# Clustering VoC

Date: 2025/5/7

## Clustering with OpenAI

https://cookbook.openai.com/examples/clustering

EmbeddingとK-means法を使って、問い合わせをクラスタリング

https://qiita.com/Tokoroteen/items/c2974c76baf05c59b94d

## Dataset (CC0 license)

DATASET: Digital Voice-of-Customer on sharing mobility services

https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/JXK3Z8

Publication: MOBI-Qual: a common framework to manage the product-service system quality of shared mobility

https://link.springer.com/article/10.1007/s10696-023-09520-y

In [1]:
DATA_PATH = "dataset/Digital VoC Dataset.xlsx"

In [2]:
import pandas as pd

df = pd.read_excel(DATA_PATH)
df.head()

Unnamed: 0,ID,Fonte,Country,Provider,Type,Data,Rating,Review
0,1,Yelp!,UNITED STATES,Zipcar,Station based,38749,5,Did you know you can make a zipcar reservation...
1,2,Yelp!,UNITED STATES,Zipcar,Station based,38749,5,I've been meaning to write up Zipcar for a whi...
2,3,Yelp!,UNITED STATES,Zipcar,Station based,38749,4,EDIT 2/21: After a slightly messy first Zipcar...
3,4,Yelp!,UNITED STATES,Zipcar,Station based,38749,5,I love Zipcar I have be a pround member sinc...
4,5,Yelp!,UNITED STATES,Zipcar,Station based,38777,5,Flexcar is the answer for those of us that are...


In [3]:
len(df)

11127

In [4]:
# "Fonte" in Italian means "Source" in English.

df = df.rename(columns={"Fonte": "Source"})
df.head()

Unnamed: 0,ID,Source,Country,Provider,Type,Data,Rating,Review
0,1,Yelp!,UNITED STATES,Zipcar,Station based,38749,5,Did you know you can make a zipcar reservation...
1,2,Yelp!,UNITED STATES,Zipcar,Station based,38749,5,I've been meaning to write up Zipcar for a whi...
2,3,Yelp!,UNITED STATES,Zipcar,Station based,38749,4,EDIT 2/21: After a slightly messy first Zipcar...
3,4,Yelp!,UNITED STATES,Zipcar,Station based,38749,5,I love Zipcar I have be a pround member sinc...
4,5,Yelp!,UNITED STATES,Zipcar,Station based,38777,5,Flexcar is the answer for those of us that are...


In [5]:
df.Provider.unique()

array(['Zipcar', 'Enterprice Car Share', 'Car2go', 'piccolo',
       'GoGet CarShare', 'DriveNow', 'Hertz 24', 'Evo Car Share',
       'Ubeeqo', 'Maven'], dtype=object)

In [6]:
df.Type.unique()

array(['Station based', 'Free floating', 'Mixed'], dtype=object)

In [7]:
df.Country.unique()

array(['UNITED STATES ', 'CANADA', 'AUSTRALIA', 'Non specificata ', 'UK'],
      dtype=object)

In [8]:
df.Source.unique()

array(['Yelp!', 'Playstore', 'Google', 'Trustpilot.com', 'Facebook'],
      dtype=object)

In [9]:
df.groupby("Provider").Provider.count()

Provider
Car2go                  3003
DriveNow                 599
Enterprice Car Share     547
Evo Car Share            585
GoGet CarShare           252
Hertz 24                 151
Maven                    313
Ubeeqo                   154
Zipcar                  5259
piccolo                  264
Name: Provider, dtype: int64

In [10]:
df.Review[0]

"Did you know you can make a zipcar reservation from your cell?? And you can extend your reservation in about 4 seconds from your cell as well     this was HUGH when the line at TraderJoe's was more hellish than expected  Love zipcar _x000D_\n_x000D_\nAnyone know what happens if you have the car out longer than your allotted reservation time? -- Update: I've heard from a few people that the alleged $50 late fine is legit  \xa0Guess I'll always be bringing my car back on time!"

In [11]:
# Replace these special characters with ' '
replace_patterns = [['\n', ' '], ['\xa0', ' '],['_x000D_', ' ']]

def replace_(x):
    for r in replace_patterns:
        x = x.replace(*r)
    return x

print(replace_(df.Review[2]))

EDIT 2/21: After a slightly messy first Zipcar experience   Pam from Zipcar was extremely helpful and immediately responded to my concerns   They get an extra star for excellent customer service and hopefully I'll be adding back that last star after my next trip!      2/20 - On Sunday we decided to take our first mini trip since selling our car   When we got to the car location   the car we'd reserved was lost (someone from Zipcar was there looking for it)   He'd driven another Zipcar so he let us take that while he went to look for the other car   It did take a good 15 minutes to get everything sorted out though and the credit we were told we'd get was not added to our account   We'd also planned on extending the car we requested if need be (it was open according to the internet about five minutes before we left home)   but the car we were given was not open so we were unable to extend the reservation when I called   Last   were given no instructions about what to do with the car we'd

In [12]:
df.Review = df.Review.apply(replace_)
df.Review.head()

0    Did you know you can make a zipcar reservation...
1    I've been meaning to write up Zipcar for a whi...
2    EDIT 2/21: After a slightly messy first Zipcar...
3    I love Zipcar   I have be a pround member sinc...
4    Flexcar is the answer for those of us that are...
Name: Review, dtype: object

## Calculating embeddings

In [13]:
from openai import OpenAI

MODEL = "text-embedding-3-small"

client = OpenAI()

In [14]:
texts = df.Review[0:2]
texts

0    Did you know you can make a zipcar reservation...
1    I've been meaning to write up Zipcar for a whi...
Name: Review, dtype: object

In [15]:
response = client.embeddings.create(input=texts, model=MODEL)

In [16]:
[data.embedding for data in response.data]

[[-0.020828261971473694,
  -0.00851521547883749,
  -0.024910181760787964,
  0.03663261979818344,
  -0.044108666479587555,
  -0.01463809609413147,
  -0.012985890731215477,
  0.01610340178012848,
  0.00288201542571187,
  -0.007539591286331415,
  0.01152806170284748,
  -0.0003686624695546925,
  0.006365852430462837,
  0.03636348247528076,
  -0.03047236055135727,
  -0.007236811798065901,
  -0.038157735019922256,
  0.014787617139518261,
  0.00574160274118185,
  0.026360535994172096,
  0.010055280290544033,
  0.04629167169332504,
  -0.0022951457649469376,
  -0.012701801024377346,
  -0.021321680396795273,
  -0.01410729717463255,
  -0.018480783328413963,
  0.03325344994664192,
  -0.003999684005975723,
  -0.05161461606621742,
  0.048983048647642136,
  -0.026779193431138992,
  -0.025224175304174423,
  -0.047727070748806,
  0.008657259866595268,
  -0.00012919541040901095,
  -0.0006714423070661724,
  -0.002422238700091839,
  0.04075939953327179,
  -0.034479521214962006,
  -0.051734231412410736,
  

In [17]:
def calc_embeddings(texts):
    response = client.embeddings.create(input=texts, model=MODEL)
    return [data.embedding for data in response.data]

In [18]:
import pickle

N = 10
embeddings = []
RUN_CALC_EMBEDDINGS = False

if RUN_CALC_EMBEDDINGS:
    samples = df.sample(300)
    for n in range(0, len(samples), N):
        print(f"{n}/{len(samples)}", end=" ")
        texts = samples.Review[n:n+N]
        embeddings.extend(calc_embeddings(texts))
    
    with open('pickle_data/embeddings.pkl', 'wb') as f:
        pickle.dump({"samples": samples, "embeddings": embeddings}, f)
else:
    with open('pickle_data/embeddings.pkl', 'rb') as f:
        pickle_data = pickle.load(f)
        samples = pickle_data["samples"]
        embeddings = pickle_data["embeddings"]

In [19]:
samples["embedding"] = embeddings
samples.head()

Unnamed: 0,ID,Source,Country,Provider,Type,Data,Rating,Review,embedding
2280,2281,Yelp!,UNITED STATES,Maven,Free floating,43191,1,I will never use maven ever again Their custo...,"[-0.022855188697576523, 0.02548600174486637, -..."
4937,4939,Facebook,CANADA,Evo Car Share,Free floating,42705,1,Irresponsible marketing They advertise their ...,"[0.02817665785551071, 0.020048251375555992, -0..."
544,545,Playstore,Non specificata,Car2go,Free floating,41579,4,Great app! This is a great way to get somewher...,"[-0.03024926967918873, -0.04144366830587387, -..."
9474,9478,Playstore,Non specificata,Car2go,Free floating,41183,1,Does not work with proxy I use a proxy on my p...,"[0.0065634711645543575, -0.0272508691996336, 0..."
6190,6193,Playstore,Non specificata,Maven,Free floating,43466,1,The app load to show available cars is really ...,"[-0.05524517595767975, -0.02250681258738041, -..."
