This notebook is a place to practice embeddings and clustering. My goal is to learn a general pattern for identifying groups of similar items, whether they are ridehailing trips, customer session journeys, retail items, etc. I'll be closely following [this Medium post by Zhi Li](https://towardsdatascience.com/a-beginners-guide-to-word-embedding-with-gensim-word2vec-model-5970fa56cc92).

# Word2vec Embeddings
Word2vec is a popular technique to learn word embeddings using a two-layer neural network. In the CBOW (continuous bag of words) approach, the NN uses context to predict a target word; while skip-gram uses a word to predict a target context.

We'll use the `gensim` natural language Python library as an implementation of word2vec. 

In [2]:
import gensim

# Car Data
We'll read in a [car dataset downloaded from Kaggle](https://www.kaggle.com/datasets/CooperUnion/cardataset?resource=download).

In [4]:
import pandas as pd

In [5]:
df = pd.read_csv('./data/raw_car_data.csv')

In [6]:
print(df.shape)

(11914, 16)


In [7]:
df.head()

Unnamed: 0,Make,Model,Year,Engine Fuel Type,Engine HP,Engine Cylinders,Transmission Type,Driven_Wheels,Number of Doors,Market Category,Vehicle Size,Vehicle Style,highway MPG,city mpg,Popularity,MSRP
0,BMW,1 Series M,2011,premium unleaded (required),335.0,6.0,MANUAL,rear wheel drive,2.0,"Factory Tuner,Luxury,High-Performance",Compact,Coupe,26,19,3916,46135
1,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Convertible,28,19,3916,40650
2,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,High-Performance",Compact,Coupe,28,20,3916,36350
3,BMW,1 Series,2011,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Coupe,28,18,3916,29450
4,BMW,1 Series,2011,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,Luxury,Compact,Convertible,28,18,3916,34500


In [15]:
df_model = (
    df
    .groupby(['Make', 'Model'])
    .agg({
        'Year': 'first',
        'Engine Fuel Type': 'first',
        'Engine HP': 'first',
        'Engine Cylinders': 'first',
        'Transmission Type': 'first',
        'Driven_Wheels': 'first',
        'Number of Doors': 'first',
        'Market Category': 'first',
        'Vehicle Size': 'first',
        'Vehicle Style': 'first',
        'highway MPG': 'first',
        'city mpg': 'first',
        'Popularity': 'first',
        'MSRP': 'first',
    })
    .reset_index()
)

In [16]:
df_model.shape

(928, 16)

In [17]:
df_model.head()

Unnamed: 0,Make,Model,Year,Engine Fuel Type,Engine HP,Engine Cylinders,Transmission Type,Driven_Wheels,Number of Doors,Market Category,Vehicle Size,Vehicle Style,highway MPG,city mpg,Popularity,MSRP
0,Acura,CL,2001,premium unleaded (required),225.0,6.0,AUTOMATIC,front wheel drive,2.0,Luxury,Midsize,Coupe,27,17,204,29980
1,Acura,ILX,2015,premium unleaded (recommended),201.0,4.0,MANUAL,front wheel drive,4.0,"Luxury,Performance",Compact,Sedan,31,22,204,29350
2,Acura,ILX Hybrid,2014,premium unleaded (recommended),111.0,4.0,AUTOMATIC,front wheel drive,4.0,"Luxury,Hybrid",Compact,Sedan,38,39,204,28900
3,Acura,Integra,1999,regular unleaded,140.0,4.0,MANUAL,front wheel drive,4.0,Luxury,Compact,Sedan,29,22,204,2827
4,Acura,Legend,1993,regular unleaded,200.0,6.0,MANUAL,front wheel drive,4.0,Luxury,Midsize,Sedan,23,15,204,2000


# Data Prep
The expected input to `Word2Vec` is a list of lists, where the latter lists represent sequences of tokens.

We're going to turn the vehicle data from `df_model` into this list of lists.

In [461]:
df_train = pd.DataFrame()
# df_train['engine_fuel_type'] = df_model['Engine Fuel Type']
df_train['highway_mpg'] = df_model['highway MPG'].astype(str) + ' mpg'
# df_train['engine_cylinders'] = 'engine cylinders ' + df_model['Engine Cylinders'].astype(str)
# df_train['transmission_type'] = df_model['Transmission Type']
# df_train['driven_wheels'] = df_model['Driven_Wheels']
# df_train['market_category'] = [str(x).split(',')[-1] for x in df_model['Market Category'].fillna('Default')]
df_train['market_category'] = df_model['Market Category'].fillna('Default')
df_train['vehicle_size'] = df_model['Vehicle Size']
df_train['vehicle_style'] = df_model['Vehicle Style']
df_train['make_model'] = df_model['Make'] + ' ' + df_model['Model']

In [462]:
df_train.shape

(928, 5)

In [463]:
df_train.head()

Unnamed: 0,highway_mpg,market_category,vehicle_size,vehicle_style,make_model
0,27 mpg,Luxury,Midsize,Coupe,Acura CL
1,31 mpg,"Luxury,Performance",Compact,Sedan,Acura ILX
2,38 mpg,"Luxury,Hybrid",Compact,Sedan,Acura ILX Hybrid
3,29 mpg,Luxury,Compact,Sedan,Acura Integra
4,23 mpg,Luxury,Midsize,Sedan,Acura Legend


In [464]:
df_single_column = df_train.apply(lambda x: ','.join(x.astype(str)), axis=1)

In [465]:
df_single_column.shape

(928,)

In [466]:
df_single_column.head()

0                 27 mpg,Luxury,Midsize,Coupe,Acura CL
1    31 mpg,Luxury,Performance,Compact,Sedan,Acura ILX
2    38 mpg,Luxury,Hybrid,Compact,Sedan,Acura ILX H...
3            29 mpg,Luxury,Compact,Sedan,Acura Integra
4             23 mpg,Luxury,Midsize,Sedan,Acura Legend
dtype: object

In [467]:
import re

In [468]:
# sentences = [item.split(',') for item in df_single_column]
sentences = [re.split('\W+', item)[:-2] + [item.split(',')[-1]] for item in df_single_column]

In [469]:
sentences[:2]

[['27', 'mpg', 'Luxury', 'Midsize', 'Coupe', 'Acura CL'],
 ['31', 'mpg', 'Luxury', 'Performance', 'Compact', 'Sedan', 'Acura ILX']]

# Train word2vec

In [470]:
from gensim.models import Word2Vec

In [500]:
car_model_model = Word2Vec(
    sentences,
    min_count=1,
    window=15,
    workers=3,
    vector_size=10,
    sg=0,
)   
    

In [501]:
car_model_model.wv['Acura ILX']

array([ 0.09918414, -0.04570673, -0.0536518 ,  0.07772738, -0.0920788 ,
       -0.0762367 , -0.05244715, -0.00252737,  0.04264588,  0.07753378],
      dtype=float32)

In [502]:
car_model_model.wv.most_similar('Toyota Corolla', topn=5)

[('Chevrolet Corsica', 0.9386063814163208),
 ('Scion xB', 0.8791967630386353),
 ('Cadillac Allante', 0.8736720681190491),
 ('Toyota Prius v', 0.8294816613197327),
 ('3', 0.809768795967102)]

In [503]:
car_model_model.wv.most_similar('Acura ILX', topn=5)

[('Pontiac G3', 0.826746940612793),
 ('Mitsubishi Mighty Max Pickup', 0.8143463134765625),
 ('Volvo S70', 0.7307913303375244),
 ('Suzuki Grand Vitara', 0.7175847887992859),
 ('Mazda 6', 0.7044230103492737)]

In [504]:
car_model_model.wv.most_similar('Audi Q5', topn=5)

[('Toyota RAV4 Hybrid', 0.8927847146987915),
 ('Ford Explorer Sport', 0.8797715306282043),
 ('CTS', 0.8325408101081848),
 ('TSX', 0.8235213756561279),
 ('Chevrolet Captiva Sport', 0.7879918813705444)]

However... note that the vocabulary here includes not just car makes and models, but ALL words from our supplied sentences matching the criteria provided.

In [481]:
# these represent the vocabulary words learned in the embedding
list(car_model_model.wv.key_to_index.keys())[:10]

['mpg',
 'Luxury',
 'Performance',
 'Compact',
 'Midsize',
 'Sedan',
 '4dr',
 'Large',
 'Default',
 'Hatchback']

In [482]:
car_model_model.wv.similarity('Toyota Camry', 'BMW M4')

0.34580588

In [483]:
car_model_model.wv.similarity('Toyota Camry', 'Honda Accord')

-0.40272263

In [484]:
car_model_model.wv.similarity('Toyota Camry', 'Honda Odyssey')

0.054913536

In [485]:
car_model_model.wv.similarity('Honda Civic', 'Toyota Corolla')

-0.2118072

In [494]:
car_model_model.wv.most_similar('Audi Q5', topn=5)

[('Toyota RAV4 Hybrid', 0.8927847146987915),
 ('Ford Explorer Sport', 0.8797715306282043),
 ('CTS', 0.8325408101081848),
 ('TSX', 0.8235213756561279),
 ('Chevrolet Captiva Sport', 0.7879918813705444)]

In [493]:
df_train[df_train['make_model'].isin(['Audi Q5', 'Toyota RAV4 Hybrid', 'Ford Explorer Sport', 'Acura MDX',])]

Unnamed: 0,highway_mpg,market_category,vehicle_size,vehicle_style,make_model
5,27 mpg,"Crossover,Luxury",Midsize,4dr SUV,Acura MDX
45,26 mpg,"Crossover,Luxury",Midsize,4dr SUV,Audi Q5
325,18 mpg,Default,Midsize,2dr SUV,Ford Explorer Sport
860,31 mpg,"Crossover,Hybrid",Midsize,4dr SUV,Toyota RAV4 Hybrid


In [505]:
car_model_model.wv.similarity('Audi Q5', 'Acura MDX')

-0.10118388

This methodology is definitely not doing what we want it to do as it's unable to classify the Audi Q5 and Acura MDX as similar to one another.