# Content Based Embedding

In this experiment we will use the movie attributes (content features) to generate dense vectors that will preserve content similarity between events.

The use of sparse matrices here is very important! First we will create tf-idf sparse features from the text attributes, then we will one-hot encode categorical features into sparse binary features. Finally, the Truncated SVD algorithm will be used to convert the sparse features into low-dimensional and dense embedding features.

In [1]:
import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import normalize

from scipy.sparse import coo_matrix, hstack

from src.text import clean_text
from src.factorization import describe_csr_matrix

## Text Embedding Features

Here we combine all text fields in a single one. Then we apply the `TfidfVectorizer` to enconde text to numerical sparse features. Before applying SVD, we normalize each row by its l2-norm. This normalization allows to compute the cosine similarity as the dot product between thge vectors.

In [2]:
movies = pd.read_csv('data/movie_info.csv', index_col='id')
tags = pd.read_csv('data/movie_tags.csv', index_col='id')
keywords = pd.read_csv('data/movie_keywords.csv', index_col='id')

movies = movies.join(tags)
movies = movies.join(keywords)

movies.fillna('', inplace=True)
movies.reset_index(drop=False, inplace=True)
movies['idx'] = movies.index

movies.head()

Unnamed: 0,id,original_title,title,overview,tagline,tags,keywords,idx
0,2,Jumanji,Jumanji,When siblings Judy and Peter discover an encha...,Roll the dice and unleash the excitement!,fantasy adapted from:book animals bad cgi base...,board game disappearance based on children's b...,0
1,3,Grumpier Old Men,Grumpier Old Men,A family wedding reignites the ancient feud be...,Still Yelling. Still Fighting. Still Ready for...,moldy old Ann Margaret Burgess Meredith Daryl ...,fishing best friend duringcreditsstinger old men,1
2,4,Waiting to Exhale,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",Friends are the people who let you be yourself...,characters girl movie characters chick flick b...,based on novel interracial relationship single...,2
3,5,Father of the Bride Part II,Father of the Bride Part II,Just when George Banks has recovered from his ...,Just When His World Is Back To Normal... He's ...,steve martin steve martin pregnancy remake agi...,baby midlife crisis confidence aging daughter ...,3
4,6,Heat,Heat,"Obsessive master thief, Neil McCauley leads a ...",A Los Angeles Crime Saga,overrated bank robbery crime heists relationsh...,robbery detective bank obsession chase shootin...,4


After joining all text fields, we apply some cleaning procedures:
- Lower case.
- Remove numbers, punctuation and special chars.
- Remove single char words.
- Remove stopwords.

In [3]:
%%time

movies['text'] = movies['title'] + " " + movies['overview'] + " " + movies['tagline'] + " " + movies['tags'] + " " + movies['keywords']
text = list(map(clean_text, movies['text']))

CPU times: user 5.15 s, sys: 8 ms, total: 5.16 s
Wall time: 5.15 s


In [4]:
text[0]

'jumanji siblings judy peter discover enchanted board game opens door magical world unwittingly invite alan adult trapped inside game years living room alan hope freedom finish game proves risky three find running giant rhinoceroses evil monkeys terrifying creatures roll dice unleash excitement fantasy adapted book animals bad cgi based book board game childhood recaptured children chris van allsburg fantasy filmed bc jungle kid flick kirsten dunst monkey robin williams saturn award best special effects saturn award best supporting actress scary time time travel fantasy robin williams adapted book childish children kid flick time travel robin williams time travel robin williams joe johnston robin williams children kid flick itaege fantasy robin williams scary time travel game animals comedy fiction thrill dynamic cgi action bad cgi horrifying horror genre kirsten dunst magic board game monkey kids based children book board game disappearance giant insect new home recluse animals fantas

We consider a vocabulary of 30k terms, taking into account unigrams and bigrams.

In [5]:
%%time

vectorizer = TfidfVectorizer(lowercase=False, max_df=0.5, min_df=5, ngram_range=(1, 2), 
                             norm='l2', max_features=30000, sublinear_tf=True)
tfidf = vectorizer.fit_transform(text)

print(describe_csr_matrix(tfidf))

20624 x 30000 sparse matrix with 99.79% of sparsity.
CPU times: user 7.47 s, sys: 192 ms, total: 7.66 s
Wall time: 7.66 s


Then we reduce dimension to 300, using SVD. The `arpack` parameter is necessary to deal with a sparse matrix.

In [6]:
%%time

N_COMPONENTS = 300

svd = TruncatedSVD(n_components=N_COMPONENTS, algorithm="arpack", random_state=0)
tfidf_embedding = svd.fit_transform(tfidf)
tfidf_embedding = normalize(tfidf_embedding, norm="l2", axis=1, copy=False)

CPU times: user 2min 54s, sys: 2.68 s, total: 2min 57s
Wall time: 15.7 s


In [7]:
tfidf_embedding[:10, :5]

array([[ 0.25727685,  0.14228107, -0.05615754, -0.22410679, -0.04457764],
       [ 0.28550536, -0.00267135, -0.01574177, -0.07095338, -0.13337099],
       [ 0.26452568, -0.00370943,  0.04026973, -0.19422941, -0.00220639],
       [ 0.26318172, -0.01038827,  0.04837783, -0.08737987, -0.10562586],
       [ 0.28532055,  0.05990469, -0.17970695, -0.02468343,  0.08128035],
       [ 0.24741332,  0.02394533,  0.06111758, -0.22589893, -0.01820054],
       [ 0.24301806,  0.06613637, -0.05265269, -0.13214851, -0.00093491],
       [ 0.09992267,  0.03170735, -0.04626095,  0.02680382, -0.03229045],
       [ 0.18925709,  0.09980298, -0.11548792, -0.01287145,  0.01059282],
       [ 0.25855054,  0.08158593,  0.03407546, -0.08312593, -0.03829589]])

Now, let's use the [Embedding Projector](https://projector.tensorflow.org/) to visualize the embedding space:

In [8]:
%%time

movies[['id','title']].to_csv('output/content_embedding_meta.tsv', sep='\t', header=True, index=False)
pd.DataFrame(tfidf_embedding).to_csv('output/tfidf_embedding_vectors.tsv', sep='\t', 
                                     float_format='%.5f', header=False, index=False)

CPU times: user 8.82 s, sys: 44 ms, total: 8.87 s
Wall time: 8.87 s


![](img/tfidf_lord_rings.png)

![](img/tfidf_star_wars.png)

![](img/tfidf_pulp_fiction.png)

## Categorical Embedding Features

For categorical features we need to carefully transform them to a binary sparse matrix where movies are put as rows and categories are put as columns.

In [9]:
crew = pd.read_csv('data/movie_producer.csv', index_col='id')

crew = crew.join(movies[['id','idx']].set_index('id'), how='inner')
crew.reset_index(drop=True, inplace=True)
crew.rename(columns={'idx':'movie_idx'}, inplace=True)

crew['person_id'] = crew['person_id'].astype('category')
crew['person_idx'] = crew['person_id'].cat.codes

crew.sort_values(['person_idx','movie_idx'], inplace=True)

crew.head()

Unnamed: 0,person_id,job,person_name,num_movies,movie_idx,person_idx
361,1,Director,George Lucas,61,204,0
362,1,Writer,George Lucas,61,204,0
3389,1,Director,George Lucas,61,1923,0
4327,1,Director,George Lucas,61,2461,0
6794,1,Director,George Lucas,61,3921,0


Some people will have more than one job in the same movie. So, we will count the number of roles and use this number in the sparse matrix.

In [10]:
crew = crew.groupby(['person_idx','movie_idx']).agg(
    count=pd.NamedAgg(column='movie_idx', aggfunc='count')
)
crew.reset_index(drop=False, inplace=True)

crew.head()

Unnamed: 0,person_idx,movie_idx,count
0,0,204,2
1,0,1923,1
2,0,2461,1
3,0,3921,2
4,0,4758,1


In [11]:
values = crew['count'].values
idx = (crew.movie_idx.values, crew.person_idx.values,)
dim = (movies.idx.max()+1, crew.person_idx.max()+1)
x_crew = coo_matrix((values, idx), shape=dim).tocsr()

print(describe_csr_matrix(x_crew))

20624 x 3635 sparse matrix with 99.97% of sparsity.


After creating the sparse matrix, we normalize its rows and bind them with the tf-idf matrix:

In [12]:
x_crew = normalize(x_crew, norm="l2", axis=1, copy=False)
x_tfidf_crew = hstack([tfidf, x_crew])

print(describe_csr_matrix(x_tfidf_crew))

20624 x 33635 sparse matrix with 99.81% of sparsity.


Now, we apply the SVD in all sparse features (text + categorical):

In [13]:
%%time

tfidf_crew_embedding = svd.fit_transform(x_tfidf_crew)
tfidf_crew_embedding = normalize(tfidf_crew_embedding, norm="l2", axis=1, copy=False)

CPU times: user 2min 55s, sys: 2.46 s, total: 2min 57s
Wall time: 15.8 s


In [14]:
tfidf_crew_embedding[:10, :5]

array([[ 0.31059373,  0.17439317, -0.07966628, -0.24312864,  0.05104854],
       [ 0.33851136, -0.00991364,  0.00576794, -0.06548582, -0.1383411 ],
       [ 0.32878417, -0.01374984,  0.06625597, -0.204537  , -0.05281414],
       [ 0.30297741, -0.03919086,  0.08155608, -0.1031477 , -0.0885925 ],
       [ 0.15918579,  0.04355397, -0.09853233, -0.04052476, -0.02758957],
       [ 0.12162072,  0.02209307,  0.03885346, -0.12141141, -0.0453824 ],
       [ 0.32863383,  0.08996191, -0.0563822 , -0.15020375, -0.03584434],
       [ 0.06833647,  0.02258088, -0.04616372,  0.02299206, -0.00446049],
       [ 0.17945057,  0.10574306, -0.12189527, -0.02358589,  0.01602491],
       [ 0.15631755,  0.04875546,  0.03154279, -0.07359079, -0.04261311]])

In [15]:
%%time

pd.DataFrame(tfidf_crew_embedding).to_csv('output/tfidf_crew_embedding_vectors.tsv', sep='\t', 
                                          float_format='%.5f', header=False, index=False)

CPU times: user 8.76 s, sys: 36 ms, total: 8.79 s
Wall time: 8.79 s


![](img/tfidf_crew_lord_rings.png)

![](img/tfidf_crew_star_wars.png)

![](img/tfidf_crew_pulp_fiction.png)

We will apply the same process to the cast information:

In [16]:
cast = pd.read_csv('data/movie_actor.csv', index_col='id')

cast = cast.join(movies[['id','idx']].set_index('id'), how='inner')
cast.reset_index(drop=True, inplace=True)
cast.rename(columns={'idx':'movie_idx'}, inplace=True)

cast['actor_id'] = cast['actor_id'].astype('category')
cast['actor_idx'] = cast['actor_id'].cat.codes
cast['count'] = 1.0

cast.sort_values(['actor_idx','movie_idx'], inplace=True)

cast.head()

Unnamed: 0,actor_id,actor_name,num_movies,movie_idx,actor_idx,count
14338,1,George Lucas,20,1516,0,1.0
23600,1,George Lucas,20,2547,0,1.0
55588,1,George Lucas,20,6392,0,1.0
59375,1,George Lucas,20,6935,0,1.0
84758,1,George Lucas,20,10202,0,1.0


In [17]:
values = cast['count'].values
idx = (cast.movie_idx.values, cast.actor_idx.values,)
dim = (movies.idx.max()+1, cast.actor_idx.max()+1)
x_cast = coo_matrix((values, idx), shape=dim).tocsr()

print(describe_csr_matrix(x_cast))

20624 x 10269 sparse matrix with 99.94% of sparsity.


In [18]:
x_tfidf_crew_cast = hstack([tfidf, x_crew, x_cast])

print(describe_csr_matrix(x_tfidf_crew_cast))

20624 x 43904 sparse matrix with 99.84% of sparsity.


In [19]:
%%time

tfidf_crew_cast_embedding = svd.fit_transform(x_tfidf_crew_cast)
tfidf_crew_cast_embedding = normalize(tfidf_crew_cast_embedding, norm="l2", axis=1, copy=False)

CPU times: user 3min 34s, sys: 3.27 s, total: 3min 37s
Wall time: 19.2 s


In [20]:
tfidf_crew_cast_embedding[:10, :5]

array([[ 5.97115181e-02, -1.68955073e-01, -2.03521438e-02,
        -3.50106713e-03,  4.41017307e-02],
       [ 5.27002925e-02, -8.92850924e-02, -2.73559422e-02,
         2.60500081e-04, -6.02938200e-02],
       [ 7.78474004e-02, -1.93302408e-01, -4.77872985e-02,
        -2.22740679e-03,  2.55805720e-02],
       [ 5.09061308e-02, -1.27617687e-01, -1.85314134e-02,
        -2.61425765e-03,  1.86144973e-02],
       [ 5.03717413e-02, -1.59873186e-01, -2.73723736e-02,
        -3.54352878e-03,  5.11350651e-02],
       [ 5.50013444e-02, -1.46239852e-01, -2.02701450e-02,
        -1.80958369e-03,  2.07394806e-02],
       [ 1.85454047e-01, -3.60821251e-01, -8.32022824e-02,
        -3.71871403e-03, -1.30683332e-02],
       [ 5.56385271e-02, -1.27784343e-01, -3.40226278e-02,
        -1.44905896e-03, -1.11192659e-03],
       [ 3.82868947e-02, -1.04685110e-01, -2.44340864e-02,
        -2.14479531e-03,  3.64669762e-02],
       [ 5.87999692e-02, -1.52994027e-01, -2.44519542e-02,
        -1.98539931e-03

In [21]:
%%time

pd.DataFrame(tfidf_crew_cast_embedding).to_csv('output/tfidf_crew_cast_embedding_vectors.tsv', sep='\t', 
                                          float_format='%.5f', header=False, index=False)

CPU times: user 8.62 s, sys: 28 ms, total: 8.65 s
Wall time: 8.64 s


![](img/tfidf_crew_cast_lord_rings.png)

![](img/tfidf_crew_cast_star_wars.png)

![](img/tfidf_crew_cast_pulp_fiction.png)

Now we have the content embedding features to be used in the similarity match task.