# Content Based Recommendation

Popular movie recommendation can be done via collaborative filtering or thresholding unpopular movies and averaging ratings. However, recently uploaded movies will have no rating from users. This situation which is called as cold-start problem leads to not recommending new movies to users. 

In order to recommend recently published movies content based recommendation can be applied. Content based recommendation basically compares feature similarities of two movie. Then, gets most similar movies for each movie. 

In this chapter, recommendations will be generated from two type of content based approach.
1. Genre based recommendation
2. TF-IDF based recommendation
    * Item to Item (Non-Personalized)
    * Item to User (Personalized)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Libraries for genre based recommendation
from sklearn.metrics import jaccard_score
from scipy.spatial.distance import pdist, squareform

# Libraries for TF-IDF based recommendation
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [2]:
# Load movies
df_movies = pd.read_csv('datasets/movies.csv')
df_movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


## 1. Genre Based Recommendation

### Jaccard Similarity

It is basically number of intersection divided by number of union between two set. We will adapt this to our problem as intersecting genres divided by union of genres of two movie.
<div>
<img src="resources/jaccard_similarity.jpeg" width="500"/>
</div>

*[Reference](https://www.youtube.com/watch?v=Ah_4xqvS1WU)*

In [3]:
# Split genres to generate matches between title and each genre
df_movies['genres_splitted'] = df_movies.genres.apply(lambda x: x.split('|'))
df_movies

Unnamed: 0,movieId,title,genres,genres_splitted
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,"[Adventure, Animation, Children, Comedy, Fantasy]"
1,2,Jumanji (1995),Adventure|Children|Fantasy,"[Adventure, Children, Fantasy]"
2,3,Grumpier Old Men (1995),Comedy|Romance,"[Comedy, Romance]"
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,"[Comedy, Drama, Romance]"
4,5,Father of the Bride Part II (1995),Comedy,[Comedy]
...,...,...,...,...
9737,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy,"[Action, Animation, Comedy, Fantasy]"
9738,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy,"[Animation, Comedy, Fantasy]"
9739,193585,Flint (2017),Drama,[Drama]
9740,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation,"[Action, Animation]"


In [4]:
# Generating new dataframe where each row corresponds one title and one genre
title_genre_matches = []
for idx, title in enumerate(df_movies['title']):
    for genre in df_movies['genres_splitted'][idx]:
        title_genre_matches.append([title, genre])

df_movie_genres = pd.DataFrame(title_genre_matches, columns=['title', 'genre'])
df_movie_genres.head()

Unnamed: 0,title,genre
0,Toy Story (1995),Adventure
1,Toy Story (1995),Animation
2,Toy Story (1995),Children
3,Toy Story (1995),Comedy
4,Toy Story (1995),Fantasy


In [5]:
# Generate binary genre vectors from genres for each movie
df_movie_cross = pd.crosstab(df_movie_genres['title'], df_movie_genres['genre'])
df_movie_cross.drop(['(no genres listed)'], axis=1, inplace=True)
df_movie_cross.head()

genre,Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
'71 (2014),1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,1,0
'Hellboy': The Seeds of Creation (2004),1,1,0,0,1,0,1,0,1,0,0,0,0,0,0,0,0,0,0
'Round Midnight (1986),0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0
'Salem's Lot (2004),0,0,0,0,0,0,0,1,0,0,1,0,0,1,0,0,1,0,0
'Til There Was You (1997),0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0


In [6]:
# Compare two similar movies
toy_story = df_movie_cross.loc['Toy Story (1995)']
incredibles_2 = df_movie_cross.loc['Incredibles 2 (2018)']

print(f"Genres of Toy Story: {toy_story[toy_story==1].index.values}")
print(f"Genres of Matrix: {incredibles_2[incredibles_2==1].index.values}")
print(f"Jaccard Similarity: {jaccard_score(toy_story, incredibles_2)}")

Genres of Toy Story: ['Adventure' 'Animation' 'Children' 'Comedy' 'Fantasy']
Genres of Matrix: ['Action' 'Adventure' 'Animation' 'Children']
Jaccard Similarity: 0.5


In [7]:
# Compare two different movies
toy_story = df_movie_cross.loc['Toy Story (1995)']
matrix = df_movie_cross.loc['Matrix, The (1999)']

print(f"Genres of Toy Story: {toy_story[toy_story==1].index.values}")
print(f"Genres of Matrix: {matrix[matrix==1].index.values}")
print(f"Jaccard Similarity: {jaccard_score(toy_story, matrix)}")

Genres of Toy Story: ['Adventure' 'Animation' 'Children' 'Comedy' 'Fantasy']
Genres of Matrix: ['Action' 'Sci-Fi' 'Thriller']
Jaccard Similarity: 0.0


In [8]:
jaccard_distances = pdist(df_movie_cross.values, metric='jaccard')
print(jaccard_distances)

[0.875      0.8        0.66666667 ... 1.         1.         0.66666667]


In [9]:
square_jaccard_distances = squareform(jaccard_distances)
print(square_jaccard_distances)

[[0.         0.875      0.8        ... 0.6        1.         1.        ]
 [0.875      0.         1.         ... 0.85714286 0.83333333 0.83333333]
 [0.8        1.         0.         ... 1.         1.         0.66666667]
 ...
 [0.6        0.85714286 1.         ... 0.         1.         1.        ]
 [1.         0.83333333 1.         ... 1.         0.         0.66666667]
 [1.         0.83333333 0.66666667 ... 1.         0.66666667 0.        ]]


In [10]:
df_movie_cross.head()

genre,Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
'71 (2014),1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,1,0
'Hellboy': The Seeds of Creation (2004),1,1,0,0,1,0,1,0,1,0,0,0,0,0,0,0,0,0,0
'Round Midnight (1986),0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0
'Salem's Lot (2004),0,0,0,0,0,0,0,1,0,0,1,0,0,1,0,0,1,0,0
'Til There Was You (1997),0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0


In [11]:
jaccard_similarity_array = 1 - square_jaccard_distances
df_movie_cross = df_movie_cross.reset_index()

df_distances = pd.DataFrame(jaccard_similarity_array,
                            index=df_movie_cross['title'],
                            columns=df_movie_cross['title'])
df_distances.head()

title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...,Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
'71 (2014),1.0,0.125,0.2,0.333333,0.2,0.0,0.0,0.25,0.166667,0.0,...,0.4,0.4,0.2,0.2,0.2,0.4,0.4,0.4,0.0,0.0
'Hellboy': The Seeds of Creation (2004),0.125,1.0,0.0,0.0,0.0,0.0,0.2,0.0,0.142857,0.285714,...,0.0,0.0,0.0,0.0,0.0,0.142857,0.142857,0.142857,0.166667,0.166667
'Round Midnight (1986),0.2,0.0,1.0,0.2,0.333333,0.0,0.0,0.5,0.25,0.0,...,0.25,0.25,0.0,0.0,0.333333,0.0,0.0,0.0,0.0,0.333333
'Salem's Lot (2004),0.333333,0.0,0.2,1.0,0.2,0.0,0.0,0.25,0.166667,0.0,...,0.4,0.75,0.5,0.5,0.2,0.166667,0.166667,0.166667,0.0,0.0
'Til There Was You (1997),0.2,0.0,0.333333,0.2,1.0,0.5,0.0,0.5,0.666667,0.0,...,0.25,0.25,0.0,0.0,0.333333,0.0,0.0,0.0,0.0,0.0


In [12]:
# Most similar items for Toy Story
movie = 'Toy Story (1995)'
df_distances[df_distances.index != movie][movie].sort_values(ascending=False).head()

title
Tale of Despereaux, The (2008)                             1.0
Monsters, Inc. (2001)                                      1.0
Adventures of Rocky and Bullwinkle, The (2000)             1.0
Asterix and the Vikings (Astérix et les Vikings) (2006)    1.0
Toy Story 2 (1999)                                         1.0
Name: Toy Story (1995), dtype: float64

In [13]:
# Most similar items for Matrix
movie = 'Matrix, The (1999)'
df_distances[df_distances.index != movie][movie].sort_values(ascending=False).head()

title
Universal Soldier: Day of Reckoning (2012)    1.0
X-Men: The Last Stand (2006)                  1.0
Screamers (1995)                              1.0
Eve of Destruction (1991)                     1.0
X-Men Origins: Wolverine (2009)               1.0
Name: Matrix, The (1999), dtype: float64

### Conclusion

Rather than focusing ratings or popularities we have obtained similar movies for each movie by comparing genres between them. With this approach new movies can also be recommended to users at ease. In addition to genres years can also be used as features, but in return sparsity will increase in the data. 

## 2. TF-IDF Based Recommendation

In real world data, product features may be filled wrong or incomplete by merchants. They may just copy and paste description of the product in the description field. Also, checking only product feature fields may limit the description of the prouct. For this reason, in this section, movie similarities will be obtained from the descriptions of the movies.

[Reference Dataset](https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset)

### TF-IDF Formula
<div>
<img src="resources/tf_idf.png" width="500"/>
</div>

*[Image Reference](https://app.datacamp.com/learn/courses/building-recommendation-engines-in-python)*

### Numerical Example

Lets assume that military word appeared 5 times and there are 100 words in the overview of the movie. The term frequency (TF) is calculated as below.

$TF = 5 / 100 = 0.05$

Also assume that there are 10,000 overview and military word occurs in 100 of them. Thus, our IDF score is:

$IDF = log(10000 / 100) = 2$

Hence final TF-IDF score can be calculated.

$TF-IDF = TF / (1 / IDF) = 0.1$

In [14]:
# Loading movies with overviews
df_tmdb_movies = pd.read_csv('datasets/tmdb_movies.csv')
df_tmdb_movies.dropna(inplace=True)
df_tmdb_movies.head()

Unnamed: 0,movieId,title,overview
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha..."
2,206647,Spectre,A cryptic message from Bond’s past sends him o...
3,49026,The Dark Knight Rises,Following the death of District Attorney Harve...
4,49529,John Carter,"John Carter is a war-weary, former military ca..."


In [15]:
# Calculating tf-idf score
# min_df=2 means that each word should occur at least two different overview
# max_df=0.7 deletes the word if it occurs in at least %70 percent of overviews
# stop_words=english eliminates stop words
tfidf_vec = TfidfVectorizer(min_df=2, max_df=0.75, stop_words='english')
data_vectorized = tfidf_vec.fit_transform(df_tmdb_movies['overview'])
print(tfidf_vec.get_feature_names_out()[-10:])

['zion' 'zoe' 'zombie' 'zombies' 'zone' 'zoo' 'zooey' 'zookeeper'
 'zoologists' 'zorro']


In [16]:
# Generating tf-idf scores based vectors for each movie
df_tfidf = pd.DataFrame(data_vectorized.toarray(),
                        columns=tfidf_vec.get_feature_names_out())
df_tfidf.index = df_tmdb_movies['title']
df_tfidf.head()

Unnamed: 0_level_0,00,000,007,10,100,1000,10th,11,119,11th,...,zion,zoe,zombie,zombies,zone,zoo,zooey,zookeeper,zoologists,zorro
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Avatar,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Pirates of the Caribbean: At World's End,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Spectre,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
The Dark Knight Rises,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
John Carter,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Cosine Similarity

The cosine similarity compares angle between two high dimensional vector. Even if size of the vectors are far apart from each other their angle distance may still be low. The similarity increases with decreasing angle.

<div>
<img src="resources/cosine_similarity.png" width="300"/>
</div>

*[Image Reference](https://app.datacamp.com/learn/courses/building-recommendation-engines-in-python)*

In [17]:
# Find similarity between all items
cosine_similarity_array = cosine_similarity(df_tfidf)

In [18]:
# Find similarity between two items
cosine_similarity(df_tfidf.loc['Avatar'].values.reshape(1, -1),
                df_tfidf.loc['The Dark Knight Rises'].values.reshape(1, -1))

array([[0.02658925]])

In [19]:
df_movie_similarities = pd.DataFrame(cosine_similarity_array,
                                    index=df_tfidf['title'].index,
                                    columns=df_tfidf['title'].index)
df_movie_similarities.head()

title,Avatar,Pirates of the Caribbean: At World's End,Spectre,The Dark Knight Rises,John Carter,Spider-Man 3,Tangled,Avengers: Age of Ultron,Harry Potter and the Half-Blood Prince,Batman v Superman: Dawn of Justice,...,On The Downlow,Sanctuary: Quite a Conundrum,Bang,Primer,Cavite,El Mariachi,Newlyweds,"Signed, Sealed, Delivered",Shanghai Calling,My Date with Drew
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Avatar,1.0,0.0,0.0,0.026589,0.0,0.033247,0.0,0.043233,0.0,0.0,...,0.0,0.0,0.032564,0.047873,0.0,0.0,0.0,0.0,0.0,0.0
Pirates of the Caribbean: At World's End,0.0,1.0,0.0,0.0,0.044533,0.0,0.0,0.028912,0.0,0.0,...,0.0,0.0,0.008528,0.0,0.0,0.0,0.0,0.027212,0.0,0.0
Spectre,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.034873,0.02587,0.0,...,0.035956,0.0,0.0,0.0,0.019912,0.0,0.0,0.016563,0.0,0.0
The Dark Knight Rises,0.026589,0.0,0.0,1.0,0.011808,0.005296,0.01355,0.029149,0.020651,0.149236,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.036173,0.044581,0.024835
John Carter,0.0,0.044533,0.0,0.011808,1.0,0.0,0.011367,0.045785,0.0,0.021655,...,0.017951,0.0,0.0,0.0,0.0,0.0,0.0,0.007405,0.0,0.0


In [20]:
# Similar movies for The Matrix
movie = 'The Matrix'
df_movie_similarities[df_movie_similarities.index != movie][movie].sort_values(ascending=False).head()

title
Hackers                 0.190574
Pulse                   0.189060
Commando                0.177248
The Inhabited Island    0.150542
Transcendence           0.149113
Name: The Matrix, dtype: float64

In [21]:
# Similar movies for the The Dark Knight Rises
# Because of Batman is kind of unique word all Batman movies are obtained.
movie = 'The Dark Knight Rises'
df_movie_similarities[df_movie_similarities.index != movie][movie].sort_values(ascending=False).head()

title
Batman Forever     0.343072
Batman Returns     0.311258
The Dark Knight    0.301516
Batman             0.279158
Slow Burn          0.186555
Name: The Dark Knight Rises, dtype: float64

### Conclusion

In this section, we have generated our recommendations thanks to tf-idf based scoring. Especially, Batman recommendation shows us how our approach can be successful at recommending related movies. 

## 3. TF-IDF Based Personalized Recommendation

Until now we have generated non-personalized recommendations. For the first time, personalized recommendations will be generated based on movies that user have watched. At first, average vector will be generated from the watched movies of user. Then, cosine similarity will be checked between user vector and movies. Finally, most similar movies will be shown as recommendation to user.

In [22]:
# Loading ratings
df_tmdb_ratings = pd.read_csv('datasets/tmdb_ratings.csv')
df_tmdb_ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


In [23]:
# Removing timestamp since we will not consider.
df_tmdb_ratings.drop('timestamp', axis=1, inplace=True)

# Only keeping liked movies since we want to recommend movies that user will like.
df_tmdb_ratings = df_tmdb_ratings[df_tmdb_ratings.rating >= 4]


In [24]:
# Merging user ratings with movies
df_tmdb_all = pd.merge(df_tmdb_movies, df_tmdb_ratings, on='movieId', how='inner')
df_tmdb_all.head()

Unnamed: 0,movieId,title,overview,userId,rating
0,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",39,4.0
1,559,Spider-Man 3,The seemingly invincible Spider-Man goes up ag...,492,5.0
2,767,Harry Potter and the Half-Blood Prince,"As Harry begins his sixth year at Hogwarts, he...",30,4.0
3,58,Pirates of the Caribbean: Dead Man's Chest,Captain Jack Sparrow works his way out of a bl...,28,5.0
4,58,Pirates of the Caribbean: Dead Man's Chest,Captain Jack Sparrow works his way out of a bl...,36,5.0


In [25]:
# Getting liked movies of random user
user_liked_movies = df_tmdb_all[df_tmdb_all['userId'] == 668]['title'].values
user_liked_movies

array(['Terminator 3: Rise of the Machines', 'Men in Black II', 'Solaris',
       'The Talented Mr. Ripley'], dtype=object)

In [26]:
# Collecting vectors of liked movies of the user
user_movies_vec = df_movie_similarities[df_movie_similarities.index.isin(user_liked_movies)]
user_movies_vec.head()

title,Avatar,Pirates of the Caribbean: At World's End,Spectre,The Dark Knight Rises,John Carter,Spider-Man 3,Tangled,Avengers: Age of Ultron,Harry Potter and the Half-Blood Prince,Batman v Superman: Dawn of Justice,...,On The Downlow,Sanctuary: Quite a Conundrum,Bang,Primer,Cavite,El Mariachi,Newlyweds,"Signed, Sealed, Delivered",Shanghai Calling,My Date with Drew
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Terminator 3: Rise of the Machines,0.0,0.026498,0.0,0.024672,0.016636,0.028675,0.006714,0.014255,0.0,0.027216,...,0.0,0.028045,0.008932,0.0,0.0,0.011153,0.0,0.0,0.010044,0.008443
Men in Black II,0.076051,0.040757,0.023209,0.005364,0.017106,0.015346,0.005163,0.054548,0.0,0.011615,...,0.0,0.0,0.0,0.021574,0.0,0.0,0.0,0.047639,0.007888,0.006493
Solaris,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.02449,0.0,0.0,0.0,0.0,0.0,0.018744,0.0
The Talented Mr. Ripley,0.0,0.0,0.0,0.0,0.0,0.014192,0.0,0.009695,0.0,0.0,...,0.013798,0.0,0.020226,0.025314,0.008781,0.0,0.0,0.008685,0.0,0.0


In [27]:
# Generating user vector
user_vec = user_movies_vec.mean()
user_vec.values.reshape(1, -1)

array([[0.01901282, 0.01681385, 0.00580236, ..., 0.01408105, 0.00916909,
        0.00373392]])

In [28]:
# Checking similarities between user vector and movies
user_movie_similarities = cosine_similarity(user_vec.values.reshape(1, -1), df_movie_similarities)

# Generating dataframe of recommendations to user
df_user_movie_similarities = pd.DataFrame(user_movie_similarities.T,
                                          index=df_movie_similarities.index,
                                          columns=['similarity_score'])

# Dropping user watched movies
df_user_movie_similarities.drop(user_liked_movies, axis=0, inplace=True)

# Visualizing the top recommendations
df_user_movie_similarities.sort_values(by='similarity_score', ascending=False).head(10)


Unnamed: 0_level_0,similarity_score
title,Unnamed: 1_level_1
My Stepmother is an Alien,0.434256
Aliens,0.42107
Elysium,0.419575
Ripley's Game,0.41746
Space Battleship Yamato,0.415428
Men in Black,0.410393
Ponyo,0.408899
After Earth,0.407536
Hercules,0.405676
Sunshine,0.397256


### Conclusion

In this section we have filtered out unliked movies of user. Afterwards, we have generated user vector and checked similarity of the user vector and movies. Finally, we have achieved personalized recommendation by generating movie list to user. This movie list may be also improved by including genre based similarity. 