## Project sample recomander system

This project is Proof-of-concept for a recomander system
With the data is the MovieLens public dataset of anonymous users's movies rating.

We will use that to recomand movies for user base on their movie type interests.




In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## So the data is consist of three parts users, movies, ratings 

In [2]:
ratings  = pd.read_csv('ratings.csv', sep='\t', encoding='latin-1')
users    = pd.read_csv('users.csv', sep='\t', encoding='latin-1', usecols=['user_id', 'gender', 'zipcode', 'age_desc', 'occ_desc'])
movies   = pd.read_csv('movies.csv', sep='\t', encoding='latin-1', usecols=['movie_id', 'title', 'genres'])

## Let take a look inside movies

In [3]:
movies.head()

Unnamed: 0,movie_id,title,genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy


## Here come we start processing the data
### Join the parts together


In [4]:
dataset = pd.merge(pd.merge(movies, ratings), users)
dataset.head()

Unnamed: 0.1,movie_id,title,genres,Unnamed: 0,user_id,rating,timestamp,user_emb_id,movie_emb_id,gender,zipcode,age_desc,occ_desc
0,1,Toy Story (1995),Animation|Children's|Comedy,40,1,5,978824268,0,0,F,48067,Under 18,K-12 student
1,48,Pocahontas (1995),Animation|Children's|Musical|Romance,25,1,5,978824351,0,47,F,48067,Under 18,K-12 student
2,150,Apollo 13 (1995),Drama,39,1,5,978301777,0,149,F,48067,Under 18,K-12 student
3,260,Star Wars: Episode IV - A New Hope (1977),Action|Adventure|Fantasy|Sci-Fi,44,1,4,978300760,0,259,F,48067,Under 18,K-12 student
4,527,Schindler's List (1993),Drama|War,23,1,5,978824195,0,526,F,48067,Under 18,K-12 student


### Genres (Movie categories)

Like if a user pick Toy Story. We can know the user like "Animation" and "Adventure" and "Children's" movies because the Toy Story has "Animation" and "Adventure" and "Children's" genres. 

So we gonna recommand that user another movie has "Animation" and "Adventure" and "Children's" genres or at least one of them. 

Make sence ?!

In [5]:
genre_labels = set()
for s in movies['genres'].str.split('|').values:
    genre_labels = genre_labels.union(set(s))

def count_word(dataset, ref_col, census):
    keyword_count = {}
    for s in census: 
        keyword_count[s] = 0
    for census_keywords in dataset[ref_col].str.split('|'):        
        if type(census_keywords) == float and pd.isnull(census_keywords): 
            continue        
        for s in [s for s in census_keywords if s in census]: 
            if pd.notnull(s): 
                keyword_count[s] += 1
    keyword_occurences = []
    for k,v in keyword_count.items():
        keyword_occurences.append([k,v])
    keyword_occurences.sort(key = lambda x:x[1], reverse = True)
    return keyword_occurences, keyword_count

keyword_occurences, dum = count_word(movies, 'genres', genre_labels)
print(keyword_occurences[:5])
print(len(keyword_occurences))

[['Drama', 1603], ['Comedy', 1200], ['Action', 503], ['Thriller', 492], ['Romance', 471]]
18


So the above code is show that there are 18 genres in the data and the most genre has movies in it is Drama (what a suprise !).

In [6]:
movies['genres'] = movies['genres'].str.split('|')
movies['genres'] = movies['genres'].fillna("").astype('str')


## Get everything into numbers.

Because user don't always want "Animation" and "Adventure" and "Children's" when they pick Toy Story. Sometime, they want some thing maybe  "nearly Adventure"... 

So what the hell is "nearly Adventure" ? Base on the data we got, some movies have some similar to Toy Story categories like ["Animation", "Adventure",...] .Maybe they will fit.

By that we decide similarities of movies base on the time the genres appear in each movie. Use a algorithm call TF-IDF, we can tranform a movies into a vector (like a point on the map)

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

tf = TfidfVectorizer(analyzer='word',ngram_range=(1, 2), min_df=0, stop_words='english')
tfidf_matrix = tf.fit_transform(movies['genres'])

cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

We wil be using (Cosine Similarity)[https://masongallo.github.io/machine/learning,/python/2016/07/29/cosine-similarity.html] to find the "similarity" of two movies

I now have a pairwise cosine similarity matrix for all the movies in the dataset. The next step is to write a function that returns the 20 most similar movies based on the cosine similarity score.

In [8]:
titles = movies['title']
indices = pd.Series(movies.index, index=movies['title'])

def recommendations(title):
    idx = indices[title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:21]
    movie_indices = [i[0] for i in sim_scores]
    return titles.iloc[movie_indices]

## Let test see if we can recomand a user a movies.

For example we pick a random movie and get the recomandations after the that

In [10]:
movie = movies['title'][2]

print(movie)
print(movies['genres'][2])
print(recommendations(movie))


Grumpier Old Men (1995)
['Comedy', 'Romance']
6                          Sabrina (1995)
38                        Clueless (1995)
63                   Two if by Sea (1996)
67     French Twist (Gazon maudit) (1995)
91             Vampire in Brooklyn (1995)
116                   If Lucy Fell (1996)
120                      Boomerang (1992)
127                 Pie in the Sky (1995)
233                    French Kiss (1995)
234                   Forget Paris (1995)
249                           I.Q. (1994)
273                     Milk Money (1994)
284             Nina Takes a Lover (1994)
286                       Only You (1994)
291              Perez Family, The (1995)
292     Pyromaniac's Love Story, A (1995)
335        While You Were Sleeping (1995)
338               Muriel's Wedding (1994)
353    Four Weddings and a Funeral (1994)
374                     Speechless (1994)
Name: title, dtype: object


Alright ! we can see the title of the movie, the genres of it and the top 20 movies we can recomand for the user.

