# CHALLENGE-FELLOWSHIP.AI

## Objective
To build a movie recommendation engine.
## Introduction
We build a Movie recommendation engine using Movielens "ml-1m" dataset. The dataset consists of the ratings of users with genre tags, occupations of the users and corresponding unix time stamps.



In [44]:
#Import the required Libraries
import pandas as pd
import numpy as np
import graphlab
from sklearn.model_selection import train_test_split

# Movielens Data file
The data is presented in three files
1. The Users file: Contains info about the users. A user is indexed with his alloted userid. Corresponding data to the userid consists of sex, age and occupation.
2. Movies file  : Contains info about movies, with a movieid and corresponding genres.
3. Ratings File : Contains ratings awared to each movie by the user.

In [45]:
movie_data= ['movie id', 'movie title','genre']
movies=pd.read_csv('C:\Users\Swapnil\.jupyter\ml-1m\movies.dat',sep='::',names=movie_data,engine='python')
print movies.genre.head()
    

0     Animation|Children's|Comedy
1    Adventure|Children's|Fantasy
2                  Comedy|Romance
3                    Comedy|Drama
4                          Comedy
Name: genre, dtype: object


The genre column and covert the strings to boolean where it is declared true if the movie belongs to a particular genre and false otherwise.

In [46]:
genres_unique = pd.DataFrame(movies.genre.str.split('|').tolist()).stack().unique()
genres_unique = pd.DataFrame(genres_unique, columns=['genre']) # Format into DataFrame to store later
movies = movies.join(movies.genre.str.get_dummies().astype(bool))
movies.drop('genre', inplace=True, axis=1)
print movies.head()

   movie id                         movie title Action Adventure Animation  \
0         1                    Toy Story (1995)  False     False      True   
1         2                      Jumanji (1995)  False      True     False   
2         3             Grumpier Old Men (1995)  False     False     False   
3         4            Waiting to Exhale (1995)  False     False     False   
4         5  Father of the Bride Part II (1995)  False     False     False   

  Children's Comedy  Crime Documentary  Drama Fantasy Film-Noir Horror  \
0       True   True  False       False  False   False     False  False   
1       True  False  False       False  False    True     False  False   
2      False   True  False       False  False   False     False  False   
3      False   True  False       False   True   False     False  False   
4      False   True  False       False  False   False     False  False   

  Musical Mystery Romance Sci-Fi Thriller    War Western  
0   False   False   False  

In [47]:
#Load users files.
user_data = ['user_id', 'sex', 'age', 'occupation', 'zip_code']
users = pd.read_csv('C:\Users\Swapnil\.jupyter\ml-1m\users.dat', sep='::', names=user_data,engine='python',
encoding='latin-1')
#Reading readers data
print users.head()
r_cols = ['user_id', 'movie_id', 'rating', 'unix_timestamp']
ratings = pd.read_csv('C:\Users\Swapnil\.jupyter\ml-1m\\ratings.dat', sep='::', names=r_cols,engine='python',
 encoding='latin-1')
print ratings.head()

   user_id sex  age  occupation zip_code
0        1   F    1          10    48067
1        2   M   56          16    70072
2        3   M   25          15    55117
3        4   M   45           7    02460
4        5   M   25          20    55455
   user_id  movie_id  rating  unix_timestamp
0        1      1193       5       978300760
1        1       661       3       978302109
2        1       914       3       978301968
3        1      3408       4       978300275
4        1      2355       5       978824291


## Split Data
To check the accuracy of our predictor model, we need data to test on. Since we already have data with us, we can train the model the trainer on a part of data and test it on the remaining part. 

In [48]:
X_train, X_test = train_test_split(ratings,test_size=0.2)
print X_train.head()
print X_test.head()

        user_id  movie_id  rating  unix_timestamp
505111     3108       457       5       969494084
208453     1273       541       4       974814174
654055     3942      2662       2       965694189
810615     4864      2375       3       962819749
256320     1564      3283       3       974739166
        user_id  movie_id  rating  unix_timestamp
706385     4234       969       4       965332319
620555     3760      3016       3       966093953
754978     4497      2728       4       964998985
706043     4229      1885       5       965312427
438496     2679      1242       5       973391738


## Popularity Recommender
It suggests a user movies based upon the popularity of a film. 

In [49]:
train_data = graphlab.SFrame(X_train)
test_data = graphlab.SFrame(X_test)
user_data=graphlab.SFrame(users)
popularity_model = graphlab.popularity_recommender.create(train_data, user_id='user_id', item_id='movie_id',
                                                          user_data=user_data,target='rating')

Lets predict the recommendations for the first few users

In [50]:
popularity_recomm = popularity_model.recommend(users=range(1,10),k=5)
popularity_recomm.print_rows(num_rows=40)
popularity_recomm.show()

+---------+----------+-------+------+
| user_id | movie_id | score | rank |
+---------+----------+-------+------+
|    1    |   3881   |  5.0  |  1   |
|    1    |   787    |  5.0  |  2   |
|    1    |   853    |  5.0  |  3   |
|    1    |   3233   |  5.0  |  4   |
|    1    |   989    |  5.0  |  5   |
|    2    |   3881   |  5.0  |  1   |
|    2    |   787    |  5.0  |  2   |
|    2    |   853    |  5.0  |  3   |
|    2    |   3233   |  5.0  |  4   |
|    2    |   989    |  5.0  |  5   |
|    3    |   3881   |  5.0  |  1   |
|    3    |   787    |  5.0  |  2   |
|    3    |   853    |  5.0  |  3   |
|    3    |   3233   |  5.0  |  4   |
|    3    |   989    |  5.0  |  5   |
|    4    |   3881   |  5.0  |  1   |
|    4    |   787    |  5.0  |  2   |
|    4    |   853    |  5.0  |  3   |
|    4    |   3233   |  5.0  |  4   |
|    4    |   989    |  5.0  |  5   |
|    5    |   3881   |  5.0  |  1   |
|    5    |   787    |  5.0  |  2   |
|    5    |   853    |  5.0  |  3   |
|    5    | 

We can see that all the users get the same recommendations. This is the result of not using a user based model, i.e collabarative filtering. Since the popular movies are same for everyone, every user is recommended the same movies.

## Collaborative Filtering
Users recommendations are based upon 
1. similar users preferences and assumes that since the profile iof the users match, their interests will.

To match, we use a cosine similarity model. This constructs a similarity vector between two items, the lesser the angle, the more similar the two vectors are.

In [51]:
item_sim_model = graphlab.item_similarity_recommender.create(train_data, user_id='user_id', item_id='movie_id', target='rating',
                                                             user_data=user_data,similarity_type='cosine')


## Evaluation

In [53]:
model_performance = graphlab.compare(test_data, [popularity_model, item_sim_model])
graphlab.show_comparison(model_performance,[popularity_model, item_sim_model])

PROGRESS: Evaluate model M0



Precision and recall summary statistics by cutoff
+--------+-------------------+-------------------+
| cutoff |   mean_precision  |    mean_recall    |
+--------+-------------------+-------------------+
|   1    |        0.0        |        0.0        |
|   2    |        0.0        |        0.0        |
|   3    |        0.0        |        0.0        |
|   4    | 4.14387535223e-05 | 1.97327397725e-06 |
|   5    | 3.31510028178e-05 | 1.97327397725e-06 |
|   6    | 2.76258356815e-05 | 1.97327397725e-06 |
|   7    |  2.3679287727e-05 | 1.97327397725e-06 |
|   8    | 2.07193767611e-05 | 1.97327397725e-06 |
|   9    | 3.68344475754e-05 | 4.93318494313e-06 |
|   10   | 4.97265042268e-05 | 7.89309590901e-06 |
+--------+-------------------+-------------------+
[10 rows x 3 columns]

PROGRESS: Evaluate model M1



Precision and recall summary statistics by cutoff
+--------+----------------+-----------------+
| cutoff | mean_precision |   mean_recall   |
+--------+----------------+-----------------+
|   1    | 0.436764462125 |  0.021680102215 |
|   2    | 0.398226421349 | 0.0378754235162 |
|   3    | 0.374772086856 | 0.0519115989812 |
|   4    | 0.356456157799 | 0.0640619711407 |
|   5    | 0.341720537046 | 0.0755338420805 |
|   6    | 0.328968451296 | 0.0861835980627 |
|   7    | 0.318060192749 | 0.0956787772369 |
|   8    | 0.308615116857 |  0.105479404532 |
|   9    | 0.298653700941 |  0.113758105904 |
|   10   | 0.290502237693 |  0.121921196153 |
+--------+----------------+-----------------+
[10 rows x 3 columns]

Model compare metric: precision_recall
Canvas is updated and available in a tab in the default browser.


# Citation
F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History
and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4,
Article 19 (December 2015), 19 pages. DOI=http://dx.doi.org/10.1145/2827872