# Project: Movie Paring Recommender System

## Data Collection and Preprocessing

### Downloading the MovieLens and IMDb datasets

In [170]:
import pandas as pd
import os
import re

#### Download the MovieLens dataset and merge them

In [171]:
if not os.path.exists('ml-1m'):
    !wget http://files.grouplens.org/datasets/movielens/ml-1m.zip
    !unzip ml-1m.zip
    !rm ml-1m.zip

In [172]:
movies_MovieLens = pd.read_csv('ml-1m/movies.dat', sep='::', engine='python', header=None, names=['movieId', 'title', 'genres'], encoding='ISO-8859-1')
ratings_MovieLens = pd.read_csv('ml-1m/ratings.dat', sep='::', engine='python', header=None, names=['userId', 'movieId', 'rating', 'timestamp'], encoding='ISO-8859-1')
print("The movies MovieLens dataset:")
print(movies_MovieLens.head())
print(f"The number of rows of the movies Movielens dataset is {movies_MovieLens.shape[0]}")
print("The ratings MovieLens dataset:")
print(ratings_MovieLens.head())
print(f"The number of rows of the ratings Movielens dataset is {ratings_MovieLens.shape[0]}")

The movies MovieLens dataset:
   movieId                               title                        genres
0        1                    Toy Story (1995)   Animation|Children's|Comedy
1        2                      Jumanji (1995)  Adventure|Children's|Fantasy
2        3             Grumpier Old Men (1995)                Comedy|Romance
3        4            Waiting to Exhale (1995)                  Comedy|Drama
4        5  Father of the Bride Part II (1995)                        Comedy
The number of rows of the movies Movielens dataset is 3883
The ratings MovieLens dataset:
   userId  movieId  rating  timestamp
0       1     1193       5  978300760
1       1      661       3  978302109
2       1      914       3  978301968
3       1     3408       4  978300275
4       1     2355       5  978824291
The number of rows of the ratings Movielens dataset is 1000209


In [173]:
merge_MovieLens = pd.merge(ratings_MovieLens, movies_MovieLens, on='movieId')
print("The merge MovieLens dataset:")
print(merge_MovieLens.head())
print(f"The number of rows of the merge Movielens dataset is {merge_MovieLens.shape[0]}")

The merge MovieLens dataset:
   userId  movieId  rating  timestamp                                   title  \
0       1     1193       5  978300760  One Flew Over the Cuckoo's Nest (1975)   
1       2     1193       5  978298413  One Flew Over the Cuckoo's Nest (1975)   
2      12     1193       4  978220179  One Flew Over the Cuckoo's Nest (1975)   
3      15     1193       4  978199279  One Flew Over the Cuckoo's Nest (1975)   
4      17     1193       5  978158471  One Flew Over the Cuckoo's Nest (1975)   

  genres  
0  Drama  
1  Drama  
2  Drama  
3  Drama  
4  Drama  
The number of rows of the merge Movielens dataset is 1000209


In [174]:
# delete the ratings and movies MovieLens datasets to make some memory space
del movies_MovieLens
del ratings_MovieLens

In [175]:
year_regex = re.compile(r'\((\d{4})\)')
merge_MovieLens['year'] = merge_MovieLens['title'].str.extract(r'\((\d{4})\)').astype(int)
merge_MovieLens['title'] = merge_MovieLens['title'].apply(lambda x: year_regex.sub("", x).strip())
print("The merge MovieLens dataset with the column year:")
print(merge_MovieLens.head())
print(f"The number of rows of the merge Movielens dataset is {merge_MovieLens.shape[0]}")

The merge MovieLens dataset with the column year:
   userId  movieId  rating  timestamp                            title genres  \
0       1     1193       5  978300760  One Flew Over the Cuckoo's Nest  Drama   
1       2     1193       5  978298413  One Flew Over the Cuckoo's Nest  Drama   
2      12     1193       4  978220179  One Flew Over the Cuckoo's Nest  Drama   
3      15     1193       4  978199279  One Flew Over the Cuckoo's Nest  Drama   
4      17     1193       5  978158471  One Flew Over the Cuckoo's Nest  Drama   

   year  
0  1975  
1  1975  
2  1975  
3  1975  
4  1975  
The number of rows of the merge Movielens dataset is 1000209


##### Feature Engineering

Remove the timestamp column

In [176]:
merge_MovieLens.drop('timestamp', axis=1, inplace=True)
print("The merge MovieLens dataset without the column timestamp:")
print(merge_MovieLens.head())
print(f"The number of rows of the merge Movielens dataset is {merge_MovieLens.shape[0]}")

The merge MovieLens dataset without the column timestamp:
   userId  movieId  rating                            title genres  year
0       1     1193       5  One Flew Over the Cuckoo's Nest  Drama  1975
1       2     1193       5  One Flew Over the Cuckoo's Nest  Drama  1975
2      12     1193       4  One Flew Over the Cuckoo's Nest  Drama  1975
3      15     1193       4  One Flew Over the Cuckoo's Nest  Drama  1975
4      17     1193       5  One Flew Over the Cuckoo's Nest  Drama  1975
The number of rows of the merge Movielens dataset is 1000209


Change the genres from string to list of string and order the list

In [177]:
merge_MovieLens['genres'] = merge_MovieLens['genres'].str.split('|')
merge_MovieLens['genres'] = merge_MovieLens['genres'].apply(lambda x: sorted(x))
print(merge_MovieLens.head())

   userId  movieId  rating                            title   genres  year
0       1     1193       5  One Flew Over the Cuckoo's Nest  [Drama]  1975
1       2     1193       5  One Flew Over the Cuckoo's Nest  [Drama]  1975
2      12     1193       4  One Flew Over the Cuckoo's Nest  [Drama]  1975
3      15     1193       4  One Flew Over the Cuckoo's Nest  [Drama]  1975
4      17     1193       5  One Flew Over the Cuckoo's Nest  [Drama]  1975


#### Download the IMDb datasets and merge them

In [178]:
if not os.path.exists('title.basics.tsv.gz'):
    !wget https://datasets.imdbws.com/title.basics.tsv.gz

In [179]:
if not os.path.exists('title.ratings.tsv.gz'):
    !wget https://datasets.imdbws.com/title.ratings.tsv.gz

In [180]:
if not os.path.exists('title.crew.tsv.gz'):
    !wget https://datasets.imdbws.com/title.crew.tsv.gz

Load the dataset movies IMDb with only the columns: tconst, titleType, primaryTitle, startYear, endYear, and genres

In [181]:
only_columns = ['tconst', 'titleType', 'primaryTitle', 'startYear', 'genres']
movies_IMDb = pd.read_csv('title.basics.tsv.gz', sep='\t', usecols=only_columns,compression='gzip')
print("The movies IMDb dataset:")
print(movies_IMDb.head())
print(f"The number of rows of the movies IMDb dataset is {movies_IMDb.shape[0]}")

The movies IMDb dataset:
      tconst titleType            primaryTitle startYear  \
0  tt0000001     short              Carmencita      1894   
1  tt0000002     short  Le clown et ses chiens      1892   
2  tt0000003     short          Pauvre Pierrot      1892   
3  tt0000004     short             Un bon bock      1892   
4  tt0000005     short        Blacksmith Scene      1893   

                     genres  
0         Documentary,Short  
1           Animation,Short  
2  Animation,Comedy,Romance  
3           Animation,Short  
4              Comedy,Short  
The number of rows of the movies IMDb dataset is 10904346


Remove series and short from the dataset

In [182]:
movies_IMDb = movies_IMDb[movies_IMDb['titleType'] == 'movie']
print("The movies IMDb dataset:")
print(movies_IMDb.head())
print(f"The number of rows of the movies IMDb dataset is {movies_IMDb.shape[0]}")

The movies IMDb dataset:
        tconst titleType                   primaryTitle startYear  \
8    tt0000009     movie                     Miss Jerry      1894   
144  tt0000147     movie  The Corbett-Fitzsimmons Fight      1897   
498  tt0000502     movie                       Bohemios      1905   
570  tt0000574     movie    The Story of the Kelly Gang      1906   
587  tt0000591     movie               The Prodigal Son      1907   

                         genres  
8                       Romance  
144      Documentary,News,Sport  
498                          \N  
570  Action,Adventure,Biography  
587                       Drama  
The number of rows of the movies IMDb dataset is 685353


Removing the columns: titleType to make the dataset lighter for the following merges

In [183]:
movies_IMDb.drop('titleType', axis=1, inplace=True)

Add the ratings with only the columns: tconst, averageRating to the movies IMDb dataset

In [184]:
only_columns = ['tconst', 'averageRating']
movies_IMDb = pd.merge(movies_IMDb, pd.read_csv('title.ratings.tsv.gz', sep='\t', usecols=only_columns, compression='gzip'), on='tconst', how='left')
print("The movies IMDb dataset:")
print(movies_IMDb.head())
print(f"The number of rows of the movies IMDb dataset is {movies_IMDb.shape[0]}")

The movies IMDb dataset:
      tconst                   primaryTitle startYear  \
0  tt0000009                     Miss Jerry      1894   
1  tt0000147  The Corbett-Fitzsimmons Fight      1897   
2  tt0000502                       Bohemios      1905   
3  tt0000574    The Story of the Kelly Gang      1906   
4  tt0000591               The Prodigal Son      1907   

                       genres  averageRating  
0                     Romance            5.4  
1      Documentary,News,Sport            5.2  
2                          \N            4.4  
3  Action,Adventure,Biography            6.0  
4                       Drama            5.6  
The number of rows of the movies IMDb dataset is 685353


Add the crew to the movies IMDb dataset

In [185]:
movies_IMDb = pd.merge(movies_IMDb, pd.read_csv('title.crew.tsv.gz', sep='\t', compression='gzip'), on='tconst', how='left')
print("The movies IMDb dataset:")
print(movies_IMDb.head())
print(f"The number of rows of the movies IMDb dataset is {movies_IMDb.shape[0]}")

The movies IMDb dataset:
      tconst                   primaryTitle startYear  \
0  tt0000009                     Miss Jerry      1894   
1  tt0000147  The Corbett-Fitzsimmons Fight      1897   
2  tt0000502                       Bohemios      1905   
3  tt0000574    The Story of the Kelly Gang      1906   
4  tt0000591               The Prodigal Son      1907   

                       genres  averageRating  directors  \
0                     Romance            5.4  nm0085156   
1      Documentary,News,Sport            5.2  nm0714557   
2                          \N            4.4  nm0063413   
3  Action,Adventure,Biography            6.0  nm0846879   
4                       Drama            5.6  nm0141150   

                         writers  
0                      nm0085156  
1                             \N  
2  nm0063413,nm0657268,nm0675388  
3                      nm0846879  
4                      nm0141150  
The number of rows of the movies IMDb dataset is 685353


In [186]:
# rename the column primaryTitle to title
movies_IMDb.rename(columns={'primaryTitle': 'title', 'startYear': 'year'}, inplace=True)
print("The movies IMDb dataset:")
print(movies_IMDb.head())
print(f"The number of rows of the movies IMDb dataset is {movies_IMDb.shape[0]}")

The movies IMDb dataset:
      tconst                          title  year                      genres  \
0  tt0000009                     Miss Jerry  1894                     Romance   
1  tt0000147  The Corbett-Fitzsimmons Fight  1897      Documentary,News,Sport   
2  tt0000502                       Bohemios  1905                          \N   
3  tt0000574    The Story of the Kelly Gang  1906  Action,Adventure,Biography   
4  tt0000591               The Prodigal Son  1907                       Drama   

   averageRating  directors                        writers  
0            5.4  nm0085156                      nm0085156  
1            5.2  nm0714557                             \N  
2            4.4  nm0063413  nm0063413,nm0657268,nm0675388  
3            6.0  nm0846879                      nm0846879  
4            5.6  nm0141150                      nm0141150  
The number of rows of the movies IMDb dataset is 685353


Change the genres from string to list of string and order the list

In [187]:
movies_IMDb['genres'] = movies_IMDb['genres'].str.split(',')
movies_IMDb['genres'] = movies_IMDb['genres'].apply(lambda x: [] if x == ['\\N'] else x)
movies_IMDb['genres'] = movies_IMDb['genres'].apply(lambda x: sorted(x))
print(movies_IMDb.head())
print(f"The number of rows of the movies IMDb dataset is {movies_IMDb.shape[0]}")

      tconst                          title  year  \
0  tt0000009                     Miss Jerry  1894   
1  tt0000147  The Corbett-Fitzsimmons Fight  1897   
2  tt0000502                       Bohemios  1905   
3  tt0000574    The Story of the Kelly Gang  1906   
4  tt0000591               The Prodigal Son  1907   

                           genres  averageRating  directors  \
0                       [Romance]            5.4  nm0085156   
1      [Documentary, News, Sport]            5.2  nm0714557   
2                              []            4.4  nm0063413   
3  [Action, Adventure, Biography]            6.0  nm0846879   
4                         [Drama]            5.6  nm0141150   

                         writers  
0                      nm0085156  
1                             \N  
2  nm0063413,nm0657268,nm0675388  
3                      nm0846879  
4                      nm0141150  
The number of rows of the movies IMDb dataset is 685353


Change the string writers to list of string writers and order the list

In [188]:
movies_IMDb['writers'] = movies_IMDb['writers'].str.split(',')
movies_IMDb['writers'] = movies_IMDb['writers'].apply(lambda x: [] if x == ['\\N'] else x)
print(movies_IMDb.head())

      tconst                          title  year  \
0  tt0000009                     Miss Jerry  1894   
1  tt0000147  The Corbett-Fitzsimmons Fight  1897   
2  tt0000502                       Bohemios  1905   
3  tt0000574    The Story of the Kelly Gang  1906   
4  tt0000591               The Prodigal Son  1907   

                           genres  averageRating  directors  \
0                       [Romance]            5.4  nm0085156   
1      [Documentary, News, Sport]            5.2  nm0714557   
2                              []            4.4  nm0063413   
3  [Action, Adventure, Biography]            6.0  nm0846879   
4                         [Drama]            5.6  nm0141150   

                             writers  
0                        [nm0085156]  
1                                 []  
2  [nm0063413, nm0657268, nm0675388]  
3                        [nm0846879]  
4                        [nm0141150]  


Change the string directors to list of string directors and order the list

In [189]:
movies_IMDb['directors'] = movies_IMDb['directors'].str.split(',')
movies_IMDb['directors'] = movies_IMDb['directors'].apply(lambda x: [] if x == ['\\N'] else x)
print(movies_IMDb.head())

      tconst                          title  year  \
0  tt0000009                     Miss Jerry  1894   
1  tt0000147  The Corbett-Fitzsimmons Fight  1897   
2  tt0000502                       Bohemios  1905   
3  tt0000574    The Story of the Kelly Gang  1906   
4  tt0000591               The Prodigal Son  1907   

                           genres  averageRating    directors  \
0                       [Romance]            5.4  [nm0085156]   
1      [Documentary, News, Sport]            5.2  [nm0714557]   
2                              []            4.4  [nm0063413]   
3  [Action, Adventure, Biography]            6.0  [nm0846879]   
4                         [Drama]            5.6  [nm0141150]   

                             writers  
0                        [nm0085156]  
1                                 []  
2  [nm0063413, nm0657268, nm0675388]  
3                        [nm0846879]  
4                        [nm0141150]  


In [190]:
movies_IMDb['year'] = pd.to_numeric(movies_IMDb['year'], errors='coerce').fillna(0).astype(int)
print(movies_IMDb.dtypes)

tconst            object
title             object
year               int64
genres            object
averageRating    float64
directors         object
writers           object
dtype: object


### Merge datasets to include movie ratings, genres, and metadata

In [191]:
merge_dataset = pd.merge(merge_MovieLens, movies_IMDb, on=['title', 'year'], how='inner')

In [192]:
print("The merge dataset:")
print(merge_dataset.head())
print(f"The number of rows of the merge dataset is {merge_dataset.shape[0]}")

The merge dataset:
   userId  movieId  rating                            title genres_x  year  \
0       1     1193       5  One Flew Over the Cuckoo's Nest  [Drama]  1975   
1       2     1193       5  One Flew Over the Cuckoo's Nest  [Drama]  1975   
2      12     1193       4  One Flew Over the Cuckoo's Nest  [Drama]  1975   
3      15     1193       4  One Flew Over the Cuckoo's Nest  [Drama]  1975   
4      17     1193       5  One Flew Over the Cuckoo's Nest  [Drama]  1975   

      tconst genres_y  averageRating    directors  \
0  tt0073486  [Drama]            8.7  [nm0001232]   
1  tt0073486  [Drama]            8.7  [nm0001232]   
2  tt0073486  [Drama]            8.7  [nm0001232]   
3  tt0073486  [Drama]            8.7  [nm0001232]   
4  tt0073486  [Drama]            8.7  [nm0001232]   

                                        writers  
0  [nm0369142, nm0325743, nm0450181, nm0913670]  
1  [nm0369142, nm0325743, nm0450181, nm0913670]  
2  [nm0369142, nm0325743, nm0450181, nm0913

## Feature Engineering

### Combine genres from both datasets

In [197]:
# combine the genres columns
def combine_genres(genres_x, genres_y):
    return list(set(genres_x + genres_y))
merge_dataset['genres'] = merge_dataset.apply(lambda row: combine_genres(row['genres_x'], row['genres_y']), axis=1)
# remove the genres_x and genres_y columns
merge_dataset.drop(['genres_x', 'genres_y'], axis=1, inplace=True)
print(merge_dataset.head())
print(f"The number of rows of the merge dataset is {merge_dataset.shape[0]}")

KeyError: 'genres_x'

### Convert the rating column to float

In [198]:
merge_dataset['rating'] = merge_dataset['rating'].astype(float)
print(merge_dataset.dtypes)

userId             int64
movieId            int64
rating           float64
title             object
year               int64
tconst            object
averageRating    float64
directors         object
writers           object
genres            object
dtype: object


### Create a user-item interaction matrix for collaborative filtering

In [199]:
user_item_matrix = merge_dataset.pivot_table(index='userId', columns='movieId', values='rating')
print(user_item_matrix.head())

movieId  1     2     3     4     5     6     7     8     9     10    ...  \
userId                                                               ...   
1         5.0   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN  ...   
2         NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN  ...   
3         NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN  ...   
4         NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN  ...   
5         NaN   NaN   NaN   NaN   NaN   2.0   NaN   NaN   NaN   NaN  ...   

movieId  3942  3943  3944  3945  3946  3947  3948  3949  3950  3951  
userId                                                               
1         NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN  
2         NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN  
3         NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN  
4         NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN  
5         NaN   NaN   NaN   NaN   NaN   NaN   N

### Understand how to combine user data to get data for the couple of users

movieId
1       5.0
2       NaN
3       NaN
4       NaN
5       NaN
       ... 
3947    NaN
3948    NaN
3949    NaN
3950    NaN
3951    NaN
Name: 1, Length: 2299, dtype: float64
movieId
1      NaN
2      NaN
3      NaN
4      NaN
5      NaN
        ..
3947   NaN
3948   NaN
3949   NaN
3950   NaN
3951   NaN
Name: 2, Length: 2299, dtype: float64


## Model Development

### Implement a recommender system algotithm to predict the rating of a movie by a couple of users

## Recommendation Algorithm

### Develop an algorithm to suggest one movie that might be likes by the couple of users

## Evalution

### Slit the data into training and testing sets

### Evaluate the model