## Data

Dta can be gotten from : https://www.kaggle.com/netflix-inc/netflix-prize-data/data

Data files :

- combined_data_1.txt
- combined_data_2.txt
- combined_data_3.txt
- combined_data_4.txt
- movie_titles.csv

<pre>  
The first line of each file [combined_data_1.txt, combined_data_2.txt, combined_data_3.txt, combined_data_4.txt] contains the movie id followed by a colon. Each subsequent line in the file corresponds to a rating from a customer and its date in the following format:

CustomerID,Rating,Date

MovieIDs range from 1 to 17770 sequentially.
CustomerIDs range from 1 to 2649429, with gaps. There are 480189 users.
Ratings are on a five star (integral) scale from 1 to 5.
Dates have the format YYYY-MM-DD.
</pre>


### Example Data Point

<pre>
1:
1488844,3,2005-09-06
822109,5,2005-05-13
885013,4,2005-10-19
30878,4,2005-12-26
823519,3,2004-05-03
893988,3,2005-11-17
124105,4,2004-08-05
1248029,3,2004-04-22
1842128,4,2004-05-09
2238063,3,2005-05-11
1503895,4,2005-05-19
2207774,5,2005-06-06
2590061,3,2004-08-12
2442,3,2004-04-14
543865,4,2004-05-28
1209119,4,2004-03-23
804919,4,2004-06-10
1086807,3,2004-12-28
1711859,4,2005-05-08
372233,5,2005-11-23
1080361,3,2005-03-28
1245640,3,2005-12-19
558634,4,2004-12-14
2165002,4,2004-04-06
1181550,3,2004-02-01
1227322,4,2004-02-06
427928,4,2004-02-26
814701,5,2005-09-29
808731,4,2005-10-31
662870,5,2005-08-24
337541,5,2005-03-23
786312,3,2004-11-16
1133214,4,2004-03-07
1537427,4,2004-03-29
1209954,5,2005-05-09
2381599,3,2005-09-12
525356,2,2004-07-11
1910569,4,2004-04-12
2263586,4,2004-08-20
2421815,2,2004-02-26
1009622,1,2005-01-19
1481961,2,2005-05-24
401047,4,2005-06-03
2179073,3,2004-08-29
1434636,3,2004-05-01
93986,5,2005-10-06
1308744,5,2005-10-29
2647871,4,2005-12-30
1905581,5,2005-08-16
2508819,3,2004-05-18
1578279,1,2005-05-19
1159695,4,2005-02-15
2588432,3,2005-03-31
2423091,3,2005-09-12
470232,4,2004-04-08
2148699,2,2004-06-05
1342007,3,2004-07-16
466135,4,2004-07-13
2472440,3,2005-08-13
1283744,3,2004-04-17
1927580,4,2004-11-08
716874,5,2005-05-06
4326,4,2005-10-29
</pre>

## Mapping the real world problem to a Machine Learning Problem

### Type of Machine Learning Problem

<pre>
For a given movie and user we need to predict the rating would be given by him/her to the movie.
The given problem is a Recommendation problem
It can also seen as a Regression problem
</pre>

### Performance metric

- Mean Absolute Percentage Error: https://en.wikipedia.org/wiki/Mean_absolute_percentage_error
- Root Mean Square Error: https://en.wikipedia.org/wiki/Root-mean-square_deviation


### Objective and Constraints
- Minimize RMSE.
- Try to provide some interpretability.

In [None]:
from datetime import datetime
import pandas as pd
import numpy as np
import matplotlib
matplotlib.use('nbagg')

import matplotlib.pyplot as plt
# plt.rcParams.update({'figure.max_open_warning': 0})

import seaborn as sns
sns.set_style('whitegrid')
import os
from scipy import sparse
from scipy.sparse import csr_matrix

from sklearn.decomposition import TruncatedSVD
from sklearn.metrics.pairwise import cosine_similarity
import random

## Exploiratory Data Analysis (EDA)

### Preprocessing

#### Converting / Merging whole data to required format: u_i, m_j, r_ij

In [None]:
start = datetime.now()
if not os.path.isfile('/content/drive/MyDrive/projects/netflix_movie_recommendation/data.csv'):
    # Create a file 'data.csv' before reading it
    # Read all the files in netflix and store them in one big file('data.csv')
    # We re reading from each of the four files and appendig each rating to a global file 'train.csv'
    data = open('data.csv', mode='w')

    row = list()
    files=['https://drive.google.com/open?id=1ILIaNy-Pi-O0cnhvfoR05HgzTqFPRKAu&usp=drive_copy/data_folder/combined_data_1.txt',
           'https://drive.google.com/open?id=1ILIaNy-Pi-O0cnhvfoR05HgzTqFPRKAu&usp=drive_copy/data_folder/combined_data_2.txt',
           'https://drive.google.com/open?id=1ILIaNy-Pi-O0cnhvfoR05HgzTqFPRKAu&usp=drive_copy/data_folder/combined_data_3.txt',
           'https://drive.google.com/open?id=1ILIaNy-Pi-O0cnhvfoR05HgzTqFPRKAu&usp=drive_copy/data_folder/combined_data_4.txt']

    for file in files:
        print("Reading ratings from {}...".format(file))
        with open(file) as f:
            for line in f:
                del row[:] # you don't have to do this.
                line = line.strip()
                if line.endswith(':'):
                    # All below are ratings for this movie, until another movie appears.
                    movie_id = line.replace(':', '')
                else:
                    row = [x for x in line.split(',')]
                    row.insert(0, movie_id)
                    data.write(','.join(row))
                    data.write('\n')
        print("Done.\n")
    data.close()
print('Time taken :', datetime.now() - start)

Time taken : 0:00:00.140793


In [None]:
print("creating the dataframe from data.csv file..")
df = pd.read_csv('/content/drive/MyDrive/projects/netflix_movie_recommendation/data.csv', sep=',',
                       names=['movie', 'user','rating','date'])
df.date = pd.to_datetime(df.date)
print('Done.\n')

# we are arranging the ratings according to time.
print('Sorting the dataframe by date..')
df.sort_values(by='date', inplace=True)
print('Done..')

creating the dataframe from data.csv file..
Done.

Sorting the dataframe by date..
Done..


In [None]:
df.head()

Unnamed: 0,movie,user,rating,date
56431994,10341,510180,4,1999-11-11
9056171,1798,510180,5,1999-11-11
58698779,10774,510180,3,1999-11-11
48101611,8651,510180,2,1999-11-11
81893208,14660,510180,2,1999-11-11


In [None]:
df['rating'].describe()

count    1.004805e+08
mean     3.604290e+00
std      1.085219e+00
min      1.000000e+00
25%      3.000000e+00
50%      4.000000e+00
75%      4.000000e+00
max      5.000000e+00
Name: rating, dtype: float64

In [None]:
print("No of Nan values in our dataframe : ", sum(df.isnull().any()))

No of Nan values in our dataframe :  0


In [None]:
# Removing duplicates

dup_bool = df.duplicated(['movie','user','rating'])
dups = sum(dup_bool) # by considering all columns..( including timestamp)
print("There are {} duplicate rating entries in the data..".format(dups))

There are 0 duplicate rating entries in the data..


In [None]:
print("Total data ")
print("-"*50)
print("\nTotal no of ratings :",df.shape[0])
print("Total No of Users   :", df['user'].nunique())
print("Total No of movies  :", df['movie'].nunique())

Total data 
--------------------------------------------------

Total no of ratings : 100480507
Total No of Users   : 480189
Total No of movies  : 17770


### Split Data into test and train

In [None]:
if not os.path.isfile('/content/drive/MyDrive/projects/netflix_movie_recommendation/train.csv'):
    # create the dataframe and store it in the disk for offline purposes..
    df[:int(df.shape[0]*0.80)].to_csv("/content/drive/MyDrive/projects/netflix_movie_recommendation/train.csv", index=False)

if not os.path.isfile('/content/drive/MyDrive/projects/netflix_movie_recommendation/test.csv'):
    # create the dataframe and store it in the disk for offline purposes..
    df[int(df.shape[0]*0.80):].to_csv("/content/drive/MyDrive/projects/netflix_movie_recommendation/test.csv", index=False)

train_df = pd.read_csv("/content/drive/MyDrive/projects/netflix_movie_recommendation/train.csv", parse_dates=['date'])
test_df = pd.read_csv("/content/drive/MyDrive/projects/netflix_movie_recommendation/test.csv")

In [None]:
display(train_df.head())
print('\n\n')
display(test_df.head())

Unnamed: 0,movie,user,rating,date
0,10341,510180,4,1999-11-11
1,1798,510180,5,1999-11-11
2,10774,510180,3,1999-11-11
3,8651,510180,2,1999-11-11
4,14660,510180,2,1999-11-11







Unnamed: 0,movie,user,rating,date
0,5926,2294429,2,2005-08-08
1,10158,1743373,4,2005-08-08
2,17064,381625,5,2005-08-08
3,1443,1252933,5,2005-08-08
4,1201,1434500,4,2005-08-08


### Basic Statistics in Train data (#Ratings, #Users, and #Movies)

In [None]:
print("Training data ")
print("-"*50)
print("\nTotal no of ratings :",train_df.shape[0])
print("Total No of Users   :", train_df['user'].nunique())
print("Total No of movies  :", train_df['movie'].nunique())

Training data 
--------------------------------------------------

Total no of ratings : 80384405
Total No of Users   : 405041
Total No of movies  : 17424


### Basic Statistics in Test data (#Ratings, #Users, and #Movies)

In [None]:
print("Training data ")
print("-"*50)
print("\nTotal no of ratings :", test_df.shape[0])
print("Total No of Users   :", test_df['user'].nunique())
print("Total No of movies  :", test_df['movie'].nunique())

Training data 
--------------------------------------------------

Total no of ratings : 20096102
Total No of Users   : 349312
Total No of movies  : 17757


### EDA on Train Data

In [None]:
# method to make y-axis more readable
def human(num, units = 'M'):
    units = units.lower()
    num = float(num)
    if units == 'k':
        return str(num/10**3) + " K"
    elif units == 'm':
        return str(num/10**6) + " M"
    elif units == 'b':
        return str(num/10**9) +  " B"

In [None]:
train_df

Unnamed: 0,movie,user,rating,date
0,10341,510180,4,1999-11-11
1,1798,510180,5,1999-11-11
2,10774,510180,3,1999-11-11
3,8651,510180,2,1999-11-11
4,14660,510180,2,1999-11-11
...,...,...,...,...
80384400,12074,2033618,4,2005-08-08
80384401,862,1797061,3,2005-08-08
80384402,10986,1498715,5,2005-08-08
80384403,14861,500016,4,2005-08-08


In [None]:
train_df['rating'].value_counts().plot(kind = 'bar')

<IPython.core.display.Javascript object>

<Axes: >

In [None]:
type([0])

pandas._libs.tslibs.timestamps.Timestamp

In [None]:
pd.options.mode.chained_assignment = None  # default='warn'

train_df['day_of_week'] = train_df['date'].dt.day_name()
train_df.tail()