<a href="https://colab.research.google.com/github/endiesworld/2110ACDS_T7_C_Predict/blob/main/2110ACDS_T7_starter_notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# EDSA Movie Recommendation 2022

© Explore Data Science Academy

---
### Honour Code

**2110ACDS_T6**, confirm - by submitting this document - that the solutions in this notebook are a result of my own work and that I abide by the [EDSA honour code](https://drive.google.com/file/d/1QDCjGZJ8-FmJE3bZdIQNwnJyQKPhHZBn/view?usp=sharing).

Non-compliance with the honour code constitutes a material breach of contract.


  

<h2><center> EDSA Movie Recommendation 2022</h2></center>
<figure>
<center><img src ="./assets/movies.png" width = "800" height = '500'/>

*Introduction*
<p align = "justify">Recommender System is a system that seeks to predict or filter preferences according to the user’s choices. Recommender systems are utilized in a variety of areas, and in this project we will use a recommender system to recommend movies for movie lovers.


*About the problem*
<p align = "justify">PUT PROBLEM STATEMENT HERE.

*Objective*
<p align = "justify"> We aim to provide an accurate and robust solution to this problem, by providing personalised recommendations to users of this product, and generating platform affinity for the streaming services which best facilitates their audience's viewing

*Process*
<p align = "justify"> In order to achieve this objective the team will follow the process below:-

1. analyse the supplied data, identify potential errors in the data and clean the existing data set;

2. determine if additional features can be added to enrich the data set;

3. build a model that is capable of predicting how a user will rate a movie;

4. evaluate the accuracy of the best machine learning model;

5. accurately predicting how a user will rate a movie they have not yet viewed, based on their historical preferences, and

6. explain the inner working of the model to a non-technical audience.

<a id="cont"></a>

## Table of Contents

<a href=#one>1. Importing Packages</a>

<a href=#two>2. Loading Data</a>

<a href=#three>3. Exploratory Data Analysis (EDA)</a>

<a href=#four>4. Data Engineering</a>

<a href=#five>5. Modeling</a>

<a href=#six>6. Model Performance</a>

<a href=#seven>7. Model Explanations</a>

 <a id="one"></a>
## 1. Importing Packages
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Importing Packages ⚡ |
| :--------------------------- |
| In this section you are required to import, and briefly discuss, the libraries that will be used throughout your analysis and modelling. |

---

In [1]:

# Import comet_ml at the top of your file
# from comet_ml import Experiment

# # Create an experiment with your api key
# experiment = Experiment(
#     api_key="emBEBYBp72gW5tfeZBSGftD0Y",
#     project_name="movie-recommendation",
#     workspace="emmanuelokoro",
#     log_code = True
# )

In [2]:
# Libraries for importing and loading data
import numpy as np
import pandas as pd
import scipy as sp # <-- The sister of Numpy, used in our code for numerical efficientcy. 
import matplotlib.pyplot as plt
import seaborn as sns
import os
os.environ['KMP_DUPLICATE_LIB_OK']='True'

# Entity featurization and similarity computation
from sklearn.metrics.pairwise import cosine_similarity 
from sklearn.feature_extraction.text import TfidfVectorizer

# Libraries used during sorting procedures.
import operator # <-- Convienient item retrieval during iteration 
import heapq # <-- Efficient sorting of large lists
from time import time

# Setting global constants to ensure notebook results are reproducible

RANDOM_STATE = 42


import warnings
warnings.filterwarnings('ignore')

<a id="two"></a>
## 2. Loading the Data
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Loading the data ⚡ |
| :--------------------------- |
| In this section you are required to load the data from the `df_train` file into a DataFrame. |

---

### 2.1 Brief description of the data



In [3]:
# load the data
train = pd.read_csv('./data/train.csv')
test = pd.read_csv('./data/test.csv')
genome_scores = pd.read_csv('./data/genome_scores.csv')
genome_tags = pd.read_csv('./data/tags.csv')
imdb_data = pd.read_csv('./data/imdb_data.csv')
links = pd.read_csv('./data/links.csv')
movies = pd.read_csv('./data/movies.csv')
# tags = pd.read_csv('./data/tags.csv')

In [4]:
# Preview train dataset
print('The Shape of the data is: ', train.shape)
train.head()

The Shape of the data is:  (10000038, 4)


Unnamed: 0,userId,movieId,rating,timestamp
0,5163,57669,4.0,1518349992
1,106343,5,4.5,1206238739
2,146790,5459,5.0,1076215539
3,106362,32296,2.0,1423042565
4,9041,366,3.0,833375837


In [5]:
train['userId'].nunique()

162541

In [6]:
# Preview train dataset
print('The Shape of the data is: ', test.shape)
test.head()

The Shape of the data is:  (5000019, 2)


Unnamed: 0,userId,movieId
0,1,2011
1,1,4144
2,1,5767
3,1,6711
4,1,7318


In [7]:
test['movieId'].nunique()

39643

In [8]:
test.tail()

Unnamed: 0,userId,movieId
5000014,162541,4079
5000015,162541,4467
5000016,162541,4980
5000017,162541,5689
5000018,162541,7153


In [9]:
# Preview genome_scores dataset
print('The Shape of the data is: ', genome_scores.shape)
genome_scores.head()

The Shape of the data is:  (15584448, 3)


Unnamed: 0,movieId,tagId,relevance
0,1,1,0.02875
1,1,2,0.02375
2,1,3,0.0625
3,1,4,0.07575
4,1,5,0.14075


In [10]:
# Preview genome_scores dataset
print('The Shape of the data is: ', genome_tags.shape)
genome_tags.head()

The Shape of the data is:  (1093360, 4)


Unnamed: 0,userId,movieId,tag,timestamp
0,3,260,classic,1439472355
1,3,260,sci-fi,1439472256
2,4,1732,dark comedy,1573943598
3,4,1732,great dialogue,1573943604
4,4,7569,so bad it's good,1573943455


In [11]:
# Preview imdb_data dataset
print('The Shape of the data is: ', imdb_data.shape)
imdb_data.head()

The Shape of the data is:  (27278, 6)


Unnamed: 0,movieId,title_cast,director,runtime,budget,plot_keywords
0,1,Tom Hanks|Tim Allen|Don Rickles|Jim Varney|Wal...,John Lasseter,81.0,"$30,000,000",toy|rivalry|cowboy|cgi animation
1,2,Robin Williams|Jonathan Hyde|Kirsten Dunst|Bra...,Jonathan Hensleigh,104.0,"$65,000,000",board game|adventurer|fight|game
2,3,Walter Matthau|Jack Lemmon|Sophia Loren|Ann-Ma...,Mark Steven Johnson,101.0,"$25,000,000",boat|lake|neighbor|rivalry
3,4,Whitney Houston|Angela Bassett|Loretta Devine|...,Terry McMillan,124.0,"$16,000,000",black american|husband wife relationship|betra...
4,5,Steve Martin|Diane Keaton|Martin Short|Kimberl...,Albert Hackett,106.0,"$30,000,000",fatherhood|doberman|dog|mansion


In [12]:
imdb_data.isna().sum()

movieId              0
title_cast       10068
director          9874
runtime          12089
budget           19372
plot_keywords    11078
dtype: int64

In [13]:
imdb_data = imdb_data.dropna()
imdb_data.isna().sum()

movieId          0
title_cast       0
director         0
runtime          0
budget           0
plot_keywords    0
dtype: int64

In [14]:
# Preview links dataset
print('The Shape of the data is: ', links.shape)
links.head()

The Shape of the data is:  (62423, 3)


Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


In [15]:
# Preview movies dataset
print('The Shape of the data is: ', movies.shape)
movies.head()

The Shape of the data is:  (62423, 3)


Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [16]:
movies.isna().sum()

movieId    0
title      0
genres     0
dtype: int64

In [17]:
# Preview tags dataset
# print('The Shape of the data is: ', tags.shape)
# tags.head()

#### Dataset summary


<a id="three"></a>
## 3. Exploratory Data Analysis (EDA)
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Exploratory data analysis ⚡ |
| :--------------------------- |
| In this section, you are required to perform an in-depth analysis of all the variables in the DataFrame. |

---


### 3.1 Exploratory Data Analysis
*What is Exploratory data analysis?*
    Exploratory data analysis (EDA) is the process of analysing and investigating data sets and summarizing their main characteristics, often employing both non-graphical and graphical methods. 

*Why is conducting EDA important?*
    It aids in determining how best to manipulate data to get the required answers, expose trends, patterns, and relationships that are not readily apparent i.e. get insights into the dataset.

*How is EDA conducted?*
    EDA can be conducted in the following ways:
- **Univariate**:- \
    i. **non-graphical**:- This is simplest form of data analysis, where the data being analyzed consists of just one variable. Since it’s a single variable, it doesn’t deal with causes or relationships.\
    ii. **graphical**:- Non-graphical methods don’t provide a full picture of the data. Graphical methods are therefore required. It involves visual exploratory analysis of the data.
- Multivariate:-  \
    i. **non-graphical**:- Multivariate non-graphical EDA techniques generally show the relationship between two or more variables of the data through cross-tabulation or statistics. \
    ii. **graphical**:- Multivariate data uses graphics to display relationships between two or more sets of data. The most used graphic is a grouped bar plot or bar chart with each group representing one level of one of the variables and each bar within a group representing the levels of the other variable.
    
To achieve the above, while considering the volume of dataset for this project, we make use of this python module **pandas_profiling**

#### 3.1.1 pandas_profiling
Pandas profiling is an open source Python module with which we can quickly do an exploratory data analysis with just a few lines of code. It offers report generation for the dataset with lots of features and customizations for the report generated.

In [18]:
# Generate EDA report of train dataset
# from pandas_profiling import ProfileReport
# profile = ProfileReport(train, title="Report")
# profile


In [19]:
# Generate report for genome_scores
# profile = ProfileReport(genome_scores, title="genome_scores report")
# profile


#### summarize the above.

 **Descriptive Statistics**

>Descriptive statistics summarize the data by computing mean, median, mode, standard deviation likewise.descriptive statistics describe the dataset in a way simpler manner through;

*   The measure of central tendency 
>*  Mean:- The average value 
>*  Median:- The mid point value 
>*  Mode:- The most common value

*   Measure of spread  
>* Percentiles:- Percentiles are used in statistics to give you a number that describes the value that a given percent of the values are lower than.
>* standard deviation:-a number that describes how spread out the values are.
*  Measure of symmetry 
>* Skewness:- a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean.
>>* If skewness is less than -1 or greater than 1, the distribution is highly skewed.
>>* If skewness is between -1 and -0.5 or between 0.5 and 1, the distribution is moderately skewed.
>>* If skewness is between -0.5 and 0.5, the distribution is approximately symmetric. 
*  Measure of Peakedness 
>* Kurtosis:-  a measure of relative peakedness of a probability distribution, or alternatively how heavy or how light its tails are. A standard normal distribution has kurtosis of 3 and is recognized as mesokurtic. An increased kurtosis (>3) can be visualized as a thin “bell” with a high peak whereas a decreased kurtosis corresponds to a broadening of the peak and “thickening” of the tails. Kurtosis >3 is recognized as leptokurtic and <3 as platykurtic (lepto=thin; platy=broad).
>>








In [19]:
# look at data statistics


### 3.2 Key Insights from EDA 


<a id="four"></a>
## 4. Data Engineering
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Data engineering ⚡ |
| :--------------------------- |
| In this section you are required to: clean the dataset, and possibly create new features - as identified in the EDA phase. |

---

### 4.1 Content-based Filtering 
Making recommendations based on how similar the properties or features of an item are to other items

Considering the large volume of dataset we have, we shall restrict this work to only userIds present in the test dataset.

**Unique userid**
We want to evaluate the difference between the unique userId in the Train dataset and Test dataset

In [20]:
test['userId'].nunique()

162350

In [21]:
test_case = test['userId'].nunique()
train_case = train['userId'].nunique()
print('The difference in unique userID count between train and test data set is:', (train_case - test_case))

The difference in unique userID count between train and test data set is: 191


From the above, proceed to extract these 191 userIds, that are not required for prediction

In [22]:
test_userids = test['userId'].unique().tolist()
test_userids[:20]

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]

In [23]:
#Extract rows with userId present in test userId
useful_train = train[train['userId'].isin(test_userids)]
useful_train.shape

(9997845, 4)

#### Sorting of Tables

We proceed to sort both tables( train and test ) by useId

In [24]:
# Sort train dataset by userId
useful_train.sort_values(by=['userId'], inplace= True)
useful_train.head()

Unnamed: 0,userId,movieId,rating,timestamp
5122500,1,3949,5.0,1147868678
9153002,1,1175,3.5,1147868826
6923102,1,6016,5.0,1147869090
724395,1,7323,3.5,1147869119
2805472,1,4973,4.5,1147869080


In [25]:
# Sort test dataset by userId
test.sort_values(by=['userId'], inplace= True)
test.head()

Unnamed: 0,userId,movieId
0,1,2011
1,1,4144
2,1,5767
3,1,6711
4,1,7318


#### Merging of relevant tables

At this stage, we merge both tables with other tables considered to be useful for the task at hand.
The tables we merge with are listed below:
- imdb_data
- movies


In [26]:
# Merge train table with imdb_data table 
useful_train = useful_train.merge(imdb_data, on = 'movieId', how= 'left')
useful_train.head()

Unnamed: 0,userId,movieId,rating,timestamp,title_cast,director,runtime,budget,plot_keywords
0,1,3949,5.0,1147868678,Ellen Burstyn|Jared Leto|Jennifer Connelly|Mar...,Hubert Selby Jr.,102.0,"$4,500,000",drug addiction|heroin|sex show|sex scene
1,1,1175,3.5,1147868826,Pascal Benezech|Dominique Pinon|Marie-Laure Do...,Jean-Pierre Jeunet,99.0,"FRF24,000,000",black comedy|absurd comedy|surrealist|bed
2,1,6016,5.0,1147869090,Alexandre Rodrigues|Leandro Firmino|Phellipe H...,Kátia Lund,130.0,"$3,300,000",photographer|slum|gang|brazil
3,1,7323,3.5,1147869119,Daniel Brühl|Katrin Saß|Chulpan Khamatova|Mari...,Bernd Lichtenberg,121.0,"EUR4,800,000",coma|german democratic republic|capitalism|pol...
4,1,4973,4.5,1147869080,Audrey Tautou|Mathieu Kassovitz|Rufus|Lorella ...,Guillaume Laurant,122.0,"$10,000,000",female protagonist|paris france|france|montmar...


In [27]:
# Merge test table with imdb_data table 
test = test.merge(imdb_data, on = 'movieId', how= 'left')
test.head()

Unnamed: 0,userId,movieId,title_cast,director,runtime,budget,plot_keywords
0,1,2011,,,,,
1,1,4144,,,,,
2,1,5767,,,,,
3,1,6711,Scarlett Johansson|Bill Murray|Akiko Takeshita...,Sofia Coppola,102.0,"$4,000,000",older man younger woman relationship|lonelines...
4,1,7318,Jim Caviezel|Maia Morgenstern|Christo Jivkov|F...,Benedict Fitzgerald,127.0,"$30,000,000",suffering|torture|brutality|whipping


In [36]:
# Merge train table with movies table 
useful_train = useful_train.merge(movies, on = 'movieId', how= 'left')
useful_train.head()

Unnamed: 0,userId,movieId,rating,timestamp,title_cast,director,runtime,budget,plot_keywords,title,genres
0,1,3949,5.0,1147868678,Ellen Burstyn|Jared Leto|Jennifer Connelly|Mar...,Hubert Selby Jr.,102.0,"$4,500,000",drug addiction|heroin|sex show|sex scene,Requiem for a Dream (2000),Drama
1,1,2351,4.5,1147877957,,,,,,"Nights of Cabiria (Notti di Cabiria, Le) (1957)",Drama
2,1,2068,2.5,1147869044,,,,,,Fanny and Alexander (Fanny och Alexander) (1982),Drama|Fantasy|Mystery
3,1,27266,4.5,1147879365,Tony Chiu-Wai Leung|Li Gong|Faye Wong|Takuya K...,Kar-Wai Wong,129.0,"$12,000,000",nostalgia|loneliness|room 2046|1960s,2046 (2004),Drama|Fantasy|Romance|Sci-Fi
4,1,7939,2.5,1147869183,,,,,,Through a Glass Darkly (Såsom i en spegel) (1961),Drama


In [37]:
# Merge test table with movies table 
test = test.merge(movies, on = 'movieId', how= 'left')
test.head()

Unnamed: 0,userId,movieId,title_cast,director,runtime,budget,plot_keywords,title,genres
0,1,2011,,,,,,Back to the Future Part II (1989),Adventure|Comedy|Sci-Fi
1,1,4144,,,,,,In the Mood For Love (Fa yeung nin wa) (2000),Drama|Romance
2,1,5767,,,,,,Teddy Bear (Mis) (1981),Comedy|Crime
3,1,6711,Scarlett Johansson|Bill Murray|Akiko Takeshita...,Sofia Coppola,102.0,"$4,000,000",older man younger woman relationship|lonelines...,Lost in Translation (2003),Comedy|Drama|Romance
4,1,7318,Jim Caviezel|Maia Morgenstern|Christo Jivkov|F...,Benedict Fitzgerald,127.0,"$30,000,000",suffering|torture|brutality|whipping,"Passion of the Christ, The (2004)",Drama


#### Merging vital colunms

For this stage, we proceed to merge columns we have considered to be important in describing the content of a movie into a new column called key_words. The columns are listed below:
- title_cast
- director
- plot_keywords
- genres

In [38]:
# Merge the columns listed above into a new column named key_words fot the train data
useful_train['key_words'] = (pd.Series(useful_train[['title_cast', 'director', 'plot_keywords', 'genres']].fillna('')
                      .values.tolist()).str.join(' '))
useful_train.head()

Unnamed: 0,userId,movieId,rating,timestamp,title_cast,director,runtime,budget,plot_keywords,title,genres,key_words
0,1,3949,5.0,1147868678,Ellen Burstyn|Jared Leto|Jennifer Connelly|Mar...,Hubert Selby Jr.,102.0,"$4,500,000",drug addiction|heroin|sex show|sex scene,Requiem for a Dream (2000),Drama,Ellen Burstyn|Jared Leto|Jennifer Connelly|Mar...
1,1,2351,4.5,1147877957,,,,,,"Nights of Cabiria (Notti di Cabiria, Le) (1957)",Drama,Drama
2,1,2068,2.5,1147869044,,,,,,Fanny and Alexander (Fanny och Alexander) (1982),Drama|Fantasy|Mystery,Drama|Fantasy|Mystery
3,1,27266,4.5,1147879365,Tony Chiu-Wai Leung|Li Gong|Faye Wong|Takuya K...,Kar-Wai Wong,129.0,"$12,000,000",nostalgia|loneliness|room 2046|1960s,2046 (2004),Drama|Fantasy|Romance|Sci-Fi,Tony Chiu-Wai Leung|Li Gong|Faye Wong|Takuya K...
4,1,7939,2.5,1147869183,,,,,,Through a Glass Darkly (Såsom i en spegel) (1961),Drama,Drama


In [41]:
# confrim the absense of NaN value in the key_word column for the train data
nan = useful_train['key_words'].isna().sum()
print(f' There are {nan} numbers of NaN values in the train keywords column')

 There are 0 numbers of NaN values in the train keywords column


In [42]:
# Merge the columns listed above into a new column named key_words fot the test data
test['key_words'] = (pd.Series(test[['title_cast', 'director', 'plot_keywords', 'genres']].fillna('')
                      .values.tolist()).str.join(' '))
test.head()

Unnamed: 0,userId,movieId,title_cast,director,runtime,budget,plot_keywords,title,genres,key_words
0,1,2011,,,,,,Back to the Future Part II (1989),Adventure|Comedy|Sci-Fi,Adventure|Comedy|Sci-Fi
1,1,4144,,,,,,In the Mood For Love (Fa yeung nin wa) (2000),Drama|Romance,Drama|Romance
2,1,5767,,,,,,Teddy Bear (Mis) (1981),Comedy|Crime,Comedy|Crime
3,1,6711,Scarlett Johansson|Bill Murray|Akiko Takeshita...,Sofia Coppola,102.0,"$4,000,000",older man younger woman relationship|lonelines...,Lost in Translation (2003),Comedy|Drama|Romance,Scarlett Johansson|Bill Murray|Akiko Takeshita...
4,1,7318,Jim Caviezel|Maia Morgenstern|Christo Jivkov|F...,Benedict Fitzgerald,127.0,"$30,000,000",suffering|torture|brutality|whipping,"Passion of the Christ, The (2004)",Drama,Jim Caviezel|Maia Morgenstern|Christo Jivkov|F...


In [43]:
# confrim the absense of NaN value in the key_word column for the test data
nan = test['key_words'].isna().sum()
print(f' There are {nan} numbers of NaN values in the test keywords column')

 There are 0 numbers of NaN values in the test keywords column


#### Droping of colunms not needed

Going forward, we drop colunms we have considered not realy important for the task at hand. The columns are listed below:
- runtime
- budget
- timestamp
- title
- title_cast
- director
- plot_keywords 
- genres'

In [44]:
# Drop the above listed columns in the train data
useful_train.drop(columns=['timestamp', 'runtime', 'budget','title', 'title_cast', 'director', 
                           'plot_keywords','genres'], inplace= True)
useful_train.head()

Unnamed: 0,userId,movieId,rating,key_words
0,1,3949,5.0,Ellen Burstyn|Jared Leto|Jennifer Connelly|Mar...
1,1,2351,4.5,Drama
2,1,2068,2.5,Drama|Fantasy|Mystery
3,1,27266,4.5,Tony Chiu-Wai Leung|Li Gong|Faye Wong|Takuya K...
4,1,7939,2.5,Drama


In [46]:
# Drop the above listed columns in the test data
test.drop(columns=['runtime', 'budget','title', 'title_cast', 'director', 
                           'plot_keywords','genres'], inplace= True)
test.head()

Unnamed: 0,userId,movieId,key_words
0,1,2011,Adventure|Comedy|Sci-Fi
1,1,4144,Drama|Romance
2,1,5767,Comedy|Crime
3,1,6711,Scarlett Johansson|Bill Murray|Akiko Takeshita...
4,1,7318,Jim Caviezel|Maia Morgenstern|Christo Jivkov|F...


#### Data Formating

As can be seen in the key_words colunm, each enity are separated by a '|', and this character which is a separator(delimiter) if left with the data, will affect the accuracy of our model, hence needs to be removed. 

To achieve this, we write a function called splitter, to operate on both the train and test dataset. 

In [48]:
# Remove delimeters(Separators) from string data
def splitter(df, col_list, delim):
    """
        This function accepts a dataframe(df) and a list of columns(col_list), which contains the delimiter
        to be removed, it also accepts the delimiter which is to be removed
    """
    new_df = df.copy()
    
    for col in col_list:
        new_df[col] = new_df[col].str.split(delim).str.join(' ')
    
    return new_df

In [49]:
# Remove delimeter from key_words colunm in train data
useful_train = splitter(useful_train, ['key_words'], '|')
useful_train.head()

Unnamed: 0,userId,movieId,rating,key_words
0,1,3949,5.0,Ellen Burstyn Jared Leto Jennifer Connelly Mar...
1,1,2351,4.5,Drama
2,1,2068,2.5,Drama Fantasy Mystery
3,1,27266,4.5,Tony Chiu-Wai Leung Li Gong Faye Wong Takuya K...
4,1,7939,2.5,Drama


In [50]:
# Remove delimeter from key_words colunm in test data
test = splitter(test, ['key_words'], '|')
test.head()

Unnamed: 0,userId,movieId,key_words
0,1,2011,Adventure Comedy Sci-Fi
1,1,4144,Drama Romance
2,1,5767,Comedy Crime
3,1,6711,Scarlett Johansson Bill Murray Akiko Takeshita...
4,1,7318,Jim Caviezel Maia Morgenstern Christo Jivkov F...


In [53]:
# Check the value of the last userId for train data
useful_train.tail()

Unnamed: 0,userId,movieId,rating,key_words
9997840,162541,4476,2.5,Comedy
9997841,162541,6548,3.0,Martin Lawrence Will Smith Jordi Mollà Gabriel...
9997842,162541,1136,4.5,Adventure Comedy Fantasy
9997843,162541,745,4.0,Animation Children Comedy
9997844,162541,1230,3.5,Comedy Romance


In [54]:
# Check the value of the last userId for test data
test.tail()

Unnamed: 0,userId,movieId,key_words
5000014,162541,345,Comedy Drama
5000015,162541,150,Tom Hanks Bill Paxton Kevin Bacon Gary Sinise ...
5000016,162541,5689,Dustin Hoffman Nicole Kidman Loren Dean Bruce ...
5000017,162541,2324,Roberto Benigni Nicoletta Braschi Giorgio Cant...
5000018,162541,7153,Noel Appleby Ali Astin Sean Astin David Aston ...


#### Dividing dataset into chunks

We shall now proceed to divide both dataset into chunks. The numbers of chunk we have chosen is 162350, which is the numbers of unique userIds we have for both dataset. We do this to enables us save these chunks into local storage system of our machines, fetch these chunks back individualy and separately processing these chunks and making prediction afterwards. This act is necessary because our machines all have limited capacity, which prevents us from processing theses dataset at once.



In [61]:
# A function that generate a list of chunks
def create_chunk_list(df, col_ref, col_val):
    """
        This function accepts a dataframe, the dataframe column and the colunm value to filter by
        It returns a new dataframe, which is a datframe where the reference column matches the passed column value.
    """
    new_df = df[df[col_ref] == col_val]
        
    return new_df
            

#### Collection of unique userIds

As can be observed from our work so far, the userId are not ordered numerically, hence, there is need to collect these userIds into a list, which we use as the extensions to the chunks to be created.

In [28]:
extentions = test['userId'].unique().tolist()

print('The total numbers of extensions is: ', len(extentions))

The total numbers of extensions is:  162350


In [66]:
# Create and store chunk for train data
t0 = time()
for index, extention in enumerate (extentions):
    
    # Create chunk name
    chunk_name = "train_chunk_{0}".format(extention)
    
    # Create and store chunk
    globals()[chunk_name] = create_chunk_list(useful_train, 'userId', extention)
    
    # Create create directory and save chunk
    directory = './data/chunked_train_data/'+chunk_name+'.csv'
    globals()[chunk_name].to_csv(directory,index=False)
    
    # Delete chunk from global memory space
    del globals()[chunk_name]
t1 = time()

print(f'I took {(t1 - t0) / 60} to create these chunks')

I took 26.652890384197235 to create these chunks


In [64]:
# Create and store chunk for test data
for index, extention in enumerate (extentions):
    # Create chunk name
    chunk_name = "test_chunk_{0}".format(extention)
    
    # Create and store chunk
    globals()[chunk_name] = create_chunk_list(useful_train, 'userId', extention)
    
    # Create create directory and save chunk
    directory = './data/chunked_test_data/'+chunk_name+'.csv'
    globals()[chunk_name].to_csv(directory,index=False)
    
    # Delete chunk from global memory space
    del globals()[chunk_name]

# WORK continues here

In [29]:
def content_generate_rating_estimate(k=20, threshold=0.0):
    frames = []
    for index, extention in enumerate (extentions):
        new_ids = []
        t0 = time()
        
        # Create chunk name
        chunk_name = "prediction_{0}".format(extention)
        directory = './data/predictions/'+chunk_name+'.csv'
        
        
        rating_data = './data/chunked_train_data/train_chunk_{0}.csv'.format(extention)
        user = './data/chunked_test_data/test_chunk_{0}.csv'.format(extention)
        
        rating_data = pd.read_csv(rating_data)
        user = pd.read_csv(user)
        
        user_record = rating_data.drop(columns = ['rating'])
        
        
        
        combined_movies_table = pd.concat([user_record, user], ignore_index=True)
        
        titles = combined_movies_table['movieId']
        indices = pd.Series(combined_movies_table.index, combined_movies_table['movieId'])
        
        tf = TfidfVectorizer(analyzer='word', ngram_range=(1,2),
                     min_df=0, stop_words='english')

        # Produce a feature matrix, where each row corresponds to a book,
        # with TF-IDF features as columns 
        tf_authTags_matrix = tf.fit_transform(combined_movies_table['key_words'])
        
        cosine_sim_authTags = cosine_similarity(tf_authTags_matrix, 
                                        tf_authTags_matrix)
        
        interested_movies = user['movieId'].tolist()
        
        new_ids = []
        predictedRatings = []
        for i, movie in enumerate(interested_movies):
            
            b_idx = indices[movie]
            neighbors = [] # <-- Stores our collection of similarity values 
        
            for j, row in rating_data.iterrows():
                sim = cosine_sim_authTags[b_idx-1, indices[row['movieId']]-1]
                
                neighbors.append((sim[0], row['rating']))
            
            # Select the top-N values from our collection
            k_neighbors = heapq.nlargest(k, neighbors, key=lambda t: t[0])
            
            # Compute the weighted average using similarity scores and 
            # user item ratings. 
            simTotal, weightedSum = 0, 0
            
            for (simScore, rating) in k_neighbors:
                # Ensure that similarity ratings are above a given threshold
                if (simScore > threshold):
                    simTotal += simScore
                    weightedSum += simScore * rating
            try:
                predictedRating = weightedSum / simTotal
            except ZeroDivisionError:
                # Cold-start problem - No ratings given by user. 
                # We use the average rating for the reference item as a proxy in this case 
                predictedRating = np.mean(rating_data[rating_data['movieId']==movie]['rating'])
            
            # Prepare Id colunm for submission 
            new_id = str(extention)+ '_' + str(movie)
            
            # Append the new id generated
            new_ids.append(new_id)
            
            # append the new prediction
            predictedRatings.append(predictedRating)
            
            
        df = pd.DataFrame({
        'Id': new_ids,
        'rating': predictedRatings
         })
        
        frames.append(df)
        df.to_csv(directory,index=False)
        
        t1 = time()
        print('Pridiction for chunk no: ', (index+1))
        print('Time elapsed: ', (t1-t0))
        
        
    return frames

In [None]:
frames = content_generate_rating_estimate(k=20, threshold=0.0)

Pridiction for chunk no:  1
Time elapsed:  0.26274824142456055
Pridiction for chunk no:  2
Time elapsed:  1.4923338890075684
Pridiction for chunk no:  3
Time elapsed:  16.388689041137695
Pridiction for chunk no:  4
Time elapsed:  2.1610679626464844
Pridiction for chunk no:  5
Time elapsed:  0.3312692642211914
Pridiction for chunk no:  6
Time elapsed:  0.03435182571411133
Pridiction for chunk no:  7
Time elapsed:  0.03903698921203613
Pridiction for chunk no:  8
Time elapsed:  1.0716240406036377
Pridiction for chunk no:  9
Time elapsed:  0.9739418029785156
Pridiction for chunk no:  10
Time elapsed:  0.11452221870422363
Pridiction for chunk no:  11
Time elapsed:  0.038392066955566406
Pridiction for chunk no:  12
Time elapsed:  25.108580112457275
Pridiction for chunk no:  13
Time elapsed:  5.730757713317871
Pridiction for chunk no:  14
Time elapsed:  0.05658292770385742
Pridiction for chunk no:  15
Time elapsed:  0.21429872512817383
Pridiction for chunk no:  16
Time elapsed:  0.01987814903

Pridiction for chunk no:  130
Time elapsed:  0.23471403121948242
Pridiction for chunk no:  131
Time elapsed:  0.19109606742858887
Pridiction for chunk no:  132
Time elapsed:  2.718611001968384
Pridiction for chunk no:  133
Time elapsed:  0.3911278247833252
Pridiction for chunk no:  134
Time elapsed:  0.20193696022033691
Pridiction for chunk no:  135
Time elapsed:  0.055590152740478516
Pridiction for chunk no:  136
Time elapsed:  0.11664390563964844
Pridiction for chunk no:  137
Time elapsed:  0.23596882820129395
Pridiction for chunk no:  138
Time elapsed:  0.08727097511291504
Pridiction for chunk no:  139
Time elapsed:  0.8480520248413086
Pridiction for chunk no:  140
Time elapsed:  0.31607508659362793
Pridiction for chunk no:  141
Time elapsed:  1.5908281803131104
Pridiction for chunk no:  142
Time elapsed:  0.03965020179748535
Pridiction for chunk no:  143
Time elapsed:  0.38677072525024414
Pridiction for chunk no:  144
Time elapsed:  0.11972403526306152
Pridiction for chunk no:  145

Pridiction for chunk no:  258
Time elapsed:  0.6564390659332275
Pridiction for chunk no:  259
Time elapsed:  0.07973694801330566
Pridiction for chunk no:  260
Time elapsed:  1.229539155960083
Pridiction for chunk no:  261
Time elapsed:  0.19804596900939941
Pridiction for chunk no:  262
Time elapsed:  0.21694302558898926
Pridiction for chunk no:  263
Time elapsed:  0.07293915748596191
Pridiction for chunk no:  264
Time elapsed:  0.3513360023498535
Pridiction for chunk no:  265
Time elapsed:  0.037069082260131836
Pridiction for chunk no:  266
Time elapsed:  0.5911638736724854
Pridiction for chunk no:  267
Time elapsed:  0.1475059986114502
Pridiction for chunk no:  268
Time elapsed:  0.04766988754272461
Pridiction for chunk no:  269
Time elapsed:  0.13809418678283691
Pridiction for chunk no:  270
Time elapsed:  0.22577977180480957
Pridiction for chunk no:  271
Time elapsed:  0.07178878784179688
Pridiction for chunk no:  272
Time elapsed:  0.03578686714172363
Pridiction for chunk no:  273


Pridiction for chunk no:  386
Time elapsed:  0.2749009132385254
Pridiction for chunk no:  387
Time elapsed:  2.5952911376953125
Pridiction for chunk no:  388
Time elapsed:  0.04221081733703613
Pridiction for chunk no:  389
Time elapsed:  0.26905107498168945
Pridiction for chunk no:  390
Time elapsed:  0.05039215087890625
Pridiction for chunk no:  391
Time elapsed:  0.06717109680175781
Pridiction for chunk no:  392
Time elapsed:  0.011054039001464844
Pridiction for chunk no:  393
Time elapsed:  0.32199716567993164
Pridiction for chunk no:  394
Time elapsed:  2.922919988632202
Pridiction for chunk no:  395
Time elapsed:  3.8703839778900146
Pridiction for chunk no:  396
Time elapsed:  0.1971571445465088
Pridiction for chunk no:  397
Time elapsed:  0.553257942199707
Pridiction for chunk no:  398
Time elapsed:  0.42885613441467285
Pridiction for chunk no:  399
Time elapsed:  0.45825791358947754
Pridiction for chunk no:  400
Time elapsed:  0.22166085243225098
Pridiction for chunk no:  401
Ti

Pridiction for chunk no:  514
Time elapsed:  0.9197118282318115
Pridiction for chunk no:  515
Time elapsed:  0.09831500053405762
Pridiction for chunk no:  516
Time elapsed:  0.047943830490112305
Pridiction for chunk no:  517
Time elapsed:  0.08083987236022949
Pridiction for chunk no:  518
Time elapsed:  18.68558120727539
Pridiction for chunk no:  519
Time elapsed:  0.042043209075927734
Pridiction for chunk no:  520
Time elapsed:  3.6770517826080322
Pridiction for chunk no:  521
Time elapsed:  0.07885313034057617
Pridiction for chunk no:  522
Time elapsed:  1.6593399047851562
Pridiction for chunk no:  523
Time elapsed:  0.022409915924072266
Pridiction for chunk no:  524
Time elapsed:  0.05023002624511719
Pridiction for chunk no:  525
Time elapsed:  2.2130088806152344
Pridiction for chunk no:  526
Time elapsed:  1.5211901664733887
Pridiction for chunk no:  527
Time elapsed:  0.19494104385375977
Pridiction for chunk no:  528
Time elapsed:  0.04416799545288086
Pridiction for chunk no:  529

Pridiction for chunk no:  641
Time elapsed:  3.02108097076416
Pridiction for chunk no:  642
Time elapsed:  0.031294822692871094
Pridiction for chunk no:  643
Time elapsed:  0.1147613525390625
Pridiction for chunk no:  644
Time elapsed:  0.12882184982299805
Pridiction for chunk no:  645
Time elapsed:  0.09006214141845703
Pridiction for chunk no:  646
Time elapsed:  5.964623689651489
Pridiction for chunk no:  647
Time elapsed:  18.021825075149536
Pridiction for chunk no:  648
Time elapsed:  4.7985570430755615
Pridiction for chunk no:  649
Time elapsed:  0.2658579349517822
Pridiction for chunk no:  650
Time elapsed:  0.0500798225402832
Pridiction for chunk no:  651
Time elapsed:  0.4764888286590576
Pridiction for chunk no:  652
Time elapsed:  48.626953125
Pridiction for chunk no:  653
Time elapsed:  0.0627598762512207
Pridiction for chunk no:  654
Time elapsed:  0.047059059143066406
Pridiction for chunk no:  655
Time elapsed:  0.08500885963439941
Pridiction for chunk no:  656
Time elapsed

Pridiction for chunk no:  769
Time elapsed:  1.7900559902191162
Pridiction for chunk no:  770
Time elapsed:  0.5007388591766357
Pridiction for chunk no:  771
Time elapsed:  5.672645807266235
Pridiction for chunk no:  772
Time elapsed:  0.07702875137329102
Pridiction for chunk no:  773
Time elapsed:  0.061028242111206055
Pridiction for chunk no:  774
Time elapsed:  1.9752659797668457
Pridiction for chunk no:  775
Time elapsed:  0.04368901252746582
Pridiction for chunk no:  776
Time elapsed:  0.04114818572998047
Pridiction for chunk no:  777
Time elapsed:  0.029083251953125
Pridiction for chunk no:  778
Time elapsed:  0.1798992156982422
Pridiction for chunk no:  779
Time elapsed:  0.20333003997802734
Pridiction for chunk no:  780
Time elapsed:  0.1275949478149414
Pridiction for chunk no:  781
Time elapsed:  0.02664494514465332
Pridiction for chunk no:  782
Time elapsed:  0.14398908615112305
Pridiction for chunk no:  783
Time elapsed:  0.01694321632385254
Pridiction for chunk no:  784
Tim

Pridiction for chunk no:  898
Time elapsed:  0.7675168514251709
Pridiction for chunk no:  899
Time elapsed:  0.48708200454711914
Pridiction for chunk no:  900
Time elapsed:  44.14847993850708
Pridiction for chunk no:  901
Time elapsed:  2.8289682865142822
Pridiction for chunk no:  902
Time elapsed:  0.24162912368774414
Pridiction for chunk no:  903
Time elapsed:  12.320036888122559
Pridiction for chunk no:  904
Time elapsed:  1.6026642322540283
Pridiction for chunk no:  905
Time elapsed:  3.5077531337738037
Pridiction for chunk no:  906
Time elapsed:  0.10321211814880371
Pridiction for chunk no:  907
Time elapsed:  0.016283035278320312
Pridiction for chunk no:  908
Time elapsed:  0.9107828140258789
Pridiction for chunk no:  909
Time elapsed:  2.180769681930542
Pridiction for chunk no:  910
Time elapsed:  0.3433878421783447
Pridiction for chunk no:  911
Time elapsed:  1.9026460647583008
Pridiction for chunk no:  912
Time elapsed:  0.14867496490478516
Pridiction for chunk no:  913
Time e

Pridiction for chunk no:  1025
Time elapsed:  0.2336559295654297
Pridiction for chunk no:  1026
Time elapsed:  0.01667189598083496
Pridiction for chunk no:  1027
Time elapsed:  0.2756941318511963
Pridiction for chunk no:  1028
Time elapsed:  0.03847670555114746
Pridiction for chunk no:  1029
Time elapsed:  4.336955785751343
Pridiction for chunk no:  1030
Time elapsed:  0.0962221622467041
Pridiction for chunk no:  1031
Time elapsed:  0.10278797149658203
Pridiction for chunk no:  1032
Time elapsed:  0.918302059173584
Pridiction for chunk no:  1033
Time elapsed:  0.8960220813751221
Pridiction for chunk no:  1034
Time elapsed:  0.027264833450317383
Pridiction for chunk no:  1035
Time elapsed:  6.54882287979126
Pridiction for chunk no:  1036
Time elapsed:  0.5764710903167725
Pridiction for chunk no:  1037
Time elapsed:  20.118993282318115
Pridiction for chunk no:  1038
Time elapsed:  0.06869316101074219
Pridiction for chunk no:  1039
Time elapsed:  1.4174768924713135
Pridiction for chunk no

Pridiction for chunk no:  1153
Time elapsed:  0.16630101203918457
Pridiction for chunk no:  1154
Time elapsed:  0.9108340740203857
Pridiction for chunk no:  1155
Time elapsed:  2.4431469440460205
Pridiction for chunk no:  1156
Time elapsed:  6.076007843017578
Pridiction for chunk no:  1157
Time elapsed:  0.14094209671020508
Pridiction for chunk no:  1158
Time elapsed:  0.3015601634979248
Pridiction for chunk no:  1159
Time elapsed:  0.8465120792388916
Pridiction for chunk no:  1160
Time elapsed:  0.07929277420043945
Pridiction for chunk no:  1161
Time elapsed:  0.37604188919067383
Pridiction for chunk no:  1162
Time elapsed:  1.269806146621704
Pridiction for chunk no:  1163
Time elapsed:  0.0510251522064209
Pridiction for chunk no:  1164
Time elapsed:  0.08780908584594727
Pridiction for chunk no:  1165
Time elapsed:  0.8107466697692871
Pridiction for chunk no:  1166
Time elapsed:  1.639969825744629
Pridiction for chunk no:  1167
Time elapsed:  9.466527938842773
Pridiction for chunk no:

Pridiction for chunk no:  1279
Time elapsed:  0.5379948616027832
Pridiction for chunk no:  1280
Time elapsed:  43.617499113082886
Pridiction for chunk no:  1281
Time elapsed:  0.32932186126708984
Pridiction for chunk no:  1282
Time elapsed:  0.7329871654510498
Pridiction for chunk no:  1283
Time elapsed:  0.05816006660461426
Pridiction for chunk no:  1284
Time elapsed:  0.04874587059020996
Pridiction for chunk no:  1285
Time elapsed:  0.06810402870178223
Pridiction for chunk no:  1286
Time elapsed:  0.1915299892425537
Pridiction for chunk no:  1287
Time elapsed:  13.048079013824463
Pridiction for chunk no:  1288
Time elapsed:  9.802242040634155
Pridiction for chunk no:  1289
Time elapsed:  2.754894256591797
Pridiction for chunk no:  1290
Time elapsed:  0.05560898780822754
Pridiction for chunk no:  1291
Time elapsed:  0.2507309913635254
Pridiction for chunk no:  1292
Time elapsed:  0.10744976997375488
Pridiction for chunk no:  1293
Time elapsed:  0.07682108879089355
Pridiction for chunk

Pridiction for chunk no:  1405
Time elapsed:  0.6169862747192383
Pridiction for chunk no:  1406
Time elapsed:  0.07728981971740723
Pridiction for chunk no:  1407
Time elapsed:  24.974124908447266
Pridiction for chunk no:  1408
Time elapsed:  0.03306317329406738
Pridiction for chunk no:  1409
Time elapsed:  1.2430570125579834
Pridiction for chunk no:  1410
Time elapsed:  0.030054092407226562
Pridiction for chunk no:  1411
Time elapsed:  0.09010004997253418
Pridiction for chunk no:  1412
Time elapsed:  0.31110715866088867
Pridiction for chunk no:  1413
Time elapsed:  0.14530301094055176
Pridiction for chunk no:  1414
Time elapsed:  1.0033092498779297
Pridiction for chunk no:  1415
Time elapsed:  0.060172080993652344
Pridiction for chunk no:  1416
Time elapsed:  2.3289639949798584
Pridiction for chunk no:  1417
Time elapsed:  0.09660482406616211
Pridiction for chunk no:  1418
Time elapsed:  0.0268096923828125
Pridiction for chunk no:  1419
Time elapsed:  0.14995789527893066
Pridiction for

Pridiction for chunk no:  1531
Time elapsed:  5.599873065948486
Pridiction for chunk no:  1532
Time elapsed:  0.047924041748046875
Pridiction for chunk no:  1533
Time elapsed:  3.405958890914917
Pridiction for chunk no:  1534
Time elapsed:  0.27728796005249023
Pridiction for chunk no:  1535
Time elapsed:  0.4547388553619385
Pridiction for chunk no:  1536
Time elapsed:  0.2677879333496094
Pridiction for chunk no:  1537
Time elapsed:  0.10814189910888672
Pridiction for chunk no:  1538
Time elapsed:  0.23619413375854492
Pridiction for chunk no:  1539
Time elapsed:  7.757724285125732
Pridiction for chunk no:  1540
Time elapsed:  0.05358386039733887
Pridiction for chunk no:  1541
Time elapsed:  2.358886241912842
Pridiction for chunk no:  1542
Time elapsed:  2.946519136428833
Pridiction for chunk no:  1543
Time elapsed:  0.04104304313659668
Pridiction for chunk no:  1544
Time elapsed:  0.11474609375
Pridiction for chunk no:  1545
Time elapsed:  0.05360293388366699
Pridiction for chunk no:  1

Pridiction for chunk no:  1657
Time elapsed:  0.2395930290222168
Pridiction for chunk no:  1658
Time elapsed:  0.030261993408203125
Pridiction for chunk no:  1659
Time elapsed:  0.7340478897094727
Pridiction for chunk no:  1660
Time elapsed:  0.10948610305786133
Pridiction for chunk no:  1661
Time elapsed:  0.9161782264709473
Pridiction for chunk no:  1662
Time elapsed:  0.03229022026062012
Pridiction for chunk no:  1663
Time elapsed:  0.0499720573425293
Pridiction for chunk no:  1664
Time elapsed:  0.8875761032104492
Pridiction for chunk no:  1665
Time elapsed:  0.03428292274475098
Pridiction for chunk no:  1666
Time elapsed:  0.04227590560913086
Pridiction for chunk no:  1667
Time elapsed:  0.14184999465942383
Pridiction for chunk no:  1668
Time elapsed:  0.038706302642822266
Pridiction for chunk no:  1669
Time elapsed:  0.04713892936706543
Pridiction for chunk no:  1670
Time elapsed:  3.6261959075927734
Pridiction for chunk no:  1671
Time elapsed:  0.12199592590332031
Pridiction for

Pridiction for chunk no:  1783
Time elapsed:  0.34227919578552246
Pridiction for chunk no:  1784
Time elapsed:  0.02733612060546875
Pridiction for chunk no:  1785
Time elapsed:  0.1103360652923584
Pridiction for chunk no:  1786
Time elapsed:  6.017061233520508
Pridiction for chunk no:  1787
Time elapsed:  0.019871950149536133
Pridiction for chunk no:  1788
Time elapsed:  0.4640829563140869
Pridiction for chunk no:  1789
Time elapsed:  0.8394901752471924
Pridiction for chunk no:  1790
Time elapsed:  1.5692598819732666
Pridiction for chunk no:  1791
Time elapsed:  0.04882097244262695
Pridiction for chunk no:  1792
Time elapsed:  0.03506588935852051
Pridiction for chunk no:  1793
Time elapsed:  0.20447826385498047
Pridiction for chunk no:  1794
Time elapsed:  0.1894388198852539
Pridiction for chunk no:  1795
Time elapsed:  0.03863406181335449
Pridiction for chunk no:  1796
Time elapsed:  0.3915088176727295
Pridiction for chunk no:  1797
Time elapsed:  0.11600589752197266
Pridiction for ch

Pridiction for chunk no:  1909
Time elapsed:  1.9658229351043701
Pridiction for chunk no:  1910
Time elapsed:  0.028370141983032227
Pridiction for chunk no:  1911
Time elapsed:  0.7342028617858887
Pridiction for chunk no:  1912
Time elapsed:  0.16292285919189453
Pridiction for chunk no:  1913
Time elapsed:  1.874884843826294
Pridiction for chunk no:  1914
Time elapsed:  0.5892741680145264
Pridiction for chunk no:  1915
Time elapsed:  0.32633209228515625
Pridiction for chunk no:  1916
Time elapsed:  0.08823871612548828
Pridiction for chunk no:  1917
Time elapsed:  0.0940101146697998
Pridiction for chunk no:  1918
Time elapsed:  221.5383689403534
Pridiction for chunk no:  1919
Time elapsed:  0.08339691162109375
Pridiction for chunk no:  1920
Time elapsed:  1.0967121124267578
Pridiction for chunk no:  1921
Time elapsed:  0.051371097564697266
Pridiction for chunk no:  1922
Time elapsed:  2.0157902240753174
Pridiction for chunk no:  1923
Time elapsed:  0.112335205078125
Pridiction for chunk

Pridiction for chunk no:  2036
Time elapsed:  0.2186272144317627
Pridiction for chunk no:  2037
Time elapsed:  0.3265352249145508
Pridiction for chunk no:  2038
Time elapsed:  1.804419994354248
Pridiction for chunk no:  2039
Time elapsed:  0.05628490447998047
Pridiction for chunk no:  2040
Time elapsed:  0.18528103828430176
Pridiction for chunk no:  2041
Time elapsed:  0.042289018630981445
Pridiction for chunk no:  2042
Time elapsed:  0.06302118301391602
Pridiction for chunk no:  2043
Time elapsed:  0.6496899127960205
Pridiction for chunk no:  2044
Time elapsed:  0.14469408988952637
Pridiction for chunk no:  2045
Time elapsed:  0.04144597053527832
Pridiction for chunk no:  2046
Time elapsed:  0.5340359210968018
Pridiction for chunk no:  2047
Time elapsed:  0.13978290557861328
Pridiction for chunk no:  2048
Time elapsed:  0.15269994735717773
Pridiction for chunk no:  2049
Time elapsed:  0.022120952606201172
Pridiction for chunk no:  2050
Time elapsed:  0.7929458618164062
Pridiction for 

Pridiction for chunk no:  2162
Time elapsed:  5.132156848907471
Pridiction for chunk no:  2163
Time elapsed:  129.5709171295166
Pridiction for chunk no:  2164
Time elapsed:  1.0432970523834229
Pridiction for chunk no:  2165
Time elapsed:  0.2487800121307373
Pridiction for chunk no:  2166
Time elapsed:  0.7372798919677734
Pridiction for chunk no:  2167
Time elapsed:  0.4510810375213623
Pridiction for chunk no:  2168
Time elapsed:  0.037000179290771484
Pridiction for chunk no:  2169
Time elapsed:  1.4166841506958008
Pridiction for chunk no:  2170
Time elapsed:  2.9032680988311768
Pridiction for chunk no:  2171
Time elapsed:  0.9017901420593262
Pridiction for chunk no:  2172
Time elapsed:  0.10422706604003906
Pridiction for chunk no:  2173
Time elapsed:  0.050855159759521484
Pridiction for chunk no:  2174
Time elapsed:  0.57460618019104
Pridiction for chunk no:  2175
Time elapsed:  950.3304500579834
Pridiction for chunk no:  2176
Time elapsed:  0.089080810546875
Pridiction for chunk no:  

Pridiction for chunk no:  2288
Time elapsed:  1.386807918548584
Pridiction for chunk no:  2289
Time elapsed:  0.821382999420166
Pridiction for chunk no:  2290
Time elapsed:  0.26947617530822754
Pridiction for chunk no:  2291
Time elapsed:  3.216996192932129
Pridiction for chunk no:  2292
Time elapsed:  0.24216508865356445
Pridiction for chunk no:  2293
Time elapsed:  0.027313947677612305
Pridiction for chunk no:  2294
Time elapsed:  0.13255691528320312
Pridiction for chunk no:  2295
Time elapsed:  9.447968006134033
Pridiction for chunk no:  2296
Time elapsed:  0.25383687019348145
Pridiction for chunk no:  2297
Time elapsed:  0.2357339859008789
Pridiction for chunk no:  2298
Time elapsed:  0.11385393142700195
Pridiction for chunk no:  2299
Time elapsed:  0.03866171836853027
Pridiction for chunk no:  2300
Time elapsed:  0.04787087440490723
Pridiction for chunk no:  2301
Time elapsed:  0.03248190879821777
Pridiction for chunk no:  2302
Time elapsed:  0.1147761344909668
Pridiction for chun

Pridiction for chunk no:  2414
Time elapsed:  17.695953845977783
Pridiction for chunk no:  2415
Time elapsed:  8.679585933685303
Pridiction for chunk no:  2416
Time elapsed:  16.19166398048401
Pridiction for chunk no:  2417
Time elapsed:  2.9112980365753174
Pridiction for chunk no:  2418
Time elapsed:  0.4553048610687256
Pridiction for chunk no:  2419
Time elapsed:  0.023504018783569336
Pridiction for chunk no:  2420
Time elapsed:  19.00049614906311
Pridiction for chunk no:  2421
Time elapsed:  0.7329258918762207
Pridiction for chunk no:  2422
Time elapsed:  0.3578958511352539
Pridiction for chunk no:  2423
Time elapsed:  0.3928210735321045
Pridiction for chunk no:  2424
Time elapsed:  1.740196704864502
Pridiction for chunk no:  2425
Time elapsed:  19.63369607925415
Pridiction for chunk no:  2426
Time elapsed:  6.769175052642822
Pridiction for chunk no:  2427
Time elapsed:  0.05356788635253906
Pridiction for chunk no:  2428
Time elapsed:  0.0682821273803711
Pridiction for chunk no:  24

Pridiction for chunk no:  2543
Time elapsed:  0.08937692642211914
Pridiction for chunk no:  2544
Time elapsed:  0.6544997692108154
Pridiction for chunk no:  2545
Time elapsed:  0.0922701358795166
Pridiction for chunk no:  2546
Time elapsed:  0.21391892433166504
Pridiction for chunk no:  2547
Time elapsed:  0.036543846130371094
Pridiction for chunk no:  2548
Time elapsed:  0.05222773551940918
Pridiction for chunk no:  2549
Time elapsed:  0.07846879959106445
Pridiction for chunk no:  2550
Time elapsed:  1.7409589290618896
Pridiction for chunk no:  2551
Time elapsed:  0.1270449161529541
Pridiction for chunk no:  2552
Time elapsed:  0.10676097869873047
Pridiction for chunk no:  2553
Time elapsed:  2.2110838890075684
Pridiction for chunk no:  2554
Time elapsed:  0.45694899559020996
Pridiction for chunk no:  2555
Time elapsed:  0.5809080600738525
Pridiction for chunk no:  2556
Time elapsed:  0.03325390815734863
Pridiction for chunk no:  2557
Time elapsed:  0.06497716903686523
Pridiction for 

Pridiction for chunk no:  2670
Time elapsed:  0.3800208568572998
Pridiction for chunk no:  2671
Time elapsed:  0.050295114517211914
Pridiction for chunk no:  2672
Time elapsed:  19.55516004562378
Pridiction for chunk no:  2673
Time elapsed:  0.04825282096862793
Pridiction for chunk no:  2674
Time elapsed:  3.0245461463928223
Pridiction for chunk no:  2675
Time elapsed:  0.5214619636535645
Pridiction for chunk no:  2676
Time elapsed:  0.050350189208984375
Pridiction for chunk no:  2677
Time elapsed:  0.04259490966796875
Pridiction for chunk no:  2678
Time elapsed:  19.854334115982056
Pridiction for chunk no:  2679
Time elapsed:  0.07147693634033203
Pridiction for chunk no:  2680
Time elapsed:  2.260227918624878
Pridiction for chunk no:  2681
Time elapsed:  0.24693775177001953
Pridiction for chunk no:  2682
Time elapsed:  0.0330810546875
Pridiction for chunk no:  2683
Time elapsed:  0.5571310520172119
Pridiction for chunk no:  2684
Time elapsed:  4.13204288482666
Pridiction for chunk no:

Pridiction for chunk no:  2796
Time elapsed:  1.9012508392333984
Pridiction for chunk no:  2797
Time elapsed:  0.10658788681030273
Pridiction for chunk no:  2798
Time elapsed:  1.6110329627990723
Pridiction for chunk no:  2799
Time elapsed:  16.445533752441406
Pridiction for chunk no:  2800
Time elapsed:  0.0955359935760498
Pridiction for chunk no:  2801
Time elapsed:  0.06872797012329102
Pridiction for chunk no:  2802
Time elapsed:  0.18312406539916992
Pridiction for chunk no:  2803
Time elapsed:  0.18124794960021973
Pridiction for chunk no:  2804
Time elapsed:  14.903814792633057
Pridiction for chunk no:  2805
Time elapsed:  0.09632587432861328
Pridiction for chunk no:  2806
Time elapsed:  0.06904315948486328
Pridiction for chunk no:  2807
Time elapsed:  0.2778480052947998
Pridiction for chunk no:  2808
Time elapsed:  0.04683494567871094
Pridiction for chunk no:  2809
Time elapsed:  0.14076018333435059
Pridiction for chunk no:  2810
Time elapsed:  0.0991983413696289
Pridiction for ch

Pridiction for chunk no:  2923
Time elapsed:  0.04882478713989258
Pridiction for chunk no:  2924
Time elapsed:  0.1712789535522461
Pridiction for chunk no:  2925
Time elapsed:  23.875988006591797
Pridiction for chunk no:  2926
Time elapsed:  0.6343967914581299
Pridiction for chunk no:  2927
Time elapsed:  0.33113527297973633
Pridiction for chunk no:  2928
Time elapsed:  0.06923127174377441
Pridiction for chunk no:  2929
Time elapsed:  0.12400197982788086
Pridiction for chunk no:  2930
Time elapsed:  4.137103796005249
Pridiction for chunk no:  2931
Time elapsed:  64.75882506370544
Pridiction for chunk no:  2932
Time elapsed:  0.30309391021728516
Pridiction for chunk no:  2933
Time elapsed:  0.015243053436279297
Pridiction for chunk no:  2934
Time elapsed:  0.06345295906066895
Pridiction for chunk no:  2935
Time elapsed:  0.05635476112365723
Pridiction for chunk no:  2936
Time elapsed:  0.03579401969909668
Pridiction for chunk no:  2937
Time elapsed:  0.049807071685791016
Pridiction for 

Pridiction for chunk no:  3049
Time elapsed:  0.4351832866668701
Pridiction for chunk no:  3050
Time elapsed:  0.11647510528564453
Pridiction for chunk no:  3051
Time elapsed:  0.601895809173584
Pridiction for chunk no:  3052
Time elapsed:  0.06509184837341309
Pridiction for chunk no:  3053
Time elapsed:  0.03750205039978027
Pridiction for chunk no:  3054
Time elapsed:  0.07778620719909668
Pridiction for chunk no:  3055
Time elapsed:  74.03244400024414
Pridiction for chunk no:  3056
Time elapsed:  1.916290044784546
Pridiction for chunk no:  3057
Time elapsed:  0.6478362083435059
Pridiction for chunk no:  3058
Time elapsed:  1.2216479778289795
Pridiction for chunk no:  3059
Time elapsed:  1.761998176574707
Pridiction for chunk no:  3060
Time elapsed:  0.03081226348876953
Pridiction for chunk no:  3061
Time elapsed:  0.020759105682373047
Pridiction for chunk no:  3062
Time elapsed:  0.0606999397277832
Pridiction for chunk no:  3063
Time elapsed:  6.4788148403167725
Pridiction for chunk n

Pridiction for chunk no:  3175
Time elapsed:  5.460804224014282
Pridiction for chunk no:  3176
Time elapsed:  0.03596186637878418
Pridiction for chunk no:  3177
Time elapsed:  0.1862320899963379
Pridiction for chunk no:  3178
Time elapsed:  0.05181479454040527
Pridiction for chunk no:  3179
Time elapsed:  12.990059852600098
Pridiction for chunk no:  3180
Time elapsed:  0.06996679306030273
Pridiction for chunk no:  3181
Time elapsed:  0.10569286346435547
Pridiction for chunk no:  3182
Time elapsed:  0.03748607635498047
Pridiction for chunk no:  3183
Time elapsed:  4.722891092300415
Pridiction for chunk no:  3184
Time elapsed:  0.23192930221557617
Pridiction for chunk no:  3185
Time elapsed:  0.08865499496459961
Pridiction for chunk no:  3186
Time elapsed:  0.036077260971069336
Pridiction for chunk no:  3187
Time elapsed:  0.24042081832885742
Pridiction for chunk no:  3188
Time elapsed:  0.032552242279052734
Pridiction for chunk no:  3189
Time elapsed:  0.7713029384613037
Pridiction for 

Pridiction for chunk no:  3301
Time elapsed:  0.33017396926879883
Pridiction for chunk no:  3302
Time elapsed:  9.669530868530273
Pridiction for chunk no:  3303
Time elapsed:  0.06794381141662598
Pridiction for chunk no:  3304
Time elapsed:  2.515493154525757
Pridiction for chunk no:  3305
Time elapsed:  0.23031210899353027
Pridiction for chunk no:  3306
Time elapsed:  0.1293492317199707
Pridiction for chunk no:  3307
Time elapsed:  0.2517571449279785
Pridiction for chunk no:  3308
Time elapsed:  1.33927583694458
Pridiction for chunk no:  3309
Time elapsed:  1.324690341949463
Pridiction for chunk no:  3310
Time elapsed:  0.05586886405944824
Pridiction for chunk no:  3311
Time elapsed:  0.04203605651855469
Pridiction for chunk no:  3312
Time elapsed:  0.1638197898864746
Pridiction for chunk no:  3313
Time elapsed:  0.2823789119720459
Pridiction for chunk no:  3314
Time elapsed:  0.05551409721374512
Pridiction for chunk no:  3315
Time elapsed:  0.17538714408874512
Pridiction for chunk no

Pridiction for chunk no:  3429
Time elapsed:  0.4389979839324951
Pridiction for chunk no:  3430
Time elapsed:  0.11033892631530762
Pridiction for chunk no:  3431
Time elapsed:  0.11730194091796875
Pridiction for chunk no:  3432
Time elapsed:  0.0232696533203125
Pridiction for chunk no:  3433
Time elapsed:  0.06729698181152344
Pridiction for chunk no:  3434
Time elapsed:  0.17604804039001465
Pridiction for chunk no:  3435
Time elapsed:  0.7830491065979004
Pridiction for chunk no:  3436
Time elapsed:  7.009062051773071
Pridiction for chunk no:  3437
Time elapsed:  0.9680371284484863
Pridiction for chunk no:  3438
Time elapsed:  57.66536784172058
Pridiction for chunk no:  3439
Time elapsed:  20.744728088378906
Pridiction for chunk no:  3440
Time elapsed:  1.6103169918060303
Pridiction for chunk no:  3441
Time elapsed:  7.437278985977173
Pridiction for chunk no:  3442
Time elapsed:  0.41513872146606445
Pridiction for chunk no:  3443
Time elapsed:  7.449205160140991
Pridiction for chunk no:

Pridiction for chunk no:  3555
Time elapsed:  0.3165252208709717
Pridiction for chunk no:  3556
Time elapsed:  0.054045915603637695
Pridiction for chunk no:  3557
Time elapsed:  0.4908721446990967
Pridiction for chunk no:  3558
Time elapsed:  0.32207703590393066
Pridiction for chunk no:  3559
Time elapsed:  1.0263547897338867
Pridiction for chunk no:  3560
Time elapsed:  0.9931371212005615
Pridiction for chunk no:  3561
Time elapsed:  1.6268770694732666
Pridiction for chunk no:  3562
Time elapsed:  0.02818894386291504
Pridiction for chunk no:  3563
Time elapsed:  0.3728928565979004
Pridiction for chunk no:  3564
Time elapsed:  0.03219199180603027
Pridiction for chunk no:  3565
Time elapsed:  0.14957404136657715
Pridiction for chunk no:  3566
Time elapsed:  0.06262779235839844
Pridiction for chunk no:  3567
Time elapsed:  0.0384061336517334
Pridiction for chunk no:  3568
Time elapsed:  0.01816391944885254
Pridiction for chunk no:  3569
Time elapsed:  0.19090008735656738
Pridiction for c

Pridiction for chunk no:  3681
Time elapsed:  0.7792439460754395
Pridiction for chunk no:  3682
Time elapsed:  0.09988141059875488
Pridiction for chunk no:  3683
Time elapsed:  0.2684779167175293
Pridiction for chunk no:  3684
Time elapsed:  0.45162391662597656
Pridiction for chunk no:  3685
Time elapsed:  0.13037800788879395
Pridiction for chunk no:  3686
Time elapsed:  0.7067608833312988
Pridiction for chunk no:  3687
Time elapsed:  0.6113228797912598
Pridiction for chunk no:  3688
Time elapsed:  0.752129077911377
Pridiction for chunk no:  3689
Time elapsed:  0.10613083839416504
Pridiction for chunk no:  3690
Time elapsed:  1.6342227458953857
Pridiction for chunk no:  3691
Time elapsed:  4.953038692474365
Pridiction for chunk no:  3692
Time elapsed:  0.816720724105835
Pridiction for chunk no:  3693
Time elapsed:  0.054512977600097656
Pridiction for chunk no:  3694
Time elapsed:  1.0622501373291016
Pridiction for chunk no:  3695
Time elapsed:  0.3016200065612793
Pridiction for chunk n

Pridiction for chunk no:  3807
Time elapsed:  2.6312100887298584
Pridiction for chunk no:  3808
Time elapsed:  0.03516888618469238
Pridiction for chunk no:  3809
Time elapsed:  3.5684502124786377
Pridiction for chunk no:  3810
Time elapsed:  2.279616594314575
Pridiction for chunk no:  3811
Time elapsed:  0.040563106536865234
Pridiction for chunk no:  3812
Time elapsed:  0.02204418182373047
Pridiction for chunk no:  3813
Time elapsed:  0.49698519706726074
Pridiction for chunk no:  3814
Time elapsed:  1.0931189060211182
Pridiction for chunk no:  3815
Time elapsed:  0.23131227493286133
Pridiction for chunk no:  3816
Time elapsed:  0.9911081790924072
Pridiction for chunk no:  3817
Time elapsed:  0.04181694984436035
Pridiction for chunk no:  3818
Time elapsed:  0.5143270492553711
Pridiction for chunk no:  3819
Time elapsed:  0.5380818843841553
Pridiction for chunk no:  3820
Time elapsed:  8.115378379821777
Pridiction for chunk no:  3821
Time elapsed:  0.2967820167541504
Pridiction for chunk

Pridiction for chunk no:  3934
Time elapsed:  0.32329702377319336
Pridiction for chunk no:  3935
Time elapsed:  0.05321526527404785
Pridiction for chunk no:  3936
Time elapsed:  0.07816863059997559
Pridiction for chunk no:  3937
Time elapsed:  1.0977540016174316
Pridiction for chunk no:  3938
Time elapsed:  0.012227058410644531
Pridiction for chunk no:  3939
Time elapsed:  16.749524116516113
Pridiction for chunk no:  3940
Time elapsed:  0.147017240524292
Pridiction for chunk no:  3941
Time elapsed:  1.1996638774871826
Pridiction for chunk no:  3942
Time elapsed:  0.33746886253356934
Pridiction for chunk no:  3943
Time elapsed:  0.14567899703979492
Pridiction for chunk no:  3944
Time elapsed:  0.19610595703125
Pridiction for chunk no:  3945
Time elapsed:  0.8850030899047852
Pridiction for chunk no:  3946
Time elapsed:  0.3177659511566162
Pridiction for chunk no:  3947
Time elapsed:  1.6921300888061523
Pridiction for chunk no:  3948
Time elapsed:  2.8044626712799072
Pridiction for chunk 

Pridiction for chunk no:  4060
Time elapsed:  0.6214351654052734
Pridiction for chunk no:  4061
Time elapsed:  0.10184192657470703
Pridiction for chunk no:  4062
Time elapsed:  0.1696462631225586
Pridiction for chunk no:  4063
Time elapsed:  0.018929243087768555
Pridiction for chunk no:  4064
Time elapsed:  0.12814116477966309
Pridiction for chunk no:  4065
Time elapsed:  0.44333624839782715
Pridiction for chunk no:  4066
Time elapsed:  0.15312910079956055
Pridiction for chunk no:  4067
Time elapsed:  0.18300509452819824
Pridiction for chunk no:  4068
Time elapsed:  0.10771679878234863
Pridiction for chunk no:  4069
Time elapsed:  0.3969118595123291
Pridiction for chunk no:  4070
Time elapsed:  0.8495001792907715
Pridiction for chunk no:  4071
Time elapsed:  0.11965203285217285
Pridiction for chunk no:  4072
Time elapsed:  0.16718411445617676
Pridiction for chunk no:  4073
Time elapsed:  1.1976509094238281
Pridiction for chunk no:  4074
Time elapsed:  1.2917349338531494
Pridiction for 

Pridiction for chunk no:  4188
Time elapsed:  0.23717784881591797
Pridiction for chunk no:  4189
Time elapsed:  0.09737610816955566
Pridiction for chunk no:  4190
Time elapsed:  0.041889190673828125
Pridiction for chunk no:  4191
Time elapsed:  0.06853985786437988
Pridiction for chunk no:  4192
Time elapsed:  0.050813913345336914
Pridiction for chunk no:  4193
Time elapsed:  0.2635819911956787
Pridiction for chunk no:  4194
Time elapsed:  0.1287980079650879
Pridiction for chunk no:  4195
Time elapsed:  6.522794008255005
Pridiction for chunk no:  4196
Time elapsed:  1.5649421215057373
Pridiction for chunk no:  4197
Time elapsed:  12.932546138763428
Pridiction for chunk no:  4198
Time elapsed:  0.09788918495178223
Pridiction for chunk no:  4199
Time elapsed:  0.09434890747070312
Pridiction for chunk no:  4200
Time elapsed:  0.45969414710998535
Pridiction for chunk no:  4201
Time elapsed:  2.8456599712371826
Pridiction for chunk no:  4202
Time elapsed:  0.04599905014038086
Pridiction for 

Pridiction for chunk no:  4315
Time elapsed:  1.0613889694213867
Pridiction for chunk no:  4316
Time elapsed:  3.7557129859924316
Pridiction for chunk no:  4317
Time elapsed:  2.451179027557373
Pridiction for chunk no:  4318
Time elapsed:  0.4417247772216797
Pridiction for chunk no:  4319
Time elapsed:  4.490937232971191
Pridiction for chunk no:  4320
Time elapsed:  0.24952292442321777
Pridiction for chunk no:  4321
Time elapsed:  0.06589007377624512
Pridiction for chunk no:  4322
Time elapsed:  4.499582052230835
Pridiction for chunk no:  4323
Time elapsed:  0.03133273124694824
Pridiction for chunk no:  4324
Time elapsed:  1.9393069744110107
Pridiction for chunk no:  4325
Time elapsed:  0.07977294921875
Pridiction for chunk no:  4326
Time elapsed:  0.23496007919311523
Pridiction for chunk no:  4327
Time elapsed:  0.1490001678466797
Pridiction for chunk no:  4328
Time elapsed:  0.08131885528564453
Pridiction for chunk no:  4329
Time elapsed:  0.07381677627563477
Pridiction for chunk no:

Pridiction for chunk no:  4441
Time elapsed:  0.9988818168640137
Pridiction for chunk no:  4442
Time elapsed:  0.2064661979675293
Pridiction for chunk no:  4443
Time elapsed:  0.4833106994628906
Pridiction for chunk no:  4444
Time elapsed:  0.06494474411010742
Pridiction for chunk no:  4445
Time elapsed:  0.030766010284423828
Pridiction for chunk no:  4446
Time elapsed:  0.9363288879394531
Pridiction for chunk no:  4447
Time elapsed:  1.499654769897461
Pridiction for chunk no:  4448
Time elapsed:  1.828110694885254
Pridiction for chunk no:  4449
Time elapsed:  3.5143258571624756
Pridiction for chunk no:  4450
Time elapsed:  0.2162618637084961
Pridiction for chunk no:  4451
Time elapsed:  0.06529593467712402
Pridiction for chunk no:  4452
Time elapsed:  0.03488302230834961
Pridiction for chunk no:  4453
Time elapsed:  11.627497911453247
Pridiction for chunk no:  4454
Time elapsed:  0.050650835037231445
Pridiction for chunk no:  4455
Time elapsed:  0.1043100357055664
Pridiction for chunk

Pridiction for chunk no:  4569
Time elapsed:  10.878185033798218
Pridiction for chunk no:  4570
Time elapsed:  9.328129053115845
Pridiction for chunk no:  4571
Time elapsed:  1.3500332832336426
Pridiction for chunk no:  4572
Time elapsed:  0.03337693214416504
Pridiction for chunk no:  4573
Time elapsed:  0.021904945373535156
Pridiction for chunk no:  4574
Time elapsed:  0.283048152923584
Pridiction for chunk no:  4575
Time elapsed:  0.034052133560180664
Pridiction for chunk no:  4576
Time elapsed:  1.2786109447479248
Pridiction for chunk no:  4577
Time elapsed:  4.02272629737854
Pridiction for chunk no:  4578
Time elapsed:  0.24570178985595703
Pridiction for chunk no:  4579
Time elapsed:  0.21785783767700195
Pridiction for chunk no:  4580
Time elapsed:  19.253971815109253
Pridiction for chunk no:  4581
Time elapsed:  0.24523401260375977
Pridiction for chunk no:  4582
Time elapsed:  1.9827358722686768
Pridiction for chunk no:  4583
Time elapsed:  0.03626704216003418
Pridiction for chunk

Pridiction for chunk no:  4696
Time elapsed:  0.05549001693725586
Pridiction for chunk no:  4697
Time elapsed:  0.9454593658447266
Pridiction for chunk no:  4698
Time elapsed:  0.08031582832336426
Pridiction for chunk no:  4699
Time elapsed:  28.609534978866577
Pridiction for chunk no:  4700
Time elapsed:  3.0908432006835938
Pridiction for chunk no:  4701
Time elapsed:  17.97972297668457
Pridiction for chunk no:  4702
Time elapsed:  0.024461030960083008
Pridiction for chunk no:  4703
Time elapsed:  0.3716411590576172
Pridiction for chunk no:  4704
Time elapsed:  0.045406341552734375
Pridiction for chunk no:  4705
Time elapsed:  0.13604497909545898
Pridiction for chunk no:  4706
Time elapsed:  0.20891094207763672
Pridiction for chunk no:  4707
Time elapsed:  0.20785307884216309
Pridiction for chunk no:  4708
Time elapsed:  0.056854963302612305
Pridiction for chunk no:  4709
Time elapsed:  0.08716106414794922
Pridiction for chunk no:  4710
Time elapsed:  0.021526575088500977
Pridiction f

Pridiction for chunk no:  4822
Time elapsed:  24.72871494293213
Pridiction for chunk no:  4823
Time elapsed:  0.7013158798217773
Pridiction for chunk no:  4824
Time elapsed:  0.0341489315032959
Pridiction for chunk no:  4825
Time elapsed:  12.24532413482666
Pridiction for chunk no:  4826
Time elapsed:  0.041152000427246094
Pridiction for chunk no:  4827
Time elapsed:  0.41560792922973633
Pridiction for chunk no:  4828
Time elapsed:  6.669635772705078
Pridiction for chunk no:  4829
Time elapsed:  0.026453018188476562
Pridiction for chunk no:  4830
Time elapsed:  0.20245623588562012
Pridiction for chunk no:  4831
Time elapsed:  2.2307300567626953
Pridiction for chunk no:  4832
Time elapsed:  0.042524099349975586
Pridiction for chunk no:  4833
Time elapsed:  3.1918351650238037
Pridiction for chunk no:  4834
Time elapsed:  0.11659693717956543
Pridiction for chunk no:  4835
Time elapsed:  0.8993809223175049
Pridiction for chunk no:  4836
Time elapsed:  0.04152393341064453
Pridiction for chu

Pridiction for chunk no:  4947
Time elapsed:  8.644846200942993
Pridiction for chunk no:  4948
Time elapsed:  0.16114592552185059
Pridiction for chunk no:  4949
Time elapsed:  0.17927932739257812
Pridiction for chunk no:  4950
Time elapsed:  0.020650148391723633
Pridiction for chunk no:  4951
Time elapsed:  0.7073280811309814
Pridiction for chunk no:  4952
Time elapsed:  0.12893199920654297
Pridiction for chunk no:  4953
Time elapsed:  0.022632837295532227
Pridiction for chunk no:  4954
Time elapsed:  0.1438448429107666
Pridiction for chunk no:  4955
Time elapsed:  0.2769510746002197
Pridiction for chunk no:  4956
Time elapsed:  0.27977585792541504
Pridiction for chunk no:  4957
Time elapsed:  0.02707982063293457
Pridiction for chunk no:  4958
Time elapsed:  0.08789205551147461
Pridiction for chunk no:  4959
Time elapsed:  0.4554409980773926
Pridiction for chunk no:  4960
Time elapsed:  0.7772080898284912
Pridiction for chunk no:  4961
Time elapsed:  0.04242300987243652
Pridiction for 

Pridiction for chunk no:  5073
Time elapsed:  0.8539493083953857
Pridiction for chunk no:  5074
Time elapsed:  0.056946754455566406
Pridiction for chunk no:  5075
Time elapsed:  0.9387009143829346
Pridiction for chunk no:  5076
Time elapsed:  0.020054101943969727
Pridiction for chunk no:  5077
Time elapsed:  1.123122215270996
Pridiction for chunk no:  5078
Time elapsed:  0.08146905899047852
Pridiction for chunk no:  5079
Time elapsed:  0.2653501033782959
Pridiction for chunk no:  5080
Time elapsed:  0.1152808666229248
Pridiction for chunk no:  5081
Time elapsed:  0.0690312385559082
Pridiction for chunk no:  5082
Time elapsed:  0.033406972885131836
Pridiction for chunk no:  5083
Time elapsed:  0.22289204597473145
Pridiction for chunk no:  5084
Time elapsed:  2.751970052719116
Pridiction for chunk no:  5085
Time elapsed:  0.5953998565673828
Pridiction for chunk no:  5086
Time elapsed:  0.03779411315917969
Pridiction for chunk no:  5087
Time elapsed:  0.4185171127319336
Pridiction for chu

Pridiction for chunk no:  5199
Time elapsed:  1.9794890880584717
Pridiction for chunk no:  5200
Time elapsed:  0.053117990493774414
Pridiction for chunk no:  5201
Time elapsed:  3.564177989959717
Pridiction for chunk no:  5202
Time elapsed:  0.05192208290100098
Pridiction for chunk no:  5203
Time elapsed:  0.8614518642425537
Pridiction for chunk no:  5204
Time elapsed:  2.4151499271392822
Pridiction for chunk no:  5205
Time elapsed:  25.374186038970947
Pridiction for chunk no:  5206
Time elapsed:  0.5753200054168701
Pridiction for chunk no:  5207
Time elapsed:  1.0747928619384766
Pridiction for chunk no:  5208
Time elapsed:  0.08523917198181152
Pridiction for chunk no:  5209
Time elapsed:  0.2875180244445801
Pridiction for chunk no:  5210
Time elapsed:  0.020030975341796875
Pridiction for chunk no:  5211
Time elapsed:  6.600864887237549
Pridiction for chunk no:  5212
Time elapsed:  6.327651023864746
Pridiction for chunk no:  5213
Time elapsed:  0.024058818817138672
Pridiction for chunk

Pridiction for chunk no:  5325
Time elapsed:  5.705017805099487
Pridiction for chunk no:  5326
Time elapsed:  0.04749011993408203
Pridiction for chunk no:  5327
Time elapsed:  0.1348731517791748
Pridiction for chunk no:  5328
Time elapsed:  0.14430999755859375
Pridiction for chunk no:  5329
Time elapsed:  7.514366865158081
Pridiction for chunk no:  5330
Time elapsed:  0.040146827697753906
Pridiction for chunk no:  5331
Time elapsed:  0.08814597129821777
Pridiction for chunk no:  5332
Time elapsed:  0.05783414840698242
Pridiction for chunk no:  5333
Time elapsed:  0.2570018768310547
Pridiction for chunk no:  5334
Time elapsed:  0.023705005645751953
Pridiction for chunk no:  5335
Time elapsed:  1.4390058517456055
Pridiction for chunk no:  5336
Time elapsed:  22.38445806503296
Pridiction for chunk no:  5337
Time elapsed:  2.3216300010681152
Pridiction for chunk no:  5338
Time elapsed:  2.076744318008423
Pridiction for chunk no:  5339
Time elapsed:  30.654002904891968
Pridiction for chunk 

Pridiction for chunk no:  5452
Time elapsed:  16.57384705543518
Pridiction for chunk no:  5453
Time elapsed:  13.381373882293701
Pridiction for chunk no:  5454
Time elapsed:  2.0232303142547607
Pridiction for chunk no:  5455
Time elapsed:  17.623174905776978
Pridiction for chunk no:  5456
Time elapsed:  0.3125431537628174
Pridiction for chunk no:  5457
Time elapsed:  0.19051885604858398
Pridiction for chunk no:  5458
Time elapsed:  2.693682909011841
Pridiction for chunk no:  5459
Time elapsed:  3.5212759971618652
Pridiction for chunk no:  5460
Time elapsed:  33.16844415664673
Pridiction for chunk no:  5461
Time elapsed:  0.9543921947479248
Pridiction for chunk no:  5462
Time elapsed:  0.11525487899780273
Pridiction for chunk no:  5463
Time elapsed:  0.13886189460754395
Pridiction for chunk no:  5464
Time elapsed:  0.28999781608581543
Pridiction for chunk no:  5465
Time elapsed:  2.5912678241729736
Pridiction for chunk no:  5466
Time elapsed:  0.18670892715454102
Pridiction for chunk no

Pridiction for chunk no:  5578
Time elapsed:  4.531329870223999
Pridiction for chunk no:  5579
Time elapsed:  0.25661206245422363
Pridiction for chunk no:  5580
Time elapsed:  9.364604949951172
Pridiction for chunk no:  5581
Time elapsed:  1.070005178451538
Pridiction for chunk no:  5582
Time elapsed:  0.06508898735046387
Pridiction for chunk no:  5583
Time elapsed:  0.15914106369018555
Pridiction for chunk no:  5584
Time elapsed:  0.10860276222229004
Pridiction for chunk no:  5585
Time elapsed:  0.050171852111816406
Pridiction for chunk no:  5586
Time elapsed:  0.4060397148132324
Pridiction for chunk no:  5587
Time elapsed:  7.068563938140869
Pridiction for chunk no:  5588
Time elapsed:  0.03247809410095215
Pridiction for chunk no:  5589
Time elapsed:  0.23282408714294434
Pridiction for chunk no:  5590
Time elapsed:  0.0988309383392334
Pridiction for chunk no:  5591
Time elapsed:  0.7491307258605957
Pridiction for chunk no:  5592
Time elapsed:  2.7978110313415527
Pridiction for chunk 

Pridiction for chunk no:  5704
Time elapsed:  38.77610802650452
Pridiction for chunk no:  5705
Time elapsed:  0.6742188930511475
Pridiction for chunk no:  5706
Time elapsed:  1.6593499183654785
Pridiction for chunk no:  5707
Time elapsed:  0.7732629776000977
Pridiction for chunk no:  5708
Time elapsed:  0.024795055389404297
Pridiction for chunk no:  5709
Time elapsed:  0.02943897247314453
Pridiction for chunk no:  5710
Time elapsed:  0.02779388427734375
Pridiction for chunk no:  5711
Time elapsed:  0.053002119064331055
Pridiction for chunk no:  5712
Time elapsed:  0.647972822189331
Pridiction for chunk no:  5713
Time elapsed:  0.039438724517822266
Pridiction for chunk no:  5714
Time elapsed:  0.3599386215209961
Pridiction for chunk no:  5715
Time elapsed:  0.03206992149353027
Pridiction for chunk no:  5716
Time elapsed:  0.021551132202148438
Pridiction for chunk no:  5717
Time elapsed:  15.966190099716187
Pridiction for chunk no:  5718
Time elapsed:  0.5018818378448486
Pridiction for c

Pridiction for chunk no:  5831
Time elapsed:  0.062195777893066406
Pridiction for chunk no:  5832
Time elapsed:  3.3575079441070557
Pridiction for chunk no:  5833
Time elapsed:  0.05639505386352539
Pridiction for chunk no:  5834
Time elapsed:  39.00046706199646
Pridiction for chunk no:  5835
Time elapsed:  0.13151121139526367
Pridiction for chunk no:  5836
Time elapsed:  1.0743119716644287
Pridiction for chunk no:  5837
Time elapsed:  1.7523760795593262
Pridiction for chunk no:  5838
Time elapsed:  8.361529111862183
Pridiction for chunk no:  5839
Time elapsed:  0.06172919273376465
Pridiction for chunk no:  5840
Time elapsed:  1.249547004699707
Pridiction for chunk no:  5841
Time elapsed:  0.05220818519592285
Pridiction for chunk no:  5842
Time elapsed:  3.9320640563964844
Pridiction for chunk no:  5843
Time elapsed:  0.06598401069641113
Pridiction for chunk no:  5844
Time elapsed:  4.234951972961426
Pridiction for chunk no:  5845
Time elapsed:  1.1089520454406738
Pridiction for chunk n

Pridiction for chunk no:  5958
Time elapsed:  0.604619026184082
Pridiction for chunk no:  5959
Time elapsed:  1.2661280632019043
Pridiction for chunk no:  5960
Time elapsed:  0.24173307418823242
Pridiction for chunk no:  5961
Time elapsed:  0.5621020793914795
Pridiction for chunk no:  5962
Time elapsed:  0.06389808654785156
Pridiction for chunk no:  5963
Time elapsed:  1.3659801483154297
Pridiction for chunk no:  5964
Time elapsed:  2.9172050952911377
Pridiction for chunk no:  5965
Time elapsed:  0.10772180557250977
Pridiction for chunk no:  5966
Time elapsed:  0.05816006660461426
Pridiction for chunk no:  5967
Time elapsed:  2.7046058177948
Pridiction for chunk no:  5968
Time elapsed:  2.4954848289489746
Pridiction for chunk no:  5969
Time elapsed:  13.686753034591675
Pridiction for chunk no:  5970
Time elapsed:  3.1240901947021484
Pridiction for chunk no:  5971
Time elapsed:  0.2985079288482666
Pridiction for chunk no:  5972
Time elapsed:  0.2514479160308838
Pridiction for chunk no: 

Pridiction for chunk no:  6085
Time elapsed:  0.12266397476196289
Pridiction for chunk no:  6086
Time elapsed:  0.09631490707397461
Pridiction for chunk no:  6087
Time elapsed:  6.858008146286011
Pridiction for chunk no:  6088
Time elapsed:  0.23132920265197754
Pridiction for chunk no:  6089
Time elapsed:  7.313436985015869
Pridiction for chunk no:  6090
Time elapsed:  0.06801009178161621
Pridiction for chunk no:  6091
Time elapsed:  2.1105949878692627
Pridiction for chunk no:  6092
Time elapsed:  0.04651188850402832
Pridiction for chunk no:  6093
Time elapsed:  5.343484878540039
Pridiction for chunk no:  6094
Time elapsed:  0.025625944137573242
Pridiction for chunk no:  6095
Time elapsed:  0.11444211006164551
Pridiction for chunk no:  6096
Time elapsed:  0.025711774826049805
Pridiction for chunk no:  6097
Time elapsed:  0.6491641998291016
Pridiction for chunk no:  6098
Time elapsed:  0.028545141220092773
Pridiction for chunk no:  6099
Time elapsed:  0.027437925338745117
Pridiction for

Pridiction for chunk no:  6212
Time elapsed:  2.342237949371338
Pridiction for chunk no:  6213
Time elapsed:  0.9009602069854736
Pridiction for chunk no:  6214
Time elapsed:  0.055947065353393555
Pridiction for chunk no:  6215
Time elapsed:  0.09939885139465332
Pridiction for chunk no:  6216
Time elapsed:  0.029272079467773438
Pridiction for chunk no:  6217
Time elapsed:  0.04180765151977539
Pridiction for chunk no:  6218
Time elapsed:  0.048909902572631836
Pridiction for chunk no:  6219
Time elapsed:  1.4772098064422607
Pridiction for chunk no:  6220
Time elapsed:  0.028172016143798828
Pridiction for chunk no:  6221
Time elapsed:  0.04141426086425781
Pridiction for chunk no:  6222
Time elapsed:  1.3869409561157227
Pridiction for chunk no:  6223
Time elapsed:  0.03992486000061035
Pridiction for chunk no:  6224
Time elapsed:  0.025226831436157227
Pridiction for chunk no:  6225
Time elapsed:  0.6275420188903809
Pridiction for chunk no:  6226
Time elapsed:  19.032278060913086
Pridiction f

Pridiction for chunk no:  6340
Time elapsed:  5.195189952850342
Pridiction for chunk no:  6341
Time elapsed:  0.22952699661254883
Pridiction for chunk no:  6342
Time elapsed:  88.0693690776825
Pridiction for chunk no:  6343
Time elapsed:  1.319894790649414
Pridiction for chunk no:  6344
Time elapsed:  0.04080605506896973
Pridiction for chunk no:  6345
Time elapsed:  0.3912670612335205
Pridiction for chunk no:  6346
Time elapsed:  2.553753137588501
Pridiction for chunk no:  6347
Time elapsed:  0.9974498748779297
Pridiction for chunk no:  6348
Time elapsed:  1.4969401359558105
Pridiction for chunk no:  6349
Time elapsed:  0.04743385314941406
Pridiction for chunk no:  6350
Time elapsed:  0.7186119556427002
Pridiction for chunk no:  6351
Time elapsed:  2.8576841354370117
Pridiction for chunk no:  6352
Time elapsed:  0.48105907440185547
Pridiction for chunk no:  6353
Time elapsed:  4.533514022827148
Pridiction for chunk no:  6354
Time elapsed:  1.8960978984832764
Pridiction for chunk no:  6

Pridiction for chunk no:  6466
Time elapsed:  0.8416860103607178
Pridiction for chunk no:  6467
Time elapsed:  0.9679579734802246
Pridiction for chunk no:  6468
Time elapsed:  0.2946817874908447
Pridiction for chunk no:  6469
Time elapsed:  1.7983479499816895
Pridiction for chunk no:  6470
Time elapsed:  1.9985337257385254
Pridiction for chunk no:  6471
Time elapsed:  0.10598111152648926
Pridiction for chunk no:  6472
Time elapsed:  0.09471988677978516
Pridiction for chunk no:  6473
Time elapsed:  0.1974780559539795
Pridiction for chunk no:  6474
Time elapsed:  0.14762115478515625
Pridiction for chunk no:  6475
Time elapsed:  0.17943716049194336
Pridiction for chunk no:  6476
Time elapsed:  0.06928896903991699
Pridiction for chunk no:  6477
Time elapsed:  0.07249712944030762
Pridiction for chunk no:  6478
Time elapsed:  0.10894989967346191
Pridiction for chunk no:  6479
Time elapsed:  4.077239990234375
Pridiction for chunk no:  6480
Time elapsed:  1.5366899967193604
Pridiction for chun

Pridiction for chunk no:  6592
Time elapsed:  0.541085958480835
Pridiction for chunk no:  6593
Time elapsed:  0.11780214309692383
Pridiction for chunk no:  6594
Time elapsed:  0.04452800750732422
Pridiction for chunk no:  6595
Time elapsed:  0.6487529277801514
Pridiction for chunk no:  6596
Time elapsed:  0.05897808074951172
Pridiction for chunk no:  6597
Time elapsed:  1.276181936264038
Pridiction for chunk no:  6598
Time elapsed:  0.0645751953125
Pridiction for chunk no:  6599
Time elapsed:  0.013058900833129883
Pridiction for chunk no:  6600
Time elapsed:  0.04453110694885254
Pridiction for chunk no:  6601
Time elapsed:  0.08536887168884277
Pridiction for chunk no:  6602
Time elapsed:  0.03565216064453125
Pridiction for chunk no:  6603
Time elapsed:  2.697659969329834
Pridiction for chunk no:  6604
Time elapsed:  0.021368980407714844
Pridiction for chunk no:  6605
Time elapsed:  0.04483199119567871
Pridiction for chunk no:  6606
Time elapsed:  23.21705389022827
Pridiction for chunk 

Pridiction for chunk no:  6717
Time elapsed:  4.164147853851318
Pridiction for chunk no:  6718
Time elapsed:  0.3406031131744385
Pridiction for chunk no:  6719
Time elapsed:  1.6886169910430908
Pridiction for chunk no:  6720
Time elapsed:  0.09382009506225586
Pridiction for chunk no:  6721
Time elapsed:  1.6201591491699219
Pridiction for chunk no:  6722
Time elapsed:  0.025144100189208984
Pridiction for chunk no:  6723
Time elapsed:  2.0283827781677246
Pridiction for chunk no:  6724
Time elapsed:  1.1288459300994873
Pridiction for chunk no:  6725
Time elapsed:  0.07233595848083496
Pridiction for chunk no:  6726
Time elapsed:  0.03463602066040039
Pridiction for chunk no:  6727
Time elapsed:  0.1627037525177002
Pridiction for chunk no:  6728
Time elapsed:  0.8533270359039307
Pridiction for chunk no:  6729
Time elapsed:  0.01624894142150879
Pridiction for chunk no:  6730
Time elapsed:  0.1978600025177002
Pridiction for chunk no:  6731
Time elapsed:  0.7577331066131592
Pridiction for chunk

Pridiction for chunk no:  6843
Time elapsed:  3.656116008758545
Pridiction for chunk no:  6844
Time elapsed:  0.2132720947265625
Pridiction for chunk no:  6845
Time elapsed:  0.07533001899719238
Pridiction for chunk no:  6846
Time elapsed:  0.0931088924407959
Pridiction for chunk no:  6847
Time elapsed:  0.7132298946380615
Pridiction for chunk no:  6848
Time elapsed:  0.5262930393218994
Pridiction for chunk no:  6849
Time elapsed:  0.3577847480773926
Pridiction for chunk no:  6850
Time elapsed:  9.183034896850586
Pridiction for chunk no:  6851
Time elapsed:  0.024964094161987305
Pridiction for chunk no:  6852
Time elapsed:  0.049677133560180664
Pridiction for chunk no:  6853
Time elapsed:  0.020656108856201172
Pridiction for chunk no:  6854
Time elapsed:  0.13148808479309082
Pridiction for chunk no:  6855
Time elapsed:  0.8519430160522461
Pridiction for chunk no:  6856
Time elapsed:  0.24721097946166992
Pridiction for chunk no:  6857
Time elapsed:  0.2150897979736328
Pridiction for chu

Pridiction for chunk no:  6969
Time elapsed:  1.177213191986084
Pridiction for chunk no:  6970
Time elapsed:  2.463430166244507
Pridiction for chunk no:  6971
Time elapsed:  11.84913682937622
Pridiction for chunk no:  6972
Time elapsed:  1.2938570976257324
Pridiction for chunk no:  6973
Time elapsed:  0.046405792236328125
Pridiction for chunk no:  6974
Time elapsed:  0.43584609031677246
Pridiction for chunk no:  6975
Time elapsed:  0.24883294105529785
Pridiction for chunk no:  6976
Time elapsed:  0.6491680145263672
Pridiction for chunk no:  6977
Time elapsed:  0.0871739387512207
Pridiction for chunk no:  6978
Time elapsed:  0.1625041961669922


In [93]:
# len(frames)

392

In [80]:
iterable = [(6,1),(6,1),(7,1),(9,1),(3,1),(5,1),(4,1),(9,1),(9,1),(9,1)]

selectCount = 3

largests = heapq.nlargest(selectCount, iterable, key=lambda t: t[0])

print(largests)

[(9, 1), (9, 1), (9, 1)]


In [69]:
df1 = pd.DataFrame([['a', 1], ['b', 2]],
                   columns=['letter', 'number'])
df2 = pd.DataFrame([['c', 3], ['d', 4]],
                   columns=['letter', 'number'])
pd.concat([df2, df1], ignore_index=True)

Unnamed: 0,letter,number
0,c,3
1,d,4
2,a,1
3,b,2


In [22]:
# Order train dataset by userId
useful_train.sort_values(by=['userId'], inplace= True)
useful_train.head()

Unnamed: 0,userId,movieId,rating,timestamp
5122500,1,3949,5.0,1147868678
9153002,1,1175,3.5,1147868826
6923102,1,6016,5.0,1147869090
724395,1,7323,3.5,1147869119
2805472,1,4973,4.5,1147869080


In [23]:
# View the last five rows of the train data
useful_train.tail()

Unnamed: 0,userId,movieId,rating,timestamp
9103441,162541,2396,4.0,1240952712
547504,162541,4973,4.5,1240950790
7991803,162541,2539,1.0,1240950911
1861237,162541,1201,3.0,1240953800
9435687,162541,1230,3.5,1240951041


In [24]:
# Get the values of all userId into a list 
train_userids = useful_train['userId'].unique().tolist()
print(f'There are {len(train_userids)} of different userIds in the useful train dataset')

There are 162350 of different userIds in the useful train dataset


#### Order the test dataset by userId

In [25]:
# Order test dataset 
useful_test = test.sort_values(by=['userId'])
useful_test.head()

Unnamed: 0,userId,movieId
0,1,2011
1,1,4144
2,1,5767
3,1,6711
4,1,7318


In [26]:
# View the last five rows of the test dataset
useful_test.tail()

Unnamed: 0,userId,movieId
4999993,162541,345
4999992,162541,150
5000017,162541,5689
5000004,162541,2324
5000018,162541,7153


From the above result, we proceed to divide 162350 by 40, which give an approximate value of 4059. Which means when we divide the useful train dataset into chuncks, we would have 4059 unique userIds in the first 39 chunks, and 4049 in the last chunks. 

### Compare userId's positions for both dataset

In [28]:
twotables = useful_test.copy
# useful_test['userId'].compare(useful_train['userId'])

In [29]:
# A function that generate a list of chunks
def create_chunk_list(obj, limit, cycle):
    """
        This function accepts a list of data as obj argument, a limit value which is the maximum chunk size in each 
        chunk and a cycle, which is the numbers/size of chunks to be created
    """
    chunks = []
    start = 0
    new_limit = limit
    for i in range(cycle):
        chunks.append(obj[start:new_limit])
        start = start + limit
        new_limit = new_limit + limit
        
    return chunks
            

In [30]:
# Create 1624 chunks of userId, with a chunk size of 100
userId_chunks = create_chunk_list(train_userids, 100, 1624)

# Random evaluation of the userId chunk size
print(f'We have {len(userId_chunks)} chunks of dataset')
print(f'The length of the first chunk is: {len(userId_chunks[0])}')
print(f'The length of the 1623 chunk is: {len(userId_chunks[1622])}')
print(f'The length of the last chunk is: {len(userId_chunks[-1])}')

We have 1624 chunks of dataset
The length of the first chunk is: 100
The length of the 1623 chunk is: 100
The length of the last chunk is: 50


We separated the uniqueIds into 16234 chuncks which gives us 100 unique userIds in each of the first 1623 chunks and 
50 unique userId in the last chunk.
For proper understanding, let us view the first 10 userIds in three random chunks

In [31]:
# Random evaluation of the first 10 userids in the chunks created above

print(f'The first 10 userIds in the first chunk are: {userId_chunks[0][:10]}')
print(f'The first 10 userIds in the 1623 chunk is: {userId_chunks[1622][:10]}')
print(f'The first 10 userIds in the last chunk is: {userId_chunks[-1][:10]}')

The first 10 userIds in the first chunk are: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
The first 10 userIds in the 1623 chunk is: [162392, 162393, 162394, 162395, 162396, 162397, 162398, 162399, 162400, 162401]
The first 10 userIds in the last chunk is: [162492, 162493, 162494, 162495, 162496, 162497, 162498, 162499, 162500, 162501]


#### Divide dataset

For simplicity and explanatory reasons, we have chosen to divide the dataset manually i.e one line at a time. A faster will be to dynamically create it and store in a list. This operation will generate **train_chunk_1** to **train_chunk_1623**

In [32]:
# Divide useful trian into chunks, to create train_chunk_1 to train_chunk_40
train_chunks_name = []
for index, chunkId in enumerate(userId_chunks):
    chunk_name = "train_chunk_{0}".format(index + 1)
    globals()[chunk_name] = useful_train[useful_train['userId'].isin(chunkId)]
    train_chunks_name.append(chunk_name)
    
# THE ABOVE AS A FUNCTION
# def chunk_dataframe(df,col , chunk_ref):
#     """
#         This function accepts a dataframe and a chunk reference, which it uses to create smaller pieces of dataframe
#         as a chunk to the inputed dataframe. It returns a list of chunked dataframe.
#     """
#     df_chunks = []
#     for index, chunkId in enumerate(chunk_ref):
#         chunk_name = "train_chunk_{0}".format(index + 1)
#         globals()[chunk_name] = df[df[col].isin(chunkId)]
#         df_chunks.append(chunk_name)
        
#     return df_chunks

In [52]:
# Divide useful test dataset into chunks, to create train_chunk_1 to train_chunk_40
test_chunks_name = []
for index, chunkId in enumerate(userId_chunks):
    chunk_name = "test_chunk_{0}".format(index + 1)
    globals()[chunk_name] = useful_test[useful_test['userId'].isin(chunkId)]
    test_chunks_name.append(chunk_name)

In [33]:
#Test the above operation by printing the first five rows of the first chunk and the last five rows of the last chunk
train_chunk_1.head()

Unnamed: 0,userId,movieId,rating,timestamp
5122500,1,3949,5.0,1147868678
9153002,1,1175,3.5,1147868826
6923102,1,6016,5.0,1147869090
724395,1,7323,3.5,1147869119
2805472,1,4973,4.5,1147869080


In [34]:
# View the last five rows of the last chunk of the train data
train_chunk_1624.tail()

Unnamed: 0,userId,movieId,rating,timestamp
9103441,162541,2396,4.0,1240952712
547504,162541,4973,4.5,1240950790
7991803,162541,2539,1.0,1240950911
1861237,162541,1201,3.0,1240953800
9435687,162541,1230,3.5,1240951041


In [54]:
# View the last five rows of the last chunk of the test data
test_chunk_1624.tail()

Unnamed: 0,userId,movieId
4999993,162541,345
4999992,162541,150
5000017,162541,5689
5000004,162541,2324
5000018,162541,7153


In [35]:
# View the first and last chunk names created above
print(f'The first chunk name created above is: {train_chunks_name[0]}')
print(f'The last chunk name created above is: {train_chunks_name[-1]}')

The first chunk name created above is: train_chunk_1
The last chunk name created above is: train_chunk_1624


In [55]:
# View the first and last test dataset chunk names created above
print(f'The first chunk name created above is: {test_chunks_name[0]}')
print(f'The last chunk name created above is: {test_chunks_name[-1]}')

The first chunk name created above is: test_chunk_1
The last chunk name created above is: test_chunk_1624


#### Merging of tables

We proceed to merge tables with all train_chunks dataset created above. We execute this merge operation, using the userId as a reference. This operation will generate variables **merge_table_1** to **merge_table_40**

In [36]:
# Merge chunk with imdb_data 
merge_chunks_name = []
for index, chunk_name in enumerate(train_chunks_name):
    merge_name = "merge_chunk_{0}".format(index + 1)
    globals()[merge_name] = globals()[chunk_name].merge(imdb_data, on = 'movieId', how= 'left')
    merge_chunks_name.append(merge_name)
    
# View the result of the first five rows in the first merged chunks
merge_chunk_1.head()

Unnamed: 0,userId,movieId,rating,timestamp,title_cast,director,runtime,budget,plot_keywords
0,1,3949,5.0,1147868678,Ellen Burstyn|Jared Leto|Jennifer Connelly|Mar...,Hubert Selby Jr.,102.0,"$4,500,000",drug addiction|heroin|sex show|sex scene
1,1,1175,3.5,1147868826,Pascal Benezech|Dominique Pinon|Marie-Laure Do...,Jean-Pierre Jeunet,99.0,"FRF24,000,000",black comedy|absurd comedy|surrealist|bed
2,1,6016,5.0,1147869090,Alexandre Rodrigues|Leandro Firmino|Phellipe H...,Kátia Lund,130.0,"$3,300,000",photographer|slum|gang|brazil
3,1,7323,3.5,1147869119,Daniel Brühl|Katrin Saß|Chulpan Khamatova|Mari...,Bernd Lichtenberg,121.0,"EUR4,800,000",coma|german democratic republic|capitalism|pol...
4,1,4973,4.5,1147869080,Audrey Tautou|Mathieu Kassovitz|Rufus|Lorella ...,Guillaume Laurant,122.0,"$10,000,000",female protagonist|paris france|france|montmar...


In [56]:
# Merge test chunk with imdb_data 
test_merge_chunks_name = []
for index, chunk_name in enumerate(test_chunks_name):
    merge_name = "test_merge_chunk_{0}".format(index + 1)
    globals()[merge_name] = globals()[chunk_name].merge(imdb_data, on = 'movieId', how= 'left')
    test_merge_chunks_name.append(merge_name)
    
# View the result of the first five rows in the first merged chunks
test_merge_chunk_1.head()

Unnamed: 0,userId,movieId,title_cast,director,runtime,budget,plot_keywords
0,1,2011,,,,,
1,1,4144,,,,,
2,1,5767,,,,,
3,1,6711,Scarlett Johansson|Bill Murray|Akiko Takeshita...,Sofia Coppola,102.0,"$4,000,000",older man younger woman relationship|lonelines...
4,1,7318,Jim Caviezel|Maia Morgenstern|Christo Jivkov|F...,Benedict Fitzgerald,127.0,"$30,000,000",suffering|torture|brutality|whipping


In [37]:
# View the result of the last five rows in the last merged chunk 
merge_chunk_1624.tail()

Unnamed: 0,userId,movieId,rating,timestamp,title_cast,director,runtime,budget,plot_keywords
5336,162541,2396,4.0,1240952712,Geoffrey Rush|Tom Wilkinson|Steven O'Donnell|T...,John Madden,123.0,"$25,000,000",william shakespeare character|shakespeare play...
5337,162541,4973,4.5,1240950790,Audrey Tautou|Mathieu Kassovitz|Rufus|Lorella ...,Guillaume Laurant,122.0,"$10,000,000",female protagonist|paris france|france|montmar...
5338,162541,2539,1.0,1240950911,Robert De Niro|Billy Crystal|Lisa Kudrow|Chazz...,Kenneth Lonergan,103.0,"$80,000,000",sex scene|mafia boss|mob boss|sexual intercourse
5339,162541,1201,3.0,1240953800,,,,,
5340,162541,1230,3.5,1240951041,,,,,


In [38]:
# Drop columns that are considered not necessary from the result of the merge operation above

for  merge_name in merge_chunks_name:
    globals()[merge_name] = globals()[merge_name].drop(columns=['timestamp', 'runtime', 'budget'])
    
# View the result of the first five rows in the first merged chunks
merge_chunk_1.head()

Unnamed: 0,userId,movieId,rating,title_cast,director,plot_keywords
0,1,3949,5.0,Ellen Burstyn|Jared Leto|Jennifer Connelly|Mar...,Hubert Selby Jr.,drug addiction|heroin|sex show|sex scene
1,1,1175,3.5,Pascal Benezech|Dominique Pinon|Marie-Laure Do...,Jean-Pierre Jeunet,black comedy|absurd comedy|surrealist|bed
2,1,6016,5.0,Alexandre Rodrigues|Leandro Firmino|Phellipe H...,Kátia Lund,photographer|slum|gang|brazil
3,1,7323,3.5,Daniel Brühl|Katrin Saß|Chulpan Khamatova|Mari...,Bernd Lichtenberg,coma|german democratic republic|capitalism|pol...
4,1,4973,4.5,Audrey Tautou|Mathieu Kassovitz|Rufus|Lorella ...,Guillaume Laurant,female protagonist|paris france|france|montmar...


In [57]:
# Drop test columns that are considered not necessary from the result of the merge operation above

for  merge_name in test_merge_chunks_name:
    globals()[merge_name] = globals()[merge_name].drop(columns=['runtime', 'budget'])
    
# View the result of the first five rows in the first merged chunks
test_merge_chunk_1.head()

Unnamed: 0,userId,movieId,title_cast,director,plot_keywords
0,1,2011,,,
1,1,4144,,,
2,1,5767,,,
3,1,6711,Scarlett Johansson|Bill Murray|Akiko Takeshita...,Sofia Coppola,older man younger woman relationship|lonelines...
4,1,7318,Jim Caviezel|Maia Morgenstern|Christo Jivkov|F...,Benedict Fitzgerald,suffering|torture|brutality|whipping


In [39]:
# Merge chunks with movies table 

for  merge_name in merge_chunks_name:
    globals()[merge_name] = globals()[merge_name].merge(movies, on = 'movieId', how= 'left')
    
# View the result of the first five rows in the first merged chunks
merge_chunk_1.head()

Unnamed: 0,userId,movieId,rating,title_cast,director,plot_keywords,title,genres
0,1,3949,5.0,Ellen Burstyn|Jared Leto|Jennifer Connelly|Mar...,Hubert Selby Jr.,drug addiction|heroin|sex show|sex scene,Requiem for a Dream (2000),Drama
1,1,1175,3.5,Pascal Benezech|Dominique Pinon|Marie-Laure Do...,Jean-Pierre Jeunet,black comedy|absurd comedy|surrealist|bed,Delicatessen (1991),Comedy|Drama|Romance
2,1,6016,5.0,Alexandre Rodrigues|Leandro Firmino|Phellipe H...,Kátia Lund,photographer|slum|gang|brazil,City of God (Cidade de Deus) (2002),Action|Adventure|Crime|Drama|Thriller
3,1,7323,3.5,Daniel Brühl|Katrin Saß|Chulpan Khamatova|Mari...,Bernd Lichtenberg,coma|german democratic republic|capitalism|pol...,"Good bye, Lenin! (2003)",Comedy|Drama
4,1,4973,4.5,Audrey Tautou|Mathieu Kassovitz|Rufus|Lorella ...,Guillaume Laurant,female protagonist|paris france|france|montmar...,"Amelie (Fabuleux destin d'Amélie Poulain, Le) ...",Comedy|Romance


In [58]:
# Merge chunks with movies table 

for  merge_name in test_merge_chunks_name:
    globals()[merge_name] = globals()[merge_name].merge(movies, on = 'movieId', how= 'left')
    
# View the result of the first five rows in the first merged chunks
test_merge_chunk_1.head()

Unnamed: 0,userId,movieId,title_cast,director,plot_keywords,title,genres
0,1,2011,,,,Back to the Future Part II (1989),Adventure|Comedy|Sci-Fi
1,1,4144,,,,In the Mood For Love (Fa yeung nin wa) (2000),Drama|Romance
2,1,5767,,,,Teddy Bear (Mis) (1981),Comedy|Crime
3,1,6711,Scarlett Johansson|Bill Murray|Akiko Takeshita...,Sofia Coppola,older man younger woman relationship|lonelines...,Lost in Translation (2003),Comedy|Drama|Romance
4,1,7318,Jim Caviezel|Maia Morgenstern|Christo Jivkov|F...,Benedict Fitzgerald,suffering|torture|brutality|whipping,"Passion of the Christ, The (2004)",Drama


#### Free up memory in the global variable

In [61]:
# Delete the train_chunks and test_chunks_name stored in the global variable
for data in train_chunks_name:
    del globals()[data]
for data in test_chunks_name:
    del globals()[data]

#### Data formating

Before we can use any string vectorizer on our data, we need to properly format the data.

In [41]:
# Remove delimeters(Separators) from string data
def splitter(df, col_list, delim):
    """
        This function accepts a dataframe(df) and a list of columns(col_list), which contains the delimiter
        to be removed, it also accepts the delimiter which is to be removed
    """
    new_df = df.copy()
    
    for col in col_list:
        new_df[col] = new_df[col].str.split(delim).str.join(' ')
    
    return new_df
        

In [42]:
# Remove separators form string data
for  merge_name in merge_chunks_name:
    globals()[merge_name] = splitter(globals()[merge_name], ['title_cast', 'plot_keywords', 'genres'], '|')
    
    
# View the result of the first five rows in the first merged chunks
merge_chunk_1.head()

Unnamed: 0,userId,movieId,rating,title_cast,director,plot_keywords,title,genres
0,1,3949,5.0,Ellen Burstyn Jared Leto Jennifer Connelly Mar...,Hubert Selby Jr.,drug addiction heroin sex show sex scene,Requiem for a Dream (2000),Drama
1,1,1175,3.5,Pascal Benezech Dominique Pinon Marie-Laure Do...,Jean-Pierre Jeunet,black comedy absurd comedy surrealist bed,Delicatessen (1991),Comedy Drama Romance
2,1,6016,5.0,Alexandre Rodrigues Leandro Firmino Phellipe H...,Kátia Lund,photographer slum gang brazil,City of God (Cidade de Deus) (2002),Action Adventure Crime Drama Thriller
3,1,7323,3.5,Daniel Brühl Katrin Saß Chulpan Khamatova Mari...,Bernd Lichtenberg,coma german democratic republic capitalism pol...,"Good bye, Lenin! (2003)",Comedy Drama
4,1,4973,4.5,Audrey Tautou Mathieu Kassovitz Rufus Lorella ...,Guillaume Laurant,female protagonist paris france france montmar...,"Amelie (Fabuleux destin d'Amélie Poulain, Le) ...",Comedy Romance


In [62]:
for  merge_name in test_merge_chunks_name:
    globals()[merge_name] = splitter(globals()[merge_name], ['title_cast', 'plot_keywords', 'genres'], '|')
    
    
# View the result of the first five rows in the first merged chunks
test_merge_chunk_1.head()

Unnamed: 0,userId,movieId,title_cast,director,plot_keywords,title,genres
0,1,2011,,,,Back to the Future Part II (1989),Adventure Comedy Sci-Fi
1,1,4144,,,,In the Mood For Love (Fa yeung nin wa) (2000),Drama Romance
2,1,5767,,,,Teddy Bear (Mis) (1981),Comedy Crime
3,1,6711,Scarlett Johansson Bill Murray Akiko Takeshita...,Sofia Coppola,older man younger woman relationship lonelines...,Lost in Translation (2003),Comedy Drama Romance
4,1,7318,Jim Caviezel Maia Morgenstern Christo Jivkov F...,Benedict Fitzgerald,suffering torture brutality whipping,"Passion of the Christ, The (2004)",Drama


In [43]:
# View the last merged chunk
merge_chunk_1624.tail()

Unnamed: 0,userId,movieId,rating,title_cast,director,plot_keywords,title,genres
5336,162541,2396,4.0,Geoffrey Rush Tom Wilkinson Steven O'Donnell T...,John Madden,william shakespeare character shakespeare play...,Shakespeare in Love (1998),Comedy Drama Romance
5337,162541,4973,4.5,Audrey Tautou Mathieu Kassovitz Rufus Lorella ...,Guillaume Laurant,female protagonist paris france france montmar...,"Amelie (Fabuleux destin d'Amélie Poulain, Le) ...",Comedy Romance
5338,162541,2539,1.0,Robert De Niro Billy Crystal Lisa Kudrow Chazz...,Kenneth Lonergan,sex scene mafia boss mob boss sexual intercourse,Analyze This (1999),Comedy
5339,162541,1201,3.0,,,,"Good, the Bad and the Ugly, The (Buono, il bru...",Action Adventure Western
5340,162541,1230,3.5,,,,Annie Hall (1977),Comedy Romance


In [44]:
# Merge interested columns values for vectorization
title_list = []
indices_list = []

for index, merge_name in enumerate(merge_chunks_name):
    globals()[merge_name]['key_words'] = (pd.Series(globals()[merge_name][['title_cast', 'director', 'plot_keywords', 
                                                                           'genres']].fillna('')
                      .values.tolist()).str.join(' '))
    
    titles = "titles_{0}".format(index + 1)
    globals()[titles] = globals()[merge_name]['title']
    title_list.append(titles)
    
    indices = "indices_{0}".format(index + 1)
    globals()[indices] = pd.Series(globals()[merge_name].index, index=globals()[merge_name]['title'])
    indices_list.append(indices)
    
    
# View the result of the first five rows in the first merged chunks
merge_chunk_1.head()

Unnamed: 0,userId,movieId,rating,title_cast,director,plot_keywords,title,genres,key_words
0,1,3949,5.0,Ellen Burstyn Jared Leto Jennifer Connelly Mar...,Hubert Selby Jr.,drug addiction heroin sex show sex scene,Requiem for a Dream (2000),Drama,Ellen Burstyn Jared Leto Jennifer Connelly Mar...
1,1,1175,3.5,Pascal Benezech Dominique Pinon Marie-Laure Do...,Jean-Pierre Jeunet,black comedy absurd comedy surrealist bed,Delicatessen (1991),Comedy Drama Romance,Pascal Benezech Dominique Pinon Marie-Laure Do...
2,1,6016,5.0,Alexandre Rodrigues Leandro Firmino Phellipe H...,Kátia Lund,photographer slum gang brazil,City of God (Cidade de Deus) (2002),Action Adventure Crime Drama Thriller,Alexandre Rodrigues Leandro Firmino Phellipe H...
3,1,7323,3.5,Daniel Brühl Katrin Saß Chulpan Khamatova Mari...,Bernd Lichtenberg,coma german democratic republic capitalism pol...,"Good bye, Lenin! (2003)",Comedy Drama,Daniel Brühl Katrin Saß Chulpan Khamatova Mari...
4,1,4973,4.5,Audrey Tautou Mathieu Kassovitz Rufus Lorella ...,Guillaume Laurant,female protagonist paris france france montmar...,"Amelie (Fabuleux destin d'Amélie Poulain, Le) ...",Comedy Romance,Audrey Tautou Mathieu Kassovitz Rufus Lorella ...


In [63]:
# Merge interested columns values for vectorization
test_title_list = []
test_indices_list = []

for index, merge_name in enumerate(test_merge_chunks_name):
    globals()[merge_name]['key_words'] = (pd.Series(globals()[merge_name][['title_cast', 'director', 'plot_keywords', 
                                                                           'genres']].fillna('')
                      .values.tolist()).str.join(' '))
    
    titles = "titles_{0}".format(index + 1)
    globals()[titles] = globals()[merge_name]['title']
    title_list.append(titles)
    
    indices = "indices_{0}".format(index + 1)
    globals()[indices] = pd.Series(globals()[merge_name].index, index=globals()[merge_name]['title'])
    indices_list.append(indices)
    
    
# View the result of the first five rows in the first merged chunks
test_merge_chunk_1.head()

Unnamed: 0,userId,movieId,title_cast,director,plot_keywords,title,genres,key_words
0,1,2011,,,,Back to the Future Part II (1989),Adventure Comedy Sci-Fi,Adventure Comedy Sci-Fi
1,1,4144,,,,In the Mood For Love (Fa yeung nin wa) (2000),Drama Romance,Drama Romance
2,1,5767,,,,Teddy Bear (Mis) (1981),Comedy Crime,Comedy Crime
3,1,6711,Scarlett Johansson Bill Murray Akiko Takeshita...,Sofia Coppola,older man younger woman relationship lonelines...,Lost in Translation (2003),Comedy Drama Romance,Scarlett Johansson Bill Murray Akiko Takeshita...
4,1,7318,Jim Caviezel Maia Morgenstern Christo Jivkov F...,Benedict Fitzgerald,suffering torture brutality whipping,"Passion of the Christ, The (2004)",Drama,Jim Caviezel Maia Morgenstern Christo Jivkov F...


In [45]:
# Drop unwanted colunms in the train data set
for index, merge_name in enumerate(merge_chunks_name):
    globals()[merge_name].drop(columns= ['title_cast','director','plot_keywords', 'genres'], inplace=True)
merge_chunk_1.head()

Unnamed: 0,userId,movieId,rating,title,key_words
0,1,3949,5.0,Requiem for a Dream (2000),Ellen Burstyn Jared Leto Jennifer Connelly Mar...
1,1,1175,3.5,Delicatessen (1991),Pascal Benezech Dominique Pinon Marie-Laure Do...
2,1,6016,5.0,City of God (Cidade de Deus) (2002),Alexandre Rodrigues Leandro Firmino Phellipe H...
3,1,7323,3.5,"Good bye, Lenin! (2003)",Daniel Brühl Katrin Saß Chulpan Khamatova Mari...
4,1,4973,4.5,"Amelie (Fabuleux destin d'Amélie Poulain, Le) ...",Audrey Tautou Mathieu Kassovitz Rufus Lorella ...


In [64]:
# Drop unwanted colunms in the test data set
for index, merge_name in enumerate(test_merge_chunks_name):
    globals()[merge_name].drop(columns= ['title_cast','director','plot_keywords', 'genres'], inplace=True)
test_merge_chunk_1.head()

Unnamed: 0,userId,movieId,title,key_words
0,1,2011,Back to the Future Part II (1989),Adventure Comedy Sci-Fi
1,1,4144,In the Mood For Love (Fa yeung nin wa) (2000),Drama Romance
2,1,5767,Teddy Bear (Mis) (1981),Comedy Crime
3,1,6711,Lost in Translation (2003),Scarlett Johansson Bill Murray Akiko Takeshita...
4,1,7318,"Passion of the Christ, The (2004)",Jim Caviezel Maia Morgenstern Christo Jivkov F...


In [50]:
# View the shape of the first chunk in train dataset
merge_chunk_1.shape

(5260, 5)

In [65]:
# View the shape of the first chunk in test dataset
test_merge_chunk_1.shape

(2748, 4)

In [66]:
# View the shape of the last chunk in train dataset
merge_chunk_1623.shape

(5915, 5)

In [67]:
# View the shape of the last chunk in test dataset
test_merge_chunk_1623.shape

(3031, 4)

In [49]:
# Save the merge table for the train dataset
for index, merge_name in enumerate(merge_chunks_name):
    directory = './data/chunked_train_data/'+merge_name+'.csv'
    globals()[merge_name].to_csv(directory,index=False)

In [68]:
# Save the merge table for the test dataset
for index, merge_name in enumerate(test_merge_chunks_name):
    directory = './data/chunked_test_data/'+merge_name+'.csv'
    globals()[merge_name].to_csv(directory,index=False)

In [69]:
merge_chunk_1.loc[0,'key_words']

"Ellen Burstyn Jared Leto Jennifer Connelly Marlon Wayans Christopher McDonald Louise Lasser Marcia Jean Kurtz Janet Sarno Suzanne Shepherd Joanne Gordon Charlotte Aronofsky Mark Margolis Michael Kaycheck Jack O'Connell Chas Mastin Hubert Selby Jr. drug addiction heroin sex show sex scene Drama"

#### Free up memory space

For us to proceed to the next, which is CPU intensive, we want to free up some memory space

In [70]:
# Free up memory space 

train = None
genome_scores = None
genome_tags = None
links = None

print(train)

None


**TfidfVectorizer**
We now need a mechanism to convert these textual features into a format which enables us to compute their relative similarities to one another.
This will allow us to translate our string-based collection of title_cast, director, plot_keywords, genres, key_words into numerical vectors to achieve this, we make use of **TfidfVectorizer**.

In [42]:
tf = TfidfVectorizer(analyzer='word', ngram_range=(1,2),min_df=0, stop_words='english')

vector_list = []

for index, merge_name in enumerate(merge_chunks_name):
    tf_matrix = "tf_matrix_{0}".format(index + 1)
    
    globals()[tf_matrix] = tf.fit_transform(globals()[merge_name]['key_words'])
    vector_list.append(tf_matrix)

In [59]:
tf_matrix_1.shape

(5260, 57409)

In [60]:
for data in merge_chunks_name:
    del globals()[data]

In [None]:
# Create first 100 consine similarities 
cosine_sim_list = []

real_index = 0
stop_count = 300
real_vector_list = vector_list[real_index:stop_count]


for index, tf_matrix in enumerate(real_vector_list):
    cosine_sim = "cosine_sim_{0}".format(real_index + 1)
    
    globals()[cosine_sim] = cosine_similarity(globals()[tf_matrix], 
                                        globals()[tf_matrix])
    cosine_sim_list.append(cosine_sim)
    
    real_index = real_index + 1
    if real_index == stop_count :
        break

print (cosine_sim_1.shape)

In [62]:
# Print cosine similarity value for the 200th chunk
cosine_sim_100.shape

(6094, 6094)

In [63]:
# View the next start point
real_index

100

In [64]:
# Create next 200 consine similarities 

stop_count = 150
real_vector_list = vector_list[real_index:stop_count]


for index, tf_matrix in enumerate(real_vector_list):
    cosine_sim = "cosine_sim_{0}".format(real_index + 1)
    
    globals()[cosine_sim] = cosine_similarity(globals()[tf_matrix], 
                                        globals()[tf_matrix])
    cosine_sim_list.append(cosine_sim)
    
    real_index = real_index + 1
    if real_index == stop_count :
        break

# View the last cosine similarity value generated 
print (cosine_sim_150.shape)


(6023, 6023)


In [65]:
# View the next start point
real_index

150

In [66]:
# Create next 200 consine similarities 

stop_count = 200
real_vector_list = vector_list[real_index:stop_count]


for index, tf_matrix in enumerate(real_vector_list):
    cosine_sim = "cosine_sim_{0}".format(real_index + 1)
    
    globals()[cosine_sim] = cosine_similarity(globals()[tf_matrix], 
                                        globals()[tf_matrix])
    cosine_sim_list.append(cosine_sim)
    
    real_index = real_index + 1
    if real_index == stop_count :
        break

# View the last cosine similarity value generated 
print (cosine_sim_200.shape)

(5638, 5638)


In [67]:
# View the next start point
real_index

200

In [68]:
# Create next 200 consine similarities 

stop_count = 250
real_vector_list = vector_list[real_index:stop_count]


for index, tf_matrix in enumerate(real_vector_list):
    cosine_sim = "cosine_sim_{0}".format(real_index + 1)
    
    globals()[cosine_sim] = cosine_similarity(globals()[tf_matrix], 
                                        globals()[tf_matrix])
    cosine_sim_list.append(cosine_sim)
    
    real_index = real_index + 1
    if real_index == stop_count :
        break

# View the last cosine similarity value generated 
print (cosine_sim_250.shape)

(5638, 5638)


In [70]:
# View the next start point
real_index

250

In [None]:
# Create next 200 consine similarities 

stop_count = 300
real_vector_list = vector_list[real_index:stop_count]


for index, tf_matrix in enumerate(real_vector_list):
    cosine_sim = "cosine_sim_{0}".format(real_index + 1)
    
    globals()[cosine_sim] = cosine_similarity(globals()[tf_matrix], 
                                        globals()[tf_matrix])
    cosine_sim_list.append(cosine_sim)
    
    real_index = real_index + 1
    if real_index == stop_count :
        break

# View the last cosine similarity value generated 
print (cosine_sim_300.shape)

#### Spliting of dataset
We split our transformed dataset into 50 part, converting same into dataframe and save to local storage. This process
gives 199957 rows for the first 49 chunks and 199952 rows for the last chunk

In [None]:
def create_chunk_list(obj, limit):
    chunk = []
    chunks = []
    obj_len = len(obj)
    
    for index, value in enumerate(obj) :
        chunk.append(value)
        if ( (len(chunk) == limit ) | ((index +1) == obj_len )) :
            chunks.append(chunk)
            chunk = []
    
    return chunks
            

####  Convert vectorised data set back to dataframe form

In [None]:
vectoried_df = pd.DataFrame(data=tf_authTags_matrix.toarray(),columns = vector.get_feature_names())
vectoried_df.head()

We now can compute the similarity between each vector within our matrix. This is done by making use of the `cosine_similarity` function provided to us by `sklearn`.

In [None]:
cosine_sim_authTags = cosine_similarity(tf_authTags_matrix, 
                                        tf_authTags_matrix)
print (cosine_sim_authTags.shape)

### 4.2 Collaborative filtering

### 4.3 Rating Prediction

As motivated previously, in some cases we may wish to directly calculate what rating a user _would_ give a book that they haven't read yet. 

We can modify our content-based filtering algorithm to do this in the following manner: 

   1. Select a reference user from the database and a reference item (movie) they have _not_ rated. 
   2. For the user, gather the similarity values between the reference item and each item the user _has_ rated. 
   3. Sort the gathered similarity values in descending order. 
   4. Select the $k$ highest similarity values which are above a given threshold value, creating a collection $K$. 
   5. Compute a weighted average rating from these values, which is the sum of the similarity values of each item multiplied by its assigned user-rating, divided by the sum of the similarity values. This can be expressed in formula as:
   
   $$ \hat{R}_{ju} = \frac{\sum_{i \in K} s_{ij} \times r_{iu}}{\sum_{i \in K} s_{ij}}   $$
   
   where $\hat{R}_{ju}$ is the weighted average computed for the reference item $j$ and reference user $u$, $K$ is the collection of items, $s_{ij}$ is the similarity computed between items $i$ and $j$, and $r_{iu}$ is the known rating user $u$ has given item $i$.
   6. We return the weighted average $\hat{R}_{ju}$ as the prediction for our reference item.
   
   
We implement this algorithmic process in the function below:

<a id="five"></a>
## 5. Modelling
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Modelling ⚡ |
| :--------------------------- |
| In this section, you are required to create one or more regression models that are able to accurately predict the thee hour load shortfall. |

---

<a id="six"></a>
## 6. Model Performance
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Model performance ⚡ |
| :--------------------------- |
| In this section you are required to compare the relative performance of the various trained ML models on a holdout dataset and comment on what model is the best and why. |

---

In [None]:
# Compare model performance

In [None]:
# Choose best model and motivate why it is the best choice

<a id="seven"></a>
## 7. Model Explanations
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Model explanation ⚡ |
| :--------------------------- |
| In this section, you are required to discuss how the best performing model works in a simple way so that both technical and non-technical stakeholders can grasp the intuition behind the model's inner workings. |

---

In [None]:
# discuss chosen methods logic

Run the next cell to make sure the experiment as ended. It notifies comit.

In [None]:
experiment.end()