# Creating model's Dataset

In [1]:
import pandas as pd
#import nltk 
import numpy as np
from nltk.corpus import stopwords
import re
from nltk.stem import WordNetLemmatizer
from IPython.display import display
import pyspark
sc = pyspark.SparkContext
from pyspark.sql import SparkSession
from pyspark import SQLContext
from pyspark.sql import Window as W
import pyspark.sql.functions as F
from pyspark.sql.dataframe import DataFrame
from pyspark.sql.types import *
from pyspark.sql import *
spark = SparkSession \
    .builder \
    .getOrCreate()

## Import datasets

In [2]:
tag = pd.read_csv('data/'+'tag'+'.csv',header=0, parse_dates=['timestamp'], dtype = {"userId" : "str","movieId" : "str"})
display(tag.head())

rating = pd.read_csv('data/'+'rating'+'.csv',header=0, parse_dates=['timestamp'], dtype = {"userId" : "str","movieId" : "str"})
display(rating.head())

BASE = "genome_tags_relevance"
genome_tags_relevance=pd.read_pickle("data/" + BASE + ".pkl")
display(genome_tags_relevance)

movie = pd.read_csv('data/'+'movie'+'.csv',header=0, dtype = {"title" : "str","movieId" : "str"})
display(movie.head())

Unnamed: 0,userId,movieId,tag,timestamp
0,18,4141,Mark Waters,2009-04-24 18:19:40
1,65,208,dark hero,2013-05-10 01:41:18
2,65,353,dark hero,2013-05-10 01:41:19
3,65,521,noir thriller,2013-05-10 01:39:43
4,65,592,dark hero,2013-05-10 01:41:18


Unnamed: 0,userId,movieId,rating,timestamp
0,1,2,3.5,2005-04-02 23:53:47
1,1,29,3.5,2005-04-02 23:31:16
2,1,32,3.5,2005-04-02 23:33:39
3,1,47,3.5,2005-04-02 23:32:07
4,1,50,3.5,2005-04-02 23:29:40


Unnamed: 0,movieId,tagId,relevance,avg_relevance,tag
0,1,6,0.21700,0.160223,1950s
1,1,8,0.26275,0.160223,1970s
2,1,9,0.26200,0.160223,1980s
3,1,11,0.57700,0.160223,3d
4,1,13,0.18800,0.160223,80s
...,...,...,...,...,...
3571640,131170,1109,0.20875,0.162884,wilderness
3571641,131170,1110,0.47125,0.162884,wine
3571642,131170,1116,0.31125,0.162884,women
3571643,131170,1123,0.64375,0.162884,writers


Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


## Cleaning, filters & feature engineering

For these tasks we'll use my own library developed for this project. Its name came for the sound that is a little bite like 'testing', which is what we are 'mathematically' doing in this case, and for the love I have for movies that is compare only with the joy that someone can have with his/her favorite food when, for example, you have in front of you a fine steak or a couple of cochinita pibil tacos... you don't eat it, you have tasted its value from long before you eat it... same with movies :)

<br> 

- The package documentation will be release soon. In the mean time, I hope it could be useful the following explination, the outputs from the methods executions, the code itself and, of course, any comment, questions, issues sent at this repo or to my personal e-mail.

- () Any contribution / pull-request will be high valued

In [3]:
import tastingmovies

**Init the class `prepmovies` with the next parameters:**

- The original datasets given: `movie, rating, tag`

- `genoma_tags_relevance`: It's just the join from the genoma's datasets.

- `minimun_wordfreq`: This would be for the tags' words frequency. It depends of how many words are we willing to lose in order to not allow words that do not appears to many times and would be just trash. In this case, in order to have a little bit more time to reaction and don't spare to much time waiting for processing machine time since I decided to focus on the nature of the features and the engineering of the new ones, I chose a 'big' number which cost the lost of an important part of the dataset but it will work perfect to show the concept of the solution.

- `minimum_to_hr`: The minimun rating which we will consider a movie as high ranked (or, in more casual words, approved by the user)


**Assumptions:**

- The genoma's datasets are taken from historical, past times, hence those do not have information of the future (time where the ratings take place) and, for these project we will not use the knwoledge gain from the training to update the genoma's datasets, which could be rather for future versions.

- With the EDA we identify that there are lots of timestamp's ratings of a user-movie that do not match with the timestamp's tag for that same combination of user-movie. In general, we'll asume that the time where was watched the movie and, therefore was known all the necessary to give it a rating, whas the older one (the smaller timestamp). This was done in order to prohibit the model to learn from future information. 

- For a movie, we will consider as a truly relevant tag the ones that have a relevance greater than the average of all the tags' relevance  associated with that movie


**Philosophy**

- We want to *'squeeze the juice'* from the variables as much as it could be. Hence, this library will help us to create many new features linked with averanges, cumulative averange, indicators of characteristics' presence as relevance of a tag with a movie, for example, interaction variables, etc.

- The objective is to build a kind of **ID** or, even better, a ***DNA*** chain for the user and other for the movie in order to have their *'genetic'* (the user's and the movie's) and with that, an idea of the tastes and characteristics for each one, just before 'they meet' each other. with this we'll put a ml model to learn which user-profile would *go out* (give a high rating) with which movie-profile.

- The variables that depend on time were built considerating the time's chronology (from past to future) in which, for the moment of the model's training and the predictions the only inforomation known are the ones that do not change by time and the ones that have past until even 'a second' before the new user watches a new movie.





In [4]:
pm = tastingmovies.prepmovies(tag_init=tag,minimun_wordfreq=2000,rating=rating,
                              genome_tags_relevance=genome_tags_relevance,movie=movie,
                              spark=spark,
                              minimun_to_hr=4)

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/jalfredomb/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [5]:
pm.join_tags_and_genome_ratings()

join_tags_and_genome_ratings


Unnamed: 0,userId,movieId,timestamp_movie,tag,timestamp,tagId,relevance,avg_relevance,rating_usr,timestamp_rating,title,genres_list,tag_relevance_movie
0,100031,1250,2006-10-04 02:12:54,war,2006-10-04 02:14:24,1096,0.95800,0.169697,5.0,2006-10-04 02:12:54,"Bridge on the River Kwai, The (1957)","[Adventure, Drama, War]",0.95800
1,100074,1701,2012-01-17 02:46:38,Woody Allen,2012-01-17 03:04:49,,,,5.0,2012-01-17 02:46:38,Deconstructing Harry (1997),"[Comedy, Drama]",0.01000
2,100074,1701,2012-01-17 02:46:38,hilarious,2012-01-17 03:04:56,505,0.56175,0.119156,5.0,2012-01-17 02:46:38,Deconstructing Harry (1997),"[Comedy, Drama]",0.56175
3,100074,2338,2012-01-17 03:12:52,white guy with Jamaican/Caribbean accent,2012-01-17 03:12:52,,,,2.0,2012-01-17 03:13:14,I Still Know What You Did Last Summer (1998),"[Horror, Mystery, Thriller]",0.01000
4,100074,2338,2012-01-17 03:12:52,Jack Black,2012-01-17 03:12:57,,,,2.0,2012-01-17 03:13:14,I Still Know What You Did Last Summer (1998),"[Horror, Mystery, Thriller]",0.01000
...,...,...,...,...,...,...,...,...,...,...,...,...,...
391440,99998,1921,2015-02-28 23:57:47,insanity,2015-02-28 23:58:18,547,0.98050,0.188004,4.5,2015-02-28 23:57:47,Pi (1998),"[Drama, Sci-Fi, Thriller]",0.98050
391441,99998,3504,2015-02-28 23:23:21,Sidney Lumet,2015-02-28 23:23:29,,,,5.0,2015-02-28 23:23:21,Network (1976),"[Comedy, Drama]",0.01000
391442,99998,3730,2015-02-28 23:26:12,psychological,2015-02-28 23:26:33,823,0.93900,0.181968,4.0,2015-02-28 23:26:12,"Conversation, The (1974)","[Drama, Mystery]",0.93900
391443,99998,3730,2015-02-28 23:26:12,character study,2015-02-28 23:26:37,194,0.98075,0.181968,4.0,2015-02-28 23:26:12,"Conversation, The (1974)","[Drama, Mystery]",0.98075


In [6]:
pm.process_tags01()

---Remove all the special characters
------Remove all single characters
---------Remove single characters from the start
------------Substituting multiple spaces with single space
---------------Removing prefixed "b" 
---------------------Converting to Lowercase
------------------------Lemmatization


### User's DNA

**User's behaviour related with tags through time**

In [7]:
pm.cumsum_tags_by_user()

pivoting table with tags + rating


**User's behaviour related with movies' genre through time**

In [8]:
pm.cumsum_genres_by_user()

pivoting ratings + movies´s genres


**Final transformations to get the cumulative averange ratings for the user**

In [9]:
pm.cumulative_ratings_x_user()

### Movie's DNA

In here we calculate both types of features: the ones dependent on time and independent ones

- Time dependent:

    - Averange cumulative rating
    

- Time independent:

    - The indicators for the movie genre and for the tag's relevance for each particular movie

In [10]:
pm.cumulative_ratings_x_movie_and_rlvnc_genre()

Calculating the cumulative rating per movie through chronology-ratings
Calculating: catalogue/ dictionary by pivoting table with tags + relevance by movie, not userId hence, the relevance is static and does not change with the users neither with time
Calculating: catalogue/ dictionary by pivoting table with genres by movie, since not userId, the presence of a genre is static and does not change with the users neither with time


### The final touch for the final dataset

It's in here where method `dataset_for_model` will join all the users' variables and the movies' variables implementing the corresponding `lag` to prohibit the model to learn from future information

In [11]:
pm.dataset_for_model()

**Dataset with id's & timestamp**

In [13]:
BASE = "tbl_usrmovie_t0"
pm.tbl_usrmovie_t0.to_pickle("data/" + BASE + ".pkl")
pm.tbl_usrmovie_t0

Unnamed: 0,movieId,id_user_movie,userId,timestamp_movie,rating_usr,genre_(no genres listed),genre_Action,genre_Adventure,genre_Animation,genre_Children,...,Mystery_avg_rtng_acum_prev,Sci-Fi_avg_rtng_acum_prev,Thriller_avg_rtng_acum_prev,Western_avg_rtng_acum_prev,Action_avg_rtng_acum_prev,Comedy_avg_rtng_acum_prev,Fantasy_avg_rtng_acum_prev,Romance_avg_rtng_acum_prev,War_avg_rtng_acum_prev,high_rating
0,1,2,1741,2002-11-27 18:31:23,4.0,0,0,1,1,1,...,5.00,4.35,4.71,3.00,4.60,4.79,5.00,5.00,5.00,1
1,1,3,20388,2003-01-15 06:33:48,5.0,0,0,1,1,1,...,3.38,3.07,3.00,3.00,2.96,3.42,3.53,3.47,3.80,1
2,1,4,123297,2004-03-30 18:59:56,5.0,0,0,1,1,1,...,,5.00,5.00,,5.00,5.00,5.00,,,1
3,1,5,72073,2005-09-30 12:30:36,5.0,0,0,1,1,1,...,4.33,4.54,4.63,4.50,4.50,4.63,4.69,4.56,4.38,1
4,1,6,86768,2006-01-13 01:55:17,4.0,0,0,1,1,1,...,,5.00,5.00,,5.00,5.00,5.00,5.00,5.00,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
42727,99917,10,134864,2015-01-30 00:58:52,4.0,0,0,0,0,0,...,,5.00,5.00,,,,,5.00,,1
42728,99996,2,42429,2013-08-06 03:59:59,2.5,0,0,0,0,0,...,3.77,3.69,3.84,4.13,3.85,3.71,3.97,3.57,3.50,0
42729,99996,3,36085,2013-11-26 23:24:02,4.0,0,0,0,0,0,...,3.77,3.38,3.58,,3.50,3.50,3.25,3.23,3.50,1
42730,99996,4,102853,2014-12-09 19:24:36,3.5,0,0,0,0,0,...,4.20,4.00,4.43,,4.58,4.14,4.50,4.33,4.00,0


**Dataset without id's & timestamp**    

- Finally we get our dataset. This one has 42,732 observations, 102 features and the objective variable `high_rating`.

*(this one we'll use to train/test the model)*

In [14]:
BASE = "ds_tastingmovies"
pm.ds_tastingmovies.to_pickle("data/" + BASE + ".pkl")
pm.ds_tastingmovies

Unnamed: 0,genre_(no genres listed),genre_Action,genre_Adventure,genre_Animation,genre_Children,genre_Comedy,genre_Crime,genre_Documentary,genre_Drama,genre_Fantasy,...,Mystery_avg_rtng_acum_prev,Sci-Fi_avg_rtng_acum_prev,Thriller_avg_rtng_acum_prev,Western_avg_rtng_acum_prev,Action_avg_rtng_acum_prev,Comedy_avg_rtng_acum_prev,Fantasy_avg_rtng_acum_prev,Romance_avg_rtng_acum_prev,War_avg_rtng_acum_prev,high_rating
0,0,0,1,1,1,1,0,0,0,1,...,5.00,4.35,4.71,3.00,4.60,4.79,5.00,5.00,5.00,1
1,0,0,1,1,1,1,0,0,0,1,...,3.38,3.07,3.00,3.00,2.96,3.42,3.53,3.47,3.80,1
2,0,0,1,1,1,1,0,0,0,1,...,,5.00,5.00,,5.00,5.00,5.00,,,1
3,0,0,1,1,1,1,0,0,0,1,...,4.33,4.54,4.63,4.50,4.50,4.63,4.69,4.56,4.38,1
4,0,0,1,1,1,1,0,0,0,1,...,,5.00,5.00,,5.00,5.00,5.00,5.00,5.00,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
42727,0,0,0,0,0,0,0,0,0,0,...,,5.00,5.00,,,,,5.00,,1
42728,0,0,0,0,0,1,0,0,1,0,...,3.77,3.69,3.84,4.13,3.85,3.71,3.97,3.57,3.50,0
42729,0,0,0,0,0,1,0,0,1,0,...,3.77,3.38,3.58,,3.50,3.50,3.25,3.23,3.50,1
42730,0,0,0,0,0,1,0,0,1,0,...,4.20,4.00,4.43,,4.58,4.14,4.50,4.33,4.00,0
