## TFRS Features Engineering


### 1. Introduction

In this notebook, we'll try to improve our cleaned Dataframe which was orginally from Tensorflow Dataset: [movie_lens/1m-ratings](https://www.tensorflow.org/datasets/catalog/movie_lens#movie_lens1m-ratings), so our focus will be mainly in: 

 * Fixing **"movie_genres"**: let's make sure that genres are the format of a list for easy access.
 * Fixing **"user_occupation_label"**: one category label is missing "10" causing 'K-12 student' & 'college/grad student' to be labled as "17" so here, we'll assign "10" to 'K-12 student'.
 * Add 5 more features to the original Dataset: **'cast', 'director', 'cast_size', 'crew_size', 'imdb_id', 'release_date' and movie_lens_movie_id** --> Will get these features using 2 datasets from [Movielens website](https://grouplens.org/datasets/movielens/):
   * **movies_metadata.csv**
   * **credits.csv**

 * Fix existing wrong movie title (or in some cases misspelled).
 * Let's remove all special characters or letter accents from Movie titles, cast and director.
 * Add movie id which is matching the orginal movie id in the movie lens original dataset (for some reason the movie id from tensorflow dataset is not matching).
 * Fix duplicates movie_title with same movie_id.
 * After fixing above items, let's convert Pandas dataframe to tensforflow dataset.

## 2. Import relevant libraries

In [203]:
# Import the necessary Libararies: 

import os
import pprint
import tempfile
import matplotlib.pyplot as plt
import seaborn as sns
from typing import Dict, Text
from wordcloud import WordCloud
import requests
import folium
from folium.plugins import MarkerCluster
import pprint

import numpy as np
import tensorflow as tf
import tensorflow_datasets as tfds
from typing import Dict, Text
import pandas as pd
import numpy as np

import tensorflow as tf

import tensorflow_datasets as tfds
import tensorflow_recommenders as tfrs

plt.style.use('ggplot')

In [None]:
# Check Current Directory:
os.getcwd()

In [None]:
# List files/folders in the cd:
os.listdir()

## 3. Fixing Existing Tensorflow Dataset (movie_lens/1m-dataset):

let's start with:

 * Fixing "movie_genres", let's make sure that genres are the format of a list for easy access.

In [182]:
# Let's load the cleaned dataframe which we converted from tensorflow to Pandas:
df =  pd.read_csv('1m_movielens_metadata_ratings.csv')

In [183]:
df.shape

(1000209, 12)

In [184]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000209 entries, 0 to 1000208
Data columns (total 12 columns):
 #   Column                 Non-Null Count    Dtype  
---  ------                 --------------    -----  
 0   bucketized_user_age    1000209 non-null  float64
 1   movie_genres           1000209 non-null  object 
 2   movie_id               1000209 non-null  int64  
 3   movie_title            1000209 non-null  object 
 4   timestamp              1000209 non-null  object 
 5   user_gender            1000209 non-null  object 
 6   user_id                1000209 non-null  int64  
 7   user_occupation_label  1000209 non-null  int64  
 8   user_occupation_text   1000209 non-null  object 
 9   user_rating            1000209 non-null  float64
 10  user_zip_code          1000209 non-null  int64  
 11  movie_release_year     1000209 non-null  object 
dtypes: float64(2), int64(4), object(6)
memory usage: 91.6+ MB


In [185]:
df.head()

Unnamed: 0,bucketized_user_age,movie_genres,movie_id,movie_title,timestamp,user_gender,user_id,user_occupation_label,user_occupation_text,user_rating,user_zip_code,movie_release_year
0,25.0,[3 4],586,Home Alone (1990),2000-12-04 02:31:40,m,595,6,executive/managerial,4.0,10019,1990-01-01 00:00:00
1,35.0,[ 0 1 4 14],1197,"Princess Bride, The (1987)",2000-10-29 03:36:20,f,2804,11,other/not specified,5.0,46234,1987-01-01 00:00:00
2,25.0,[ 4 14],2502,Office Space (1999),2000-11-20 21:51:54,m,1457,0,academic/educator,4.0,95472,1999-01-01 00:00:00
3,18.0,[1 3 8],2,Jumanji (1995),2000-08-09 07:30:08,m,3887,17,college/grad student,3.0,80513,1995-01-01 00:00:00
4,35.0,[10 16],1717,Scream 2 (1997),2000-12-10 18:55:33,m,329,6,executive/managerial,2.0,2115,1997-01-01 00:00:00


In [186]:
# At first let's remove square brackets:
df['movie_genres'] = df['movie_genres'].str.strip("[")

In [187]:
df['movie_genres'] = df['movie_genres'].str.strip("]")

In [188]:
df.head()

Unnamed: 0,bucketized_user_age,movie_genres,movie_id,movie_title,timestamp,user_gender,user_id,user_occupation_label,user_occupation_text,user_rating,user_zip_code,movie_release_year
0,25.0,3 4,586,Home Alone (1990),2000-12-04 02:31:40,m,595,6,executive/managerial,4.0,10019,1990-01-01 00:00:00
1,35.0,0 1 4 14,1197,"Princess Bride, The (1987)",2000-10-29 03:36:20,f,2804,11,other/not specified,5.0,46234,1987-01-01 00:00:00
2,25.0,4 14,2502,Office Space (1999),2000-11-20 21:51:54,m,1457,0,academic/educator,4.0,95472,1999-01-01 00:00:00
3,18.0,1 3 8,2,Jumanji (1995),2000-08-09 07:30:08,m,3887,17,college/grad student,3.0,80513,1995-01-01 00:00:00
4,35.0,10 16,1717,Scream 2 (1997),2000-12-10 18:55:33,m,329,6,executive/managerial,2.0,2115,1997-01-01 00:00:00


In [189]:
# Let's remove all spaces if any start/end of the str:
df['movie_genres'] = df['movie_genres'].str.strip()

In [190]:
df.head()

Unnamed: 0,bucketized_user_age,movie_genres,movie_id,movie_title,timestamp,user_gender,user_id,user_occupation_label,user_occupation_text,user_rating,user_zip_code,movie_release_year
0,25.0,3 4,586,Home Alone (1990),2000-12-04 02:31:40,m,595,6,executive/managerial,4.0,10019,1990-01-01 00:00:00
1,35.0,0 1 4 14,1197,"Princess Bride, The (1987)",2000-10-29 03:36:20,f,2804,11,other/not specified,5.0,46234,1987-01-01 00:00:00
2,25.0,4 14,2502,Office Space (1999),2000-11-20 21:51:54,m,1457,0,academic/educator,4.0,95472,1999-01-01 00:00:00
3,18.0,1 3 8,2,Jumanji (1995),2000-08-09 07:30:08,m,3887,17,college/grad student,3.0,80513,1995-01-01 00:00:00
4,35.0,10 16,1717,Scream 2 (1997),2000-12-10 18:55:33,m,329,6,executive/managerial,2.0,2115,1997-01-01 00:00:00


In [191]:
#finally let's change the genres to  list:
df['movie_genres'] = df['movie_genres'].apply(lambda x: list(map(str.lstrip,x.split(' '))))

In [192]:
df.head()

Unnamed: 0,bucketized_user_age,movie_genres,movie_id,movie_title,timestamp,user_gender,user_id,user_occupation_label,user_occupation_text,user_rating,user_zip_code,movie_release_year
0,25.0,"[3, 4]",586,Home Alone (1990),2000-12-04 02:31:40,m,595,6,executive/managerial,4.0,10019,1990-01-01 00:00:00
1,35.0,"[0, , 1, , 4, 14]",1197,"Princess Bride, The (1987)",2000-10-29 03:36:20,f,2804,11,other/not specified,5.0,46234,1987-01-01 00:00:00
2,25.0,"[4, 14]",2502,Office Space (1999),2000-11-20 21:51:54,m,1457,0,academic/educator,4.0,95472,1999-01-01 00:00:00
3,18.0,"[1, 3, 8]",2,Jumanji (1995),2000-08-09 07:30:08,m,3887,17,college/grad student,3.0,80513,1995-01-01 00:00:00
4,35.0,"[10, 16]",1717,Scream 2 (1997),2000-12-10 18:55:33,m,329,6,executive/managerial,2.0,2115,1997-01-01 00:00:00


Now, let's get rid of the release year from the movie_title:

In [194]:
# let's keep all characters excpet the last 6 characters of the movie_title:
df['movie_title'] = df['movie_title'].str[:-6]

In [195]:
# let's remove all spaces:
df['movie_title'] = df['movie_title'].str.strip()

In [196]:
df['movie_title'] = df['movie_title'].str.strip('')

In [197]:
df.head()

Unnamed: 0,bucketized_user_age,movie_genres,movie_id,movie_title,timestamp,user_gender,user_id,user_occupation_label,user_occupation_text,user_rating,user_zip_code,movie_release_year
0,25.0,"[3, 4]",586,Home Alone,2000-12-04 02:31:40,m,595,6,executive/managerial,4.0,10019,1990-01-01 00:00:00
1,35.0,"[0, , 1, , 4, 14]",1197,"Princess Bride, The",2000-10-29 03:36:20,f,2804,11,other/not specified,5.0,46234,1987-01-01 00:00:00
2,25.0,"[4, 14]",2502,Office Space,2000-11-20 21:51:54,m,1457,0,academic/educator,4.0,95472,1999-01-01 00:00:00
3,18.0,"[1, 3, 8]",2,Jumanji,2000-08-09 07:30:08,m,3887,17,college/grad student,3.0,80513,1995-01-01 00:00:00
4,35.0,"[10, 16]",1717,Scream 2,2000-12-10 18:55:33,m,329,6,executive/managerial,2.0,2115,1997-01-01 00:00:00


In [198]:
df.to_csv(path_or_buf = "path" + "/1m_movielens_metadata_ratings_v1.csv", index=False)

 * Fixing "user_occupation_label", it turns our one category is missing "10" where 'K-12 student' & 'college/grad student' were labled as "17" so here, we'll assign "10" to 'K-12 student'.

In [199]:
# Ok, as we can see below, user_occupation_label "10" is missing:
df.user_occupation_label.unique()

array([ 6, 11,  0, 17, 16, 18, 20,  1, 21, 14,  4, 12, 13,  9, 15,  2,  8,
        3, 19,  7], dtype=int64)

In [201]:
#let's check 'K-12 student' label: 17
df[df['user_occupation_text'] == 'K-12 student']

Unnamed: 0,bucketized_user_age,movie_genres,movie_id,movie_title,timestamp,user_gender,user_id,user_occupation_label,user_occupation_text,user_rating,user_zip_code,movie_release_year
24,1.0,"[0, 10, 15, 16]",1320,Alien\xc2\xb3,2000-08-04 18:13:22,m,4102,17,K-12 student,2.0,48655,1992-01-01 00:00:00
188,1.0,"[3, 4]",2384,Babe: Pig in the City,2000-12-02 16:33:48,f,634,17,K-12 student,1.0,49512,1998-01-01 00:00:00
217,18.0,"[0, 16]",2334,"Siege, The",2000-11-25 07:09:13,m,1917,17,K-12 student,3.0,92704,1998-01-01 00:00:00
274,1.0,"[2, , 3, , 8, 12]",2087,Peter Pan,2000-12-22 19:35:37,f,119,17,K-12 student,3.0,77515,1953-01-01 00:00:00
284,1.0,"[7, 13]",2686,"Red Violin, The (Le Violon rouge)",2002-02-24 07:09:48,m,2168,17,K-12 student,5.0,60202,1998-01-01 00:00:00
...,...,...,...,...,...,...,...,...,...,...,...,...
999972,1.0,"[4, 7]",2858,American Beauty,2000-05-08 16:03:48,f,5844,17,K-12 student,4.0,2131,1999-01-01 00:00:00
1000052,1.0,"[2, 3]",2123,All Dogs Go to Heaven,2000-07-24 17:56:56,f,4572,17,K-12 student,3.0,17036,1989-01-01 00:00:00
1000053,35.0,"[0, 1, 4]",2723,Mystery Men,2000-08-28 02:04:22,m,4562,17,K-12 student,5.0,94133,1999-01-01 00:00:00
1000084,1.0,"[13, 16]",904,Rear Window,2000-12-20 19:07:26,f,1088,17,K-12 student,5.0,98103,1954-01-01 00:00:00


In [202]:
#let's check 'college/grad student' label: 17 again!!
df[df['user_occupation_text'] ==  'college/grad student']

Unnamed: 0,bucketized_user_age,movie_genres,movie_id,movie_title,timestamp,user_gender,user_id,user_occupation_label,user_occupation_text,user_rating,user_zip_code,movie_release_year
3,18.0,"[1, 3, 8]",2,Jumanji,2000-08-09 07:30:08,m,3887,17,college/grad student,3.0,80513,1995-01-01 00:00:00
9,18.0,"[0, , 1, 15]",2640,Superman,2000-11-02 19:25:21,m,2756,17,college/grad student,4.0,97331,1978-01-01 00:00:00
14,25.0,[4],344,Ace Ventura: Pet Detective,2000-07-01 15:25:05,m,5083,17,college/grad student,2.0,16803,1994-01-01 00:00:00
22,25.0,"[1, 4]",2265,Nothing But Trouble,2000-11-22 01:30:59,f,1184,17,college/grad student,3.0,92354,1991-01-01 00:00:00
36,18.0,[0],3635,"Spy Who Loved Me, The",2000-06-16 18:28:31,m,5264,17,college/grad student,4.0,15217,1977-01-01 00:00:00
...,...,...,...,...,...,...,...,...,...,...,...,...
1000139,18.0,"[2, 3]",2761,"Iron Giant, The",2000-08-16 16:56:53,m,3659,17,college/grad student,4.0,29649,1999-01-01 00:00:00
1000155,18.0,[7],2546,"Deep End of the Ocean, The",2000-12-01 17:10:38,m,714,17,college/grad student,4.0,76013,1999-01-01 00:00:00
1000163,18.0,"[2, , 3, 12]",1032,Alice in Wonderland,2000-11-19 21:30:40,f,2073,17,college/grad student,5.0,13148,1951-01-01 00:00:00
1000169,25.0,"[4, 10, 12, 15]",2657,"Rocky Horror Picture Show, The",2000-11-20 06:58:36,m,1757,17,college/grad student,4.0,75206,1975-01-01 00:00:00


In [203]:
# alright, let's fix this, by assigning user_occupation_label=10 to 'K-12 student':
df.loc[df['user_occupation_text'] == 'K-12 student', 'user_occupation_label'] = 10

In [204]:
# let's confirm that user_occupation_label for 'K-12 student' is : 10
df[df['user_occupation_text'] == 'K-12 student']

Unnamed: 0,bucketized_user_age,movie_genres,movie_id,movie_title,timestamp,user_gender,user_id,user_occupation_label,user_occupation_text,user_rating,user_zip_code,movie_release_year
24,1.0,"[0, 10, 15, 16]",1320,Alien\xc2\xb3,2000-08-04 18:13:22,m,4102,10,K-12 student,2.0,48655,1992-01-01 00:00:00
188,1.0,"[3, 4]",2384,Babe: Pig in the City,2000-12-02 16:33:48,f,634,10,K-12 student,1.0,49512,1998-01-01 00:00:00
217,18.0,"[0, 16]",2334,"Siege, The",2000-11-25 07:09:13,m,1917,10,K-12 student,3.0,92704,1998-01-01 00:00:00
274,1.0,"[2, , 3, , 8, 12]",2087,Peter Pan,2000-12-22 19:35:37,f,119,10,K-12 student,3.0,77515,1953-01-01 00:00:00
284,1.0,"[7, 13]",2686,"Red Violin, The (Le Violon rouge)",2002-02-24 07:09:48,m,2168,10,K-12 student,5.0,60202,1998-01-01 00:00:00
...,...,...,...,...,...,...,...,...,...,...,...,...
999972,1.0,"[4, 7]",2858,American Beauty,2000-05-08 16:03:48,f,5844,10,K-12 student,4.0,2131,1999-01-01 00:00:00
1000052,1.0,"[2, 3]",2123,All Dogs Go to Heaven,2000-07-24 17:56:56,f,4572,10,K-12 student,3.0,17036,1989-01-01 00:00:00
1000053,35.0,"[0, 1, 4]",2723,Mystery Men,2000-08-28 02:04:22,m,4562,10,K-12 student,5.0,94133,1999-01-01 00:00:00
1000084,1.0,"[13, 16]",904,Rear Window,2000-12-20 19:07:26,f,1088,10,K-12 student,5.0,98103,1954-01-01 00:00:00


In [205]:
df.user_occupation_label.unique()

array([ 6, 11,  0, 17, 16, 18, 20,  1, 21, 14,  4, 10, 12, 13,  9, 15,  2,
        8,  3, 19,  7], dtype=int64)

In [200]:
df['user_occupation_text'].unique()

array(['executive/managerial', 'other/not specified', 'academic/educator',
       'college/grad student', 'self-employed', 'technician/engineer',
       'unemployed', 'artist', 'writer', 'sales/marketing',
       'doctor/health care', 'K-12 student', 'programmer', 'retired',
       'lawyer', 'scientist', 'clerical/admin', 'homemaker',
       'customer service', 'tradesman/craftsman', 'farmer'], dtype=object)

Ok, user_occupation_text is fixed now :)

### 4.   Add Five more features to the original Dataset: **'cast', 'director', 'cast_size', 'crew_size', 'imdb_id', 'release_date' and movie_lens_movie_id**: 

Will get above features using 2 datasets from [Movielens website](https://grouplens.org/datasets/movielens/):

   * **movies_metadata.csv**
   * **credits.csv**


Alright, let's first load **'credits.csv'** to get  'cast', 'director', 'cast_size', 'crew_size':

In [369]:
#loading credits.csv: cast_dir
cast_dir = pd.read_csv('credits.csv')

In [238]:
cast_dir.head()

Unnamed: 0,cast,crew,id
0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...",862
1,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...",8844
2,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...",15602
3,"[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...",31357
4,"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...",11862


 **1. Crew:** From the crew, we will only pick the director as our feature since the others don't contribute that much to the feel of the movie.
 
 **2. Cast:** Choosing Cast is a little more tricky. Lesser known actors and minor roles do not really affect people's opinion of a movie. Therefore, we must only select the major characters and their respective actors. Arbitrarily we will choose the top 4 actors that appear in the credits list.

In [239]:
# Alright let's use literal_eval from ast (Abstract Syntax Tree) to help Python to process trees of the Python abstract syntax grammar, in our case Dictionary as shown above.
from ast import literal_eval

In [240]:
# let's change "id" datatype from str to int:
cast_dir['id'] = cast_dir['id'].astype('int')
#md['id'] = md['id'].astype('int')

In [241]:
# let's apply literal_eval to both 'cast' & 'crew'
cast_dir['cast'] = cast_dir['cast'].apply(literal_eval)
cast_dir['crew'] = cast_dir['crew'].apply(literal_eval)

In [242]:
# now, let's create 2 features 'cast_size' & 'crew_size':
cast_dir['cast_size'] = cast_dir['cast'].apply(lambda x: len(x))
cast_dir['crew_size'] = cast_dir['crew'].apply(lambda x: len(x))

In [243]:
cast_dir.head()

Unnamed: 0,cast,crew,id,cast_size,crew_size
0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...",862,13,106
1,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...",8844,26,16
2,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...",15602,7,4
3,"[{'cast_id': 1, 'character': 'Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...",31357,10,10
4,"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...",11862,12,7


Ok, now let's get the director!!!

In [244]:
def get_director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
    return np.nan

In [245]:
cast_dir['director'] = cast_dir['crew'].apply(get_director)

Finally let's get out cast!!!

In [246]:
cast_dir['cast'] = cast_dir['cast'].apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])
cast_dir['cast'] = cast_dir['cast'].apply(lambda x: x[:4] if len(x) >=4 else x)

In [247]:
cast_dir.head()

Unnamed: 0,cast,crew,id,cast_size,crew_size,director
0,"[Tom Hanks, Tim Allen, Don Rickles, Jim Varney]","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...",862,13,106,John Lasseter
1,"[Robin Williams, Jonathan Hyde, Kirsten Dunst,...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...",8844,26,16,Joe Johnston
2,"[Walter Matthau, Jack Lemmon, Ann-Margret, Sop...","[{'credit_id': '52fe466a9251416c75077a89', 'de...",15602,7,4,Howard Deutch
3,"[Whitney Houston, Angela Bassett, Loretta Devi...","[{'credit_id': '52fe44779251416c91011acb', 'de...",31357,10,10,Forest Whitaker
4,"[Steve Martin, Diane Keaton, Martin Short, Kim...","[{'credit_id': '52fe44959251416c75039ed7', 'de...",11862,12,7,Charles Shyer


In [252]:
cast_dir.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45476 entries, 0 to 45475
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   cast       45476 non-null  object
 1   crew       45476 non-null  object
 2   id         45476 non-null  int32 
 3   cast_size  45476 non-null  int64 
 4   crew_size  45476 non-null  int64 
 5   director   44589 non-null  object
dtypes: int32(1), int64(2), object(3)
memory usage: 1.9+ MB


In [271]:
# let's change id data type:
cast_dir['id'] = cast_dir['id'].astype('int64')

In [272]:
cast_dir.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45476 entries, 0 to 45475
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   cast       45476 non-null  object
 1   crew       45476 non-null  object
 2   id         45476 non-null  int64 
 3   cast_size  45476 non-null  int64 
 4   crew_size  45476 non-null  int64 
 5   director   44589 non-null  object
dtypes: int64(3), object(3)
memory usage: 2.1+ MB


let's now load movies_metadata so we can merge it with cast_dir dataframe: 

In [248]:
# load movies_metadata.csv: m_meta
m_meta = pd.read_csv('movies_metadata.csv')

  exec(code_obj, self.user_global_ns, self.user_ns)


In [249]:
m_meta.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


In [253]:
m_meta.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 24 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   adult                  45466 non-null  object 
 1   belongs_to_collection  4494 non-null   object 
 2   budget                 45466 non-null  object 
 3   genres                 45466 non-null  object 
 4   homepage               7782 non-null   object 
 5   id                     45466 non-null  object 
 6   imdb_id                45449 non-null  object 
 7   original_language      45455 non-null  object 
 8   original_title         45466 non-null  object 
 9   overview               44512 non-null  object 
 10  popularity             45461 non-null  object 
 11  poster_path            45080 non-null  object 
 12  production_companies   45463 non-null  object 
 13  production_countries   45463 non-null  object 
 14  release_date           45379 non-null  object 
 15  re

In [274]:
# as shown above before we mrege the 2 dataframes we need to change id datatype from O to int:
m_meta['id'] = m_meta['id'].astype('int64')

ValueError: invalid literal for int() with base 10: '1997-08-20'

As shown above, our code is failed because looks like we have some id are in a date format, so let's try locate them!!!

In [278]:
#let's filter data format at "id":
m_meta[m_meta.id.str.len() >6]

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
19730,- Written by Ørnås,0.065736,/ff9qCepilowshEtG2GYWwzt2bs4.jpg,"[{'name': 'Carousel Productions', 'id': 11176}...","[{'iso_3166_1': 'CA', 'name': 'Canada'}, {'iso...",1997-08-20,0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,...,1,,,,,,,,,
29503,Rune Balot goes to a casino connected to the ...,1.931659,/zV8bHuSL6WXoD6FWogP9j4x80bL.jpg,"[{'name': 'Aniplex', 'id': 2883}, {'name': 'Go...","[{'iso_3166_1': 'US', 'name': 'United States o...",2012-09-29,0,68.0,"[{'iso_639_1': 'ja', 'name': '日本語'}]",Released,...,12,,,,,,,,,
35587,Avalanche Sharks tells the story of a bikini ...,2.185485,/zaSf5OG7V8X8gqFvly88zDdRm46.jpg,"[{'name': 'Odyssey Media', 'id': 17161}, {'nam...","[{'iso_3166_1': 'CA', 'name': 'Canada'}]",2014-01-01,0,82.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,...,22,,,,,,,,,


Thats easy one, looks like we have only 3 rows .. let's drop them!!!!

In [279]:
# let's only keep id where it has only 6 or less characters!!!
m_meta = m_meta[m_meta.id.str.len() <= 6]

In [None]:
# confirm:
m_meta[m_meta.id.str.len() >6]

In [282]:
# let's try once again the id conversion from Str to int:
m_meta['id'] = m_meta['id'].astype('int64')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._set_item(key, value)


Ok that was successfull !!! let's not try to merge the 2 dataframes on 'id': new

In [283]:
# merge:
new = pd.merge(cast_dir, m_meta, on='id', how='left')

In [284]:
new.shape

(45538, 29)

In [285]:
new.head()

Unnamed: 0,cast,crew,id,cast_size,crew_size,director,adult,belongs_to_collection,budget,genres,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,"[Tom Hanks, Tim Allen, Don Rickles, Jim Varney]","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...",862,13,106,John Lasseter,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,"[Robin Williams, Jonathan Hyde, Kirsten Dunst,...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...",8844,26,16,Joe Johnston,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,"[Walter Matthau, Jack Lemmon, Ann-Margret, Sop...","[{'credit_id': '52fe466a9251416c75077a89', 'de...",15602,7,4,Howard Deutch,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,"[Whitney Houston, Angela Bassett, Loretta Devi...","[{'credit_id': '52fe44779251416c91011acb', 'de...",31357,10,10,Forest Whitaker,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,"[Steve Martin, Diane Keaton, Martin Short, Kim...","[{'credit_id': '52fe44959251416c75039ed7', 'de...",11862,12,7,Charles Shyer,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


In [286]:
new.columns

Index(['cast', 'crew', 'id', 'cast_size', 'crew_size', 'director', 'adult',
       'belongs_to_collection', 'budget', 'genres', 'homepage', 'imdb_id',
       'original_language', 'original_title', 'overview', 'popularity',
       'poster_path', 'production_companies', 'production_countries',
       'release_date', 'revenue', 'runtime', 'spoken_languages', 'status',
       'tagline', 'title', 'video', 'vote_average', 'vote_count'],
      dtype='object')

Let's try to keep only the columns that would make sense for our recommendations systems:

In [297]:
# let's drop and keep the important features:
cast_df = new [['cast', 'id', 'cast_size', 'crew_size', 'director', 'adult',
                'budget', 'imdb_id', 'original_title', 'popularity',
                'release_date', 'revenue', 'title']]

In [298]:
cast_df.head()

Unnamed: 0,cast,id,cast_size,crew_size,director,adult,budget,imdb_id,original_title,popularity,release_date,revenue,title
0,"[Tom Hanks, Tim Allen, Don Rickles, Jim Varney]",862,13,106,John Lasseter,False,30000000,tt0114709,Toy Story,21.946943,1995-10-30,373554033.0,Toy Story
1,"[Robin Williams, Jonathan Hyde, Kirsten Dunst,...",8844,26,16,Joe Johnston,False,65000000,tt0113497,Jumanji,17.015539,1995-12-15,262797249.0,Jumanji
2,"[Walter Matthau, Jack Lemmon, Ann-Margret, Sop...",15602,7,4,Howard Deutch,False,0,tt0113228,Grumpier Old Men,11.7129,1995-12-22,0.0,Grumpier Old Men
3,"[Whitney Houston, Angela Bassett, Loretta Devi...",31357,10,10,Forest Whitaker,False,16000000,tt0114885,Waiting to Exhale,3.859495,1995-12-22,81452156.0,Waiting to Exhale
4,"[Steve Martin, Diane Keaton, Martin Short, Kim...",11862,12,7,Charles Shyer,False,0,tt0113041,Father of the Bride Part II,8.387519,1995-02-10,76578911.0,Father of the Bride Part II


In [299]:
# check missing values:
cast_df.isnull().sum()

cast                0
id                  0
cast_size           0
crew_size           0
director          887
adult               0
budget              0
imdb_id            17
original_title      0
popularity          3
release_date       87
revenue             3
title               3
dtype: int64

In [306]:
#let's drop missing values!!!
cast_df.dropna(inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return func(*args, **kwargs)


In [300]:
# check the 
cast_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 45538 entries, 0 to 45537
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   cast            45538 non-null  object 
 1   id              45538 non-null  int64  
 2   cast_size       45538 non-null  int64  
 3   crew_size       45538 non-null  int64  
 4   director        44651 non-null  object 
 5   adult           45538 non-null  object 
 6   budget          45538 non-null  object 
 7   imdb_id         45521 non-null  object 
 8   original_title  45538 non-null  object 
 9   popularity      45535 non-null  object 
 10  release_date    45451 non-null  object 
 11  revenue         45535 non-null  float64
 12  title           45535 non-null  object 
dtypes: float64(1), int64(3), object(9)
memory usage: 4.9+ MB


Let's save this datafram so we can compare with the original dataframe from tensforflow or the one we cleaned recently!!

In [362]:
# Saving ...
cast_df.to_csv(path_or_buf = "path" + "/cast_df.csv", index=False)

In [353]:
# Now, let's look at the final dataframe where we merged tensorflow dataset with movies_metadata.csv and credits.csv**
# let's make sure to use encoding argument to avoid in encoding errors !!!
main_df = pd.read_csv('1m_movielens_metadata_ratings_v7.csv', encoding='ISO-8859-1')

In [354]:
# now let's look at the new features we added: 'movie_imdb_id', 'cast', 'director', 'cast_size', 'crew_size', 'imdb_id' and 'release_date':
#Also, please note that all wrong spelled movies are corrected!!!!
main_df.head()

Unnamed: 0,bucketized_user_age,movie_genres,movie_id,movie_title,movie_imdb_id,timestamp,user_gender,user_id,user_occupation_label,user_occupation_text,user_rating,user_zip_code,movie_release_year,cast,director,cast_size,crew_size,imdb_id,release_date
0,50,['7'],1251,Eight and half,422,11/13/2000 4:23,f,2497,14,sales/marketing,3,37922,1/1/1963 0:00,"['Marcello Mastroianni', 'Claudia Cardinale', ...",Federico Fellini,17,16,tt0056801,2/14/1963
1,18,['7'],1251,Eight and half,422,4/8/2001 9:30,m,671,17,college/grad student,5,61761,1/1/1963 0:00,"['Marcello Mastroianni', 'Claudia Cardinale', ...",Federico Fellini,17,16,tt0056801,2/14/1963
2,45,['7'],1251,Eight and half,422,6/3/2000 22:38,f,5590,12,programmer,2,94117,1/1/1963 0:00,"['Marcello Mastroianni', 'Claudia Cardinale', ...",Federico Fellini,17,16,tt0056801,2/14/1963
3,25,['7'],1251,Eight and half,422,1/25/2002 21:12,m,1851,20,unemployed,5,59602,1/1/1963 0:00,"['Marcello Mastroianni', 'Claudia Cardinale', ...",Federico Fellini,17,16,tt0056801,2/14/1963
4,35,['7'],1251,Eight and half,422,7/8/2000 23:52,f,5526,1,artist,5,27514,1/1/1963 0:00,"['Marcello Mastroianni', 'Claudia Cardinale', ...",Federico Fellini,17,16,tt0056801,2/14/1963


In [355]:
# Let's check the rows: Original movie_lens from tensorflow was 1,000,209 rows ... after cleanup and keeping high impact movies (correct movies) we have 1,000,087 rows.
main_df.shape

(1000085, 19)

In [356]:
# let's view the final datafram:
main_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000085 entries, 0 to 1000084
Data columns (total 19 columns):
 #   Column                 Non-Null Count    Dtype 
---  ------                 --------------    ----- 
 0   bucketized_user_age    1000085 non-null  int64 
 1   movie_genres           1000085 non-null  object
 2   movie_id               1000085 non-null  int64 
 3   movie_title            1000085 non-null  object
 4   movie_imdb_id          1000085 non-null  int64 
 5   timestamp              1000085 non-null  object
 6   user_gender            1000085 non-null  object
 7   user_id                1000085 non-null  int64 
 8   user_occupation_label  1000085 non-null  int64 
 9   user_occupation_text   1000085 non-null  object
 10  user_rating            1000085 non-null  int64 
 11  user_zip_code          1000085 non-null  int64 
 12  movie_release_year     1000085 non-null  object
 13  cast                   1000085 non-null  object
 14  director               1000085 non

In [357]:
# let's check the statistics of the final dataframe:
pd.set_option('float_format', '{:.1f}'.format)
main_df.describe()

Unnamed: 0,bucketized_user_age,movie_id,movie_imdb_id,user_id,user_occupation_label,user_rating,user_zip_code,cast_size,crew_size
count,1000085.0,1000085.0,1000085.0,1000085.0,1000085.0,1000085.0,1000085.0,1000085.0,1000085.0
mean,29.7,1865.6,9305.8,3024.5,10.9,3.6,54230.0,25.9,29.3
std,11.8,1096.0,17111.5,1728.4,6.5,1.1,32090.4,21.2,29.5
min,1.0,1.0,5.0,1.0,0.0,1.0,231.0,0.0,1.0
25%,25.0,1030.0,769.0,1506.0,6.0,3.0,23185.0,14.0,11.0
50%,25.0,1835.0,6620.0,3070.0,11.0,4.0,55129.0,18.0,18.0
75%,35.0,2770.0,11113.0,4476.0,17.0,4.0,90004.0,30.0,36.0
max,56.0,3952.0,438108.0,6040.0,21.0,5.0,99945.0,213.0,181.0


In [358]:
# let's convert timestamp &  release_date from object to  datetime64:
main_df['timestamp'] = pd.to_datetime(main_df['timestamp'])
main_df['release_date'] = pd.to_datetime(main_df['release_date'])
main_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000085 entries, 0 to 1000084
Data columns (total 19 columns):
 #   Column                 Non-Null Count    Dtype         
---  ------                 --------------    -----         
 0   bucketized_user_age    1000085 non-null  int64         
 1   movie_genres           1000085 non-null  object        
 2   movie_id               1000085 non-null  int64         
 3   movie_title            1000085 non-null  object        
 4   movie_imdb_id          1000085 non-null  int64         
 5   timestamp              1000085 non-null  datetime64[ns]
 6   user_gender            1000085 non-null  object        
 7   user_id                1000085 non-null  int64         
 8   user_occupation_label  1000085 non-null  int64         
 9   user_occupation_text   1000085 non-null  object        
 10  user_rating            1000085 non-null  int64         
 11  user_zip_code          1000085 non-null  int64         
 12  movie_release_year     10000

In [359]:
# Let's remove all special characters from all letters in all strings in the dataframe:
cols = main_df.select_dtypes(include=[np.object]).columns
main_df[cols] = main_df[cols].apply(lambda x: x.str.normalize('NFKD').str.encode('ascii', errors='ignore').str.decode('utf-8'))
main_df.head()

Unnamed: 0,bucketized_user_age,movie_genres,movie_id,movie_title,movie_imdb_id,timestamp,user_gender,user_id,user_occupation_label,user_occupation_text,user_rating,user_zip_code,movie_release_year,cast,director,cast_size,crew_size,imdb_id,release_date
0,50,['7'],1251,Eight and half,422,2000-11-13 04:23:00,f,2497,14,sales/marketing,3,37922,1/1/1963 0:00,"['Marcello Mastroianni', 'Claudia Cardinale', ...",Federico Fellini,17,16,tt0056801,1963-02-14
1,18,['7'],1251,Eight and half,422,2001-04-08 09:30:00,m,671,17,college/grad student,5,61761,1/1/1963 0:00,"['Marcello Mastroianni', 'Claudia Cardinale', ...",Federico Fellini,17,16,tt0056801,1963-02-14
2,45,['7'],1251,Eight and half,422,2000-06-03 22:38:00,f,5590,12,programmer,2,94117,1/1/1963 0:00,"['Marcello Mastroianni', 'Claudia Cardinale', ...",Federico Fellini,17,16,tt0056801,1963-02-14
3,25,['7'],1251,Eight and half,422,2002-01-25 21:12:00,m,1851,20,unemployed,5,59602,1/1/1963 0:00,"['Marcello Mastroianni', 'Claudia Cardinale', ...",Federico Fellini,17,16,tt0056801,1963-02-14
4,35,['7'],1251,Eight and half,422,2000-07-08 23:52:00,f,5526,1,artist,5,27514,1/1/1963 0:00,"['Marcello Mastroianni', 'Claudia Cardinale', ...",Federico Fellini,17,16,tt0056801,1963-02-14


In [360]:
#let's have a general view of main_df:
df_missing = pd.concat([main_df.nunique(), main_df.dtypes, main_df.isnull().sum(), 100*main_df.isnull().mean()], axis=1)
df_missing.columns = [['count', 'data_type', 'missing_count', 'missing%']]
df_missing

Unnamed: 0,count,data_type,missing_count,missing%
bucketized_user_age,7,int64,0,0.0
movie_genres,302,object,0,0.0
movie_id,3656,int64,0,0.0
movie_title,3619,object,0,0.0
movie_imdb_id,3656,int64,0,0.0
timestamp,174246,datetime64[ns],0,0.0
user_gender,2,object,0,0.0
user_id,6040,int64,0,0.0
user_occupation_label,21,int64,0,0.0
user_occupation_text,21,object,0,0.0


**Thats's good we have everything in place ....** the count of movie_title and movie_imdb_id/imdb_id are not the same because we have duplicate movies name.

### 5. Convert Pandas Dataset to Tensorflow Dataset

We need to convert our cleaned pandas dataframe to a tensorflow dataset that TFRS can read:

 * From 'cast' features, let's drop all secondary casting and keep only the star of the movie and let's call the feature "star" 
 * Let's make sure to keep only the important columns. 
 * Change the data types of the important features to fit with Tensorflow-Recommender TFRS Library.
   * Keep in mind **tfds** currently does not support **float64** so we'll be using **int64 or  float32** depends on the data.
   * We'll wrap the **pandas dataframe** into **tf.data.Dataset** object using **tf.data.Dataset.from_tensor_slices** (To check other options - [here](https://www.srijan.net/resources/blog/building-a-high-performance-data-pipeline-with-tensorflow#gs.f33srf))

 

 

In [361]:
# let's apply literal_eval to both 'cast' & 'crew'
main_df['cast'] = main_df['cast'].apply(literal_eval)

In [362]:
# now let's sperate main_df['cast' from 4 strings in a list values to a dataframe with 4 columns per actor:
main_star = main_df["cast"].apply(pd.Series)

In [363]:
#let's view the data:
main_star.head()

Unnamed: 0,0,1,2,3
0,Marcello Mastroianni,Claudia Cardinale,Anouk Aimee,Sandra Milo
1,Marcello Mastroianni,Claudia Cardinale,Anouk Aimee,Sandra Milo
2,Marcello Mastroianni,Claudia Cardinale,Anouk Aimee,Sandra Milo
3,Marcello Mastroianni,Claudia Cardinale,Anouk Aimee,Sandra Milo
4,Marcello Mastroianni,Claudia Cardinale,Anouk Aimee,Sandra Milo


In [364]:
#Let's rename first column which is the main star
main_star.rename(columns = {list(main_star)[0]: 'star'}, inplace = True)
main_star.head()

Unnamed: 0,star,1,2,3
0,Marcello Mastroianni,Claudia Cardinale,Anouk Aimee,Sandra Milo
1,Marcello Mastroianni,Claudia Cardinale,Anouk Aimee,Sandra Milo
2,Marcello Mastroianni,Claudia Cardinale,Anouk Aimee,Sandra Milo
3,Marcello Mastroianni,Claudia Cardinale,Anouk Aimee,Sandra Milo
4,Marcello Mastroianni,Claudia Cardinale,Anouk Aimee,Sandra Milo


In [365]:
#drop all other columns:
main_star = main_star[['star']]

In [366]:
#view final dataframe:
main_star.head()

Unnamed: 0,star
0,Marcello Mastroianni
1,Marcello Mastroianni
2,Marcello Mastroianni
3,Marcello Mastroianni
4,Marcello Mastroianni


In [367]:
main_star.shape

(1000085, 1)

In [368]:
main_df.shape

(1000085, 19)

In [369]:
#Now, let's add the star column to out main dataframe:
main_df = pd.concat([main_df, main_star], axis=1)
main_df.head()

Unnamed: 0,bucketized_user_age,movie_genres,movie_id,movie_title,movie_imdb_id,timestamp,user_gender,user_id,user_occupation_label,user_occupation_text,user_rating,user_zip_code,movie_release_year,cast,director,cast_size,crew_size,imdb_id,release_date,star
0,50,['7'],1251,Eight and half,422,2000-11-13 04:23:00,f,2497,14,sales/marketing,3,37922,1/1/1963 0:00,"[Marcello Mastroianni, Claudia Cardinale, Anou...",Federico Fellini,17,16,tt0056801,1963-02-14,Marcello Mastroianni
1,18,['7'],1251,Eight and half,422,2001-04-08 09:30:00,m,671,17,college/grad student,5,61761,1/1/1963 0:00,"[Marcello Mastroianni, Claudia Cardinale, Anou...",Federico Fellini,17,16,tt0056801,1963-02-14,Marcello Mastroianni
2,45,['7'],1251,Eight and half,422,2000-06-03 22:38:00,f,5590,12,programmer,2,94117,1/1/1963 0:00,"[Marcello Mastroianni, Claudia Cardinale, Anou...",Federico Fellini,17,16,tt0056801,1963-02-14,Marcello Mastroianni
3,25,['7'],1251,Eight and half,422,2002-01-25 21:12:00,m,1851,20,unemployed,5,59602,1/1/1963 0:00,"[Marcello Mastroianni, Claudia Cardinale, Anou...",Federico Fellini,17,16,tt0056801,1963-02-14,Marcello Mastroianni
4,35,['7'],1251,Eight and half,422,2000-07-08 23:52:00,f,5526,1,artist,5,27514,1/1/1963 0:00,"[Marcello Mastroianni, Claudia Cardinale, Anou...",Federico Fellini,17,16,tt0056801,1963-02-14,Marcello Mastroianni


In [370]:
# So let's review the data types and which columns to keep:
main_df.columns

Index(['bucketized_user_age', 'movie_genres', 'movie_id', 'movie_title',
       'movie_imdb_id', 'timestamp', 'user_gender', 'user_id',
       'user_occupation_label', 'user_occupation_text', 'user_rating',
       'user_zip_code', 'movie_release_year', 'cast', 'director', 'cast_size',
       'crew_size', 'imdb_id', 'release_date', 'star'],
      dtype='object')

In [371]:
# Data types:
main_df.dtypes

bucketized_user_age               int64
movie_genres                     object
movie_id                          int64
movie_title                      object
movie_imdb_id                     int64
timestamp                datetime64[ns]
user_gender                      object
user_id                           int64
user_occupation_label             int64
user_occupation_text             object
user_rating                       int64
user_zip_code                     int64
movie_release_year               object
cast                             object
director                         object
cast_size                         int64
crew_size                         int64
imdb_id                          object
release_date             datetime64[ns]
star                             object
dtype: object

Ok, so let's drop the following columns: 'movie_imdb_id',  'cast_size','crew_size', 'imdb_id', 'timestamp', 'cast', and 'movie_release_year'.

In [372]:
# Let's keep only the important columns and name our final dataframe: ratings
ratings = main_df.drop(['movie_imdb_id', 'cast_size','crew_size', 'imdb_id', 'movie_release_year', 'cast'], axis=1)


In [373]:
#let's have a general view of main_df:
ratings_missing = pd.concat([ratings.nunique(), ratings.dtypes, ratings.isnull().sum(), 100*ratings.isnull().mean()], axis=1)
ratings_missing.columns = [['count', 'data_type', 'missing_count', 'missing%']]
ratings_missing

Unnamed: 0,count,data_type,missing_count,missing%
bucketized_user_age,7,int64,0,0.0
movie_genres,302,object,0,0.0
movie_id,3656,int64,0,0.0
movie_title,3619,object,0,0.0
timestamp,174246,datetime64[ns],0,0.0
user_gender,2,object,0,0.0
user_id,6040,int64,0,0.0
user_occupation_label,21,int64,0,0.0
user_occupation_text,21,object,0,0.0
user_rating,5,int64,0,0.0


In [374]:
ratings.head()

Unnamed: 0,bucketized_user_age,movie_genres,movie_id,movie_title,timestamp,user_gender,user_id,user_occupation_label,user_occupation_text,user_rating,user_zip_code,director,release_date,star
0,50,['7'],1251,Eight and half,2000-11-13 04:23:00,f,2497,14,sales/marketing,3,37922,Federico Fellini,1963-02-14,Marcello Mastroianni
1,18,['7'],1251,Eight and half,2001-04-08 09:30:00,m,671,17,college/grad student,5,61761,Federico Fellini,1963-02-14,Marcello Mastroianni
2,45,['7'],1251,Eight and half,2000-06-03 22:38:00,f,5590,12,programmer,2,94117,Federico Fellini,1963-02-14,Marcello Mastroianni
3,25,['7'],1251,Eight and half,2002-01-25 21:12:00,m,1851,20,unemployed,5,59602,Federico Fellini,1963-02-14,Marcello Mastroianni
4,35,['7'],1251,Eight and half,2000-07-08 23:52:00,f,5526,1,artist,5,27514,Federico Fellini,1963-02-14,Marcello Mastroianni


Alright, as we see above let's change data types per below to **fit with tensorflow-recommenders library**:
 * **From int64 to tf.string**: 'movie_id',  'user_id' and 'user_zip_code' 
 * **From datetime64 to tf.int64**: 'timestamp' and 'release_date' (From datetime64 to unix epoch (UTC - units of seconds))
 * **From str to bool**: 'user_gender'

OK, let's first start with changing the dates ...

In [375]:
# calculate unix datetime: 'timestamp'
ratings['timestamp'] = ((ratings['timestamp'])- pd.Timestamp("1970-01-01")) // pd.Timedelta('1s')  

In [376]:
# calculate unix datetime: 'release_date'
ratings['release_date'] = ((ratings['release_date'])- pd.Timestamp("1970-01-01")) // pd.Timedelta('1s')  

In [377]:
# Data types:
ratings.tail()

Unnamed: 0,bucketized_user_age,movie_genres,movie_id,movie_title,timestamp,user_gender,user_id,user_occupation_label,user_occupation_text,user_rating,user_zip_code,director,release_date,star
1000080,35,['3'],1426,Zeus & Roxanne,965500860,f,4048,0,academic/educator,4,89431,George T. Miller,854064000,Steve Guttenberg
1000081,35,['3'],1426,Zeus & Roxanne,990617760,m,4079,18,technician/engineer,5,26505,George T. Miller,854064000,Steve Guttenberg
1000082,25,['3'],1426,Zeus & Roxanne,968520540,f,3224,14,sales/marketing,3,93428,George T. Miller,854064000,Steve Guttenberg
1000083,1,['3'],1426,Zeus & Roxanne,959391660,m,5558,10,K-12 student,1,2446,George T. Miller,854064000,Steve Guttenberg
1000084,18,['3'],1426,Zeus & Roxanne,974650020,m,2116,17,college/grad student,2,49546,George T. Miller,854064000,Steve Guttenberg


Now, let's convert below to strings:

In [378]:
# movie_id from int to str:
ratings['movie_id'] = ratings['movie_id'].astype('str')

In [379]:
# user_id from int to str:
ratings['user_id'] = ratings['user_id'].astype('str')

In [380]:
# user_zip_code from int to str:
ratings['user_zip_code'] = ratings['user_zip_code'].astype('str')

Finally, let's change below from str to bool:

In [381]:
# user_gender from str to bool:
ratings.loc[ratings['user_gender'] == 'm', 'user_gender'] = True
ratings.loc[ratings['user_gender'] == 'f', 'user_gender'] = False


In [382]:
#Let's change user_gender from str to int:
ratings['user_gender'] = ratings['user_gender'].astype('bool')

In [383]:
#Confirm dtypes:
ratings.dtypes

bucketized_user_age       int64
movie_genres             object
movie_id                 object
movie_title              object
timestamp                 int64
user_gender                bool
user_id                  object
user_occupation_label     int64
user_occupation_text     object
user_rating               int64
user_zip_code            object
director                 object
release_date              int64
star                     object
dtype: object

In [384]:
#Let's check our data:
ratings.head()

Unnamed: 0,bucketized_user_age,movie_genres,movie_id,movie_title,timestamp,user_gender,user_id,user_occupation_label,user_occupation_text,user_rating,user_zip_code,director,release_date,star
0,50,['7'],1251,Eight and half,974089380,False,2497,14,sales/marketing,3,37922,Federico Fellini,-217123200,Marcello Mastroianni
1,18,['7'],1251,Eight and half,986722200,True,671,17,college/grad student,5,61761,Federico Fellini,-217123200,Marcello Mastroianni
2,45,['7'],1251,Eight and half,960071880,False,5590,12,programmer,2,94117,Federico Fellini,-217123200,Marcello Mastroianni
3,25,['7'],1251,Eight and half,1011993120,True,1851,20,unemployed,5,59602,Federico Fellini,-217123200,Marcello Mastroianni
4,35,['7'],1251,Eight and half,963100320,False,5526,1,artist,5,27514,Federico Fellini,-217123200,Marcello Mastroianni


In [385]:
# Let's remove qutations from movie_genres:
ratings['movie_genres'] = ratings['movie_genres'].str.replace("'", "")
ratings.head()

Unnamed: 0,bucketized_user_age,movie_genres,movie_id,movie_title,timestamp,user_gender,user_id,user_occupation_label,user_occupation_text,user_rating,user_zip_code,director,release_date,star
0,50,[7],1251,Eight and half,974089380,False,2497,14,sales/marketing,3,37922,Federico Fellini,-217123200,Marcello Mastroianni
1,18,[7],1251,Eight and half,986722200,True,671,17,college/grad student,5,61761,Federico Fellini,-217123200,Marcello Mastroianni
2,45,[7],1251,Eight and half,960071880,False,5590,12,programmer,2,94117,Federico Fellini,-217123200,Marcello Mastroianni
3,25,[7],1251,Eight and half,1011993120,True,1851,20,unemployed,5,59602,Federico Fellini,-217123200,Marcello Mastroianni
4,35,[7],1251,Eight and half,963100320,False,5526,1,artist,5,27514,Federico Fellini,-217123200,Marcello Mastroianni


In [395]:
# Saving ...
ratings.to_csv(path_or_buf = "path" + "/ratings.csv", index=False)

In [386]:
#let's wrap the **pandas dataframe** into **tf.data.Dataset** object using **tf.data.Dataset.from_tensor_slices** using: tf.data.Dataset.from_tensor_slices
rating = tf.data.Dataset.from_tensor_slices(dict(ratings))

In [387]:
type(rating)

tensorflow.python.data.ops.dataset_ops.TensorSliceDataset

In [388]:
#View the data from movies dataset:
for x in rating.take(1).as_numpy_iterator():
    pprint.pprint(x)

{'bucketized_user_age': 50,
 'director': b'Federico Fellini',
 'movie_genres': b'[7]',
 'movie_id': b'1251',
 'movie_title': b'Eight and half ',
 'release_date': -217123200,
 'star': b'Marcello Mastroianni',
 'timestamp': 974089380,
 'user_gender': False,
 'user_id': b'2497',
 'user_occupation_label': 14,
 'user_occupation_text': b'sales/marketing',
 'user_rating': 3,
 'user_zip_code': b'37922'}


In [389]:
#Let's select the necessary attributes:

rating = rating.map(lambda x: {
                                 "movie_id": x["movie_id"],
                                 "movie_title": x["movie_title"],
                                 "user_id": x["user_id"],
                                 "user_rating": x["user_rating"],
                                 "user_gender": int(x["user_gender"]),
                                 "release_date": int(x["release_date"]),
                                 "user_zip_code": x["user_zip_code"],
                                 "user_occupation_text": x["user_occupation_text"],
                                 "director": x["director"],
                                 "star": x["star"],
                                 "movie_genres": x["movie_genres"],    
                                 "bucketized_user_age": int(x["bucketized_user_age"]),                                
                                })

In [390]:
len(rating)

1000085

In [391]:
# let's use a random split, putting 75% of the ratings in the train set, and 25% in the test set:
# Assign a seed=42 for consistency of results and reproducibility:
seed = 42
l = len(rating)

tf.random.set_seed(seed)
shuffled = rating.shuffle(l, seed=seed, reshuffle_each_iteration=False)

#Save 75% of the data for training and 25% for testing:
train_ = int(0.75 * l)
test_ = int(0.25 * l)

train = shuffled.take(train_)
test = shuffled.skip(train_).take(test_)

In [392]:
# Now, let's find out how many uniques users/movies:
#movie_titles = movies.batch(l)
user_ids = rating.batch(l).map(lambda x: x["user_id"])
movie_titles = rating.batch(l).map(lambda x: x["movie_title"])

#Movies uniques:
unique_movie_titles = np.unique(np.concatenate(list(movie_titles)))

#users unique
unique_user_ids = np.unique(np.concatenate(list(user_ids)))

# take a look at the movies:
unique_movie_titles[:10]

array([b' Pret-a-Porter (Ready to Wear)', b"'night, Mother",
       b'...And God Created Woman (Et Dieu... crea la femme)',
       b'...And Justice for All', b'1-900', b'10 Things I Hate About You',
       b'101 Dalmatians', b'12 Angry Men', b'2 Days in the Valley',
       b'2 or 3 Things I Know About Her'], dtype=object)

In [393]:
#Movies uniques
len_films = len(unique_movie_titles)
print(len_films) 

3619


In [394]:
#users unique
len_users = len(unique_user_ids)
print(len_users) 

6040


In [1]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))