### Data Preparation
The data for this exercise is taken from the Kaggle link below. The name of the dataset is “movies_metadata.csv”.

The dataset contains a lot of information related to movies with less preprocessing required from users. We import the dataset using Pandas and then prepare our data.

In [6]:
import pandas as pd

metadata = pd.read_csv('movies_metadata.csv', low_memory=False)

metadata.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


In [9]:
metadata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 24 columns):
adult                    45466 non-null object
belongs_to_collection    4494 non-null object
budget                   45466 non-null object
genres                   45466 non-null object
homepage                 7782 non-null object
id                       45466 non-null object
imdb_id                  45449 non-null object
original_language        45455 non-null object
original_title           45466 non-null object
overview                 44512 non-null object
popularity               45461 non-null object
poster_path              45080 non-null object
production_companies     45463 non-null object
production_countries     45463 non-null object
release_date             45379 non-null object
revenue                  45460 non-null float64
runtime                  45203 non-null float64
spoken_languages         45460 non-null object
status                   45379 non-null objec

### Data Exploration

In [10]:
metadata.describe()

Unnamed: 0,revenue,runtime,vote_average,vote_count
count,45460.0,45203.0,45460.0,45460.0
mean,11209350.0,94.128199,5.618207,109.897338
std,64332250.0,38.40781,1.924216,491.310374
min,0.0,0.0,0.0,0.0
25%,0.0,85.0,5.0,3.0
50%,0.0,95.0,6.0,10.0
75%,0.0,107.0,6.8,34.0
max,2787965000.0,1256.0,10.0,14075.0


In [12]:
metadata.shape

(45466, 24)

### Shape of the dataframe
The data set contains around 45000 entries with 24 columns.
Now, let us explore how many of them have missing values.

In [7]:
metadata.isnull().sum()

adult                        0
belongs_to_collection    40972
budget                       0
genres                       0
homepage                 37684
id                           0
imdb_id                     17
original_language           11
original_title               0
overview                   954
popularity                   5
poster_path                386
production_companies         3
production_countries         3
release_date                87
revenue                      6
runtime                    263
spoken_languages             6
status                      87
tagline                  25054
title                        6
video                        6
vote_average                 6
vote_count                   6
dtype: int64

### Missing Values
It turns out there are a lot of missing values in the data. So we remove the variables with high missing value percentage. 
Also, removed a few other variables to keep the data simple. 

The code below is used to remove variables from the dataset.

In [2]:
metadata = metadata.dropna(subset=['imdb_id','poster_path'])

metadata = metadata.drop(['belongs_to_collection','homepage','popularity','tagline','status'],axis=1)

metadata = metadata.drop(['runtime','release_date','original_language','production_countries','production_companies','spoken_languages','video'],axis=1)

### Exploring Genres available in the dataset

In [13]:
pd.set_option('display.max_colwidth', -1)
print(metadata['genres'])

0        [{'id': 16, 'name': 'Animation'}, {'id': 35, 'name': 'Comedy'}, {'id': 10751, 'name': 'Family'}]                                    
1        [{'id': 12, 'name': 'Adventure'}, {'id': 14, 'name': 'Fantasy'}, {'id': 10751, 'name': 'Family'}]                                   
2        [{'id': 10749, 'name': 'Romance'}, {'id': 35, 'name': 'Comedy'}]                                                                    
3        [{'id': 35, 'name': 'Comedy'}, {'id': 18, 'name': 'Drama'}, {'id': 10749, 'name': 'Romance'}]                                       
4        [{'id': 35, 'name': 'Comedy'}]                                                                                                      
5        [{'id': 28, 'name': 'Action'}, {'id': 80, 'name': 'Crime'}, {'id': 18, 'name': 'Drama'}, {'id': 53, 'name': 'Thriller'}]            
6        [{'id': 35, 'name': 'Comedy'}, {'id': 10749, 'name': 'Romance'}]                                                                    
7     

### Encoding lists
The genre name is stored in a list of dictionaries format. 

To make it a clean format, weare applying the below transformations using inline lambda functions and importing ast


In [14]:
import ast
metadata['genres'] = metadata['genres'].apply(lambda x: ast.literal_eval(x))

metadata['genres'] = metadata['genres'].apply(lambda x: ', '.join([d['name'] for d in x]))

print(metadata['genres'].head())

0    Animation, Comedy, Family 
1    Adventure, Fantasy, Family
2    Romance, Comedy           
3    Comedy, Drama, Romance    
4    Comedy                    
Name: genres, dtype: object


The above steps helped format and encode the data as below - 

0    Animation, Comedy, Family 

1    Adventure, Fantasy, Family

2    Romance, Comedy           

3    Comedy, Drama, Romance    

4    Comedy                    
Name: genres, dtype: object

The below step helps in creating the IMDB, TMDB and Image url links which we will be returned as response to the Chanakya slackbot for movie recommendations

In [15]:
metadata['imdbURL'] = 'https://www.imdb.com/title/' + metadata['imdb_id'] + '/'
metadata['tmdbURL'] = 'https://www.themoviedb.org/movie/' + metadata['id']
metadata['ImageURL'] = 'https://image.tmdb.org/t/p/w92' + metadata['poster_path']

In [4]:
metadata.isnull().sum()

adult               0
budget              0
genres              0
id                  0
imdb_id             0
original_title      0
overview          913
poster_path         0
revenue             3
title               3
vote_average        3
vote_count          3
imdbURL             0
tmdbURL             0
ImageURL            0
dtype: int64

# Writing out the data to csv file

The below step writes out the dataframe metadata as a csv file in readines to be used by the next step of the process.

"metadata_prep.csv" will be created after you run the data preparation code which will be later used in nlp models to train the movie recommendation system.

In [5]:
metadata.to_csv('metadata_prep.csv')

In [8]:
metadata.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0
