# 1.Project Title: [Data Classificcation]
___

#### a. Introduction

- **Objective:** Clearly state the goal of your project. What problem are you trying to solve?
- **Background:** Provide context on why this problem is important or interesting. Mention any relevant research, datasets, or industry relevance.
- **Scope:** Define the boundaries of your project. What will be included, and what will be out of scope?

#### b. Project Overview

- **Project Summary:** A brief overview of the project, including the main steps you will take to achieve the objective.
- **Milestones:** Outline the key milestones or phases of the project. For example:
  - Data Collection
  - Data Preprocessing
  - Model Selection
  - Model Training and Evaluation
  - Results and Conclusion


#### c. About the Author

- **Name:** [Ahmed Ferganey]
- **Background:** Junior Data Scientist and Machine Learning Engineer with a strong foundation in embedded systems, industrial engineering, and supply chain management. Knowledgeable in statistical analysis, NLP, Computer Vision, and deep learning, with hands-on experience in Python, SQL, and Docker.
- **Motivation:** Why are you interested in this project? What do you hope to learn or achieve?
- **Contact:** [LinkedIn acc](https://www.linkedin.com/in/ahmed-ferganey/)



#### d. Tools and Technologies

- **Programming Languages:** List the programming languages you will use (e.g., Python).
- **Libraries and Frameworks:** List the specific libraries and frameworks you will use (e.g., TensorFlow, scikit-learn).
- **Software and Tools:** Mention any software or tools necessary for the project (e.g., Jupyter Notebook, Git).

#### e. Dataset Description

- **Dataset Name:** [Name of the Dataset]
- **Source:** Where did you obtain the dataset? Include a link if possible.
- **Description:** Briefly describe the dataset, including the number of features, the target variable, and any other important details.
- **Data Preprocessing:** Outline any preprocessing steps you anticipate, such as data cleaning, normalization, or feature engineering.

#### f. Methodology

- **Model Selection:** Describe the types of models you are considering and why.
- **Evaluation Metrics:** Define how you will evaluate your models' performance (e.g., accuracy, F1-score).
- **Validation Strategy:** Explain how you will validate your models, such as cross-validation or a 


### 2. importing libraries
___



In [1]:
import os 
import io
import sys
import pandas as pd
pd.set_option('display.max_columns', None)
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set(style="whitegrid")
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectPercentile , f_classif ,SelectKBest
from sklearn.feature_selection import chi2 , f_classif 
import plotly.express as px
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression,SGDClassifier
from sklearn.naive_bayes import GaussianNB,BernoulliNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis,QuadraticDiscriminantAnalysis
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.decomposition import PCA
from tqdm import tqdm

### 3. reading the raw data
___

In [2]:
base_path = '/media/ahmed-ferganey/AI4/01-Learning_AI'
sub_folder_one = 'Movies'
sub_folder_two = 'Json'
MoviesFilesDir = f'{base_path}/MyGitHub/HeshamAsem_ML_App3/{sub_folder_one}'
JsonFilesDir = f'{base_path}/MyGitHub/HeshamAsem_ML_App3/{sub_folder_two}'


In [3]:
def StringToDict(f) : 
    f = f.strip('"[{').strip('}]"')
    f = f.split('}, {')
    FinalList = []

    for item in f : 
        
        NewList = [i for i in [i.strip().split(':') for i in item.split(',')]]
        
        FinalList.append({i[0].replace("'",""):i[1].replace("'","") for i in NewList if len(i)>1 })

    return FinalList


In [4]:
os.listdir(MoviesFilesDir)


['.ipynb_checkpoints',
 'credits.csv',
 'keywords.csv',
 'links.csv',
 'links_small.csv',
 'Movies.ipynb',
 'movies_metadata.csv',
 'ratings.csv',
 'ratings_small.csv']

In [5]:
# File paths
keywords_path = f'{MoviesFilesDir}/keywords.csv'
credits_path = f'{MoviesFilesDir}/credits.csv'
links_path = f'{MoviesFilesDir}/links.csv'
links_small_path = f'{MoviesFilesDir}/links_small.csv'
movies_metadata_path = f'{MoviesFilesDir}/movies_metadata.csv'
ratings_path = f'{MoviesFilesDir}/ratings.csv'
ratings_small_path = f'{MoviesFilesDir}/ratings_small.csv'



# Reading CSV files
keywords        = pd.read_csv(keywords_path)
credits         = pd.read_csv(credits_path)
links           = pd.read_csv(links_path)
links_small     = pd.read_csv(links_small_path)
movies_metadata = pd.read_csv(movies_metadata_path)
ratings         = pd.read_csv(ratings_path)
ratings_small   = pd.read_csv(ratings_small_path)


  movies_metadata = pd.read_csv(movies_metadata_path)


In [6]:
credits

Unnamed: 0,cast,crew,id
0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...",862
1,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...",8844
2,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...",15602
3,"[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...",31357
4,"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...",11862
...,...,...,...
45471,"[{'cast_id': 0, 'character': '', 'credit_id': ...","[{'credit_id': '5894a97d925141426c00818c', 'de...",439050
45472,"[{'cast_id': 1002, 'character': 'Sister Angela...","[{'credit_id': '52fe4af1c3a36847f81e9b15', 'de...",111109
45473,"[{'cast_id': 6, 'character': 'Emily Shaw', 'cr...","[{'credit_id': '52fe4776c3a368484e0c8387', 'de...",67758
45474,"[{'cast_id': 2, 'character': '', 'credit_id': ...","[{'credit_id': '533bccebc3a36844cf0011a7', 'de...",227506


In [7]:
credits['castlist'] = credits['cast'].apply( lambda x : StringToDict(x))
credits['crewlist'] = credits['crew'].apply( lambda x : StringToDict(x))

In [8]:
credits

Unnamed: 0,cast,crew,id,castlist,crewlist
0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...",862,"[{'cast_id': ' 14', 'character': ' Woody (voic...","[{'credit_id': ' 52fe4284c3a36847f8024f49', 'd..."
1,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...",8844,"[{'cast_id': ' 1', 'character': ' Alan Parrish...","[{'credit_id': ' 52fe44bfc3a36847f80a7cd1', 'd..."
2,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...",15602,"[{'cast_id': ' 2', 'character': ' Max Goldman'...","[{'credit_id': ' 52fe466a9251416c75077a89', 'd..."
3,"[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...",31357,"[{'cast_id': ' 1', 'character': ' ""Savannah Va...","[{'credit_id': ' 52fe44779251416c91011acb', 'd..."
4,"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...",11862,"[{'cast_id': ' 1', 'character': ' George Banks...","[{'credit_id': ' 52fe44959251416c75039ed7', 'd..."
...,...,...,...,...,...
45471,"[{'cast_id': 0, 'character': '', 'credit_id': ...","[{'credit_id': '5894a97d925141426c00818c', 'de...",439050,"[{'cast_id': ' 0', 'character': ' ', 'credit_i...","[{'credit_id': ' 5894a97d925141426c00818c', 'd..."
45472,"[{'cast_id': 1002, 'character': 'Sister Angela...","[{'credit_id': '52fe4af1c3a36847f81e9b15', 'de...",111109,"[{'cast_id': ' 1002', 'character': ' Sister An...","[{'credit_id': ' 52fe4af1c3a36847f81e9b15', 'd..."
45473,"[{'cast_id': 6, 'character': 'Emily Shaw', 'cr...","[{'credit_id': '52fe4776c3a368484e0c8387', 'de...",67758,"[{'cast_id': ' 6', 'character': ' Emily Shaw',...","[{'credit_id': ' 52fe4776c3a368484e0c8387', 'd..."
45474,"[{'cast_id': 2, 'character': '', 'credit_id': ...","[{'credit_id': '533bccebc3a36844cf0011a7', 'de...",227506,"[{'cast_id': ' 2', 'character': ' ', 'credit_i...","[{'credit_id': ' 533bccebc3a36844cf0011a7', 'd..."


In [9]:
keywords

Unnamed: 0,id,keywords
0,862,"[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,8844,"[{'id': 10090, 'name': 'board game'}, {'id': 1..."
2,15602,"[{'id': 1495, 'name': 'fishing'}, {'id': 12392..."
3,31357,"[{'id': 818, 'name': 'based on novel'}, {'id':..."
4,11862,"[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n..."
...,...,...
46414,439050,"[{'id': 10703, 'name': 'tragic love'}]"
46415,111109,"[{'id': 2679, 'name': 'artist'}, {'id': 14531,..."
46416,67758,[]
46417,227506,[]


In [10]:
keywords['keylist'] = keywords['keywords'].apply( lambda x : StringToDict(x))
keywords

Unnamed: 0,id,keywords,keylist
0,862,"[{'id': 931, 'name': 'jealousy'}, {'id': 4290,...","[{'id': ' 931', 'name': ' jealousy'}, {'id': '..."
1,8844,"[{'id': 10090, 'name': 'board game'}, {'id': 1...","[{'id': ' 10090', 'name': ' board game'}, {'id..."
2,15602,"[{'id': 1495, 'name': 'fishing'}, {'id': 12392...","[{'id': ' 1495', 'name': ' fishing'}, {'id': '..."
3,31357,"[{'id': 818, 'name': 'based on novel'}, {'id':...","[{'id': ' 818', 'name': ' based on novel'}, {'..."
4,11862,"[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n...","[{'id': ' 1009', 'name': ' baby'}, {'id': ' 15..."
...,...,...,...
46414,439050,"[{'id': 10703, 'name': 'tragic love'}]","[{'id': ' 10703', 'name': ' tragic love'}]"
46415,111109,"[{'id': 2679, 'name': 'artist'}, {'id': 14531,...","[{'id': ' 2679', 'name': ' artist'}, {'id': ' ..."
46416,67758,[],[{}]
46417,227506,[],[{}]


In [11]:
data = [keywords,credits,links,movies_metadata,ratings,ratings_small]
data

[           id                                           keywords  \
 0         862  [{'id': 931, 'name': 'jealousy'}, {'id': 4290,...   
 1        8844  [{'id': 10090, 'name': 'board game'}, {'id': 1...   
 2       15602  [{'id': 1495, 'name': 'fishing'}, {'id': 12392...   
 3       31357  [{'id': 818, 'name': 'based on novel'}, {'id':...   
 4       11862  [{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n...   
 ...       ...                                                ...   
 46414  439050             [{'id': 10703, 'name': 'tragic love'}]   
 46415  111109  [{'id': 2679, 'name': 'artist'}, {'id': 14531,...   
 46416   67758                                                 []   
 46417  227506                                                 []   
 46418  461257                                                 []   
 
                                                  keylist  
 0      [{'id': ' 931', 'name': ' jealousy'}, {'id': '...  
 1      [{'id': ' 10090', 'name': ' board game'}, 

In [12]:
credits

Unnamed: 0,cast,crew,id,castlist,crewlist
0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...",862,"[{'cast_id': ' 14', 'character': ' Woody (voic...","[{'credit_id': ' 52fe4284c3a36847f8024f49', 'd..."
1,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...",8844,"[{'cast_id': ' 1', 'character': ' Alan Parrish...","[{'credit_id': ' 52fe44bfc3a36847f80a7cd1', 'd..."
2,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...",15602,"[{'cast_id': ' 2', 'character': ' Max Goldman'...","[{'credit_id': ' 52fe466a9251416c75077a89', 'd..."
3,"[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...",31357,"[{'cast_id': ' 1', 'character': ' ""Savannah Va...","[{'credit_id': ' 52fe44779251416c91011acb', 'd..."
4,"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...",11862,"[{'cast_id': ' 1', 'character': ' George Banks...","[{'credit_id': ' 52fe44959251416c75039ed7', 'd..."
...,...,...,...,...,...
45471,"[{'cast_id': 0, 'character': '', 'credit_id': ...","[{'credit_id': '5894a97d925141426c00818c', 'de...",439050,"[{'cast_id': ' 0', 'character': ' ', 'credit_i...","[{'credit_id': ' 5894a97d925141426c00818c', 'd..."
45472,"[{'cast_id': 1002, 'character': 'Sister Angela...","[{'credit_id': '52fe4af1c3a36847f81e9b15', 'de...",111109,"[{'cast_id': ' 1002', 'character': ' Sister An...","[{'credit_id': ' 52fe4af1c3a36847f81e9b15', 'd..."
45473,"[{'cast_id': 6, 'character': 'Emily Shaw', 'cr...","[{'credit_id': '52fe4776c3a368484e0c8387', 'de...",67758,"[{'cast_id': ' 6', 'character': ' Emily Shaw',...","[{'credit_id': ' 52fe4776c3a368484e0c8387', 'd..."
45474,"[{'cast_id': 2, 'character': '', 'credit_id': ...","[{'credit_id': '533bccebc3a36844cf0011a7', 'de...",227506,"[{'cast_id': ' 2', 'character': ' ', 'credit_i...","[{'credit_id': ' 533bccebc3a36844cf0011a7', 'd..."


In [13]:
keywords


Unnamed: 0,id,keywords,keylist
0,862,"[{'id': 931, 'name': 'jealousy'}, {'id': 4290,...","[{'id': ' 931', 'name': ' jealousy'}, {'id': '..."
1,8844,"[{'id': 10090, 'name': 'board game'}, {'id': 1...","[{'id': ' 10090', 'name': ' board game'}, {'id..."
2,15602,"[{'id': 1495, 'name': 'fishing'}, {'id': 12392...","[{'id': ' 1495', 'name': ' fishing'}, {'id': '..."
3,31357,"[{'id': 818, 'name': 'based on novel'}, {'id':...","[{'id': ' 818', 'name': ' based on novel'}, {'..."
4,11862,"[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n...","[{'id': ' 1009', 'name': ' baby'}, {'id': ' 15..."
...,...,...,...
46414,439050,"[{'id': 10703, 'name': 'tragic love'}]","[{'id': ' 10703', 'name': ' tragic love'}]"
46415,111109,"[{'id': 2679, 'name': 'artist'}, {'id': 14531,...","[{'id': ' 2679', 'name': ' artist'}, {'id': ' ..."
46416,67758,[],[{}]
46417,227506,[],[{}]


In [14]:
links = links.rename(columns={'tmdbId':'id'})
links

Unnamed: 0,movieId,imdbId,id
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0
...,...,...,...
45838,176269,6209470,439050.0
45839,176271,2028550,111109.0
45840,176273,303758,67758.0
45841,176275,8536,227506.0


In [15]:
movies_metadata

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",21.946943,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,"[{'name': 'Pixar Animation Studios', 'id': 3}]","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,17.015539,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,"[{'name': 'TriStar Pictures', 'id': 559}, {'na...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,11.7129,/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg,"[{'name': 'Warner Bros.', 'id': 6194}, {'name'...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",3.859495,/16XOMpEaLWkrcPqSQqhTmeJuqQl.jpg,[{'name': 'Twentieth Century Fox Film Corporat...,"[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,8.387519,/e64sOI48hQXyru7naBFyssKFxVd.jpg,"[{'name': 'Sandollar Productions', 'id': 5842}...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45461,False,,0,"[{'id': 18, 'name': 'Drama'}, {'id': 10751, 'n...",http://www.imdb.com/title/tt6209470/,439050,tt6209470,fa,رگ خواب,Rising and falling between a man and woman.,0.072051,/jldsYflnId4tTWPx8es3uzsB1I8.jpg,[],"[{'iso_3166_1': 'IR', 'name': 'Iran'}]",,0.0,90.0,"[{'iso_639_1': 'fa', 'name': 'فارسی'}]",Released,Rising and falling between a man and woman,Subdue,False,4.0,1.0
45462,False,,0,"[{'id': 18, 'name': 'Drama'}]",,111109,tt2028550,tl,Siglo ng Pagluluwal,An artist struggles to finish his work while a...,0.178241,/xZkmxsNmYXJbKVsTRLLx3pqGHx7.jpg,"[{'name': 'Sine Olivia', 'id': 19653}]","[{'iso_3166_1': 'PH', 'name': 'Philippines'}]",2011-11-17,0.0,360.0,"[{'iso_639_1': 'tl', 'name': ''}]",Released,,Century of Birthing,False,9.0,3.0
45463,False,,0,"[{'id': 28, 'name': 'Action'}, {'id': 18, 'nam...",,67758,tt0303758,en,Betrayal,"When one of her hits goes wrong, a professiona...",0.903007,/d5bX92nDsISNhu3ZT69uHwmfCGw.jpg,"[{'name': 'American World Pictures', 'id': 6165}]","[{'iso_3166_1': 'US', 'name': 'United States o...",2003-08-01,0.0,90.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,A deadly game of wits.,Betrayal,False,3.8,6.0
45464,False,,0,[],,227506,tt0008536,en,Satana likuyushchiy,"In a small town live two brothers, one a minis...",0.003503,/aorBPO7ak8e8iJKT5OcqYxU3jlK.jpg,"[{'name': 'Yermoliev', 'id': 88753}]","[{'iso_3166_1': 'RU', 'name': 'Russia'}]",1917-10-21,0.0,87.0,[],Released,,Satan Triumphant,False,0.0,0.0


In [16]:
ratings = ratings.rename(columns={'movieId':'id'})

ratings

Unnamed: 0,userId,id,rating,timestamp
0,1,110,1.0,1425941529
1,1,147,4.5,1425942435
2,1,858,5.0,1425941523
3,1,1221,5.0,1425941546
4,1,1246,5.0,1425941556
...,...,...,...,...
26024284,270896,58559,5.0,1257031564
26024285,270896,60069,5.0,1257032032
26024286,270896,63082,4.5,1257031764
26024287,270896,64957,4.5,1257033990


In [17]:
ratings_small = ratings_small.rename(columns={'movieId':'id'})

ratings_small

Unnamed: 0,userId,id,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205
...,...,...,...,...
99999,671,6268,2.5,1065579370
100000,671,6269,4.0,1065149201
100001,671,6365,4.0,1070940363
100002,671,6385,2.5,1070979663


In [18]:
ratings['id'].nunique()

45115

In [19]:
ratings_small['id'].nunique()

9066

In [20]:
data = [credits,keywords,links,movies_metadata,ratings]
data

[                                                    cast  \
 0      [{'cast_id': 14, 'character': 'Woody (voice)',...   
 1      [{'cast_id': 1, 'character': 'Alan Parrish', '...   
 2      [{'cast_id': 2, 'character': 'Max Goldman', 'c...   
 3      [{'cast_id': 1, 'character': "Savannah 'Vannah...   
 4      [{'cast_id': 1, 'character': 'George Banks', '...   
 ...                                                  ...   
 45471  [{'cast_id': 0, 'character': '', 'credit_id': ...   
 45472  [{'cast_id': 1002, 'character': 'Sister Angela...   
 45473  [{'cast_id': 6, 'character': 'Emily Shaw', 'cr...   
 45474  [{'cast_id': 2, 'character': '', 'credit_id': ...   
 45475                                                 []   
 
                                                     crew      id  \
 0      [{'credit_id': '52fe4284c3a36847f8024f49', 'de...     862   
 1      [{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...    8844   
 2      [{'credit_id': '52fe466a9251416c75077a89', 'de...  

In [21]:
ratings

Unnamed: 0,userId,id,rating,timestamp
0,1,110,1.0,1425941529
1,1,147,4.5,1425942435
2,1,858,5.0,1425941523
3,1,1221,5.0,1425941546
4,1,1246,5.0,1425941556
...,...,...,...,...
26024284,270896,58559,5.0,1257031564
26024285,270896,60069,5.0,1257032032
26024286,270896,63082,4.5,1257031764
26024287,270896,64957,4.5,1257033990


### 4. data analysis
___

In [22]:
ratings[ratings['id'] == 110]['rating'].mean()

4.016057252826558

In [23]:
RatingMeansWithID = ratings.groupby(['id']).mean()

### 5. data cleaning
___

In [24]:
RatingMeansWithID['id'] = RatingMeansWithID.index


In [25]:
RatingMeansWithID.drop(['timestamp','userId'], axis=1, inplace=True)
RatingMeansWithID

Unnamed: 0_level_0,rating,id
id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,3.888157,1
2,3.236953,2
3,3.175550,3
4,2.875713,4
5,3.079565,5
...,...,...
176267,4.000000,176267
176269,3.500000,176269
176271,5.000000,176271
176273,1.000000,176273


In [26]:
data = [credits,keywords,links,movies_metadata,RatingMeansWithID]
[i.shape for i in data]

[(45476, 5), (46419, 3), (45843, 3), (45466, 24), (45115, 2)]

In [27]:
CList= set(credits              ['id'].tolist()    )                        
KList= set(keywords             ['id'].tolist()    )         
LList= set(links                ['id'].tolist())
MList= set(movies_metadata      ['id'].tolist())
RList= set(RatingMeansWithID    ['id'].tolist())

len(CList),len(KList),len(LList),len(MList),len(RList)

(45432, 45432, 45813, 45436, 45115)

In [28]:
CommonIDList= list(set(CList).intersection(KList).intersection(LList).intersection(MList).intersection(RList))
len(CommonIDList)

0

In [29]:
CList= set(credits['id'].tolist()    )                        
KList= set(keywords['id'].tolist()    )         
LList= set(links['id'].tolist())
MList= set(movies_metadata['id'].tolist())
RList= set(RatingMeansWithID['id'].tolist())


CList= [float(i) for i in CList]                   
KList= [float(i) for i in KList]    
LList= [float(i) for i in LList]
MList= [float(i) for i in MList if not '-' in i]
RList= [float(i) for i in RList]

len(CList),len(KList),len(LList),len(MList),len(RList)

(45432, 45432, 45813, 45433, 45115)

In [30]:
CommonIDList= list(set(CList).intersection(KList).intersection(LList).intersection(MList).intersection(RList))
len(CommonIDList)

7565

In [31]:
CommonIDList

[2.0,
 3.0,
 131074.0,
 5.0,
 6.0,
 11.0,
 12.0,
 13.0,
 14.0,
 15.0,
 16.0,
 17.0,
 18.0,
 19.0,
 20.0,
 21.0,
 22.0,
 24.0,
 25.0,
 26.0,
 27.0,
 28.0,
 131098.0,
 30.0,
 98328.0,
 33.0,
 35.0,
 98339.0,
 38.0,
 98344.0,
 131116.0,
 55.0,
 32825.0,
 58.0,
 59.0,
 65596.0,
 62.0,
 63.0,
 64.0,
 65.0,
 66.0,
 67.0,
 68.0,
 69.0,
 70.0,
 71.0,
 32834.0,
 73.0,
 74.0,
 75.0,
 76.0,
 77.0,
 78.0,
 79.0,
 80.0,
 81.0,
 82.0,
 83.0,
 65612.0,
 85.0,
 86.0,
 87.0,
 88.0,
 89.0,
 90.0,
 92.0,
 93.0,
 32862.0,
 95.0,
 96.0,
 97.0,
 98.0,
 99.0,
 100.0,
 101.0,
 102.0,
 103.0,
 104.0,
 105.0,
 106.0,
 107.0,
 108.0,
 109.0,
 110.0,
 111.0,
 112.0,
 113.0,
 114.0,
 115.0,
 116.0,
 117.0,
 118.0,
 65651.0,
 120.0,
 121.0,
 122.0,
 123.0,
 124.0,
 127.0,
 128.0,
 129.0,
 132.0,
 133.0,
 134.0,
 135.0,
 136.0,
 137.0,
 138.0,
 139.0,
 140.0,
 141.0,
 142.0,
 143.0,
 144.0,
 145.0,
 146.0,
 147.0,
 148.0,
 149.0,
 150.0,
 152.0,
 153.0,
 154.0,
 155.0,
 156.0,
 157.0,
 158.0,
 159.0,
 160.0,
 161.0,

##### 5.1 finding nulls

##### 5.2 outliers

##### 5.3 feature extraction

##### 5.4 feature selection

In [32]:
data = [credits,keywords,links,movies_metadata,RatingMeansWithID]
[i.shape for i in data]

[(45476, 5), (46419, 3), (45843, 3), (45466, 24), (45115, 2)]

In [33]:
credits.shape

(45476, 5)

In [34]:
credits

Unnamed: 0,cast,crew,id,castlist,crewlist
0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...",862,"[{'cast_id': ' 14', 'character': ' Woody (voic...","[{'credit_id': ' 52fe4284c3a36847f8024f49', 'd..."
1,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...",8844,"[{'cast_id': ' 1', 'character': ' Alan Parrish...","[{'credit_id': ' 52fe44bfc3a36847f80a7cd1', 'd..."
2,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...",15602,"[{'cast_id': ' 2', 'character': ' Max Goldman'...","[{'credit_id': ' 52fe466a9251416c75077a89', 'd..."
3,"[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...",31357,"[{'cast_id': ' 1', 'character': ' ""Savannah Va...","[{'credit_id': ' 52fe44779251416c91011acb', 'd..."
4,"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...",11862,"[{'cast_id': ' 1', 'character': ' George Banks...","[{'credit_id': ' 52fe44959251416c75039ed7', 'd..."
...,...,...,...,...,...
45471,"[{'cast_id': 0, 'character': '', 'credit_id': ...","[{'credit_id': '5894a97d925141426c00818c', 'de...",439050,"[{'cast_id': ' 0', 'character': ' ', 'credit_i...","[{'credit_id': ' 5894a97d925141426c00818c', 'd..."
45472,"[{'cast_id': 1002, 'character': 'Sister Angela...","[{'credit_id': '52fe4af1c3a36847f81e9b15', 'de...",111109,"[{'cast_id': ' 1002', 'character': ' Sister An...","[{'credit_id': ' 52fe4af1c3a36847f81e9b15', 'd..."
45473,"[{'cast_id': 6, 'character': 'Emily Shaw', 'cr...","[{'credit_id': '52fe4776c3a368484e0c8387', 'de...",67758,"[{'cast_id': ' 6', 'character': ' Emily Shaw',...","[{'credit_id': ' 52fe4776c3a368484e0c8387', 'd..."
45474,"[{'cast_id': 2, 'character': '', 'credit_id': ...","[{'credit_id': '533bccebc3a36844cf0011a7', 'de...",227506,"[{'cast_id': ' 2', 'character': ' ', 'credit_i...","[{'credit_id': ' 533bccebc3a36844cf0011a7', 'd..."


In [35]:
CommonIDList = [int(i) for i in CommonIDList]
CommonIDList

[2,
 3,
 131074,
 5,
 6,
 11,
 12,
 13,
 14,
 15,
 16,
 17,
 18,
 19,
 20,
 21,
 22,
 24,
 25,
 26,
 27,
 28,
 131098,
 30,
 98328,
 33,
 35,
 98339,
 38,
 98344,
 131116,
 55,
 32825,
 58,
 59,
 65596,
 62,
 63,
 64,
 65,
 66,
 67,
 68,
 69,
 70,
 71,
 32834,
 73,
 74,
 75,
 76,
 77,
 78,
 79,
 80,
 81,
 82,
 83,
 65612,
 85,
 86,
 87,
 88,
 89,
 90,
 92,
 93,
 32862,
 95,
 96,
 97,
 98,
 99,
 100,
 101,
 102,
 103,
 104,
 105,
 106,
 107,
 108,
 109,
 110,
 111,
 112,
 113,
 114,
 115,
 116,
 117,
 118,
 65651,
 120,
 121,
 122,
 123,
 124,
 127,
 128,
 129,
 132,
 133,
 134,
 135,
 136,
 137,
 138,
 139,
 140,
 141,
 142,
 143,
 144,
 145,
 146,
 147,
 148,
 149,
 150,
 152,
 153,
 154,
 155,
 156,
 157,
 158,
 159,
 160,
 161,
 162,
 163,
 164,
 165,
 166,
 167,
 168,
 169,
 170,
 171,
 172,
 173,
 174,
 175,
 176,
 177,
 178,
 179,
 180,
 32941,
 182,
 183,
 184,
 185,
 186,
 187,
 98491,
 189,
 192,
 193,
 194,
 195,
 196,
 197,
 198,
 199,
 200,
 201,
 164041,
 203,
 204,
 205,


In [36]:
Final_Credit= credits[credits['id'].isin(CommonIDList)]
Final_Keyword= keywords[keywords['id'].isin(CommonIDList)]
Final_Links= links[links['id'].isin(CommonIDList)]


#Common_ID_Data = [Final_Credit,Final_Keyword,Final_Links,Final_Movies_metadata,Final_RatingMeansWithID]


In [37]:
Final_Credit

Unnamed: 0,cast,crew,id,castlist,crewlist
0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...",862,"[{'cast_id': ' 14', 'character': ' Woody (voic...","[{'credit_id': ' 52fe4284c3a36847f8024f49', 'd..."
1,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...",8844,"[{'cast_id': ' 1', 'character': ' Alan Parrish...","[{'credit_id': ' 52fe44bfc3a36847f80a7cd1', 'd..."
5,"[{'cast_id': 25, 'character': 'Lt. Vincent Han...","[{'credit_id': '52fe4292c3a36847f802916d', 'de...",949,"[{'cast_id': ' 25', 'character': ' Lt. Vincent...","[{'credit_id': ' 52fe4292c3a36847f802916d', 'd..."
9,"[{'cast_id': 1, 'character': 'James Bond', 'cr...","[{'credit_id': '52fe426ec3a36847f801e14b', 'de...",710,"[{'cast_id': ' 1', 'character': ' James Bond',...","[{'credit_id': ' 52fe426ec3a36847f801e14b', 'd..."
14,"[{'cast_id': 1, 'character': 'Morgan Adams', '...","[{'credit_id': '52fe42f4c3a36847f802f69f', 'de...",1408,"[{'cast_id': ' 1', 'character': ' Morgan Adams...","[{'credit_id': ' 52fe42f4c3a36847f802f69f', 'd..."
...,...,...,...,...,...
45416,"[{'cast_id': 1001, 'character': 'Masha', 'cred...","[{'credit_id': '52fe4a1c9251416c750de11b', 'de...",98604,"[{'cast_id': ' 1001', 'character': ' Masha', '...","[{'credit_id': ' 52fe4a1c9251416c750de11b', 'd..."
45443,"[{'cast_id': 12, 'character': 'princezna Helen...","[{'credit_id': '52fe440fc3a36847f807ffa3', 'de...",5589,"[{'cast_id': ' 12', 'character': ' princezna H...","[{'credit_id': ' 52fe440fc3a36847f807ffa3', 'd..."
45446,"[{'cast_id': 1, 'character': 'Gillian Grady', ...","[{'credit_id': '52fe46c7c3a36847f81119a1', 'de...",45527,"[{'cast_id': ' 1', 'character': ' Gillian Grad...","[{'credit_id': ' 52fe46c7c3a36847f81119a1', 'd..."
45460,"[{'cast_id': 3, 'character': 'All the members ...","[{'credit_id': '52fe478dc3a36847f813bd6b', 'de...",49280,"[{'cast_id': ' 3', 'character': ' All the memb...","[{'credit_id': ' 52fe478dc3a36847f813bd6b', 'd..."


In [38]:
Final_Keyword

Unnamed: 0,id,keywords,keylist
0,862,"[{'id': 931, 'name': 'jealousy'}, {'id': 4290,...","[{'id': ' 931', 'name': ' jealousy'}, {'id': '..."
1,8844,"[{'id': 10090, 'name': 'board game'}, {'id': 1...","[{'id': ' 10090', 'name': ' board game'}, {'id..."
5,949,"[{'id': 642, 'name': 'robbery'}, {'id': 703, '...","[{'id': ' 642', 'name': ' robbery'}, {'id': ' ..."
9,710,"[{'id': 701, 'name': 'cuba'}, {'id': 769, 'nam...","[{'id': ' 701', 'name': ' cuba'}, {'id': ' 769..."
14,1408,"[{'id': 911, 'name': 'exotic island'}, {'id': ...","[{'id': ' 911', 'name': ' exotic island'}, {'i..."
...,...,...,...
46359,98604,[],[{}]
46386,5589,"[{'id': 3205, 'name': 'fairy tale'}, {'id': 13...","[{'id': ' 3205', 'name': ' fairy tale'}, {'id'..."
46389,45527,[],[{}]
46403,49280,[],[{}]


In [39]:
Final_Links

#here id is float not int

Unnamed: 0,movieId,imdbId,id
0,1,114709,862.0
1,2,113497,8844.0
5,6,113277,949.0
9,10,113189,710.0
14,15,112760,1408.0
...,...,...,...
45783,176143,2147597,98604.0
45810,176201,232750,5589.0
45813,176207,1331329,45527.0
45827,176237,135453,49280.0


In [40]:
#Final_Movies_metadata= movies_metadata[movies_metadata['id'].isin(CommonIDList)]
#Final_RatingMeansWithID= RatingMeansWithID[RatingMeansWithID['id'].isin(CommonIDList)]


In [41]:
#Final_Movies_metadata

In [42]:
Final_Movies_metadata = movies_metadata[movies_metadata['id'].isin(CommonIDList)]


In [43]:
Final_Movies_metadata.head()


Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count


In [44]:
print(movies_metadata['id'].head())  # Print the first few IDs in movies_metadata
    

0      862
1     8844
2    15602
3    31357
4    11862
Name: id, dtype: object


In [45]:
non_numeric_ids = movies_metadata[~movies_metadata['id'].str.isnumeric()]
print(non_numeric_ids[['id', 'title']])


               id title
19730  1997-08-20   NaN
29503  2012-09-29   NaN
35587  2014-01-01   NaN


In [46]:
movies_metadata = movies_metadata[movies_metadata['id'].str.isnumeric()]


In [47]:
movies_metadata['id'] = movies_metadata['id'].astype(int)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movies_metadata['id'] = movies_metadata['id'].astype(int)


In [48]:
common_ids_in_metadata = set(CommonIDList) & set(movies_metadata['id'].tolist())
print(f'Number of common IDs: {len(common_ids_in_metadata)}')

# Filter the movies_metadata based on CommonIDList
Final_Movies_metadata = movies_metadata[movies_metadata['id'].isin(CommonIDList)]
Final_Movies_metadata


Number of common IDs: 7565


Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",21.946943,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,"[{'name': 'Pixar Animation Studios', 'id': 3}]","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,17.015539,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,"[{'name': 'TriStar Pictures', 'id': 559}, {'na...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
5,False,,60000000,"[{'id': 28, 'name': 'Action'}, {'id': 80, 'nam...",,949,tt0113277,en,Heat,"Obsessive master thief, Neil McCauley leads a ...",17.924927,/zMyfPUelumio3tiDKPffaUpsQTD.jpg,"[{'name': 'Regency Enterprises', 'id': 508}, {...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-15,187436818.0,170.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,A Los Angeles Crime Saga,Heat,False,7.7,1886.0
9,False,"{'id': 645, 'name': 'James Bond Collection', '...",58000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 28, '...",http://www.mgm.com/view/movie/757/Goldeneye/,710,tt0113189,en,GoldenEye,James Bond must unmask the mysterious head of ...,14.686036,/5c0ovjT41KnYIHYuF4AWsTe3sKh.jpg,"[{'name': 'United Artists', 'id': 60}, {'name'...","[{'iso_3166_1': 'GB', 'name': 'United Kingdom'...",1995-11-16,352194034.0,130.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,No limits. No fears. No substitutes.,GoldenEye,False,6.6,1194.0
14,False,,98000000,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...",,1408,tt0112760,en,Cutthroat Island,"Morgan Adams and her slave, William Shaw, are ...",7.284477,/odM9973kIv9hcjfHPp6g6BlyTIJ.jpg,"[{'name': 'Le Studio Canal+', 'id': 183}, {'na...","[{'iso_3166_1': 'FR', 'name': 'France'}, {'iso...",1995-12-22,10017322.0,119.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,The Course Has Been Set. There Is No Turning B...,Cutthroat Island,False,5.7,137.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45406,False,,0,"[{'id': 35, 'name': 'Comedy'}, {'id': 10749, '...",,98604,tt2147597,ru,Zolushka,"Masha Krapivina - is yet beautiful, and not th...",0.803588,/cBFOyxe5HzIOIjJhipKQuslZsuV.jpg,"[{'name': 'Channel One Russia', 'id': 1039}, {...","[{'iso_3166_1': 'RU', 'name': 'Russia'}]",2012-02-14,0.0,91.0,"[{'iso_639_1': 'ru', 'name': 'Pусский'}]",Released,,Cinderella,False,4.6,6.0
45433,False,,0,"[{'id': 10402, 'name': 'Music'}, {'id': 35, 'n...",,5589,tt0232750,cs,Šíleně smutná princezna,No overview found.,0.375001,/9h5eegOHh1zoaHJCwu9meyVdQJk.jpg,"[{'name': 'Filmové Studio Barrandov', 'id': 19...","[{'iso_3166_1': 'CZ', 'name': 'Czech Republic'}]",1968-06-07,0.0,73.0,"[{'iso_639_1': 'cs', 'name': 'Český'}]",Released,,Šíleně smutná princezna,False,6.1,4.0
45436,False,,0,"[{'id': 28, 'name': 'Action'}, {'id': 9648, 'n...",,45527,tt1331329,en,The Final Storm,A stranger named Silas flees from a devastatin...,1.270832,/sEP5bLtK5IyQxOFtq5AYz4kUpzD.jpg,[{'name': 'Boll Kino Beteiligungs GmbH & Co. K...,"[{'iso_3166_1': 'CA', 'name': 'Canada'}, {'iso...",2010-01-01,0.0,92.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,"Action, Horror",The Final Storm,False,3.7,11.0
45450,False,,0,"[{'id': 14, 'name': 'Fantasy'}, {'id': 28, 'na...",,49280,tt0135453,fr,L'Homme orchestre,A band-leader has arranged seven chairs for th...,1.109068,/ZLOgI7KjtWby1NEg2pjU2Id60W.jpg,"[{'name': 'Star Film Company', 'id': 45867}]","[{'iso_3166_1': 'FR', 'name': 'France'}]",1900-01-01,0.0,1.0,"[{'iso_639_1': 'xx', 'name': 'No Language'}]",Released,,The One-Man Band,False,6.5,22.0


In [49]:
print(Final_Movies_metadata['id'].head()) 


0      862
1     8844
5      949
9      710
14    1408
Name: id, dtype: int64


In [50]:
#Final_RatingMeansWithID= RatingMeansWithID[RatingMeansWithID['id'].isin(CommonIDList)]
#Final_RatingMeansWithID.head()
Final_RatingMeansWithID = RatingMeansWithID[RatingMeansWithID['id'].isin(CommonIDList)]


In [51]:
print(RatingMeansWithID.head())


      rating  id
id              
1   3.888157   1
2   3.236953   2
3   3.175550   3
4   2.875713   4
5   3.079565   5


In [52]:
Final_RatingMeansWithID

Unnamed: 0_level_0,rating,id
id,Unnamed: 1_level_1,Unnamed: 2_level_1
2,3.236953,2
3,3.175550,3
5,3.079565,5
6,3.841764,6
11,3.660591,11
...,...,...
176077,3.500000,176077
176085,3.000000,176085
176143,3.500000,176143
176167,3.000000,176167


In [53]:
""""

Now we need create one dataset that combine all of these :

                Final_RatingMeansWithID        
                Final_Credit      
                Final_Keyword       
                Final_Links      
                Final_Movies_metadata

"""

DataPrepared = pd.DataFrame()
DataPrepared




In [54]:
DataPrepared['id'] = CommonIDList
DataPrepared

Unnamed: 0,id
0,2
1,3
2,131074
3,5
4,6
...,...
7560,131027
7561,32728
7562,98273
7563,98277


In [55]:
Final_Credit

Unnamed: 0,cast,crew,id,castlist,crewlist
0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...",862,"[{'cast_id': ' 14', 'character': ' Woody (voic...","[{'credit_id': ' 52fe4284c3a36847f8024f49', 'd..."
1,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...",8844,"[{'cast_id': ' 1', 'character': ' Alan Parrish...","[{'credit_id': ' 52fe44bfc3a36847f80a7cd1', 'd..."
5,"[{'cast_id': 25, 'character': 'Lt. Vincent Han...","[{'credit_id': '52fe4292c3a36847f802916d', 'de...",949,"[{'cast_id': ' 25', 'character': ' Lt. Vincent...","[{'credit_id': ' 52fe4292c3a36847f802916d', 'd..."
9,"[{'cast_id': 1, 'character': 'James Bond', 'cr...","[{'credit_id': '52fe426ec3a36847f801e14b', 'de...",710,"[{'cast_id': ' 1', 'character': ' James Bond',...","[{'credit_id': ' 52fe426ec3a36847f801e14b', 'd..."
14,"[{'cast_id': 1, 'character': 'Morgan Adams', '...","[{'credit_id': '52fe42f4c3a36847f802f69f', 'de...",1408,"[{'cast_id': ' 1', 'character': ' Morgan Adams...","[{'credit_id': ' 52fe42f4c3a36847f802f69f', 'd..."
...,...,...,...,...,...
45416,"[{'cast_id': 1001, 'character': 'Masha', 'cred...","[{'credit_id': '52fe4a1c9251416c750de11b', 'de...",98604,"[{'cast_id': ' 1001', 'character': ' Masha', '...","[{'credit_id': ' 52fe4a1c9251416c750de11b', 'd..."
45443,"[{'cast_id': 12, 'character': 'princezna Helen...","[{'credit_id': '52fe440fc3a36847f807ffa3', 'de...",5589,"[{'cast_id': ' 12', 'character': ' princezna H...","[{'credit_id': ' 52fe440fc3a36847f807ffa3', 'd..."
45446,"[{'cast_id': 1, 'character': 'Gillian Grady', ...","[{'credit_id': '52fe46c7c3a36847f81119a1', 'de...",45527,"[{'cast_id': ' 1', 'character': ' Gillian Grad...","[{'credit_id': ' 52fe46c7c3a36847f81119a1', 'd..."
45460,"[{'cast_id': 3, 'character': 'All the members ...","[{'credit_id': '52fe478dc3a36847f813bd6b', 'de...",49280,"[{'cast_id': ' 3', 'character': ' All the memb...","[{'credit_id': ' 52fe478dc3a36847f813bd6b', 'd..."


In [56]:
Final_Keyword

Unnamed: 0,id,keywords,keylist
0,862,"[{'id': 931, 'name': 'jealousy'}, {'id': 4290,...","[{'id': ' 931', 'name': ' jealousy'}, {'id': '..."
1,8844,"[{'id': 10090, 'name': 'board game'}, {'id': 1...","[{'id': ' 10090', 'name': ' board game'}, {'id..."
5,949,"[{'id': 642, 'name': 'robbery'}, {'id': 703, '...","[{'id': ' 642', 'name': ' robbery'}, {'id': ' ..."
9,710,"[{'id': 701, 'name': 'cuba'}, {'id': 769, 'nam...","[{'id': ' 701', 'name': ' cuba'}, {'id': ' 769..."
14,1408,"[{'id': 911, 'name': 'exotic island'}, {'id': ...","[{'id': ' 911', 'name': ' exotic island'}, {'i..."
...,...,...,...
46359,98604,[],[{}]
46386,5589,"[{'id': 3205, 'name': 'fairy tale'}, {'id': 13...","[{'id': ' 3205', 'name': ' fairy tale'}, {'id'..."
46389,45527,[],[{}]
46403,49280,[],[{}]


In [57]:
IDtoCastList = {k:v for k,v in zip (Final_Credit['id'].tolist() , Final_Credit['castlist'].tolist())}
IDtoCrewList = {k:v for k,v in zip (Final_Credit['id'].tolist() , Final_Credit['crewlist'].tolist())}
IDtoKeyList  = {k:v for k,v in zip (Final_Keyword['id'].tolist() , Final_Keyword['keylist'].tolist())}

In [58]:
Final_Credit[Final_Credit['id'] == 2] 

Unnamed: 0,cast,crew,id,castlist,crewlist
4342,"[{'cast_id': 3, 'character': 'Taisto Olavi Kas...","[{'credit_id': '52fe420dc3a36847f800001f', 'de...",2,"[{'cast_id': ' 3', 'character': ' Taisto Olavi...","[{'credit_id': ' 52fe420dc3a36847f800001f', 'd..."


In [59]:
DataPrepared['castlist'] = DataPrepared['id'].map(IDtoCastList)
DataPrepared['crewlist'] = DataPrepared['id'].map(IDtoCrewList)
DataPrepared['keylist'] = DataPrepared['id'].map(IDtoKeyList)

DataPrepared

Unnamed: 0,id,castlist,crewlist,keylist
0,2,"[{'cast_id': ' 3', 'character': ' Taisto Olavi...","[{'credit_id': ' 52fe420dc3a36847f800001f', 'd...","[{'id': ' 240', 'name': ' underdog'}, {'id': '..."
1,3,"[{'cast_id': ' 5', 'character': ' Nikander', '...","[{'credit_id': ' 52fe420dc3a36847f8000077', 'd...","[{'id': ' 1361', 'name': ' salesclerk'}, {'id'..."
2,131074,"[{'cast_id': ' 2', 'character': ' Chris Lloyd'...","[{'credit_id': ' 52fe4b6dc3a368484e18897b', 'd...","[{'id': ' 9937', 'name': ' suspense'}, {'id': ..."
3,5,"[{'cast_id': ' 42', 'character': ' Ted the Bel...","[{'credit_id': ' 52fe420dc3a36847f800011b', 'd...","[{'id': ' 612', 'name': ' hotel'}, {'id': ' 61..."
4,6,"[{'cast_id': ' 7', 'character': ' Frank Wyatt'...","[{'credit_id': ' 52fe420dc3a36847f800023d', 'd...","[{'id': ' 520', 'name': ' chicago'}, {'id': ' ..."
...,...,...,...,...
7560,131027,"[{'cast_id': ' 4', 'character': ' Csontváry / ...","[{'credit_id': ' 52fe4b6cc3a368484e188547', 'd...","[{'id': ' 437', 'name': ' painter'}, {'id': ' ..."
7561,32728,"[{'cast_id': ' 3', 'character': ' Himself - wr...","[{'credit_id': ' 52fe44e69251416c91020be1', 'd...","[{'id': ' 9682', 'name': ' history'}, {'id': '..."
7562,98273,"[{'cast_id': ' 2', 'character': ' Martha', 'cr...","[{'credit_id': ' 52fe4a0a9251416c750dbab5', 'd...","[{'id': ' 13142', 'name': ' gangster'}, {'id':..."
7563,98277,"[{'cast_id': ' 11', 'character': ' Nora', 'cre...","[{'credit_id': ' 52fe4a0a9251416c750dbb21', 'd...","[{'id': ' 90', 'name': ' paris'}, {'id': ' 254..."


### 6. visualization
___

### 7. building the model
___

### 8. evaluation the model
___