### Data Cleaning 

Let's importe the dataset

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv('../../dataset/TMDB_movie_dataset_v11.csv')
pd.set_option('display.max_columns', None)

In [3]:
df.shape

(1084383, 24)

In [4]:
df.head()

Unnamed: 0,id,title,vote_average,vote_count,status,release_date,revenue,runtime,adult,backdrop_path,budget,homepage,imdb_id,original_language,original_title,overview,popularity,poster_path,tagline,genres,production_companies,production_countries,spoken_languages,keywords
0,27205,Inception,8.364,34495,Released,2010-07-15,825532764,148,False,/8ZTVqvKDQ8emSGUEMjsS4yHAwrp.jpg,160000000,https://www.warnerbros.com/movies/inception,tt1375666,en,Inception,"Cobb, a skilled thief who commits corporate es...",83.952,/oYuLEt3zVCKq57qu2F8dT7NIa6f.jpg,Your mind is the scene of the crime.,"Action, Science Fiction, Adventure","Legendary Pictures, Syncopy, Warner Bros. Pict...","United Kingdom, United States of America","English, French, Japanese, Swahili","rescue, mission, dream, airplane, paris, franc..."
1,157336,Interstellar,8.417,32571,Released,2014-11-05,701729206,169,False,/pbrkL804c8yAv3zBZR4QPEafpAR.jpg,165000000,http://www.interstellarmovie.net/,tt0816692,en,Interstellar,The adventures of a group of explorers who mak...,140.241,/gEU2QniE6E77NI6lCU6MxlNBvIx.jpg,Mankind was born on Earth. It was never meant ...,"Adventure, Drama, Science Fiction","Legendary Pictures, Syncopy, Lynda Obst Produc...","United Kingdom, United States of America",English,"rescue, future, spacecraft, race against time,..."
2,155,The Dark Knight,8.512,30619,Released,2008-07-16,1004558444,152,False,/nMKdUUepR0i5zn0y1T4CsSB5chy.jpg,185000000,https://www.warnerbros.com/movies/dark-knight/,tt0468569,en,The Dark Knight,Batman raises the stakes in his war on crime. ...,130.643,/qJ2tW6WMUDux911r6m7haRef0WH.jpg,Welcome to a world without rules.,"Drama, Action, Crime, Thriller","DC Comics, Legendary Pictures, Syncopy, Isobel...","United Kingdom, United States of America","English, Mandarin","joker, sadism, chaos, secret identity, crime f..."
3,19995,Avatar,7.573,29815,Released,2009-12-15,2923706026,162,False,/vL5LR6WdxWPjLPFRLe133jXWsh5.jpg,237000000,https://www.avatar.com/movies/avatar,tt0499549,en,Avatar,"In the 22nd century, a paraplegic Marine is di...",79.932,/kyeqWdyUXW608qlYkRqosgbbJyK.jpg,Enter the world of Pandora.,"Action, Adventure, Fantasy, Science Fiction","Dune Entertainment, Lightstorm Entertainment, ...","United States of America, United Kingdom","English, Spanish","future, society, culture clash, space travel, ..."
4,24428,The Avengers,7.71,29166,Released,2012-04-25,1518815515,143,False,/9BBTo63ANSmhC4e6r62OJFuK2GL.jpg,220000000,https://www.marvel.com/movies/the-avengers,tt0848228,en,The Avengers,When an unexpected enemy emerges and threatens...,98.082,/RYMX2wcKCBAr24UyPD7xwmjaTn.jpg,Some assembly required.,"Science Fiction, Action, Adventure",Marvel Studios,United States of America,"English, Hindi, Russian","new york city, superhero, shield, based on com..."


Let's get some information about our dataset

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1084383 entries, 0 to 1084382
Data columns (total 24 columns):
 #   Column                Non-Null Count    Dtype  
---  ------                --------------    -----  
 0   id                    1084383 non-null  int64  
 1   title                 1084370 non-null  object 
 2   vote_average          1084383 non-null  float64
 3   vote_count            1084383 non-null  int64  
 4   status                1084383 non-null  object 
 5   release_date          925357 non-null   object 
 6   revenue               1084383 non-null  int64  
 7   runtime               1084383 non-null  int64  
 8   adult                 1084383 non-null  bool   
 9   backdrop_path         294328 non-null   object 
 10  budget                1084383 non-null  int64  
 11  homepage              116376 non-null   object 
 12  imdb_id               591742 non-null   object 
 13  original_language     1084383 non-null  object 
 14  original_title        1084370 non-

Some fields are of a type object so we must assign the right type to them 

In [6]:
if (df == "-").any().any():
    print("The dataset contains '-' values.")
else:
    print("The dataset does not contain '-' values.")

The dataset contains '-' values.


The second problem we are facing is that some cells have '-' as a value , so we must fix this by replacing "-" by np.nan.

In [7]:
mask = df == "-"
rows_with_dash = df[mask.any(axis=1)]  # Rows with any "-" in them
rows_with_dash

Unnamed: 0,id,title,vote_average,vote_count,status,release_date,revenue,runtime,adult,backdrop_path,budget,homepage,imdb_id,original_language,original_title,overview,popularity,poster_path,tagline,genres,production_companies,production_countries,spoken_languages,keywords
252501,1165214,OFF DUTY,10.0,1,Released,,0,11,False,/h46msCu7fHoc1ocwk23eF5ylLFD.jpg,300,https://vimeo.com/user164306955,,pa,OFF DUTY,It's an ordinary winter's day when disillusion...,0.6,/8auhKxNfy8dXrsZnWnBGGLsvEBM.jpg,-,Drama,,India,Punjabi,"police, nationalism, india, martyrdom, punjab,..."
558748,1263489,Prokleta krv,0.0,0,Released,,0,12,False,,1086,https://www.youtube.com/watch?v=FPEh71e_qgQ,tt30327382,en,Prokleta krv,"After the sudden death of his father, Drago st...",0.0,,-,,,,,
587917,1330399,Le cri Barrymore,0.0,0,Released,,0,0,False,,0,https://www.amazon.com/True-Story-Barrymores-D...,tt23650606,en,Le cri Barrymore,"""The True Story of the Barrymores"" is the stor...",0.6,,-,,,,,
624547,1170368,-,0.0,0,Rumored,,0,0,False,,0,,,ur,-,,0.6,/4Xhbin1cZZZq0GJeG4SrDbYhK6m.jpg,,,,Pakistan,,
634698,1177123,Take After Take,0.0,0,Released,2023-09-11,0,40,False,/tPdYZmnwIwKy0BmQLG220HKig8v.jpg,0,,,th,Take After Take,"As they prepare for their upcoming film, the d...",1.117,/x6e92KgRObkrhDVkPVH4G53SVUA.jpg,-,Drama,,,Thai,
634831,1177462,To Thine Own Self Be True,0.0,0,Released,,0,33,False,,0,,,en,To Thine Own Self Be True,An experimental study of nature and how we hav...,0.6,,-,,,,,"greed, tragedy, desire, destiny, ego, imagery,..."
655291,1161111,The George,0.0,0,Released,,0,55,False,,0,,,en,The George,Documentary by filmmaker Clark Harding. The fi...,0.6,,-,,,,,
669366,1217323,Maahiset,0.0,0,Released,,0,90,False,,0,https://www.youtube.com/channel/UCcM2tgbHTTSKZ...,,en,Maahiset,Maahiset is a full-length drama comedy dealing...,0.0,/uh0EgTjP1mmzbocabnsZ4gJFmfx.jpg,-,,,,Finnish,


In [8]:
df.replace({'-':np.nan, "(empty)":np.nan}, inplace=True)
rows_with_dash = df[mask.any(axis=1)]

### Now it is all fixed

In [9]:
df.nunique().sort_values(ascending=False)

id                      1083717
original_title           965205
title                    932117
overview                 851678
poster_path              753175
imdb_id                  590506
backdrop_path            291457
production_companies     200860
keywords                 172787
tagline                  148013
homepage                 109362
release_date              42535
popularity                19888
revenue                   14258
genres                    13091
production_countries       9945
spoken_languages           6987
budget                     5568
vote_average               5024
vote_count                 3598
runtime                     744
original_language           174
status                        6
adult                         2
dtype: int64

No columns have only unique values nor have the same value for all rows so there is no column to drop we need to keep all of them. 

In revenge we notice that we have only 932118 titles which is odd considering that we have more than 1M movies/TV shows . Let's start by seeing if any raw has no title .

In [10]:
missing_titles = df['title'].isna().sum()
missing_titles

15

Let's delete these raws since they won't help us

In [11]:
df = df.dropna(subset=['title'])

Let's look for duplicate now

In [12]:
duplicate_titles = df[df.duplicated(subset=['title'], keep=False)]
print(f"Number of duplicate titles: {len(duplicate_titles)}")

Number of duplicate titles: 209989


Let's remove all duplicates

In [13]:
df = df.drop_duplicates(subset=['title'])
duplicate_titles = df[df.duplicated(subset=['title'], keep=False)]
print(f"Number of duplicate titles: {len(duplicate_titles)}")

Number of duplicate titles: 0


Now we will remove these columns since there is no need for them . 
- id

- original_title

- vote_count

- revenue

- runtime

- backdrop_path

- budget

- homepage

- imdb_id

- overview

- popularity

- poster_path

- tagline

- spoken_languages

- keywords

In [14]:
df.drop(columns=['id'],inplace=True)
df.drop(columns=['original_title'],inplace=True)
df.drop(columns=['vote_count'],inplace=True)
df.drop(columns=['revenue'],inplace=True)
df.drop(columns=['runtime'],inplace=True)
df.drop(columns=['backdrop_path'],inplace=True)
df.drop(columns=['budget'],inplace=True)
df.drop(columns=['homepage'],inplace=True)
df.drop(columns=['imdb_id'],inplace=True)
df.drop(columns=['overview'],inplace=True)
df.drop(columns=['popularity'],inplace=True)
df.drop(columns=['poster_path'],inplace=True)
df.drop(columns=['tagline'],inplace=True)
df.drop(columns=['spoken_languages'],inplace=True)
df.drop(columns=['keywords'],inplace=True)

In [15]:
print(df.shape)
df.nunique().sort_values(ascending=False)

(932117, 9)


title                   932117
production_companies    176464
release_date             41840
genres                   12065
production_countries      8924
vote_average              4970
original_language          173
status                       6
adult                        2
dtype: int64

### Fix data types of the misinterpreted columns

In [16]:
df.head()

Unnamed: 0,title,vote_average,status,release_date,adult,original_language,genres,production_companies,production_countries
0,Inception,8.364,Released,2010-07-15,False,en,"Action, Science Fiction, Adventure","Legendary Pictures, Syncopy, Warner Bros. Pict...","United Kingdom, United States of America"
1,Interstellar,8.417,Released,2014-11-05,False,en,"Adventure, Drama, Science Fiction","Legendary Pictures, Syncopy, Lynda Obst Produc...","United Kingdom, United States of America"
2,The Dark Knight,8.512,Released,2008-07-16,False,en,"Drama, Action, Crime, Thriller","DC Comics, Legendary Pictures, Syncopy, Isobel...","United Kingdom, United States of America"
3,Avatar,7.573,Released,2009-12-15,False,en,"Action, Adventure, Fantasy, Science Fiction","Dune Entertainment, Lightstorm Entertainment, ...","United States of America, United Kingdom"
4,The Avengers,7.71,Released,2012-04-25,False,en,"Science Fiction, Action, Adventure",Marvel Studios,United States of America


In [17]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 932117 entries, 0 to 1084381
Data columns (total 9 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   title                 932117 non-null  object 
 1   vote_average          932117 non-null  float64
 2   status                932117 non-null  object 
 3   release_date          796639 non-null  object 
 4   adult                 932117 non-null  bool   
 5   original_language     932117 non-null  object 
 6   genres                566403 non-null  object 
 7   production_companies  433270 non-null  object 
 8   production_countries  530919 non-null  object 
dtypes: bool(1), float64(1), object(7)
memory usage: 64.9+ MB


In [18]:
dtype_convert_dict = {
    "status": "category",
    "original_language": "category"
}
df = df.astype(dtype_convert_dict)
df['release_date'] = pd.to_datetime(df['release_date'], format='%Y-%m-%d')
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 932117 entries, 0 to 1084381
Data columns (total 9 columns):
 #   Column                Non-Null Count   Dtype         
---  ------                --------------   -----         
 0   title                 932117 non-null  object        
 1   vote_average          932117 non-null  float64       
 2   status                932117 non-null  category      
 3   release_date          796639 non-null  datetime64[ns]
 4   adult                 932117 non-null  bool          
 5   original_language     932117 non-null  category      
 6   genres                566403 non-null  object        
 7   production_companies  433270 non-null  object        
 8   production_countries  530919 non-null  object        
dtypes: bool(1), category(2), datetime64[ns](1), float64(1), object(4)
memory usage: 53.3+ MB


## Let's get some more information about our dataset

In [19]:
df["status"].value_counts()

Released           914684
In Production        6653
Post Production      5674
Planned              4645
Rumored               238
Canceled              223
Name: status, dtype: int64

Obviousely we are going to keep only the released ones

In [20]:
df = df[df['status'] == "Released"]
df["status"].value_counts()

Released           914684
Canceled                0
In Production           0
Planned                 0
Post Production         0
Rumored                 0
Name: status, dtype: int64

In [21]:
df = df.drop('status',axis=1)
df = df.drop('vote_average',axis=1) # no need to take it into consideration

In [22]:
df["adult"].value_counts()

False    822993
True      91691
Name: adult, dtype: int64

We are going to need to encode this to 1s and 0s

In [23]:
df_one_hot = pd.get_dummies(df, columns=['adult'], drop_first=True)
df_one_hot.head()

Unnamed: 0,title,release_date,original_language,genres,production_companies,production_countries,adult_True
0,Inception,2010-07-15,en,"Action, Science Fiction, Adventure","Legendary Pictures, Syncopy, Warner Bros. Pict...","United Kingdom, United States of America",0
1,Interstellar,2014-11-05,en,"Adventure, Drama, Science Fiction","Legendary Pictures, Syncopy, Lynda Obst Produc...","United Kingdom, United States of America",0
2,The Dark Knight,2008-07-16,en,"Drama, Action, Crime, Thriller","DC Comics, Legendary Pictures, Syncopy, Isobel...","United Kingdom, United States of America",0
3,Avatar,2009-12-15,en,"Action, Adventure, Fantasy, Science Fiction","Dune Entertainment, Lightstorm Entertainment, ...","United States of America, United Kingdom",0
4,The Avengers,2012-04-25,en,"Science Fiction, Action, Adventure",Marvel Studios,United States of America,0


In [24]:
df = df_one_hot

In [25]:
df_original_language_values = df["original_language"].value_counts()
print(len(df_original_language_values))
df_original_language_values[0:20]

173


en    485924
fr     56516
de     47460
es     47103
ja     43719
zh     27231
pt     26988
it     20480
ru     18457
ko     10487
cs      9165
nl      7365
sv      7215
ar      7092
tr      6472
hi      6165
pl      5565
tl      5195
da      4396
cn      4260
Name: original_language, dtype: int64

We will keep only the first 10 langages and put the others in a spare columns called other langage since they re not as many as the first 10 langage 

In [26]:
top_10 = df['original_language'].value_counts().nlargest(10).index
df['Language_encoded'] = df['original_language'].apply(lambda x: x if x in top_10 else 'Other_langage')
df["Language_encoded"].value_counts()
df.drop(['original_language'],axis=1,inplace=True)

Let's do a one-hot encoder for original language

In [27]:
df_one_hot = pd.get_dummies(df, columns=['Language_encoded'], drop_first=True)
df = df_one_hot
df.head()

Unnamed: 0,title,release_date,genres,production_companies,production_countries,adult_True,Language_encoded_de,Language_encoded_en,Language_encoded_es,Language_encoded_fr,Language_encoded_it,Language_encoded_ja,Language_encoded_ko,Language_encoded_pt,Language_encoded_ru,Language_encoded_zh
0,Inception,2010-07-15,"Action, Science Fiction, Adventure","Legendary Pictures, Syncopy, Warner Bros. Pict...","United Kingdom, United States of America",0,0,1,0,0,0,0,0,0,0,0
1,Interstellar,2014-11-05,"Adventure, Drama, Science Fiction","Legendary Pictures, Syncopy, Lynda Obst Produc...","United Kingdom, United States of America",0,0,1,0,0,0,0,0,0,0,0
2,The Dark Knight,2008-07-16,"Drama, Action, Crime, Thriller","DC Comics, Legendary Pictures, Syncopy, Isobel...","United Kingdom, United States of America",0,0,1,0,0,0,0,0,0,0,0
3,Avatar,2009-12-15,"Action, Adventure, Fantasy, Science Fiction","Dune Entertainment, Lightstorm Entertainment, ...","United States of America, United Kingdom",0,0,1,0,0,0,0,0,0,0,0
4,The Avengers,2012-04-25,"Science Fiction, Action, Adventure",Marvel Studios,United States of America,0,0,1,0,0,0,0,0,0,0,0


In [28]:
df["genres"].value_counts()[:20]

Documentary           117630
Drama                  79343
Comedy                 50900
Animation              24400
Music                  23321
Horror                 16122
Drama, Romance          8351
Comedy, Drama           7563
Action                  6550
Romance                 6358
Documentary, Music      5017
Drama, Comedy           4939
Thriller                4833
Comedy, Romance         4617
Western                 4025
Romance, Drama          3388
Crime                   3231
Family                  3145
Music, Documentary      2838
Drama, TV Movie         2791
Name: genres, dtype: int64

Here the count is not good since there are rows with multiple "genres"

In [29]:
split_values = df['genres'].str.split(',').explode().str.strip() #str.strip to ignore spaces before some values
# Count occurrences of each unique value
value_counts = split_values.value_counts()
print(len(value_counts))
value_counts

19


Drama              172005
Documentary        144031
Comedy             117216
Animation           45777
Music               43572
Romance             42772
Horror              38877
Action              36404
Thriller            33326
Crime               26620
Family              23829
TV Movie            19606
Adventure           19403
Fantasy             17939
Science Fiction     16528
Mystery             15040
History             14302
War                  8884
Western              7415
Name: genres, dtype: int64

Let's make something similar to one-hot encoding

In [30]:
one_hot_encoded = pd.get_dummies(split_values).groupby(level=0).sum()
df = df.drop(columns=['genres']).join(one_hot_encoded)
df.head()

Unnamed: 0,title,release_date,production_companies,production_countries,adult_True,Language_encoded_de,Language_encoded_en,Language_encoded_es,Language_encoded_fr,Language_encoded_it,Language_encoded_ja,Language_encoded_ko,Language_encoded_pt,Language_encoded_ru,Language_encoded_zh,Action,Adventure,Animation,Comedy,Crime,Documentary,Drama,Family,Fantasy,History,Horror,Music,Mystery,Romance,Science Fiction,TV Movie,Thriller,War,Western
0,Inception,2010-07-15,"Legendary Pictures, Syncopy, Warner Bros. Pict...","United Kingdom, United States of America",0,0,1,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
1,Interstellar,2014-11-05,"Legendary Pictures, Syncopy, Lynda Obst Produc...","United Kingdom, United States of America",0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0
2,The Dark Knight,2008-07-16,"DC Comics, Legendary Pictures, Syncopy, Isobel...","United Kingdom, United States of America",0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,1,0,0
3,Avatar,2009-12-15,"Dune Entertainment, Lightstorm Entertainment, ...","United States of America, United Kingdom",0,0,1,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0
4,The Avengers,2012-04-25,Marvel Studios,United States of America,0,0,1,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0


In [31]:
df["production_countries"].value_counts()[:20]

United States of America    146111
Japan                        36045
Germany                      28808
France                       27787
United Kingdom               26376
India                        16170
Canada                       15149
Brazil                       13403
Italy                        11444
Mexico                        9537
Spain                         9442
China                         7941
Soviet Union                  7416
Russia                        6819
South Korea                   6694
Argentina                     6182
Czechoslovakia                5732
Philippines                   5654
Sweden                        5468
Czech Republic                4972
Name: production_countries, dtype: int64

Let's do as what we just did before

In [32]:
split_values = df['production_companies'].str.split(',').explode().str.strip() #str.strip to ignore spaces before some values
# Count occurrences of each unique value
value_counts = split_values.value_counts()
print(len(value_counts))
value_counts[:15]

139449


Evil Angel               2854
Warner Bros. Pictures    2796
BBC                      2631
Columbia Pictures        2531
Metro-Goldwyn-Mayer      2482
ARTE                     2426
ONF | NFB                2202
Universal Pictures       2188
Paramount                2138
Toei Company             2093
ZDF                      2093
Nikkatsu Corporation     1710
20th Century Fox         1514
ARD                      1464
Canal+                   1347
Name: production_companies, dtype: int64

Here we notice that it is hard to take the production companies into consideration since there are a lot and each one didn't produce many movies .

Now Let's move to procution countries

In [33]:
df.drop(['production_companies'],axis=1,inplace=True)

In [34]:
split_values = df['production_countries'].str.split(',').explode().str.strip() #str.strip to ignore spaces before some values
# Count occurrences of each unique value
value_counts = split_values.value_counts()
print(len(value_counts))
value_counts[:20]

249


United States of America    158370
France                       37792
Japan                        37464
Germany                      36451
United Kingdom               32492
Canada                       19481
India                        16877
Italy                        16047
Brazil                       14402
Spain                        12630
Mexico                       10870
China                         9096
Russia                        7774
Soviet Union                  7672
Argentina                     7457
South Korea                   7279
Sweden                        6746
Philippines                   5991
Czech Republic                5980
Czechoslovakia                5951
Name: production_countries, dtype: int64

We will take into consideration only production countries with more than 7000 movies.

In [35]:
df['production_countries'] = df['production_countries'].str.split(',')
df['production_countries'] = df['production_countries'].apply(lambda x: [item.strip() for item in x] if x is not np.nan else [])

In [36]:
all_values = [item for sublist in df['production_countries'] for item in sublist]
value_counts = pd.Series(all_values).value_counts()
df['production_countries'] = df['production_countries'].apply(lambda x: [item if value_counts[item] >= 7000 else 'other' for item in x])
df['production_countries'] = df['production_countries'].apply(lambda x: ', '.join(x))

In [37]:
split_values = df['production_countries'].str.split(',').explode().str.strip() #str.strip to ignore spaces before some values
# Count occurrences of each unique value
value_counts = split_values.value_counts()
print(len(value_counts))
value_counts

18


                            394825
United States of America    158370
other                       139282
France                       37792
Japan                        37464
Germany                      36451
United Kingdom               32492
Canada                       19481
India                        16877
Italy                        16047
Brazil                       14402
Spain                        12630
Mexico                       10870
China                         9096
Russia                        7774
Soviet Union                  7672
Argentina                     7457
South Korea                   7279
Name: production_countries, dtype: int64

In [38]:
one_hot_encoded = pd.get_dummies(split_values).groupby(level=0).sum()
df = df.drop(columns=['production_countries']).join(one_hot_encoded)
df.head()

Unnamed: 0,title,release_date,adult_True,Language_encoded_de,Language_encoded_en,Language_encoded_es,Language_encoded_fr,Language_encoded_it,Language_encoded_ja,Language_encoded_ko,Language_encoded_pt,Language_encoded_ru,Language_encoded_zh,Action,Adventure,Animation,Comedy,Crime,Documentary,Drama,Family,Fantasy,History,Horror,Music,Mystery,Romance,Science Fiction,TV Movie,Thriller,War,Western,Unnamed: 33,Argentina,Brazil,Canada,China,France,Germany,India,Italy,Japan,Mexico,Russia,South Korea,Soviet Union,Spain,United Kingdom,United States of America,other
0,Inception,2010-07-15,0,0,1,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0
1,Interstellar,2014-11-05,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0
2,The Dark Knight,2008-07-16,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0
3,Avatar,2009-12-15,0,0,1,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0
4,The Avengers,2012-04-25,0,0,1,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0


Let's now focus on release_date

In [39]:
def categorize_release_date(year):
    if year < 1930:
        return 'Silent Era'
    elif 1930 <= year < 1950:
        return 'Golden Age of Hollywood'
    elif 1950 <= year < 1970:
        return 'Post-War and Classic Cinema'
    elif 1970 <= year < 1990:
        return 'New Hollywood and Blockbuster Era'
    elif 1990 <= year < 2010:
        return 'Modern Cinema'
    else:
        return 'Contemporary Cinema'

In [40]:
df['release_date'] = df['release_date'].dt.year
df['release_date'] = df['release_date'].apply(categorize_release_date)
df = pd.get_dummies(df, columns=['release_date'])
df.head()

Unnamed: 0,title,adult_True,Language_encoded_de,Language_encoded_en,Language_encoded_es,Language_encoded_fr,Language_encoded_it,Language_encoded_ja,Language_encoded_ko,Language_encoded_pt,Language_encoded_ru,Language_encoded_zh,Action,Adventure,Animation,Comedy,Crime,Documentary,Drama,Family,Fantasy,History,Horror,Music,Mystery,Romance,Science Fiction,TV Movie,Thriller,War,Western,Unnamed: 32,Argentina,Brazil,Canada,China,France,Germany,India,Italy,Japan,Mexico,Russia,South Korea,Soviet Union,Spain,United Kingdom,United States of America,other,release_date_Contemporary Cinema,release_date_Golden Age of Hollywood,release_date_Modern Cinema,release_date_New Hollywood and Blockbuster Era,release_date_Post-War and Classic Cinema,release_date_Silent Era
0,Inception,0,0,1,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,1,0,0,0,0,0
1,Interstellar,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,1,0,0,0,0,0
2,The Dark Knight,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,1,0,0,0
3,Avatar,0,0,1,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,1,0,0,0
4,The Avengers,0,0,1,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0
