# Project 3: Data Cleaning - Tidy up messy Datasets (Movies Dataset)

# Project Brief for Self-Coders

Here you´ll have the opportunity to code major parts of Project 3 on your own. If you need any help or inspiration, have a look at the Videos or the Jupyter Notebook with the full code. <br> <br>
Keep in mind that it´s all about __getting the right results/conclusions__. It´s not about finding the identical code. Things can be coded in many different ways. Even if you come to the same conclusions, it´s very unlikely that we have the very same code. 

## First Steps 

1. __Load__ and __inspect__ the messy dataset __movies_metadata.csv__. Identify columns with nested / stringified json data.

In [1]:
import pandas as pd
import json
import numpy as np

In [2]:
df_raw=pd.read_csv("movies_metadata.csv")
df_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 24 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   adult                  45466 non-null  object 
 1   belongs_to_collection  4494 non-null   object 
 2   budget                 45466 non-null  object 
 3   genres                 45466 non-null  object 
 4   homepage               7782 non-null   object 
 5   id                     45466 non-null  object 
 6   imdb_id                45449 non-null  object 
 7   original_language      45455 non-null  object 
 8   original_title         45466 non-null  object 
 9   overview               44512 non-null  object 
 10  popularity             45461 non-null  object 
 11  poster_path            45080 non-null  object 
 12  production_companies   45463 non-null  object 
 13  production_countries   45463 non-null  object 
 14  release_date           45379 non-null  object 
 15  re

  df_raw=pd.read_csv("movies_metadata.csv")


In [3]:
df_raw.describe()

Unnamed: 0,revenue,runtime,vote_average,vote_count
count,45460.0,45203.0,45460.0,45460.0
mean,11209350.0,94.128199,5.618207,109.897338
std,64332250.0,38.40781,1.924216,491.310374
min,0.0,0.0,0.0,0.0
25%,0.0,85.0,5.0,3.0
50%,0.0,95.0,6.0,10.0
75%,0.0,107.0,6.8,34.0
max,2787965000.0,1256.0,10.0,14075.0


## Dropping irrelevant Columns

2. __Drop__ the irrelevant columns 'adult', 'imdb_id', 'original_title', 'video' and 'homepage'.

In [4]:
df_dropped=df_raw.drop(['adult','imdb_id', 'original_title', 'video','homepage'], axis=1)

## How to handle stringified JSON columns

3. __Evaluate__ Python Expressions in the stringified columns ["belongs_to_collection", "genres", "production_countries", "production_companies", "spoken_languages"] and __remove quotes__ ("") where possible.

In [5]:
df_evaluated=df_dropped.copy()
df_evaluated['belongs_to_collection']=df_evaluated['belongs_to_collection'].apply(lambda x: eval(x) if isinstance(x,str) else x)
df_evaluated['genres']=df_evaluated['genres'].apply(lambda x: eval(x) if isinstance(x,str) else x)
df_evaluated['production_countries']=df_evaluated['production_countries'].apply(lambda x: eval(x) if isinstance(x,str) else x)
df_evaluated['production_companies']=df_evaluated['production_companies'].apply(lambda x: eval(x) if isinstance(x,str) else x)
df_evaluated['spoken_languages']=df_evaluated['spoken_languages'].apply(lambda x: eval(x) if isinstance(x,str) else x)
df_evaluated

Unnamed: 0,belongs_to_collection,budget,genres,id,original_language,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",862,en,"Led by Woody, Andy's toys live happily in his ...",21.946943,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,"[{'name': 'Pixar Animation Studios', 'id': 3}]","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,7.7,5415.0
1,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",8844,en,When siblings Judy and Peter discover an encha...,17.015539,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,"[{'name': 'TriStar Pictures', 'id': 559}, {'na...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,6.9,2413.0
2,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",15602,en,A family wedding reignites the ancient feud be...,11.7129,/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg,"[{'name': 'Warner Bros.', 'id': 6194}, {'name'...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,6.5,92.0
3,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",31357,en,"Cheated on, mistreated and stepped on, the wom...",3.859495,/16XOMpEaLWkrcPqSQqhTmeJuqQl.jpg,[{'name': 'Twentieth Century Fox Film Corporat...,"[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,6.1,34.0
4,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",11862,en,Just when George Banks has recovered from his ...,8.387519,/e64sOI48hQXyru7naBFyssKFxVd.jpg,"[{'name': 'Sandollar Productions', 'id': 5842}...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,5.7,173.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45461,,0,"[{'id': 18, 'name': 'Drama'}, {'id': 10751, 'n...",439050,fa,Rising and falling between a man and woman.,0.072051,/jldsYflnId4tTWPx8es3uzsB1I8.jpg,[],"[{'iso_3166_1': 'IR', 'name': 'Iran'}]",,0.0,90.0,"[{'iso_639_1': 'fa', 'name': 'فارسی'}]",Released,Rising and falling between a man and woman,Subdue,4.0,1.0
45462,,0,"[{'id': 18, 'name': 'Drama'}]",111109,tl,An artist struggles to finish his work while a...,0.178241,/xZkmxsNmYXJbKVsTRLLx3pqGHx7.jpg,"[{'name': 'Sine Olivia', 'id': 19653}]","[{'iso_3166_1': 'PH', 'name': 'Philippines'}]",2011-11-17,0.0,360.0,"[{'iso_639_1': 'tl', 'name': ''}]",Released,,Century of Birthing,9.0,3.0
45463,,0,"[{'id': 28, 'name': 'Action'}, {'id': 18, 'nam...",67758,en,"When one of her hits goes wrong, a professiona...",0.903007,/d5bX92nDsISNhu3ZT69uHwmfCGw.jpg,"[{'name': 'American World Pictures', 'id': 6165}]","[{'iso_3166_1': 'US', 'name': 'United States o...",2003-08-01,0.0,90.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,A deadly game of wits.,Betrayal,3.8,6.0
45464,,0,[],227506,en,"In a small town live two brothers, one a minis...",0.003503,/aorBPO7ak8e8iJKT5OcqYxU3jlK.jpg,"[{'name': 'Yermoliev', 'id': 88753}]","[{'iso_3166_1': 'RU', 'name': 'Russia'}]",1917-10-21,0.0,87.0,[],Released,,Satan Triumphant,0.0,0.0


## How to flatten nested Columns

4. __Extract__ only the __collection name__ from the column "belongs_to_collection" and __overwrite__ "belongs_to_collection". <br> For example: The value in the first row (Toy Story) should be 'Toy Story Collection'.

In [445]:
df_flattened=df_evaluated.copy()
df_flattened['belongs_to_collection']=df_evaluated['belongs_to_collection'].apply(lambda x: x.get('name') if isinstance(x,dict) else x)
df_flattened

Unnamed: 0,belongs_to_collection,budget,genres,id,original_language,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,Toy Story Collection,30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",862,en,"Led by Woody, Andy's toys live happily in his ...",21.946943,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,"[{'name': 'Pixar Animation Studios', 'id': 3}]","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,7.7,5415.0
1,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",8844,en,When siblings Judy and Peter discover an encha...,17.015539,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,"[{'name': 'TriStar Pictures', 'id': 559}, {'na...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,6.9,2413.0
2,Grumpy Old Men Collection,0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",15602,en,A family wedding reignites the ancient feud be...,11.7129,/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg,"[{'name': 'Warner Bros.', 'id': 6194}, {'name'...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,6.5,92.0
3,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",31357,en,"Cheated on, mistreated and stepped on, the wom...",3.859495,/16XOMpEaLWkrcPqSQqhTmeJuqQl.jpg,[{'name': 'Twentieth Century Fox Film Corporat...,"[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,6.1,34.0
4,Father of the Bride Collection,0,"[{'id': 35, 'name': 'Comedy'}]",11862,en,Just when George Banks has recovered from his ...,8.387519,/e64sOI48hQXyru7naBFyssKFxVd.jpg,"[{'name': 'Sandollar Productions', 'id': 5842}...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,5.7,173.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45461,,0,"[{'id': 18, 'name': 'Drama'}, {'id': 10751, 'n...",439050,fa,Rising and falling between a man and woman.,0.072051,/jldsYflnId4tTWPx8es3uzsB1I8.jpg,[],"[{'iso_3166_1': 'IR', 'name': 'Iran'}]",,0.0,90.0,"[{'iso_639_1': 'fa', 'name': 'فارسی'}]",Released,Rising and falling between a man and woman,Subdue,4.0,1.0
45462,,0,"[{'id': 18, 'name': 'Drama'}]",111109,tl,An artist struggles to finish his work while a...,0.178241,/xZkmxsNmYXJbKVsTRLLx3pqGHx7.jpg,"[{'name': 'Sine Olivia', 'id': 19653}]","[{'iso_3166_1': 'PH', 'name': 'Philippines'}]",2011-11-17,0.0,360.0,"[{'iso_639_1': 'tl', 'name': ''}]",Released,,Century of Birthing,9.0,3.0
45463,,0,"[{'id': 28, 'name': 'Action'}, {'id': 18, 'nam...",67758,en,"When one of her hits goes wrong, a professiona...",0.903007,/d5bX92nDsISNhu3ZT69uHwmfCGw.jpg,"[{'name': 'American World Pictures', 'id': 6165}]","[{'iso_3166_1': 'US', 'name': 'United States o...",2003-08-01,0.0,90.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,A deadly game of wits.,Betrayal,3.8,6.0
45464,,0,[],227506,en,"In a small town live two brothers, one a minis...",0.003503,/aorBPO7ak8e8iJKT5OcqYxU3jlK.jpg,"[{'name': 'Yermoliev', 'id': 88753}]","[{'iso_3166_1': 'RU', 'name': 'Russia'}]",1917-10-21,0.0,87.0,[],Released,,Satan Triumphant,0.0,0.0


5. __Extract__ all __genre names__ from the column "genres" and __overwrite__ "genres". If a movie has more than one genre, __seperate genres by a pipe__ "|".<br>
For example: The value in the first row (Toy Story) should be 'Animation|Comedy|Family'.

In [446]:
def get_items_from_dicts(ls: list, key)->str:
    item_str=''
    for dict_ in ls:
        item_str+=dict_.get(key,'')+'|'
    return item_str.strip('|')
    
df_flattened['genres']=df_evaluated['genres'].apply(lambda x: get_items_from_dicts(x, 'name') if isinstance(x,list) else x)
df_flattened

Unnamed: 0,belongs_to_collection,budget,genres,id,original_language,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,Toy Story Collection,30000000,Animation|Comedy|Family,862,en,"Led by Woody, Andy's toys live happily in his ...",21.946943,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,"[{'name': 'Pixar Animation Studios', 'id': 3}]","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,7.7,5415.0
1,,65000000,Adventure|Fantasy|Family,8844,en,When siblings Judy and Peter discover an encha...,17.015539,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,"[{'name': 'TriStar Pictures', 'id': 559}, {'na...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,6.9,2413.0
2,Grumpy Old Men Collection,0,Romance|Comedy,15602,en,A family wedding reignites the ancient feud be...,11.7129,/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg,"[{'name': 'Warner Bros.', 'id': 6194}, {'name'...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,6.5,92.0
3,,16000000,Comedy|Drama|Romance,31357,en,"Cheated on, mistreated and stepped on, the wom...",3.859495,/16XOMpEaLWkrcPqSQqhTmeJuqQl.jpg,[{'name': 'Twentieth Century Fox Film Corporat...,"[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,6.1,34.0
4,Father of the Bride Collection,0,Comedy,11862,en,Just when George Banks has recovered from his ...,8.387519,/e64sOI48hQXyru7naBFyssKFxVd.jpg,"[{'name': 'Sandollar Productions', 'id': 5842}...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,5.7,173.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45461,,0,Drama|Family,439050,fa,Rising and falling between a man and woman.,0.072051,/jldsYflnId4tTWPx8es3uzsB1I8.jpg,[],"[{'iso_3166_1': 'IR', 'name': 'Iran'}]",,0.0,90.0,"[{'iso_639_1': 'fa', 'name': 'فارسی'}]",Released,Rising and falling between a man and woman,Subdue,4.0,1.0
45462,,0,Drama,111109,tl,An artist struggles to finish his work while a...,0.178241,/xZkmxsNmYXJbKVsTRLLx3pqGHx7.jpg,"[{'name': 'Sine Olivia', 'id': 19653}]","[{'iso_3166_1': 'PH', 'name': 'Philippines'}]",2011-11-17,0.0,360.0,"[{'iso_639_1': 'tl', 'name': ''}]",Released,,Century of Birthing,9.0,3.0
45463,,0,Action|Drama|Thriller,67758,en,"When one of her hits goes wrong, a professiona...",0.903007,/d5bX92nDsISNhu3ZT69uHwmfCGw.jpg,"[{'name': 'American World Pictures', 'id': 6165}]","[{'iso_3166_1': 'US', 'name': 'United States o...",2003-08-01,0.0,90.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,A deadly game of wits.,Betrayal,3.8,6.0
45464,,0,,227506,en,"In a small town live two brothers, one a minis...",0.003503,/aorBPO7ak8e8iJKT5OcqYxU3jlK.jpg,"[{'name': 'Yermoliev', 'id': 88753}]","[{'iso_3166_1': 'RU', 'name': 'Russia'}]",1917-10-21,0.0,87.0,[],Released,,Satan Triumphant,0.0,0.0


6. __Extract__ all __spoken language names__ from the column "spoken_languages" and __overwrite__ "spoken_languages". If a movie has more than one spoken language, __seperate spoken languages by a pipe__ "|".<br>
For example: The value in the first row (Toy Story) should be 'English'.

In [447]:
df_flattened['spoken_languages']=df_evaluated['spoken_languages'].apply(lambda x: get_items_from_dicts(x, 'name') if isinstance(x,list) else x)
df_flattened

Unnamed: 0,belongs_to_collection,budget,genres,id,original_language,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,Toy Story Collection,30000000,Animation|Comedy|Family,862,en,"Led by Woody, Andy's toys live happily in his ...",21.946943,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,"[{'name': 'Pixar Animation Studios', 'id': 3}]","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-10-30,373554033.0,81.0,English,Released,,Toy Story,7.7,5415.0
1,,65000000,Adventure|Fantasy|Family,8844,en,When siblings Judy and Peter discover an encha...,17.015539,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,"[{'name': 'TriStar Pictures', 'id': 559}, {'na...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-15,262797249.0,104.0,English|Français,Released,Roll the dice and unleash the excitement!,Jumanji,6.9,2413.0
2,Grumpy Old Men Collection,0,Romance|Comedy,15602,en,A family wedding reignites the ancient feud be...,11.7129,/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg,"[{'name': 'Warner Bros.', 'id': 6194}, {'name'...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-22,0.0,101.0,English,Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,6.5,92.0
3,,16000000,Comedy|Drama|Romance,31357,en,"Cheated on, mistreated and stepped on, the wom...",3.859495,/16XOMpEaLWkrcPqSQqhTmeJuqQl.jpg,[{'name': 'Twentieth Century Fox Film Corporat...,"[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-22,81452156.0,127.0,English,Released,Friends are the people who let you be yourself...,Waiting to Exhale,6.1,34.0
4,Father of the Bride Collection,0,Comedy,11862,en,Just when George Banks has recovered from his ...,8.387519,/e64sOI48hQXyru7naBFyssKFxVd.jpg,"[{'name': 'Sandollar Productions', 'id': 5842}...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-02-10,76578911.0,106.0,English,Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,5.7,173.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45461,,0,Drama|Family,439050,fa,Rising and falling between a man and woman.,0.072051,/jldsYflnId4tTWPx8es3uzsB1I8.jpg,[],"[{'iso_3166_1': 'IR', 'name': 'Iran'}]",,0.0,90.0,فارسی,Released,Rising and falling between a man and woman,Subdue,4.0,1.0
45462,,0,Drama,111109,tl,An artist struggles to finish his work while a...,0.178241,/xZkmxsNmYXJbKVsTRLLx3pqGHx7.jpg,"[{'name': 'Sine Olivia', 'id': 19653}]","[{'iso_3166_1': 'PH', 'name': 'Philippines'}]",2011-11-17,0.0,360.0,,Released,,Century of Birthing,9.0,3.0
45463,,0,Action|Drama|Thriller,67758,en,"When one of her hits goes wrong, a professiona...",0.903007,/d5bX92nDsISNhu3ZT69uHwmfCGw.jpg,"[{'name': 'American World Pictures', 'id': 6165}]","[{'iso_3166_1': 'US', 'name': 'United States o...",2003-08-01,0.0,90.0,English,Released,A deadly game of wits.,Betrayal,3.8,6.0
45464,,0,,227506,en,"In a small town live two brothers, one a minis...",0.003503,/aorBPO7ak8e8iJKT5OcqYxU3jlK.jpg,"[{'name': 'Yermoliev', 'id': 88753}]","[{'iso_3166_1': 'RU', 'name': 'Russia'}]",1917-10-21,0.0,87.0,,Released,,Satan Triumphant,0.0,0.0


7. __Extract__ all __production countries names__ from the column "production_countries" and __overwrite__ "production_countries". If a movie has more than one production country, __seperate production countries by a pipe__ "|".<br>
For example: The value in the first row (Toy Story) should be 'United States of America'.

In [448]:
df_flattened['production_countries']=df_evaluated['production_countries'].apply(lambda x: get_items_from_dicts(x, 'name') if isinstance(x,list) else x)
df_flattened

Unnamed: 0,belongs_to_collection,budget,genres,id,original_language,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,Toy Story Collection,30000000,Animation|Comedy|Family,862,en,"Led by Woody, Andy's toys live happily in his ...",21.946943,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,"[{'name': 'Pixar Animation Studios', 'id': 3}]",United States of America,1995-10-30,373554033.0,81.0,English,Released,,Toy Story,7.7,5415.0
1,,65000000,Adventure|Fantasy|Family,8844,en,When siblings Judy and Peter discover an encha...,17.015539,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,"[{'name': 'TriStar Pictures', 'id': 559}, {'na...",United States of America,1995-12-15,262797249.0,104.0,English|Français,Released,Roll the dice and unleash the excitement!,Jumanji,6.9,2413.0
2,Grumpy Old Men Collection,0,Romance|Comedy,15602,en,A family wedding reignites the ancient feud be...,11.7129,/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg,"[{'name': 'Warner Bros.', 'id': 6194}, {'name'...",United States of America,1995-12-22,0.0,101.0,English,Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,6.5,92.0
3,,16000000,Comedy|Drama|Romance,31357,en,"Cheated on, mistreated and stepped on, the wom...",3.859495,/16XOMpEaLWkrcPqSQqhTmeJuqQl.jpg,[{'name': 'Twentieth Century Fox Film Corporat...,United States of America,1995-12-22,81452156.0,127.0,English,Released,Friends are the people who let you be yourself...,Waiting to Exhale,6.1,34.0
4,Father of the Bride Collection,0,Comedy,11862,en,Just when George Banks has recovered from his ...,8.387519,/e64sOI48hQXyru7naBFyssKFxVd.jpg,"[{'name': 'Sandollar Productions', 'id': 5842}...",United States of America,1995-02-10,76578911.0,106.0,English,Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,5.7,173.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45461,,0,Drama|Family,439050,fa,Rising and falling between a man and woman.,0.072051,/jldsYflnId4tTWPx8es3uzsB1I8.jpg,[],Iran,,0.0,90.0,فارسی,Released,Rising and falling between a man and woman,Subdue,4.0,1.0
45462,,0,Drama,111109,tl,An artist struggles to finish his work while a...,0.178241,/xZkmxsNmYXJbKVsTRLLx3pqGHx7.jpg,"[{'name': 'Sine Olivia', 'id': 19653}]",Philippines,2011-11-17,0.0,360.0,,Released,,Century of Birthing,9.0,3.0
45463,,0,Action|Drama|Thriller,67758,en,"When one of her hits goes wrong, a professiona...",0.903007,/d5bX92nDsISNhu3ZT69uHwmfCGw.jpg,"[{'name': 'American World Pictures', 'id': 6165}]",United States of America,2003-08-01,0.0,90.0,English,Released,A deadly game of wits.,Betrayal,3.8,6.0
45464,,0,,227506,en,"In a small town live two brothers, one a minis...",0.003503,/aorBPO7ak8e8iJKT5OcqYxU3jlK.jpg,"[{'name': 'Yermoliev', 'id': 88753}]",Russia,1917-10-21,0.0,87.0,,Released,,Satan Triumphant,0.0,0.0


8. __Extract__ all __production companies names__ from the column "production_companies" and __overwrite__ "production_companies". If a movie has more than one production company, __seperate production companies by a pipe__ "|".<br>
For example: The value in the first row (Toy Story) should be 'Pixar Animation Studios'

In [449]:
df_flattened['production_companies']=df_evaluated['production_companies'].apply(lambda x: get_items_from_dicts(x, 'name') if isinstance(x,list) else x)
df_flattened

Unnamed: 0,belongs_to_collection,budget,genres,id,original_language,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,Toy Story Collection,30000000,Animation|Comedy|Family,862,en,"Led by Woody, Andy's toys live happily in his ...",21.946943,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,Pixar Animation Studios,United States of America,1995-10-30,373554033.0,81.0,English,Released,,Toy Story,7.7,5415.0
1,,65000000,Adventure|Fantasy|Family,8844,en,When siblings Judy and Peter discover an encha...,17.015539,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,TriStar Pictures|Teitler Film|Interscope Commu...,United States of America,1995-12-15,262797249.0,104.0,English|Français,Released,Roll the dice and unleash the excitement!,Jumanji,6.9,2413.0
2,Grumpy Old Men Collection,0,Romance|Comedy,15602,en,A family wedding reignites the ancient feud be...,11.7129,/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg,Warner Bros.|Lancaster Gate,United States of America,1995-12-22,0.0,101.0,English,Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,6.5,92.0
3,,16000000,Comedy|Drama|Romance,31357,en,"Cheated on, mistreated and stepped on, the wom...",3.859495,/16XOMpEaLWkrcPqSQqhTmeJuqQl.jpg,Twentieth Century Fox Film Corporation,United States of America,1995-12-22,81452156.0,127.0,English,Released,Friends are the people who let you be yourself...,Waiting to Exhale,6.1,34.0
4,Father of the Bride Collection,0,Comedy,11862,en,Just when George Banks has recovered from his ...,8.387519,/e64sOI48hQXyru7naBFyssKFxVd.jpg,Sandollar Productions|Touchstone Pictures,United States of America,1995-02-10,76578911.0,106.0,English,Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,5.7,173.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45461,,0,Drama|Family,439050,fa,Rising and falling between a man and woman.,0.072051,/jldsYflnId4tTWPx8es3uzsB1I8.jpg,,Iran,,0.0,90.0,فارسی,Released,Rising and falling between a man and woman,Subdue,4.0,1.0
45462,,0,Drama,111109,tl,An artist struggles to finish his work while a...,0.178241,/xZkmxsNmYXJbKVsTRLLx3pqGHx7.jpg,Sine Olivia,Philippines,2011-11-17,0.0,360.0,,Released,,Century of Birthing,9.0,3.0
45463,,0,Action|Drama|Thriller,67758,en,"When one of her hits goes wrong, a professiona...",0.903007,/d5bX92nDsISNhu3ZT69uHwmfCGw.jpg,American World Pictures,United States of America,2003-08-01,0.0,90.0,English,Released,A deadly game of wits.,Betrayal,3.8,6.0
45464,,0,,227506,en,"In a small town live two brothers, one a minis...",0.003503,/aorBPO7ak8e8iJKT5OcqYxU3jlK.jpg,Yermoliev,Russia,1917-10-21,0.0,87.0,,Released,,Satan Triumphant,0.0,0.0


9. __Inspect__ all columns above with value_counts(). Do you see anything strange? __Take reasonable measures__!

In [450]:
for col in df_flattened.columns:
    print(df_flattened[col].value_counts())
    print(str(df_flattened[col].dtype)+'\n')
#replace all empty strings for NaN
df_flattened.replace('',np.nan, inplace=True)
df_flattened

The Bowery Boys                       29
Totò Collection                       27
James Bond Collection                 26
Zatôichi: The Blind Swordsman         26
The Carry On Collection               25
                                      ..
No Good Middle-Aged Yakuza Trilogy     1
The Spiritual Boxer Collection         1
East Side Kids                         1
Seed Collection                        1
Red Lotus Collection                   1
Name: belongs_to_collection, Length: 1698, dtype: int64
object

0           36573
5000000       286
10000000      259
20000000      243
2000000       242
            ...  
923             1
72500000        1
2160000         1
4439832         1
1254040         1
Name: budget, Length: 1226, dtype: int64
object

Drama                              5000
Comedy                             3621
Documentary                        2723
                                   2442
Drama|Romance                      1301
                                   ...

Unnamed: 0,belongs_to_collection,budget,genres,id,original_language,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,Toy Story Collection,30000000,Animation|Comedy|Family,862,en,"Led by Woody, Andy's toys live happily in his ...",21.946943,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,Pixar Animation Studios,United States of America,1995-10-30,373554033.0,81.0,English,Released,,Toy Story,7.7,5415.0
1,,65000000,Adventure|Fantasy|Family,8844,en,When siblings Judy and Peter discover an encha...,17.015539,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,TriStar Pictures|Teitler Film|Interscope Commu...,United States of America,1995-12-15,262797249.0,104.0,English|Français,Released,Roll the dice and unleash the excitement!,Jumanji,6.9,2413.0
2,Grumpy Old Men Collection,0,Romance|Comedy,15602,en,A family wedding reignites the ancient feud be...,11.7129,/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg,Warner Bros.|Lancaster Gate,United States of America,1995-12-22,0.0,101.0,English,Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,6.5,92.0
3,,16000000,Comedy|Drama|Romance,31357,en,"Cheated on, mistreated and stepped on, the wom...",3.859495,/16XOMpEaLWkrcPqSQqhTmeJuqQl.jpg,Twentieth Century Fox Film Corporation,United States of America,1995-12-22,81452156.0,127.0,English,Released,Friends are the people who let you be yourself...,Waiting to Exhale,6.1,34.0
4,Father of the Bride Collection,0,Comedy,11862,en,Just when George Banks has recovered from his ...,8.387519,/e64sOI48hQXyru7naBFyssKFxVd.jpg,Sandollar Productions|Touchstone Pictures,United States of America,1995-02-10,76578911.0,106.0,English,Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,5.7,173.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45461,,0,Drama|Family,439050,fa,Rising and falling between a man and woman.,0.072051,/jldsYflnId4tTWPx8es3uzsB1I8.jpg,,Iran,,0.0,90.0,فارسی,Released,Rising and falling between a man and woman,Subdue,4.0,1.0
45462,,0,Drama,111109,tl,An artist struggles to finish his work while a...,0.178241,/xZkmxsNmYXJbKVsTRLLx3pqGHx7.jpg,Sine Olivia,Philippines,2011-11-17,0.0,360.0,,Released,,Century of Birthing,9.0,3.0
45463,,0,Action|Drama|Thriller,67758,en,"When one of her hits goes wrong, a professiona...",0.903007,/d5bX92nDsISNhu3ZT69uHwmfCGw.jpg,American World Pictures,United States of America,2003-08-01,0.0,90.0,English,Released,A deadly game of wits.,Betrayal,3.8,6.0
45464,,0,,227506,en,"In a small town live two brothers, one a minis...",0.003503,/aorBPO7ak8e8iJKT5OcqYxU3jlK.jpg,Yermoliev,Russia,1917-10-21,0.0,87.0,,Released,,Satan Triumphant,0.0,0.0


## Cleaning Numerical Columns

10. __Convert__ the datatype in the columns __"budget"__, __"id"__ and __"popularity"__ __to numeric__. Set invalid values as NaN.

### Converting budget to numeric

In [451]:
df_flattened[~df_flattened['budget'].apply(lambda x: x.isdigit())]

Unnamed: 0,belongs_to_collection,budget,genres,id,original_language,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
19730,0.065736,/ff9qCepilowshEtG2GYWwzt2bs4.jpg,Carousel Productions|Vision View Entertainment...,1997-08-20,104.0,Released,,Midnight Man,False,6.0,1,,,,,,,,
29503,1.931659,/zV8bHuSL6WXoD6FWogP9j4x80bL.jpg,Aniplex|GoHands|BROSTA TV|Mardock Scramble Pro...,2012-09-29,68.0,Released,,Mardock Scramble: The Third Exhaust,False,7.0,12,,,,,,,,
35587,2.185485,/zaSf5OG7V8X8gqFvly88zDdRm46.jpg,Odyssey Media|Pulser Productions|Rogue State|T...,2014-01-01,82.0,Released,Beware Of Frost Bites,Avalanche Sharks,False,4.3,22,,,,,,,,


In [452]:
#Some rows have values shuffled across columns, so we are going to arrange them
#ONLY RUN THIS 1 TIME
indices_to_change=df_flattened[~df_flattened['budget'].apply(lambda x: x.isdigit())].index
columns_ordered=['budget','poster_path','production_companies','release_date','runtime','status','tagline','title','belongs_to_collection','vote_average', 'vote_count','spoken_languages',\
'revenue','production_countries','popularity','id','original_language','genres','overview']
for idx in indices_to_change:
    df_flattened.loc[idx,columns_ordered]=df_flattened.loc[idx,:].values

In [453]:
df_flattened.loc[indices_to_change]

Unnamed: 0,belongs_to_collection,budget,genres,id,original_language,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
19730,False,0.065736,,,,,,/ff9qCepilowshEtG2GYWwzt2bs4.jpg,Carousel Productions|Vision View Entertainment...,,1997-08-20,,104.0,,Released,,Midnight Man,6.0,1
29503,False,1.931659,,,,,,/zV8bHuSL6WXoD6FWogP9j4x80bL.jpg,Aniplex|GoHands|BROSTA TV|Mardock Scramble Pro...,,2012-09-29,,68.0,,Released,,Mardock Scramble: The Third Exhaust,7.0,12
35587,False,2.185485,,,,,,/zaSf5OG7V8X8gqFvly88zDdRm46.jpg,Odyssey Media|Pulser Productions|Rogue State|T...,,2014-01-01,,82.0,,Released,Beware Of Frost Bites,Avalanche Sharks,4.3,22


In [454]:
df_flattened['budget']=pd.to_numeric(df_flattened['budget'])
df_flattened['budget'].dtype

dtype('float64')

### Converting id to numeric

In [455]:
df_flattened['id']=pd.to_numeric(df_flattened['id'])
df_flattened['id'].dtype

dtype('float64')

### Converting popularity to numeric

In [456]:
df_flattened['popularity']=pd.to_numeric(df_flattened['popularity'])
df_flattened['popularity'].dtype

dtype('float64')

11. __Analyze__ the columns __"budget"__ and __"revenue"__ and __"runtime"__. Analyze movies with a budget/revenue/runtime of 0. Do you think the value 0 is the most appropriate value? __Take reasonable measures__! 

In [457]:
df_flattened['budget'].replace(0.0,np.nan,inplace=True)
df_flattened['revenue'].replace(0.0,np.nan,inplace=True)
df_flattened['runtime'].replace(0.0,np.nan,inplace=True)

12. The columns "budget" and "revenue" shall show values in Million USD. __Convert and Overwrite__!

In [458]:
df_flattened['budget']=df_flattened['budget']/1000000
df_flattened['revenue']=df_flattened['revenue']/1000000

13. __Analyze__ movies with a __vote_count of 0__. What´s the __vote_average__ for those movies? Do you think this value is the most appropriate value? __Take reasonable measures__!

In [459]:
df_flattened.vote_average[(df_flattened.vote_count==0)].any() #Al menos es consistente que todos los vote_average son 0 para las pelis que no tienen ningún voto

False

In [460]:
df_flattened.vote_average.replace(0,np.nan, inplace=True)

## Cleaning DateTime Columns

14. __Convert__ the datatype in the column __"release_date"__ __to datetime__. Set invalid values as NaN.

In [461]:
df_flattened['release_date']=pd.to_datetime(df_flattened['release_date'])

## Cleaning Text / String Columns

15. __Analyze__ the text columns "overview" and "tagline". Try to identify __missing data that is not represented by NaN__ (e.g. "No Data"). __Replace as NaN__ (np.nan)!

### cleaning overview

In [462]:
#check values of 3 words or less:
df_valid_overviews=df_flattened.copy()
df_overviews_str=df_valid_overviews[df_valid_overviews.overview.apply(lambda x: True if isinstance(x,str) else False)]
df_overviews_less_3=df_overviews_str[df_overviews_str.overview.apply(lambda x: True if len(x.split())<=3  else False)]
df_overviews_less_3.overview.value_counts().index

Index(['No overview found.', 'No Overview', ' ', 'Bollywood 2009',
       'No overview yet.', 'Feature film.', 'English gangster flick.',
       'Netflix Special', 'No overview.', 'Bigfoot documentary',
       'Plot unknown', 'x', 'Documentary', 'French Film', 'japanese movie',
       'Finnish documentary.', 'Tupla-Uuno', 'no overview yet',
       'Finnish Road Movie', 'No overview', 'A skeleton dances.',
       'American Documentary', '...', 'Discovery Channel documentary',
       'Return of Django', 'Directed by', 'Finnish Documentary',
       'Black and White', 'Not Available', 'German Comedy', 'Dom är tillbaka!',
       'Italian film', '2013 Italian comedy', 'No overview found',
       '70s Italian comedy.', 'Italian comedy', 'Scandal: Sex@students.edu'],
      dtype='object')

In [463]:
no_overview_pos=[0,1,2,4,8,10,11,17,19,22,25,28,33]
incorrect_overviews=df_overviews_less_3.overview.value_counts().index[no_overview_pos].to_list()
incorrect_overviews

['No overview found.',
 'No Overview',
 ' ',
 'No overview yet.',
 'No overview.',
 'Plot unknown',
 'x',
 'no overview yet',
 'No overview',
 '...',
 'Directed by',
 'Not Available',
 'No overview found']

In [464]:
ids_to_modify=df_overviews_str[df_overviews_str.overview.isin(incorrect_overviews)].index
ids_to_modify

Int64Index([  104,   800,   972,  1613,  2230,  3238,  3288,  3763,  4246,
             4538,
            ...
            43239, 43333, 43786, 43836, 44142, 44607, 45251, 45272, 45292,
            45433],
           dtype='int64', length=156)

In [465]:
df_valid_overviews.loc[ids_to_modify,'overview']=np.nan
df_valid_overviews

Unnamed: 0,belongs_to_collection,budget,genres,id,original_language,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,Toy Story Collection,30.0,Animation|Comedy|Family,862.0,en,"Led by Woody, Andy's toys live happily in his ...",21.946943,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,Pixar Animation Studios,United States of America,1995-10-30,373.554033,81.0,English,Released,,Toy Story,7.7,5415.0
1,,65.0,Adventure|Fantasy|Family,8844.0,en,When siblings Judy and Peter discover an encha...,17.015539,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,TriStar Pictures|Teitler Film|Interscope Commu...,United States of America,1995-12-15,262.797249,104.0,English|Français,Released,Roll the dice and unleash the excitement!,Jumanji,6.9,2413.0
2,Grumpy Old Men Collection,,Romance|Comedy,15602.0,en,A family wedding reignites the ancient feud be...,11.712900,/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg,Warner Bros.|Lancaster Gate,United States of America,1995-12-22,,101.0,English,Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,6.5,92.0
3,,16.0,Comedy|Drama|Romance,31357.0,en,"Cheated on, mistreated and stepped on, the wom...",3.859495,/16XOMpEaLWkrcPqSQqhTmeJuqQl.jpg,Twentieth Century Fox Film Corporation,United States of America,1995-12-22,81.452156,127.0,English,Released,Friends are the people who let you be yourself...,Waiting to Exhale,6.1,34.0
4,Father of the Bride Collection,,Comedy,11862.0,en,Just when George Banks has recovered from his ...,8.387519,/e64sOI48hQXyru7naBFyssKFxVd.jpg,Sandollar Productions|Touchstone Pictures,United States of America,1995-02-10,76.578911,106.0,English,Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,5.7,173.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45461,,,Drama|Family,439050.0,fa,Rising and falling between a man and woman.,0.072051,/jldsYflnId4tTWPx8es3uzsB1I8.jpg,,Iran,NaT,,90.0,فارسی,Released,Rising and falling between a man and woman,Subdue,4.0,1.0
45462,,,Drama,111109.0,tl,An artist struggles to finish his work while a...,0.178241,/xZkmxsNmYXJbKVsTRLLx3pqGHx7.jpg,Sine Olivia,Philippines,2011-11-17,,360.0,,Released,,Century of Birthing,9.0,3.0
45463,,,Action|Drama|Thriller,67758.0,en,"When one of her hits goes wrong, a professiona...",0.903007,/d5bX92nDsISNhu3ZT69uHwmfCGw.jpg,American World Pictures,United States of America,2003-08-01,,90.0,English,Released,A deadly game of wits.,Betrayal,3.8,6.0
45464,,,,227506.0,en,"In a small town live two brothers, one a minis...",0.003503,/aorBPO7ak8e8iJKT5OcqYxU3jlK.jpg,Yermoliev,Russia,1917-10-21,,87.0,,Released,,Satan Triumphant,,0.0


### cleaning tagline

In [466]:
#Valid taglines are naturally short so it will need different tactics. 
#First we will check for very short ones to capture invalid taglines
df_valid_taglines=df_valid_overviews.copy()
df_taglines_str=df_valid_taglines[df_valid_taglines.tagline.apply(lambda x: True if isinstance(x,str) else False)]
df_taglines_less_3=df_taglines_str[(df_taglines_str.tagline.str.split().apply(len)<=2) & (df_taglines_str.tagline.str.len()<=15)]
df_taglines_less_3.tagline.unique()
# '   ' and '-' are the only invalid ones
incorrect_taglines=['-','   ']

In [467]:
#Now lets check for taglines that have the word 'no' to catch possible phrases like 'no data', etc
df_taglines_contains_no=df_taglines_str[(df_taglines_str.tagline.str.contains('No ', case=False)) & (df_taglines_str.tagline.str.split().apply(len)<=4)]
df_taglines_contains_no.tagline.unique()
#There are no taglines with 'no data' type of phrases

array(['There is no defense.', 'No Guards. No Walls.', 'Trust No One',
       "This Ain't No Game.", 'Trust no one.', 'Love knows no limits.',
       'Love Knows No Bounds.', 'Know life. No limits.',
       'Vengeance takes no prisoners.', 'There is no substitute.',
       'Passion has no limits', 'Vengeance knows no boundaries.',
       'No ordinary love story...', 'This is no game.',
       'No Money. Still Funny.', "It's no picnic!",
       'Leave No Man Behind.', 'No Soul Is Safe.', 'Hell hath no fury...',
       "It's No Laughing Matter.", 'No king rules forever.',
       'No Static At All!', 'No gut, no glory', 'No Sleep To Brooklyn!',
       'Desire knows no bounds.', 'Serve No Master', 'No day but today.',
       'Vengeance Knows No Mercy', 'Possession Knows No Bounds',
       'No Gut... No Glory.', 'Courage know no limit',
       'Americans...they have no imagination.', "...It's No Fairy Tale",
       'There is no escape.', 'No body is safe.', 'No goats. No glory.',
       'No

In [468]:
ids_to_modify=df_taglines_less_3[df_taglines_less_3.tagline.isin(incorrect_taglines)].index
df_valid_taglines.loc[ids_to_modify,'tagline']=np.nan
df_valid_taglines

Unnamed: 0,belongs_to_collection,budget,genres,id,original_language,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,Toy Story Collection,30.0,Animation|Comedy|Family,862.0,en,"Led by Woody, Andy's toys live happily in his ...",21.946943,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,Pixar Animation Studios,United States of America,1995-10-30,373.554033,81.0,English,Released,,Toy Story,7.7,5415.0
1,,65.0,Adventure|Fantasy|Family,8844.0,en,When siblings Judy and Peter discover an encha...,17.015539,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,TriStar Pictures|Teitler Film|Interscope Commu...,United States of America,1995-12-15,262.797249,104.0,English|Français,Released,Roll the dice and unleash the excitement!,Jumanji,6.9,2413.0
2,Grumpy Old Men Collection,,Romance|Comedy,15602.0,en,A family wedding reignites the ancient feud be...,11.712900,/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg,Warner Bros.|Lancaster Gate,United States of America,1995-12-22,,101.0,English,Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,6.5,92.0
3,,16.0,Comedy|Drama|Romance,31357.0,en,"Cheated on, mistreated and stepped on, the wom...",3.859495,/16XOMpEaLWkrcPqSQqhTmeJuqQl.jpg,Twentieth Century Fox Film Corporation,United States of America,1995-12-22,81.452156,127.0,English,Released,Friends are the people who let you be yourself...,Waiting to Exhale,6.1,34.0
4,Father of the Bride Collection,,Comedy,11862.0,en,Just when George Banks has recovered from his ...,8.387519,/e64sOI48hQXyru7naBFyssKFxVd.jpg,Sandollar Productions|Touchstone Pictures,United States of America,1995-02-10,76.578911,106.0,English,Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,5.7,173.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45461,,,Drama|Family,439050.0,fa,Rising and falling between a man and woman.,0.072051,/jldsYflnId4tTWPx8es3uzsB1I8.jpg,,Iran,NaT,,90.0,فارسی,Released,Rising and falling between a man and woman,Subdue,4.0,1.0
45462,,,Drama,111109.0,tl,An artist struggles to finish his work while a...,0.178241,/xZkmxsNmYXJbKVsTRLLx3pqGHx7.jpg,Sine Olivia,Philippines,2011-11-17,,360.0,,Released,,Century of Birthing,9.0,3.0
45463,,,Action|Drama|Thriller,67758.0,en,"When one of her hits goes wrong, a professiona...",0.903007,/d5bX92nDsISNhu3ZT69uHwmfCGw.jpg,American World Pictures,United States of America,2003-08-01,,90.0,English,Released,A deadly game of wits.,Betrayal,3.8,6.0
45464,,,,227506.0,en,"In a small town live two brothers, one a minis...",0.003503,/aorBPO7ak8e8iJKT5OcqYxU3jlK.jpg,Yermoliev,Russia,1917-10-21,,87.0,,Released,,Satan Triumphant,,0.0


## Removing Duplicates

16. __Identify__ and __remove__ duplicates!

In [469]:
df_valid_taglines[df_valid_taglines.duplicated(keep=False, subset='id')]

Unnamed: 0,belongs_to_collection,budget,genres,id,original_language,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
676,,,Drama|Romance,105045.0,de,"East-Berlin, 1961, shortly after the erection ...",0.122178,/5WFIrBhOOgc0jGmoLxMZwWqCctO.jpg,Studio Babelsberg|Centre National de la Cinéma...,Germany,1995-02-16,,115.0,Deutsch,Released,"A love, a hope, a wall.",The Promise,5.0,1.0
838,,,Drama,132641.0,ja,"Ten years into a marriage, the wife is disappo...",0.096079,/z9WiHt5uQjs8L8tyBpRBKzlheF2.jpg,Toho Company,Japan,1953-04-29,,89.0,日本語,Released,,Wife,,0.0
949,,4e-06,Drama|Romance|War,22649.0,en,British nurse Catherine Barkley (Helen Hayes) ...,1.914697,/k0MF0IIbJ2PfOIku2KyraXL72d8.jpg,Paramount Pictures,United States of America,1932-12-08,2.5e-05,89.0,English,Released,Every woman who has loved will understand,A Farewell to Arms,6.2,29.0
1465,,,Drama|Romance,105045.0,de,"East-Berlin, 1961, shortly after the erection ...",0.122178,/5WFIrBhOOgc0jGmoLxMZwWqCctO.jpg,Studio Babelsberg|Centre National de la Cinéma...,Germany,1995-02-16,,115.0,Deutsch,Released,"A love, a hope, a wall.",The Promise,5.0,1.0
2564,,,Documentary,84198.0,en,"Using personal stories, this powerful document...",0.501046,/jn8L1QdWWX5c0NUOLjzaSXtZrbt.jpg,,United States of America,2012-03-22,,84.0,English,Released,One Nation. Underfed.,A Place at the Table,6.9,7.0
4114,Pokémon Collection,16.0,Adventure|Fantasy|Animation|Action|Family,10991.0,ja,When Molly Hale's sadness of her father's disa...,10.264597,/5ILjS6XB5deiHop8SXPsYxXWVPE.jpg,TV Tokyo|4 Kids Entertainment|Nintendo|Pikachu...,Japan,2000-07-08,68.411275,93.0,English,Released,Pokémon: Spell of the Unknown,Pokémon: Spell of the Unknown,6.0,143.0
4356,,3.512454,Drama,110428.0,fr,"Winter, 1915. Confined by her family to an asy...",0.134014,/sGMPDg6je1zKi0TiX9b4pP6yN02.jpg,Canal+|Arte France Cinéma|3B Productions|C.R.R...,France,2013-03-13,0.11586,95.0,Français,Released,,Camille Claudel 1915,7.0,20.0
5130,,26.0,Adventure|Family|Science Fiction|Thriller,15028.0,en,"Until now, Zak Gibbs' greatest challenge has b...",5.373623,/xGhDPrBz9mJN8CsIjA23jQSd3sc.jpg,Paramount Pictures|Nickelodeon Movies|Valhalla...,United States of America,2002-03-17,38.793283,94.0,Český|English,Released,"The adventure of a lifetime, in a few mere sec...",Clockstoppers,4.9,89.0
5535,Pokémon Collection,,Adventure|Fantasy|Animation|Science Fiction|Fa...,12600.0,ja,"All your favorite Pokémon characters are back,...",7.072301,/bqL0PVHbQ8Jmw3Njcl38kW0CoeM.jpg,,Japan|United States of America,2001-07-06,28.023563,75.0,日本語,Released,,Pokémon 4Ever: Celebi - Voice of the Forest,5.7,82.0
5710,,,Drama,109962.0,en,Two literary women compete for 20 years: one w...,12.180836,/tOflyY8eUFWubLKJH7fKg4KwpCl.jpg,Metro-Goldwyn-Mayer (MGM)|Jaquet,United States of America,1981-09-23,,115.0,English,Released,"From the very beginning, they knew they'd be f...",Rich and Famous,4.9,7.0


In [470]:
df_no_duplicates=df_valid_taglines[~df_valid_taglines.duplicated(keep='first',subset='id')]
df_no_duplicates

Unnamed: 0,belongs_to_collection,budget,genres,id,original_language,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,Toy Story Collection,30.0,Animation|Comedy|Family,862.0,en,"Led by Woody, Andy's toys live happily in his ...",21.946943,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,Pixar Animation Studios,United States of America,1995-10-30,373.554033,81.0,English,Released,,Toy Story,7.7,5415.0
1,,65.0,Adventure|Fantasy|Family,8844.0,en,When siblings Judy and Peter discover an encha...,17.015539,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,TriStar Pictures|Teitler Film|Interscope Commu...,United States of America,1995-12-15,262.797249,104.0,English|Français,Released,Roll the dice and unleash the excitement!,Jumanji,6.9,2413.0
2,Grumpy Old Men Collection,,Romance|Comedy,15602.0,en,A family wedding reignites the ancient feud be...,11.712900,/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg,Warner Bros.|Lancaster Gate,United States of America,1995-12-22,,101.0,English,Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,6.5,92.0
3,,16.0,Comedy|Drama|Romance,31357.0,en,"Cheated on, mistreated and stepped on, the wom...",3.859495,/16XOMpEaLWkrcPqSQqhTmeJuqQl.jpg,Twentieth Century Fox Film Corporation,United States of America,1995-12-22,81.452156,127.0,English,Released,Friends are the people who let you be yourself...,Waiting to Exhale,6.1,34.0
4,Father of the Bride Collection,,Comedy,11862.0,en,Just when George Banks has recovered from his ...,8.387519,/e64sOI48hQXyru7naBFyssKFxVd.jpg,Sandollar Productions|Touchstone Pictures,United States of America,1995-02-10,76.578911,106.0,English,Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,5.7,173.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45461,,,Drama|Family,439050.0,fa,Rising and falling between a man and woman.,0.072051,/jldsYflnId4tTWPx8es3uzsB1I8.jpg,,Iran,NaT,,90.0,فارسی,Released,Rising and falling between a man and woman,Subdue,4.0,1.0
45462,,,Drama,111109.0,tl,An artist struggles to finish his work while a...,0.178241,/xZkmxsNmYXJbKVsTRLLx3pqGHx7.jpg,Sine Olivia,Philippines,2011-11-17,,360.0,,Released,,Century of Birthing,9.0,3.0
45463,,,Action|Drama|Thriller,67758.0,en,"When one of her hits goes wrong, a professiona...",0.903007,/d5bX92nDsISNhu3ZT69uHwmfCGw.jpg,American World Pictures,United States of America,2003-08-01,,90.0,English,Released,A deadly game of wits.,Betrayal,3.8,6.0
45464,,,,227506.0,en,"In a small town live two brothers, one a minis...",0.003503,/aorBPO7ak8e8iJKT5OcqYxU3jlK.jpg,Yermoliev,Russia,1917-10-21,,87.0,,Released,,Satan Triumphant,,0.0


## Handling Missing Values & Removing Observations

17. __Drop__ all rows/movies with unknown __id__ or __title__.

In [471]:
#Unknown ids
df_no_duplicates[df_no_duplicates.id.isna()]

Unnamed: 0,belongs_to_collection,budget,genres,id,original_language,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
19730,False,6.5736e-08,,,,,,/ff9qCepilowshEtG2GYWwzt2bs4.jpg,Carousel Productions|Vision View Entertainment...,,1997-08-20,,104.0,,Released,,Midnight Man,6.0,1


In [472]:
#Unknown titles
df_no_duplicates[df_no_duplicates.title.isna()]

Unnamed: 0,belongs_to_collection,budget,genres,id,original_language,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
19729,,,Action|Thriller|Drama,82663.0,en,British soldiers force a recently captured IRA...,,,,,NaT,,,,,,,,
29502,Mardock Scramble Collection,,Animation|Science Fiction,122662.0,ja,Third film of the Mardock Scramble series.,,,,,NaT,,,,,,,,
35586,,,TV Movie|Action|Horror|Science Fiction,249260.0,en,A group of skiers are terrorized during spring...,,,,,NaT,,,,,,,,


18. __Keep__ only those rows/movies in the df with __10 or more non-NaN__ values.

In [473]:
df_dropped_ids=df_no_duplicates.dropna(subset='id')
df_dropped_title=df_dropped_ids.dropna(subset='title')
df_dropped_title

Unnamed: 0,belongs_to_collection,budget,genres,id,original_language,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,Toy Story Collection,30.0,Animation|Comedy|Family,862.0,en,"Led by Woody, Andy's toys live happily in his ...",21.946943,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,Pixar Animation Studios,United States of America,1995-10-30,373.554033,81.0,English,Released,,Toy Story,7.7,5415.0
1,,65.0,Adventure|Fantasy|Family,8844.0,en,When siblings Judy and Peter discover an encha...,17.015539,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,TriStar Pictures|Teitler Film|Interscope Commu...,United States of America,1995-12-15,262.797249,104.0,English|Français,Released,Roll the dice and unleash the excitement!,Jumanji,6.9,2413.0
2,Grumpy Old Men Collection,,Romance|Comedy,15602.0,en,A family wedding reignites the ancient feud be...,11.712900,/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg,Warner Bros.|Lancaster Gate,United States of America,1995-12-22,,101.0,English,Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,6.5,92.0
3,,16.0,Comedy|Drama|Romance,31357.0,en,"Cheated on, mistreated and stepped on, the wom...",3.859495,/16XOMpEaLWkrcPqSQqhTmeJuqQl.jpg,Twentieth Century Fox Film Corporation,United States of America,1995-12-22,81.452156,127.0,English,Released,Friends are the people who let you be yourself...,Waiting to Exhale,6.1,34.0
4,Father of the Bride Collection,,Comedy,11862.0,en,Just when George Banks has recovered from his ...,8.387519,/e64sOI48hQXyru7naBFyssKFxVd.jpg,Sandollar Productions|Touchstone Pictures,United States of America,1995-02-10,76.578911,106.0,English,Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,5.7,173.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45461,,,Drama|Family,439050.0,fa,Rising and falling between a man and woman.,0.072051,/jldsYflnId4tTWPx8es3uzsB1I8.jpg,,Iran,NaT,,90.0,فارسی,Released,Rising and falling between a man and woman,Subdue,4.0,1.0
45462,,,Drama,111109.0,tl,An artist struggles to finish his work while a...,0.178241,/xZkmxsNmYXJbKVsTRLLx3pqGHx7.jpg,Sine Olivia,Philippines,2011-11-17,,360.0,,Released,,Century of Birthing,9.0,3.0
45463,,,Action|Drama|Thriller,67758.0,en,"When one of her hits goes wrong, a professiona...",0.903007,/d5bX92nDsISNhu3ZT69uHwmfCGw.jpg,American World Pictures,United States of America,2003-08-01,,90.0,English,Released,A deadly game of wits.,Betrayal,3.8,6.0
45464,,,,227506.0,en,"In a small town live two brothers, one a minis...",0.003503,/aorBPO7ak8e8iJKT5OcqYxU3jlK.jpg,Yermoliev,Russia,1917-10-21,,87.0,,Released,,Satan Triumphant,,0.0


## Final (Cleaning) Steps

19. __Keep__ only those rows/movies in the df with __status "Released"__. Then __drop__ the column "status".

In [474]:
df_released=df_dropped_title[df_dropped_title.status=='Released'].copy()
df_released.drop('status', axis=1, inplace=True)
df_released

Unnamed: 0,belongs_to_collection,budget,genres,id,original_language,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,tagline,title,vote_average,vote_count
0,Toy Story Collection,30.0,Animation|Comedy|Family,862.0,en,"Led by Woody, Andy's toys live happily in his ...",21.946943,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,Pixar Animation Studios,United States of America,1995-10-30,373.554033,81.0,English,,Toy Story,7.7,5415.0
1,,65.0,Adventure|Fantasy|Family,8844.0,en,When siblings Judy and Peter discover an encha...,17.015539,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,TriStar Pictures|Teitler Film|Interscope Commu...,United States of America,1995-12-15,262.797249,104.0,English|Français,Roll the dice and unleash the excitement!,Jumanji,6.9,2413.0
2,Grumpy Old Men Collection,,Romance|Comedy,15602.0,en,A family wedding reignites the ancient feud be...,11.712900,/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg,Warner Bros.|Lancaster Gate,United States of America,1995-12-22,,101.0,English,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,6.5,92.0
3,,16.0,Comedy|Drama|Romance,31357.0,en,"Cheated on, mistreated and stepped on, the wom...",3.859495,/16XOMpEaLWkrcPqSQqhTmeJuqQl.jpg,Twentieth Century Fox Film Corporation,United States of America,1995-12-22,81.452156,127.0,English,Friends are the people who let you be yourself...,Waiting to Exhale,6.1,34.0
4,Father of the Bride Collection,,Comedy,11862.0,en,Just when George Banks has recovered from his ...,8.387519,/e64sOI48hQXyru7naBFyssKFxVd.jpg,Sandollar Productions|Touchstone Pictures,United States of America,1995-02-10,76.578911,106.0,English,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,5.7,173.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45461,,,Drama|Family,439050.0,fa,Rising and falling between a man and woman.,0.072051,/jldsYflnId4tTWPx8es3uzsB1I8.jpg,,Iran,NaT,,90.0,فارسی,Rising and falling between a man and woman,Subdue,4.0,1.0
45462,,,Drama,111109.0,tl,An artist struggles to finish his work while a...,0.178241,/xZkmxsNmYXJbKVsTRLLx3pqGHx7.jpg,Sine Olivia,Philippines,2011-11-17,,360.0,,,Century of Birthing,9.0,3.0
45463,,,Action|Drama|Thriller,67758.0,en,"When one of her hits goes wrong, a professiona...",0.903007,/d5bX92nDsISNhu3ZT69uHwmfCGw.jpg,American World Pictures,United States of America,2003-08-01,,90.0,English,A deadly game of wits.,Betrayal,3.8,6.0
45464,,,,227506.0,en,"In a small town live two brothers, one a minis...",0.003503,/aorBPO7ak8e8iJKT5OcqYxU3jlK.jpg,Yermoliev,Russia,1917-10-21,,87.0,,,Satan Triumphant,,0.0


20. The Order of the columns should be as follows: 

In [475]:
ordered_columns=["id", "title", "tagline", "release_date", "genres", "belongs_to_collection", 
"original_language", "budget_musd", "revenue_musd", "production_companies",
"production_countries", "vote_count", "vote_average", "popularity", "runtime",
"overview", "spoken_languages", "poster_path"]

In [476]:
df_released.rename(columns={'budget':'budget_musd','revenue':'revenue_musd'}, inplace=True)
df_released=df_released[ordered_columns]
df_released

Unnamed: 0,id,title,tagline,release_date,genres,belongs_to_collection,original_language,budget_musd,revenue_musd,production_companies,production_countries,vote_count,vote_average,popularity,runtime,overview,spoken_languages,poster_path
0,862.0,Toy Story,,1995-10-30,Animation|Comedy|Family,Toy Story Collection,en,30.0,373.554033,Pixar Animation Studios,United States of America,5415.0,7.7,21.946943,81.0,"Led by Woody, Andy's toys live happily in his ...",English,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg
1,8844.0,Jumanji,Roll the dice and unleash the excitement!,1995-12-15,Adventure|Fantasy|Family,,en,65.0,262.797249,TriStar Pictures|Teitler Film|Interscope Commu...,United States of America,2413.0,6.9,17.015539,104.0,When siblings Judy and Peter discover an encha...,English|Français,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg
2,15602.0,Grumpier Old Men,Still Yelling. Still Fighting. Still Ready for...,1995-12-22,Romance|Comedy,Grumpy Old Men Collection,en,,,Warner Bros.|Lancaster Gate,United States of America,92.0,6.5,11.712900,101.0,A family wedding reignites the ancient feud be...,English,/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg
3,31357.0,Waiting to Exhale,Friends are the people who let you be yourself...,1995-12-22,Comedy|Drama|Romance,,en,16.0,81.452156,Twentieth Century Fox Film Corporation,United States of America,34.0,6.1,3.859495,127.0,"Cheated on, mistreated and stepped on, the wom...",English,/16XOMpEaLWkrcPqSQqhTmeJuqQl.jpg
4,11862.0,Father of the Bride Part II,Just When His World Is Back To Normal... He's ...,1995-02-10,Comedy,Father of the Bride Collection,en,,76.578911,Sandollar Productions|Touchstone Pictures,United States of America,173.0,5.7,8.387519,106.0,Just when George Banks has recovered from his ...,English,/e64sOI48hQXyru7naBFyssKFxVd.jpg
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45461,439050.0,Subdue,Rising and falling between a man and woman,NaT,Drama|Family,,fa,,,,Iran,1.0,4.0,0.072051,90.0,Rising and falling between a man and woman.,فارسی,/jldsYflnId4tTWPx8es3uzsB1I8.jpg
45462,111109.0,Century of Birthing,,2011-11-17,Drama,,tl,,,Sine Olivia,Philippines,3.0,9.0,0.178241,360.0,An artist struggles to finish his work while a...,,/xZkmxsNmYXJbKVsTRLLx3pqGHx7.jpg
45463,67758.0,Betrayal,A deadly game of wits.,2003-08-01,Action|Drama|Thriller,,en,,,American World Pictures,United States of America,6.0,3.8,0.903007,90.0,"When one of her hits goes wrong, a professiona...",English,/d5bX92nDsISNhu3ZT69uHwmfCGw.jpg
45464,227506.0,Satan Triumphant,,1917-10-21,,,en,,,Yermoliev,Russia,0.0,,0.003503,87.0,"In a small town live two brothers, one a minis...",,/aorBPO7ak8e8iJKT5OcqYxU3jlK.jpg


21. __Reset__ the Index and create a __RangeIndex__.

In [477]:
df_released.reset_index(inplace=True)
df_released.index

RangeIndex(start=0, stop=44985, step=1)

22. __Save__ the cleaned dataset in a __csv-file__.

In [478]:
df_released.to_csv('clean_movies_dataset.csv')