# Project 3: Data Cleaning - Tidy up messy Datasets (Movies Dataset)

# Project Brief for Self-Coders

Here you´ll have the opportunity to code major parts of Project 3 on your own. If you need any help or inspiration, have a look at the Videos or the Jupyter Notebook with the full code. <br> <br>
Keep in mind that it´s all about __getting the right results/conclusions__. It´s not about finding the identical code. Things can be coded in many different ways. Even if you come to the same conclusions, it´s very unlikely that we have the very same code. 

## First Steps 

1. __Load__ and __inspect__ the messy dataset __movies_metadata.csv__. Identify columns with nested / stringified json data.

In [154]:
import pandas as pd
import numpy as np
import ast
from pandas.io.json import json_normalize
import json

In [155]:
df = pd.read_csv('movies_metadata.csv')

In [156]:
df.head(5)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


In [157]:
df.columns

Index(['adult', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id',
       'imdb_id', 'original_language', 'original_title', 'overview',
       'popularity', 'poster_path', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'video',
       'vote_average', 'vote_count'],
      dtype='object')

## Dropping irrelevant Columns

2. __Drop__ the irrelevant columns 'adult', 'imdb_id', 'original_title', 'video' and 'homepage'.

In [158]:
df = df.drop(['adult', 'imdb_id', 'original_title', 'video', 'homepage'], axis = 1)

## How to handle stringified JSON columns

3. __Evaluate__ Python Expressions in the stringified columns ["belongs_to_collection", "genres", "production_countries", "production_companies", "spoken_languages"] and __remove quotes__ ("") where possible.

In [159]:
json_col = ["belongs_to_collection", "genres", "production_countries", "production_companies", "spoken_languages"] 

In [160]:
for col in json_col:
    df[col] = df[col].apply(lambda x: ast.literal_eval(x) if isinstance(x,str) else np.nan)

In [161]:
df.head()

Unnamed: 0,belongs_to_collection,budget,genres,id,original_language,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",862,en,"Led by Woody, Andy's toys live happily in his ...",21.9469,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,"[{'name': 'Pixar Animation Studios', 'id': 3}]","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,7.7,5415.0
1,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",8844,en,When siblings Judy and Peter discover an encha...,17.0155,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,"[{'name': 'TriStar Pictures', 'id': 559}, {'na...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,6.9,2413.0
2,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",15602,en,A family wedding reignites the ancient feud be...,11.7129,/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg,"[{'name': 'Warner Bros.', 'id': 6194}, {'name'...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,6.5,92.0
3,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",31357,en,"Cheated on, mistreated and stepped on, the wom...",3.85949,/16XOMpEaLWkrcPqSQqhTmeJuqQl.jpg,[{'name': 'Twentieth Century Fox Film Corporat...,"[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,6.1,34.0
4,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",11862,en,Just when George Banks has recovered from his ...,8.38752,/e64sOI48hQXyru7naBFyssKFxVd.jpg,"[{'name': 'Sandollar Productions', 'id': 5842}...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,5.7,173.0


## How to flatten nested Columns

4. __Extract__ only the __collection name__ from the column "belongs_to_collection" and __overwrite__ "belongs_to_collection". <br> For example: The value in the first row (Toy Story) should be 'Toy Story Collection'.

5. __Extract__ all __genre names__ from the column "genres" and __overwrite__ "genres". If a movie has more than one genre, __seperate genres by a pipe__ "|".<br>
For example: The value in the first row (Toy Story) should be 'Animation|Comedy|Family'.

6. __Extract__ all __spoken language names__ from the column "spoken_languages" and __overwrite__ "spoken_languages". If a movie has more than one spoken language, __seperate spoken languages by a pipe__ "|".<br>
For example: The value in the first row (Toy Story) should be 'English'.

7. __Extract__ all __production countries names__ from the column "production_countries" and __overwrite__ "production_countries". If a movie has more than one production country, __seperate production countries by a pipe__ "|".<br>
For example: The value in the first row (Toy Story) should be 'United States of America'.

8. __Extract__ all __production companies names__ from the column "production_companies" and __overwrite__ "production_companies". If a movie has more than one production company, __seperate production companies by a pipe__ "|".<br>
For example: The value in the first row (Toy Story) should be 'Pixar Animation Studios'

9. __Inspect__ all columns above with value_counts(). Do you see anything strange? __Take reasonable measures__!

In [162]:
df.belongs_to_collection

0        {'id': 10194, 'name': 'Toy Story Collection', ...
1                                                      NaN
2        {'id': 119050, 'name': 'Grumpy Old Men Collect...
3                                                      NaN
4        {'id': 96871, 'name': 'Father of the Bride Col...
                               ...                        
45461                                                  NaN
45462                                                  NaN
45463                                                  NaN
45464                                                  NaN
45465                                                  NaN
Name: belongs_to_collection, Length: 45466, dtype: object

In [163]:
df.belongs_to_collection = df.belongs_to_collection.apply(lambda x: x['name'] if isinstance(x,dict) else np.nan)

In [164]:
df.genres = df.genres.apply(lambda x: "|".join(i['name'] for i in x))
df.genres.replace("", np.nan, inplace = True)

In [165]:
df.spoken_languages = df.spoken_languages.apply(lambda x: "|".join(i['name'] for i in x) if isinstance(x,list) else np.nan)

In [166]:
df.spoken_languages.replace("", np.nan, inplace = True)

In [167]:
df.production_companies = df.production_companies.apply(lambda x: "|".join(i['name'] for i in x) if isinstance(x,list) else np.nan)

In [168]:
df.production_companies.replace("", np.nan, inplace = True)

## Cleaning Numerical Columns

10. __Convert__ the datatype in the columns __"budget"__, __"id"__ and __"popularity"__ __to numeric__. Set invalid values as NaN.

11. __Analyze__ the columns __"budget"__ and __"revenue"__ and __"runtime"__. Analyze movies with a budget/revenue/runtime of 0. Do you think the value 0 is the most appropriate value? __Take reasonable measures__! 

12. The columns "budget" and "revenue" shall show values in Million USD. __Convert and Overwrite__!

13. __Analyze__ movies with a __vote_count of 0__. What´s the __vote_average__ for those movies? Do you think this value is the most appropriate value? __Take reasonable measures__!

In [169]:
df.budget = pd.to_numeric(df.budget, errors = 'coerce').replace(0, np.nan)
    


In [170]:
df.id = pd.to_numeric(df.id, errors = 'coerce', downcast = 'integer').replace(0, np.nan)

In [171]:
df.popularity = pd.to_numeric(df.popularity, errors = 'coerce').replace(0, np.nan)

In [174]:
df.budget = df.budget.div(1000000)

In [178]:
df.revenue = df.revenue.div(1000000)
df.replace(0, np.nan, inplace = True)

In [187]:
df.runtime.replace(0, np.nan, inplace = True)

In [217]:
df.loc[df['vote_count'] == 0, 'vote_average']

Series([], Name: vote_average, dtype: float64)

## Cleaning DateTime Columns

14. __Convert__ the datatype in the column __"release_date"__ __to datetime__. Set invalid values as NaN.

In [200]:
df.release_date = pd.to_datetime(df.release_date, errors = 'coerce')

## Cleaning Text / String Columns

15. __Analyze__ the text columns "overview" and "tagline". Try to identify __missing data that is not represented by NaN__ (e.g. "No Data"). __Replace as NaN__ (np.nan)!

In [216]:
df.overview.value_counts()

King Lear, old and tired, divides his kingdom among his daughters, giving great importance to their protestations of love for him. When Cordelia, youngest and most honest, refuses to idly flatter the old man in return for favor, he banishes her and turns for support to his remaining daughters. But Goneril and Regan have no love for him and instead plot to take all his power from him. In a parallel, Lear's loyal courtier Gloucester favors his illegitimate son Edmund after being told lies about his faithful son Edgar. Madness and tragedy befall both ill-starred fathers.                                                         3
Recovering from a nail gun shot to the head and 13 months of coma, doctor Pekka Valinta starts to unravel the mystery of his past, still suffering from total amnesia.                                                                                                                                                                                                         

In [215]:
df.overview = df.overview.replace(['No overview found.', 'No overview', 'No Overview', ' ', '..', '','No movie overview available.'], np.nan)

In [221]:
df.tagline.value_counts()

Based on a true story.                                                                                        7
Trust no one.                                                                                                 4
Be careful what you wish for.                                                                                 4
How far would you go?                                                                                         3
Some doors should never be opened.                                                                            3
                                                                                                             ..
In a time of crisis, America will call Reno 911!                                                              1
A Latino comic's near-death experience forces him to revisit his personal and professional highs and lows.    1
Driven By Love...And Bank Robbing                                                                       

In [220]:
df.tagline = df.tagline.replace(['-', ' ', ''], np.nan)

## Removing Duplicates

16. __Identify__ and __remove__ duplicates!

In [234]:
df = df.drop_duplicates(subset = ['id'])

## Handling Missing Values & Removing Observations

17. __Drop__ all rows/movies with unknown __id__ or __title__.

18. __Keep__ only those rows/movies in the df with __10 or more non-NaN__ values.

In [238]:
df.isna().sum()

belongs_to_collection    40946
budget                   36554
genres                    2442
id                           1
original_language           11
overview                  1103
popularity                  70
poster_path                386
production_companies     11872
production_countries         3
release_date                88
revenue                  38036
runtime                   1819
spoken_languages          3954
status                      85
tagline                  25037
title                        4
vote_average              2999
vote_count                2900
dtype: int64

In [240]:
df =df.dropna(subset = ['id', 'title'])

In [242]:
df = df.dropna(thresh = 10)

In [245]:
df.status.value_counts()

Released           44711
Rumored              227
Post Production       98
In Production         20
Planned               15
Canceled               2
Name: status, dtype: int64

In [247]:
df.id = df.id.astype('int')

In [252]:
df.notna().sum(axis = 1).value_counts()

15    12858
16    11586
14     6021
17     4322
18     3867
13     2726
12     1428
19     1132
11      785
10      415
dtype: int64

In [253]:
df.status.value_counts()

Released           44711
Rumored              227
Post Production       98
In Production         20
Planned               15
Canceled               2
Name: status, dtype: int64

In [257]:
df = df.loc[df.status == 'Released'].copy()

In [261]:
df.drop('status', axis = 1, inplace = True)

## Final (Cleaning) Steps

19. __Keep__ only those rows/movies in the df with __status "Released"__. Then __drop__ the column "status".

20. The Order of the columns should be as follows: 

In [264]:
order = ["id", "title", "tagline", "release_date", "genres", "belongs_to_collection", 
"original_language", "budget", "revenue", "production_companies",
"production_countries", "vote_count", "vote_average", "popularity", "runtime",
"overview", "spoken_languages", "poster_path"]

In [268]:
df = df[order]
df

Unnamed: 0,id,title,tagline,release_date,genres,belongs_to_collection,original_language,budget,revenue,production_companies,production_countries,vote_count,vote_average,popularity,runtime,overview,spoken_languages,poster_path
0,862,Toy Story,,1995-10-30,Animation|Comedy|Family,Toy Story Collection,en,30.0,37.355403,Pixar Animation Studios,"[{'iso_3166_1': 'US', 'name': 'United States o...",5415.0,7.7,21.946943,81.0,"Led by Woody, Andy's toys live happily in his ...",English,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg
1,8844,Jumanji,Roll the dice and unleash the excitement!,1995-12-15,Adventure|Fantasy|Family,,en,65.0,26.279725,TriStar Pictures|Teitler Film|Interscope Commu...,"[{'iso_3166_1': 'US', 'name': 'United States o...",2413.0,6.9,17.015539,104.0,When siblings Judy and Peter discover an encha...,English|Français,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg
2,15602,Grumpier Old Men,Still Yelling. Still Fighting. Still Ready for...,1995-12-22,Romance|Comedy,Grumpy Old Men Collection,en,,,Warner Bros.|Lancaster Gate,"[{'iso_3166_1': 'US', 'name': 'United States o...",92.0,6.5,11.712900,101.0,A family wedding reignites the ancient feud be...,English,/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg
3,31357,Waiting to Exhale,Friends are the people who let you be yourself...,1995-12-22,Comedy|Drama|Romance,,en,16.0,8.145216,Twentieth Century Fox Film Corporation,"[{'iso_3166_1': 'US', 'name': 'United States o...",34.0,6.1,3.859495,127.0,"Cheated on, mistreated and stepped on, the wom...",English,/16XOMpEaLWkrcPqSQqhTmeJuqQl.jpg
4,11862,Father of the Bride Part II,Just When His World Is Back To Normal... He's ...,1995-02-10,Comedy,Father of the Bride Collection,en,,7.657891,Sandollar Productions|Touchstone Pictures,"[{'iso_3166_1': 'US', 'name': 'United States o...",173.0,5.7,8.387519,106.0,Just when George Banks has recovered from his ...,English,/e64sOI48hQXyru7naBFyssKFxVd.jpg
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45461,439050,Subdue,Rising and falling between a man and woman,NaT,Drama|Family,,fa,,,,"[{'iso_3166_1': 'IR', 'name': 'Iran'}]",1.0,4.0,0.072051,90.0,Rising and falling between a man and woman.,فارسی,/jldsYflnId4tTWPx8es3uzsB1I8.jpg
45462,111109,Century of Birthing,,2011-11-17,Drama,,tl,,,Sine Olivia,"[{'iso_3166_1': 'PH', 'name': 'Philippines'}]",3.0,9.0,0.178241,360.0,An artist struggles to finish his work while a...,,/xZkmxsNmYXJbKVsTRLLx3pqGHx7.jpg
45463,67758,Betrayal,A deadly game of wits.,2003-08-01,Action|Drama|Thriller,,en,,,American World Pictures,"[{'iso_3166_1': 'US', 'name': 'United States o...",6.0,3.8,0.903007,90.0,"When one of her hits goes wrong, a professiona...",English,/d5bX92nDsISNhu3ZT69uHwmfCGw.jpg
45464,227506,Satan Triumphant,,1917-10-21,,,en,,,Yermoliev,"[{'iso_3166_1': 'RU', 'name': 'Russia'}]",,,0.003503,87.0,"In a small town live two brothers, one a minis...",,/aorBPO7ak8e8iJKT5OcqYxU3jlK.jpg


21. __Reset__ the Index and create a __RangeIndex__.

22. __Save__ the cleaned dataset in a __csv-file__.

In [275]:
df.set_index('id', inplace = True)

KeyError: "None of ['id'] are in the columns"

In [276]:
df.columns

Index(['title', 'tagline', 'release_date', 'genres', 'belongs_to_collection',
       'original_language', 'budget', 'revenue', 'production_companies',
       'production_countries', 'vote_count', 'vote_average', 'popularity',
       'runtime', 'overview', 'spoken_languages', 'poster_path'],
      dtype='object')

In [278]:
df.to_csv('cleaned_csv.csv', index = True)

# +++++++++ See some Hints below +++++++++++++

# ++++++++++++++++ Hints++++++++++++++++++++

__Hints for 3.__ <br>
apply ast.literal_eval() on all stringified elements (you have to import ast):

In [None]:
# example:
df.stringified_column = df.stringified_column.apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else np.nan)

__Hints for 4., 5., 6., 7., 8.__<br> 
apply an appropriate lambda function on all column elements

__Hints for 9.__<br>
Replace all __""__ (empty strings) in the above columns by NaN (__np.nan__)

__Hints for 10.__<br>
Use pd.to_numeric() and "coerce" errors

__Hints for 11.__<br>
Replace the value 0 by NaN (__np.nan__)

__Hints for 13.__<br>
Replace the value 0 by NaN (__np.nan__)

__Hints for 14.__<br>
Use pd.to_datetime() and "coerce" errors

__Hints for 16.__<br>
There cannot be two or more movies with the same movie id.