<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 2: Analyzing IMDb Data

_Author: Kevin Markham (DC)_

---

For project two, you will complete a series of exercises exploring movie rating data from IMDb.

For these exercises, you will be conducting basic exploratory data analysis on IMDB's movie data, looking to answer such questions as:

What is the average rating per genre?
How many different actors are in a movie?

This process will help you practice your data analysis skills while becoming comfortable with Pandas.

## Basic level

In [89]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

#### Read in 'imdb_1000.csv' and store it in a DataFrame named movies.

In [90]:
movies = pd.read_csv('./data/imdb_1000.csv')
movies.head()

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
0,9.3,The Shawshank Redemption,R,Crime,142,"[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt..."
1,9.2,The Godfather,R,Crime,175,"[u'Marlon Brando', u'Al Pacino', u'James Caan']"
2,9.1,The Godfather: Part II,R,Crime,200,"[u'Al Pacino', u'Robert De Niro', u'Robert Duv..."
3,9.0,The Dark Knight,PG-13,Action,152,"[u'Christian Bale', u'Heath Ledger', u'Aaron E..."
4,8.9,Pulp Fiction,R,Crime,154,"[u'John Travolta', u'Uma Thurman', u'Samuel L...."


#### Check the number of rows and columns.

In [91]:
print(f'there are {len(movies)} rows and {len(movies.columns)} columns')

there are 979 rows and 6 columns


#### Check the data type of each column.

In [92]:
movies.dtypes

star_rating       float64
title              object
content_rating     object
genre              object
duration            int64
actors_list        object
dtype: object

#### Calculate the average movie duration.

In [93]:
print(f"the average movie duration is {int(movies['duration'].mean())} minutes")

the average movie duration is 120 minutes


#### Sort the DataFrame by duration to find the shortest and longest movies.

In [94]:
movies.sort_values('duration', ascending = False)

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
476,7.8,Hamlet,PG-13,Drama,242,"[u'Kenneth Branagh', u'Julie Christie', u'Dere..."
157,8.2,Gone with the Wind,G,Drama,238,"[u'Clark Gable', u'Vivien Leigh', u'Thomas Mit..."
78,8.4,Once Upon a Time in America,R,Crime,229,"[u'Robert De Niro', u'James Woods', u'Elizabet..."
142,8.3,Lagaan: Once Upon a Time in India,PG,Adventure,224,"[u'Aamir Khan', u'Gracy Singh', u'Rachel Shell..."
445,7.9,The Ten Commandments,APPROVED,Adventure,220,"[u'Charlton Heston', u'Yul Brynner', u'Anne Ba..."
...,...,...,...,...,...,...
293,8.1,Duck Soup,PASSED,Comedy,68,"[u'Groucho Marx', u'Harpo Marx', u'Chico Marx']"
88,8.4,The Kid,NOT RATED,Comedy,68,"[u'Charles Chaplin', u'Edna Purviance', u'Jack..."
258,8.1,The Cabinet of Dr. Caligari,UNRATED,Crime,67,"[u'Werner Krauss', u'Conrad Veidt', u'Friedric..."
338,8.0,Battleship Potemkin,UNRATED,History,66,"[u'Aleksandr Antonov', u'Vladimir Barsky', u'G..."


#### Create a histogram of duration, choosing an "appropriate" number of bins.

In [95]:
# Answer:

#### Use a box plot to display that same data.

In [96]:
# Answer:

## Intermediate level

#### Count how many movies have each of the content ratings.

In [97]:
movies.groupby('genre').count().sort_values('duration',ascending = False)[['title']]

Unnamed: 0_level_0,title
genre,Unnamed: 1_level_1
Drama,278
Comedy,156
Action,136
Crime,124
Biography,77
Adventure,75
Animation,62
Horror,29
Mystery,16
Western,9


#### Use a visualization to display that same data, including a title and x and y labels.

In [98]:
# Answer:

#### Convert the following content ratings to "UNRATED": NOT RATED, APPROVED, PASSED, GP.

In [99]:
import numpy as np
movies.replace(['NOT RATED', 'APPROVED', 'PASSED', 'GP'],'UNRATED',inplace=True)
movies.groupby('content_rating').count()['title'].sort_values(ascending = False)

content_rating
R          460
PG-13      189
UNRATED    160
PG         123
G           32
NC-17        7
X            4
TV-MA        1
Name: title, dtype: int64

#### Convert the following content ratings to "NC-17": X, TV-MA.

In [100]:
movies.replace(['X', 'TV-MA'],'NC-17',inplace =True)
movies.groupby('content_rating').count()['title'].sort_values(ascending = False)

content_rating
R          460
PG-13      189
UNRATED    160
PG         123
G           32
NC-17       12
Name: title, dtype: int64

#### Count the number of missing values in each column.

In [101]:
movies.isnull().sum()

star_rating       0
title             0
content_rating    3
genre             0
duration          0
actors_list       0
dtype: int64

#### If there are missing values: examine them, then fill them in with "reasonable" values.

In [102]:
movies[movies.isnull().any(axis=1)]

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
187,8.2,Butch Cassidy and the Sundance Kid,,Biography,110,"[u'Paul Newman', u'Robert Redford', u'Katharin..."
649,7.7,Where Eagles Dare,,Action,158,"[u'Richard Burton', u'Clint Eastwood', u'Mary ..."
936,7.4,True Grit,,Adventure,128,"[u'John Wayne', u'Kim Darby', u'Glen Campbell']"


In [112]:
movies.replace(movies.iloc[[187,649,936]]['content_rating'],'reasonable',inplace=True)

In [115]:
movies.iloc[[187,649,936]]

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
187,8.2,Butch Cassidy and the Sundance Kid,reasonable,Biography,110,"[u'Paul Newman', u'Robert Redford', u'Katharin..."
649,7.7,Where Eagles Dare,reasonable,Action,158,"[u'Richard Burton', u'Clint Eastwood', u'Mary ..."
936,7.4,True Grit,reasonable,Adventure,128,"[u'John Wayne', u'Kim Darby', u'Glen Campbell']"


#### Calculate the average star rating for movies 2 hours or longer, and compare that with the average star rating for movies shorter than 2 hours.

In [142]:
movies['longer_than_2h']=movies['duration']<=120
movies.groupby('longer_than_2h').mean()[['star_rating']]

Unnamed: 0_level_0,star_rating
longer_than_2h,Unnamed: 1_level_1
False,7.95367
True,7.83849


#### Use a visualization to detect whether there is a relationship between duration and star rating.

In [None]:
# Answer:

#### Calculate the average duration for each genre.

In [128]:
movies.groupby('genre').mean()['duration'].astype(np.int64)

genre
Action       126
Adventure    134
Animation     96
Biography    131
Comedy       107
Crime        122
Drama        126
Family       107
Fantasy      112
Film-Noir     97
History       66
Horror       102
Mystery      115
Sci-Fi       109
Thriller     114
Western      136
Name: duration, dtype: int64

## Advanced level

#### Visualize the relationship between content rating and duration.

In [None]:
# Answer:

#### Determine the top rated movie (by star rating) for each genre.

In [224]:
movies.groupby('genre')[['star_rating']].max()

Unnamed: 0_level_0,star_rating
genre,Unnamed: 1_level_1
Action,9.0
Adventure,8.9
Animation,8.6
Biography,8.9
Comedy,8.6
Crime,9.3
Drama,8.9
Family,7.9
Fantasy,7.7
Film-Noir,8.3


#### Check if there are multiple movies with the same title, and if so, determine if they are actually duplicates.

In [None]:
# Answer:

#### Calculate the average star rating for each genre, but only include genres with at least 10 movies


#### Option 1: manually create a list of relevant genres, then filter using that list

In [None]:
# Answer:

#### Option 2: automatically create a list of relevant genres by saving the value_counts and then filtering

In [None]:
# Answer:

#### Option 3: calculate the average star rating for all genres, then filter using a boolean Series

In [None]:
# Answer:

#### Option 4: aggregate by count and mean, then filter using the count

In [None]:
# Answer:

## Bonus

#### Figure out something "interesting" using the actors data!