<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 2: Analyzing IMDb Data

_Author: Kevin Markham (DC)_

---

For project two, you will complete a serious of exercises exploring movie rating data from IMDb.

For these exercises, you will be conducting basic exploratory data analysis on IMDB's movie data, looking to answer such questions as:

What is the average rating per genre?
How many different actors are in a movie?

This process will help you practice your data analysis skills while becoming comfortable with Pandas.

## Basic level

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

#### Read in 'imdb_1000.csv' and store it in a DataFrame named movies.

In [None]:
movies = pd.read_csv('./data/imdb_1000.csv')
movies.head()

#### Check the number of rows and columns.

In [None]:
[rows, cols] = movies.shape

print("Rows: ", rows)
print("Cols: ", cols)

#### Check the data type of each column.

In [None]:
movies.dtypes

#### Calculate the average movie duration.

In [None]:
movies['duration'].mean()

#### Sort the DataFrame by duration to find the shortest and longest movies.

In [None]:
movies.sort_values('duration', inplace=True)
movies

#### Create a histogram of duration, choosing an "appropriate" number of bins.

In [None]:
movies['duration'].hist()

#### Use a box plot to display that same data.

In [None]:
movies['duration'].plot(kind='box', figsize=(12,6))

## Intermediate level

#### Count how many movies have each of the content ratings.

In [None]:
movies['content_rating'].value_counts()

#### Use a visualization to display that same data, including a title and x and y labels.

In [None]:
movies['content_rating'].value_counts().sort_index().plot(kind='bar', figsize=(12,6)) 
plt.xlabel('Content Rating', fontsize=13)
plt.ylabel('Number of Movies', fontsize=13)
plt.title('Number of Movies per Content Rating', fontsize=16)

#### Convert the following content ratings to "UNRATED": NOT RATED, APPROVED, PASSED, GP.

In [None]:
movies.content_rating.replace('NOT RATED', 'UNRATED', inplace=True)
movies.content_rating.replace('APPROVED', 'UNRATED', inplace=True)
movies.content_rating.replace('PASSED', 'UNRATED', inplace=True)
movies.content_rating.replace('GP', 'UNRATED', inplace=True)

#### Convert the following content ratings to "NC-17": X, TV-MA.

In [None]:
movies.content_rating.replace('X', 'NC-17', inplace=True)
movies.content_rating.replace('TV_MA', 'NC-17', inplace=True)


#### Count the number of missing values in each column.

In [None]:
movies.isnull().sum()

#### If there are missing values: examine them, then fill them in with "reasonable" values.

In [None]:
movies['content_rating'].fillna(value='PG-13', inplace=True)


#### Calculate the average star rating for movies 2 hours or longer, and compare that with the average star rating for movies shorter than 2 hours.

In [None]:
mask_1 = movies['duration'] >= 120
avg_rating_long = movies[mask_1]['star_rating'].mean()

mask_2 = movies['duration'] < 120
avg_rating_short = movies[mask_2]['star_rating'].mean()

print('Long movies rating: ', avg_rating_long)
print("Short movies rating: ", avg_rating_short)

#### Use a visualization to detect whether there is a relationship between duration and star rating.

In [None]:
import seaborn as sns

plt.figure(figsize = (16,16))
sns.set_palette("coolwarm", 7)
sns.heatmap(movies.corr(), vmin=-1, vmax=1) 

#### Calculate the average duration for each genre.

In [None]:
movies.groupby('genre')['duration'].describe()

## Advanced level

#### Visualize the relationship between star rating and duration.

In [None]:
movies.boxplot()

#### Determine the top rated movie (by star rating) for each genre.

In [None]:
movies.groupby('genre')['star_rating', 'title'].max()


#### Check if there are multiple movies with the same title, and if so, determine if they are actually duplicates.

In [None]:
movies['title'].duplicated().sum()
movies[movies.duplicated(['title', 'genre', 'duration'])] # same name different movie durations

#### Calculate the average star rating for each genre, but only include genres with at least 10 movies


#### Option 1: manually create a list of relevant genres, then filter using that list

In [None]:
movies.groupby('genre')['title'].count()
genre_list = ['Action', 'Adventure', 'Animation', 'Biography', 'Comedy', 'Crime', 'Drama', 'Horror', 'Mystery']
movies[movies['genre'].isin(genre_list)]

#### Option 2: automatically create a list of relevant genres by saving the value_counts and then filtering

In [None]:
genre_count = movies['genre'].value_counts()
genre_list = genre_count[genre_count >= 10]

movies[movies['genre'].isin(genre_list.index.tolist())]

## Bonus

#### Figure out something "interesting" using the actors data!

In [None]:
import ast

def clean_up(actor_list):
    return actor_list.replace("u'", "'")
    
movies['actors_list'] = movies['actors_list'].apply(clean_up)

movies['actors_list'].duplicated().sum() # no same cast used in a different movie

actors = []

def store_actors(actor_list):
    actor_list = ast.literal_eval(actor_list)
    for actor in actor_list:
        if actor not in actors:
            actors.append(actor)
    
movies['actors_list'].apply(store_actors)

In [None]:

actors_df =  pd.DataFrame(columns=['name', 'movie_count'])
actors_df['name'] = actors
actors_df

In [None]:
actors_dict = dict.fromkeys(actors, 0)
count_list = []
for index, val in enumerate(actors):
    count = movies[movies['actors_list'].str.contains(actors[index])]['title'].count()
    count_list.append(count)
    actors_dict[f"{val}"] = count

In [None]:
actors_dict # using for quick validation

In [None]:
actors_df['movie_count'] = count_list

actors_df['movie_count'].describe()
actors_df[actors_df['movie_count'] == 0] # somthings gone wrong here... All with special chars in names