<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 2: Analyzing IMDb Data

_Author: Kevin Markham (DC)_

---

For project two, you will complete a series of exercises exploring movie rating data from IMDb.

For these exercises, you will be conducting basic exploratory data analysis on IMDB's movie data, looking to answer such questions as:

What is the average rating per genre?
How many different actors are in a movie?

This process will help you practice your data analysis skills while becoming comfortable with Pandas.

## Basic level

In [220]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import plotly.express as px
import plotly.graph_objects as go

#### Read in 'imdb_1000.csv' and store it in a DataFrame named movies.

In [273]:
movies = pd.read_csv('./data/imdb_1000.csv')
movies.head()

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
0,9.3,The Shawshank Redemption,R,Crime,142,"[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt..."
1,9.2,The Godfather,R,Crime,175,"[u'Marlon Brando', u'Al Pacino', u'James Caan']"
2,9.1,The Godfather: Part II,R,Crime,200,"[u'Al Pacino', u'Robert De Niro', u'Robert Duv..."
3,9.0,The Dark Knight,PG-13,Action,152,"[u'Christian Bale', u'Heath Ledger', u'Aaron E..."
4,8.9,Pulp Fiction,R,Crime,154,"[u'John Travolta', u'Uma Thurman', u'Samuel L...."


#### Check the number of rows and columns.

In [9]:
# Answer: 
print(f"{len(movies.index)} rows. {movies.shape[1]} columns.")

979 rows. 6 columns.


#### Check the data type of each column.

In [11]:
# Answer:
movies.dtypes

star_rating       float64
title              object
content_rating     object
genre              object
duration            int64
actors_list        object
dtype: object

The data types are float for the star rating, integer for the movie duration in minutes, and strings for everything else.

#### Calculate the average movie duration.

In [19]:
# Answer:
print(f" Average movie duration: {round(movies['duration'].mean(),2)} minutes.")

 Average movie duration: 120.98 minutes.


#### Sort the DataFrame by duration to find the shortest and longest movies.

In [39]:
# Answer:
movies.sort_values(['duration'], ascending= False)[['title', 'duration']].iloc[:5].style.set_caption("Longest 5 Movies")

    

Unnamed: 0,title,duration
476,Hamlet,242
157,Gone with the Wind,238
78,Once Upon a Time in America,229
142,Lagaan: Once Upon a Time in India,224
445,The Ten Commandments,220


In [43]:
movies.sort_values(['duration'], ascending= False)[['title', 'duration']].iloc[-5:].style.set_caption("Shortest 5 Movies")

Unnamed: 0,title,duration
293,Duck Soup,68
88,The Kid,68
258,The Cabinet of Dr. Caligari,67
338,Battleship Potemkin,66
389,Freaks,64


#### Create a histogram of duration, choosing an "appropriate" number of bins.

I did a small amount of research and decided to use the Freedman-Diaconis rule for determining historgram bin size. This rule uses the **interquartile range (IQR)** and number of data points to determine a bin size. I decided to use this method rather than Sturges' rule because Sturges' can be inaccurate for large (>200) data sets.


In [64]:
print(f"Statistical values for Duration column: \n{movies['duration'].describe()}")

Statistical values for Duration column: 
count    979.000000
mean     120.979571
std       26.218010
min       64.000000
25%      102.000000
50%      117.000000
75%      134.000000
max      242.000000
Name: duration, dtype: float64


$IQR = 75^{th} percentile - 25^{th} percentile = 134 - 102 = 32$

$n = 979$


**Freedman-Diaconis rule:**

Bin Size = $2 \frac{IQR(x)}{\sqrt[3]n}$

Bin Size = $2 \frac{32}{9.93}$ = $6.44$

Rounding the bin size up to 7, I would need about 25 bins to cover the range between the min and max durations in the data set:

In [251]:
# Answer:
fig1 = go.Figure()
fig1.add_trace(go.Histogram(
    x=movies['duration'],
    histnorm='',
    xbins=dict(
        start=50,
        end=250,
        size=7
    ),
    marker_color='green',
    opacity=0.9
))
fig1.update_layout(
    title_text='Movies by Duration', # title of plot
    xaxis_title_text='Duration (in minutes)', # xaxis label
    yaxis_title_text='Number of Movies', # yaxis label
)
fig1.show()

#### Use a box plot to display that same data.

In [288]:
# Answer:
fig2 = px.box(movies, y="duration", title="Movies by Duration", labels={'duration':'Duration'})
fig2.update_traces(marker_color = 'green')
fig2.show()

## Intermediate level

#### Count how many movies have each of the content ratings.

In [146]:
# Answer:
content_ratings = movies['content_rating'].value_counts()
content_ratings


R            460
PG-13        189
PG           123
NOT RATED     65
APPROVED      47
UNRATED       38
G             32
PASSED         7
NC-17          7
X              4
GP             3
TV-MA          1
Name: content_rating, dtype: int64

#### Use a visualization to display that same data, including a title and x and y labels.

In [283]:
# Answer:
fig3 = px.bar(movies['content_rating'].value_counts(), title="Number of Movies per Content Rating",  labels={'index': 'Content Rating', 'value':'# of Movies'})
fig3.update_layout(showlegend=False)
fig3.update_traces(marker_color='green')
fig3.show()



#### Convert the following content ratings to "UNRATED": NOT RATED, APPROVED, PASSED, GP.

In [284]:
# Answer:
movies['content_rating'] = np.where((movies.content_rating == 'NOT RATED') | (movies.content_rating == 'APPROVED') |
                                    (movies.content_rating == 'PASSED') | (movies.content_rating == 'GP'), 'UNRATED', 
                                    movies['content_rating'])


In [149]:
content_ratings_converted = movies['content_rating'].value_counts()
content_ratings_converted

R          460
PG-13      189
UNRATED    160
PG         123
G           32
NC-17        7
X            4
TV-MA        1
Name: content_rating, dtype: int64

#### Convert the following content ratings to "NC-17": X, TV-MA.

In [285]:
# Answer:
movies['content_rating'] = np.where((movies.content_rating == 'X') | (movies.content_rating == 'TV-MA'), 'NC-17', 
                                    movies['content_rating'])

In [151]:
content_ratings_converted = movies['content_rating'].value_counts()
content_ratings_converted

R          460
PG-13      189
UNRATED    160
PG         123
G           32
NC-17       12
Name: content_rating, dtype: int64

#### Count the number of missing values in each column.

In [179]:
# Answer:
movies[movies.isnull().any(axis=1)]

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
187,8.2,Butch Cassidy and the Sundance Kid,,Biography,110,"[u'Paul Newman', u'Robert Redford', u'Katharin..."
649,7.7,Where Eagles Dare,,Action,158,"[u'Richard Burton', u'Clint Eastwood', u'Mary ..."
936,7.4,True Grit,,Adventure,128,"[u'John Wayne', u'Kim Darby', u'Glen Campbell']"


#### If there are missing values: examine them, then fill them in with "reasonable" values.

The only values I'm missing are in the Content Rating column. Similar to an exercise above, I will convert those missing values to 'UNRATED', bringing the unrated total to 163:

In [424]:
# Answer:
movies.fillna(value='UNRATED', inplace=True)
content_ratings_converted = movies['content_rating'].value_counts()
content_ratings_converted

R          460
PG-13      189
UNRATED    163
PG         123
G           32
NC-17       12
Name: content_rating, dtype: int64

Updated bar chart with condensed content rating categories:

In [286]:
fig4 = px.bar(movies['content_rating'].value_counts(), title="Number of Movies per Content Rating",  labels={'index': 'Content Rating', 'value':'# of Movies'})
fig4.update_layout(showlegend=False)
fig4.update_traces(marker_color='green')
fig4.show()

#### Calculate the average star rating for movies 2 hours or longer, and compare that with the average star rating for movies shorter than 2 hours.

First, I want to confirm that my filtering method works properly. I'll split the data into movies that are greater than or equal to 120 minutes, and movies that are less than 120 minutes. The sum of these two groups should be the total number of movies (979).

In [202]:
print(f"{movies[movies['duration']>=120].shape[0]} movies 120 minutes or longer\n")
print(f"{movies[movies['duration']<120].shape[0]} movies shorter than 120 minutes")

454 movies 120 minutes or longer

525 movies shorter than 120 minutes


In [200]:
# Answer:
long_movies = movies[movies['duration']>=120]
short_movies = movies[movies['duration']<120]

print(f"Average Rating for movies 120 minutes or longer: {round(long_movies['star_rating'].mean(),2)} stars\n")
print(f"Average Rating for movies shorter than 120 minutes: {round(short_movies['star_rating'].mean(),2)} stars")

Average Rating for movies 120 minutes or longer: 7.95 stars

Average Rating for movies shorter than 120 minutes: 7.84 stars


There is a *slight* difference between the two values, but it doesn't appear to be significant. I can do a linear regression analysis to see if there's a statistical relationship between movie duration and star rating:

#### Use a visualization to detect whether there is a relationship between duration and star rating.

In [289]:
# Answer:

fig5 = px.scatter(movies, x='duration', y='star_rating', trendline='ols')
fig5.update_traces(marker_color='green')
fig5.show()

The R-squared value for the line above is about 0.05, suggesting that there is not a strong correlation between a movie's duration and its star rating.

#### Calculate the average duration for each genre.

In [267]:
# Answer:
round(movies.groupby('genre')['duration'].mean().sort_values(),2)

genre
History       66.00
Animation     96.60
Film-Noir     97.33
Horror       102.52
Family       107.50
Comedy       107.60
Sci-Fi       109.00
Fantasy      112.00
Thriller     114.20
Mystery      115.62
Crime        122.30
Action       126.49
Drama        126.54
Biography    131.84
Adventure    134.84
Western      136.67
Name: duration, dtype: float64

## Advanced level

#### Visualize the relationship between content rating and duration.

In [688]:
# Answer:
fig6 = px.bar(round(movies.groupby('content_rating')['duration'].mean(),2), title="Average Movie Duration by Content Rating", labels={'content_rating':'Content Rating','value':'Average Duration (minutes)'})
fig6.update_traces(marker_color='green')
fig6.update_layout(showlegend=False)
fig6.show()

#### Determine the top rated movie (by star rating) for each genre.

First, I'll use the aggregate function to determine the max rating for each genre. Then I'll save that result as a new dataframe and use it to do a join with my main movies list. If I join and match based on both genre and rating, I should be able to easily find the top rated movies by genre:

In [455]:
# Best Ratings by Genre:

best_rated = movies.groupby('genre').agg({'star_rating':'max'}).sort_values(by='star_rating').reset_index()
best_rated


Unnamed: 0,genre,star_rating
0,Fantasy,7.7
1,Family,7.9
2,History,8.0
3,Thriller,8.0
4,Sci-Fi,8.2
5,Film-Noir,8.3
6,Animation,8.6
7,Comedy,8.6
8,Horror,8.6
9,Mystery,8.6


In [416]:
# Answer:
best_movies = movies.merge(best_rated, how='inner', on=['genre', 'star_rating'])
best_movies    

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
0,9.3,The Shawshank Redemption,R,Crime,142,"[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt..."
1,9.0,The Dark Knight,PG-13,Action,152,"[u'Christian Bale', u'Heath Ledger', u'Aaron E..."
2,8.9,12 Angry Men,UNRATED,Drama,96,"[u'Henry Fonda', u'Lee J. Cobb', u'Martin Bals..."
3,8.9,Fight Club,R,Drama,139,"[u'Brad Pitt', u'Edward Norton', u'Helena Bonh..."
4,8.9,"The Good, the Bad and the Ugly",UNRATED,Western,161,"[u'Clint Eastwood', u'Eli Wallach', u'Lee Van ..."
5,8.9,The Lord of the Rings: The Return of the King,PG-13,Adventure,201,"[u'Elijah Wood', u'Viggo Mortensen', u'Ian McK..."
6,8.9,Schindler's List,R,Biography,195,"[u'Liam Neeson', u'Ralph Fiennes', u'Ben Kings..."
7,8.6,Life Is Beautiful,PG-13,Comedy,116,"[u'Roberto Benigni', u'Nicoletta Braschi', u'G..."
8,8.6,City Lights,UNRATED,Comedy,87,"[u'Charles Chaplin', u'Virginia Cherrill', u'F..."
9,8.6,Modern Times,G,Comedy,87,"[u'Charles Chaplin', u'Paulette Goddard', u'He..."


#### Check if there are multiple movies with the same title, and if so, determine if they are actually duplicates.

In [425]:
# Answer:
movies[movies.duplicated(subset='title', keep=False)].sort_values(by='title')


Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
703,7.6,Dracula,UNRATED,Horror,85,"[u'Bela Lugosi', u'Helen Chandler', u'David Ma..."
905,7.5,Dracula,R,Horror,128,"[u'Gary Oldman', u'Winona Ryder', u'Anthony Ho..."
678,7.7,Les Miserables,PG-13,Drama,158,"[u'Hugh Jackman', u'Russell Crowe', u'Anne Hat..."
924,7.5,Les Miserables,PG-13,Crime,134,"[u'Liam Neeson', u'Geoffrey Rush', u'Uma Thurm..."
466,7.9,The Girl with the Dragon Tattoo,R,Crime,158,"[u'Daniel Craig', u'Rooney Mara', u'Christophe..."
482,7.8,The Girl with the Dragon Tattoo,R,Crime,152,"[u'Michael Nyqvist', u'Noomi Rapace', u'Ewa Fr..."
662,7.7,True Grit,PG-13,Adventure,110,"[u'Jeff Bridges', u'Matt Damon', u'Hailee Stei..."
936,7.4,True Grit,UNRATED,Adventure,128,"[u'John Wayne', u'Kim Darby', u'Glen Campbell']"


I used a built-in pandas method called **duplicated** to find duplicates. By setting the keep parameter to False, I was able to list all occurrences of duplicated titles, not just the ones after the first occurrence. Inspecting the list, it's clear that all of these are unique movies. The biggest indication is that the cast members are totally different. Also, the movies have different durations, sometimes significantly so. Finally, two of the pairs have different content ratings. I would expect actual duplicates to have the same cast and very nearly the same, if not identical, runtimes.

#### Calculate the average star rating for each genre, but only include genres with at least 10 movies


#### Option 1: manually create a list of relevant genres, then filter using that list

In [None]:
# Answer:

#### Option 2: automatically create a list of relevant genres by saving the value_counts and then filtering

In [None]:
# Answer:

#### Option 3: calculate the average star rating for all genres, then filter using a boolean Series

In [541]:
# Answer:

#### Option 4: aggregate by count and mean, then filter using the count

In [540]:
# Answer:
genres_ratings = movies.groupby('genre').agg(['count','mean']).star_rating
round(genres_ratings[(genres_ratings['count'] >= 10)],2)

Unnamed: 0_level_0,count,mean
genre,Unnamed: 1_level_1,Unnamed: 2_level_1
Action,136,7.88
Adventure,75,7.93
Animation,62,7.91
Biography,77,7.86
Comedy,156,7.82
Crime,124,7.92
Drama,278,7.9
Horror,29,7.81
Mystery,16,7.98


## Bonus

#### Figure out something "interesting" using the actors data!

In [659]:
actors_dict = dict()

for index, row in movies.iterrows():
    actors = row['actors_list']
    actors = actors.strip("['']")
    actors = actors.replace('"','').lstrip().rstrip()
    actors = actors.split(",")
    
    for actor in actors:
        actor = actor.split("'")[1]
        if actor in actors_dict:
            actors_dict[actor].append(index)
        else:
            actors_dict[actor] = [index]
    


In [660]:
actors_dict


{'Tim Robbins': [0, 365, 611, 693, 819],
 'Morgan Freeman': [0, 24, 119, 227, 549, 621, 943, 962],
 'Bob Gunton': [0],
 'Marlon Brando': [1, 51, 122, 284],
 'Al Pacino': [1, 2, 115, 135, 278, 374, 423, 436, 463, 560, 561, 711, 889],
 'James Caan': [1, 496],
 'Robert De Niro': [2,
  18,
  78,
  92,
  124,
  135,
  156,
  166,
  321,
  383,
  475,
  571,
  579,
  580,
  780,
  826,
  898,
  931],
 'Robert Duvall': [2, 51, 761, 772, 813, 884, 914],
 'Christian Bale': [3, 43, 53, 113, 446, 504, 555, 589, 702, 732, 815],
 'Heath Ledger': [3, 628],
 'Aaron Eckhart': [3, 714],
 'John Travolta': [4],
 'Uma Thurman': [4, 198, 354, 499, 924],
 'Samuel L. Jackson': [4, 194, 378, 517, 778, 836],
 'Henry Fonda': [5, 26, 188],
 'Lee J. Cobb': [5, 122],
 'Martin Balsam': [5],
 'Clint Eastwood': [6,
  107,
  119,
  162,
  227,
  276,
  421,
  515,
  649,
  691,
  704,
  753,
  851,
  865],
 'Eli Wallach': [6],
 'Lee Van Cleef': [6, 107],
 'Elijah Wood': [7, 10, 14, 830, 868],
 'Viggo Mortensen': [7, 1

In [683]:
def common_movies(actor1, actor2):
    try:
        a1_movies = actors_dict[actor1]
        a2_movies = actors_dict[actor2]
    except:
        return("Invalid selection")
    
    common = set(a1_movies) & set(a2_movies)
    
    if len(common) == 0:
        return("No movies in common")
    
    else:
        common_list = list(common)
        return movies[movies.index.isin(common_list)]
    

In [687]:
common_movies('Al Pacino', 'Robert De Niro')

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
2,9.1,The Godfather: Part II,R,Crime,200,"[u'Al Pacino', u'Robert De Niro', u'Robert Duv..."
135,8.3,Heat,R,Action,170,"[u'Al Pacino', u'Robert De Niro', u'Val Kilmer']"
