<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 2: Analyzing IMDb Data

_Author: Kevin Markham (DC)_

---

For project two, you will complete a series of exercises exploring movie rating data from IMDb.

For these exercises, you will be conducting basic exploratory data analysis on IMDB's movie data, looking to answer such questions as:

What is the average rating per genre?
How many different actors are in a movie?

This process will help you practice your data analysis skills while becoming comfortable with Pandas.

## Basic level

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

#### Read in 'imdb_1000.csv' and store it in a DataFrame named movies.

In [3]:
movies = pd.read_csv('./data/imdb_1000.csv')
movies.head()

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
0,9.3,The Shawshank Redemption,R,Crime,142,"[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt..."
1,9.2,The Godfather,R,Crime,175,"[u'Marlon Brando', u'Al Pacino', u'James Caan']"
2,9.1,The Godfather: Part II,R,Crime,200,"[u'Al Pacino', u'Robert De Niro', u'Robert Duv..."
3,9.0,The Dark Knight,PG-13,Action,152,"[u'Christian Bale', u'Heath Ledger', u'Aaron E..."
4,8.9,Pulp Fiction,R,Crime,154,"[u'John Travolta', u'Uma Thurman', u'Samuel L...."


#### Check the number of rows and columns.

In [4]:
# Answer:

#create data frame:
df = pd.DataFrame(movies, index = None)

#check rows:
rows = len(df.axes[0])

#print rows:
print(rows)



NameError: name 'pd' is not defined

#### Check the data type of each column.

In [None]:
# Answer:

dataTypeSeries = movies.dtypes

#print data type:
print(dataTypeSeries)


#### Calculate the average movie duration.

In [None]:
# Answer:

movies.duration.mean()

#answer:
= 120.97957099080695

#### Sort the DataFrame by duration to find the shortest and longest movies.

In [None]:
# Answer:
movies.sort('duration').head(1)
#
movies.sort('duration').tail(1)

#### Create a histogram of duration, choosing an "appropriate" number of bins.

In [None]:
# Answer:

movies.duration.plot(kind='hist', bins=20)


#### Use a box plot to display that same data.

In [None]:
# Answer:

movies.duration.plot(kind='box')


## Intermediate level

#### Count how many movies have each of the content ratings.

In [None]:
# Answer:

movies.content_rating.value_counts()

#### Use a visualization to display that same data, including a title and x and y labels.

In [None]:
# Answer:

movies.content_rating.value_counts().plot(kind='bar', title='Top 1000 Movies by Content Rating')
plt.xlabel('Content Rating')
plt.ylabel('Number of Movies')

#### Convert the following content ratings to "UNRATED": NOT RATED, APPROVED, PASSED, GP.

In [None]:
# Answer:

movies.content_rating.replace(['NOT RATED', 'APPROVED', 'PASSED', 'GP'], 'UNRATED', inplace=True)


#### Convert the following content ratings to "NC-17": X, TV-MA.

In [None]:
# Answer:

movies.content_rating.replace(['X', 'TV-MA'], 'NC-17', inplace=True)


#### Count the number of missing values in each column.

In [None]:
# Answer:

movies.isnull().sum()


#### If there are missing values: examine them, then fill them in with "reasonable" values.

In [None]:
# Answer:

#identifying misisng values
movies[movies.content_rating.isnull()]
#adding fill
movies.content_rating.fillna('UNRATED', inplace=True)

#### Calculate the average star rating for movies 2 hours or longer, and compare that with the average star rating for movies shorter than 2 hours.

In [None]:
# Answer:

#average star rating
movies[movies.duration >= 120].star_rating.mean()

#comparison

movies[movies.duration < 120].star_rating.mean()


#### Use a visualization to detect whether there is a relationship between duration and star rating.

In [None]:
# Answer:

movies.plot(kind='scatter', x='star_rating', y='duration', alpha=0.2)


#### Calculate the average duration for each genre.

In [8]:
# Answer:
movies = pd.read_csv('./data/imdb_1000.csv')
movies.groupby('genre').duration.mean()

#results:
genre
Action       126.485294
Adventure    134.840000
Animation     96.596774
Biography    131.844156
Comedy       107.602564
Crime        122.298387
Drama        126.539568
Family       107.500000
Fantasy      112.000000
Film-Noir     97.333333
History       66.000000
Horror       102.517241
Mystery      115.625000
Sci-Fi       109.000000
Thriller     114.200000
Western      136.666667
Name: duration, dtype: float64

NameError: name 'pd' is not defined

## Advanced level

#### Visualize the relationship between content rating and duration.

In [None]:
# Answer:

#content rating:
movies.boxplot(column='duration', by='content_rating')

#duration:
movies.duration.hist(by=movies.content_rating, sharex=True)

#### Determine the top rated movie (by star rating) for each genre.

In [None]:
# Answer:
movies.sort('star_rating', ascending=False).groupby('genre').title.first()
movies.groupby('genre').title.first() 

#### Check if there are multiple movies with the same title, and if so, determine if they are actually duplicates.

In [None]:
# Answer:

dupe_titles = movies[movies.title.duplicated()].title
movies[movies.title.isin(dupe_titles)]


#### Calculate the average star rating for each genre, but only include genres with at least 10 movies


#### Option 1: manually create a list of relevant genres, then filter using that list

In [None]:
# Answer:

movies.genre.value_counts()
top_genres = ['Drama', 'Comedy', 'Action', 'Crime', 'Biography', 'Adventure', 'Animation', 'Horror', 'Mystery']
movies[movies.genre.isin(top_genres)].groupby('genre').star_rating.mean()

#### Option 2: automatically create a list of relevant genres by saving the value_counts and then filtering

In [None]:
# Answer:

genre_counts = movies.genre.value_counts()
top_genres = genre_counts[genre_counts >= 10].index
movies[movies.genre.isin(top_genres)].groupby('genre').star_rating.mean()

#### Option 3: calculate the average star rating for all genres, then filter using a boolean Series

In [None]:
# Answer:

movies.groupby('genre').star_rating.mean()[movies.genre.value_counts() >= 10]


#### Option 4: aggregate by count and mean, then filter using the count

In [None]:
# Answer:

genre_ratings = movies.groupby('genre').star_rating.agg(['count', 'mean'])
genre_ratings[genre_ratings['count'] >= 10]

## Bonus

#### Figure out something "interesting" using the actors data!