# Analysis on IMDb Movie Data

## Group Information
* 1430003004 Jeff 傅永升
* 1430003011 Covey 刘克盾
* 1430003029 Garfield 邬嘉祺
* 1430003030 Frank 邬可夫
* 1430003045 Bill 钟钧儒
## Collecting Data
The data comes from Kaggle, which collects data of 1000 movies.
### Kaggle
Kaggle is a platform for predictive modelling and analytics competitions in which statisticians and data miners compete to produce the best models for predicting and describing the datasets uploaded by companies and users.[Wikipedia]
![](kaggle-data.png)
### IMDb
IMDb, formerly known as Internet Movie Database, is an online database of information related to films, television programs and video games, including cast, production crew, fictional characters, biographies, plot summaries, trivia and reviews, operated by IMDb.com, Inc., a subsidiary of Amazon.[Wikipedia]
![](imdb-mainpage.png)
### Movies on IMDb
![](imdb-movie.png)

## Analysis
Five analyses were brought out by members in the group.
### Analysis on Directors
This analysis was posted by **Junru (Bill) Zhong**, and the following aims were setted.
* Find out which director earned most.
* Find out the genres of the works of the five-most earned directors.
* Estimate how much each of them will get in their next film.
### Technologies
* Python
* Package: pandas

In [1]:
# Import package
import pandas as pd
# Read data
df = pd.read_csv('../Dataset/IMDB-Movie-Data.csv')
print(df.head())

   Rank                    Title                     Genre  \
0     1  Guardians of the Galaxy   Action,Adventure,Sci-Fi   
1     2               Prometheus  Adventure,Mystery,Sci-Fi   
2     3                    Split           Horror,Thriller   
3     4                     Sing   Animation,Comedy,Family   
4     5            Suicide Squad  Action,Adventure,Fantasy   

                                         Description              Director  \
0  A group of intergalactic criminals are forced ...            James Gunn   
1  Following clues to the origin of mankind, a te...          Ridley Scott   
2  Three girls are kidnapped by a man with a diag...    M. Night Shyamalan   
3  In a city of humanoid animals, a hustling thea...  Christophe Lourdelet   
4  A secret government agency recruits some of th...            David Ayer   

                                              Actors  Year  Runtime (Minutes)  \
0  Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...  2014                121

### Arranging Data
* Use `groupby` method provided by pandas, select all directors and their films' revenue.
* Collect this data with the film count and total revenue of their works.
* Drop the data with value of null.

In [4]:
# List all directors.
directorList = df.groupby('Director')['Revenue (Millions)'].agg(['sum', 'count']).reset_index()
# Delete invalid data
directorList = directorList.dropna()
# Print out result
print(directorList.head())

              Director     sum  count
0           Aamir Khan    1.20      1
1  Abdellatif Kechiche    2.20      1
3           Adam McKay  438.14      4
4        Adam Shankman  157.33      2
5         Adam Wingard   21.07      2


### Sort Data
Sort the data according to the **average revenue** of each film that the directors directed.
* Calculate the average revenue.
* Sort the list.
* Save the first five directors.

In [6]:
# Get average revenue of each director.
directorList['avg'] = directorList['sum'] / directorList['count']
print(directorList.head())

              Director     sum  count      avg
0           Aamir Khan    1.20      1    1.200
1  Abdellatif Kechiche    2.20      1    2.200
3           Adam McKay  438.14      4  109.535
4        Adam Shankman  157.33      2   78.665
5         Adam Wingard   21.07      2   10.535


In [7]:
# Sorting by revenue, list the first five data.
firstFive = directorList.sort_values('avg', ascending=False).head()
print(firstFive)

            Director      sum  count      avg
261    James Cameron   760.51      1  760.510
114  Colin Trevorrow   652.18      1  652.180
341      Joss Whedon  1082.27      2  541.135
377      Lee Unkrich   414.98      1  414.980
208        Gary Ross   408.00      1  408.000


### Genres of Directors' Work
Go back to the original dataset, then match all entries of these directors.

In [8]:
# Searching their works.
directorNames = firstFive['Director']
for directorName in directorNames:
    for index, row in df.iterrows():
        if row['Director'] == directorName:
            print(row['Director'] + ', ' + row['Genre'] + ', ' + row['Title'])

James Cameron, Action,Adventure,Fantasy, Avatar
Colin Trevorrow, Action,Adventure,Sci-Fi, Jurassic World
Joss Whedon, Action,Sci-Fi, The Avengers
Joss Whedon, Action,Adventure,Sci-Fi, Avengers: Age of Ultron
Lee Unkrich, Animation,Adventure,Comedy, Toy Story 3
Gary Ross, Adventure,Sci-Fi,Thriller, The Hunger Games
Gary Ross, Action,Biography,Drama, Free State of Jones


*One movie from **Gary Ross** was not calculated because there is no revenue data.*

### Summary
* In the first five directors, seven of their works were recorded.
* Six of seven films have the genre of **Action**, that means action films are popular.
* Four of seven films have the genre of **Sci-Fi (科幻)**, this genre is also popular.
![](first-two.png)
![](third.png)
![](fourth-fifth.png)

### Analysis on Directors by Rating
This analysis was posted by **Kefu (Frank) Wu**, and the following aims were setted.
* Find out which director is praised most.
* Find out the genres of the works of the five-most praised directors.
* Estimate the rating that new film may get according to the genres of the film and its director.
### Technologies
* Python
* Package: pandas

In [3]:
# Show origin data
print(df.head())

   Rank                    Title                     Genre  \
0     1  Guardians of the Galaxy   Action,Adventure,Sci-Fi   
1     2               Prometheus  Adventure,Mystery,Sci-Fi   
2     3                    Split           Horror,Thriller   
3     4                     Sing   Animation,Comedy,Family   
4     5            Suicide Squad  Action,Adventure,Fantasy   

                                         Description              Director  \
0  A group of intergalactic criminals are forced ...            James Gunn   
1  Following clues to the origin of mankind, a te...          Ridley Scott   
2  Three girls are kidnapped by a man with a diag...    M. Night Shyamalan   
3  In a city of humanoid animals, a hustling thea...  Christophe Lourdelet   
4  A secret government agency recruits some of th...            David Ayer   

                                              Actors  Year  Runtime (Minutes)  \
0  Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...  2014                121

### Arranging Data
* Use `groupby` method provided by pandas, select all directors and their films' rating.
* Collect this data with the film count and sum up the rating of their works.

In [4]:
# List all directors.
directorList = df.groupby('Director')['Rating'].agg(['sum', 'count']).reset_index()
# Print out result
print(directorList.head())

              Director   sum  count
0           Aamir Khan   8.5      1
1  Abdellatif Kechiche   7.8      1
2            Adam Leon   6.5      1
3           Adam McKay  28.0      4
4        Adam Shankman  12.6      2


### Sort Data
Sort the data according to the **average rating** of each film that the directors directed.
* Calculate the average rating.
* Sort the list.
* Save the top five directors that customers praised.

In [5]:
# Get average rating of each director.
directorList['avg'] = directorList['sum'] / directorList['count']
print(directorList.head())

              Director   sum  count  avg
0           Aamir Khan   8.5      1  8.5
1  Abdellatif Kechiche   7.8      1  7.8
2            Adam Leon   6.5      1  6.5
3           Adam McKay  28.0      4  7.0
4        Adam Shankman  12.6      2  6.3


In [6]:
# Sorting by rating, list the first five data.
firstFive = directorList.sort_values('avg', ascending=False).head()
print(firstFive)

                             Director   sum  count   avg
465                     Nitesh Tiwari   8.8      1  8.80
108                 Christopher Nolan  43.4      5  8.68
392                    Makoto Shinkai   8.6      1  8.60
470                   Olivier Nakache   8.6      1  8.60
194  Florian Henckel von Donnersmarck   8.5      1  8.50


### Genres of Directors' Work
Go back to the original dataset, then match all entries of these directors.

In [23]:
# Searching their works.
directorNames = firstFive['Director']
for directorName in directorNames:
    for index, row in df.iterrows():
        if row['Director'] == directorName:
            print(row['Director'] + ': \n\t' + row['Genre'] + ', ' + row['Title'])

Nitesh Tiwari: 
	Action,Biography,Drama, Dangal
Christopher Nolan: 
	Adventure,Drama,Sci-Fi, Interstellar
Christopher Nolan: 
	Action,Crime,Drama, The Dark Knight
Christopher Nolan: 
	Drama,Mystery,Sci-Fi, The Prestige
Christopher Nolan: 
	Action,Adventure,Sci-Fi, Inception
Christopher Nolan: 
	Action,Thriller, The Dark Knight Rises
Makoto Shinkai: 
	Animation,Drama,Fantasy, Kimi no na wa
Olivier Nakache: 
	Biography,Comedy,Drama, The Intouchables
Florian Henckel von Donnersmarck: 
	Drama,Thriller, The Lives of Others


### Summary
* We find out five director that are favored by their audience.
* In the first five directors, nine of their works were recorded.
* Among the directors, Chrostopher Nolan is the only director that can maintain high average rating after producing more than one film.
* In Nolan's five works, there are three of them are action, which means that Nolan is good at directoring action film.