# Project: What makes movies more successful?

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

This notebook investigates the TMDB movies dataset, with the aim of answering the following questions:

1. How popularity of various movie genres have evolved over time?
2. What movie features are associated with higher profit ?

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pandas.plotting import scatter_matrix

% matplotlib inline

<a id='wrangling'></a>
## Data Wrangling


### General Properties

In [2]:
# loading the dataset
df_movies = pd.read_csv('tmdb-movies.csv')

In [3]:
print("Dataset shape:", df_movies.shape)

Dataset shape: (10866, 21)


In [4]:
# Show data types and value counts for each column 
df_movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10866 entries, 0 to 10865
Data columns (total 21 columns):
id                      10866 non-null int64
imdb_id                 10856 non-null object
popularity              10866 non-null float64
budget                  10866 non-null int64
revenue                 10866 non-null int64
original_title          10866 non-null object
cast                    10790 non-null object
homepage                2936 non-null object
director                10822 non-null object
tagline                 8042 non-null object
keywords                9373 non-null object
overview                10862 non-null object
runtime                 10866 non-null int64
genres                  10843 non-null object
production_companies    9836 non-null object
release_date            10866 non-null object
vote_count              10866 non-null int64
vote_average            10866 non-null float64
release_year            10866 non-null int64
budget_adj              1

#### Exploring first rows and checking for null values:

In [5]:
df_movies.head()

Unnamed: 0,id,imdb_id,popularity,budget,revenue,original_title,cast,homepage,director,tagline,...,overview,runtime,genres,production_companies,release_date,vote_count,vote_average,release_year,budget_adj,revenue_adj
0,135397,tt0369610,32.985763,150000000,1513528810,Jurassic World,Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...,http://www.jurassicworld.com/,Colin Trevorrow,The park is open.,...,Twenty-two years after the events of Jurassic ...,124,Action|Adventure|Science Fiction|Thriller,Universal Studios|Amblin Entertainment|Legenda...,6/9/15,5562,6.5,2015,137999900.0,1392446000.0
1,76341,tt1392190,28.419936,150000000,378436354,Mad Max: Fury Road,Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic...,http://www.madmaxmovie.com/,George Miller,What a Lovely Day.,...,An apocalyptic story set in the furthest reach...,120,Action|Adventure|Science Fiction|Thriller,Village Roadshow Pictures|Kennedy Miller Produ...,5/13/15,6185,7.1,2015,137999900.0,348161300.0
2,262500,tt2908446,13.112507,110000000,295238201,Insurgent,Shailene Woodley|Theo James|Kate Winslet|Ansel...,http://www.thedivergentseries.movie/#insurgent,Robert Schwentke,One Choice Can Destroy You,...,Beatrice Prior must confront her inner demons ...,119,Adventure|Science Fiction|Thriller,Summit Entertainment|Mandeville Films|Red Wago...,3/18/15,2480,6.3,2015,101200000.0,271619000.0
3,140607,tt2488496,11.173104,200000000,2068178225,Star Wars: The Force Awakens,Harrison Ford|Mark Hamill|Carrie Fisher|Adam D...,http://www.starwars.com/films/star-wars-episod...,J.J. Abrams,Every generation has a story.,...,Thirty years after defeating the Galactic Empi...,136,Action|Adventure|Science Fiction|Fantasy,Lucasfilm|Truenorth Productions|Bad Robot,12/15/15,5292,7.5,2015,183999900.0,1902723000.0
4,168259,tt2820852,9.335014,190000000,1506249360,Furious 7,Vin Diesel|Paul Walker|Jason Statham|Michelle ...,http://www.furious7.com/,James Wan,Vengeance Hits Home,...,Deckard Shaw seeks revenge against Dominic Tor...,137,Action|Crime|Thriller,Universal Pictures|Original Film|Media Rights ...,4/1/15,2947,7.3,2015,174799900.0,1385749000.0


In [6]:
print("Number of missing values in the dataset:\n{}".format(df_movies.isnull().sum()))

Number of missing values in the dataset:
id                         0
imdb_id                   10
popularity                 0
budget                     0
revenue                    0
original_title             0
cast                      76
homepage                7930
director                  44
tagline                 2824
keywords                1493
overview                   4
runtime                    0
genres                    23
production_companies    1030
release_date               0
vote_count                 0
vote_average               0
release_year               0
budget_adj                 0
revenue_adj                0
dtype: int64


Some columns are irrelevant to the research questions, and some have missing values. These need to be cleaned.

In [None]:
# build a list of irrelevant columns
not_needed = ['overview', 'tagline', 'homepage', 'keywords', 'production_companies', 'imdb_id', 'budget', 'revenue', 'cast', 'director']
df = df_movies.drop(not_needed, axis=1)

# drop movies with missing genres
df.dropna(subset=['genres'], inplace=True)

In [None]:
df.info()

In [None]:
print("Number of missing values in the dataset:\n{}".format(df.isnull().sum()))

<a id='eda'></a>
## Exploratory Data Analysis


### What are the most popular genres over time?

In [None]:
# get a list of all genres
all_genres = []
def split_genres(genres_str):
    genres = genres_str.split('|')
    all_genres.extend(genres)
    return genres

df['genre_split'] = df.genres.apply(split_genres)
all_genres = set(all_genres) # convert to a set to remove duplicates
print(all_genres)


In [None]:
# convert the set back to a list and sort it
all_genres = list(all_genres)
all_genres.sort()

# create a boolean column for each genre
for g in all_genres:
    df[g] = df.genres.str.contains(g)
    
df[all_genres].head()

#### The total number of movies produced in each genre:
let's see how many movies are in each genre in the entire dataset.

In [None]:
genre_counts = df[all_genres + ['release_year']].groupby('release_year').sum()
genre_counts.mean(axis=0).plot(kind='bar', figsize=(15,8), grid=True);

Clearly the top genres are: Drama, Comedy, Thriller, and Action.

Now, let's see how this evolved over time

In [None]:
genre_counts.plot(figsize=(15,8), grid=True);

The graph shows that in recent years, the number of drama movies is increasing while the number of comedy movies is declining. Interestingly, in most of the late 80s, the opposite was true.

Now, let's check if this relates to the popularity of each genre

In [None]:
genre_ppl = df[all_genres].multiply(df['popularity'], axis='index')
genre_ppl['release_year'] = df['release_year']
genre_ppl[all_genres].mean().plot(kind='bar', figsize=(15,8), grid=True);

In general, the most popular genres are the same ones with the largest number of produced movies, with a few exceptions.
Action movies tend to have a slightly higher popularity than Thriller movies. Also, Adventure and Science Fiction movies have relatively higher popularity compared to other genres with lower number of movies.


In [None]:
genre_ppl.groupby('release_year').mean().plot(figsize=(15,8), grid=True);

Plotting genres popularity over time shows some interesting points:
* The difference in popularity for the most popular genres is not large as they oscillate over time.
* Drama movies are more popular in recent years since 2004 while comedy movies were the most popular in mid and late 80s.
* Action movies had a spike in popularity during the period 2000-2005.

### What features are associated with higher profit?

First, we need to calculate the profit for each movie by subtracting budget from revenue

In [None]:
df['profit'] = df['revenue_adj'] - df['budget_adj']
df.profit.describe()

Now, we calculate the correlation coefficient between profit and other features, and print the sorted values

In [None]:
other_cols = df.drop(['id', 'release_year'], axis=1)
profit_corr = other_cols.corr()['profit'].sort_values(ascending=False)
profit_corr

For visualization, let's create a bar chart

In [None]:
profit_corr.plot(kind='bar', figsize=(15, 8))

From these correlation, we find that most promising properties for predicting profit for a new movie with unknown popularity and votes are the budget and genre. Genres such as Adventure, Action, and Fantasy are more correlated with higher profit.

Also, movies with large budgets usually make higher profit than movies with small budgets.

<a id='conclusions'></a>
## Conclusions
* In recent years, the number of drama movies is increasing while the number of comedy movies is declining. Interestingly, in most of the late 80s, the opposite was true.
* In general, the most popular genres are the same ones with the largest number of produced movies, with a few exceptions. Action movies tend to have a slightly higher popularity that Thriller movies. 
* Adventure and Science Fiction movies have relatively higher popularity compared to other genres with lower number of movies.
* The difference in popularity for the most popular genres is not large as they oscillate over time.
* Drama movies are more popular in recent years since 2004 while comedy movies were the most popular in mid and late 80s.
* Action movies had a spike in popularity during the period 2000-2005.
* Most promising properties for predicting profit for a new movie with unknown popularity and votes are the budget and genre. Genres such as Adventure, Action, and Fantasy are more correlated with higher profit.

* Movies with large budgets usually make higher profit than movies with small budgets.

In [None]:
from subprocess import call
call(['python', '-m', 'nbconvert', 'Investigate_a_Dataset.ipynb'])