# Data Analysis

At this point, we were ready to begin our analysis of the data.  We wanted ultimately to make recommendations to the Microsoft Team about which genre and release month is best for their next movie production.  The main two questions we want to answer are: what is the best genre for their next movie production?  what month should they release their next movie production?

In recommending the best genre, we will look at the distribution of genre films in our database, as well as analyzing the average gross revenue, average IMDB ratings, and average TMDB popularity per genre.

In recommending the release month, we plan to analyze the average gross revenue per month per genre. We conjecture that horror movies would have the most revenue around the months surrounding Halloween, but we will make a recommendation based on the data.

## Distribution of Gross Revenue

We looked at the distribution of gross revenues to determine whether the median or mean value would be a better measure of central tendency.  Based on the data below, we determined that the median values would be a better measure since it is a skewed distribution.

In [3]:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
sns.set_context("poster")
sns.set_style("darkgrid")

In [4]:
combined = pd.read_csv("data/combined.csv")

In [15]:
# fig = plt.figure(figsize = (20, 10))
# sns.distplot(combined.gross_us) 
# plt.title("Distribution of Gross Revenue")
# plt.xlabel("Gross Revenue ($M)")
# plt.savefig("images/gross_earnings_distribution.jpg")
# plt.show()

<img src="images/gross_earnings_distribution.jpg">

## Distribution of Films by Genre

We looked at the distribution of films by genre to get an initial sense of the sheer number of how much of each genre of movies has been produced.  Based on the data below, the top 3 genres are drama, thriller, and comedy.

In [6]:
combo = combined.explode('genre_ids')
combo.head()

Unnamed: 0,genre_ids,popularity,release_date,title,year,pg_rating,imdb_id,runtime,genre,imdb_rating,votes,gross_us,month
0,"['Drama', 'Crime']",43.166,1994-09-23,The Shawshank Redemption,1994,R,111161,142,Drama,9.3,2353109,28.34,9
1,"['Drama', 'Crime']",37.301,1972-03-14,The Godfather,1972,R,68646,175,"Crime, Drama",9.2,1628373,134.97,3
2,"['Drama', 'History', 'War']",27.598,1993-11-30,Schindler's List,1993,R,108052,195,"Biography, Drama, History",8.9,1218144,96.9,11
3,"['Romance', 'Animation', 'Drama']",108.69,2016-08-26,Your Name.,2016,PG,5311514,106,"Animation, Drama, Fantasy",8.4,197768,5.02,8
4,"['Drama', 'Crime']",29.973,1974-12-20,The Godfather: Part II,1974,R,71562,202,"Crime, Drama",9.0,1134978,57.3,12


In [9]:
combo.gross_us.max()

858.37

In [12]:
# Y = combo.genre_ids.value_counts(normalize = True)
# X = Y.index
# fig = plt.figure(figsize = (20, 12))
# g = sns.barplot(X, Y, palette = "Set2")
# g.set(title = 'Distribution of Genres', ylabel = "Proportion", xlabel = "Genre")
# plt.xticks(rotation=90, ha='right')
# plt.tight_layout()
# fig.savefig("images/Distribution_genre.png")
# plt.show()

<img src="images/Distribution_genre.png">

## Median Gross Revenue per Genre

We looked at the median gross revenue per genre to determine which genre was generating the most revenue at the box office. The two highest revenue generating genres are Adventure and Family.

In [14]:
# fig = plt.figure(figsize = (20, 10))
# g = sns.boxplot(x = 'genre_ids', y = 'gross_us', data = combo, 
#                 showfliers = False, palette="Set2", linewidth = 5, fliersize= 1.5 )
# g.set(title = 'Median Gross Revenue per Genre', 
#       ylabel = "Median Gross Revenue ($M)", xlabel = "Genre")
# plt.axhline(combo.gross_us.median(), ls='--', lw = 1, color = 'black')
# plt.xticks(rotation=90, ha='right')
# plt.tight_layout()
# fig.savefig("images/Median_Gross_Earnings.png")
# plt.show()

<img src="images/Median_Gross_Earnings.png">

## Average IMDB Ratings Per Genre

In [None]:
# fig = plt.figure(figsize = (20, 10))
# g = sns.boxplot(x = 'genre_ids', y = 'imdb_rating', data = combo, 
#                 showfliers = False, palette="Set2", linewidth = 5, fliersize= 1.5 )
# g.set(title = 'Average IMDB Ratings per Genre', 
#       ylabel = "Average IMDB Ratings", xlabel = "Genre")
# plt.axhline(combo.imdb_rating.mean(), ls='--', lw = 1, color = 'black')
# plt.xticks(rotation=90, ha='right')
# plt.tight_layout()
# fig.savefig("Average_IMDB_Ratings.png")
# plt.show()

<img src="images/Average_IMDB_Ratings.png">

## Average TMDB Popularity per Genre

We looked at the average TMDB popularity per genre to determine what genre of movie was receiving the highest popularity ratings on TMDB. The genres of Animation and Family have the highest popularity ratings given by movie watchers on that site.

In [None]:
# fig = plt.figure(figsize = (20, 10))
# g = sns.boxplot(x = 'genre_ids', y = 'popularity', data = combo, 
#                 showfliers = False, palette="Set2", linewidth = 5, fliersize= 1.5 )
# g.set(title = 'Average TMDB Popularity per Genre', 
#       ylabel = "Average TMDB Popularity", xlabel = "Genre")
# plt.axhline(combo.popularity.mean(), ls='--', lw = 1, color = 'black')
# plt.xticks(rotation=90, ha='right')
# plt.tight_layout()
# fig.savefig("Average_TMDB_Popularity.png")
# plt.show()

<img src="images/Average_TMDB_Popularity.png">

## Median Gross Earnings Per Month for each Genre

We wanted to make a recommendation for release month to Microsoft using our data.  Based on our findings on a preferred genre, we can look for the month with the highest median gross earning for that genre. We could also find the highest revenue-generating genre for each month.

### Highest revenue-generating genre for each month:

- Jan: War (60)
- Feb: Adventure (250)
- Mar: Family (200)
- Apr: SciFi, Adventure (400)
- May: Animation, Family (250)
- Jun: Family, Fantasy, Animation (200)
- Jul: Adventure, Action, War (200)
- Aug: Music (160)
- Sep: Western (50)
- Oct: Fantasy, Adventure (200)
- Nov: Animation, Family, Adventure (200)
- Dec: SciFi, Adventure (200)

### Month of highest revenues for each genre:

- Western: Mar
- War: Jul
- SciFi: Apr
- Family: May
- Adventure: Apr
- Action: Jul
- Fantasy: Oct

In [None]:
combo_2 = combo[combo.genre_ids != 'TV Movie']

In [None]:

# month_genre_gross_median = combo_2.groupby(['month', 'genre_ids'])['gross_us'].median().unstack().transpose()
# fig, axes = plt.subplots(3,4, figsize = (50, 30))
# fig.suptitle('Median Gross Revenue for each Genre', fontsize=40)

# axes[0][0].set_title('Jan')
# axes[0][1].set_title('Feb')
# axes[0][2].set_title('Mar')
# axes[0][3].set_title('Apr')
# axes[1][0].set_title('May')
# axes[1][1].set_title('Jun')
# axes[1][2].set_title('Jul')
# axes[1][3].set_title('Aug')
# axes[2][0].set_title('Sep')
# axes[2][1].set_title('Oct')
# axes[2][2].set_title('Nov')
# axes[2][3].set_title('Dec')

# axes[0][0].set_xlabel('Avg gross USD in mil')
# axes[0][1].set_xlabel('Avg gross USD in mil')
# axes[0][2].set_xlabel('Avg gross USD in mil')
# axes[0][0].set_xlabel('Avg gross USD in mil')
# axes[1][0].set_xlabel('Avg gross USD in mil')
# axes[1][1].set_xlabel('Avg gross USD in mil')
# axes[1][2].set_xlabel('Avg gross USD in mil')
# axes[1][3].set_xlabel('Avg gross USD in mil')
# axes[2][0].set_xlabel('Avg gross USD in mil')
# axes[2][1].set_xlabel('Avg gross USD in mil')
# axes[2][2].set_xlabel('Avg gross USD in mil')
# axes[2][3].set_xlabel('Avg gross USD in mil')

# axes[0][0].set_ylabel('Genre')
# axes[0][1].set_ylabel('Genre')
# axes[0][2].set_ylabel('Genre')
# axes[0][3].set_ylabel('Genre')
# axes[1][0].set_ylabel('Genre')
# axes[1][1].set_ylabel('Genre')
# axes[1][2].set_ylabel('Genre')
# axes[1][3].set_ylabel('Genre')
# axes[2][0].set_ylabel('Genre')
# axes[2][1].set_ylabel('Genre')
# axes[2][2].set_ylabel('Genre')
# axes[2][3].set_ylabel('Genre')

# x = month_genre_gross_median.index
# Jan = month_genre_gross_median[1]
# Feb = month_genre_gross_median[2]
# Mar = month_genre_gross_median[3]
# Apr = month_genre_gross_median[4]
# May = month_genre_gross_median[5]
# June = month_genre_gross_median[6]
# July = month_genre_gross_median[7]
# Aug = month_genre_gross_median[8]
# Sep = month_genre_gross_median[9]
# Oct = month_genre_gross_median[10]
# Nov = month_genre_gross_median[11]
# Dec = month_genre_gross_median[12]

# axes[0][0].barh(x, Jan, color = 'gray')
# axes[0][1].barh(x, Feb, color = 'red')
# axes[0][2].barh(x, Mar, color = 'orange')
# axes[0][3].barh(x, Apr, color = 'green')
# axes[1][0].barh(x, May, color = 'pink')
# axes[1][1].barh(x, June, color = 'dodgerblue')
# axes[1][2].barh(x, July, color = 'black')
# axes[1][3].barh(x, Aug, color = 'purple')
# axes[2][0].barh(x, Sep, color = 'yellow')
# axes[2][1].barh(x, Oct, color = 'lime')
# axes[2][2].barh(x, Nov, color = 'brown')
# axes[2][3].barh(x, Dec, color = 'silver')

# plt.show()

<img src="images/Median_Gross_Revenue.png">


## Runtime Through the Decades

We wanted to make a recommendation for how long Microsoft's premier film should be.  We looked at the runtime of movies through the decades to find an ideal length.

In [None]:
# fig = plt.figure(figsize = (20, 10)
# g = sns.lineplot(x='year', y='runtime', ci=None, data=combo, lw=1)
# g.set(title = 'Runtime through the Decades', 
#           ylabel = "Runtime (minutes)", 
#           xlabel = "Year")
# plt.tight_layout()
# fig.savefig("Runtime_decades.png")
# plt.show()


<img src="images/Runtime_decades.png">