## Final Project Submission

Please fill out:
* Student name: 
* Student pace: self paced / part time / full time
* Scheduled project review date/time: 
* Instructor name: 
* Blog post URL:


### BUSINESS UNDERSTANDING

##### KEY QUESTIONS FOR MOVIE STUDIO STRATEGY
1. Which movie genres generate the highest revenue? By analyzing the total earnings from the different genres, we can determine which types of films are more profitable and likely to attract large audiences.
2. Does movie rating or critic score correlate with higher box office success? using ratings from Rotten Tomatoes and movie database, we can measure whether higher-rated movies earn more revenue 
3. Does a longer runtime affect rating or revenue? By analyzing the runtimes in the dataset, we can identify which runtimes generate higher earnings for different types of movies.
4. Does the release timing affect the earnings? By examining which months and their associated revenue, we can identify the optimal times to launch films for maximum success.

# Movie Studio Market Analysis
## Stakeholder
Head of the New Movie Studio

## Problem Statement
Our company is launching a new movie studio and needs to decide what types of films to produce in order to maximize box office success. However, the company currently lacks historical knowledge about which movie characteristics lead to strong financial performance.

Importing libraries for data loading

In [1]:
#python libraries
import sqlite3
import pandas as pd

Loading the dataset for Box office mojo and Rotten tomatoes

In [5]:
#load dataset for Box Office Mojo
BoxOfficeMojo=pd.read_csv('zippedData/bom.movie_gross.csv.gz')
BoxOfficeMojo.head()

                                         title studio  domestic_gross  \
0                                  Toy Story 3     BV     415000000.0   
1                   Alice in Wonderland (2010)     BV     334200000.0   
2  Harry Potter and the Deathly Hallows Part 1     WB     296000000.0   
3                                    Inception     WB     292600000.0   
4                          Shrek Forever After   P/DW     238700000.0   

  foreign_gross  year  
0     652000000  2010  
1     691300000  2010  
2     664300000  2010  
3     535700000  2010  
4     513900000  2010  


In [6]:
#load dataset for Rotten Tomatoes
RottenTomatoes=pd.read_csv('zippedData/rt.reviews.tsv.gz',sep='\t',compression='gzip',encoding='latin-1')
print(RottenTomatoes.head())


   id                                             review rating   fresh  \
0   3  A distinctly gallows take on contemporary fina...    3/5   fresh   
1   3  It's an allegory in search of a meaning that n...    NaN  rotten   
2   3  ... life lived in a bubble in financial dealin...    NaN   fresh   
3   3  Continuing along a line introduced in last yea...    NaN   fresh   
4   3             ... a perverse twist on neorealism...     NaN   fresh   

           critic  top_critic         publisher               date  
0      PJ Nabarro           0   Patrick Nabarro  November 10, 2018  
1  Annalee Newitz           0           io9.com       May 23, 2018  
2    Sean Axmaker           0  Stream on Demand    January 4, 2018  
3   Daniel Kasman           0              MUBI  November 16, 2017  
4             NaN           0      Cinema Scope   October 12, 2017  


Inspecting the columns

In [7]:
print(BoxOfficeMojo.columns)


Index(['title', 'studio', 'domestic_gross', 'foreign_gross', 'year'], dtype='object')


In [8]:
print(RottenTomatoes.columns)

Index(['id', 'review', 'rating', 'fresh', 'critic', 'top_critic', 'publisher',
       'date'],
      dtype='object')


In [9]:
#check the info of the datasets
print(BoxOfficeMojo.info())
print(RottenTomatoes.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3387 entries, 0 to 3386
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           3387 non-null   object 
 1   studio          3382 non-null   object 
 2   domestic_gross  3359 non-null   float64
 3   foreign_gross   2037 non-null   object 
 4   year            3387 non-null   int64  
dtypes: float64(1), int64(1), object(3)
memory usage: 132.4+ KB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54432 entries, 0 to 54431
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   id          54432 non-null  int64 
 1   review      48869 non-null  object
 2   rating      40915 non-null  object
 3   fresh       54432 non-null  object
 4   critic      51710 non-null  object
 5   top_critic  54432 non-null  int64 
 6   publisher   54123 non-null  object
 7   date        54432 non-null  object
dt

#BoxOfficemojo dataset has 5 columns and RottenTomatoes dataset has 8 columns. Both datasets have some columns with missing values.

Check for missing values

In [10]:
#Box office Mojo
print(BoxOfficeMojo.isnull().sum())
#Rotten Tomatoes
print(RottenTomatoes.isnull().sum())

title                0
studio               5
domestic_gross      28
foreign_gross     1350
year                 0
dtype: int64
id                0
review         5563
rating        13517
fresh             0
critic         2722
top_critic        0
publisher       309
date              0
dtype: int64


In [7]:
df.duplicated().sum()

0

There are no duplicate rows in the dataset indicating that each observation is unique.

Generating summary statistics

In [8]:
df.describe()

Unnamed: 0.1,Unnamed: 0,id,popularity,vote_average,vote_count
count,26517.0,26517.0,26517.0,26517.0,26517.0
mean,13258.0,295050.15326,3.130912,5.991281,194.224837
std,7654.94288,153661.615648,4.355229,1.852946,960.961095
min,0.0,27.0,0.6,0.0,1.0
25%,6629.0,157851.0,0.6,5.0,2.0
50%,13258.0,309581.0,1.374,6.0,5.0
75%,19887.0,419542.0,3.694,7.0,28.0
max,26516.0,608444.0,80.773,10.0,22186.0


Getting genre distribution

In [9]:
df['genre_ids'].value_counts().head()

genre_ids
[99]    3700
[]      2479
[18]    2268
[35]    1660
[27]    1145
Name: count, dtype: int64

Dropping columns that we do not need

In [10]:
df = df.drop(columns=['Unnamed: 0'])

Checking for updated data types in our dataset

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26517 entries, 0 to 26516
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   genre_ids          26517 non-null  object 
 1   id                 26517 non-null  int64  
 2   original_language  26517 non-null  object 
 3   original_title     26517 non-null  object 
 4   popularity         26517 non-null  float64
 5   release_date       26517 non-null  object 
 6   title              26517 non-null  object 
 7   vote_average       26517 non-null  float64
 8   vote_count         26517 non-null  int64  
dtypes: float64(2), int64(2), object(5)
memory usage: 1.8+ MB


Creating cleaned copy

In [12]:
eda_df = df[
    [
        'id',
        'title',
        'original_language',
        'genre_ids',
        'release_date',
        'popularity',
        'vote_average',
        'vote_count'
    ]
]