![example](images/director_shot.jpeg)

# Project Title

**Authors:** Jonathan, Matt, Nate, Roshni
***

## Overview

A one-paragraph overview of the project, including the business problem, data, methods, results and recommendations.

Microsoft would like open a new movie studio to create original video content. Microsoft wants to know what types of films are currently doing best at the box office. We have generated three actionable insights that will help the head of Microsoft's new movie studio decide on which type of films to create.

* **Which methods did we use?**

* **What are the recommendations?**

## Business Problem

Summary of the business problem you are trying to solve, and the data questions that you plan to answer to solve them.

Which type of films should Microsoft's new movie studio create?

***
Questions to consider:
* What are the business's pain points related to this project?
* How did you pick the data analysis question(s) that you did?
* Why are these questions important from a business perspective?
***

## Data Understanding

Describe the data being used for this project.
***
Questions to consider:
* Where did the data come from, and how do they relate to the data analysis questions?
* What do the data represent? Who is in the sample and what variables are included?
* What is the target variable?
* What are the properties of the variables you intend to use?
***

In [1]:
# Import standard packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

## Data Preparation

Describe and justify the process for preparing the data for analysis.

***
Questions to consider:
* Were there variables you dropped or created?
* How did you address missing values or outliers?
* Why are these choices appropriate given the data and the business problem?
***

### Data Prep The Movies Database

#### Importing the data as a dataframe
- Indexing the first column as non-relavent
- parsing dates to convert to datetime/timestamp

In [2]:
df_tmdb = pd.read_csv('./zippedData/tmdb.movies.csv.gz', index_col = 0, parse_dates=['release_date'])

#### Getting a general idea of what the dataset looks like

In [3]:
df_tmdb

Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
0,"[12, 14, 10751]",12444,en,Harry Potter and the Deathly Hallows: Part 1,33.533,2010-11-19,Harry Potter and the Deathly Hallows: Part 1,7.7,10788
1,"[14, 12, 16, 10751]",10191,en,How to Train Your Dragon,28.734,2010-03-26,How to Train Your Dragon,7.7,7610
2,"[12, 28, 878]",10138,en,Iron Man 2,28.515,2010-05-07,Iron Man 2,6.8,12368
3,"[16, 35, 10751]",862,en,Toy Story,28.005,1995-11-22,Toy Story,7.9,10174
4,"[28, 878, 12]",27205,en,Inception,27.920,2010-07-16,Inception,8.3,22186
...,...,...,...,...,...,...,...,...,...
26512,"[27, 18]",488143,en,Laboratory Conditions,0.600,2018-10-13,Laboratory Conditions,0.0,1
26513,"[18, 53]",485975,en,_EXHIBIT_84xxx_,0.600,2018-05-01,_EXHIBIT_84xxx_,0.0,1
26514,"[14, 28, 12]",381231,en,The Last One,0.600,2018-10-01,The Last One,0.0,1
26515,"[10751, 12, 28]",366854,en,Trailer Made,0.600,2018-06-22,Trailer Made,0.0,1


In [4]:
df_tmdb.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 26517 entries, 0 to 26516
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   genre_ids          26517 non-null  object        
 1   id                 26517 non-null  int64         
 2   original_language  26517 non-null  object        
 3   original_title     26517 non-null  object        
 4   popularity         26517 non-null  float64       
 5   release_date       26517 non-null  datetime64[ns]
 6   title              26517 non-null  object        
 7   vote_average       26517 non-null  float64       
 8   vote_count         26517 non-null  int64         
dtypes: datetime64[ns](1), float64(2), int64(2), object(4)
memory usage: 2.0+ MB


**Key Takeaways:**
- 26517 rows with 9 columns of information
- No null values
- Need to find a movie genre key to figure out what the genres mean
- Limit the data to the most recent 10 years to stay relavent
    - Can maybe also plot vote_avg vs time to see trends that way?

### Data Cleaning: The Movies Database
#### Removing Unnecessary Columns & Filtering Data
**Columns to drop**
- `popularity`
   - popularity is based on current popularity, not how well the movie performed when released or how it was reviewed
        - https://developers.themoviedb.org/3/getting-started/popularity
    
- `id`
    - just a unique identifier, not relevant

- `original_title`
    - there are two columns for title, we will keep the final title column

In [5]:
df_cleaning = df_tmdb.drop(['popularity', 'id', 'original_title'], axis=1)

**Columns to filter**
- `original_language`
    - Microsoft would make the language in english as the company is based out of the USA
    
- `vote_Count`
    - Microsoft is only interested in a successful movie
    - It can be reasonably assumed that movies with fewer votes are less successful
    - We are dropping any count values below the mean
    
- `release_date`
    - Microsoft is interested in current data trends
        - We will limit the data to the last 10 years of available data (2010-2019)

Making sure that it's only english movies and then dropping the original_language column as it is no longer needed.

In [6]:
df_filtered = df_cleaning[df_cleaning['original_language'] == 'en']
df_filtered = df_filtered.drop('original_language', axis=1)

**Drop Certain Movies**

Dropping movies with a vote count below 200 votes.
(I chose this value as it is roughly the avg of the number of votes in the dataset)

In [7]:
df_filtered['vote_count'].describe()

count    23291.000000
mean       209.307887
std       1016.214512
min          1.000000
25%          1.000000
50%          4.000000
75%         24.000000
max      22186.000000
Name: vote_count, dtype: float64

In [8]:
df_filtered = df_filtered[df_filtered['vote_count'] > 200]

Limiting the number of movies to those released in 2010 and after.

In [9]:
df_filtered = df_filtered[df_filtered['release_date'] > pd.Timestamp(2010, 1, 1)]
df_filtered

Unnamed: 0,genre_ids,release_date,title,vote_average,vote_count
0,"[12, 14, 10751]",2010-11-19,Harry Potter and the Deathly Hallows: Part 1,7.7,10788
1,"[14, 12, 16, 10751]",2010-03-26,How to Train Your Dragon,7.7,7610
2,"[12, 28, 878]",2010-05-07,Iron Man 2,6.8,12368
4,"[28, 878, 12]",2010-07-16,Inception,8.3,22186
5,"[12, 14, 10751]",2010-02-11,Percy Jackson & the Olympians: The Lightning T...,6.1,4229
...,...,...,...,...,...
24369,"[18, 36, 35]",2017-11-22,The Man Who Invented Christmas,6.6,323
24383,[27],2018-10-05,Malevolent,5.0,236
24409,"[9648, 53]",2017-10-27,All I See Is You,4.9,311
24422,"[35, 18]",2018-02-16,The Party,6.4,229


We are left with the table `df_filtered` that has 2318 rows and 5 columns of relevant information

### Understanding the Data

**Making sense of the genre_ids**

We can see that the genre ID's are in order of best fit rather than numerical or alphabetical order
- For index 0: `genre_ids` = [12, 14, 10751]
- For index 1: `genre_ids` = [14, 12, 16, 10751]

We will take the primary and secondary Genre_ids from the list to get a better idea of which Genre's relate to the count

**Determining the data type for genre_ids**

Checking the type of data for genre_ids below. Goal is to create new columns with primary and secondary genres.

In [10]:
print(df_filtered['genre_ids'][0])
print(type(df_filtered['genre_ids'][0]))

[12, 14, 10751]
<class 'str'>


The data in genre_ids looks like a list, but is really a string including brackets. Let's clean this up.

**Created a for-loop that runs through each row of the column 'genre_ids'**

The for-loop:
 - First converts the string to a list of int.
 - Then it takes each int and assigns it to a new list corresponding to it's position in the original `genre_ids` list.
 - If there is no value for that iteration, it populates the list with `None`.
 - I include nonetype so that they stay on point with the correct index.

In [11]:
char_remove = ["'", " ", "[", "]"]

first_genre = []
second_genre = []

for x in df_filtered['genre_ids']:
    row = x
    for char in char_remove:
        row = row.replace(char, '')
    row = row.split(',')
    first_genre.append(row[0])
        
    if len(row) == 1:
        second_genre.append(None)
    if len(row) >= 2:
        second_genre.append(row[1])

**Now that we have lists of primary and secondary genre's we can turn them into columns.**

In [12]:
df_filtered['primary_genre'] = first_genre
df_filtered['secondary_genre'] = second_genre
df_filtered

Unnamed: 0,genre_ids,release_date,title,vote_average,vote_count,primary_genre,secondary_genre
0,"[12, 14, 10751]",2010-11-19,Harry Potter and the Deathly Hallows: Part 1,7.7,10788,12,14
1,"[14, 12, 16, 10751]",2010-03-26,How to Train Your Dragon,7.7,7610,14,12
2,"[12, 28, 878]",2010-05-07,Iron Man 2,6.8,12368,12,28
4,"[28, 878, 12]",2010-07-16,Inception,8.3,22186,28,878
5,"[12, 14, 10751]",2010-02-11,Percy Jackson & the Olympians: The Lightning T...,6.1,4229,12,14
...,...,...,...,...,...,...,...
24369,"[18, 36, 35]",2017-11-22,The Man Who Invented Christmas,6.6,323,18,36
24383,[27],2018-10-05,Malevolent,5.0,236,27,
24409,"[9648, 53]",2017-10-27,All I See Is You,4.9,311,9648,53
24422,"[35, 18]",2018-02-16,The Party,6.4,229,35,18


**Converting the genre_ids codes to something more understandable**

I found the below movie genre key on The MovieDB website. The key is important so that we can understand what the genre_id code actually means.

(found at: https://www.themoviedb.org/talk/5daf6eb0ae36680011d7e6ee)

In [13]:
backwards_key = {
'Action' : '28',
'Adventure' : '12',
'Animation' : '16',
'Comedy' : '35',
'Crime' : '80',
'Documentary' : '99',
'Drama' : '18',
'Family' : '10751',
'Fantasy' : '14',
'History' : '36',
'Horror' : '27',
'Music' : '10402',
'Mystery' : '9648',
'Romance' : '10749',
'Science Fiction' : '878',
'TV Movie' : '10770',
'Thriller' : '53',
'War' : '10752',
'Western' : '37'
}

genre_key = {v: k for k, v in backwards_key.items()}
# print(genre_key)

**Now to put the key to work and change the columns from numbers to english**

In [14]:
df_filtered["primary_genre"].replace(genre_key, inplace=True)
df_filtered["secondary_genre"].replace(genre_key, inplace=True)
df_filtered

Unnamed: 0,genre_ids,release_date,title,vote_average,vote_count,primary_genre,secondary_genre
0,"[12, 14, 10751]",2010-11-19,Harry Potter and the Deathly Hallows: Part 1,7.7,10788,Adventure,Fantasy
1,"[14, 12, 16, 10751]",2010-03-26,How to Train Your Dragon,7.7,7610,Fantasy,Adventure
2,"[12, 28, 878]",2010-05-07,Iron Man 2,6.8,12368,Adventure,Action
4,"[28, 878, 12]",2010-07-16,Inception,8.3,22186,Action,Science Fiction
5,"[12, 14, 10751]",2010-02-11,Percy Jackson & the Olympians: The Lightning T...,6.1,4229,Adventure,Fantasy
...,...,...,...,...,...,...,...
24369,"[18, 36, 35]",2017-11-22,The Man Who Invented Christmas,6.6,323,Drama,History
24383,[27],2018-10-05,Malevolent,5.0,236,Horror,
24409,"[9648, 53]",2017-10-27,All I See Is You,4.9,311,Mystery,Thriller
24422,"[35, 18]",2018-02-16,The Party,6.4,229,Comedy,Drama


**For our later data analysis, lets combine the two genres into a list of lists that we will add as a third column**

We are going to sort these so that our genre combos come out in the same order, regardless of primary/secondary for easy charting.

1) Merge primary and secondary into a list of lists

In [15]:
merged_genre = [list(x) for x in zip(list(df_filtered['primary_genre']), list(df_filtered['secondary_genre']))]

2) Remove any None type from the list of lists for the secondary genres

In [16]:
for genre in merged_genre:
    if genre[1] == None:
        genre.pop()

3) Sort the list based on the values for each list (each row) within the total list

In [17]:
for genre in merged_genre:
    genre.sort()

**Now that we have a list of the combined genres, we can turn it into a column.**

In [18]:
df_filtered['combined_genres'] = merged_genre
df_filtered

Unnamed: 0,genre_ids,release_date,title,vote_average,vote_count,primary_genre,secondary_genre,combined_genres
0,"[12, 14, 10751]",2010-11-19,Harry Potter and the Deathly Hallows: Part 1,7.7,10788,Adventure,Fantasy,"[Adventure, Fantasy]"
1,"[14, 12, 16, 10751]",2010-03-26,How to Train Your Dragon,7.7,7610,Fantasy,Adventure,"[Adventure, Fantasy]"
2,"[12, 28, 878]",2010-05-07,Iron Man 2,6.8,12368,Adventure,Action,"[Action, Adventure]"
4,"[28, 878, 12]",2010-07-16,Inception,8.3,22186,Action,Science Fiction,"[Action, Science Fiction]"
5,"[12, 14, 10751]",2010-02-11,Percy Jackson & the Olympians: The Lightning T...,6.1,4229,Adventure,Fantasy,"[Adventure, Fantasy]"
...,...,...,...,...,...,...,...,...
24369,"[18, 36, 35]",2017-11-22,The Man Who Invented Christmas,6.6,323,Drama,History,"[Drama, History]"
24383,[27],2018-10-05,Malevolent,5.0,236,Horror,,[Horror]
24409,"[9648, 53]",2017-10-27,All I See Is You,4.9,311,Mystery,Thriller,"[Mystery, Thriller]"
24422,"[35, 18]",2018-02-16,The Party,6.4,229,Comedy,Drama,"[Comedy, Drama]"


### Data Prep The Numbers Database

### Data Prep The Numbers Database

Opening up the database and examining the tables:

In [24]:
import sqlite3

con = sqlite3.connect('zippedData/im.db')
query = """  SELECT * FROM sqlite_master  """
tables = pd.read_sql(query, con)
tables

Unnamed: 0,type,name,tbl_name,rootpage,sql
0,table,movie_basics,movie_basics,2,"CREATE TABLE ""movie_basics"" (\n""movie_id"" TEXT..."
1,table,directors,directors,3,"CREATE TABLE ""directors"" (\n""movie_id"" TEXT,\n..."
2,table,known_for,known_for,4,"CREATE TABLE ""known_for"" (\n""person_id"" TEXT,\..."
3,table,movie_akas,movie_akas,5,"CREATE TABLE ""movie_akas"" (\n""movie_id"" TEXT,\..."
4,table,movie_ratings,movie_ratings,6,"CREATE TABLE ""movie_ratings"" (\n""movie_id"" TEX..."
5,table,persons,persons,7,"CREATE TABLE ""persons"" (\n""person_id"" TEXT,\n ..."
6,table,principals,principals,8,"CREATE TABLE ""principals"" (\n""movie_id"" TEXT,\..."
7,table,writers,writers,9,"CREATE TABLE ""writers"" (\n""movie_id"" TEXT,\n ..."


**Tables of initial interest:**
1. movie_basics
2. movie_ratings

#### Examine Movie Basics table

In [26]:
query = """  SELECT * FROM movie_basics  """
movie_basics = pd.read_sql(query, con)
movie_basics.head()

Unnamed: 0,movie_id,primary_title,original_title,start_year,runtime_minutes,genres
0,tt0063540,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama"
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama"
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama"
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy"


In [27]:
movie_basics.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 146144 entries, 0 to 146143
Data columns (total 6 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   movie_id         146144 non-null  object 
 1   primary_title    146144 non-null  object 
 2   original_title   146123 non-null  object 
 3   start_year       146144 non-null  int64  
 4   runtime_minutes  114405 non-null  float64
 5   genres           140736 non-null  object 
dtypes: float64(1), int64(1), object(4)
memory usage: 6.7+ MB


**Notes:**
- This table has the movie titles, release year, and genres.  The primary key is the movie_id, which is referenced in other tables such as movie basics.
- There appears to be a movie_id, primary title, and start_year for all rows.

#### Examine Movie Ratings table

In [28]:
query = """  SELECT * FROM movie_ratings  """
movie_ratings = pd.read_sql(query, con)
movie_ratings

Unnamed: 0,movie_id,averagerating,numvotes
0,tt10356526,8.3,31
1,tt10384606,8.9,559
2,tt1042974,6.4,20
3,tt1043726,4.2,50352
4,tt1060240,6.5,21
...,...,...,...
73851,tt9805820,8.1,25
73852,tt9844256,7.5,24
73853,tt9851050,4.7,14
73854,tt9886934,7.0,5


In [29]:
movie_ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 73856 entries, 0 to 73855
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   movie_id       73856 non-null  object 
 1   averagerating  73856 non-null  float64
 2   numvotes       73856 non-null  int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 1.7+ MB


**Notes:**
- Movie ratings are only given for 73K of the 146K of movies.  
- Since the ratings are the most important quantitate information from this file, I would suggest we 

#### Merge Tables
- merging `movie_ratings` with `movie_basics` 

In [30]:
movies_with_ratings = pd.merge(movie_ratings, movie_basics, left_on='movie_id', right_on='movie_id')
movies_with_ratings

Unnamed: 0,movie_id,averagerating,numvotes,primary_title,original_title,start_year,runtime_minutes,genres
0,tt10356526,8.3,31,Laiye Je Yaarian,Laiye Je Yaarian,2019,117.0,Romance
1,tt10384606,8.9,559,Borderless,Borderless,2019,87.0,Documentary
2,tt1042974,6.4,20,Just Inès,Just Inès,2010,90.0,Drama
3,tt1043726,4.2,50352,The Legend of Hercules,The Legend of Hercules,2014,99.0,"Action,Adventure,Fantasy"
4,tt1060240,6.5,21,Até Onde?,Até Onde?,2011,73.0,"Mystery,Thriller"
...,...,...,...,...,...,...,...,...
73851,tt9805820,8.1,25,Caisa,Caisa,2018,84.0,Documentary
73852,tt9844256,7.5,24,Code Geass: Lelouch of the Rebellion - Glorifi...,Code Geass: Lelouch of the Rebellion Episode III,2018,120.0,"Action,Animation,Sci-Fi"
73853,tt9851050,4.7,14,Sisters,Sisters,2019,,"Action,Drama"
73854,tt9886934,7.0,5,The Projectionist,The Projectionist,2019,81.0,Documentary


#### Filter Movies
-we can be more confident in the rating if it has a higher number of votes

-filtering movies with more than 200 votes

In [31]:
vote_threshold = movies_with_ratings['numvotes'].mean()
focusMovies = movies_with_ratings[(movies_with_ratings['numvotes'] >= vote_threshold)]
focusMovies.sort_values('averagerating', ascending=False)

Unnamed: 0,movie_id,averagerating,numvotes,primary_title,original_title,start_year,runtime_minutes,genres
63149,tt7131622,9.7,5600,Once Upon a Time ... in Hollywood,Once Upon a Time ... in Hollywood,2019,159.0,"Comedy,Drama"
12174,tt5963218,9.5,6509,Aloko Udapadi,Aloko Udapadi,2017,113.0,"Drama,History"
4461,tt7738784,9.4,9629,Peranbu,Peranbu,2018,147.0,Drama
49629,tt2170667,9.3,17308,Wheels,Wheels,2014,115.0,Drama
10198,tt5354160,9.3,18470,Aynabaji,Aynabaji,2016,147.0,"Crime,Mystery,Thriller"
...,...,...,...,...,...,...,...,...
17877,tt4458206,1.5,26723,Code Name: K.O.Z.,Kod Adi K.O.Z.,2015,114.0,"Crime,Mystery"
9185,tt4009460,1.5,14221,Saving Christmas,Saving Christmas,2014,79.0,"Comedy,Family"
54659,tt6038600,1.4,7383,Smolensk,Smolensk,2016,120.0,"Drama,Thriller"
9965,tt4404474,1.3,6249,Potato Salad,Kartoffelsalat,2015,81.0,"Comedy,Horror"


## PAUSE, NEED TO INSERT OTHER DATACLEANING

## Data Modeling
Describe and justify the process for analyzing or modeling the data.

***
Questions to consider:
* How did you analyze or model the data?
* How did you iterate on your initial approach to make it better?
* Why are these choices appropriate given the data and the business problem?
***

In [None]:
# Here you run your code to model the data


## Evaluation
Evaluate how well your work solves the stated business problem.

***
Questions to consider:
* How do you interpret the results?
* How well does your model fit your data? How much better is this than your baseline model?
* How confident are you that your results would generalize beyond the data you have?
* How confident are you that this model would benefit the business if put into use?
***

## Conclusions
Provide your conclusions about the work you've done, including any limitations or next steps.

***
Questions to consider:
* What would you recommend the business do as a result of this work?
* What are some reasons why your analysis might not fully solve the business problem?
* What else could you do in the future to improve this project?
***