![example](images/director_shot.jpeg)

# Microsoft Movie Studios Analysis

**Authors:** Armun Shakeri
***

## Overview

A one-paragraph overview of the project, including the business problem, data, methods, results and recommendations.

This project analyzes current movie trends, budgets, gross income, and ratings in order to help Microsoft Studios best decide which movies to produce in its new upcoming studio. Analysis shows that if Microsoft studios produces movies that are in high demand positive gross profit will be reflected. 




## Business Problem

Summary of the business problem you are trying to solve, and the data questions that you plan to answer to solve them.

***
Questions to consider:
* What are the business's pain points related to this project?
* How did you pick the data analysis question(s) that you did?
* Why are these questions important from a business perspective?
***

Microsoft is seeking to enter into the movie industry and does not know what movies to create. We need to analyze what types of movies are currently trending, most popular movie genres, highest grossing movies of all time, highest budgeted movies, and movie title basics. For Microsoft's new movie studio to be profitable we need to pick a movie genre that is currently in demand and which movies had highest gross incomes, doing this ensures that the movie will have a positive inception and be profitable.  


## Data Understanding

Describe the data being used for this project.
***
Questions to consider:
* Where did the data come from, and how do they relate to the data analysis questions?
* What do the data represent? Who is in the sample and what variables are included?
* What is the target variable?
* What are the properties of the variables you intend to use?
***

The following files imported are from various film rating institutions that will help identify what type of movie
Microsoft Studios should create next. These files include information on income, movie basics (genres, ratings, and movie budgets. We intend to use variables mostly related to domestic gross income since we want Microsoft's first film to be profitable within the United States. 

In [6]:
# Import standard packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [7]:
# Here you run your code to explore the data

income = pd.read_csv('zippedData/bom.movie_gross.csv.gz', compression='gzip', error_bad_lines=False)
basics = pd.read_csv('zippedData/imdb.title.basics.csv.gz', compression='gzip', error_bad_lines=False)
ratings = pd.read_csv('zippedData/imdb.title.ratings.csv.gz', compression='gzip', error_bad_lines=False)
budgets = pd.read_csv('zippedData/tn.movie_budgets.csv.gz', compression='gzip', error_bad_lines=False)


In [None]:
income.info()
# the target variables here are title and domestic_gross

In [128]:
basics.head()
#the target variables are primary title and genre

Unnamed: 0,tconst,primary_title,original_title,start_year,runtime_minutes,genres
0,tt0063540,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama"
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama"
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama"
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy"


In [129]:
ratings.head()

Unnamed: 0,tconst,averagerating,numvotes
0,tt10356526,8.3,31
1,tt10384606,8.9,559
2,tt1042974,6.4,20
3,tt1043726,4.2,50352
4,tt1060240,6.5,21


In [None]:
budgets.info()
#the target variables are movie, production_budget, domestic_gross, and worldwide_gross

## Data Preparation

Describe and justify the process for preparing the data for analysis.

***
Questions to consider:
* Were there variables you dropped or created?
* How did you address missing values or outliers?
* Why are these choices appropriate given the data and the business problem?
***

We are going to drop studio since microsoft will be using their own and year since it is irrelevant in 
analyzing gross profit within this case.

In [8]:
# Here you run your code to clean the data
income.drop(['studio', 'year', 'foreign_gross'], axis=1, inplace=True)

In [9]:
income.head(10)

Unnamed: 0,title,domestic_gross
0,Toy Story 3,415000000.0
1,Alice in Wonderland (2010),334200000.0
2,Harry Potter and the Deathly Hallows Part 1,296000000.0
3,Inception,292600000.0
4,Shrek Forever After,238700000.0
5,The Twilight Saga: Eclipse,300500000.0
6,Iron Man 2,312400000.0
7,Tangled,200800000.0
8,Despicable Me,251500000.0
9,How to Train Your Dragon,217600000.0


The new film will be focusing on the domestic US market so for the budgets data release_date and worldwide_gross will be the dropped variables. 

In [10]:
budgets.drop(['id', 'release_date', 'worldwide_gross'], axis=1, inplace=True)

In [11]:
budgets.head(10)

Unnamed: 0,movie,production_budget,domestic_gross
0,Avatar,"$425,000,000","$760,507,625"
1,Pirates of the Caribbean: On Stranger Tides,"$410,600,000","$241,063,875"
2,Dark Phoenix,"$350,000,000","$42,762,350"
3,Avengers: Age of Ultron,"$330,600,000","$459,005,868"
4,Star Wars Ep. VIII: The Last Jedi,"$317,000,000","$620,181,382"
5,Star Wars Ep. VII: The Force Awakens,"$306,000,000","$936,662,225"
6,Avengers: Infinity War,"$300,000,000","$678,815,482"
7,Pirates of the Caribbean: At Worldâs End,"$300,000,000","$309,420,425"
8,Justice League,"$300,000,000","$229,024,295"
9,Spectre,"$300,000,000","$200,074,175"


We will need to combine basics and ratings using the common variable 'tconst'. Doing so we will be able to 
analyze ratings of different movies in specific genres. This will allow us to decide what type of genre Microsoft
studios should focus on when creating the new movie. 

In [12]:
basics.drop(['start_year', 'runtime_minutes', 'original_title'], axis=1, inplace=True)
ratings.drop(['numvotes'], axis=1, inplace=True)

In [13]:
basics.head()

Unnamed: 0,tconst,primary_title,genres
0,tt0063540,Sunghursh,"Action,Crime,Drama"
1,tt0066787,One Day Before the Rainy Season,"Biography,Drama"
2,tt0069049,The Other Side of the Wind,Drama
3,tt0069204,Sabse Bada Sukh,"Comedy,Drama"
4,tt0100275,The Wandering Soap Opera,"Comedy,Drama,Fantasy"


In [14]:
ratings.head()

Unnamed: 0,tconst,averagerating
0,tt10356526,8.3
1,tt10384606,8.9
2,tt1042974,6.4
3,tt1043726,4.2
4,tt1060240,6.5


In order to accurately understand the ratings of each title we will need to combine basics and ratings by tconst. 

In [20]:
q = """
SELECT averagerating 
FROM ratings
JOIN primary_title
    USING(tconst)
LIMIT 10
"""
pd.read_sql(q, conn)

NameError: name 'conn' is not defined

## Data Modeling
Describe and justify the process for analyzing or modeling the data.

***
Questions to consider:
* How did you analyze or model the data?
* How did you iterate on your initial approach to make it better?
* Why are these choices appropriate given the data and the business problem?
***

In [52]:
# Here you run your code to model the data


## Evaluation
Evaluate how well your work solves the stated business problem.

***
Questions to consider:
* How do you interpret the results?
* How well does your model fit your data? How much better is this than your baseline model?
* How confident are you that your results would generalize beyond the data you have?
* How confident are you that this model would benefit the business if put into use?
***

## Conclusions
Provide your conclusions about the work you've done, including any limitations or next steps.

***
Questions to consider:
* What would you recommend the business do as a result of this work?
* What are some reasons why your analysis might not fully solve the business problem?
* What else could you do in the future to improve this project?
***