![image](microsoft_logo.jpeg)

# Microsoft Movie Studio
#### Business Recommendations 

**Authors:** Valeria Viscarra Fossati, Olgert Hasko, Sally Heinzel, Czarina Luna 

##### December 2021

***


## Overview

A one-paragraph overview of the project, including the business problem, data, methods, results and recommendations.

## Business Problem

Microsoft is creating a new movie studio, and to prepare, it seeks to understand the film industry. We, the advisory team, will explore data of movie records to help the Microsoft movie studio head decide which direction the studio is going, in terms of films to create.

The advisory team will examine which movies are doing well at the box office, which ones are not, and which features contribute to these film's successes, or losses. To guide our analysis, we ask the following questions:

1. What are production budgets...
2. What are genres...
3. What...

## Data Understanding

The data available to us included datasets from The Numbers and IMDb (Internet Movie Database), both online databases of information about movies. The dataset from The Numbers included variables necessary for our analysis, such as production budget and gross earnings; and the other datasets from IMDb included variables that are important in our analysis too, such as genres. However, merging these datasets resulted to a significant loss of movies 

however, when merged with the dataset from IMDb resulted to a significant loss of movies that did not match values in the IMDb dataset.

***
Questions to consider:
* Where did the data come from, and how do they relate to the data analysis questions?
* What do the data represent? Who is in the sample and what variables are included?
* What is the target variable?
* What are the properties of the variables you intend to use?
***

We imported the Python libraries required for the data analysis. 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [2]:
# Read the datasets

df1 = pd.read_csv("data/tngross.csv")
df2 = pd.read_csv("data/tnproduction.csv")

In [3]:
df1.head()

Unnamed: 0.1,Unnamed: 0,release_date,movie,production_budget,domestic_gross,worldwide_gross
0,0,"Apr 23, 2019",Avengers: Endgame,"$400,000,000","$858,373,000","$2,797,800,564"
1,1,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,"$379,000,000","$241,071,802","$1,045,713,802"
2,2,"Apr 22, 2015",Avengers: Age of Ultron,"$365,000,000","$459,005,868","$1,395,316,979"
3,3,"Dec 16, 2015",Star Wars Ep. VII: The Force Awakens,"$306,000,000","$936,662,225","$2,064,615,817"
4,4,"Apr 25, 2018",Avengers: Infinity War,"$300,000,000","$678,815,482","$2,044,540,523"


In [4]:
df1.tail()

Unnamed: 0.1,Unnamed: 0,release_date,movie,production_budget,domestic_gross,worldwide_gross
6095,6095,"Mar 17, 2015",Closure,"$100,000",$0,$0
6096,6096,"Aug 29, 2015",Lunch Time Heroes,"$100,000",$0,$0
6097,6097,"Mar 25, 2015",Open Secret,"$100,000",$0,$0
6098,6098,"Nov 10, 2015",The Night Visitor,"$100,000",$0,$0
6099,6099,"Jul 7, 2015",Tiger Orange,"$100,000",$0,$0


In [5]:
df2.head()

Unnamed: 0.1,Unnamed: 0,runtime_minutes,genres,production_company,production_country
0,0,181 minutes,Action,Marvel Studios,United States
1,1,136 minutes,Adventure,Walt Disney Pictures,United States
2,2,141 minutes,Action,Marvel Studios,United States
3,3,136 minutes,Adventure,"Lucasfilm, Bad Robot",United States
4,4,156 minutes,Action,Marvel Studios,United States


In [6]:
df2.tail()

Unnamed: 0.1,Unnamed: 0,runtime_minutes,genres,production_company,production_country
6095,6095,90 minutes,Drama,,United States
6096,6096,88 minutes,Adventure,Phebean Films,Nigeria
6097,6097,,Documentary,,United States
6098,6098,,Horror,,United States
6099,6099,81 minutes,Drama,,United States


## Data Preparation
-olgert
First, we merge the datasets on column "Unnamed: 0" to match the movies from the first dataset to the additional information on the second dataset. Then we dropped the same column and used the merged dataframe's index.
MENTION NO DUPLICATE

clean column:

For the variables with dollar amounts, we removed the dollar sign, commas from the string, and converted the string into a float.

-remove missing values of worldwide gross because ...

***
Questions to consider:
* Were there variables you dropped or created?
* How did you address missing values or outliers?
* Why are these choices appropriate given the data and the business problem?
***

In [7]:
df = df1.merge(df2, on="Unnamed: 0")

In [8]:
df = df.drop("Unnamed: 0", axis=1)

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6100 entries, 0 to 6099
Data columns (total 9 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   release_date        6100 non-null   object
 1   movie               6100 non-null   object
 2   production_budget   6100 non-null   object
 3   domestic_gross      6100 non-null   object
 4   worldwide_gross     6100 non-null   object
 5   runtime_minutes     6100 non-null   object
 6   genres              6100 non-null   object
 7   production_company  6100 non-null   object
 8   production_country  6100 non-null   object
dtypes: object(9)
memory usage: 476.6+ KB


##### Check for Duplicates

In [10]:
duplicate_df = df[df['movie'].duplicated(keep=False)]
duplicate_df

Unnamed: 0,release_date,movie,production_budget,domestic_gross,worldwide_gross,runtime_minutes,genres,production_company,production_country
13,"Jul 11, 2019",The Lion King,"$260,000,000","$543,638,043","$1,654,367,425",118 minutes,Adventure,"Walt Disney Pictures, Fairview Entertainment",United States
27,"Apr 25, 2012",The Avengers,"$225,000,000","$623,357,910","$1,515,100,211",143 minutes,Action,"Marvel Studios, Paramount Pictures",United States
40,"May 14, 2010",Robin Hood,"$210,000,000","$105,487,148","$322,459,006",139 minutes,Action,"United Artists, Fairbanks","United Kingdom, United States"
41,"Dec 14, 2005",King Kong,"$207,000,000","$218,080,025","$550,517,357",189 minutes,Adventure,Wingnut Films,"New Zealand, United States"
54,"Mar 4, 2010",Alice in Wonderland,"$200,000,000","$334,191,110","$1,025,491,110",108 minutes,Adventure,"Walt Disney Pictures, Roth Films, Zancuk Company",United States
...,...,...,...,...,...,...,...,...,...
6056,"Mar 1, 2002",The Calling,"$160,000","$32,092","$32,092",,Drama,,
6066,"Oct 18, 1974",The Texas Chainsaw Massacre,"$140,000","$26,572,439","$26,572,439",83 minutes,Horror,,United States
6069,"Nov 16, 1942",Cat People,"$134,000","$4,000,000","$8,000,000",73 minutes,Drama,,United States
6077,"Oct 1, 1968",Night of the Living Dead,"$114,000","$12,087,064","$30,087,064",96 minutes,Horror,,United States


In [11]:
duplicate_df[duplicate_df.duplicated(subset=['movie','release_date'], keep=False)]

Unnamed: 0,release_date,movie,production_budget,domestic_gross,worldwide_gross,runtime_minutes,genres,production_company,production_country


##### Clean up the columns

In [12]:
df.head(2)

Unnamed: 0,release_date,movie,production_budget,domestic_gross,worldwide_gross,runtime_minutes,genres,production_company,production_country
0,"Apr 23, 2019",Avengers: Endgame,"$400,000,000","$858,373,000","$2,797,800,564",181 minutes,Action,Marvel Studios,United States
1,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,"$379,000,000","$241,071,802","$1,045,713,802",136 minutes,Adventure,Walt Disney Pictures,United States


In [13]:
df['release_date'] = pd.to_datetime(df['release_date'], errors='coerce', format='%b %d, %Y')

In [14]:
def dollar_to_int(column):
    return column.str.replace('$', '').str.replace(',', '').map(int)

df[['production_budget', 'domestic_gross', 'worldwide_gross']] = df[['production_budget', 'domestic_gross', 'worldwide_gross']].apply(dollar_to_int)

In [15]:
df

Unnamed: 0,release_date,movie,production_budget,domestic_gross,worldwide_gross,runtime_minutes,genres,production_company,production_country
0,2019-04-23,Avengers: Endgame,400000000,858373000,2797800564,181 minutes,Action,Marvel Studios,United States
1,2011-05-20,Pirates of the Caribbean: On Stranger Tides,379000000,241071802,1045713802,136 minutes,Adventure,Walt Disney Pictures,United States
2,2015-04-22,Avengers: Age of Ultron,365000000,459005868,1395316979,141 minutes,Action,Marvel Studios,United States
3,2015-12-16,Star Wars Ep. VII: The Force Awakens,306000000,936662225,2064615817,136 minutes,Adventure,"Lucasfilm, Bad Robot",United States
4,2018-04-25,Avengers: Infinity War,300000000,678815482,2044540523,156 minutes,Action,Marvel Studios,United States
...,...,...,...,...,...,...,...,...,...
6095,2015-03-17,Closure,100000,0,0,90 minutes,Drama,,United States
6096,2015-08-29,Lunch Time Heroes,100000,0,0,88 minutes,Adventure,Phebean Films,Nigeria
6097,2015-03-25,Open Secret,100000,0,0,,Documentary,,United States
6098,2015-11-10,The Night Visitor,100000,0,0,,Horror,,United States


##### Remove missing values for worldwide_gross

In [16]:
df = df[~(df['worldwide_gross']==0)]

##### Create new columns

In [17]:
df['earnings'] = df["worldwide_gross"] - df["production_budget"]
df['earnings_ratio'] = df["worldwide_gross"] / df["production_budget"]
df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['earnings'] = df["worldwide_gross"] - df["production_budget"]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['earnings_ratio'] = df["worldwide_gross"] / df["production_budget"]


Unnamed: 0,release_date,movie,production_budget,domestic_gross,worldwide_gross,runtime_minutes,genres,production_company,production_country,earnings,earnings_ratio
0,2019-04-23,Avengers: Endgame,400000000,858373000,2797800564,181 minutes,Action,Marvel Studios,United States,2397800564,6.994501
1,2011-05-20,Pirates of the Caribbean: On Stranger Tides,379000000,241071802,1045713802,136 minutes,Adventure,Walt Disney Pictures,United States,666713802,2.759139
2,2015-04-22,Avengers: Age of Ultron,365000000,459005868,1395316979,141 minutes,Action,Marvel Studios,United States,1030316979,3.822786
3,2015-12-16,Star Wars Ep. VII: The Force Awakens,306000000,936662225,2064615817,136 minutes,Adventure,"Lucasfilm, Bad Robot",United States,1758615817,6.747111
4,2018-04-25,Avengers: Infinity War,300000000,678815482,2044540523,156 minutes,Action,Marvel Studios,United States,1744540523,6.815135


#### Create new tables for analysis of each variable

- budget vs earnings:  movie_cash_flow
- genres: genres_df
- dates: dates_df
- runtime: runtime_df

In [18]:
movie_cash_flow = df[["movie", "production_budget", "worldwide_gross", "domestic_gross", "earnings", "earnings_ratio"]]

In [19]:
movie_cash_flow = movie_cash_flow[movie_cash_flow["worldwide_gross"] != 0]
movie_cash_flow

Unnamed: 0,movie,production_budget,worldwide_gross,domestic_gross,earnings,earnings_ratio
0,Avengers: Endgame,400000000,2797800564,858373000,2397800564,6.994501
1,Pirates of the Caribbean: On Stranger Tides,379000000,1045713802,241071802,666713802,2.759139
2,Avengers: Age of Ultron,365000000,1395316979,459005868,1030316979,3.822786
3,Star Wars Ep. VII: The Force Awakens,306000000,2064615817,936662225,1758615817,6.747111
4,Avengers: Infinity War,300000000,2044540523,678815482,1744540523,6.815135
...,...,...,...,...,...,...
6087,Penitentiary,100000,287000,287000,187000,2.870000
6088,The Lost Skeleton of Cadavra,100000,110536,110536,10536,1.105360
6089,Cheap Thrills,100000,59424,59424,-40576,0.594240
6090,The Past is a Grotesque Animal,100000,20056,20056,-79944,0.200560


In [20]:
genres_df = df[df.genres != 'None']
genres_df

Unnamed: 0,release_date,movie,production_budget,domestic_gross,worldwide_gross,runtime_minutes,genres,production_company,production_country,earnings,earnings_ratio
0,2019-04-23,Avengers: Endgame,400000000,858373000,2797800564,181 minutes,Action,Marvel Studios,United States,2397800564,6.994501
1,2011-05-20,Pirates of the Caribbean: On Stranger Tides,379000000,241071802,1045713802,136 minutes,Adventure,Walt Disney Pictures,United States,666713802,2.759139
2,2015-04-22,Avengers: Age of Ultron,365000000,459005868,1395316979,141 minutes,Action,Marvel Studios,United States,1030316979,3.822786
3,2015-12-16,Star Wars Ep. VII: The Force Awakens,306000000,936662225,2064615817,136 minutes,Adventure,"Lucasfilm, Bad Robot",United States,1758615817,6.747111
4,2018-04-25,Avengers: Infinity War,300000000,678815482,2044540523,156 minutes,Action,Marvel Studios,United States,1744540523,6.815135
...,...,...,...,...,...,...,...,...,...,...,...
6086,2017-07-07,A Ghost Story,100000,1594798,2769782,93 minutes,Drama,"Sailor Bear, Zero Trans Fat Productions, Ideam...",United States,2669782,27.697820
6087,1980-05-10,Penitentiary,100000,287000,287000,,Drama,,United States,187000,2.870000
6088,2004-02-06,The Lost Skeleton of Cadavra,100000,110536,110536,90 minutes,Comedy,,United States,10536,1.105360
6089,2014-03-21,Cheap Thrills,100000,59424,59424,85 minutes,Thriller/Suspense,Snowfort Pictures,United States,-40576,0.594240


In [22]:
dates_df = df[~df['release_date'].isna()]
dates_df

Unnamed: 0,release_date,movie,production_budget,domestic_gross,worldwide_gross,runtime_minutes,genres,production_company,production_country,earnings,earnings_ratio
0,2019-04-23,Avengers: Endgame,400000000,858373000,2797800564,181 minutes,Action,Marvel Studios,United States,2397800564,6.994501
1,2011-05-20,Pirates of the Caribbean: On Stranger Tides,379000000,241071802,1045713802,136 minutes,Adventure,Walt Disney Pictures,United States,666713802,2.759139
2,2015-04-22,Avengers: Age of Ultron,365000000,459005868,1395316979,141 minutes,Action,Marvel Studios,United States,1030316979,3.822786
3,2015-12-16,Star Wars Ep. VII: The Force Awakens,306000000,936662225,2064615817,136 minutes,Adventure,"Lucasfilm, Bad Robot",United States,1758615817,6.747111
4,2018-04-25,Avengers: Infinity War,300000000,678815482,2044540523,156 minutes,Action,Marvel Studios,United States,1744540523,6.815135
...,...,...,...,...,...,...,...,...,...,...,...
6087,1980-05-10,Penitentiary,100000,287000,287000,,Drama,,United States,187000,2.870000
6088,2004-02-06,The Lost Skeleton of Cadavra,100000,110536,110536,90 minutes,Comedy,,United States,10536,1.105360
6089,2014-03-21,Cheap Thrills,100000,59424,59424,85 minutes,Thriller/Suspense,Snowfort Pictures,United States,-40576,0.594240
6090,2014-06-19,The Past is a Grotesque Animal,100000,20056,20056,90 minutes,Documentary,Oscilloscope Pictures,United States,-79944,0.200560


In [24]:
runtime_df = df[(df['runtime_minutes'] != 'None')]
runtime_df

Unnamed: 0,release_date,movie,production_budget,domestic_gross,worldwide_gross,runtime_minutes,genres,production_company,production_country,earnings,earnings_ratio
0,2019-04-23,Avengers: Endgame,400000000,858373000,2797800564,181 minutes,Action,Marvel Studios,United States,2397800564,6.994501
1,2011-05-20,Pirates of the Caribbean: On Stranger Tides,379000000,241071802,1045713802,136 minutes,Adventure,Walt Disney Pictures,United States,666713802,2.759139
2,2015-04-22,Avengers: Age of Ultron,365000000,459005868,1395316979,141 minutes,Action,Marvel Studios,United States,1030316979,3.822786
3,2015-12-16,Star Wars Ep. VII: The Force Awakens,306000000,936662225,2064615817,136 minutes,Adventure,"Lucasfilm, Bad Robot",United States,1758615817,6.747111
4,2018-04-25,Avengers: Infinity War,300000000,678815482,2044540523,156 minutes,Action,Marvel Studios,United States,1744540523,6.815135
...,...,...,...,...,...,...,...,...,...,...,...
6084,1920-09-17,Over the Hill to the Poorhouse,100000,3000000,3000000,110 minutes,,,United States,2900000,30.000000
6086,2017-07-07,A Ghost Story,100000,1594798,2769782,93 minutes,Drama,"Sailor Bear, Zero Trans Fat Productions, Ideam...",United States,2669782,27.697820
6088,2004-02-06,The Lost Skeleton of Cadavra,100000,110536,110536,90 minutes,Comedy,,United States,10536,1.105360
6089,2014-03-21,Cheap Thrills,100000,59424,59424,85 minutes,Thriller/Suspense,Snowfort Pictures,United States,-40576,0.594240


## Data Analysis

Start with earnings vs budget analysis


Analysis of genres

Analysis of release date
Analysis of runtime

Describe and justify the process for analyzing or modeling the data.

***
Questions to consider:
* How did you analyze or model the data?
* How did you iterate on your initial approach to make it better?
* Why are these choices appropriate given the data and the business problem?
***

In [None]:
# Here you run your code to model the data


## Evaluation
Evaluate how well your work solves the stated business problem.

***
Questions to consider:
* How do you interpret the results?
* How well does your model fit your data? How much better is this than your baseline model?
* How confident are you that your results would generalize beyond the data you have?
* How confident are you that this model would benefit the business if put into use?
***

## Conclusions
Provide your conclusions about the work you've done, including any limitations or next steps.

***
Questions to consider:
* What would you recommend the business do as a result of this work?
* What are some reasons why your analysis might not fully solve the business problem?
* What else could you do in the future to improve this project?
***