![example](images/director_shot.jpeg)

# Microsoft Movie Analysis

**Authors:** Volha Puzikava
***

## Overview

A one-paragraph overview of the project, including the business problem, data, methods, results and recommendations.

## Business Problem

Summary of the business problem you are trying to solve, and the data questions that you plan to answer to solve them.

***
Questions to consider:
* What are the business's pain points related to this project?
* How did you pick the data analysis question(s) that you did?
* Why are these questions important from a business perspective?
***

## Data Understanding

Describe the data being used for this project.
***
Questions to consider:
* Where did the data come from, and how do they relate to the data analysis questions?
* What do the data represent? Who is in the sample and what variables are included?
* What is the target variable?
* What are the properties of the variables you intend to use?
***

In [2]:
# Import standard packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [4]:
# Here you run your code to explore the data
df1 = pd.read_csv('zippedData/bom.movie_gross.csv.gz')
df1.head()

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
0,Toy Story 3,BV,415000000.0,652000000,2010
1,Alice in Wonderland (2010),BV,334200000.0,691300000,2010
2,Harry Potter and the Deathly Hallows Part 1,WB,296000000.0,664300000,2010
3,Inception,WB,292600000.0,535700000,2010
4,Shrek Forever After,P/DW,238700000.0,513900000,2010


In [5]:
df2 = pd.read_csv('zippedData/rt.movie_info.tsv.gz', delimiter = '\t')
df2.head()

Unnamed: 0,id,synopsis,rating,genre,director,writer,theater_date,dvd_date,currency,box_office,runtime,studio
0,1,"This gritty, fast-paced, and innovative police...",R,Action and Adventure|Classics|Drama,William Friedkin,Ernest Tidyman,"Oct 9, 1971","Sep 25, 2001",,,104 minutes,
1,3,"New York City, not-too-distant-future: Eric Pa...",R,Drama|Science Fiction and Fantasy,David Cronenberg,David Cronenberg|Don DeLillo,"Aug 17, 2012","Jan 1, 2013",$,600000.0,108 minutes,Entertainment One
2,5,Illeana Douglas delivers a superb performance ...,R,Drama|Musical and Performing Arts,Allison Anders,Allison Anders,"Sep 13, 1996","Apr 18, 2000",,,116 minutes,
3,6,Michael Douglas runs afoul of a treacherous su...,R,Drama|Mystery and Suspense,Barry Levinson,Paul Attanasio|Michael Crichton,"Dec 9, 1994","Aug 27, 1997",,,128 minutes,
4,7,,NR,Drama|Romance,Rodney Bennett,Giles Cooper,,,,,200 minutes,


In [6]:
df3 = pd.read_csv('zippedData/tmdb.movies.csv.gz')
df3.head()

Unnamed: 0.1,Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
0,0,"[12, 14, 10751]",12444,en,Harry Potter and the Deathly Hallows: Part 1,33.533,2010-11-19,Harry Potter and the Deathly Hallows: Part 1,7.7,10788
1,1,"[14, 12, 16, 10751]",10191,en,How to Train Your Dragon,28.734,2010-03-26,How to Train Your Dragon,7.7,7610
2,2,"[12, 28, 878]",10138,en,Iron Man 2,28.515,2010-05-07,Iron Man 2,6.8,12368
3,3,"[16, 35, 10751]",862,en,Toy Story,28.005,1995-11-22,Toy Story,7.9,10174
4,4,"[28, 878, 12]",27205,en,Inception,27.92,2010-07-16,Inception,8.3,22186


In [7]:
df4 = pd.read_csv('zippedData/tn.movie_budgets.csv.gz')
df4.head()

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
0,1,"Dec 18, 2009",Avatar,"$425,000,000","$760,507,625","$2,776,345,279"
1,2,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,"$410,600,000","$241,063,875","$1,045,663,875"
2,3,"Jun 7, 2019",Dark Phoenix,"$350,000,000","$42,762,350","$149,762,350"
3,4,"May 1, 2015",Avengers: Age of Ultron,"$330,600,000","$459,005,868","$1,403,013,963"
4,5,"Dec 15, 2017",Star Wars Ep. VIII: The Last Jedi,"$317,000,000","$620,181,382","$1,316,721,747"


In [10]:
import sqlite3
conn = sqlite3.connect('zippedData/im.db')
cur = conn.cursor()
cur.execute("""SELECT name FROM sqlite_master WHERE type = 'table';""")
table_names = cur.fetchall()
table_names

[('movie_basics',),
 ('directors',),
 ('known_for',),
 ('movie_akas',),
 ('movie_ratings',),
 ('persons',),
 ('principals',),
 ('writers',)]

## Data Preparation

Describe and justify the process for preparing the data for analysis.

***
Questions to consider:
* Were there variables you dropped or created?
* How did you address missing values or outliers?
* Why are these choices appropriate given the data and the business problem?
***

In [9]:
# Here you run your code to clean the data
df4.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5782 entries, 0 to 5781
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   id                 5782 non-null   int64 
 1   release_date       5782 non-null   object
 2   movie              5782 non-null   object
 3   production_budget  5782 non-null   object
 4   domestic_gross     5782 non-null   object
 5   worldwide_gross    5782 non-null   object
dtypes: int64(1), object(5)
memory usage: 271.2+ KB


In [10]:
df4_mod = df4.loc[:, ['release_date', 'movie', 'domestic_gross', 'worldwide_gross']]
df4_mod['release_date'] = df4_mod.release_date.map(lambda x: x[7:])
df4_sort = df4_mod.sort_values('release_date', ascending = False)
df4_sort

Unnamed: 0,release_date,movie,domestic_gross,worldwide_gross
3633,2019,The Best of Enemies,"$10,205,616","$10,205,616"
3915,2019,El Chicano,"$700,261","$700,261"
580,2019,The Secret Life of Pets 2,"$63,795,655","$113,351,496"
496,2019,Shazam!,"$139,606,856","$362,899,733"
95,2019,Captain Marvel,"$426,525,952","$1,123,061,550"
...,...,...,...,...
4984,1927,Wings,$0,$0
5606,1925,The Big Parade,"$11,000,000","$22,000,000"
4569,1925,Ben-Hur: A Tale of the Christ,"$9,000,000","$9,000,000"
5683,1920,Over the Hill to the Poorhouse,"$3,000,000","$3,000,000"


In [11]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3387 entries, 0 to 3386
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           3387 non-null   object 
 1   studio          3382 non-null   object 
 2   domestic_gross  3359 non-null   float64
 3   foreign_gross   2037 non-null   object 
 4   year            3387 non-null   int64  
dtypes: float64(1), int64(1), object(3)
memory usage: 132.4+ KB


In [12]:
df1_mod = df1.loc[:, ['year', 'title', 'domestic_gross', 'foreign_gross']]
df1_sort = df1_mod.sort_values('year', ascending = False)
df1_sort

Unnamed: 0,year,title,domestic_gross,foreign_gross
3386,2018,An Actor Prepares,1700.0,
3183,2018,On the Basis of Sex,24600000.0,13600000
3176,2018,Tyler Perry's Acrimony,43500000.0,2900000
3177,2018,Mary Queen of Scots,16500000.0,29900000
3178,2018,The Possession of Hannah Grace,14800000.0,28200000
...,...,...,...,...
220,2010,After.Life,109000.0,1900000
221,2010,Cairo Time,1600000.0,391000
222,2010,Flipped,1800000.0,
223,2010,Guzaarish,1000000.0,695000


In [13]:
df3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26517 entries, 0 to 26516
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Unnamed: 0         26517 non-null  int64  
 1   genre_ids          26517 non-null  object 
 2   id                 26517 non-null  int64  
 3   original_language  26517 non-null  object 
 4   original_title     26517 non-null  object 
 5   popularity         26517 non-null  float64
 6   release_date       26517 non-null  object 
 7   title              26517 non-null  object 
 8   vote_average       26517 non-null  float64
 9   vote_count         26517 non-null  int64  
dtypes: float64(2), int64(3), object(5)
memory usage: 2.0+ MB


In [19]:
df3_mod = df3.loc[:, ['release_date', 'title', 'vote_average']]
df3_mod['release_date'] = df3_mod.release_date.map(lambda x: x[:4])
df3_sort = df3_mod.sort_values('release_date', ascending = False)
df3_sort

Unnamed: 0,release_date,title,vote_average
26057,2020,Murdery Christmas,0.0
24384,2019,Piercing,5.9
25429,2019,Bilby,5.0
24933,2019,Late Afternoon,7.7
24764,2019,Holiday,5.5
...,...,...,...
11192,1946,The Best Years of Our Lives,7.8
26345,1939,How Walt Disney Cartoons Are Made,7.3
3580,1936,Le Bonheur,8.7
21758,1933,The Vampire Bat,5.6


In [15]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1560 entries, 0 to 1559
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   id            1560 non-null   int64 
 1   synopsis      1498 non-null   object
 2   rating        1557 non-null   object
 3   genre         1552 non-null   object
 4   director      1361 non-null   object
 5   writer        1111 non-null   object
 6   theater_date  1201 non-null   object
 7   dvd_date      1201 non-null   object
 8   currency      340 non-null    object
 9   box_office    340 non-null    object
 10  runtime       1530 non-null   object
 11  studio        494 non-null    object
dtypes: int64(1), object(11)
memory usage: 146.4+ KB


In [16]:
df2_mod = df2.loc[:, ['theater_date', 'genre', 'director']]
df2_sort = df2_mod.sort_values('theater_date', ascending = False)
df2_sort

Unnamed: 0,theater_date,genre,director
716,"Sep 9, 1992",Action and Adventure|Comedy|Drama|Mystery and ...,Phil Alden Robinson
1374,"Sep 9, 1988",Comedy,Paul Mazursky
715,"Sep 9, 1986",Drama,Fielder Cook
1356,"Sep 8, 2000",Comedy|Drama,Neil LaBute
1434,"Sep 8, 1996",Drama,Michael Rhodes
...,...,...,...
1543,,,
1547,,Comedy,Phil Alden Robinson
1548,,Comedy,Les Rose
1549,,Art House and International|Drama,


## Data Modeling
Describe and justify the process for analyzing or modeling the data.

***
Questions to consider:
* How did you analyze or model the data?
* How did you iterate on your initial approach to make it better?
* Why are these choices appropriate given the data and the business problem?
***

In [None]:
# Here you run your code to model the data


## Evaluation
Evaluate how well your work solves the stated business problem.

***
Questions to consider:
* How do you interpret the results?
* How well does your model fit your data? How much better is this than your baseline model?
* How confident are you that your results would generalize beyond the data you have?
* How confident are you that this model would benefit the business if put into use?
***

## Conclusions
Provide your conclusions about the work you've done, including any limitations or next steps.

***
Questions to consider:
* What would you recommend the business do as a result of this work?
* What are some reasons why your analysis might not fully solve the business problem?
* What else could you do in the future to improve this project?
***