# Objective

Help Microsoft choose waht type of movies they should implement

To help choose which movies are best for Microsoft to start with, we should consider using

**movie genres**

**production budget**

**revenue(gross)**

**votes/ratings**

## Methodology

Some areas you can look to examine are movie genres (Thriller, Drama, Comedy, etc.), movie ratings, budget, social media discussion, and critic or user reviews. Your team gets to define its _own questions_ about the movie industry and then use its knowledge of descriptive statistics and the EDA process to try and answer those questions. <br>
Questions to consider:
- How are you defining _success_ ?
 - Return on investment?
 - Revenue?
 - Garunteed box-office hit?
 - Social media buzz?

In [1]:
# Import all libraries
import pandas as pd
import matplotlib.pylab as plt
import numpy as np
import re

In [6]:
# Creating data-frames out of the existing files and assigning them names
movie_gross_df = pd.read_csv("Data/bom.movie_gross.csv.gz")
name_basics_df = pd.read_csv("Data/imdb.name.basics.csv.gz")
akas_df = pd.read_csv("Data/imdb.title.akas.csv.gz")
basics_df = pd.read_csv("Data/imdb.title.basics.csv.gz")
crew_df = pd.read_csv("Data/imdb.title.crew.csv.gz")
principals_df = pd.read_csv("Data/imdb.title.principals.csv.gz")
ratings_df = pd.read_csv("Data/imdb.title.ratings.csv.gz")
movies_df = pd.read_csv("Data/tmdb.movies.csv.gz")
movie_budget_df = pd.read_csv("Data/tn.movie_budgets.csv.gz")
rt_info_df = pd.read_csv("Data/rt.movie_info.tsv.gz", sep='\t')
rt_reviews_df = pd.read_csv("Data/rt.reviews.tsv.gz", sep='\t', encoding="unicode_escape")

In [7]:
movie_gross_df

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
0,Toy Story 3,BV,415000000.0,652000000,2010
1,Alice in Wonderland (2010),BV,334200000.0,691300000,2010
2,Harry Potter and the Deathly Hallows Part 1,WB,296000000.0,664300000,2010
3,Inception,WB,292600000.0,535700000,2010
4,Shrek Forever After,P/DW,238700000.0,513900000,2010
...,...,...,...,...,...
3382,The Quake,Magn.,6200.0,,2018
3383,Edward II (2018 re-release),FM,4800.0,,2018
3384,El Pacto,Sony,2500.0,,2018
3385,The Swan,Synergetic,2400.0,,2018


In [11]:
def remove_format(dataframe):
    for strings in dataframe:
        dataframe.replace('[\$,)]','', regex=True, inplace=True)

In [14]:
remove_format(movie_budget_df)

In [26]:
movie_budget_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5782 entries, 0 to 5781
Data columns (total 6 columns):
id                   5782 non-null int64
release_date         5782 non-null object
movie                5782 non-null object
production_budget    5782 non-null float64
domestic_gross       5782 non-null float64
worldwide_gross      5782 non-null float64
dtypes: float64(3), int64(1), object(2)
memory usage: 271.2+ KB


In [25]:
movie_gross_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3387 entries, 0 to 3386
Data columns (total 5 columns):
title             3387 non-null object
studio            3382 non-null object
domestic_gross    3359 non-null float64
foreign_gross     2037 non-null object
year              3387 non-null int64
dtypes: float64(1), int64(1), object(3)
memory usage: 132.4+ KB


In [24]:
movie_budget_df['production_budget']  = movie_budget_df['production_budget'].astype(float)
movie_budget_df['domestic_gross']  = movie_budget_df['domestic_gross'].astype(float)
movie_budget_df['worldwide_gross']  = movie_budget_df['worldwide_gross'].astype(float)

In [None]:
#foreign gross has null values which we need to clean

# Data Description

**movie_budget_df** 
- release date 
- movie name
- production budget
- domestic gross
- worldwide gross

**movie_gross_df**
- movie name
- studio name
- domestic gross
- foriegn gross
- year of movie

**movies_df**
- genre ids
- the id
- original language filmed in
- movie name
- popularity
- release date
- vote
- vote count

**name_basics_df**

**akas_df**

**basics_df**
- primary and original title
- start year
- runtime in mins
- genre

**crew_df**
- directors
- writers

**principals_df**

**ratings_df**
- average num. of ratings
- num. of votes

**rt_reviews_df**
- review
- rating
- fresh/rotten
- critic/top critic
- publisher
- date

**rt_info_df**
- synopsis
- movie rating
- genre
- director
- writer
- release date
- dvd date
- currency
- box office
- runtime
- studio

# Data that are similar

movie_budget_df & movie_gross_df

movies_df & ratings_df

# Data To Use
**rt_info_df**

**movie_gross_df**

**movie_budget_df**

# What To Show
Name of Movie

**Genre**

**Movie Rating**

Release Date - Not a priority

**Production Budget/Gross Revenue/ROI**

Ratings

In [79]:
# data_on_movies = pd.concat([movie_gross_df, name_basics_df, crew_df, pricipals_df, ratings_df, movies_df, movie_budget_df], sort=True)

In [80]:
# Join movie_budget and movie_gross
budget_df = pd.concat([movie_budget_df, movie_gross_df], axis=1, join="inner")

In [81]:
# New column with ROI using worldwide gross revenue and production budget

In [4]:
# Un-format string "$" and "," to a plain string of numbers
def remove_format(dataframe):
    for strings in dataframe:
        dataframe.replace("[\$,)]", "", regex=True, inplace=True)
        
# -----------------------Second Way-------------------------
        
# # Un-format string "$" and "," to a plain string of numbers
# movie_budget_df['production_budget'] = movie_budget_df['production_budget'].replace('[\$,)]', "", regex=True)
# movie_budget_df['worldwide_gross'] = movie_budget_df['worldwide_gross'].replace('[\$,)]', "", regex=True)
# movie_budget_df['domestic_gross'] = movie_budget_df['domestic_gross'].replace('[\$,)]', "", regex=True)

In [5]:
remove_format(movie_budget_df)

In [6]:
movie_budget_df.head()

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
0,1,Dec 18 2009,Avatar,425000000,760507625,2776345279
1,2,May 20 2011,Pirates of the Caribbean: On Stranger Tides,410600000,241063875,1045663875
2,3,Jun 7 2019,Dark Phoenix,350000000,42762350,149762350
3,4,May 1 2015,Avengers: Age of Ultron,330600000,459005868,1403013963
4,5,Dec 15 2017,Star Wars Ep. VIII: The Last Jedi,317000000,620181382,1316721747


In [7]:
rt_info_df.head()

Unnamed: 0,id,synopsis,rating,genre,director,writer,theater_date,dvd_date,currency,box_office,runtime,studio
0,1,"This gritty, fast-paced, and innovative police...",R,Action and Adventure|Classics|Drama,William Friedkin,Ernest Tidyman,"Oct 9, 1971","Sep 25, 2001",,,104 minutes,
1,3,"New York City, not-too-distant-future: Eric Pa...",R,Drama|Science Fiction and Fantasy,David Cronenberg,David Cronenberg|Don DeLillo,"Aug 17, 2012","Jan 1, 2013",$,600000.0,108 minutes,Entertainment One
2,5,Illeana Douglas delivers a superb performance ...,R,Drama|Musical and Performing Arts,Allison Anders,Allison Anders,"Sep 13, 1996","Apr 18, 2000",,,116 minutes,
3,6,Michael Douglas runs afoul of a treacherous su...,R,Drama|Mystery and Suspense,Barry Levinson,Paul Attanasio|Michael Crichton,"Dec 9, 1994","Aug 27, 1997",,,128 minutes,
4,7,,NR,Drama|Romance,Rodney Bennett,Giles Cooper,,,,,200 minutes,


In [8]:
rt_reviews_df.head()

Unnamed: 0,id,review,rating,fresh,critic,top_critic,publisher,date
0,3,A distinctly gallows take on contemporary fina...,3/5,fresh,PJ Nabarro,0,Patrick Nabarro,"November 10, 2018"
1,3,It's an allegory in search of a meaning that n...,,rotten,Annalee Newitz,0,io9.com,"May 23, 2018"
2,3,... life lived in a bubble in financial dealin...,,fresh,Sean Axmaker,0,Stream on Demand,"January 4, 2018"
3,3,Continuing along a line introduced in last yea...,,fresh,Daniel Kasman,0,MUBI,"November 16, 2017"
4,3,... a perverse twist on neorealism...,,fresh,,0,Cinema Scope,"October 12, 2017"
