# Objective

Help Microsoft choose waht type of movies they should implement

To help choose which movies are best for Microsoft to start with, we should consider using

**movie genres**

**production budget**

**revenue(gross)**

**votes/ratings**

## Methodology

Some areas you can look to examine are movie genres (Thriller, Drama, Comedy, etc.), movie ratings, budget, social media discussion, and critic or user reviews. Your team gets to define its _own questions_ about the movie industry and then use its knowledge of descriptive statistics and the EDA process to try and answer those questions. <br>
Questions to consider:
- How are you defining _success_ ?
 - Return on investment?
 - Revenue?
 - Garunteed box-office hit?
 - Social media buzz?

In [211]:
# Import all libraries
import pandas as pd
import matplotlib.pylab as plt
import numpy as np
import re

In [212]:
# Creating data-frames out of the existing files and assigning them names
movie_gross_df = pd.read_csv("Data/bom.movie_gross.csv.gz")
name_basics_df = pd.read_csv("Data/imdb.name.basics.csv.gz")
akas_df = pd.read_csv("Data/imdb.title.akas.csv.gz")
basics_df = pd.read_csv("Data/imdb.title.basics.csv.gz")
crew_df = pd.read_csv("Data/imdb.title.crew.csv.gz")
principals_df = pd.read_csv("Data/imdb.title.principals.csv.gz")
ratings_df = pd.read_csv("Data/imdb.title.ratings.csv.gz")
movies_df = pd.read_csv("Data/tmdb.movies.csv.gz")
movie_budget_df = pd.read_csv("Data/tn.movie_budgets.csv.gz")
rt_info_df = pd.read_csv("Data/rt.movie_info.tsv.gz", sep='\t')
rt_reviews_df = pd.read_csv("Data/rt.reviews.tsv.gz", sep='\t', encoding="unicode_escape")

# Data Description

**movie_budget_df** 
- release date 
- movie name
- production budget
- domestic gross
- worldwide gross

**movie_gross_df**
- movie name
- studio name
- domestic gross
- foriegn gross
- year of movie

**movies_df**
- genre ids
- the id
- original language filmed in
- movie name
- popularity
- release date
- vote
- vote count

**name_basics_df**

**akas_df**

**basics_df**
- primary and original title
- start year
- runtime in mins
- genre

**crew_df**
- directors
- writers

**principals_df**

**ratings_df**
- average num. of ratings
- num. of votes

**rt_reviews_df**
- review
- rating
- fresh/rotten
- critic/top critic
- publisher
- date

**rt_info_df**
- synopsis
- movie rating
- genre
- director
- writer
- release date
- dvd date
- currency
- box office
- runtime
- studio

# Data that are similar

movie_budget_df & movie_gross_df

movies_df & ratings_df

# Data To Use

**movie_gross_df**

**movie_budget_df**

# What To Show
Name of Movie

**Genre**

**Movie Rating**

Release Date - Not a priority

**Production Budget/Gross Revenue/ROI**

Ratings

# **Code Starts Here**

### What Data-Frames Were Chosen
We chose two dataframes to use, **movie_budget_df** and **movie_gross_df**

## Problem
The **movie_budget_df** had special characters, "$" and ",", that the other dataframe, **movie_gross_df** did not.

We are not able to merge because of these differences

## Solution
We created a function that allows us to input any dataframe and take out the commas and USD signs

In [213]:
# Gitting rid of "$" and "," in a dataframe
def remove_format(dataframe):
    for strings in dataframe:
        dataframe.replace('[\$,)]','', regex=True, inplace=True)

In [214]:
# Removing specified characters in this dataframe
remove_format(movie_budget_df)

# Changing Data Types

## Problem
There's more Inconsistency with our dataframes. **movie_budget_df** has a dtype of int, while **movie_gross_df** had a dtype of float.

In order to merge we need to have our data as consistent as possible

## Solution
We access the **movie_budget_df** and choose our specific columns and change the dtype to a float.

Now our **movie_budget_df** has a float dtype, just like the **movie_gross_df**

In [215]:
# Changing data types to a float to match same content
movie_budget_df['production_budget']  = movie_budget_df['production_budget'].astype(float)
movie_budget_df['domestic_gross']  = movie_budget_df['domestic_gross'].astype(float)
movie_budget_df['worldwide_gross']  = movie_budget_df['worldwide_gross'].astype(float)

In [216]:
movie_budget_df.info() # Checking to see if info matches in order to merge

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5782 entries, 0 to 5781
Data columns (total 6 columns):
id                   5782 non-null int64
release_date         5782 non-null object
movie                5782 non-null object
production_budget    5782 non-null float64
domestic_gross       5782 non-null float64
worldwide_gross      5782 non-null float64
dtypes: float64(3), int64(1), object(2)
memory usage: 271.2+ KB


In [217]:
movie_gross_df.info() # Checking to see if info matches in order to merge

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3387 entries, 0 to 3386
Data columns (total 5 columns):
title             3387 non-null object
studio            3382 non-null object
domestic_gross    3359 non-null float64
foreign_gross     2037 non-null object
year              3387 non-null int64
dtypes: float64(1), int64(1), object(3)
memory usage: 132.4+ KB


# Dropping Columns

## Problem
We've found missing data in **movie_gross_df**

## Solution
We use the dropna() method and set its' parameters as inplace=True. 

We set that specific key argument because that allows our changes to be inside that dataframe, acting as if it was a save and remember key.

In [218]:
# Dropping all missing values(NaN) inside the dataframe
movie_gross_df.dropna(inplace=True)

In [219]:
movie_gross_df.isna().sum() # Sums the amount of NaN's there are in each column

title             0
studio            0
domestic_gross    0
foreign_gross     0
year              0
dtype: int64

In [220]:
movie_budget_df.isna().sum() # Sums the amount of NaN's there are in each column

id                   0
release_date         0
movie                0
production_budget    0
domestic_gross       0
worldwide_gross      0
dtype: int64

In [221]:
# Changing column names
movie_gross_df = movie_gross_df.rename(columns={"title": "movie", "foreign_gross": "worldwide_gross"})

In [222]:
movie_budget_df.head()

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
0,1,Dec 18 2009,Avatar,425000000.0,760507625.0,2776345000.0
1,2,May 20 2011,Pirates of the Caribbean: On Stranger Tides,410600000.0,241063875.0,1045664000.0
2,3,Jun 7 2019,Dark Phoenix,350000000.0,42762350.0,149762400.0
3,4,May 1 2015,Avengers: Age of Ultron,330600000.0,459005868.0,1403014000.0
4,5,Dec 15 2017,Star Wars Ep. VIII: The Last Jedi,317000000.0,620181382.0,1316722000.0


In [223]:
movie_budget_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5782 entries, 0 to 5781
Data columns (total 6 columns):
id                   5782 non-null int64
release_date         5782 non-null object
movie                5782 non-null object
production_budget    5782 non-null float64
domestic_gross       5782 non-null float64
worldwide_gross      5782 non-null float64
dtypes: float64(3), int64(1), object(2)
memory usage: 271.2+ KB


In [238]:
percent = 100 # Defining the percent for we can convert a given number into percent form

# Creating a new column that shows ROI
movie_budget_df["worldwide_roi"] = (movie_budget_df.worldwide_gross / movie_budget_df.production_budget) * percent
movie_budget_df["domestic_roi"] = (movie_budget_df.domestic_gross / movie_budget_df.production_budget) * percent

In [239]:
movie_budget_df.head()

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross,worldwide_roi,domestic_roi
0,1,Dec 18 2009,Avatar,425000000.0,760507625.0,2776345000.0,653.257713,178.942971
1,2,May 20 2011,Pirates of the Caribbean: On Stranger Tides,410600000.0,241063875.0,1045664000.0,254.667286,58.71015
2,3,Jun 7 2019,Dark Phoenix,350000000.0,42762350.0,149762400.0,42.789243,12.217814
3,4,May 1 2015,Avengers: Age of Ultron,330600000.0,459005868.0,1403014000.0,424.384139,138.84025
4,5,Dec 15 2017,Star Wars Ep. VIII: The Last Jedi,317000000.0,620181382.0,1316722000.0,415.369636,195.640815
