![example](images/director_shot.jpeg)

# Project Title

**Authors:** Student 1, Student 2, Student 3
***

## Overview

A one-paragraph overview of the project, including the business problem, data, methods, results and recommendations.

## Business Problem

Summary of the business problem you are trying to solve, and the data questions that you plan to answer to solve them.

***
Questions to consider:
* What are the business's pain points related to this project?
* How did you pick the data analysis question(s) that you did?
* Why are these questions important from a business perspective?
***

## Data Understanding

Describe the data being used for this project.
***
Questions to consider:
* Where did the data come from, and how do they relate to the data analysis questions?
* What do the data represent? Who is in the sample and what variables are included?
* What is the target variable?
* What are the properties of the variables you intend to use?
***

In [1]:
# Import standard packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [2]:
# Here you run your code to explore the data

In [3]:
imdb_title_basics = pd.read_csv('zippedData/imdb.title.basics.csv.gz')
imdb_title_basics

Unnamed: 0,tconst,primary_title,original_title,start_year,runtime_minutes,genres
0,tt0063540,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama"
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama"
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama"
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy"
...,...,...,...,...,...,...
146139,tt9916538,Kuambil Lagi Hatiku,Kuambil Lagi Hatiku,2019,123.0,Drama
146140,tt9916622,Rodolpho Teóphilo - O Legado de um Pioneiro,Rodolpho Teóphilo - O Legado de um Pioneiro,2015,,Documentary
146141,tt9916706,Dankyavar Danka,Dankyavar Danka,2013,,Comedy
146142,tt9916730,6 Gunn,6 Gunn,2017,116.0,


In [4]:
# Normalizing title names in preparation for join
imdb_adjusted_titles = []
for title in imdb_title_basics['primary_title']:
    imdb_adjusted_titles.append(title.lower().replace(":", "").replace("-", ""))
imdb_title_basics['primary_title'] = imdb_adjusted_titles
imdb_title_basics['primary_title'].head()

0                          sunghursh
1    one day before the rainy season
2         the other side of the wind
3                    sabse bada sukh
4           the wandering soap opera
Name: primary_title, dtype: object

In [5]:
# Create new column with title + year to minimize duplicates when joining
imdb_title_basics['title_year'] = imdb_title_basics['primary_title'] + " " + imdb_title_basics['start_year'].astype('str')
imdb_title_basics.head()

Unnamed: 0,tconst,primary_title,original_title,start_year,runtime_minutes,genres,title_year
0,tt0063540,sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama",sunghursh 2013
1,tt0066787,one day before the rainy season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama",one day before the rainy season 2019
2,tt0069049,the other side of the wind,The Other Side of the Wind,2018,122.0,Drama,the other side of the wind 2018
3,tt0069204,sabse bada sukh,Sabse Bada Sukh,2018,,"Comedy,Drama",sabse bada sukh 2018
4,tt0100275,the wandering soap opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy",the wandering soap opera 2017


In [6]:
# Purging unreleased movies by removing based on year and null runtime
imdb_drop_rows = []
for row in imdb_title_basics.index:
    if (np.isnan(imdb_title_basics['runtime_minutes'][row]) == True) and (imdb_title_basics['start_year'][row] >= 2020):
        imdb_drop_rows.append(row)
imdb_drop_rows[:5]
    

[33, 93, 229, 289, 386]

In [7]:
imdb_title_basics.drop(imdb_drop_rows, inplace = True)

In [8]:
len(imdb_title_basics)

145170

In [9]:
imdb_title_basics.duplicated(subset=['title_year']).sum()

2160

In [10]:
imdb_title_basics.duplicated(subset=['primary_title']).sum()

10251

In [11]:
# Dropping duplicate titles
clean_imdb = imdb_title_basics.drop_duplicates(subset=['title_year'], keep="first")
clean_imdb.duplicated(subset=['title_year']).sum()

0

In [12]:
len(clean_imdb)

143010

In [None]:
bom_movie_gross = pd.read_csv('zippedData/bom.movie_gross.csv.gz')
bom_movie_gross

In [None]:
# Normalizing title names in preparation for join
bom_adjusted_titles = []
for title in bom_movie_gross['title']:
    bom_adjusted_titles.append(title.lower().replace(":", "").replace("-", ""))
bom_movie_gross['title'] = bom_adjusted_titles
bom_movie_gross['title'].head()

In [None]:
# Create new column with title + year to minimize duplicates when joining
bom_movie_gross['title_year'] = bom_movie_gross['title'] + " " + bom_movie_gross['year'].astype('str')
bom_movie_gross = bom_movie_gross.reset_index()
bom_movie_gross.head()

In [None]:
bom_movie_gross.info()

In [None]:
bom_movie_gross[bom_movie_gross['domestic_gross'] > 100000000]

In [None]:
imdb_title_ratings = pd.read_csv('zippedData/imdb.title.ratings.csv.gz')
imdb_title_ratings

In [None]:
imdb_name_basics = pd.read_csv('zippedData/imdb.name.basics.csv.gz')
imdb_name_basics

In [None]:
imdb_title_akas = pd.read_csv('zippedData/imdb.title.akas.csv.gz')
imdb_title_akas

In [None]:
imdb_title_crew = pd.read_csv('zippedData/imdb.title.crew.csv.gz')
imdb_title_crew

In [None]:
imdb_title_principles = pd.read_csv('zippedData/imdb.title.principals.csv.gz')
imdb_title_principles

In [None]:
rt_movie = pd.read_csv('zippedData/rt.movie_info.tsv.gz', sep="\t", encoding="iso8859-1")
rt_movie

In [None]:
rt_reviews = pd.read_csv('zippedData/rt.reviews.tsv.gz', sep="\t", encoding="iso8859-1")
rt_reviews

In [None]:
tmdb_movies = pd.read_csv('zippedData/tmdb.movies.csv.gz')
tmdb_movies

In [None]:
tn_movie_budgets = pd.read_csv('zippedData/tn.movie_budgets.csv.gz')
tn_movie_budgets

## Data Preparation

Describe and justify the process for preparing the data for analysis.

***
Questions to consider:
* Were there variables you dropped or created?
* How did you address missing values or outliers?
* Why are these choices appropriate given the data and the business problem?
***

In [None]:
# Here you run your code to clean the data

## Data Modeling
Describe and justify the process for analyzing or modeling the data.

***
Questions to consider:
* How did you analyze or model the data?
* How did you iterate on your initial approach to make it better?
* Why are these choices appropriate given the data and the business problem?
***

In [None]:
# Here you run your code to model the data


## Evaluation
Evaluate how well your work solves the stated business problem.

***
Questions to consider:
* How do you interpret the results?
* How well does your model fit your data? How much better is this than your baseline model?
* How confident are you that your results would generalize beyond the data you have?
* How confident are you that this model would benefit the business if put into use?
***

## Conclusions
Provide your conclusions about the work you've done, including any limitations or next steps.

***
Questions to consider:
* What would you recommend the business do as a result of this work?
* What are some reasons why your analysis might not fully solve the business problem?
* What else could you do in the future to improve this project?
***