![example](images/director_shot.jpeg)

# Project Title

**Authors:** Student 1, Student 2, Student 3
***

## Overview

A one-paragraph overview of the project, including the business problem, data, methods, results and recommendations.

## Business Problem

Summary of the business problem you are trying to solve, and the data questions that you plan to answer to solve them.

***
Questions to consider:
* What are the business's pain points related to this project?
* How did you pick the data analysis question(s) that you did?
* Why are these questions important from a business perspective?
***

## Data Understanding

Describe the data being used for this project.
***
Questions to consider:
* Where did the data come from, and how do they relate to the data analysis questions?
* What do the data represent? Who is in the sample and what variables are included?
* What is the target variable?
* What are the properties of the variables you intend to use?
***

In [1]:
# Import standard packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [2]:
# Here you run your code to explore the data

imdb_title = pd.read_csv('./data/zippedData/imdb.title.basics.csv.gz')
imdb_title_akas = pd.read_csv('./data/zippedData/imdb.title.akas.csv.gz')
imdb_ratings = pd.read_csv('./data/zippedData/imdb.title.ratings.csv.gz')
bom_gross = pd.read_csv('./data/zippedData/bom.movie_gross.csv.gz')



In [3]:
imdb_title.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 146144 entries, 0 to 146143
Data columns (total 6 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   tconst           146144 non-null  object 
 1   primary_title    146144 non-null  object 
 2   original_title   146123 non-null  object 
 3   start_year       146144 non-null  int64  
 4   runtime_minutes  114405 non-null  float64
 5   genres           140736 non-null  object 
dtypes: float64(1), int64(1), object(4)
memory usage: 6.7+ MB


**OBSERVATION:** We can notice from here that only the columns runtime_minutes and genres have null values

In [4]:
imdb_title_akas.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 331703 entries, 0 to 331702
Data columns (total 8 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   title_id           331703 non-null  object 
 1   ordering           331703 non-null  int64  
 2   title              331703 non-null  object 
 3   region             278410 non-null  object 
 4   language           41715 non-null   object 
 5   types              168447 non-null  object 
 6   attributes         14925 non-null   object 
 7   is_original_title  331678 non-null  float64
dtypes: float64(1), int64(1), object(6)
memory usage: 20.2+ MB


**OBSERVATION:** We can notice from here that few columns have null values

In [5]:
imdb_ratings.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 73856 entries, 0 to 73855
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   tconst         73856 non-null  object 
 1   averagerating  73856 non-null  float64
 2   numvotes       73856 non-null  int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 1.7+ MB


**OBSERVATION:** We can notice from here that all columns have values

In [6]:
bom_gross.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3387 entries, 0 to 3386
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           3387 non-null   object 
 1   studio          3382 non-null   object 
 2   domestic_gross  3359 non-null   float64
 3   foreign_gross   2037 non-null   object 
 4   year            3387 non-null   int64  
dtypes: float64(1), int64(1), object(3)
memory usage: 132.4+ KB


**OBSERVATION:** We can notice from here that all columns except title have null values

**IMDB TITLES DATA**

In [7]:
imdb_title.head()

Unnamed: 0,tconst,primary_title,original_title,start_year,runtime_minutes,genres
0,tt0063540,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama"
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama"
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama"
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy"


In [9]:
imdb_title['tconst'].value_counts()

tt4932514    1
tt6146894    1
tt1835951    1
tt1975179    1
tt3037950    1
            ..
tt5894274    1
tt9173250    1
tt2631390    1
tt3923472    1
tt4299026    1
Name: tconst, Length: 146144, dtype: int64

**OBSERVATION:** each tconst have a unique value

In [10]:
imdb_title['primary_title'].value_counts()


Home              24
Broken            20
The Return        20
Alone             16
Homecoming        16
                  ..
GH5                1
Anumati            1
Blackwood          1
A Real Vermeer     1
Balkan Spirit      1
Name: primary_title, Length: 136071, dtype: int64

**Note to me:** what is the difference between primmary and original title?

In [11]:
imdb_title['original_title'].value_counts()


Broken                                    19
Home                                      18
The Return                                17
Freedom                                   13
Homecoming                                13
                                          ..
Wolf is Coming                             1
Tone, javi se!                             1
PICO: Un parlante de Africa en America     1
The Thorn                                  1
Balkan Spirit                              1
Name: original_title, Length: 137773, dtype: int64

In [12]:
imdb_title['start_year'].value_counts()


2017    17504
2016    17272
2018    16849
2015    16243
2014    15589
2013    14709
2012    13787
2011    12900
2010    11849
2019     8379
2020      937
2021       83
2022       32
2023        5
2024        2
2027        1
2026        1
2025        1
2115        1
Name: start_year, dtype: int64

**OBSERVATION:** How is it possible that the start year are beyon actual year? eg 2115

In [15]:
imdb_title['runtime_minutes'].value_counts()


90.0     7131
80.0     3526
85.0     2915
100.0    2662
95.0     2549
         ... 
382.0       1
724.0       1
808.0       1
287.0       1
540.0       1
Name: runtime_minutes, Length: 367, dtype: int64

**Note to me** is this information necessary for my project? well indeed it owuld be good to know whether movies should be long or short...

In [16]:
imdb_title['genres'].value_counts()


Documentary                32185
Drama                      21486
Comedy                      9177
Horror                      4372
Comedy,Drama                3519
                           ...  
Animation,Crime                1
Comedy,Musical,Sport           1
Comedy,Romance,Short           1
Animation,Family,Sci-Fi        1
Drama,News,Sci-Fi              1
Name: genres, Length: 1085, dtype: int64

THINGS TO CONSIDER:
- each title id has several rows, with different region, languages... 
- there are different titles for the same movie, maybe when combining with other database I have to select the original title

In [None]:
imb_titles['title_id'].value_counts()


In [None]:
imb_titles['ordering'].value_counts()


**note for me** = what is ordering?


In [None]:
imb_titles['title'].value_counts()


In [None]:
imb_titles['region'].value_counts()


In [None]:
imb_titles['language'].value_counts()


In [None]:
imb_titles['types'].value_counts()


**note for me** do we need all these types? Alternative?? 

In [None]:
imb_titles['attributes'].value_counts()


In [None]:
imb_titles['is_original_title'].value_counts()


this is indicating that htere are 44700 original titles

**IMDB RATINGS DATA**

In [None]:
imb['is_original_title'].value_counts()


## Data Preparation

Describe and justify the process for preparing the data for analysis.

***
Questions to consider:
* Were there variables you dropped or created?
* How did you address missing values or outliers?
* Why are these choices appropriate given the data and the business problem?
***

In [None]:
# Here you run your code to clean the data

## Data Modeling
Describe and justify the process for analyzing or modeling the data.

***
Questions to consider:
* How did you analyze or model the data?
* How did you iterate on your initial approach to make it better?
* Why are these choices appropriate given the data and the business problem?
***

In [None]:
# Here you run your code to model the data


## Evaluation
Evaluate how well your work solves the stated business problem.

***
Questions to consider:
* How do you interpret the results?
* How well does your model fit your data? How much better is this than your baseline model?
* How confident are you that your results would generalize beyond the data you have?
* How confident are you that this model would benefit the business if put into use?
***

## Conclusions
Provide your conclusions about the work you've done, including any limitations or next steps.

***
Questions to consider:
* What would you recommend the business do as a result of this work?
* What are some reasons why your analysis might not fully solve the business problem?
* What else could you do in the future to improve this project?
***