## Resolving Conflicts

As noted in [data understanding](#Data-Understanding), there are a number of conflicts that need to be resolved before 

In [10]:
cl.prep_Data(df_title_basics, df_movie_budgets).head()

AttributeError: 'NoneType' object has no attribute 'head'

### Datasets
For the purposes of this analysis we focused primarily on data from the Internet Movie Database (IMDB) and The-Numbers.com (TN), two sources that focus on the film industry. Specifically we used datasets that included--on one hand--title, date released, and genre data and--on the other hand--title, date released, production budgets, and box office figures. Below is a summary of the data pertinent to our analysis broken down by file. 

| imdb.title.basics.csv | tn.movie_budgets.csv |
| --- | --- |
| primary_title | movie |
| start_year | release_date |
| genres |  |
|  | production_budget |
|  | domestic_gross |
|  | worldwide_gross |


In this table you can see our understanding of how data between the two datasets can "match" in the sense that they provide they same kind of data but may be in different formats. For example, `start_year` in `imdb.title.basics.csv` is  formatted as `YYYY` whereas `release_date` in `tn.movie_budgets.csv` is formatted as `MMM DD, YYYY`. You can see below examples of such a discrepancy.

In [1]:
import CustomLibrary as cl
from CustomLibrary import df_title_basics, df_movie_budgets

df_title_basics.head(3)

Unnamed: 0,tconst,primary_title,original_title,start_year,runtime_minutes,genres
0,tt0063540,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama"
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama"
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama


In [16]:
df_movie_budgets.head(3)

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
0,1,"Dec 18, 2009",Avatar,"$425,000,000","$760,507,625","$2,776,345,279"
1,2,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,"$410,600,000","$241,063,875","$1,045,663,875"
2,3,"Jun 7, 2019",Dark Phoenix,"$350,000,000","$42,762,350","$149,762,350"


Further, there are issues with being able to utilize the data due to their data types. The previously discussed `release_date` and `start_year` data are actually `objects` and `integers`, respectively. And much of the box office data are `objects` that can't be added and subtraced. How we dealt with them can be seen in the [data preparation](#Data-Preparation) section.

In [4]:
cl.df_info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 146144 entries, 0 to 146143
Data columns (total 6 columns):
tconst             146144 non-null object
primary_title      146144 non-null object
original_title     146123 non-null object
start_year         146144 non-null int64
runtime_minutes    114405 non-null float64
genres             140736 non-null object
dtypes: float64(1), int64(1), object(4)
memory usage: 6.7+ MB
imdb.title.basics.csv
 None 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5782 entries, 0 to 5781
Data columns (total 6 columns):
id                   5782 non-null int64
release_date         5782 non-null object
movie                5782 non-null object
production_budget    5782 non-null object
domestic_gross       5782 non-null object
worldwide_gross      5782 non-null object
dtypes: int64(1), object(5)
memory usage: 271.2+ KB
tn._movie_budgets.csv
 None


## Key Data for Merging Datasets and Analysis

Another key issue for this data is identifying which data we can use to perform a merge. Although the `imdb.title.basics.csv` contains a unique identifier column `tconst` the `tn.movie_budgets.csv` dataset contains no unique IDs. We worked around that issue by identifying the title variables `primary_title` and `movie` and release date variables `start_year`and `release_date` variables as our keys. As mentioned before these variables did not match in terms of format so, for example, we identified the year string in `release_date` as data that will be used to match release date variables. 

Although the datasets contain key variables as `genres`, `production_budget`, and `worldwide_gross`, the datasets also lack total costs of producing each movie, including costs like marketing, and more granular genre data. As seen in [data preparation](#Data-Preparation), we made key assumptions, calculations, and manipulations to make the dataframe accessible for our analysis of ROI, genre, and budgets. 

# Cleaning commands and functions

In [None]:
# Unzipping csv.gz and tsv.gz files
!find . -name '*.csv.gz' -exec gzip -d {} \;
!find . -name '*.tsv.gz' -exec gzip -d {} \;

In [None]:
def indicator_str_parser(dataframe, parsed_column_str, list_of_strs):
    
    # If column full of strings has no string to be parsed, set value to 'none'
    dataframe[dataframe[parsed_column_str].isnull()] = "none"
    
    # Create indicator columns for columns with no string to be parsed
    dataframe[parsed_column_str + '_not_parsed_id'] = [1 if x == "none"
                                                       else 0 
                                                       for x in dataframe[parsed_column_str]]
    
    # starts list of created series to be used as arguments
    list_of_series = []
    
    # Loop over elements in list
    for genre in list_of_strs:
        
        # Make a new indicator column from the parsed column and the element to be searched for
        dataframe[parsed_column_str + '_' + genre + '_id'] = [1 if genre in x 
                                                            else 0 
                                                            for x in dataframe[parsed_column_str]]
        
        # Include new column in list to be fed into function
        list_of_series.append(dataframe[parsed_column_str + '_' + genre + '_id'])
        
    # Unpack list_of_series to be fed as arguments into zip function for unique tuples of parsed indicators
    dataframe[parsed_column_str + '_tuple'] = list(zip(*list_of_series))
    
    # return value counts showing how many strings in the column were parsed
    return dataframe[parsed_column_str + '_not_parsed_id'].value_counts()

# Misc Exploration

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
home = 'data/zippedData/'

df_movie_budgets = pd.read_csv('data/zippedData/tn.movie_budgets.csv')
df_movie_budgets['start_year'] = [(x[-4:]) for x in df_movie_budgets['release_date']]

df_movie_budgets.head()

In [None]:
df_title_basics = pd.read_csv(home + 'imdb.title.basics.csv')
df_title_basics.head()
df_title_basics.rename(columns={'primary_title': 'movie'})
df_title_basics[df_title_basics['genres'].isnull()] = "none"
df_title_basics

In [None]:


indicator_str_parser(df_title_basics, 'genres', ['Comedy', 'Drama', 'Action'])
    

In [None]:
df_movie_gross = pd.read_csv(home + 'bom.movie_gross.csv')
df_movie_gross.head()

In [None]:
df_budget_merge = pd.merge(df_movie_budgets, df_title_basics, how = 'inner', on = ('movie', 'start_year'))
df_budget_merge.head()

In [None]:
def clean_dollars(dataframe, column_str):
    dataframe[column_str] = dataframe[column_str].str.replace(',',", '').str.replace('$', '').astype(int)
    return dataframe

In [None]:
df_budget_merge['comedy_id'] = [1 if 'comedy' in df_budget_merge['genre'] else 0]