# Table of Contents

* [Data Sources](#Data-Sources)
* [Gather the Data](#Gather-the-Data)
    * [Merge Dataframes / Tables](#Merge-Dataframes-/-Tables)
* [Explore the Data](#Explore-the-Data)
* [Model the Data](#Model-the-Data)
* [Visualize the Results](#Visualize-the-Results)


<hr>

## Data Sources

Description of the IMDB data: https://www.imdb.com/interfaces/

IMDB Data Sources: https://datasets.imdbws.com/

<hr>

## Gather the Data

In [1]:
%matplotlib inline

In [2]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm

In [3]:
sns.set(rc={'figure.figsize': (12, 10), "lines.markeredgewidth": 0.5 })

In [4]:
#--------------------------------------------------------
#--  Input File 1:  name.basics.tsv
#--------------------------------------------------------
print('Reading name.basics.tsv')
nameBasics = pd.read_csv("../Data/name.basics.tsv/data.tsv", sep='\t')
print('Complete - 1 of 3')
print(nameBasics.head(5))

#--------------------------------------------------------
#--  Input File 6:  title.principals.tsv
#--------------------------------------------------------
print('Reading title.principals.tsv')
titlePrincipals = pd.read_csv("../Data/title.principals.tsv/data.tsv", sep='\t')
print('Complete - 2 of 3')
print(titlePrincipals.head(5))

#--------------------------------------------------------
#--  Input File 7:  title.ratings.tsv
#--------------------------------------------------------
print('Reading title.ratings.tsv')
titleRatings = pd.read_csv("../Data/title.ratings.tsv/data.tsv", sep='\t',dtype={"tconst": object, "averageRating": float, "numVotes": int})
print('CompletitleRatingste - 3 of 3')
print(titleRatings.head(5))

print('\n-----all data loaded -----')

Reading name.basics.tsv
Complete - 1 of 3
      nconst      primaryName birthYear deathYear  \
0  nm0000001     Fred Astaire      1899      1987   
1  nm0000002    Lauren Bacall      1924      2014   
2  nm0000003  Brigitte Bardot      1934        \N   
3  nm0000004     John Belushi      1949      1982   
4  nm0000005   Ingmar Bergman      1918      2007   

                primaryProfession                           knownForTitles  
0  soundtrack,actor,miscellaneous  tt0043044,tt0072308,tt0050419,tt0045537  
1              actress,soundtrack  tt0037382,tt0038355,tt0071877,tt0117057  
2     actress,soundtrack,producer  tt0054452,tt0057345,tt0049189,tt0059956  
3         actor,writer,soundtrack  tt0072562,tt0080455,tt0078723,tt0077975  
4           writer,director,actor  tt0083922,tt0050986,tt0050976,tt0060827  
Reading title.principals.tsv
Complete - 2 of 3
      tconst  ordering     nconst         category                      job  \
0  tt0000001         1  nm1588970             self 

### Dataset Descriptions

**name.basics.tsv.gz** - Contains the following information for names:

- nconst (string) – alphanumeric unique identifier of the name/person
- primaryName (string) – name by which the person is most often credited
- birthYear – in YYYY format
- deathYear – in YYYY format if applicable, else '\N'
- primaryProfession (array of strings) – the top-3 professions of the person
- knownForTitles (array of tconsts) – titles the person is known for


**title.principals.tsv** - Contains the principal cast/crew for titles:

- tconst (string) – alphanumeric unique identifier of the title
- ordering (integer) – a number to uniquely identify rows for a given titleId
- nconst (string) – alphanumeric unique identifier of the name/person
- category (string) – the category of job that person was in
- job (string) – the specific job title if applicable, else '\N'
- characters (string) – the name of the character played if applicable, else '\N' 

### Merge Dataframes / Tables

In [5]:
print(len(titlePrincipals))
titlePrincipals.head(5)

29289842


Unnamed: 0,tconst,ordering,nconst,category,job,characters
0,tt0000001,1,nm1588970,self,\N,"[""Herself""]"
1,tt0000001,2,nm0005690,director,\N,\N
2,tt0000001,3,nm0374658,cinematographer,director of photography,\N
3,tt0000002,1,nm0721526,director,\N,\N
4,tt0000002,2,nm1335271,composer,\N,\N


In [None]:
# drop unused columns
titlePrincipals = titlePrincipals.drop(['knownForTitles'], axis=1)

In [21]:
# replace values with '\N' with the pandas 'NaN'
# source: https://stackoverflow.com/a/49406417
titlePrincipals = titlePrincipals.replace({'\\N': np.nan})

In [6]:
print(len(nameBasics))
nameBasics.head()

8749442


Unnamed: 0,nconst,primaryName,birthYear,deathYear,primaryProfession,knownForTitles
0,nm0000001,Fred Astaire,1899,1987,"soundtrack,actor,miscellaneous","tt0043044,tt0072308,tt0050419,tt0045537"
1,nm0000002,Lauren Bacall,1924,2014,"actress,soundtrack","tt0037382,tt0038355,tt0071877,tt0117057"
2,nm0000003,Brigitte Bardot,1934,\N,"actress,soundtrack,producer","tt0054452,tt0057345,tt0049189,tt0059956"
3,nm0000004,John Belushi,1949,1982,"actor,writer,soundtrack","tt0072562,tt0080455,tt0078723,tt0077975"
4,nm0000005,Ingmar Bergman,1918,2007,"writer,director,actor","tt0083922,tt0050986,tt0050976,tt0060827"


In [33]:
# replace values with '\N' with the pandas 'NaN'
nameBasics_subset = nameBasics.replace({'\\N': np.nan})

In [7]:
# returns a left join of both dataframes
principal_data = pd.merge(titlePrincipals, nameBasics, how='left', on=['nconst'])

# Check the length of the resulting join
print(len(principal_data))

principal_data.head()

29289842


Unnamed: 0,tconst,ordering,nconst,category,job,characters,primaryName,birthYear,deathYear,primaryProfession,knownForTitles
0,tt0000001,1,nm1588970,self,\N,"[""Herself""]",Carmencita,\N,\N,soundtrack,"tt0057728,tt0000001"
1,tt0000001,2,nm0005690,director,\N,\N,William K.L. Dickson,1860,1935,"cinematographer,director,producer","tt0219560,tt1428455,tt6687694,tt1496763"
2,tt0000001,3,nm0374658,cinematographer,director of photography,\N,William Heise,\N,1910,"cinematographer,director,producer","tt0241393,tt0285863,tt0241715,tt0229665"
3,tt0000002,1,nm0721526,director,\N,\N,Émile Reynaud,1844,1918,director,"tt2184231,tt0000003,tt2184201,tt0413219"
4,tt0000002,2,nm1335271,composer,\N,\N,Gaston Paulin,1839,1903,composer,"tt0000003,tt0000004,tt0000002"


<hr> 

## Explore the Data

**Decision Needed:** Do we limit the analysis of cast and crew to just movies? Or do we take into account their experience on other projects such as television?

**Decision Needed:** Do we remove cast/crew that are no longer living? The goal is to decide who to hire.

In [None]:
# drop unused columns
principal_data = principal_data.drop(['knownForTitles', 'birthYear'], axis=1)

In [19]:
len(principal_data)

29289842

In [21]:
# replace values with '\N' with the pandas 'NaN'
# source: https://stackoverflow.com/a/49406417
principal_data = principal_data.replace({'\\N': np.nan})

In [22]:
principal_data.deathYear.unique()

array([nan, '1935', '1910', '1918', '1903', '1931', '1933', '1896',
       '1936', '1951', '1928', '1940', '1948', '1939', '1970', '1963',
       '1954', '1907', '1943', '1921', '1941', '1911', '1920', '1925',
       '1938', '1905', '1859', '1944', '1930', '1956', '1926', '1917',
       '1924', '1915', '1968', '1957', '1950', '1959', '1955', '1932',
       '1958', '1703', '1870', '1953', '1832', '1929', '1780', '1923',
       '1945', '1971', '1616', '1961', '1937', '1984', '1981', '1916',
       '1947', '1949', '1976', '1962', '1942', '1969', '1902', '1946',
       '1979', '1919', '1898', '1745', '1901', '1882', '1966', '1890',
       '1899', '1913', '1885', '1934', '1973', '1909', '1974', '1960',
       '1912', '1852', '1863', '1965', '1991', '1922', '1964', '1895',
       '1985', '1883', '1892', '1980', '1977', '1967', '1927', '1893',
       '1865', '1877', '1873', '1972', '1983', '1914', '1986', '1975',
       '1875', '1880', '1952', '1837', '1999', '1851', '1908', '1673',
       '1

In [25]:
# check what rows are missing data
principal_data.isnull().sum()

tconst                      0
ordering                    0
nconst                      0
category                    0
job                  24437520
characters           14176704
primaryName              7753
deathYear            25671366
primaryProfession     1495813
dtype: int64

In [None]:
principal_data = principal_data.drop([principal_data.deathYear.isnull() == False])

In [23]:
principal_data.head()

Unnamed: 0,tconst,ordering,nconst,category,job,characters,primaryName,deathYear,primaryProfession
0,tt0000001,1,nm1588970,self,,"[""Herself""]",Carmencita,,soundtrack
1,tt0000001,2,nm0005690,director,,,William K.L. Dickson,1935.0,"cinematographer,director,producer"
2,tt0000001,3,nm0374658,cinematographer,director of photography,,William Heise,1910.0,"cinematographer,director,producer"
3,tt0000002,1,nm0721526,director,,,Émile Reynaud,1918.0,director
4,tt0000002,2,nm1335271,composer,,,Gaston Paulin,1903.0,composer


In [None]:
# only keep the row if the person is still living
principal_data = principal_data.drop([principal_data.deathYear != '\\N'])

In [None]:
principal_data.to_csv("../Data/principals_data.csv", sep='\t', index=False)

In [7]:
# returns a left join of both dataframes
#principal_data = pd.merge(principal_data, titleRatings, how='left', on=['tconst'])

# Check the length of the resulting join
print(len(principal_data))

principal_data.head()

29289842


Unnamed: 0,tconst,ordering,nconst,category,job,characters,averageRating,numVotes
0,tt0000001,1,nm1588970,self,\N,"[""Herself""]",5.8,1393.0
1,tt0000001,2,nm0005690,director,\N,\N,5.8,1393.0
2,tt0000001,3,nm0374658,cinematographer,director of photography,\N,5.8,1393.0
3,tt0000002,1,nm0721526,director,\N,\N,6.5,163.0
4,tt0000002,2,nm1335271,composer,\N,\N,6.5,163.0


In [10]:
principals = principal_data.copy()

<hr> 

## Model the Data

<hr> 

## Visualize the Results