Notebook containing initial analyses and data handling pipelines. We will grade the correctness, quality of code, and quality of textual descriptions.

In [35]:
import pandas as pd



In [36]:
DATA_PATH = 'Data/'
PLOT_SUMMARIES_FILENAME = 'plot_summaries.txt'
CORENLP_PLOT_SUMMARIES_FILENAME = 'corenlp_plot_summaries.tar' #TODO: download separatly
MOVIE_METADATA_FILENAME = 'movie.metadata.tsv'
CHARACTER_METADATA_FILENAME = 'character.metadata.tsv'
CHARACTER_TYPES_FILENAME = 'tvtropes.clusters.txt' #character types 
CHARACTER_NAMES_FILENAME ='name.clusters.txt' #unique character names used in at least 2 movies

Loading data

In [37]:
col_names_plot_summaries = ['Wikipedia movie ID', 'Summary']
df_plot_summaries = pd.read_csv(DATA_PATH+PLOT_SUMMARIES_FILENAME, sep='\t', header=None, names=col_names_plot_summaries, index_col=0)
df_plot_summaries.head()

Unnamed: 0_level_0,Summary
Wikipedia movie ID,Unnamed: 1_level_1
23890098,"Shlykov, a hard-working taxi driver and Lyosha..."
31186339,The nation of Panem consists of a wealthy Capi...
20663735,Poovalli Induchoodan is sentenced for six yea...
2231378,"The Lemon Drop Kid , a New York City swindler,..."
595909,Seventh-day Adventist Church pastor Michael Ch...


In [38]:
col_names_movie_metadata = ['Wikipedia movie ID', 'Freebase movie ID', 'Movie name', 'Movie release date', 
                            'Movie box office revenue', 'Movie runtime', 'Movie languages (Freebase ID:name tuples)', 
                            'Movie countries (Freebase ID:name tuples)', 'Movie genres (Freebase ID:name tuples)']
df_movie_metadata = pd.read_csv(DATA_PATH + MOVIE_METADATA_FILENAME, sep='\t', header=None, names=col_names_movie_metadata, index_col=0)
df_movie_metadata.head()


Unnamed: 0_level_0,Freebase movie ID,Movie name,Movie release date,Movie box office revenue,Movie runtime,Movie languages (Freebase ID:name tuples),Movie countries (Freebase ID:name tuples),Movie genres (Freebase ID:name tuples)
Wikipedia movie ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
975900,/m/03vyhn,Ghosts of Mars,2001-08-24,14010832.0,98.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/01jfsb"": ""Thriller"", ""/m/06n90"": ""Science..."
3196793,/m/08yl5d,Getting Away with Murder: The JonBenét Ramsey ...,2000-02-16,,95.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/02n4kr"": ""Mystery"", ""/m/03bxz7"": ""Biograp..."
28463795,/m/0crgdbh,Brun bitter,1988,,83.0,"{""/m/05f_3"": ""Norwegian Language""}","{""/m/05b4w"": ""Norway""}","{""/m/0lsxr"": ""Crime Fiction"", ""/m/07s9rl0"": ""D..."
9363483,/m/0285_cd,White Of The Eye,1987,,110.0,"{""/m/02h40lc"": ""English Language""}","{""/m/07ssc"": ""United Kingdom""}","{""/m/01jfsb"": ""Thriller"", ""/m/0glj9q"": ""Erotic..."
261236,/m/01mrr1,A Woman in Flames,1983,,106.0,"{""/m/04306rv"": ""German Language""}","{""/m/0345h"": ""Germany""}","{""/m/07s9rl0"": ""Drama""}"


In [39]:
col_names_character_metadata = ['Wikipedia movie ID', 'Freebase movie ID', 'Movie release date', 'Character name', 'Actor date of birth', 'Actor gender', 'Actor height (in meters)', 'Actor ethnicity (Freebase ID)', 'Actor name', 'Actor age at movie release', 'Freebase character/actor map ID', 'Freebase character ID', 'Freebase actor ID']
df_character_metadata = pd.read_csv(DATA_PATH + CHARACTER_METADATA_FILENAME, sep='\t', header=None, names=col_names_character_metadata, index_col=0)
df_character_metadata.head()

Unnamed: 0_level_0,Freebase movie ID,Movie release date,Character name,Actor date of birth,Actor gender,Actor height (in meters),Actor ethnicity (Freebase ID),Actor name,Actor age at movie release,Freebase character/actor map ID,Freebase character ID,Freebase actor ID
Wikipedia movie ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
975900,/m/03vyhn,2001-08-24,Akooshay,1958-08-26,F,1.62,,Wanda De Jesus,42.0,/m/0bgchxw,/m/0bgcj3x,/m/03wcfv7
975900,/m/03vyhn,2001-08-24,Lieutenant Melanie Ballard,1974-08-15,F,1.78,/m/044038p,Natasha Henstridge,27.0,/m/0jys3m,/m/0bgchn4,/m/0346l4
975900,/m/03vyhn,2001-08-24,Desolation Williams,1969-06-15,M,1.727,/m/0x67,Ice Cube,32.0,/m/0jys3g,/m/0bgchn_,/m/01vw26l
975900,/m/03vyhn,2001-08-24,Sgt Jericho Butler,1967-09-12,M,1.75,,Jason Statham,33.0,/m/02vchl6,/m/0bgchnq,/m/034hyc
975900,/m/03vyhn,2001-08-24,Bashira Kincaid,1977-09-25,F,1.65,,Clea DuVall,23.0,/m/02vbb3r,/m/0bgchp9,/m/01y9xg


In [40]:
col_names_character_types = ['Character type', 'Instances']
df_character_types = pd.read_csv(DATA_PATH+CHARACTER_TYPES_FILENAME, sep='\t', header=None, names=col_names_character_types)
df_character_types.head()

Unnamed: 0,Character type,Instances
0,absent_minded_professor,"{""char"": ""Professor Philip Brainard"", ""movie"":..."
1,absent_minded_professor,"{""char"": ""Professor Keenbean"", ""movie"": ""Richi..."
2,absent_minded_professor,"{""char"": ""Dr. Reinhardt Lane"", ""movie"": ""The S..."
3,absent_minded_professor,"{""char"": ""Dr. Harold Medford"", ""movie"": ""Them!..."
4,absent_minded_professor,"{""char"": ""Daniel Jackson"", ""movie"": ""Stargate""..."


In [41]:
col_names_character_names = ['Character name', 'Freebase character/actor map ID']
df_character_names = pd.read_csv(DATA_PATH+CHARACTER_NAMES_FILENAME, sep='\t', header=None, names=col_names_character_names)
df_character_names.head()

Unnamed: 0,Character name,Freebase character/actor map ID
0,Stuart Little,/m/0k3w9c
1,Stuart Little,/m/0k3wcx
2,Stuart Little,/m/0k3wbn
3,John Doe,/m/0jyg35
4,John Doe,/m/0k2_zn


Dealing with missing data

In [56]:
def get_column_names_with_null_values(df, df_name):
    column_with_null_values = []
    for column in df:
        if df[column].isnull().values.any():
            column_with_null_values.append(df[column].name)

    print("Columns of %s with at least one null value:" % df_name, *column_with_null_values, sep='\n', end='\n\n')
    return 

get_column_names_with_null_values(df_movie_metadata, "df_movie_metadata")
get_column_names_with_null_values(df_character_metadata, 'df_character_metadata')
get_column_names_with_null_values(df_plot_summaries, 'df_plot_summaries')
get_column_names_with_null_values(df_character_names, 'df_character_names')
get_column_names_with_null_values(df_character_types, 'df_character_types')

Columns of df_movie_metadata with at least one null value:
Movie release date
Movie box office revenue
Movie runtime

Columns of df_character_metadata with at least one null value:
Movie release date
Character name
Actor date of birth
Actor gender
Actor height (in meters)
Actor ethnicity (Freebase ID)
Actor name
Actor age at movie release
Freebase character ID
Freebase actor ID

Columns of df_plot_summaries with at least one null value:

Columns of df_character_names with at least one null value:

Columns of df_character_types with at least one null value:



In [None]:
#TODO: import dates in datetime format
#TODO: find a way to structure character_names and character_types in useful format

In [16]:
df_movie_metadata.describe()

Unnamed: 0,Movie box office revenue,Movie runtime
count,8401.0,61291.0
mean,47993630.0,111.8192
std,112175300.0,4360.07
min,10000.0,0.0
25%,2083193.0,81.0
50%,10639690.0,93.0
75%,40716960.0,106.0
max,2782275000.0,1079281.0


In [17]:
df_character_metadata.describe()

Unnamed: 0,Actor height (in meters),Actor age at movie release
count,154824.0,292556.0
mean,1.788893,37.788523
std,4.37994,20.58787
min,0.61,-7896.0
25%,1.6764,28.0
50%,1.75,36.0
75%,1.83,47.0
max,510.0,103.0


In [18]:
df_character_names.describe()

Unnamed: 0,Character name,Freebase character/actor map ID
count,2666,2666
unique,970,2661
top,Daffy Duck,/m/0gcy23_
freq,42,2


In [19]:
df_character_types.describe()

Unnamed: 0,Character type,Instances
count,501,501
unique,72,447
top,crazy_jealous_guy,"{""char"": ""Captain Jack Sparrow"", ""movie"": ""Pir..."
freq,25,5
