# Data Pipeline

This notebook describes the pipeline for milestone 3 for our project. It should give an overview of how we intend to load, clean, parse and exploit our data for our story telling.

## Imports

In [1]:
import pandas as pd
import json
import numpy as np

# Loading and cleaning the data

Given the data size, pandas dataframe will be used.

## Handling data

In [2]:
CHARACTER_DATA_PATH = "./data/character.metadata.tsv"
MOVIE_DATA_PATH = "./data/movie.metadata.tsv"
SUMMARIES_DATA_PATH = "./data/plot_summaries.txt"
NAME_DATA_PATH = "./data/name.clusters.txt"
TYPE_DATA_PATH = "./data/tvtropes.clusters.txt"

## Characters data

In [3]:
CHARACTER_HEADER = [
    "Wikipedia_movie_ID",
    "Freebase_movie_ID",
    "Movie_release_date",
    "Character_name",
    "Actor_date_of_birth",
    "Actor_gender",
    "Actor_height",
    "Actor_ethnicity",
    "Actor_name",
    "Actor_age_at_movie_release",
    "Freebase_character/actor_map_ID",
    "Freebase_character_ID",
    "Freebase_actor_ID",
]
character = pd.read_table(CHARACTER_DATA_PATH, header=None, names=CHARACTER_HEADER)

In [4]:
character.describe()

Unnamed: 0,Wikipedia_movie_ID,Actor_height,Actor_age_at_movie_release
count,450669.0,154824.0,292556.0
mean,13969750.0,1.788893,37.788523
std,10796620.0,4.37994,20.58787
min,330.0,0.61,-7896.0
25%,3759292.0,1.6764,28.0
50%,11890650.0,1.75,36.0
75%,23665010.0,1.83,47.0
max,37501920.0,510.0,103.0


First, we want to clean the dataset :
- negative ages should be removed
- actor's height > 2.5 as well, as they become irrelevant compared to the true height of actors
- dates converted to correct format

In [5]:
character["Actor_age_at_movie_release"] = character["Actor_age_at_movie_release"].apply(
    lambda x: np.nan if x < 0 else x
)
character["Actor_height"] = character["Actor_height"].apply(
    lambda x: np.nan if x > 2.5 else x
)
character["Movie_release_date"] = pd.to_datetime(
    character["Movie_release_date"], format="mixed", utc=True, errors="coerce"
)
character["Actor_date_of_birth"] = pd.to_datetime(
    character["Actor_date_of_birth"], format="mixed", utc=True, errors="coerce"
)
character.sample(5)

Unnamed: 0,Wikipedia_movie_ID,Freebase_movie_ID,Movie_release_date,Character_name,Actor_date_of_birth,Actor_gender,Actor_height,Actor_ethnicity,Actor_name,Actor_age_at_movie_release,Freebase_character/actor_map_ID,Freebase_character_ID,Freebase_actor_ID
337724,20523067,/m/051z47t,2009-01-17 00:00:00+00:00,Jason,1983-09-25 00:00:00+00:00,M,1.78,,Donald Glover,25.0,/m/0cry6nt,/m/0h6dzx0,/m/02778wm
431780,27175131,/m/0bwkjdw,1965-02-03 00:00:00+00:00,Duke of Orgagna,1913-05-06 00:00:00+00:00,M,1.9,,Stewart Granger,51.0,/m/0g9g0jx,/m/0h5nm0y,/m/026d77
210883,4989559,/m/0cyq_t,1997-01-01 00:00:00+00:00,Daisy Duke,1954-03-01 00:00:00+00:00,F,1.73,/m/01g7zj,Catherine Bach,42.0,/m/0k4zp7,/m/02qnzg,/m/028f6l
138147,2187291,/m/06tftd,2005-01-01 00:00:00+00:00,,1976-11-07 00:00:00+00:00,M,,,Jonas Talkington,28.0,/m/0gc_3z9,,/m/0gcxdym
175354,20689529,/m/05202kg,2008-03-18 00:00:00+00:00,Ricky,1978-08-17 00:00:00+00:00,F,,,Karena Lam,29.0,/m/0gw1hw5,/m/0k6djm_,/m/08rd9l


## TODO
Should modify the ethnicity to give corresponding freebase ID. https://pypi.org/project/freebase/

### Movies data

In [6]:
MOVIE_HEADER = [
    "Wikipedia_movie_ID",
    "Freebase_movie_ID",
    "Movie_name",
    "Movie_release_date",
    "Movie_box_office_revenue",
    "Movie_runtime",
    "Movie_languages",
    "Movie_countries",
    "Movie_genres",
]

movie = pd.read_table(MOVIE_DATA_PATH, header=None, names=MOVIE_HEADER)

The next cell will transform the dictionaries we might find in the cells of our data frames into comma-separated strings.

In [7]:
def format_dict(x):
    n = len(x)
    if n == 0:
        return np.nan
    else:
        return list(x.values())


try:
    movie["Movie_genres"] = movie["Movie_genres"].apply(json.loads).apply(format_dict)
    movie["Movie_countries"] = (
        movie["Movie_countries"].apply(json.loads).apply(format_dict)
    )
    movie["Movie_languages"] = (
        movie["Movie_languages"].apply(json.loads).apply(format_dict)
    )
except TypeError:
    print("Data has already been parsed and modified.")

There is a bit of cleaning to perform :
- we will focus on movies whose runtime is < 15000, because that is the longest movie we found to have its true length.
- dates to correct format

In [8]:
movie["Movie_runtime"] = movie["Movie_runtime"].apply(
    lambda x: np.nan if 1 > x or x > 15000 else x
)
movie["Movie_release_date"] = pd.to_datetime(
    movie["Movie_release_date"], format="mixed", utc=True, errors="coerce"
)
movie.sample(5)

Unnamed: 0,Wikipedia_movie_ID,Freebase_movie_ID,Movie_name,Movie_release_date,Movie_box_office_revenue,Movie_runtime,Movie_languages,Movie_countries,Movie_genres
55319,32842014,/m/0hr6gzb,Royal Rumble,2012-01-01 00:00:00+00:00,,,,,[Action]
62016,7275660,/m/0kv33z,The Cocoon and the Butterfly,1914-01-01 00:00:00+00:00,,,"[Silent film, English Language]",[United States of America],"[Short Film, Silent film, Indie, Black-and-white]"
17885,33161490,/m/0h65b7f,Ware ni utsu yoi ari,1990-11-17 00:00:00+00:00,,106.0,,,"[Action, Erotica]"
49228,31153496,/m/0gh8j2h,Chittu Kuruvi,1978-01-01 00:00:00+00:00,,,[Tamil Language],[India],
947,36036213,/m/0d6np27,Buried Alive,1990-05-09 00:00:00+00:00,,93.0,[English Language],[United States of America],"[Thriller, Horror]"


In [9]:
movie.describe()

Unnamed: 0,Wikipedia_movie_ID,Movie_box_office_revenue,Movie_runtime
count,81741.0,8401.0,61233.0
mean,17407840.0,47993630.0,94.298795
std,10987910.0,112175300.0,91.294848
min,330.0,10000.0,1.0
25%,7323695.0,2083193.0,81.0
50%,17778990.0,10639690.0,93.0
75%,27155730.0,40716960.0,106.0
max,37501920.0,2782275000.0,14400.0


## Summaries data

In [10]:
summaries = pd.read_table(
    SUMMARIES_DATA_PATH, header=None, names=["Wikipedia_movie_ID", "Summary"]
)
summaries.head()

Unnamed: 0,Wikipedia_movie_ID,Summary
0,23890098,"Shlykov, a hard-working taxi driver and Lyosha..."
1,31186339,The nation of Panem consists of a wealthy Capi...
2,20663735,Poovalli Induchoodan is sentenced for six yea...
3,2231378,"The Lemon Drop Kid , a New York City swindler,..."
4,595909,Seventh-day Adventist Church pastor Michael Ch...


## Character names data

In [11]:
name = pd.read_table(
    NAME_DATA_PATH,
    header=None,
    names=["Character_name", "Freebase_character/actor_map_ID"],
)
name.head()

Unnamed: 0,Character_name,Freebase_character/actor_map_ID
0,Stuart Little,/m/0k3w9c
1,Stuart Little,/m/0k3wcx
2,Stuart Little,/m/0k3wbn
3,John Doe,/m/0jyg35
4,John Doe,/m/0k2_zn


## Character types data

In [12]:
character_type = pd.read_table(
    TYPE_DATA_PATH, header=None, names=["Character_type", "Instance"]
)
character_type = pd.concat(
    [character_type, character_type["Instance"].apply(json.loads).apply(pd.Series)],
    axis=1,
)
character_type.drop(columns="Instance", inplace=True)
character_type.rename(
    columns={
        "char": "Character_name",
        "movie": "Movie_name",
        "id": "Freebase_character/actor_map_ID",
        "actor": "Actor_name",
    },
    inplace=True,
)
character_type.head()

Unnamed: 0,Character_type,Character_name,Movie_name,Freebase_character/actor_map_ID,Actor_name
0,absent_minded_professor,Professor Philip Brainard,Flubber,/m/0jy9q0,Robin Williams
1,absent_minded_professor,Professor Keenbean,Richie Rich,/m/02vchl3,Michael McShane
2,absent_minded_professor,Dr. Reinhardt Lane,The Shadow,/m/0k6fkc,Ian McKellen
3,absent_minded_professor,Dr. Harold Medford,Them!,/m/0k6_br,Edmund Gwenn
4,absent_minded_professor,Daniel Jackson,Stargate,/m/0k3rhh,James Spader
