## Link quotebank and our movie dataset
Once again we chose not to upload the Quotebank datasets we use on github because they were too big ; this notebook filters the quotes to only keep those whose author appears in `movie_data_2015_2020.csv`.

The two data sources are:
- the **QuoteBank datasets** [link](https://zenodo.org/record/4277311) from the year 2015 to 2020 (quotation-centric versions), that have to be downloaded and put in the `data` folder. They contain quotes along with a list of potential authors and their probabilities.
- our own **movie dataset** extracted in `movieProcessing`.

The initial architecture to run this notebook should be the following:
```
    .
    ├── mergeDataSets
    │   └── link_quotes_to_movies.ipynb   # this notebook 
    └── data/                             # the datasets we need
        ├── movie_data_2015_2020.csv      # from the moviePreprocessing notebook
        ├── quotes-2015.json.bz2
        ├── quotes-2016.json.bz2
        ├── quotes-2017.json.bz2
        ├── quotes-2018.json.bz2
        ├── quotes-2019.json.bz2
        └── quotes-2020.json.bz2
```
where each quotes files comes from the quotation-centric version of the QuoteBank datasets, described at the link above.

We simply link the quotes to the actors and crew members from our movie dataset, using the most probable author.

**The goal of this notebook is to generate datasets containing the filtered quotes. These datasets will be named `movie_{year}_crew_quotes.csv.gz` (where `{year}` is the year the quote was made) and located in the `data` folder.**

In [1]:
# Initial imports
import pandas as pd
import matplotlib.pyplot as plt
from tqdm import tqdm
import multiprocessing

# Constants to easily update and read the code
DATA_DIR = "../data/"
MOVIE_DATASET = "movie_data_2015_2020.csv"
START_YEAR = 2015
END_YEAR = 2020
YEARS = range(START_YEAR, END_YEAR + 1)

# Auxiliary functions
def quotes_dataset(year):
    return f'quotes-{year}.json.bz2'

def output_dataset(year):
    return f"movie_{year}_crew_quotes.csv.gz"

In [2]:
# Open movie data set
moviedata = pd.read_csv(DATA_DIR + MOVIE_DATASET)
moviedata.head()

Unnamed: 0.1,Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,...,ordering_y,nconst,category,job,characters,primaryName,birthYear,deathYear,primaryProfession,knownForTitles
0,0,tt1179933,movie,10 Cloverfield Lane,10 Cloverfield Lane,0.0,2016.0,,103.0,"Action,Drama,Horror",...,10.0,nm6618222,producer,producer,,Lindsey Weber,,,producer,"tt4530422,tt2660888,tt2548396,tt1179933"
1,1,tt4530422,movie,Overlord,Overlord,0.0,2018.0,,110.0,"Action,Horror,Sci-Fi",...,9.0,nm6618222,producer,producer,,Lindsey Weber,,,producer,"tt4530422,tt2660888,tt2548396,tt1179933"
2,2,tt1179933,movie,10 Cloverfield Lane,10 Cloverfield Lane,0.0,2016.0,,103.0,"Action,Drama,Horror",...,1.0,nm0000422,actor,,"[""Howard""]",John Goodman,1952.0,,"actor,soundtrack,producer","tt0101410,tt1179933,tt1024648,tt1907668"
3,3,tt2406566,movie,Atomic Blonde,Atomic Blonde,0.0,2017.0,,115.0,"Action,Thriller",...,3.0,nm0000422,actor,,"[""Emmett Kurzfeld""]",John Goodman,1952.0,,"actor,soundtrack,producer","tt0101410,tt1179933,tt1024648,tt1907668"
4,4,tt5968394,movie,Captive State,Captive State,0.0,2019.0,,109.0,"Action,Horror,Sci-Fi",...,1.0,nm0000422,actor,,"[""William Mulligan""]",John Goodman,1952.0,,"actor,soundtrack,producer","tt0101410,tt1179933,tt1024648,tt1907668"


In [3]:
# Remove some categories, if not we would have
# e.g. Joe Biden as an actor because of an archive footage

# Remove archive movies
moviedata = moviedata[moviedata["category"] != "archive_footage"]
# Remove biopics (movies where the actors is his own character)
moviedata = moviedata[moviedata["characters"] != "[\"Self\"]"]

We run the code separately for each year of interest.

In [4]:
# Reads the given year's quotes and filters those whose speaker is in the movie dataset
# (whether it's an actor, the director, or any other)
def handleYear(year):
    # We read the dataset by chunks because it is too big
    # Warning: it takes a long time
    with pd.read_json(DATA_DIR + quotes_dataset(year), lines=True, chunksize=10000) as df_reader:

        # Concatenate the quotes whose speaker are in the movie dataset
        quotes = pd.DataFrame()
        # Use tqdm to display a progress bar (it takes a long time)
        pbar = tqdm(df_reader)
        for chunk in pbar:
            quotes = pd.concat((quotes,chunk[chunk["speaker"].isin(moviedata["primaryName"])]))
            pbar.set_description(f"({year}) Number of quotes extracted = {quotes.shape[0]}")

    print(f"Year {year}: {quotes.shape[0]} quotes extracted")
    #quotes.head()
    
    # Save to the data folder
    quotes.to_csv(DATA_DIR + output_dataset(year), compression='gzip')
    
    # Display the speakers and their citation count, in descending order
    # (just to check)
    #quotes.groupby("speaker").agg(count=("speaker","count")).sort_values(["count"],ascending=False)

In [5]:
# Run all jobs in parallel
# The progress bars don't render well but it's better than nothing.
# (and specifying a position for the bars doesn't renders well either)

jobs = []
for year in YEARS:
    job = multiprocessing.Process(target=lambda: handleYear(year))
    jobs.append(job)
    job.start()

# Wait for each job to finish.
for job in jobs:
    job.join()

(2020) Number of quotes extracted = 137760: : 525it [08:16,  1.06it/s]


Year 2020: 137760 quotes extracted


(2016) Number of quotes extracted = 339622: : 1387it [29:12,  1.26s/it]


Year 2016: 339622 quotes extracted


(2015) Number of quotes extracted = 502457: : 2088it [38:37,  1.11s/it]


Year 2015: 502457 quotes extracted


(2019) Number of quotes extracted = 549496: : 2177it [41:01,  1.13s/it]


Year 2019: 549496 quotes extracted


(2018) Number of quotes extracted = 662315: : 2723it [48:57,  1.08s/it]


Year 2018: 662315 quotes extracted


(2017) Number of quotes extracted = 652761: : 2662it [49:58,  1.13s/it]


Year 2017: 652761 quotes extracted
