## Link quotebank and our movie dataset
Once again we chose not to upload the Quotebank datasets we use on github because they were too big ; this notebook filters the quotes to only keep those whose author appears in `movie_data_2015_2020.csv`.

The two data sources are:
- the **QuoteBank datasets** [link](https://zenodo.org/record/4277311) from the year 2015 to 2020 (quotation-centric versions), that have to be downloaded and put in the `data` folder. They contain quotes along with a list of potential authors and their probabilities.
- our own **movie dataset** extracted in `movieProcessing`.

The initial architecture to run this notebook should be the following:
```
    .
    ├── mergeDataSets
    │   └── link_quotes_to_movies.ipynb   # this notebook 
    └── data/                             # the datasets we need
        ├── movie_data_2015_2020.csv      # from the moviePreprocessing notebook
        ├── quotes-2015.json.bz2
        ├── quotes-2016.json.bz2
        ├── quotes-2017.json.bz2
        ├── quotes-2018.json.bz2
        ├── quotes-2019.json.bz2
        └── quotes-2020.json.bz2
```
where each quotes files comes from the quotation-centric version of the QuoteBank datasets, described at the link above.

We simply link the quotes to the actors and crew members from our movie dataset, using the most probable author.

**The goal of this notebook is to generate datasets containing the filtered quotes. These datasets will be named `movie_{year}_crew_quotes.csv.gz` (where `{year}` is the year the quote was made) and located in the `data` folder.**

In [13]:
# Initial imports
import pandas as pd
import matplotlib.pyplot as plt
import csv
from tqdm import tqdm
import multiprocessing
import datetime

# Constants to easily update and read the code
DATA_DIR = "../data/"
MOVIE_DATASET = "movie_data_2015_2020.csv"
START_YEAR = 2015
END_YEAR = 2020
YEARS = range(START_YEAR, END_YEAR + 1)

# Auxiliary functions
def quotes_dataset(year):
    return f'quotes-{year}.json.bz2'

def output_dataset(year):
    return f"movie_{year}_crew_quotes.csv.gz"

In [14]:
# Open movie data set
moviedata = pd.read_csv(DATA_DIR + MOVIE_DATASET)
moviedata.head()

Unnamed: 0.1,Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,...,ordering_y,nconst,category,job,characters,primaryName,birthYear,deathYear,primaryProfession,knownForTitles
0,0,tt1179933,movie,10 Cloverfield Lane,10 Cloverfield Lane,0.0,2016.0,,103.0,"Action,Drama,Horror",...,10.0,nm6618222,producer,producer,,Lindsey Weber,,,producer,"tt4530422,tt2660888,tt2548396,tt1179933"
1,1,tt4530422,movie,Overlord,Overlord,0.0,2018.0,,110.0,"Action,Horror,Sci-Fi",...,9.0,nm6618222,producer,producer,,Lindsey Weber,,,producer,"tt4530422,tt2660888,tt2548396,tt1179933"
2,2,tt1179933,movie,10 Cloverfield Lane,10 Cloverfield Lane,0.0,2016.0,,103.0,"Action,Drama,Horror",...,1.0,nm0000422,actor,,"[""Howard""]",John Goodman,1952.0,,"actor,soundtrack,producer","tt0101410,tt1179933,tt1024648,tt1907668"
3,3,tt2406566,movie,Atomic Blonde,Atomic Blonde,0.0,2017.0,,115.0,"Action,Thriller",...,3.0,nm0000422,actor,,"[""Emmett Kurzfeld""]",John Goodman,1952.0,,"actor,soundtrack,producer","tt0101410,tt1179933,tt1024648,tt1907668"
4,4,tt5968394,movie,Captive State,Captive State,0.0,2019.0,,109.0,"Action,Horror,Sci-Fi",...,1.0,nm0000422,actor,,"[""William Mulligan""]",John Goodman,1952.0,,"actor,soundtrack,producer","tt0101410,tt1179933,tt1024648,tt1907668"


In [15]:
# Remove some categories, if not we would have
# e.g. Joe Biden as an actor because of an archive footage

# Remove archive movies
moviedata = moviedata[moviedata["category"] != "archive_footage"]
# Remove biopics (movies where the actors is his own character)
moviedata = moviedata[moviedata["characters"] != "[\"Self\"]"]

We run the code separately for each year of interest.

In [None]:
# Reads the given year's quotes and filters those whose speaker is in the movie dataset
# (whether it's an actor, the director, or any other)
def handleYear(year):
    # We read the dataset by chunks because it is too big
    # Warning: it takes a long time
    with pd.read_json(DATA_DIR + quotes_dataset(year), lines=True, chunksize=10000) as df_reader:

        # Concatenate the quotes whose speaker are in the movie dataset
        quotes = pd.DataFrame()
        # Use tqdm to display a progress bar (it takes a long time)
        pbar = tqdm(df_reader)
        for chunk in pbar:
            quotes = pd.concat((quotes,chunk[chunk["speaker"].isin(moviedata["primaryName"])]))
            pbar.set_description(f"({year}) Number of quotes extracted = {quotes.shape[0]}")

    print(f"Year {year}: {quotes.shape[0]} quotes extracted")
    #quotes.head()
    
    # Save to the data folder
    quotes.to_csv(DATA_DIR + output_dataset(year), compression='gzip')
    
    # Display the speakers and their citation count, in descending order
    # (just to check)
    #quotes.groupby("speaker").agg(count=("speaker","count")).sort_values(["count"],ascending=False)

In [None]:
# Run all jobs in parallel
# The progress bars don't render well but it's better than nothing.
# (and specifying a position for the bars doesn't renders well either)

jobs = []
for year in YEARS:
    job = multiprocessing.Process(target=lambda: handleYear(year))
    jobs.append(job)
    job.start()

# Wait for each job to finish.
for job in jobs:
    job.join()

In [16]:
# Ignore the columns related to the crew and other uninteresting columns.
moviedata.drop(['Unnamed: 0', 'startYear', 'endYear', 'titleType', 'nconst', 'category', 'job', 'characters', 'primaryName', 'birthYear', 'deathYear', 'primaryProfession', 'knownForTitles'], axis=1, inplace=True)
# Only keep one row per movie.
moviedata.drop_duplicates(subset="tconst", keep="first", inplace=True)

In [17]:
# Extract the gross value as an integer, to compare movies easily.
moviedata['Total Gross'] = moviedata['Total Gross'].map(lambda x: int(x[1:].replace(',', '')))

In [18]:
# Extract the first 20 movies ranked by their box-office results.
moviedata.sort_values('Total Gross', ascending=False).head(50)["primaryTitle"].tolist()

['Star Wars: Episode VII - The Force Awakens',
 'Avengers: Endgame',
 'Black Panther',
 'Avengers: Infinity War',
 'Jurassic World',
 'Star Wars: Episode VIII - The Last Jedi',
 'Incredibles 2',
 'The Lion King',
 'Rogue One: A Star Wars Story',
 'Star Wars: Episode IX - The Rise of Skywalker',
 'Beauty and the Beast',
 'Finding Dory',
 'Frozen II',
 'Avengers: Age of Ultron',
 'Toy Story 4',
 'Captain Marvel',
 'Jurassic World: Fallen Kingdom',
 'Wonder Woman',
 'Captain America: Civil War',
 'Jumanji: Welcome to the Jungle',
 'Spider-Man: Far from Home',
 'Guardians of the Galaxy Vol. 2',
 'The Secret Life of Pets',
 'The Jungle Book',
 'Deadpool',
 'Inside Out',
 'Aladdin',
 'Furious 7',
 'American Sniper',
 'Zootopia',
 'The Hunger Games: Mockingjay - Part 1',
 'Minions',
 'Joker',
 'Aquaman',
 'Spider-Man: Homecoming',
 'Batman v Superman: Dawn of Justice',
 'It',
 'Suicide Squad',
 'Jumanji: The Next Level',
 'Deadpool 2',
 'Thor: Ragnarok',
 'The Hunger Games: Mockingjay - Part 

In [24]:
# Create a list of movie titles with their patterns to look for in the quotes
# (if a given pattern is found, it is considered that the quote mentions the movie)
alttitles = [
    ('Star Wars: Episode VII - The Force Awakens', ['Star Wars', 'The force awakens']),
    ('Avengers: Endgame', ['Avengers', 'Endgame']),
    ('Black Panther', ['Black Panther']),
    ('Avengers: Infinity War', ['Infinity War', 'Avengers']),
    ('Jurassic World', ['Jurassic']), # Probably a bad idea but let's see
    ('Star Wars: Episode VIII - The Last Jedi', ['Star Wars', 'Last Jedi']),
    ('Incredibles 2', ['Incredibles']), # Possibly too broad
    ('The Lion King', ['Lion King']),
    ('Rogue One: A Star Wars Story', ['Rogue One', 'Star Wars']),
    ('Star Wars: Episode IX - The Rise of Skywalker', ['Star Wars', 'Skywalker']),
    ('Beauty and the Beast', ['Beauty and the Beast']),
    ('Finding Dory', ['Finding Dory']),
    ('Frozen II', ['Frozen']), # Probably a bad idea but let's see
    ('Avengers: Age of Ultron', ['Avengers', 'Ultron']),
    ('Toy Story 4', ['Toy Story']),
    ('Captain Marvel', ['Marvel']),
    ('Jurassic World: Fallen Kingdom', ['Jurassic', 'Fallen Kingdom']),
    ('Captain America: Civil War', ['Captain America', 'Civil War']),
    ('Jumanji: Welcome to the Jungle', ['Jumanji', 'Welcome to the Jungle']),
    #('Spider-Man: Far from Home', ['Spider-Man', 'Spider Man', 'Far from Home']),
    ('Guardians of the Galaxy Vol. 2', ['Guardians of the Galaxy']),
    ('The Secret Life of Pets', ['Secret Life of Pets']),
    ('The Jungle Book', ['Jungle Book']),
    ('Deadpool', ['Deadpool']),
    ('Inside Out', ['Inside Out']),
    ('Aladdin', ['Aladdin']),
    ('Furious 7', ['Furious']), # Probably a bad idea but let's see
    ('American Sniper', ['American Sniper']),
    ('Zootopia', ['Zootopia']),
    ('The Hunger Games: Mockingjay - Part 1', ['Hunger Games', 'Mockingjay']),
    ('Minions', ['Minions']), # Possibly too broad
    ('Joker', ['Joker']),
    ('Aquaman', ['Aquaman']),
    ('Spider-Man: Homecoming', ['Spider-Man', 'Spider Man', 'Homecoming']),
    ('Batman v Superman: Dawn of Justice', ['Batman', 'Superman', 'Dawn of Justice']),
    #('It', []), # just not possible
    ('Suicide Squad', ['Suicide Squad']),
    ('Jumanji: The Next Level', ['Jumanji', 'The Next Level']),
    ('Deadpool 2', ['Deadpool']),
    ('Thor: Ragnarok', ['Thor', 'Ragnarok']),
    ('The Hunger Games: Mockingjay - Part 2', ['Hunger Games', 'Mockingjay']),
    ('The Grinch', ['Grinch']),
    ('Sing', ['Sing']), # Probably a bad idea but let's see
    ('Despicable Me 3', ['Despicable Me']),
    ('The Hobbit: The Battle of the Five Armies', ['Hobbit', 'Battle of the Five Armies']),
    ('Moana', ['Moana']),
    ('Fantastic Beasts and Where to Find Them', ['Fantastic Beasts']),
    ('Doctor Strange', ['Doctor Strange']),
    ('Justice League', ['Justice League'])
]

In [20]:
# Get the full release date as a datetime
moviedata["release"] = moviedata["Release Date"] + " " + moviedata["Year"].astype(int).astype(str)
moviedata["release"] = pd.to_datetime(moviedata["release"])

In [21]:
# Returns the years which are within the timedelta of the release of the movie.
# timedelta must be less than one year, for convenience.
def get_years(movie, timedelta):
    row = moviedata[moviedata["primaryTitle"] == movie].iloc[0]
    release = row["release"]
    year = int(row["Year"])
    years = [year]
    if release - datetime.datetime(year, 1, 1) < timedelta:
        years.append(year-1)
    if datetime.datetime(year, 12, 31) - release < timedelta:
        years.append(year+1)
    years = filter(lambda y: END_YEAR >= y >= START_YEAR, years)
    return years

In [22]:
# Returns the list of quotes that mention one of the titles, and that
# are within the time delta of the release date of the movie.
# timedelta must be less than one year. 
def findQuotes(movie, titles, timedelta, export=True):
    years = get_years(movie, timedelta)
    quotes = pd.DataFrame()
    row = moviedata[moviedata["primaryTitle"] == movie].iloc[0]
    release = row["release"]
    
    # Generate the regex that will be searched to match the movie
    titleregex = "|".join(titles)

    def valid_date(quotedate):
        return release - timedelta < quotedate["date"] < release + timedelta 
    
    for year in years:
        with pd.read_json(DATA_DIR + quotes_dataset(year), lines=True, chunksize=100000) as df_reader:
            pbar = tqdm(df_reader)
            count = 0
            for chunk in pbar:
                count += len(chunk)
                pbar.set_description(f"{movie} / {year}: progress = {count}")
                
                # Fast filter of the chunk, but works only if the quote dataset is sorted!
                # (which it is not for now, but if we need to extract the quotes for lots of movie
                # it would be convenient to sort them)
                #if not valid_date(chunk.iloc[0]) and not valid_date(chunk.iloc[len(chunk)-1]):
                #    continue
                
                # filter by dates
                chunk = chunk[(chunk["date"] > release - timedelta) & (chunk["date"] < release + timedelta)]
                # filter those containing one of the titles
                chunk = chunk[chunk["quotation"].str.contains(titleregex, na=False, case=False, regex=True)]
                # append the selected quotes
                quotes = pd.concat((quotes, chunk))

            print(f"{movie} / {year}: {quotes.shape[0]} quotes extracted")
            #quotes.head()

    # Specify the movie
    quotes["primaryTitle"] = movie
    
    if export:
        # Save to the data folder
        quotes.to_csv(DATA_DIR + "movies/" + movie + ".csv.gz", compression='gzip')
    else:
        return quotes

In [23]:
# Beware: take a looooong time, even with 8 threads, in my case 6~7h for 48 movies
# A possible faster way would be to split the years among
# the threads to avoid reading several times the dataset (considering
# that reading from disk is the bottleneck, and not comparing strings)
# or to sort the dataset to ignore chunks directly

# For each movie in the alttitles dictionnary, find the quotes that mention this movie
# within the given time delta of the release date.
timedelta = datetime.timedelta(weeks=25)

manager = multiprocessing.Manager()
# Use a shared list, that will be used to split the workload amongst threads
movies = manager.list()
for mov in alttitles:
    movies.append(mov)

# While there are movies in `movies` to process, pop one and look for its quotes
def worker(threadnum):
    print(f"Thread {threadnum+1}: start")
    try:
        while True:
            movie, titles = movies.pop()
            print(f"Thread {threadnum+1}: processing {movie}")
            findQuotes(movie, titles, timedelta)
    except IndexError:
        # The list is empty
        print(f"Thread {threadnum+1}: done")
        return

jobs = []
# Start 8 threads which will, for each movie, find its quotes
for threadnum in range(8):
    job = multiprocessing.Process(target=lambda: worker(threadnum))
    jobs.append(job)
    job.start()

# Wait for each job to finish.
for job in jobs:
    job.join()

Thread 1: start
Thread 2: startThread 1: processing American Sniper

Thread 3: startThread 2: processing Aladdin

Thread 4: start
Thread 3: processing Inside OutThread 5: startThread 4: processing Deadpool


Thread 5: processing The Jungle BookThread 6: start


Thread 7: startThread 6: processing The Secret Life of Pets
Thread 7: processing Guardians of the Galaxy Vol. 2
Thread 8: start
Thread 8: processing Jumanji: Welcome to the Jungle


Deadpool / 2016: progress = 13862129: : 139it [26:11, 11.31s/it]t]t [26:10, 12.47s/it]


Deadpool / 2016: 668 quotes extracted


The Jungle Book / 2016: progress = 13862129: : 139it [26:15, 11.34s/it].11s/it]


The Jungle Book / 2016: 211 quotes extracted


The Secret Life of Pets / 2016: progress = 13862129: : 139it [26:33, 11.47s/it]08s/it]


The Secret Life of Pets / 2016: 38 quotes extracted
Thread 6: processing Captain America: Civil War


American Sniper / 2015: progress = 20874338: : 209it [37:36, 10.79s/it]0.20s/it]1s/it]


American Sniper / 2015: 49 quotes extracted


Inside Out / 2015: progress = 20874338: : 209it [37:50, 10.87s/it]t [37:49, 12.09s/it]


Inside Out / 2015: 1233 quotes extracted
Thread 3: processing Wonder Woman


Aladdin / 2019: progress = 21763302: : 218it [40:59, 11.28s/it]it]21, 10.11s/it]0s/it]


Aladdin / 2019: 681 quotes extracted


Captain America: Civil War / 2016: progress = 13862129: : 139it [23:25, 10.11s/it]/it]


Captain America: Civil War / 2016: 3562 quotes extracted


Guardians of the Galaxy Vol. 2 / 2017: progress = 26611588: : 267it [54:59, 12.36s/it]


Guardians of the Galaxy Vol. 2 / 2017: 841 quotes extracted


Jumanji: Welcome to the Jungle / 2017: progress = 26611588: : 267it [55:14, 12.41s/it]


Jumanji: Welcome to the Jungle / 2017: 203 quotes extracted


Deadpool / 2015: progress = 20874338: : 209it [34:09,  9.24s/it]79s/it]1, 10.06s/it]
Deadpool / 2015: progress = 20874338: : 209it [34:09,  9.81s/it]


Deadpool / 2015: 935 quotes extracted
The Jungle Book / 2015: 227 quotes extracted
Thread 5: processing Jurassic World: Fallen Kingdom
Thread 4: processing Captain Marvel


American Sniper / 2016: progress = 13862129: : 139it [23:13, 10.02s/it]2, 10.87s/it]


American Sniper / 2016: 62 quotes extracted
Thread 1: processing Toy Story 4


Guardians of the Galaxy Vol. 2 / 2016: progress = 13862129: : 139it [23:10, 10.00s/it]


Guardians of the Galaxy Vol. 2 / 2016: 906 quotes extracted
Thread 7: processing Avengers: Age of Ultron


Captain America: Civil War / 2015: progress = 20874338: : 209it [33:31,  9.62s/it]/it]


Captain America: Civil War / 2015: 4598 quotes extracted


Avengers: Age of Ultron / 2015: progress = 3100000: : 31it [05:20, 10.47s/it]

Thread 6: processing Frozen II


Aladdin / 2018: progress = 27228451: : 273it [47:35, 10.46s/it]24, 10.15s/it]0.68s/it]


Aladdin / 2018: 760 quotes extracted
Thread 2: processing Finding Dory


Wonder Woman / 2017: progress = 26611588: : 267it [52:10, 11.72s/it]t]9:38, 10.87s/it]


Wonder Woman / 2017: 3143 quotes extracted


Captain Marvel / 2019: progress = 21763302: : 218it [36:49, 10.14s/it]0.07s/it]92s/it]


Captain Marvel / 2019: 8512 quotes extracted


Jurassic World: Fallen Kingdom / 2018: progress = 20400000: : 204it [36:51, 10.93s/it]


Toy Story 4 / 2019: 513 quotes extracted


Jumanji: Welcome to the Jungle / 2018: progress = 27228451: : 273it [49:43, 10.93s/it]


Jumanji: Welcome to the Jungle / 2018: 365 quotes extracted
Thread 8: processing Beauty and the Beast


Jurassic World: Fallen Kingdom / 2018: progress = 27228451: : 273it [49:06, 10.79s/it]


Jurassic World: Fallen Kingdom / 2018: 1136 quotes extracted


Finding Dory / 2016: progress = 13862129: : 139it [23:17, 10.05s/it]  9.99s/it]s/it]


Finding Dory / 2016: 73 quotes extracted


Wonder Woman / 2016: progress = 13862129: : 139it [22:46,  9.83s/it]7s/it]11.20s/it]


Wonder Woman / 2016: 3227 quotes extracted
Thread 3: processing Star Wars: Episode IX - The Rise of Skywalker


Avengers: Age of Ultron / 2015: progress = 20874338: : 209it [35:21, 10.15s/it][00:42, 10.94s/it]


Avengers: Age of Ultron / 2015: 1574 quotes extracted
Thread 7: processing Rogue One: A Star Wars Story


Frozen II / 2019: progress = 21763302: : 218it [36:32, 10.06s/it]s/it], 10.61s/it]t]:06, 10.39s/it]


Frozen II / 2019: 2581 quotes extracted


Frozen II / 2020: progress = 5244449: : 53it [09:48, 11.11s/it]08s/it], 12.18s/it]16:58, 11.87s/it]


Frozen II / 2020: 3720 quotes extracted


Toy Story 4 / 2018: progress = 18200000: : 182it [32:38, 12.29s/it]

Thread 6: done


Rogue One: A Star Wars Story / 2016: progress = 13862129: : 139it [25:24, 10.97s/it]26:03, 10.17s/it]


Rogue One: A Star Wars Story / 2016: 2514 quotes extracted


Toy Story 4 / 2018: progress = 27228451: : 273it [48:16, 10.61s/it]it]000: : 184it [32:41, 10.27s/it]


Toy Story 4 / 2018: 514 quotes extracted


Captain Marvel / 2018: progress = 27200000: : 272it [48:18, 10.40s/it]

Thread 1: done


Captain Marvel / 2018: progress = 27228451: : 273it [48:21, 10.63s/it]


Captain Marvel / 2018: 12145 quotes extracted
Thread 4: done


Finding Dory / 2015: progress = 20874338: : 209it [34:19,  9.85s/it]00000: : 189it [33:22,  8.26s/it]


Finding Dory / 2015: 76 quotes extracted
Thread 2: done


Star Wars: Episode IX - The Rise of Skywalker / 2019: progress = 21763302: : 218it [36:56, 10.17s/it]


Star Wars: Episode IX - The Rise of Skywalker / 2019: 3156 quotes extracted


Beauty and the Beast / 2017: progress = 26611588: : 267it [49:07, 11.04s/it]4s/it]04:17,  7.49s/it]


Beauty and the Beast / 2017: 592 quotes extracted


Jurassic World: Fallen Kingdom / 2017: progress = 26611588: : 267it [46:34, 10.47s/it]5,  7.50s/it]


Jurassic World: Fallen Kingdom / 2017: 1148 quotes extracted
Thread 5: done


Star Wars: Episode IX - The Rise of Skywalker / 2020: progress = 5244449: : 53it [06:32,  7.41s/it]


Star Wars: Episode IX - The Rise of Skywalker / 2020: 4038 quotes extracted
Thread 3: done


Beauty and the Beast / 2016: progress = 13862129: : 139it [13:24,  5.79s/it].74s/it]


Beauty and the Beast / 2016: 640 quotes extracted
Thread 8: done


Rogue One: A Star Wars Story / 2017: progress = 26611588: : 267it [33:57,  7.63s/it]


Rogue One: A Star Wars Story / 2017: 4425 quotes extracted
Thread 7: done


In [31]:
# Merge all produced datasets into one
allquotes = pd.DataFrame()
for movie, _ in alttitles:
    quotes = pd.read_csv(DATA_DIR + "movies/" + movie + ".csv.gz")
    #print(movie, quotes.shape[0])
    allquotes = pd.concat((allquotes, quotes))

allquotes.to_csv(DATA_DIR + "50moviesquotes.csv.gz", compression='gzip')