# Mouting the Google Drive

It is possible to mount your Google Drive to Colab if you need additional storage or if you need to use files from it. To do that run (click on play button or use keyboard shortcut 'Command/Ctrl+Enter') the following code cell:

In [3]:
from google.colab import drive
drive.mount('/content/drive')

ModuleNotFoundError: No module named 'google.colab'

For feasibility (and runtime), we decided to focus on only some top movies from each year (also as we would assume that they provide the most quotes related to them). So, we first take the dataset about movies and ratings from IMDb to derive the top 10 movies for each year, and afterwards filter the quotes related to these movies.

In [4]:
import pandas as pd
import pickle
import bz2
import json
import re

We first load the dataset and then parse the year it was published into an integer, as it turned out initially that there were some years at strings and some as numerical values, which messed up our groupby. 

In [2]:
movie_df = pd.read_csv("./drive/MyDrive/imdb/IMDb movies.csv")

FileNotFoundError: [Errno 2] No such file or directory: './drive/MyDrive/imdb/IMDb movies.csv'

In [15]:
movie_df['year'] = pd.to_numeric(movie_df['year'], errors='coerce') # coerce: give NaN when parsing did not work
movie_df.head()

Unnamed: 0,imdb_title_id,title,original_title,year,date_published,genre,duration,country,language,director,writer,production_company,actors,description,avg_vote,votes,budget,usa_gross_income,worlwide_gross_income,metascore,reviews_from_users,reviews_from_critics
0,tt0000009,Miss Jerry,Miss Jerry,1894.0,1894-10-09,Romance,45,USA,,Alexander Black,Alexander Black,Alexander Black Photoplays,"Blanche Bayliss, William Courtenay, Chauncey D...",The adventures of a female reporter in the 1890s.,5.9,154,,,,,1.0,2.0
1,tt0000574,The Story of the Kelly Gang,The Story of the Kelly Gang,1906.0,1906-12-26,"Biography, Crime, Drama",70,Australia,,Charles Tait,Charles Tait,J. and N. Tait,"Elizabeth Tait, John Tait, Norman Campbell, Be...",True story of notorious Australian outlaw Ned ...,6.1,589,$ 2250,,,,7.0,7.0
2,tt0001892,Den sorte drøm,Den sorte drøm,1911.0,1911-08-19,Drama,53,"Germany, Denmark",,Urban Gad,"Urban Gad, Gebhard Schätzler-Perasini",Fotorama,"Asta Nielsen, Valdemar Psilander, Gunnar Helse...",Two men of high rank are both wooing the beaut...,5.8,188,,,,,5.0,2.0
3,tt0002101,Cleopatra,Cleopatra,1912.0,1912-11-13,"Drama, History",100,USA,English,Charles L. Gaskill,Victorien Sardou,Helen Gardner Picture Players,"Helen Gardner, Pearl Sindelar, Miss Fielding, ...",The fabled queen of Egypt's affair with Roman ...,5.2,446,$ 45000,,,,25.0,3.0
4,tt0002130,L'Inferno,L'Inferno,1911.0,1911-03-06,"Adventure, Drama, Fantasy",68,Italy,Italian,"Francesco Bertolini, Adolfo Padovan",Dante Alighieri,Milano Film,"Salvatore Papa, Arturo Pirovano, Giuseppe de L...",Loosely adapted from Dante's Divine Comedy and...,7.0,2237,,,,,31.0,14.0


In [16]:
movie_counts = movie_df.groupby('year')['imdb_title_id'].count()

In [17]:
movie_counts[2015] + movie_counts[2016] + movie_counts[2017] + movie_counts[2018] + movie_counts[2019] + movie_counts[2020]

16331

To define a "top-movie" for a year, we consider the worldwide gross income of it (we consider this as a sort of "democratic vote" on whether or not the movie is good enough to be watched). Therefore, we first drop NaN values in the 'worlwide_gross_income'[sic!] column, and parse it by filtering out all movies whose income was not given in US-Dollars (It turns out that the worldwide successful movies all had their income in US-Dollars, and we also did not find it feasible to convert local currencies with some currency converters to US Dollar, as most of they were mostly locally succesful productions). 

In [18]:
movie_df_nona = movie_df.dropna(subset=['worlwide_gross_income'])
movie_df_nona.shape

(31016, 22)

In [19]:
movie_counts_nona = movie_df_nona.groupby('year')['imdb_title_id'].count()

In [20]:
movie_counts_nona[2015] + movie_counts_nona[2016] + movie_counts_nona[2017] + movie_counts_nona[2018] + movie_counts_nona[2019] + movie_counts_nona[2020]

8331

In [22]:
movie_df_nona['worldwide_gross_income'] = movie_df_nona[movie_df_nona['worlwide_gross_income'].astype(str).str.startswith('$')]['worlwide_gross_income'].str.replace('\$ ', '').astype(float)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [23]:
movie_df_nona[movie_df_nona['year'] == 2020].sort_values(by='worldwide_gross_income', ascending=False).head()

Unnamed: 0,imdb_title_id,title,original_title,year,date_published,genre,duration,country,language,director,writer,production_company,actors,description,avg_vote,votes,budget,usa_gross_income,worlwide_gross_income,metascore,reviews_from_users,reviews_from_critics,worldwide_gross_income
58734,tt1502397,Bad Boys for Life,Bad Boys for Life,2020.0,2020-02-20,"Action, Comedy, Crime",124,"USA, Mexico","English, Spanish","Adil El Arbi, Bilall Fallah","Peter Craig, Joe Carnahan",Columbia Pictures,"Will Smith, Martin Lawrence, Vanessa Hudgens, ...",Miami detectives Mike Lowrey and Marcus Burnet...,6.6,111557,$ 90000000,$ 204417855,$ 424617855,59.0,1072.0,238.0,424617855.0
72664,tt3794354,Sonic - Il film,Sonic the Hedgehog,2020.0,2020-02-13,"Action, Adventure, Comedy",99,"USA, Japan, Canada","English, French",Jeff Fowler,"Pat Casey, Josh Miller",Paramount Pictures,"Ben Schwartz, James Marsden, Jim Carrey, Tika ...","After discovering a small, blue, fast hedgehog...",6.5,74639,$ 85000000,$ 146066470,$ 306766470,47.0,1018.0,250.0,306766470.0
82258,tt7294150,Ba Bai,Ba Bai,2020.0,2020-08-21,War,147,China,Chinese,Hu Guan,"Hu Guan, Rui Ge",Beijing Diqi Yinxiang Entertainment,"Zhi-zhong Huang, Zhang Junyi, Hao Ou, Wu Jiang...",From the acclaimed filmmaker behind Mr. Six co...,7.2,387,$ 80000000,$ 111552,$ 277207347,,8.0,4.0,277207347.0
80994,tt6673612,Dolittle,Dolittle,2020.0,2020-01-30,"Adventure, Comedy, Family",101,"USA, China, UK, Japan","English, French",Stephen Gaghan,"Stephen Gaghan, Dan Gregor",Universal Pictures,"Robert Downey Jr., Antonio Banderas, Michael S...",A physician who can talk to animals embarks on...,5.6,39245,$ 175000000,$ 77047065,$ 245891166,26.0,680.0,180.0,245891166.0
83078,tt7713068,Birds of Prey e la fantasmagorica rinascita di...,Birds of Prey: And the Fantabulous Emancipatio...,2020.0,2020-02-06,"Action, Adventure, Crime",109,USA,"English, Chinese",Cathy Yan,"Christina Hodson, Paul Dini",Clubhouse Pictures (II),"Margot Robbie, Rosie Perez, Mary Elizabeth Win...","After splitting with the Joker, Harley Quinn j...",6.1,137373,$ 84500000,$ 84158461,$ 201858461,60.0,2222.0,372.0,201858461.0


Out of these successful movies, we now create a dictionary to get the successful movies per year, having the year as a key and the list of movie titles as value. We save this file both as pickle and (for convenience) also directly in this notebook as a hardcoded dict, given it is rather small. 

In [24]:
def get_most_succesful_movies_per_year(df, year, number=10):
  return df[df['year'] == year].sort_values(by='worldwide_gross_income', ascending=False).head(number)['original_title'].to_list()

In [25]:
most_successful_movies = dict()
for year in range(2015,2021):
  most_successful_movies[year] = get_most_succesful_movies_per_year(movie_df_nona, year)
most_successful_movies[2015]
with open('./drive/MyDrive/most_successful_movies.pickle', 'wb') as f:  
  pickle.dump(most_successful_movies, f)

In [26]:
most_successful_movies = {2015: ['Star Wars: Episode VII - The Force Awakens',
  'Jurassic World',
  'Fast & Furious 7',
  'Avengers: Age of Ultron',
  'Minions',
  'Spectre',
  'Inside Out',
  'Mission: Impossible - Rogue Nation',
  'The Hunger Games: Mockingjay - Part 2',
  'The Martian'],
 2016: ['Captain America: Civil War',
  'Rogue One',
  'Finding Dory',
  'Zootopia',
  'The Jungle Book',
  'The Secret Life of Pets',
  'Batman v Superman: Dawn of Justice',
  'Fantastic Beasts and Where to Find Them',
  'Deadpool',
  'Suicide Squad'],
 2017: ['Star Wars: Episode VIII - The Last Jedi',
  'Beauty and the Beast',
  'The Fate of the Furious',
  'Despicable Me 3',
  'Jumanji: Welcome to the Jungle',
  'Spider-Man: Homecoming',
  'Zhan lang II',
  'Guardians of the Galaxy Vol. 2',
  'Thor: Ragnarok',
  'Wonder Woman'],
 2018: ['Avengers: Infinity War',
  'Black Panther',
  'Jurassic World: Fallen Kingdom',
  'Incredibles 2',
  'Aquaman',
  'Bohemian Rhapsody',
  'Venom',
  'Mission: Impossible - Fallout',
  'Deadpool 2',
  'Fantastic Beasts: The Crimes of Grindelwald'],
 2019: ['Avengers: Endgame',
  'The Lion King',
  'Frozen II',
  'Spider-Man: Far from Home',
  'Captain Marvel',
  'Joker',
  'Star Wars: Episode IX - The Rise of Skywalker',
  'Toy Story 4',
  'Aladdin',
  'Jumanji: The Next Level'],
 2020: ['Bad Boys for Life',
  'Sonic the Hedgehog',
  'Ba Bai',
  'Dolittle',
  'Birds of Prey: And the Fantabulous Emancipation of One Harley Quinn',
  'Onward',
  'The Invisible Man',
  'The Call of the Wild',
  'Tenet',
  'Tolo Tolo']}

In [27]:
with open('./drive/MyDrive/most_successful_movies.pickle', 'rb') as f:  
  most_successful_movies = pickle.load(f)

Final step before filtering the quotes: It turns out that people tend not to use the actual full title when talking about a movie (e.g. people would rather refer to 'the new Star Wars' or 'The Rise of Skywalker' than 'Star Wars: Episode IX - The Rise of Skywalker'). Therefore, we split the movie title into multiple subparts (basically after ':' and ' - ', with the excemption that splitting 'Mission: Impossible' into 'Mission' and 'Impossible' is a terrible idea as basically any politician in the world loves to stress the importance of his or her 'Mission', which leads to many unrelated quotes). So, for 'Mission: Impossible', we make an excemption and only split on 'Mission: Impossible' and the respective title of the movie. 
As we could see in the analysis of the box office data, many movies are released in the december of a year (especially Star Wars), hence, we decided to filter the quotes of year X by movies released in year X and X-1, to also be able to determine the reactions about a movie some weeks or months after release.

In [10]:
year = 2015
movies = most_successful_movies[year]
movies_dict = {key: [x.strip() for x in re.split(':| - ', key)] for key in movies}
if year == 2018:
  movies_dict['Mission: Impossible - Fallout'] = ['Mission: Impossible', 'Fallout']
elif year == 2015:
  movies_dict['Mission: Impossible - Rogue Nation'] = ['Mission: Impossible', 'Rogue Nation']
if year != 2015:
  year -= 1
  movies_dict = movies_dict.update({year: most_successful_movies[year]})
  movies_dict = {key: [x.strip() for x in re.split(':| - ', key)] for key in movies}
  if year == 2018:
    movies_dict['Mission: Impossible - Fallout'] = ['Mission: Impossible', 'Fallout']
  year += 1
movies_dict.items()

dict_items([('Bad Boys for Life', ['Bad Boys for Life']), ('Sonic the Hedgehog', ['Sonic the Hedgehog']), ('Ba Bai', ['Ba Bai']), ('Dolittle', ['Dolittle']), ('Birds of Prey: And the Fantabulous Emancipation of One Harley Quinn', ['Birds of Prey', 'And the Fantabulous Emancipation of One Harley Quinn']), ('Onward', ['Onward']), ('The Invisible Man', ['The Invisible Man']), ('The Call of the Wild', ['The Call of the Wild']), ('Tenet', ['Tenet']), ('Tolo Tolo', ['Tolo Tolo'])])

In [11]:
def filter_quotes(path_to_file, path_to_out, movies_dict):
  i = 0
  with bz2.open(path_to_file, 'rb') as s_file:
      try:
        os.remove(path_to_out)
      except:
        pass
      with bz2.open(path_to_out, 'wb') as d_file:
          for instance in s_file:
            i += 1
            if i % 100000 == 0:
              print(i)
            instance = json.loads(instance) # loading a sample
            quotation = instance['quotation'] # load quotation
            add_quote = False
            for movie, words in movies_dict.items():
              for word in words:
                if word in quotation:
                  instance['movie'] = movie
                  add_quote = True
                  break
            if add_quote:
              d_file.write((json.dumps(instance)+'\n').encode('utf-8')) # writing in the new file

In [12]:
path_to_file = '/content/drive/MyDrive/Quotebank/quotes-{}.json.bz2'.format(year) 
path_to_out = '/content/drive/MyDrive/quotes-{}-movies.json.bz2'.format(year)

filter_quotes(path_to_file, path_to_out, movies_dict)

100000
200000
300000
400000
500000
600000
700000
800000
900000
1000000
1100000
1200000
1300000
1400000
1500000
1600000
1700000
1800000
1900000
2000000
2100000
2200000
2300000
2400000
2500000
2600000
2700000
2800000
2900000
3000000
3100000
3200000
3300000
3400000
3500000
3600000
3700000
3800000
3900000
4000000
4100000
4200000
4300000
4400000
4500000
4600000
4700000
4800000
4900000
5000000
5100000
5200000
5300000
5400000
5500000
5600000
5700000
5800000
5900000
6000000
6100000
6200000
6300000
6400000
6500000
6600000
6700000
6800000
6900000
7000000
7100000
7200000
7300000
7400000
7500000
7600000
7700000
7800000
7900000
8000000
8100000
8200000
8300000
8400000
8500000
8600000
8700000
8800000
8900000
9000000
9100000
9200000
9300000
9400000
9500000
9600000
9700000
9800000
9900000
10000000
10100000
10200000
10300000
10400000
10500000
10600000
10700000
10800000
10900000
11000000
11100000
11200000
11300000
11400000
11500000
11600000
11700000
11800000
11900000
12000000
12100000
12200000
12300000
1