# Research Question 2 - sentiment, ratings and box office  

This notebook does the ground work for answering the two subquestions related to research question 2 (stated in the ReadME). The problem and subquestions we aim at adressing are:      
    
    Sentiment analysis in quotes about movies over time. Additionaly relate sentiment to Box Office sales and see if positive/negative media coverage affect the sale of tickets.    
    
    - RQ2.1: Does the media/quoters opinion on a certain movie affect the amount of sold tickets?     
    - RQ2.2: Does the sentiment seen in quotes relate to the rating on IMDB? 
    
Conclusions will be drawn here and also used in the `milestone3` file in the main directory of the repo.

In [1]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from tqdm.notebook import tqdm
import plotly.express as px
import plotly
import re

## Load Box Office, IMDb and Quotebank

We start by loading the three regarded data sets from their respective pickles / csv-files that are investigated in the exploratory notebooks. We choose to load the Quotebank data with associated sentiment scores as it will be relevant to adress this RQ.

In [2]:
data_dir = os.getcwd() + os.sep + 'data'

# load pickled data
df_Quotebank = pd.read_pickle(rf"{data_dir}{os.sep}Quotebank_sentiment.pkl") 
df_boxOffice = pd.read_pickle(rf"{data_dir}{os.sep}boxOffice.pkl")   

# load IMDb csv-files and merge
movies = pd.read_csv(rf"{data_dir}{os.sep}IMDb{os.sep}IMDB movies.csv", low_memory=False)
ratings = pd.read_csv(rf"{data_dir}{os.sep}IMDb{os.sep}IMDB ratings.csv")    
df_imdb = movies.merge(ratings, on='imdb_title_id')


In [11]:
a = [date.split("-")[0] for date in df_Quotebank.date[df_Quotebank.movie == 'Star Wars: Episode VII - The Force Awakens']]

In [13]:
pd.Series(a).unique()

array(['2015'], dtype=object)

We check that these were correctly loaded by looking at the head of the dataframes.

In [3]:
df_Quotebank.head(3)

Unnamed: 0,quotation,speaker,qids,date,numOccurrences,probas,urls,movie,shared_ID,AFINN_label,AFINN_score,VADER_label,VADER_score,BERT_label,BERT_score,positive_BERT_score,scaledReverted_BERT_score
0,Is Ferguson like Mockingjay?,Laci Green,[Q16843606],2015-11-15,1,"[[Laci Green, 0.9013], [None, 0.0987]]",[http://www.dailykos.com/story/2015/11/15/1450...,The Hunger Games: Mockingjay - Part 2,1412,POSITIVE,0.5,POSITIVE,0.3612,NEGATIVE,0.989802,0.010198,-0.541032
1,I want to clarify my interview on the `Charlie...,George Lucas,"[Q1507803, Q38222]",2015-12-31,7,"[[George Lucas, 0.5327], [None, 0.4248], [Char...",[http://www.escapistmagazine.com/news/view/165...,Star Wars: Episode VII - The Force Awakens,700,POSITIVE,0.165563,POSITIVE,0.991,POSITIVE,0.999293,0.999293,0.787788
2,Is Daredevil joining the Avengers for Infinity...,Scott Davis,"[Q16195496, Q18202175, Q7436225, Q7436228, Q12...",2015-12-10,2,"[[None, 0.4806], [Scott Davis, 0.4017], [Antho...",[http://www.flickeringmyth.com/2015/12/is-dare...,Avengers: Age of Ultron,999,NEGATIVE,-0.153846,NEGATIVE,-0.6369,NEGATIVE,0.833872,0.166128,-0.208302


In [4]:
df_boxOffice.head()

Unnamed: 0,days,dow,rank,daily,theaters,special events,movie
0,2019-05-24,Friday,1,31358935.0,4476,,Aladdin
1,2019-05-25,Saturday,1,30013295.0,4476,,Aladdin
2,2019-05-26,Sunday,1,30128699.0,4476,,Aladdin
3,2019-05-27,Monday,1,25305033.0,4476,Memorial Day,Aladdin
4,2019-05-28,Tuesday,1,12014982.0,4476,,Aladdin


In [5]:
df_imdb.head()

Unnamed: 0,imdb_title_id,title,original_title,year,date_published,genre,duration,country,language,director,...,females_30age_avg_vote,females_30age_votes,females_45age_avg_vote,females_45age_votes,top1000_voters_rating,top1000_voters_votes,us_voters_rating,us_voters_votes,non_us_voters_rating,non_us_voters_votes
0,tt0000009,Miss Jerry,Miss Jerry,1894,1894-10-09,Romance,45,USA,,Alexander Black,...,5.7,13.0,4.5,4.0,5.7,34.0,6.4,51.0,6.0,70.0
1,tt0000574,The Story of the Kelly Gang,The Story of the Kelly Gang,1906,1906-12-26,"Biography, Crime, Drama",70,Australia,,Charles Tait,...,6.2,23.0,6.6,14.0,6.4,66.0,6.0,96.0,6.2,331.0
2,tt0001892,Den sorte drøm,Den sorte drøm,1911,1911-08-19,Drama,53,"Germany, Denmark",,Urban Gad,...,5.8,4.0,6.8,7.0,5.4,32.0,6.2,31.0,5.9,123.0
3,tt0002101,Cleopatra,Cleopatra,1912,1912-11-13,"Drama, History",100,USA,English,Charles L. Gaskill,...,5.5,14.0,6.1,21.0,4.9,57.0,5.5,207.0,4.7,105.0
4,tt0002130,L'Inferno,L'Inferno,1911,1911-03-06,"Adventure, Drama, Fantasy",68,Italy,Italian,"Francesco Bertolini, Adolfo Padovan",...,7.3,82.0,7.4,77.0,6.9,139.0,7.0,488.0,7.0,1166.0


In [6]:
overlap = np.intersect1d(df_imdb.original_title.unique(), df_Quotebank.movie.unique())
print(overlap)
print(f"\nsize of overlap of movie titles: {overlap.__len__()}")

['Aladdin' 'Aquaman' 'Avengers: Age of Ultron' 'Avengers: Endgame'
 'Avengers: Infinity War' 'Bad Boys for Life'
 'Batman v Superman: Dawn of Justice' 'Beauty and the Beast'
 'Birds of Prey: And the Fantabulous Emancipation of One Harley Quinn'
 'Black Panther' 'Bohemian Rhapsody' 'Captain America: Civil War'
 'Captain Marvel' 'Deadpool' 'Deadpool 2' 'Despicable Me 3' 'Dolittle'
 'Fantastic Beasts and Where to Find Them'
 'Fantastic Beasts: The Crimes of Grindelwald' 'Fast & Furious 7'
 'Finding Dory' 'Frozen II' 'Guardians of the Galaxy Vol. 2'
 'Incredibles 2' 'Inside Out' 'Joker' 'Jumanji: The Next Level'
 'Jumanji: Welcome to the Jungle' 'Jurassic World'
 'Jurassic World: Fallen Kingdom' 'Minions'
 'Mission: Impossible - Fallout' 'Mission: Impossible - Rogue Nation'
 'Onward' 'Rogue One' 'Sonic the Hedgehog' 'Spectre'
 'Spider-Man: Far from Home' 'Spider-Man: Homecoming'
 'Star Wars: Episode IX - The Rise of Skywalker'
 'Star Wars: Episode VII - The Force Awakens'
 'Star Wars: Epis

In [7]:
df_imdb = df_imdb[np.isin(df_imdb.original_title, overlap)]

In [8]:
Quotebank_years = pd.Series([date.split("-")[0] for date in df_Quotebank.date.unique()]).unique()
df_imdb = df_imdb[np.isin(df_imdb.year, Quotebank_years)]

In [9]:
df_imdb.head()

Unnamed: 0,imdb_title_id,title,original_title,year,date_published,genre,duration,country,language,director,...,females_30age_avg_vote,females_30age_votes,females_45age_avg_vote,females_45age_votes,top1000_voters_rating,top1000_voters_votes,us_voters_rating,us_voters_votes,non_us_voters_rating,non_us_voters_votes
43822,tt0369610,Jurassic World,Jurassic World,2015,2015-06-11,"Action, Adventure, Sci-Fi",124,USA,English,Colin Trevorrow,...,7.2,32243.0,7.3,6381.0,6.8,762.0,7.1,72876.0,6.8,217418.0
47292,tt0451279,Wonder Woman,Wonder Woman,2017,2017-06-01,"Action, Adventure, Fantasy",141,"USA, China, Hong Kong","English, German, Dutch, French, Spanish, Chine...",Patty Jenkins,...,7.6,29053.0,7.8,7220.0,7.0,723.0,7.7,67709.0,7.2,190337.0
53124,tt1051906,L'uomo invisibile,The Invisible Man,2020,2020-03-27,"Horror, Mystery, Sci-Fi",124,"Canada, Australia, USA",English,Leigh Whannell,...,7.1,6778.0,7.2,1603.0,6.9,373.0,7.2,15568.0,7.0,49787.0
56282,tt1270797,Venom,Venom,2018,2018-10-04,"Action, Adventure, Sci-Fi",112,"China, USA","English, Mandarin, Malay",Ruben Fleischer,...,7.0,14589.0,7.2,3063.0,6.3,556.0,6.6,35509.0,6.6,125428.0
57584,tt1386697,Suicide Squad,Suicide Squad,2016,2016-08-13,"Action, Adventure, Fantasy",123,USA,"English, Japanese, Spanish",David Ayer,...,6.2,27359.0,6.4,4816.0,5.7,740.0,5.9,60869.0,5.9,203451.0


In [10]:
# attributes of interest
aoi = ['title', 'genre', 'description', 'avg_vote', 'votes', 'reviews_for_users', 'reviews_from_critics', 'weighted_average_vote', 'total_votes','mean_vote', 'median_vote','top1000_voters_rating', 'top1000_voters_votes', 'us_voters_rating', 'us_voters_votes']