# Interpreting Taste in Movies

Today, we will be testing Bourdieu’s theory of taste in contemporary movies. Specifically, in the past ten years, is good taste in movies still something that is learned and replicated in elite codes, through the institution of the movie critic? To empirically answer this question, we will address the following:

1. In the past ten years, how influential are movie critic reviews in which movies are nominated for/win awards?
    * Theoretically: Critical approval (good taste, SV), conventionally indexes prestige and distinction (O), which then is associated with the act of appreciating a film (I)
    * Empirically: Does the probability of being nominated for an award go up with critical approval (i.e. critic scores for a movie) and down for mass approval (user scores for a movie). Historically the stamp of approval/distinction in the movie industry has been awards like the Oscars. Therefore, we will use the receipt of these awards as our outcome variable, assuming that good taste in movies can be intepreted thorugh the movies that win awards.

2. Do critics still stand in opposition to movies that the masses enjoy? I.e. are critics still the gateways for elite codes?
    * As above, critical approval indexes elite distinction from the masses
    * Empirically: critical reviews should be either negatively or non-correlated with box-office numbers (reflecting mass enjoyment) if this theory holds

3. Are “difficult” genres and words used to describe films that are nominated for awards?
    * Theoretically: The incorporation of difficult themes and language in a movie (SV) indexes a certain ideal of high culture and sophistication (O), which is then associated with the act of appreciating a film and being sophisticated enough to understand it (I) in contrast to the "masses" who only enjoy “low brow” films
    * Empirically: does the probability of being nominated for an award go up for "serious" genres such as drama (in comparison to comedy, for example)? How about for longer movie runtimes? For "PG-13" and "R" rated movies? For plot descriptions that closely mimic a movie critic's writing style?
    
To address these questions, we will rely on large text datasets from [IMDb](https://www.imdb.com/interfaces/) to identify all of the listed movies from the past 10 years, along with basic metadata about them (such as user reviews on IMDb). IMDb, however, does not give us access to awards information, critical review data, or plot description information in their available text files. For this, we will turn to the [OMDb API](http://www.omdbapi.com/) (Open Movie Database), which allows us to download all of this information, using the IMDb IDs for each movie to link movie information.

In [3]:
import pandas as pd
pd.options.mode.chained_assignment = None  # default='warn'
import numpy as np
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
import seaborn as sns
import re
import omdb
import requests
import json
import nltk
import string
from gensim import corpora, models

def logit_and_plot(formula, data, plot_title):
    # Fit Logistic Regression Model and print summary:
    logitfit = smf.logit(formula = formula, data = data).fit()
    print(logitfit.summary())
    
    # Plot only if I'm plotting for a single variable logistic regression
    
    # Make Logistic Regression Predictions for each set of ratings:
    probabilities = pd.DataFrame(logitfit.predict(), index=data.index)
    df = pd.DataFrame()
    if len(formula.split(" ")) > 3:
        return logitfit
    elif 'averageRating' in formula.split(" "):
        df = pd.DataFrame({
            'Rating':data['averageRating'],
            'P(Award)':probabilities[0]
            })
    elif 'polarization' in formula.split(" "):
        df = pd.DataFrame({
            'Rating':data['polarization'],
            'P(Award)':probabilities[0]
            })
    else:
        df = pd.DataFrame({
            'Rating':data['metascore'],
            'P(Award)':probabilities[0]
            })
    
    # Plot mean predictions for each value if only plotting for a single variable
    df.groupby(df.Rating).agg({'P(Award)':np.mean}).plot()
    plt.title(plot_title)
    plt.xlabel('Rating')
    plt.ylabel('Probability')
    plt.show();
    
    # Return the logistic regression model itself
    return logitfit

def get_movie_data(imdb_id):
    # Get Movie Dictionary via OMDb API, using IMDb ID #, note omdb api key needs to be filled in:
    movie = requests.get('http://www.omdbapi.com/?i=' + imdb_id + '&plot=full&apikey=########', timeout=None)
    movie = json.loads(movie.text, strict=False)

    # Get Desired Data entries from OMDb dictionary:
    rated = movie.get('Rated')
    plot = movie.get('Plot')
    metascore = movie.get('Metascore')
    box_office = movie.get('BoxOffice')
    awards = movie.get('Awards')
    
    # Return Entries as a Series to be added as new DataFrame rows
    return pd.Series([rated, plot, metascore, box_office, awards])

def get_wordnet_pos(word):

    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": nltk.corpus.wordnet.ADJ,
                "N": nltk.corpus.wordnet.NOUN,
                "V": nltk.corpus.wordnet.VERB,
                "R": nltk.corpus.wordnet.ADV}

    return tag_dict.get(tag, nltk.corpus.wordnet.NOUN)

def get_lemmas(text):

    stop = nltk.corpus.stopwords.words('english') + list(string.punctuation)
    tokens = [i for i in nltk.word_tokenize(text.lower()) if i not in stop]
    lemmas = [nltk.stem.WordNetLemmatizer().lemmatize(t, get_wordnet_pos(t)) for t in tokens]
    return lemmas

def make_bigrams(texts):
    return [bigram_mod[doc] for doc in texts]

In [2]:
# tsv is tab separated values, can read in with csv if you tell it that separator is a tab
imdb_list = pd.read_csv('imdb_list.tsv', sep='\t')
imdb_info = pd.read_csv('imdb_info.tsv', sep='\t')
imdb_user_ratings = pd.read_csv('imdb_ratings.tsv', sep='\t')

  interactivity=interactivity, compiler=compiler, result=result)
  interactivity=interactivity, compiler=compiler, result=result)


------------------

In [None]:
# logic for what terms we're going to filter out (e.g. non-movies)
imdb_movies = imdb_list[imdb_list['titleType'] == 'movie' & # imdb_list is list of what entries are
                       (pd.to_numeric(imdb_list['startYear'], errors = 'coerce'))<2009) & # start year is string so need to transform ino numeric
                       (imdb_list['isAdult'] == 0)
                       ] 
# currting data down to 100,000 entries
imdb_movie_info = pd.merge(imdb_movies, imdb_info, how='inner', 
                          left_on='tconst', right_on='titleId') # merge other df where imdb ID number corresponds with number on other dataframe we created above (two dfs have diff column names for IDs but can specify that)
imdb_movies_usa = imdb_movie_info[imdb_movie_info['region'] == 'US'.drop_duplicates('tconst')] # drop duplicates b/c multiple releases for US region for some movies
imdb_movies_usa.head()
# but don't have user ratings in this merged df - now have to add another df 
imdb_movies_usa_w_ratings = pd.merge(imdb_movies_use, imdb_user_ratings,
                                    how = 'inner', on='tconst') # here same name for IDs
#get rid of any movie that has too few ratings (less than ten user reviews)
imdb_movies_final = imdb_movies_usa_w_ratings[imdb_movies_usa_w_ratings['numVotes'] >= 10]
# now don't have critic reviews - imdb wants us to pay for pro service to access box office and critic info -> go to omdb
# what he did (takes a long time b/c 25,000 entries so 25,000 API calls):
def get_movie_data(imdb_id):
    # Get Movie Dictionary via OMDb API, using IMDb ID #, note omdb api key needs to be filled in:
    movie = requests.get('http://www.omdbapi.com/?i=' + imdb_id + '&plot=full&apikey=########', timeout=None)
    movie = json.loads(movie.text, strict=False)

    # Get Desired Data entries from OMDb dictionary:
    rated = movie.get('Rated')
    plot = movie.get('Plot')
    metascore = movie.get('Metascore') #critic average
    box_office = movie.get('BoxOffice') # gross revenue
    awards = movie.get('Awards') 
    
    # Return Entries as a Series to be added as new DataFrame rows
    return pd.Series([rated, plot, metascore, box_office, awards])

# load in resulting file
imdb_movies_final = pd.read_pickle('imdb_omdb_movies_final.pkl') # pickle brings df in as is, nicer than reading in csv
# first question: is metascore influential in terms of nominated/win awards?
# expect correlation between metascore and awards but not with user ratings
# first just take out columns we need
movie_ratings = imdb_movies_final[['tconst', 'averageRating', 'metascore', 'awards']][~imdb_movies_final['metascore'].isnull()] # ~ is NOT; only return ones where there is a metascore
# award info is in string format - need to change it
# identify whether row has any award (NA vs. value)
def any_award(awards):
    if pd.isnull(awards):
        return 0
    else:
        return 1
# identify whether oscar is there
def oscar(awards): # if oscar is in awards where awards are lower case and split into a list
    if 'oscar' in re.sub(r'[\w\s]', '', str(awards)).lower().split # need to deal with punctuation and other stuff w/ reg expressions (keep only words and spaces)
        return 1
    else:
        return 0
    
movie_ratings['Any_Award'] = movie_ratings['awards'].apply(any_award) # apply any award function over the column and create new column in df
movie_ratings['Oscar'] = movie_ratings['awards'].apply(oscar)
# average rating is ten point scale, metascore is on a 100 point scale so multiply average rating by ten to get on same scale
movie_ratings['averageRating'] = movie_ratings['averageRating']*10
# metascore is represented as string so need to turn into numbers
movie_ratings['metascore'] = pd.to_numeric(movie_ratings['metascore'])
# interested in polarization effect (metascore-average score) people vs. critics
movie_ratings['polarization'] = movie_ratings['metascore'].subtract(movie_ratings['averageRating']) # more efficient than just putting in minus

# ready to do some modeling!
# scatter plot to see if any obvious relationships
pd.plotting.scatter_matrix(movie_ratings, figsize=(20,20)); # big figure size

# logistic regression - model log odds of binary phenomenon 
# odds = prob of A over prob of B
# bigger = bigger probability of winning over not winning OScar
# coefficients for diff. predictors will tell us whether increases or decreases odds of getting an oscar
# complication: metascore is correlated with average user score + polarization - so need to look at interaction effects (effect to which a given variable is changed by presence of another one)

# write models as strings and then plug in
f_any_user = 'Any_Award ~ averageRating' # probs of getting any award given score of average rating - should be 0 or negative if theory is correct
f_any_critic = 'Any-Award ~ metascore' # should be positive
f_any_simple = 'Any_Award ~ averageRating + metascore'
f_any_interaction = 'Any_Award ~ averageRating + metascore + averageRating*metascore'# take into account that they might not have unique contribution (overlap b/c of correlation (e/g/ when one is high the other might be particularly low, etc))

# use top function to print out relationships and plot (plots not as meaningful for multiple variables)
logit_and_plot(formula = f_any_user, data = movie_ratings, plot_title = f_any_user)
# repeat for all the models above - more strategic to put in a loop

# now repeat all for oscars

# now polarization
# add in polarization as potential interaction effect for both

# movie difficulty as fancier/elite
movie_difficulty = imdb_movies_final[['tconst', 'runtimeMinutes', 'genres', 'rated', 'plot']][
    (~imdb_movies_final['rated'].isnull()) & (~imdb_movies_final['plot'].isnull())]# don't want entries with no rating or no plot
# merge difficulty df with ratings df to identify whether won oscar
ratings_difficulty = pd.merge(movie_difficulty, movie_ratings, on = 'tconst')
ratings_difficulty['runtimeMinutes'] = pd.to_numeric(ratings_difficulty['runtimesMinutes'])

# use type function to find if string or number, floats are things with decimal points, int are integers, and strings

