# Angad Kalra
# SN: 1005134999


# Primary Questions
* Q1: For all active fulltime NYT movie critics, how often do they correctly predict box office hits? 
* Q2: Does movie review sentiment align with whether it was a critic's pick or not?
* Q3: What characterisitcs are common among movies that do really well? Among movies that do poorly? 

# TO DO:
* Get all data into dataframes
* Go through each Q and complete
* Write Report

# Data Collection

In [32]:
import numpy as np
import pandas as pd
import requests, json, os, sys

# Sample data collection code

# API urls & keys
nyt_url = "http://api.nytimes.com/svc/movies/v2"
nyt_apikey = "72aaefb1f009451e986a0e446468f649"

tmdb_url = "https://api.themoviedb.org/3"
tmdb_apikey = "49200255c3dc5d6af15e04656ea5f7c4"

# List of critics
url = nyt_url + "/critics/full-time.json"
critics_res = requests.get(url, params={"api-key": nyt_apikey})
if critics_res.status_code == 200:
    critics_res = json.loads(critics_res.text)
    critics = [x["display_name"] for x in critics_res["results"]]
else:
    print(json.loads(critics_res.text))

# List of genres
genres_url = tmdb_url + "/genre/movie/list"
genres_resp = requests.get(genres_url, params={"api_key": tmdb_apikey})
if genres_resp.status_code == 200:
    genres = json.loads(genres_resp.text)
    genres = [x["name"] for x in genres["genres"]]
else:
    print(json.loads(genres_resp.text))

In [None]:
# Table of movies + financial outcome
url = nyt_url + "/reviews/search.json"
critics_list = []
movie_titles = []
box_office_total = []
box_office_hit = []

# Get critic pick movies for each critic + info and create dataframe.
for c in critics:
    resp = requests.get(url, params={"api-key": nyt_apikey, "critics-pick": "Y", "reviewer": c})
    if resp.status_code == 200:
        resp = json.loads(resp.text)
    else:
        continue
        
    results = resp["results"]
    
    for r in results:
        tmdb_resp = requests.get(tmdb_url + "/search/movie", 
                            params={"api_key": tmdb_apikey, "query": r["display_title"], 
                                    "primary_release_year": int(r["publication_date"][0:4]) })
        
        if tmdb_resp.status_code == 200:
            tmdb_resp = json.loads(tmdb_resp.text)

            if (tmdb_resp["total_results"] > 0):
                mid = tmdb_resp["results"][0]["id"]
                movie_info = requests.get(tmdb_url + "/movie/{}".format(mid), params={"api_key": tmdb_apikey})

                if movie_info.status_code == 200:
                    movie_info = json.loads(movie_info.text)
                    
                    if (movie_info["revenue"] > 0):
                        critics_list.append(c)
                        movie_titles.append(movie_info["title"])
                        box_office_total.append(movie_info["revenue"])
                        box_office_hit.append(1) if movie_info["revenue"] >= 100000000 else box_office_hit.append(0)


In [None]:
# Create DataFrame from critics, movies, box office revenue. 
df_dict = {"critic": pd.Series(critics_list, dtype=str), "movie_title": pd.Series(movie_titles, dtype=str), 
                        "box_office_total": pd.Series(box_office_total, dtype=np.int64), "box_office_hit": pd.Series(box_office_hit, dtype=np.bool)}
box_office_movies = pd.DataFrame(df_dict)

In [138]:
# Table of movies + review sentiment
from bs4 import BeautifulSoup
from textblob import TextBlob

url = nyt_url + "/reviews/search.json"
movie_titles = []
review_sentiment = []
critic_pick = []

for c in critics:
    resp = requests.get(url, params={"api-key": nyt_apikey, "reviewer": c})
    if resp.status_code == 200:
        resp = json.loads(resp.text)
    else:
        continue
        
    results = resp["results"]
    
    # For each review by critic, get the review using url
    for r in results:
        review_url = r["link"]["url"]
        resp = requests.get(review_url)
        
        if resp.status_code == 200:
            resp = resp.text
            soup = BeautifulSoup(resp, 'html.parser')
            article = soup.find('section', attrs={'name': "articleBody"})
            paragraphs = article.find_all('p', class_="css-1xl4flh e2kc3sl0")
            
            if (len(paragraphs) > 0):
                review = []
                
                for p in paragraphs:
                    review.append(p.text)
                
                review = "".join(review)
                review = TextBlob(review)
                
                if review.sentiment.polarity > 0.10:
                    review_sentiment.append("positive")
                elif review.sentiment.polarity >= -0.10:
                    review_sentiment.append("neutral")
                else:
                    review_sentiment.append("negative")
                
                movie_titles.append(r["display_title"])
                critic_pick.append(r["critics_pick"])

In [139]:
# Create DataFrame from movies, review sentiment and critic's pick. 
df_dict = {"movie_title": pd.Series(movie_titles, dtype=str), 
               "review_sentiment": pd.Series(review_sentiment, dtype=str), 
                   "critic_pick": pd.Series(critic_pick, dtype=np.bool)}
movie_review_sentiment = pd.DataFrame(df_dict)

In [29]:
# Top 100 revenue movie characteristics
import time

top100_url = tmdb_url + "/discover/movie"
top100_movies = []

for i in range(1,3):
    resp = requests.get(top100_url, params={"api_key": tmdb_apikey, "sort_by": "revenue.desc", "page": i,
                                           "primary_release_date.gte": "2016-01-01"})
    if (resp.status_code == 200):
        resp = json.loads(resp.text)
        results = resp["results"]
        
        for r in results:
            mid = r["id"]
            movie_info = requests.get(tmdb_url + "/movie/{}".format(mid), 
                                    params={"api_key": tmdb_apikey})
            
            if movie_info.status_code == 200:
                movie_info = json.loads(movie_info.text)
                top100_movies.append({"budget": movie_info["budget"], "genres": movie_info["genres"], 
                                      "release_date": movie_info["release_date"],
                                        "revenue": movie_info["revenue"], "title": movie_info["title"]} )
            else:
                print("movie request with id {} didn't work".format(mid))
                continue 
    else:
        print(json.loads(resp.text))

time.sleep(15)

for i in range(3,5):
    resp = requests.get(top100_url, params={"api_key": tmdb_apikey, "sort_by": "revenue.desc", "page": i,
                                           "primary_release_date.gte": "2016-01-01"})
    if (resp.status_code == 200):
        resp = json.loads(resp.text)
        results = resp["results"]
        
        for r in results:
            mid = r["id"]
            movie_info = requests.get(tmdb_url + "/movie/{}".format(mid), 
                                    params={"api_key": tmdb_apikey})
            
            if movie_info.status_code == 200:
                movie_info = json.loads(movie_info.text)
                top100_movies.append({"budget": movie_info["budget"], "genres": movie_info["genres"], 
                                      "release_date": movie_info["release_date"],
                                        "revenue": movie_info["revenue"], "title": movie_info["title"]} )
            else:
                print("movie request with id {} didn't work".format(mid))
                continue 
    else:
        print(json.loads(resp.text))

time.sleep(15)

resp = requests.get(top100_url, params={"api_key": tmdb_apikey, "sort_by": "revenue.desc", "page": 5,
                                           "primary_release_date.gte": "2016-01-01"})
if (resp.status_code == 200):
    resp = json.loads(resp.text)
    results = resp["results"]

    for r in results:
        mid = r["id"]
        movie_info = requests.get(tmdb_url + "/movie/{}".format(mid), 
                                params={"api_key": tmdb_apikey})

        if movie_info.status_code == 200:
            movie_info = json.loads(movie_info.text)
            top100_movies.append({"budget": movie_info["budget"], "genres": movie_info["genres"], 
                                  "release_date": movie_info["release_date"],
                                    "revenue": movie_info["revenue"], "title": movie_info["title"]} )
        else:
            print("movie request with id {} didn't work".format(mid))
            continue 
else:
    print(json.loads(resp.text))


movie request with id 381890 didn't work
movie request with id 246655 didn't work
movie request with id 283366 didn't work
movie request with id 402900 didn't work


In [31]:
top100_movies[50]

{'budget': 105000000,
 'genres': [{'id': 12, 'name': 'Adventure'},
  {'id': 16, 'name': 'Animation'},
  {'id': 10751, 'name': 'Family'},
  {'id': 35, 'name': 'Comedy'}],
 'release_date': '2016-06-23',
 'revenue': 408579038,
 'title': 'Ice Age: Collision Course'}

In [24]:
# Bottom 100 revenue movie characteristics
import time

bottom100_url = tmdb_url + "/discover/movie"
bottom100_movies = []

for i in range(1,3):
    resp = requests.get(bottom100_url, params={"api_key": tmdb_apikey, "sort_by": "revenue.asc", "page": i,
                                           "primary_release_date.gte": "2016-01-01"})
    if (resp.status_code == 200):
        resp = json.loads(resp.text)
        results = resp["results"]
        
        for r in results:
            mid = r["id"]
            movie_info = requests.get(tmdb_url + "/movie/{}".format(mid), 
                                    params={"api_key": tmdb_apikey})
            
            if movie_info.status_code == 200:
                movie_info = json.loads(movie_info.text)
                bottom100_movies.append({"budget": movie_info["budget"], "genres": movie_info["genres"], 
                                      "release_date": movie_info["release_date"],
                                        "revenue": movie_info["revenue"], "title": movie_info["title"]} )
            else:
                print("movie request with id {} didn't work".format(mid))
                continue 
    else:
        print(json.loads(resp.text))

time.sleep(15)

for i in range(3,5):
    resp = requests.get(bottom100_url, params={"api_key": tmdb_apikey, "sort_by": "revenue.asc", "page": i,
                                           "primary_release_date.gte": "2016-01-01"})
    if (resp.status_code == 200):
        resp = json.loads(resp.text)
        results = resp["results"]
        
        for r in results:
            mid = r["id"]
            movie_info = requests.get(tmdb_url + "/movie/{}".format(mid), 
                                    params={"api_key": tmdb_apikey})
            
            if movie_info.status_code == 200:
                movie_info = json.loads(movie_info.text)
                bottom100_movies.append({"budget": movie_info["budget"], "genres": movie_info["genres"], 
                                      "release_date": movie_info["release_date"],
                                        "revenue": movie_info["revenue"], "title": movie_info["title"]} )
            else:
                print("movie request with id {} didn't work".format(mid))
                continue 
    else:
        print(json.loads(resp.text))

time.sleep(15)

resp = requests.get(bottom100_url, params={"api_key": tmdb_apikey, "sort_by": "revenue.asc", "page": 5,
                                           "primary_release_date.gte": "2016-01-01"})
if (resp.status_code == 200):
    resp = json.loads(resp.text)
    results = resp["results"]

    for r in results:
        mid = r["id"]
        movie_info = requests.get(tmdb_url + "/movie/{}".format(mid), 
                                params={"api_key": tmdb_apikey})

        if movie_info.status_code == 200:
            movie_info = json.loads(movie_info.text)
            bottom100_movies.append({"budget": movie_info["budget"], "genres": movie_info["genres"], 
                                  "release_date": movie_info["release_date"],
                                    "revenue": movie_info["revenue"], "title": movie_info["title"]} )
        else:
            print("movie request with id {} didn't work".format(mid))
            continue 
else:
    print(json.loads(resp.text))


In [28]:
bottom100_movies[79]

{'budget': 300,
 'genres': [{'id': 53, 'name': 'Thriller'},
  {'id': 18, 'name': 'Drama'},
  {'id': 27, 'name': 'Horror'}],
 'release_date': '2018-08-03',
 'revenue': 0,
 'title': 'Heartbreaker'}

* I'm going to create a table of movies for each critic and also get the info on how much money the movies made in theatres. I will use the NYT and Box Office Mojo APIs for this info.
* For question #2, I'm going to scrape the NYT movie reviews from the url provided in the API response and create a dataset that has the movie info, the review, sentiment analysis, and whether it was a critic's pick. I will use NYT API for this info.
* For question #3, I'll create two datasets: one for movies that did well and one for movies that performed poorly. I will use NYT, OMDb, and Box Office Mojo APIs for this.

# EDA

* Q1: I'm going to create a list of picks for each critic and see how many of their picks performed well in box office. I will define a "box office hit" as a movie that made more than 100 million dollars domestic. 
* Q2: After scraping the reviews off the web, I will run sentiment analysis on the reviews using popular Python packages (i.e TextBlob) and classify them. I will then compare the results to whether or not the movie was a critic's pick. Plotting the data seems like a good idea right now because it will allow me to see if their is a somewhat sigmoid relationship between the two variables. My plot will be 2D, with the x-axis being the different sentiment results and the y-axis being binary (+1 for critic pick, 0 otherwise).
* Q3: For each table (movies that did well, movies that did poorly), I will perform something similar to a set intersection among their characteristics and see what they have in common. I will also remove any characteristics that appear in both tables. 

# Modelling

* Q1: Don't think I need any formal model here. Just going to look at accuracy for each critic.

* Q2: Perform logistic regression. 

* Q3: I'm thinking of using PCA to find which "features" of movies are the most influential when it comes to determining revenue. 