**Project 2 to-dos:**
1. Write functions to pull title urls - **DONE**
2. Test functions on American/Japanese animated movie search (does it pull all the necessary movie imdb URLs)? - **DONE**
3. Read existing literature on box office revenue drivers
4. Create a list of necessary features
5. Create functions to scrape all the necessary data from each URL
6. Put all data into dataframe
7. Train models
8. Cross-validate models 
9. Test models
10. Explain/visualize results
11. Create PPT

**Project 2 Steps:**    
1. Literature review
2. Web scraping 
3. Data cleaning

**Project 2 Questions:**
1. What to do about animated films made in both Japan/US? Just take them out of the dataset? Or put them in both the Japan/US regression? 
    * Take them out of the dataset to isolate the differences between Japanese and American animated films

**List of features:**
* Title - **DONE**
* Runtime - **DONE**
* Budget - **DONE**
* MPAA rating (e.g. G, PG, PG-13, R) - **DONE**
* Release date - **DONE**
    * Christmas/New Year release (December and first week of January)
    * Summer release (June, July, August) 
* Sub-genre? - leave out for now
* IMDb user rating (a weighted average score of user ratings; the weights are not disclosed) - **DONE**
* \# of IMDb user ratings
* \# of Oscar wins
* Number of awards won in general (since there are a lot of animation specific awards)
* Number of nominations in general
* Metascore (a weighted average score of critic ratings; the weights are not disclosed)
* What are the American animation/Japanese animation equivalent of Oscars?
    * Annie Awards 
    * Japan Academy Awards
* Sequel - BUT NOT AVAILABLE ON imdb

In [56]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import pickle
import datetime
from bs4 import BeautifulSoup
import requests
import re

%config InlineBackend.figure_format = 'svg'
%matplotlib inline
sns.set(color_codes=True)
plt.style.use('seaborn-colorblind')

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 50)
pd.set_option('display.precision', 2)

Search: https://www.imdb.com/search/title/  
Title type: Feature film  
Release date: up to July 1, 2020
Genres: Animation  
Countries: Japan; United States (the countries need to be searched separately)  
Display options: Detailed, 100 per page    
Adult titles: Exclude  
Sorted by: Release date descending  

Search: https://pro.imdb.com/find?q=&showRVCWidget=true&ref_=hm_nv_search#keyspace=TITLE&sort=ranking  
Type: Movie  
Status: Released  
Country: United States, Japan  
**Note to self:** I can use beautifulsoup to find the individual links for each movie (from a search page) and then do a request for each of those individual links  

In [13]:
japan_base_url = 'https://www.imdb.com/search/title/?title_type=feature&release_date=,' \
           '2020-07-01&genres=animation&countries=jp&sort=release_date,desc&count=' \
           '100&view=simple'

In [14]:
japan_base_url

'https://www.imdb.com/search/title/?title_type=feature&release_date=,2020-07-01&genres=animation&countries=jp&sort=release_date,desc&count=100&view=simple'

In [3]:
american_base_url = 'https://www.imdb.com/search/title/?title_type=feature&release_date=,' \
                    '2020-07-01&genres=animation&countries=us&sort=release_date,desc&count=' \
                    '100&view=simple'

In [4]:
japan_next_url = 'https://www.imdb.com/search/title/?title_type=feature&release_date=,' \
                '2020-07-01&genres=animation&countries=jp&view=simple&sort=release_date,desc&count=' \
                '100&start=101&ref_=adv_nxt'

In [5]:
american_next_url = 'https://www.imdb.com/search/title/?title_type=feature&release_date=,' \
                    '2020-07-01&genres=animation&countries=us&view=simple&sort=release_date,' \
                    'desc&count=100&start=101&ref_=adv_nxt'

In [6]:
def get_search_urls(base_url, next_url):
    search_urls = [base_url, next_url]
    num_titles = get_num_titles(base_url)

    for i in range(num_titles//100-1):
        search_urls.append(
            next_url.replace('101', str(201+100*i)))
    return search_urls

In [7]:
def get_num_titles(base_url):
    soup = create_soup(base_url)
    mini_header = soup.find('div', class_='desc').findNext().text.split()
    num_titles = int(
        mini_header[mini_header.index('titles.')-1].replace(',', ''))
    return num_titles

In [8]:
def create_soup(url):
    response_text = requests.get(url).text
    soup = BeautifulSoup(response_text, 'html5lib')
    return soup

In [9]:
def create_soups(urls):
    soups = []
    for url in urls:
        response_text = requests.get(url).text
        soup = BeautifulSoup(response_text, 'html5lib')
        soups.append(soup)
    return soups

In [10]:
# I was missing the 's' in title_urls! Be careful about variable names (keep them short and simple)
def get_title_urls(soups):
    titles_urls = []
    for soup in soups:
        title_spans = soup.find(
            'div', class_='lister-list').find_all('span', class_='lister-item-header')
        for element in title_spans:
            titles_urls.append(element.find('a').get('href'))
    return titles_urls

In [11]:
# Checking if my functions work to pull all Japanese animated film urls

japan_search_urls = get_search_urls(japan_base_url, japan_next_url)
japan_search_soups = create_soups(japan_search_urls)
japan_title_urls = get_title_urls(japan_search_soups)
len(japan_title_urls)
japan_title_urls[9]
japan_title_urls[100]

1361

'/title/tt12478494/'

'/title/tt12280576/'

In [12]:
# Checking if my functions work to pull all American animated film urls

american_search_urls = get_search_urls(american_base_url, american_next_url)
american_search_soups = create_soups(american_search_urls)
american_title_urls = get_title_urls(american_search_soups)
len(american_title_urls)
american_title_urls[25]
american_title_urls[100]

1484

'/title/tt12203840/'

'/title/tt1620981/'

In [23]:
your_name_url = 'https://www.imdb.com/title/tt5311514/'
toy_story_2_url = 'https://www.imdb.com/title/tt0120363/'

In [119]:
def get_title(soup):
    if soup.find('h1'):
        raw_text = soup.find('h1').text.strip()
        return clean_title(raw_text)
    else:
        return None    

In [61]:
def get_runtime(soup):
    if soup.find('time'):
        return soup.find('time').text.strip()
    else:
        return None    

In [None]:
def get_MPAA_rating(soup):
    if soup.find('div', class_'subtext'):
        return soup.find('div', class_='subtext').text.strip().split()[0]
    else:
        return None

In [None]:
def get_user_rating(soup):
    if soup.find('span', itemprop='ratingValue'):
        return soup.find('span', itemprop='ratingValue').text
    else:
        return None

In [58]:
def clean_title(string):
    return re.sub('\\xa0.+', '', string)

In [94]:
your_name_soup.find(text='Budget:').findParent().findParent().text.strip()

'Budget:JPY370,000,000\n            (estimated)'

In [112]:
def get_budget(soup):
    if soup.find(text='Budget:'):
        raw_text = soup.find(
            text='Budget:').findParent().findParent().text.strip()
        return clean_budget(raw_text)
    else:
        return None

In [111]:
def clean_budget(string):
    string = string.replace('Budget:', '').replace(',', '')
    return re.sub('\\n.+', '', string)

In [121]:
def get_user_rating_count(soup):
    if soup.find(itemprop='ratingCount'):
        raw_text = soup.find(itemprop='ratingCount').text
        return remove_commas(raw_text)
    else:
        return None

In [120]:
def remove_commas(string):
    return string.replace(',', '')

In [122]:
get_user_rating_count(your_name_soup)

'173449'

In [123]:
get_user_rating_count(toy_story_2_soup)

'511459'