**Project 2 to-dos:**
* Write functions to pull title urls - **DONE**
* Test functions on American/Japanese animated movie search (does it pull all the necessary movie imdb URLs)? - **DONE**
* Read existing literature on box office revenue drivers - **DONE**
* Create a list of necessary features
* Create functions to scrape all the necessary data from each URL
* Put all data into dataframe
* Put functions into .py files
* Read through linear regression notebooks
* Train models
* Cross-validate models 
* Test models
* Explain/visualize results
* Create PPT
    * In appendix, note that I wasn't able to parse the exact release dates for Japanese movies (sometimes, the release date for a Japanese movie was the American release date, which tends to come after the Japanese release date by a year or two). This was because of the difficulty of parsing the Japanese release date (had to be parsed from a separate 'release date' URL/lack of time. 

**Project 2 Steps:**    
1. Literature review
2. Web scraping 
3. Data cleaning

**Project 2 Questions:**
1. What to do about animated films made in both Japan/US? Just take them out of the dataset? Or put them in both the Japan/US regression? 
    * Take them out of the dataset to isolate the differences between Japanese and American animated films

**List of features:**
* Title - **DONE**
* Runtime - **DONE**
* Budget - **DONE**
* MPAA rating (e.g. G, PG, PG-13, R) - **DONE**
* Release date - **DONE**
    * Christmas/New Year release (December and first week of January)
    * Summer release (June, July, August) 
    * \# of years since release
    * Differentiate in my larger function call between US and Japanese release date depending on country of the film
* Genres - **DONE**
* IMDb user rating (a weighted average score of user ratings; the weights are not disclosed) - **DONE**
* \# of IMDb user ratings - **DONE**
* \# of Oscar wins - **DONE**
* \# of total awards (since there are a lot of animation specific awards) - **DONE**
* \# of Oscar nominations - **LEAVE OUT FOR NOW**
* \# of total nominations - **LEAVE OUT FOR NOW**
* Country - **DONE**
* Metascore (a weighted average score of critic ratings; the weights are not disclosed)
* What are the American animation/Japanese animation equivalent of Oscars?
    * Annie Awards 
    * Japan Academy Awards
* Sequel - **NOT AVAILABLE ON imdb**

In [325]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import pickle
import datetime
from bs4 import BeautifulSoup
import requests
import re
import dateutil.parser

%config InlineBackend.figure_format = 'svg'
%matplotlib inline
sns.set(color_codes=True)
plt.style.use('seaborn-colorblind')

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 50)
pd.set_option('display.precision', 2)

Search: https://www.imdb.com/search/title/  
Title type: Feature film  
Release date: up to July 1, 2020
Genres: Animation  
Countries: Japan; United States (the countries need to be searched separately)  
Display options: Detailed, 100 per page    
Adult titles: Exclude  
Sorted by: Release date descending  

Search: https://pro.imdb.com/find?q=&showRVCWidget=true&ref_=hm_nv_search#keyspace=TITLE&sort=ranking  
Type: Movie  
Status: Released  
Country: United States, Japan  
**Note to self:** I can use beautifulsoup to find the individual links for each movie (from a search page) and then do a request for each of those individual links  

In [13]:
japan_base_url = 'https://www.imdb.com/search/title/?title_type=feature&release_date=,' \
    '2020-07-01&genres=animation&countries=jp&sort=release_date,desc&count=' \
    '100&view=simple'

In [14]:
japan_base_url

'https://www.imdb.com/search/title/?title_type=feature&release_date=,2020-07-01&genres=animation&countries=jp&sort=release_date,desc&count=100&view=simple'

In [3]:
american_base_url = 'https://www.imdb.com/search/title/?title_type=feature&release_date=,' \
                    '2020-07-01&genres=animation&countries=us&sort=release_date,desc&count=' \
                    '100&view=simple'

In [4]:
japan_next_url = 'https://www.imdb.com/search/title/?title_type=feature&release_date=,' \
    '2020-07-01&genres=animation&countries=jp&view=simple&sort=release_date,desc&count=' \
    '100&start=101&ref_=adv_nxt'

In [5]:
american_next_url = 'https://www.imdb.com/search/title/?title_type=feature&release_date=,' \
                    '2020-07-01&genres=animation&countries=us&view=simple&sort=release_date,' \
                    'desc&count=100&start=101&ref_=adv_nxt'

In [6]:
def get_search_urls(base_url, next_url):
    search_urls = [base_url, next_url]
    num_titles = get_num_titles(base_url)

    for i in range(num_titles//100-1):
        search_urls.append(
            next_url.replace('101', str(201+100*i)))
    return search_urls

In [7]:
def get_num_titles(base_url):
    soup = create_soup(base_url)
    mini_header = soup.find('div', class_='desc').findNext().text.split()
    num_titles = int(
        mini_header[mini_header.index('titles.')-1].replace(',', ''))
    return num_titles

In [8]:
def create_soup(url):
    response_text = requests.get(url).text
    soup = BeautifulSoup(response_text, 'html5lib')
    return soup

In [9]:
def create_soups(urls):
    soups = []
    for url in urls:
        response_text = requests.get(url).text
        soup = BeautifulSoup(response_text, 'html5lib')
        soups.append(soup)
    return soups

In [10]:
# I was missing the 's' in title_urls! Be careful about variable names (keep them short and simple)
def get_title_urls(soups):
    titles_urls = []
    for soup in soups:
        title_spans = soup.find(
            'div', class_='lister-list').find_all('span', class_='lister-item-header')
        for element in title_spans:
            titles_urls.append(element.find('a').get('href'))
    return titles_urls

In [304]:
# Checking if my functions work to pull all Japanese animated film urls

japan_search_urls = get_search_urls(japan_base_url, japan_next_url)
japan_search_soups = create_soups(japan_search_urls)
japan_title_urls = get_title_urls(japan_search_soups)
len(japan_title_urls)
japan_title_urls[9]
japan_title_urls[100]

1361

'/title/tt12478494/'

'/title/tt12280576/'

In [305]:
# Checking if my functions work to pull all American animated film urls

american_search_urls = get_search_urls(american_base_url, american_next_url)
american_search_soups = create_soups(american_search_urls)
american_title_urls = get_title_urls(american_search_soups)
len(american_title_urls)
american_title_urls[25]
american_title_urls[100]

1485

'/title/tt12203840/'

'/title/tt1620981/'

In [137]:
your_name_url = 'https://www.imdb.com/title/tt5311514/'
toy_story_2_url = 'https://www.imdb.com/title/tt0120363/'
toy_story_3_url = 'https://www.imdb.com/title/tt0435761/'

In [337]:
def get_title(soup):
    if soup.find('h1'):
        raw_text = soup.find('h1').text.strip()
        return clean_title(raw_text)
    return None

In [338]:
def get_runtime(soup):
    if soup.find('time'):
        raw_runtime = soup.find('time').text.strip()
        return runtime_to_minutes(raw_runtime)
    return None

In [355]:
def get_mpaa_rating(soup):
    ratings = ['G', 'PG', 'PG-13', 'R', 'TV-PG', 'TV-MA']
    if soup.find('div', class_='subtext'):
        rating = soup.find('div', class_='subtext').text.strip().split()[0]
        if rating in ratings:
            return rating
    return None

In [340]:
def get_user_rating(soup):
    if soup.find('span', itemprop='ratingValue'):
        return float(soup.find('span', itemprop='ratingValue').text)
    return None

In [58]:
def clean_title(string):
    return re.sub('\\xa0.+', '', string)

In [394]:
def get_budget(soup):
    if soup.find(text='Budget:'):
        raw_text = soup.find(
            text='Budget:').findParent().findParent().text.strip()
        budget = clean_budget(raw_text)
        if 'JPY' in budget:
            return yen_to_int(budget)
        return dollars_to_int(budget)
    return None

In [393]:
def clean_budget(string):
    string = string.replace('Budget:', '')
    string = remove_commas(string)
    return re.sub('\\n.+', '', string)

In [342]:
def get_user_rating_count(soup):
    if soup.find(itemprop='ratingCount'):
        raw_text = soup.find(itemprop='ratingCount').text
        return int(remove_commas(raw_text))
    return None

In [120]:
def remove_commas(string):
    return string.replace(',', '')

In [343]:
def get_genres(soup):
    if soup.find(text='Genres:'):
        raw_text = soup.find(
            text='Genres:').findParent().findParent().text.strip()
        return clean_genres(raw_text)
    return None

In [134]:
def clean_genres(string):
    string = string.replace('Genres:', '')
    string = re.sub('\\n ', '', string)
    return re.sub('\\xa0\|', ', ', string)

In [349]:
def get_oscar_wins(soup):
    if soup.find('span', class_='awards-blurb'):
        if 'Won' in soup.find('span', class_='awards-blurb').text:
            raw_text = soup.find('span', class_='awards-blurb').text.strip()
            for s in raw_text.split():
                if s.isdigit():
                    return int(s)
    return None

In [194]:
toy_story_3_soup.find(
    'span', class_='awards-blurb').findNextSibling().text.strip().split()

['Another', '60', 'wins', '&', '95', 'nominations.']

In [353]:
def get_non_oscar_wins(soup):
    if soup.find('span', class_='awards-blurb'):
        if 'Oscar' in soup.find('span', class_='awards-blurb').text:
            if soup.find('span', class_='awards-blurb').findNextSibling():
                raw_text = soup.find(
                    'span', class_='awards-blurb').findNextSibling().text.strip()
                if 'win' in raw_text:
                    for s in raw_text.split():
                        if s.isdigit():
                            return int(s)                                
            return None
        else:
            raw_text = soup.find('span', class_='awards-blurb').text.strip()
            if 'win' in raw_text:
                for s in raw_text.split():
                    if s.isdigit():
                        return int(s) 
    return None

In [228]:
def get_metascore(soup):
    if soup.find('div', class_='metacriticScore'):
        return int(soup.find('div', class_='metacriticScore').text.strip())
    else:
        return None

In [328]:
def get_usa_release_date(soup):
    if soup.find(title='See more release dates'):
        raw_text = soup.find(
            title='See more release dates').text.strip().replace(' (USA)', '')
        return to_datetime(raw_text)
    else:
        return None

In [272]:
for element in spirited_away_release_soup.find_all('a'):
    if 'Japan' in element.text:
        print(element)

<a href="/calendar/?region=jp">Japan
</a>


In [277]:
spirited_away_release_soup.find(href='/calendar/?region=jp').findNext().text

'20 July 2001'

In [278]:
your_name_release_url = 'https://www.imdb.com/title/tt5311514/releaseinfo'
your_name_release_soup = create_soup(your_name_release_url)

In [279]:
your_name_release_soup.find(href='/calendar/?region=jp').findNext().text

'26 August 2016'

In [327]:
def get_japan_release_date(soup):
    if soup.find(href='/calendar/?region=jp'):
        raw_text = soup.find(href='/calendar/?region=jp').findNext().text
        return to_datetime(raw_text)
    else:
        return None

In [281]:
get_japan_release_date(your_name_release_soup)

'26 August 2016'

In [282]:
get_japan_release_date(spirited_away_release_soup)

'20 July 2001'

In [285]:
get_usa_release_date(toy_story_3_soup)

'18 June 2010'

In [298]:
def get_country(soup):
    for element in soup.find_all('h4'):
        if 'Country:' in element:
            return element.findNext().text
    return None

In [378]:
def get_global_gross(soup):
    for element in soup.find_all('h4'):
        if 'Cumulative Worldwide Gross:' in element:
            raw_text = element.findParent().text.strip()
            raw_text = raw_text.replace('Cumulative Worldwide Gross: ', '')
            raw_text = remove_commas(raw_text)
            return dollars_to_int(raw_text)
    return None

In [316]:
def runtime_to_minutes(raw_runtime):
    raw_runtime = raw_runtime.replace('h', '').replace('min', '')
    runtime = raw_runtime.split()
    minutes = int(runtime[0])*60 + int(runtime[1])
    return minutes

In [332]:
def to_datetime(datestring):
    return dateutil.parser.parse(datestring)

In [377]:
def dollars_to_int(dollars_string):
    dollars_string = dollars_string.replace('$', '')
    return int(dollars_string)

In [392]:
def yen_to_int(yen_string):
    yen_conversion = 106.9
    yen_string = yen_string.replace('JPY', '')
    return round(int(yen_string) / yen_conversion)

In [391]:
yen_to_int('JPY370000000')

3461179

In [333]:
print(to_datetime('March 6, 1998'))

1998-03-06 00:00:00


In [314]:
runtime_to_minutes('2h 5min')

125

In [372]:
def get_movie_dict(link):
    base_url = 'https://www.imdb.com'
    url = base_url + link
    soup = create_soup(url)

    headers = ['title', 'country', 'runtime_minutes', 'budget', 'global_gross',
               'mpaa_rating', 'japan_release_date', 'usa_release_date', 'genres',
               'imdb_user_rating', 'imdb_user_rating_count', 'oscar_wins',
               'non_oscar_wins']

    title = get_title(soup)
    country = get_country(soup)
    runtime = get_runtime(soup)
    budget = get_budget(soup)
    global_gross = get_global_gross(soup)
    mpaa_rating = get_mpaa_rating(soup)

    if country == 'Japan':
        release_info_url = url + 'releaseinfo'
        release_info_soup = create_soup(release_info_url)
        japan_release_date = get_japan_release_date(release_info_soup)
        usa_release_date = None
    else:
        usa_release_date = get_usa_release_date(soup)
        japan_release_date = None

    genres = get_genres(soup)
    imdb_user_rating = get_user_rating(soup)
    imdb_user_rating_count = get_user_rating_count(soup)
    oscar_wins = get_oscar_wins(soup)
    non_oscar_wins = get_non_oscar_wins(soup)

    movie_dict = dict(
        zip(headers, [title, country, runtime, budget, global_gross,
                      mpaa_rating, japan_release_date,
                      usa_release_date, genres, imdb_user_rating,
                      imdb_user_rating_count, oscar_wins, non_oscar_wins]))

    return movie_dict

In [397]:
get_movie_dict('/title/tt0245429/')

{'title': 'Spirited Away',
 'country': 'Japan',
 'runtime_minutes': 125,
 'budget': 19000000,
 'global_gross': 350657645,
 'mpaa_rating': 'PG',
 'japan_release_date': datetime.datetime(2001, 7, 20, 0, 0),
 'usa_release_date': None,
 'genres': 'Animation, Adventure, Family, Fantasy, Mystery',
 'imdb_user_rating': 8.6,
 'imdb_user_rating_count': 619086,
 'oscar_wins': 1,
 'non_oscar_wins': 57}

In [396]:
get_movie_dict('/title/tt5311514/')

{'title': 'Your Name.',
 'country': 'Japan',
 'runtime_minutes': 106,
 'budget': 3461179,
 'global_gross': 358922706,
 'mpaa_rating': 'PG',
 'japan_release_date': datetime.datetime(2016, 8, 26, 0, 0),
 'usa_release_date': None,
 'genres': 'Animation, Drama, Fantasy, Romance',
 'imdb_user_rating': 8.4,
 'imdb_user_rating_count': 173534,
 'oscar_wins': None,
 'non_oscar_wins': 15}

In [375]:
get_movie_dict('/title/tt11433900/')

{'title': 'Gekijoban G No Reconguista II: Bellri Gekishin',
 'country': 'Japan',
 'runtime_minutes': 95,
 'budget': None,
 'global_gross': None,
 'mpaa_rating': None,
 'japan_release_date': datetime.datetime(2020, 2, 21, 0, 0),
 'usa_release_date': None,
 'genres': 'Animation',
 'imdb_user_rating': None,
 'imdb_user_rating_count': None,
 'oscar_wins': None,
 'non_oscar_wins': None}

In [395]:
get_movie_dict('/title/tt10981202/')

{'title': 'Her Blue Sky',
 'country': 'Japan',
 'runtime_minutes': 106,
 'budget': None,
 'global_gross': 4736031,
 'mpaa_rating': None,
 'japan_release_date': datetime.datetime(2019, 10, 11, 0, 0),
 'usa_release_date': None,
 'genres': 'Animation, Drama, Family, Fantasy, Music, Romance',
 'imdb_user_rating': 6.7,
 'imdb_user_rating_count': 206,
 'oscar_wins': None,
 'non_oscar_wins': None}