# Introduction
*What's this Python notebook all about?*

`Hello world!`

**This notebook will scrape some movie websites to get data, then make a model using that data.**

To be specific, here's what this Python 3 notebook will do:
1. Scrape a list of movies from IMDB
2. Scrape each movie's stats from Rotten Tomatoes
3. Make a Pandas DataFrame from scraped data
4. Make a linear regression model

---

# `import` everything!
*How to import all the necessary libraries (and install them if you haven't!)*

**1. The `requests` library is for getting the HTML code of a web page.**

Installation: `pip install requests`.

In [1]:
import requests

**2. `BeautifulSoup` makes parsing through the HTML a LOT easier!**

Installation: `pip install beautifulsoup4`.

In [2]:
from bs4 import BeautifulSoup

**3. `Pandas` is one of the most iconic Python libraries for handling data.**

Installation: `pip install pandas`

In [3]:
import pandas as pd

**4. `numpy` is another very iconic Python library.**

This helps make Python more powerful in dealing with numbers and lists of numbers.

In [4]:
import numpy as np

**5. `re` is the regex library**

It's a built-in python library for implementing [regular expressions](https://www.regular-expressions.info/).

In [5]:
import re

**6. `statsmodels` was chosen as the library of choice for modeling.**

It's not the fastest regression tool, but the output is most helpful for our current usage. (Our dataset is also small enough, too!)

In [6]:
import statsmodels.api as sm

  from pandas.core import datetools


---

# Let's compile a list of movies.
*Because the IMDB API wouldn't cooperate.*

**To perform a study related to movies, we'll first need a list of movies to gather data about.**

Since the IMDB API didn't work, we have to look for another way to get a list of movies.

Check out one of the top Google results I got:

![Check this out](1-look-the-url.png)

It's a list of movies for the year 2017! Apparently, it's just a specific search query.

Let's observe what we get when we go to the next page:

![Woah, pattern!](1-look-theres-a-pattern-labelled.png)

When you go to the next page, the URL actually follows a certain pattern based on its structure!

**We can use this to our advantage, to easily prepare a list of movies!**

Since all pages follow this pattern, we can just scrape as many pages as we want, and just change the value of page in the query!

**Let's store that website in a variable. *Note the curly braces in lieu of a number!***

We do this so we can dynamically change that number as needed later on with the `format` method!

In [7]:
movies_2017_url = 'https://www.imdb.com/search/title?year=2017,2017&title_type=feature&sort=moviemeter,asc&page={}&ref_=adv_nxt'

**To demonstrate, let's go to the first page of 2017 movies.**

We'll start with doing a get request to the page, and parsing it with BeautifulSoup.

*Note: These are only the first 1000 characters of the HTML file; the whole HTML is much too big. [Click here to see the complete HTML file.](https://pastebin.com/WhngJxbE)*

*Alternatively, use the developer tools (ctrl/cmd + Shift + C) from your favorite web browser to browse through the HTML file [of the original link](https://www.imdb.com/search/title?year=2017,2017&title_type=feature&sort=moviemeter,asc&page=1&ref_=adv_nxt) instead.*

In [8]:
page = BeautifulSoup(requests.get(movies_2017_url.format(1)).text, 'lxml')
print(page.prettify()[:1000])

<!DOCTYPE html>
<html xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://ogp.me/ns#">
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="app-id=342792525, app-argument=imdb:///?src=mdot" name="apple-itunes-app"/>
  <script type="text/javascript">
   var IMDbTimer={starttime: new Date().getTime(),pt:'java'};
  </script>
  <script>
   if (typeof uet == 'function') {
      uet("bb", "LoadTitle", {wb: 1});
    }
  </script>
  <script>
   (function(t){ (t.events = t.events || {})["csm_head_pre_title"] = new Date().getTime(); })(IMDbTimer);
  </script>
  <title>
   IMDb: Most Popular Feature Films Released 2017-01-01 to 2017-12-31 - IMDb
  </title>
  <script>
   (function(t){ (t.events = t.events || {})["csm_head_post_title"] = new Date().getTime(); })(IMDbTimer);
  </script>
  <script>
   if (typeof uet == 'function') {
      uet("be", "LoadTitle", {wb: 1});
    }
  </script>
  <script>
   if (typeof uex == 'function') {
  

**Looking through the source code, we can observe that each row in the list is in a `lister-item-content` html element.**

Using the BeautifulSoup's `find_all` method allows us to get a list of all HTML elements with `lister-item-content` as their tag compiled as a Python list.

Protip: To make understanding the HTML file easier, check out Google Chrome's [Developer Tools](https://developers.google.com/web/tools/chrome-devtools/).

In [9]:
rows = page.find_all(class_='lister-item-content')

**Let's see what the first element of our list looks like.**

Another wonderful part of `BeautifulSoup` is that we can still use the `prettify` method with smaller components!

In [10]:
first_row = rows[0]
print(first_row.prettify()[:500])

<div class="lister-item-content">
 <h3 class="lister-item-header">
  <span class="lister-item-index unbold text-primary">
   1.
  </span>
  <a href="/title/tt1485796/?ref_=adv_li_tt">
   The Greatest Showman
  </a>
  <span class="lister-item-year text-muted unbold">
   (2017)
  </span>
 </h3>
 <p class="text-muted ">
  <span class="certificate">
   PG-13
  </span>
  <span class="ghost">
   |
  </span>
  <span class="runtime">
   105 min
  </span>
  <span class="ghost">
   |
  </span>
  <span cla


**We can easily get the title of the movie from the HTML.**

`BeautifulSoup` to the rescue!

In [11]:
first_title = first_row.a.text
first_title

'The Greatest Showman'

**We need to convert titles to follow the Rotten Tomatoes URL format.**

All lowercase letters, no punctuation marks, only underscores. The Rotten Tomatoes link of this movie is https://www.rottentomatoes.com/m/jumanji_welcome_to_the_jungle/.

In [12]:
linkified = first_title.lower()                 # 1: make the text lowercase
linkified = re.sub('[!?.\'",:]', '', linkified) # 2: remove punctuation marks
linkified = re.sub('[ -]', '_', linkified)      # 3: turn spaces or dashes into underscores
linkified                                       # 4: voila.

'the_greatest_showman'

**We can start building the Rotten Tomatoes URL of the movie!**

All movies on Rotten Tomatoes start with this format.

In [13]:
base_url = 'https://www.rottentomatoes.com/m/'

**However, there are two possible Rotten Tomatoes URLs; so, we need to try both.**

Some movies have years at the end to disambiguate different movie versions so we'll need to try multiple versions of links. (An example is Wonder Woman, which has [a recent version](https://www.rottentomatoes.com/m/wonder_woman_2017) and [an older animated version](https://www.rottentomatoes.com/m/wonder_woman).)

We'll first try the link with `_2017` at the end -- if it works, then that's the one we need. Otherwise, we'll try the title as is; if this still fails, then it doesn't follow the "base format" and will need more careful parsing.

In [14]:
url_a = base_url + linkified + '_2017'
url_b = base_url + linkified

# if the first one works, then that's the URL of the movie.
if requests.get(url_a).status_code == 200:
    url = url_a

# otherwise, try the link without the year.
elif requests.get(url_b).status_code == 200:
    url = url_b
    
# if both didn't work, we'll have to prepare the link manually.
else:
    url = 'no link'
    
print(url)

https://www.rottentomatoes.com/m/the_greatest_showman_2017


**Let's put the code we've written into a loop.**

We'll repeat what we did for the first link unto every other link in the page.

In [15]:
links = []
no_links = []

for i,row in enumerate(rows):
    # get the title
    title = row.a.text    
    
    # so we know what's happening
    print('processing {:>2}/{:>2}: {}'.format(i+1, len(rows), title))
    
    # process the link
    linkified = title.lower()
    linkified = re.sub('[!?.\'",:]', '', linkified)
    linkified = re.sub('[ -]', '_', linkified)
    
    # try the URLs
    url_a = base_url + linkified + '_2017'
    url_b = base_url + linkified

    # if the first one works, then that's the URL of the movie.
    if requests.get(url_a).status_code == 200:
        links.append(url_a)

    # otherwise, try the link without the year.
    elif requests.get(url_b).status_code == 200:
        links.append(url_b)

    # if both didn't work, we'll have to prepare the link manually.
    else:
        print('=== '+title+' failed ===')
        no_links.append(row.a)
        
print('Movies with links: {}'.format(len(links)))
print('Others (no links): {}'.format(len(no_links)))

processing  1/50: The Greatest Showman
processing  2/50: You Were Never Really Here
processing  3/50: Jumanji: Welcome to the Jungle
processing  4/50: Star Wars: Episode VIII - The Last Jedi
=== Star Wars: Episode VIII - The Last Jedi failed ===
processing  5/50: Thor: Ragnarok
processing  6/50: The Shape of Water
processing  7/50: Hostiles
processing  8/50: Justice League
processing  9/50: Coco
processing 10/50: Ghost Stories
processing 11/50: Molly's Game
processing 12/50: Hot Summer Nights
processing 13/50: Chappaquiddick
processing 14/50: Three Billboards Outside Ebbing, Missouri
processing 15/50: All the Money in the World
processing 16/50: Pitch Perfect 3
processing 17/50: The Post
processing 18/50: Murder on the Orient Express
processing 19/50: Spider-Man: Homecoming
processing 20/50: It
processing 21/50: Blade Runner 2049
processing 22/50: Órbita 9
processing 23/50: Ferdinand
processing 24/50: Call Me by Your Name
processing 25/50: Get Out
processing 26/50: Wonder
processing 27

**Now, let's loop through all of the pages.**

Let's expand our loop!

In [16]:
NUM_PAGES = 15  # 1 page = 50 movies

links = []
no_links = []

In [17]:
for i in range(1,NUM_PAGES+1):   # recall that range(1,4) yields 1,2,3.
    alert = 'now on iteration {}/{}'.format(i, NUM_PAGES)
    print(alert)
    
    page = BeautifulSoup(requests.get(movies_2017_url.format(i)).text, 'lxml')
    rows = page.find_all(class_='lister-item-content')
    
    for j,row in enumerate(rows):
        # get the title
        title = row.a.text    

        # so we know what's happening
        if j%25==0:
            print('processing {:>2}/{:>2}: {}'.format(j+1, len(rows), title))

        # process the link
        linkified = title.lower()
        linkified = re.sub('[!?.\'",:]', '', linkified)
        linkified = re.sub('[ -]', '_', linkified)

        # try the URLs
        url_a = base_url + linkified + '_2017'
        url_b = base_url + linkified

        # if the first one works, then that's the URL of the movie.
        if requests.get(url_a).status_code == 200:
            links.append(url_a)

        # otherwise, try the link without the year.
        elif requests.get(url_b).status_code == 200:
            links.append(url_b)

        # if both didn't work, we'll have to prepare the link manually.
        else:
            no_links.append(row.a)
        
        if len(links)==5000:
            break
            
    print()
    
print('Movies with links: {}'.format(len(links)))
print('Others (no links): {}'.format(len(no_links)))

now on iteration 1/15
processing  1/50: The Greatest Showman
processing 26/50: Wonder

now on iteration 2/15
processing  1/50: The Hitman's Bodyguard
processing 26/50: King Arthur: Legend of the Sword

now on iteration 3/15
processing  1/50: The Big Sick
processing 26/50: A Dog's Purpose

now on iteration 4/15
processing  1/50: Okja
processing 26/50: Newness

now on iteration 5/15
processing  1/50: The Meyerowitz Stories (New and Selected)
processing 26/50: My Little Pony: The Movie

now on iteration 6/15
processing  1/50: Aftermath
processing 26/50: Rings

now on iteration 7/15
processing  1/50: The Man from Earth: Holocene
processing 26/50: Woody Woodpecker

now on iteration 8/15
processing  1/50: First They Killed My Father
processing 26/50: Midnighters

now on iteration 9/15
processing  1/50: Gong fu yu jia
processing 26/50: iBoy

now on iteration 10/15
processing  1/50: Fun Mom Dinner
processing 26/50: Hampstead

now on iteration 11/15
processing  1/50: Qarib Qarib Singlle
process

**We now have a lot of Rotten Tomatoes links!**

Let's check what the top 5 links look like.

In [18]:
for link in links[:5]:
    print(link)

https://www.rottentomatoes.com/m/the_greatest_showman_2017
https://www.rottentomatoes.com/m/you_were_never_really_here_2017
https://www.rottentomatoes.com/m/jumanji_welcome_to_the_jungle
https://www.rottentomatoes.com/m/thor_ragnarok_2017
https://www.rottentomatoes.com/m/the_shape_of_water_2017


*Nice;* the links work perfectly!

**We can save our links so we won't have to run this loop again in the future.**

The `w` argument means we will be writing to the file.

In [19]:
# Save our links to a file for safe keeping
with open('links_found.txt', 'w') as output:
    for val in links:
        output.write(str(val) + '\n')
        
with open('no_links_found.txt', 'w') as output:
    for val in no_links:
        output.write(str(val) + '\n')
        
# # How to read the files you just saved.
# with open('links_found_15.txt') as f:
#     links = f.read().split('\n')

---

# Time to get the data!

We now have a list of Rotten Tomatoes URLs! Our next task is to scrape each of those web pages.

But first, to demonstrate what each iteration of our loop will do, let's process a single movie first; and we'll try one that is not in our list: Black Panther.

Let's see the [features](https://en.wikipedia.org/wiki/Feature_(machine_learning)) we can get from the pages:

![Black Panther was the hit Marvel movie at the time this was made.](2-black-panther-pic-labelled.png)

*Black Panther Rotten Tomatoes page*

---

![Scrolling down the Black Panther page](3-black-panther-pic-2-labelled.png)

*That same page, scrolled down.*

---

We can see we have 17 features -- add the link, and we'll have 18 features to play with!

## What each iteration of the loop looks like

Again, we'll be using a movie that isn't in our list of 2017 movies.

In [20]:
link = 'https://www.rottentomatoes.com/m/black_panther_2018'

**Let's make some soup.**

A lot of these processes will be similar to the previous step's.

In [21]:
soup = BeautifulSoup(requests.get(link).text, 'lxml')

**Time to extract each feature we wanted.**

Some knowledge about [regular expressions](https://www.regular-expressions.info/) and some knowledge of [CSS selectors](https://www.w3schools.com/cssref/css_selectors.asp) (specifically the [nth-of-type](https://developer.mozilla.org/en-US/docs/Web/CSS/%3Anth-of-type)) will aid you greatly.

### Feature 0: Title of the movie

There are multiple choices to obtain this, but the one from the `title` tag of the web page was chosen.

*Note: `xa0` is part of `soup.title.text.split` for some reason, so that extra bit of code is to get _just_ the title of the movie.*

In [22]:
# title of the movie
movie_title = soup.title.text.split('\xa0')[0]
movie_title

'Black Panther'

**The first half of features are all in a single `div` element called `scorePanel`.**

So, let's extract that first to make our job easier.

In [23]:
score_panel = soup.find(id='scorePanel')
print(score_panel.prettify()[:400])

<div class="score_panel col-sm-17 col-xs-15" id="scorePanel">
 <div class="col-sm-16 col-xs-12 tmeter-panel">
  <ul class="pull-right hidden-xs" role="tablist">
   <li class="active pull-left critics-score">
    <a class="articleLink unstyled smaller gray superPageFontColor" data-toggle="tab" href="#all-critics-numbers" role="tab">
     All Critics
    </a>
   </li>
   <li class="pull-left superPa


**Tomatometer and Audience Score**

Both of these `span` elements share the same `class`; we can use that to our advantage.

In [24]:
meter_values = score_panel.find_all(class_='meter-value')
meter_values

[<span class="meter-value superPageFontColor"><span>96</span>%</span>,
 <span class="meter-value superPageFontColor"><span>100</span>%</span>,
 <div class="meter-value">
 <span class="superPageFontColor" style="vertical-align:top">79%</span>
 </div>]

### Feature 1: Tomatometer Score

It's the first `span` element within `meter_values`.

In [25]:
tom = int(meter_values[0].span.text)
tom

96

### Feature 2: Audience Score

It's the _last_ `span` element within the same `meter_values`.

(The middle `span` element is for the red meter visual above the Critics Consensus.)

In [26]:
aud = meter_values[-1].span.text # returns 79%
aud = int(aud.replace('%', ''))  # remove %, parse to an int
aud

79

**Features 3-6 are in a `div` element called `scoreStats`.**

That means we can get Average Rating, Reviews Counted, and numbers of Fresh and Rotten in one go!

In [27]:
score_spans = score_panel.find(id='scoreStats').find_all('span')
score_spans

[<span class="subtle superPageFontColor">Average Rating: </span>,
 <span class="subtle superPageFontColor">Reviews Counted: </span>,
 <span>367</span>,
 <span class="subtle superPageFontColor audience-info">Fresh: </span>,
 <span>354</span>,
 <span class="subtle superPageFontColor audience-info">Rotten: </span>,
 <span>13</span>]

### Feature 3: Tomatometer Average Rating

The first `span` element contained the number.

We'll extract just the actual score and convert it into a `float`.

In [28]:
tom_ave_rating = float(score_spans[0].next_sibling.strip().split('/')[0])
tom_ave_rating

8.2

### Feature 4: Number of Reviews in Tomatometer

We get our next feature from the `span` at index _2_, not 1. Don't forget to parse to `int`.

In [29]:
tom_num_reviews = int(score_spans[2].text)
tom_num_reviews

367

### Feature 5: Number of "Fresh" Ratings

_Fresh, fresh, baby!_

In [30]:
tom_fresh = int(score_spans[4].text)
tom_fresh

354

### Feature 6: Number of "Rotten" Ratings

`raw + 10`

In [31]:
tom_rotten = int(score_spans[6].text)
tom_rotten

13

**The accompanying content of Audience Score are in the last `audience-info`.**

There are other `div` elements that have the same `class`, so we must specify that we want the last.

In [32]:
audience_panel = score_panel.find_all(class_='audience-info')[-1]
audience_panel

<div class="audience-info hidden-xs superPageFontColor">
<div>
<span class="subtle superPageFontColor">Average Rating:</span>
            4.1/5
                </div>
<div>
<span class="subtle superPageFontColor">User Ratings:</span>
        75,883</div>
</div>

**The numbers are not in any HTML tags!**

We can observe, however, that both numbers are after span elements. So let's start with selecting the span elements.

In [33]:
# audience average rating
audience_panel = audience_panel.find_all('span')
audience_panel

[<span class="subtle superPageFontColor">Average Rating:</span>,
 <span class="subtle superPageFontColor">User Ratings:</span>]

### Feature 7: Audience Score Average Rating

We access the label of "Average Rating:" and get the text right after it using `next_sibling`, which are our numbers. Don't forget to parse afterwards!

In [34]:
aud_ave_rating = audience_panel[0].next_sibling.strip().split('/')[0]
aud_ave_rating = float(aud_ave_rating)
aud_ave_rating

4.1

### Feature 8: Audience Score Number of Ratings

We do the same thing as in the previous feature.

In [35]:
# audience number of ratings
aud_num_ratings = audience_panel[1].next_sibling.strip().split('/')[0]
aud_num_ratings = int(aud_num_ratings.replace(',', ''))
aud_num_ratings

75883

**The rest of our features are a bit funkier.**

This next part will need to be a bit more dynamic, as some of these information might be missing for certain movies -- especially the ones that're less mainstream, or some movies that were later found to be streaming movies, but still part of this list.

---

**In fact, even a popular movie like Coco can lack movie information!**

Here's Jumanji's movie information:

![Check out Jumanji's movie info.](4-jumanji-movie-info.png)

Coco:

![Check out Coco's movie info.](5-coco-movie-info.png)

---

**So, we need to account for both the label (e.g. "Rating"), and the actual value (e.g. "R-18").**

But a solution will slowly reveal itself after looking through the source code!

**All the headers are in `meta-label`.**

After some string processing, we can put them into a neat list.

In [36]:
movie_info_headers = soup.find_all(class_='meta-label')
movie_info_headers = [x.text.split(':')[0] for x in movie_info_headers] # split the string at : and get the first.
movie_info_headers

['Rating',
 'Genre',
 'Directed By',
 'Written By',
 'In Theaters',
 'Box Office',
 'Runtime',
 'Studio']

**All the values, meanwhile, are in `meta-value`!**

Let's put them into a list as well!

In [37]:
movie_info_values = soup.find_all(class_='meta-value')
movie_info_values = [x.text.strip() for x in movie_info_values]
movie_info_values

['PG-13 (for prolonged sequences of action violence, and a brief rude gesture)',
 'Action & Adventure, \n                        Drama, \n                        Science Fiction & Fantasy',
 'Ryan Coogler',
 'Joe Robert Cole, Ryan Coogler',
 'Feb 16, 2018\n\xa0wide',
 '$501,105,037',
 '135 minutes',
 'Marvel Studios']

### Headers 9 to N.

We can look at the headers and values in tandem using `zip`.

Afterwards, we need `if` statements because if we just assume everything's there, it'll cause our script to crash.

In [38]:
# a list helps us so that if a feature is missing, it's
# automatically a None (aka NaN).
row = [None for x in range(8)]

for header,value in zip(movie_info_headers, movie_info_values):
    lowercase_header = header.lower()
    
    if 'rating' in lowercase_header:
        row[0] = value.split(' ')[0]

    elif 'genre' in lowercase_header:
        # \s+ means "any whitespace". So,
        # ',\s+' means a comma followed by any whitespace.
        row[1] = re.split(',\s+', value)
    
    elif 'directed' in lowercase_header:
        row[2] = value
        
    elif 'written' in lowercase_header:
        row[3] = value.split(',\s+')
        
    elif 'theaters' in lowercase_header:
        row[4] = value.split('\n')[0]
    
    elif 'box' in lowercase_header:
        row[5] = int(re.sub('\$|,', '', value))
        
    elif 'runtime' in lowercase_header:
        row[6] = int(value.split(' ')[0])
        
    elif 'studio' in lowercase_header:
        row[7] = value

row

['PG-13',
 ['Action & Adventure', 'Drama', 'Science Fiction & Fantasy'],
 'Ryan Coogler',
 ['Joe Robert Cole, Ryan Coogler'],
 'Feb 16, 2018',
 501105037,
 135,
 'Marvel Studios']

## Let's now fill up our whole table!

*No pandas were hurt in the preparation of this table.*

In [39]:
# for which movie pages does our script to get the
# tomatometer fail?
# is our algorithm acceptable enough?

table = []

# a set is used here so we can keep track of what error and where!
err = {}

for i,link in enumerate(links):
    if i%20==0:
        print('now processing {:>3}/{:>3}: {}'.format(i+1, len(links), link))

    row = []
    
    try:
        # == make soup ==
        soup = BeautifulSoup(requests.get(link).text, 'lxml')

        # === 0: title of the movie ===
        movie_title = soup.title.text.split('\xa0')[0]
        row.append(movie_title)
        
        # our first features are all in a div, scorePanel.
        score_panel = soup.find(id='scorePanel')
        
        # get tomatometer score and audience score (first and last in meter-value)
        meter_values = score_panel.find_all(class_='meter-value')

        # === 1: tomatometer ===
        tom = int(meter_values[0].span.text)
        row.append(tom)
        
        # === 2: audience meter ===
        aud = meter_values[-1].span.text # returns 79%
        aud = int(aud.replace('%', ''))  # remove %, parse to an int
        row.append(aud)
        
        # scoreStats contains the other numbers underneath the tomatometer score
        score_spans = score_panel.find(id='scoreStats').find_all('span')

        # === 3: tomatometer average rating ===
        tom_ave_rating = float(score_spans[0].next_sibling.strip().split('/')[0])
        row.append(tom_ave_rating)
        
        # === 4: tomatometer number of reviews ===
        tom_num_reviews = int(score_spans[2].text)
        row.append(tom_num_reviews)
        
        # === 5: tomatometer number of fresh reviews===
        tom_fresh = int(score_spans[4].text)
        row.append(tom_fresh)
        
        # === 6: tomatometer number ===
        tom_rotten = int(score_spans[6].text)
        row.append(tom_rotten)
        
        # the last audience-info contains the other numbers underneath the audience score.
        audience_panel = score_panel.find_all(class_='audience-info')[-1].find_all('span')
        
        # === 7: audience average rating ===
        aud_ave_rating = audience_panel[0].next_sibling.strip().split('/')[0]
        aud_ave_rating = float(aud_ave_rating)
        row.append(aud_ave_rating)
        
        # === 8: audience number of ratings ===
        aud_num_ratings = audience_panel[1].next_sibling.strip().split('/')[0]
        aud_num_ratings = int(aud_num_ratings.replace(',', ''))
        row.append(aud_num_ratings)
        
        # we need to dynamically scrape movie info by checking both headers...
        movie_info_headers = soup.find_all(class_='meta-label')
        movie_info_headers = [x.text.split(':')[0] for x in movie_info_headers]

        # ...and values.
        movie_info_values = soup.find_all(class_='meta-value')
        movie_info_values = [x.text.strip() for x in movie_info_values]

        # so we have a set number of elements, even if some stuff are missing.
        info = [None for x in range(8)]

        for header,value in zip(movie_info_headers, movie_info_values):
            lowercase_header = header.lower()

            if 'rating' in lowercase_header:
                info[0] = value.split(' ')[0]

            elif 'genre' in lowercase_header:
                # \s+ means "any whitespace". So,
                # ',\s+' means a comma followed by any whitespace.
                info[1] = re.split(',\s+', value)

            elif 'directed' in lowercase_header:
                info[2] = value

            elif 'written' in lowercase_header:
                info[3] = value.split(',\s+')

            elif 'theaters' in lowercase_header:
                info[4] = value.split('\n')[0]

            elif 'box' in lowercase_header:
                info[5] = int(re.sub('\$|,', '', value))

            elif 'runtime' in lowercase_header:
                info[6] = int(value.split(' ')[0])

            elif 'studio' in lowercase_header:
                info[7] = value
                
        # we're using extend rather than append
        row.extend(info)
        
        # let's put the link so we can check easily later
        row.append(link)
        
    except IndexError:
        err[i] = 'index'
        continue
        
    except AttributeError:
        err[i] = 'attribute'
        continue
        
    except ValueError:
        err[i] = 'value'
        continue
    
    table.append(row)

print()
print('num. of successes: {:>4}'.format(len(table)))
print('num. of crashes:   {:>4}'.format(len(err)))
print('index errors:      {:>4}'.format(len([error for error in err if err[error]=='index'])))
print('attribute errors:  {:>4}'.format(len([error for error in err if err[error]=='attribute'])))
print('value errors:      {:>4}'.format(len([error for error in err if err[error]=='value'])))

now processing   1/645: https://www.rottentomatoes.com/m/the_greatest_showman_2017
now processing  21/645: https://www.rottentomatoes.com/m/órbita_9
now processing  41/645: https://www.rottentomatoes.com/m/the_florida_project_2017
now processing  61/645: https://www.rottentomatoes.com/m/john_wick_chapter_2
now processing  81/645: https://www.rottentomatoes.com/m/xxx_return_of_xander_cage
now processing 101/645: https://www.rottentomatoes.com/m/ghost_in_the_shell_2017
now processing 121/645: https://www.rottentomatoes.com/m/annabelle_creation
now processing 141/645: https://www.rottentomatoes.com/m/finding_your_feet
now processing 161/645: https://www.rottentomatoes.com/m/goodbye_christopher_robin
now processing 181/645: https://www.rottentomatoes.com/m/cold_skin_2017
now processing 201/645: https://www.rottentomatoes.com/m/war_machine
now processing 221/645: https://www.rottentomatoes.com/m/the_lego_ninjago_movie
now processing 241/645: https://www.rottentomatoes.com/m/woman_walks_ahea

In [40]:
headers = [
    'title',
    'tom',
    'aud',
    'tom_ave_rating',
    'tom_num_reviews',
    'tom_fresh',
    'tom_rotten',
    'aud_ave_rating',
    'aud_num_ratings',
    'age_rating',
    'genre',
    'director',
    'writers',
    'rel_date',
    'box',
    'runtime',
    'studio',
    'link'
]

df = pd.DataFrame(table, columns=headers)
df.head(n=10)

Unnamed: 0,title,tom,aud,tom_ave_rating,tom_num_reviews,tom_fresh,tom_rotten,aud_ave_rating,aud_num_ratings,age_rating,genre,director,writers,rel_date,box,runtime,studio,link
0,The Greatest Showman,55,88,6.0,205,113,92,4.4,21776,PG,"[Drama, Musical & Performing Arts]",Michael Gracey,"[Jenny Bicks, Bill Condon]","Dec 20, 2017",164616443.0,105.0,20th Century Fox,https://www.rottentomatoes.com/m/the_greatest_...
1,You Were Never Really Here,88,72,8.2,137,120,17,3.7,2183,R,"[Drama, Mystery & Suspense]",Lynne Ramsay,[Lynne Ramsay],"Apr 6, 2018",,89.0,Amazon Studios,https://www.rottentomatoes.com/m/you_were_neve...
2,Jumanji: Welcome to the Jungle,76,88,6.1,185,141,44,4.4,35896,PG-13,"[Action & Adventure, Drama, Kids & Family, Sci...",Jake Kasdan,"[Chris McKenna, Erik Sommers, Scott Rosenberg,...","Dec 20, 2017",393201353.0,112.0,Columbia Pictures,https://www.rottentomatoes.com/m/jumanji_welco...
3,Thor: Ragnarok,92,87,7.5,332,305,27,4.2,88083,PG-13,"[Action & Adventure, Drama, Science Fiction & ...",Taika Waititi,[Eric Pearson],"Nov 3, 2017",314971245.0,130.0,Walt Disney Pictures,https://www.rottentomatoes.com/m/thor_ragnarok...
4,The Shape of Water,92,74,8.4,346,318,28,3.8,22328,R,"[Drama, Science Fiction & Fantasy, Romance]",Guillermo del Toro,"[Guillermo del Toro, Vanessa Taylor]","Dec 22, 2017",57393976.0,119.0,Fox Searchlight Pictures,https://www.rottentomatoes.com/m/the_shape_of_...
5,Hostiles,73,72,6.8,180,131,49,3.7,4009,R,"[Action & Adventure, Drama, Western]",Scott Cooper,[Scott Cooper],"Jan 26, 2018",29472340.0,135.0,Entertainment Studios Motion Pictures,https://www.rottentomatoes.com/m/hostiles
6,Justice League,40,75,5.3,313,126,187,3.9,123478,PG-13,"[Action & Adventure, Drama, Science Fiction & ...",Zack Snyder,"[Chris Terrio, Joss Whedon]","Nov 17, 2017",227032490.0,110.0,Warner Bros. Pictures,https://www.rottentomatoes.com/m/justice_leagu...
7,Coco,97,94,8.2,273,265,8,4.6,23818,PG,"[Action & Adventure, Animation, Comedy]","Lee Unkrich, Adrian Molina","[Matthew Aldrich, Adrian Molina]","Nov 22, 2017",208487719.0,,Disney/Pixar,https://www.rottentomatoes.com/m/coco_2017
8,Molly's Game,82,84,7.2,241,198,43,3.9,10147,R,[Drama],Aaron Sorkin,[Aaron Sorkin],"Jan 5, 2018",28744803.0,140.0,STXfilms,https://www.rottentomatoes.com/m/mollys_game_2017
9,Chappaquiddick,80,75,7.0,93,74,19,3.9,1009,PG-13,[Mystery & Suspense],John Curran,"[Taylor Allen, Andrew Logan]","Apr 6, 2018",,101.0,Entertainment Studios Motion Pictures,https://www.rottentomatoes.com/m/chappaquiddic...


In [41]:
df.to_csv('movies_df_15.csv', index=False)

# # here is how to read the csv you just wrote. With pandas,
# #   it'll conveniently be read as a DataFrame!
# df = pd.read_csv('movies_df_15.csv')

# Let's inspect our data.

How did our scraping do?

In [42]:
print('number of nulls per column\n')
for label in df:
    print('{:<16} {:>3}'.format(label+':', len(df[df[label].isnull()])))

number of nulls per column

title:             0
tom:               0
aud:               0
tom_ave_rating:    0
tom_num_reviews:   0
tom_fresh:         0
tom_rotten:        0
aud_ave_rating:    0
aud_num_ratings:   0
age_rating:        0
genre:             1
director:          7
writers:          18
rel_date:         57
box:             248
runtime:          62
studio:           29
link:              0


# Let's model!

We'll do some linear regression with Python's `statsmodels` library.

*Note: this is not the best model performance-wise; but for this tutorial, it will be the most helpful as results are already in a tabular form!*

**Let's predict box office revenues!**

We will base our prediction on:
* tomatometer score
* audience score
* average tomatometer rating
* average audience rating

In [43]:
import statsmodels.api as sm
import numpy as np

df_model = df[df['box']>=1000000]
data_to_model = df_model[['tom', 'aud', 'tom_ave_rating', 'aud_ave_rating', 'tom_num_reviews', 'aud_num_ratings']]
target_column = df_model[['box']]

# Note the order of arguments
model = sm.OLS(target_column, data_to_model).fit()

# Print out the statistics. Summary2 gives it in non-exponential format.
model.summary2()

0,1,2,3
Model:,OLS,Adj. R-squared:,0.783
Dependent Variable:,box,AIC:,5760.1077
Date:,2018-04-18 10:29,BIC:,5778.1716
No. Observations:,150,Log-Likelihood:,-2874.1
Df Model:,6,F-statistic:,91.34
Df Residuals:,144,Prob (F-statistic):,1.4e-46
R-squared:,0.792,Scale:,2677500000000000.0

0,1,2,3,4,5,6
,Coef.,Std.Err.,t,P>|t|,[0.025,0.975]
tom,2107912.1629,649682.3825,3.2445,0.0015,823766.1650,3392058.1608
aud,-854161.2729,737759.2328,-1.1578,0.2489,-2312397.7770,604075.2312
tom_ave_rating,-47332304.5300,13364125.5752,-3.5417,0.0005,-73747501.9185,-20917107.1416
aud_ave_rating,55255245.2428,21516220.4917,2.5681,0.0112,12726820.3083,97783670.1773
tom_num_reviews,145957.1495,86549.0200,1.6864,0.0939,-25113.4854,317027.7843
aud_num_ratings,2465.3351,252.2317,9.7741,0.0000,1966.7802,2963.8900

0,1,2,3
Omnibus:,67.502,Durbin-Watson:,1.681
Prob(Omnibus):,0.0,Jarque-Bera (JB):,386.841
Skew:,1.496,Prob(JB):,0.0
Kurtosis:,10.276,Condition No.:,203394.0


---

# Conclusion

And we're done! We hope you've learned a thing or two from this detailed notebook.

With web scraping, we can prepare our own datasets to play with, and you'll only be limited by what data is on the internet -- which, if you ask us, is quite a lot! :)

---

Prepared for
**Data Science 1**

*(An internal lecture conducted for PLDT/Smart)*

---

by:

Nicholas _"Lodi Nick"_ Huber  <c-nehuber@pldt.com.ph>

Andre _"dTanMan"_ Tan  <attan@pldt.com.ph>

Mark _"Markee-joke-lang-Mark-lang"_ Herrera  <mnherrera@talas.com.ph>

Brent _"Pun de Manila"_ Carbonera  <bbcarbonera@pldt.com.ph>

---

# Appendix

Other code snippets that may help you when you're playing around with this notebook

## One-cell snippet to scrape one movie

This is one complete iteration of the for loop. You can use this for debugging error pages, as you'll see on which part of the loop the code crashed.

In [44]:
# == prepare link ==
link = 'https://www.rottentomatoes.com/m/black_panther_2018'

# == make soup ==
soup = BeautifulSoup(requests.get(link).text, 'lxml')

# === 0: title of the movie ===
movie_title = soup.title.text.split('\xa0')[0]

# get tomatometer score and audience score (first and last in meter-value)
meter_values = score_panel.find_all(class_='meter-value')

# === 1: tomatometer ===
tom = int(meter_values[0].span.text)

# === 2: audience meter ===
aud = meter_values[-1].span.text # returns 79%
aud = int(aud.replace('%', ''))  # remove %, parse to an int

# scoreStats contains the other numbers underneath the tomatometer score
score_spans = score_panel.find(id='scoreStats').find_all('span')

# === 3: tomatometer average rating
tom_ave_rating = float(score_spans[0].next_sibling.strip().split('/')[0])

# === 4: tomatometer number of reviews
tom_num_reviews = int(score_spans[2].text)

# === 5: tomatometer number of fresh reviews
tom_fresh = int(score_spans[4].text)

# === 6: tomatometer number 
tom_rotten = int(score_spans[6].text)

# the last audience-info contains the other numbers underneath the audience score.
audience_panel = score_panel.find_all(class_='audience-info')[-1].find_all('span')

# === 7: audience average rating
aud_ave_rating = audience_panel[0].next_sibling.strip().split('/')[0]
aud_ave_rating = float(aud_ave_rating)

# === 8: audience number of ratings
aud_num_ratings = audience_panel[1].next_sibling.strip().split('/')[0]
aud_num_ratings = int(aud_num_ratings.replace(',', ''))

# we need to dynamically scrape movie info by checking both headers...
movie_info_headers = soup.find_all(class_='meta-label')
movie_info_headers = [x.text.split(':')[0] for x in movie_info_headers]

# ...and values.
movie_info_values = soup.find_all(class_='meta-value')
movie_info_values = [x.text.strip() for x in movie_info_values]

# so we have a set number of Nones
info = [None for x in range(8)]

for header,value in zip(movie_info_headers, movie_info_values):
    lowercase_header = header.lower()

    if 'rating' in lowercase_header:
        info[0] = value.split(' ')[0]

    elif 'genre' in lowercase_header:
        # \s+ means "any whitespace". So,
        # ',\s+' means a comma followed by any whitespace.
        info[1] = re.split(',\s+', value)

    elif 'directed' in lowercase_header:
        info[2] = value

    elif 'written' in lowercase_header:
        info[3] = value.split(',\s+')

    elif 'theaters' in lowercase_header:
        info[4] = value.split('\n')[0]

    elif 'box' in lowercase_header:
        info[5] = int(re.sub('\$|,', '', value))

    elif 'runtime' in lowercase_header:
        info[6] = int(value.split(' ')[0])

    elif 'studio' in lowercase_header:
        info[7] = value

ValueError: invalid literal for int() with base 10: '91%'

## Processing `genre` column if it became a string

The `genre` column of our `DataFrame` contained lists; these are converted into `string`s when the `DataFrame` is written to a .csv file. Here's how to change it back.

In [45]:
app_df = pd.read_csv('movies_df_15.csv')

In [46]:
# remove brackets from genres
# you'll need this if you got the df from file
app_df['genre'] = app_df['genre'].str.replace('\[|\]', '')

# we want complete genres
app_df = app_df[~app_df['genre'].isnull()]

app_df.head(n=10)

Unnamed: 0,title,tom,aud,tom_ave_rating,tom_num_reviews,tom_fresh,tom_rotten,aud_ave_rating,aud_num_ratings,age_rating,genre,director,writers,rel_date,box,runtime,studio,link
0,The Greatest Showman,55,88,6.0,205,113,92,4.4,21776,PG,"'Drama', 'Musical & Performing Arts'",Michael Gracey,"['Jenny Bicks, Bill Condon']","Dec 20, 2017",164616443.0,105.0,20th Century Fox,https://www.rottentomatoes.com/m/the_greatest_...
1,You Were Never Really Here,88,72,8.2,137,120,17,3.7,2183,R,"'Drama', 'Mystery & Suspense'",Lynne Ramsay,['Lynne Ramsay'],"Apr 6, 2018",,89.0,Amazon Studios,https://www.rottentomatoes.com/m/you_were_neve...
2,Jumanji: Welcome to the Jungle,76,88,6.1,185,141,44,4.4,35896,PG-13,"'Action & Adventure', 'Drama', 'Kids & Family'...",Jake Kasdan,"['Chris McKenna, Erik Sommers, Scott Rosenberg...","Dec 20, 2017",393201353.0,112.0,Columbia Pictures,https://www.rottentomatoes.com/m/jumanji_welco...
3,Thor: Ragnarok,92,87,7.5,332,305,27,4.2,88083,PG-13,"'Action & Adventure', 'Drama', 'Science Fictio...",Taika Waititi,['Eric Pearson'],"Nov 3, 2017",314971245.0,130.0,Walt Disney Pictures,https://www.rottentomatoes.com/m/thor_ragnarok...
4,The Shape of Water,92,74,8.4,346,318,28,3.8,22328,R,"'Drama', 'Science Fiction & Fantasy', 'Romance'",Guillermo del Toro,"['Guillermo del Toro, Vanessa Taylor']","Dec 22, 2017",57393976.0,119.0,Fox Searchlight Pictures,https://www.rottentomatoes.com/m/the_shape_of_...
5,Hostiles,73,72,6.8,180,131,49,3.7,4009,R,"'Action & Adventure', 'Drama', 'Western'",Scott Cooper,['Scott Cooper'],"Jan 26, 2018",29472340.0,135.0,Entertainment Studios Motion Pictures,https://www.rottentomatoes.com/m/hostiles
6,Justice League,40,75,5.3,313,126,187,3.9,123478,PG-13,"'Action & Adventure', 'Drama', 'Science Fictio...",Zack Snyder,"['Chris Terrio, Joss Whedon']","Nov 17, 2017",227032490.0,110.0,Warner Bros. Pictures,https://www.rottentomatoes.com/m/justice_leagu...
7,Coco,97,94,8.2,273,265,8,4.6,23818,PG,"'Action & Adventure', 'Animation', 'Comedy'","Lee Unkrich, Adrian Molina","['Matthew Aldrich, Adrian Molina']","Nov 22, 2017",208487719.0,,Disney/Pixar,https://www.rottentomatoes.com/m/coco_2017
8,Molly's Game,82,84,7.2,241,198,43,3.9,10147,R,'Drama',Aaron Sorkin,['Aaron Sorkin'],"Jan 5, 2018",28744803.0,140.0,STXfilms,https://www.rottentomatoes.com/m/mollys_game_2017
9,Chappaquiddick,80,75,7.0,93,74,19,3.9,1009,PG-13,'Mystery & Suspense',John Curran,"['Taylor Allen, Andrew Logan']","Apr 6, 2018",,101.0,Entertainment Studios Motion Pictures,https://www.rottentomatoes.com/m/chappaquiddic...


## Converting genres from one list to multiple [boolean](https://en.wikipedia.org/wiki/Boolean_expression) columns

If we wanted to consider genres in our regression model, we can't prepare the model with a list as a feature; instead, we'll convert each genre into a column that will contain a 1 if that movie is of that genre and 0 if otherwise.

**Let's see how many genres there actually are in our data set.**

Notice that a `set` was used as it will not contain duplicates, which is useful for our current purpose. An alternative is to use a `Counter` from the `collections` library.

In [47]:
genres = set()

for list_per_movie in app_df['genre']:
    for genre in list_per_movie.split(', '):
        genres.add(genre)

genres

{"'Action & Adventure'",
 "'Animation'",
 "'Art House & International'",
 "'Classics'",
 "'Comedy'",
 "'Cult Movies'",
 "'Documentary'",
 "'Drama'",
 "'Horror'",
 "'Kids & Family'",
 "'Musical & Performing Arts'",
 "'Mystery & Suspense'",
 "'Romance'",
 "'Science Fiction & Fantasy'",
 "'Special Interest'",
 "'Sports & Fitness'",
 "'Western'"}

**We only have 16 possible genres!**

We can make one column for each genre in our `DataFrame`; a 1 on a genre column will mean it's of that genre; a 0 will mean otherwise.

We're using numbers so they can still be part of a numerical model!

In [48]:
genre_headers = ['horror', 'classics', 'western', 'art', 'musical', 'romance', 'comedy', 'special', 'scifi_fantasy', 'kids', 'animation', 'action', 'drama', 'documentary', 'sports', 'mystery']
lists = app_df['genre']

for header,genre in zip(genre_headers, list(genres)):
    app_df[header] = [(1 if genre in x else 0) for x in lists]
    
df.head(n=10)

Unnamed: 0,title,tom,aud,tom_ave_rating,tom_num_reviews,tom_fresh,tom_rotten,aud_ave_rating,aud_num_ratings,age_rating,genre,director,writers,rel_date,box,runtime,studio,link
0,The Greatest Showman,55,88,6.0,205,113,92,4.4,21776,PG,"[Drama, Musical & Performing Arts]",Michael Gracey,"[Jenny Bicks, Bill Condon]","Dec 20, 2017",164616443.0,105.0,20th Century Fox,https://www.rottentomatoes.com/m/the_greatest_...
1,You Were Never Really Here,88,72,8.2,137,120,17,3.7,2183,R,"[Drama, Mystery & Suspense]",Lynne Ramsay,[Lynne Ramsay],"Apr 6, 2018",,89.0,Amazon Studios,https://www.rottentomatoes.com/m/you_were_neve...
2,Jumanji: Welcome to the Jungle,76,88,6.1,185,141,44,4.4,35896,PG-13,"[Action & Adventure, Drama, Kids & Family, Sci...",Jake Kasdan,"[Chris McKenna, Erik Sommers, Scott Rosenberg,...","Dec 20, 2017",393201353.0,112.0,Columbia Pictures,https://www.rottentomatoes.com/m/jumanji_welco...
3,Thor: Ragnarok,92,87,7.5,332,305,27,4.2,88083,PG-13,"[Action & Adventure, Drama, Science Fiction & ...",Taika Waititi,[Eric Pearson],"Nov 3, 2017",314971245.0,130.0,Walt Disney Pictures,https://www.rottentomatoes.com/m/thor_ragnarok...
4,The Shape of Water,92,74,8.4,346,318,28,3.8,22328,R,"[Drama, Science Fiction & Fantasy, Romance]",Guillermo del Toro,"[Guillermo del Toro, Vanessa Taylor]","Dec 22, 2017",57393976.0,119.0,Fox Searchlight Pictures,https://www.rottentomatoes.com/m/the_shape_of_...
5,Hostiles,73,72,6.8,180,131,49,3.7,4009,R,"[Action & Adventure, Drama, Western]",Scott Cooper,[Scott Cooper],"Jan 26, 2018",29472340.0,135.0,Entertainment Studios Motion Pictures,https://www.rottentomatoes.com/m/hostiles
6,Justice League,40,75,5.3,313,126,187,3.9,123478,PG-13,"[Action & Adventure, Drama, Science Fiction & ...",Zack Snyder,"[Chris Terrio, Joss Whedon]","Nov 17, 2017",227032490.0,110.0,Warner Bros. Pictures,https://www.rottentomatoes.com/m/justice_leagu...
7,Coco,97,94,8.2,273,265,8,4.6,23818,PG,"[Action & Adventure, Animation, Comedy]","Lee Unkrich, Adrian Molina","[Matthew Aldrich, Adrian Molina]","Nov 22, 2017",208487719.0,,Disney/Pixar,https://www.rottentomatoes.com/m/coco_2017
8,Molly's Game,82,84,7.2,241,198,43,3.9,10147,R,[Drama],Aaron Sorkin,[Aaron Sorkin],"Jan 5, 2018",28744803.0,140.0,STXfilms,https://www.rottentomatoes.com/m/mollys_game_2017
9,Chappaquiddick,80,75,7.0,93,74,19,3.9,1009,PG-13,[Mystery & Suspense],John Curran,"[Taylor Allen, Andrew Logan]","Apr 6, 2018",,101.0,Entertainment Studios Motion Pictures,https://www.rottentomatoes.com/m/chappaquiddic...
