# Introduction
*What's this Python notebook all about?*

`Hello world!`

**This notebook will scrape some movie websites to get data, then make a model using that data.**

To be specific, here's what this Python 3 notebook will do:
1. Scrape a list of movies from IMDB
2. Scrape each movie's stats from Rotten Tomatoes
3. Make a Pandas DataFrame from scraped data
4. Make a linear regression model

---

# `import` everything!
*How to import all the necessary libraries (and install them if you haven't!)*

**1. The `requests` library is for getting the HTML code of a web page.**

Installation: `pip install requests`.

In [2]:
import requests

**2. `BeautifulSoup` makes parsing through the HTML a LOT easier!**

Installation: `pip install beautifulsoup4`.

In [3]:
from bs4 import BeautifulSoup

**3. `Pandas` is one of the most iconic Python libraries for handling data.**

Installation: `pip install pandas`

In [4]:
import pandas as pd

**4. `numpy` is another very iconic Python library.**

This helps make Python more powerful in dealing with numbers and lists of numbers.

In [5]:
import numpy as np

**5. `re` is the regex library**

It's a built-in python library for implementing [regular expressions](https://www.regular-expressions.info/).

In [6]:
import re

**6. `statsmodels` was chosen as the library of choice for modeling.**

It's not the fastest regression tool, but the output is most helpful for our current usage. (Our dataset is also small enough, too!)

In [7]:
import statsmodels.api as sm

  from pandas.core import datetools


# Get the data

In Data Science 1, we scraped movies and data from IMDB and Rotten Tomatoes. Let's not repeat the process.

([Here is the previous notebook](https://bit.ly/paceds1notebook).)

In [23]:
df = pd.read_csv('movies_df.csv')
df

Unnamed: 0,title,tom,aud,tom_ave_rating,tom_num_reviews,tom_fresh,tom_rotten,aud_ave_rating,aud_num_ratings,age_rating,genre,director,writers,rel_date,box,runtime,studio,link
0,The Greatest Showman,56,87,6.0,217,121,96,4.4,23841,PG,"['Drama', 'Musical & Performing Arts']",Michael Gracey,"['Jenny Bicks, Bill Condon']","Dec 20, 2017",164616443.0,105.0,20th Century Fox,https://www.rottentomatoes.com/m/the_greatest_...
1,It,85,84,7.2,321,273,48,4.1,64421,R,"['Drama', 'Horror', 'Mystery & Suspense']",Andy Muschietti,"['Chase Palmer, Cary Joji Fukunaga, Gary Daube...","Sep 8, 2017",326898358.0,135.0,Warner Bros. Pictures,https://www.rottentomatoes.com/m/it_2017
2,Thor: Ragnarok,92,87,7.5,347,320,27,4.2,89985,PG-13,"['Action & Adventure', 'Drama', 'Science Ficti...",Taika Waititi,['Eric Pearson'],"Nov 3, 2017",314971245.0,130.0,Walt Disney Pictures,https://www.rottentomatoes.com/m/thor_ragnarok...
3,Spider-Man: Homecoming,92,88,7.7,327,301,26,4.2,103926,PG-13,"['Action & Adventure', 'Drama', 'Science Ficti...",Jon Watts,"['John Francis Daley, Christopher Ford, Erik S...","Jul 7, 2017",334166825.0,,Sony Pictures,https://www.rottentomatoes.com/m/spider_man_ho...
4,Justice League,40,74,5.3,322,129,193,3.9,124762,PG-13,"['Action & Adventure', 'Drama', 'Science Ficti...",Zack Snyder,"['Chris Terrio, Joss Whedon']","Nov 17, 2017",227032490.0,110.0,Warner Bros. Pictures,https://www.rottentomatoes.com/m/justice_leagu...
5,Terrifier,67,65,6.1,6,4,2,3.6,366,NR,['Horror'],Damien Leone,['Damien Leone'],"Mar 15, 2018",,85.0,Epic Pictures,https://www.rottentomatoes.com/m/terrifier
6,The Children Act,71,75,6.3,63,45,18,3.9,348,R,"['Art House & International', 'Drama']",Richard Eyre,['Ian McEwan'],"Sep 14, 2018",,105.0,A24 and DIRECTV,https://www.rottentomatoes.com/m/the_children_act
7,Murder On The Orient Express,57,54,6.1,247,141,106,3.4,24234,PG-13,"['Drama', 'Mystery & Suspense']",Kenneth Branagh,['Michael Green'],"Nov 10, 2017",101555644.0,114.0,20th Century Fox,https://www.rottentomatoes.com/m/murder_on_the...
8,Jumanji: Welcome to the Jungle,76,87,6.2,196,149,47,4.3,37660,PG-13,"['Action & Adventure', 'Drama', 'Kids & Family...",Jake Kasdan,"['Erik Sommers, Scott Rosenberg, Jeff Pinkner']","Dec 20, 2017",393201353.0,112.0,Columbia Pictures,https://www.rottentomatoes.com/m/jumanji_welco...
9,"Three Billboards Outside Ebbing, Missouri",92,86,8.5,339,312,27,4.1,21181,R,"['Comedy', 'Drama']",Martin McDonagh,['Martin McDonagh'],"Dec 1, 2017",52000189.0,115.0,Fox Searchlight Pictures,https://www.rottentomatoes.com/m/three_billboa...


In [42]:
print('number of nulls per column\n')
for label in df:
    print('{:<16} {:>3}'.format(label+':', len(df[df[label].isnull()])))

number of nulls per column

title:             0
tom:               0
aud:               0
tom_ave_rating:    0
tom_num_reviews:   0
tom_fresh:         0
tom_rotten:        0
aud_ave_rating:    0
aud_num_ratings:   0
age_rating:        0
genre:             0
director:         10
writers:          19
rel_date:         65
box:             290
runtime:          66
studio:           28
link:              0


In [17]:
# let's filter to, at least 500k in box office.
df[df['box']>=500000]

Unnamed: 0,title,tom,aud,tom_ave_rating,tom_num_reviews,tom_fresh,tom_rotten,aud_ave_rating,aud_num_ratings,age_rating,genre,director,writers,rel_date,box,runtime,studio,link
0,The Greatest Showman,56,87,6.0,217,121,96,4.4,23841,PG,"['Drama', 'Musical & Performing Arts']",Michael Gracey,"['Jenny Bicks, Bill Condon']","Dec 20, 2017",164616443.0,105.0,20th Century Fox,https://www.rottentomatoes.com/m/the_greatest_...
1,It,85,84,7.2,321,273,48,4.1,64421,R,"['Drama', 'Horror', 'Mystery & Suspense']",Andy Muschietti,"['Chase Palmer, Cary Joji Fukunaga, Gary Daube...","Sep 8, 2017",326898358.0,135.0,Warner Bros. Pictures,https://www.rottentomatoes.com/m/it_2017
2,Thor: Ragnarok,92,87,7.5,347,320,27,4.2,89985,PG-13,"['Action & Adventure', 'Drama', 'Science Ficti...",Taika Waititi,['Eric Pearson'],"Nov 3, 2017",314971245.0,130.0,Walt Disney Pictures,https://www.rottentomatoes.com/m/thor_ragnarok...
3,Spider-Man: Homecoming,92,88,7.7,327,301,26,4.2,103926,PG-13,"['Action & Adventure', 'Drama', 'Science Ficti...",Jon Watts,"['John Francis Daley, Christopher Ford, Erik S...","Jul 7, 2017",334166825.0,,Sony Pictures,https://www.rottentomatoes.com/m/spider_man_ho...
4,Justice League,40,74,5.3,322,129,193,3.9,124762,PG-13,"['Action & Adventure', 'Drama', 'Science Ficti...",Zack Snyder,"['Chris Terrio, Joss Whedon']","Nov 17, 2017",227032490.0,110.0,Warner Bros. Pictures,https://www.rottentomatoes.com/m/justice_leagu...
7,Murder On The Orient Express,57,54,6.1,247,141,106,3.4,24234,PG-13,"['Drama', 'Mystery & Suspense']",Kenneth Branagh,['Michael Green'],"Nov 10, 2017",101555644.0,114.0,20th Century Fox,https://www.rottentomatoes.com/m/murder_on_the...
8,Jumanji: Welcome to the Jungle,76,87,6.2,196,149,47,4.3,37660,PG-13,"['Action & Adventure', 'Drama', 'Kids & Family...",Jake Kasdan,"['Erik Sommers, Scott Rosenberg, Jeff Pinkner']","Dec 20, 2017",393201353.0,112.0,Columbia Pictures,https://www.rottentomatoes.com/m/jumanji_welco...
9,"Three Billboards Outside Ebbing, Missouri",92,86,8.5,339,312,27,4.1,21181,R,"['Comedy', 'Drama']",Martin McDonagh,['Martin McDonagh'],"Dec 1, 2017",52000189.0,115.0,Fox Searchlight Pictures,https://www.rottentomatoes.com/m/three_billboa...
11,mother!,69,50,6.8,314,216,98,2.9,23678,R,"['Drama', 'Horror', 'Mystery & Suspense']",Darren Aronofsky,['Darren Aronofsky'],"Sep 15, 2017",17297289.0,115.0,Paramount Pictures,https://www.rottentomatoes.com/m/mother_2017
12,Blade Runner 2049,87,81,8.2,364,315,49,4.1,57813,R,"['Action & Adventure', 'Drama', 'Science Ficti...",Denis Villeneuve,"['Michael Green, Hampton Fancher']","Oct 6, 2017",91800042.0,164.0,Warner Bros. Pictures,https://www.rottentomatoes.com/m/blade_runner_...


# Let's model!

We'll do some linear regression with Python's `statsmodels` library.

*Note: this is not the best model performance-wise; but for this tutorial, it will be the most helpful as results are already in a tabular form!*

**Let's predict box office revenues!**

We will base our prediction on:
* tomatometer score
* audience score
* average tomatometer rating
* average audience rating

In [43]:
import statsmodels.api as sm
import numpy as np

df_model = df[df['box']>=500000]
data_to_model = df_model[['tom', 'aud', 'tom_ave_rating', 'aud_ave_rating', 'tom_num_reviews', 'aud_num_ratings']]
target_column = df_model[['box']]

# Note the order of arguments
model = sm.OLS(target_column, data_to_model).fit()

# Print out the statistics. Summary2 gives it in non-exponential format.
model.summary2()

0,1,2,3
Model:,OLS,Adj. R-squared:,0.782
Dependent Variable:,box,AIC:,5798.3431
Date:,2018-09-17 16:50,BIC:,5816.4468
No. Observations:,151,Log-Likelihood:,-2893.2
Df Model:,6,F-statistic:,91.27
Df Residuals:,145,Prob (F-statistic):,1e-46
R-squared:,0.791,Scale:,2675300000000000.0

0,1,2,3,4,5,6
,Coef.,Std.Err.,t,P>|t|,[0.025,0.975]
tom,1714114.1672,587419.3625,2.9180,0.0041,553103.5510,2875124.7834
aud,-450430.9828,669286.3851,-0.6730,0.5020,-1773248.4588,872386.4931
tom_ave_rating,-39685485.5125,12206199.7118,-3.2513,0.0014,-63810545.6561,-15560425.3689
aud_ave_rating,42890256.4395,19367990.1787,2.2145,0.0284,4610207.0420,81170305.8370
tom_num_reviews,111314.7690,81785.5427,1.3611,0.1756,-50331.0490,272960.5871
aud_num_ratings,2496.0309,251.8672,9.9101,0.0000,1998.2256,2993.8363

0,1,2,3
Omnibus:,72.691,Durbin-Watson:,1.856
Prob(Omnibus):,0.0,Jarque-Bera (JB):,451.021
Skew:,1.599,Prob(JB):,0.0
Kurtosis:,10.839,Condition No.:,185175.0


---

# Conclusion

And we're done! We hope you've learned a thing or two from this detailed notebook.

With web scraping, we can prepare our own datasets to play with, and you'll only be limited by what data is on the internet -- which, if you ask us, is quite a lot! :)

---

Prepared for
**Data Science 1**

*(An internal lecture conducted for PLDT/Smart)*

---

by:

Nicholas _"Lodi Nick"_ Huber  <c-nehuber@pldt.com.ph>

Andre _"dTanMan"_ Tan  <attan@pldt.com.ph>

Mark _"Markee-joke-lang-Mark-lang"_ Herrera  <mnherrera@talas.com.ph>

Brent _"Pun de Manila"_ Carbonera  <bbcarbonera@pldt.com.ph>

---

# Sandbox

Stuff that should have been deleted when this notebook is release-ready