# Introduction
*What's this Python notebook all about?*

`Hello world!`

**This notebook is a more abridged version from the previous session's notebook.**

It's meant to be a demonstration of using a Jupyter notebook, via Google Colab. It does so by:
1. Reading in data that was previously scraped [in a previous notebook](https://bit.ly/paceds1notebook).
2. Playing around with the data using `pandas`.
3. Deploying a simple linear regression model.

---

# `import` everything!
*How to import all the necessary libraries (and install them if you haven't!)*

**1. `Pandas` is one of the most iconic Python libraries for handling data.**

Installation: `pip install pandas`

In [None]:
import pandas as pd

**2. `numpy` is another very iconic Python library.**

This helps make Python more powerful in dealing with numbers and lists of numbers.

In [None]:
import numpy as np

**3. `statsmodels` was chosen as the library of choice for modeling.**

It's not the fastest regression tool, but the output is most helpful for our current usage. (Our dataset is also small enough, too!)

In [None]:
import statsmodels.api as sm

**4. Upload the file to this Colab notebook.**

Since this notebook is technically on Google's server, we need to specificially upload our data set there. Download the data from [here](https://github.com/dTanMan/modeling-pace-ds-notebook/blob/master/movies_df.csv) and upload it to this notebook.

(Un-comment these with cmd+/ or ctrl+/ if you're on Colab.)

In [None]:
# from google.colab import files

# uploaded = files.upload()

# for fn in uploaded.keys():
#   print('User uploaded file "{name}" with length {length} bytes'.format(
#       name=fn, length=len(uploaded[fn])))

# Get the data

In Data Science 1, we scraped movies and data from IMDB and Rotten Tomatoes. Let's not repeat the process.

([Here is the previous notebook.](https://bit.ly/paceds1notebook))

**Read in the csv using the `read_csv` method of `pandas`.**

If you're using Google Colab, you should've uploaded your file in the previous section before running this code. Otherwise, this will work normally if the csv is in the same folder as this notebook.

In [None]:
df = pd.read_csv('movies_df.csv')
df

**Check data quality**

Let's see how much of our data is actually empty junk.

In [None]:
print('number of nulls per column\n')
for label in df:
    print('{:<16} {:>3}'.format(label+':', len(df[df[label].isnull()])))

**Removing the null rows**

For simplicity, let's drop the rows that have null `box` values.

In [None]:
df = df[~pd.isnull(df['box'])]
df

**Filtering movies?**

Maybe we should restrict our study to look only at more prominent movies -- in here, that means movies with at least $500k in revenue.

Let's see if we'll have enough data if we do that.

In [None]:
df[df['box']>=500000]

And we do!

---

# Let's model!

We'll do some linear regression with Python's `statsmodels` library.

*Note: this is not the best model performance-wise; but for this tutorial, it will be the most helpful as results are already in a tabular form!*

**Let's predict box office revenues!**

We will base our prediction on:
* tomatometer score
* audience score
* average tomatometer rating
* average audience rating

In [None]:
import statsmodels.api as sm
import numpy as np

df_model = df[df['box']>=500000]
data_to_model = df_model[['tom', 'aud', 'tom_ave_rating', 'aud_ave_rating', 'tom_num_reviews', 'aud_num_ratings']]
target_column = df_model[['box']]

# Note the order of arguments
model = sm.OLS(target_column, data_to_model).fit()

# Print out the statistics. Summary2 gives it in non-exponential format.
model.summary2()

---

# Conclusion

And we're done! We hope you've learned a thing or two from this detailed notebook. :)

Here we showed you how to deploy a simple model -- it is not the best of models in terms of performance, but it *is* the best in terms of expressivity.

And just like that, you have made your first model!

---

Prepared for
**Data Science 2**

*(An internal lecture conducted for PLDT/Smart)*

---

by:


Andre _"dTanMan"_ Tan  <attan@pldt.com.ph>

---