# Goals

Good data science projects start with a clear understanding of the data and where it comes from. In this tutorial, we will use HTTP requests and BeautifulSoup to write a web scraper, and then we will use NumPy, Pandas, matplotlib, sklearn, and Jupyter notebooks to explore that data carefully, and eventually develop an effective model.

The goal of this part of the tutorial is build your skills in: 

- asking good questions
- finding good data
- scraping a web page
- cleaning and manipulating data
- using pandas
- exploratory data visualization
- thinking about data and what features might be used to make predictions

## We're going to make a movie! 

Our first goal today is to learn about what makes a box office hit. So to start off let's pretend that we are some (very data savvy) movie producers, and we want to make a movie that will make us a metric ton of money. So what are some features of movies that might correlate with making a ton of money? Is it the budget? The actors? 

##### Example 1.  Black Panther may be a contender for best movie EVARR

<img src="http://www.nerdcoremovement.com/wp-content/uploads/Black-Panther-1-938x535.jpg"
     alt="Example Great Movie"
     style="float: left; margin-right: 10px;"/>
    

### Try it out! 

List some of the features that you think you might want to consider as you make you make **The Best Movie Ever**.

*type your answer here*

# Asking good questions  

*"A good question is one that you can answer!"*

Think about   

@help can someone come up with some exercises for this? Maybe have them rewrite some causal statements as correlational?

*type your answer here*

# Getting good data 

## What are some good places to get data? 

There are a lot of good places to get data! If you want to start looking at some data sets, here are a few repositories. 
   - [Kaggle](https://www.kaggle.com/datasets)  
   - [FiveThirtyEight](https://data.fivethirtyeight.com/)
   - [Government Data](https://catalog.data.gov/dataset)
   - APIs (e.g. [IMDBPy](https://imdbpy.sourceforge.io/))
   - Scraping the Internet! 
    
### What sites should you scrape? 

When you want to find a certain type of data, try to find data that is structured (e.g. Wikipedia). Tables, headings, and text boxes stand out, and can be easier to find. **Also, remember to read the Terms of Service for the site that you plan to scrape.**

### Try it out! 

We are going to use data to make **The Best Movie Ever**, but what are some good places for us to get that data? 

*type your answer here*

## Start scraping 

To get started, we need to import several packages. We explain what several of these imports are below in the comments. 

In [8]:
# The %... is an iPython thing, and is not part of the Python language.
# In this case we're just telling the plotting library to draw things on
# the notebook, instead of on a separate window.
%matplotlib inline 
# See all the "as ..." contructs? They're just aliasing the package names.
# That way we can call methods like plt.plot() instead of matplotlib.pyplot.plot().
from matplotlib import rcParams # special matplotlib argument for improved plots
from collections import defaultdict 
#from imdb import IMDb

import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
import statsmodels.api as sm
import cPickle as pickle
# import seaborn as sns
# sns.set_style("whitegrid")
# sns.set_context("poster")

import io 
import time
import requests
import sklearn
import warnings
warnings.filterwarnings('ignore')

We use [Seaborn](http://seaborn.pydata.org/) to give us a nicer default color palette, with our plots being of large (poster) size and with a white-grid background. 

### Scrape Box Office Mojo 

To get the text from the website to your local machine, we will use a GET request, which is available in the [Requests](http://docs.python-requests.org/en/master/) library. 

To get started exploring the text that you bring to your local macine, we will use [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/). If you are familiar with another scraping library like [PyQuery](https://pythonhosted.org/pyquery/) or [Scrapy](https://scrapy.org/), feel free to do this exercise using those instead (or in addition! :)). 

In [9]:
from bs4 import BeautifulSoup
# The "requests" library makes working with HTTP requests easier
# than the built-in urllib libraries.
import requests

Here, we access a webpage and download the HTML using requests. When we make a GET response, we get an HTTP response object back. 

In [10]:
r_2018 = requests.get("http://www.boxofficemojo.com/yearly/chart/?view=releasedate&view2=domestic&page=1&yr=2018")

You should get a HTTP response 200, which means that the request went through without issue. If you get another HTTP response, you can look it up in [this list](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes) to determine what it is. 

Alternatively, if you like your HTTP status codes illustrated as cat gifs, you can look up your codes using [http.cat](https://http.cat/).

In [11]:
print r_2018

<Response [200]>


There are a lot of awesome things going in this response object. Most relevantly, it has returned all the text from the page that we made the request from to us, so we can look at it on our local machine. 

In [12]:
## uncomment the line below to see the html returned by the response 
#print r_2018.text

While this blob of text is not difficult for our computer to search through, it can be a little difficult for us to wrap human brains around. Another way that we can look at the text are: 
(a) Right click > View Source. 

@help insert screen shot 

(b) Right click > Inspect Item 

@help insert screen shot 

(c) View > Developer > Developer Tools 

@help insert screen shot 

### Understanding the HTML 

Which parts of this text do we need to get information about these movies? How can we pull out only the information that we want? 

*double-click to type your answer here*

# Exploratory Data Analysis Preview 

## Example correlation plots 