## Goals 

The goal of this part of the tutorial is build your skills in: 

- web page scraping
- data cleaning and manipulation
- pandas
- exploratory data visualization
- thinking about data and what features might be used to make predictions

## We're going to make a movie! 

@help Insert a cool movie photo. Maybe a black panther scene?? Make sure to include an image credit. 

Our first goal today is to learn about what makes a box office hit. So to start off let's pretend that we are some (very data savvy) movie producers, and we want to make a movie that will make us a metric ton of money. So what are some features of movies that might correlate with making a ton of money? Is it the budget? The actors? 

List some of the features that you think you might want to consider as you make you make **The Best Movie Ever**.

*double-click to type your answer here*

## What are some good places to get data? 

We are going to use data to make **The Best Movie Ever**, but what are some good places for us to get that data? 

*double-click to type your answer here*

## Time to start scraping the web

To get started, we need to import several packages. We explain what several of these imports are below in the comments. 

In [None]:
# The %... is an iPython thing, and is not part of the Python language.
# In this case we're just telling the plotting library to draw things on
# the notebook, instead of on a separate window.
%matplotlib inline 
# See all the "as ..." contructs? They're just aliasing the package names.
# That way we can call methods like plt.plot() instead of matplotlib.pyplot.plot().
from matplotlib import rcParams # special matplotlib argument for improved plots
from collections import defaultdict 
from imdb import IMDb

import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
import statsmodels.api as sm
import cPickle as pickle
import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("poster")

import io 
import time
import requests
import sklearn
import warnings
warnings.filterwarnings('ignore')

We use [Seaborn](http://seaborn.pydata.org/) to give us a nicer default color palette, with our plots being of large (poster) size and with a white-grid background. 

### Scraping Box Office Mojo 

To get the text from the website to your local machine, we will use a GET request, which is available in the [Requests](http://docs.python-requests.org/en/master/) library. 

To get started exploring the text that you bring to your local macine, we will use [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/). If you are familiar with another scraping library like [PyQuery](https://pythonhosted.org/pyquery/) or [Scrapy](https://scrapy.org/), feel free to do this exercise using those instead (or in addition! :)). 

In [2]:
from bs4 import BeautifulSoup
# The "requests" library makes working with HTTP requests easier
# than the built-in urllib libraries.
import requests

Here, we access a webpage and download the HTML using requests. When we make a GET response, we get an HTTP response object back. 

In [6]:
r_2018 = requests.get("http://www.boxofficemojo.com/yearly/chart/?view=releasedate&view2=domestic&page=1&yr=2018")

You should get a HTTP response 200, which means that the request went through without issue. If you get another HTTP response, you can look it up in [this list](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes) to determine what it is. 

Alternatively, if you like your HTTP status codes illustrated as cat gifs, you can look up your codes using [http.cat](https://http.cat/).

In [8]:
print r_2018

<Response [200]>


There are a lot of awesome things going in this response object. Most relevantly, it has returned all the text from the page that we made the request from to us, so we can look at it on our local machine. 

In [7]:
print r_2018.text

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html lang="en">
<HEAD>
<TITLE>2018 Yearly Box Office Results - Box Office Mojo</TITLE>
<META NAME="keywords" CONTENT="2018, year, yearly, box, office, result, list, movie, movies, listing, listings, top movies, all time, film">
<META NAME="description" CONTENT="Yearly box office results for 2018.">
<link rel="stylesheet" href="/css/mojo.css?1" type="text/css" media="screen" title="no title" charset="utf-8">
<link rel="stylesheet" href="/css/mojo.css?1" type="text/css" media="print" title="no title" charset="utf-8"></head>
<body>
	<iframe id="sis_pixel_sitewide" width="1" height="1" frameborder="0" marginwidth="0" marginheight="0" style="display: none;"></iframe>
<script>
    setTimeout(function(){
        try{
            //sis3.0 pixel
            var cacheBust = Math.random() * 10000000000000000,
                url_sis3 = 'http://s.amazon-adsystem.com/iu3?',
                params

While this blob of text is not difficult for our computer to search through, it can be a little difficult for us to wrap human brains around. Another way that we can look at the text are: 
(a) Right click > View Source. 

@help insert screen shot 

(b) Right click > Inspect Item 

@help insert screen shot 

(c) View > Developer > Developer Tools 

@help insert screen shot 

### Understanding the HTML 

Which parts of this text do we need to get information about these movies? How can we pull out only the information that we want? 

*double-click to type your answer here*

## Try it out! Scraping your own data. 

We downloaded one page of HTML text from Box Office Mojo. But we want to be able to see all of the data from the last three years. Can you do this? 

## Exploring the text using Beautiful Soup 