## Working With API Data & Web Scraping

**Class Agenda:**

 - Understand the basics of connecting to an API
 - Practice getting API Calls for specific types of results
 - Learn how to update data automatically via API calls
 - The basics of web scraping via Beautiful Soup

### API:  Application Programming Interface

 How software programs make their information available to users.
 Come in two different forms:
 
  - pre-defined function calls you can make (ie, scikit-learn and `fit()`, `predict()`, etc)
  - website endpoints that users can access to dynamically pull information from their database
  
We've spent most of this class working with the first type of API, will spend this class working with the 2nd

Basic process follows a fairly generic process:

 - get an access token (if necessary)
 - establish a request with your specific endpoint
 - put in some sort of query, if necessary to grab specific information
 - receive the information, usually via json
 
Process is very similar to what was discussed in our class on dictionaries.

### Case Study:  The Movies Database

**Take 5 Minutes:** Go to https://themoviedb.org

 - Create an account
 - Settings --> API --> Create an account token

### A Simple Connection

In [1]:
# an example endpoint
import requests

url  = 'https://api.themoviedb.org/3/movie/550?api_key=c6c78f5b65558f9fc6ab9a3ef2d8ba7d'
data = requests.get(url).json()

In [2]:
# data returns to you as a dictionary
data

{'adult': False,
 'backdrop_path': '/mMZRKb3NVo5ZeSPEIaNW9buLWQ0.jpg',
 'belongs_to_collection': None,
 'budget': 63000000,
 'genres': [{'id': 18, 'name': 'Drama'}],
 'homepage': 'http://www.foxmovies.com/movies/fight-club',
 'id': 550,
 'imdb_id': 'tt0137523',
 'original_language': 'en',
 'original_title': 'Fight Club',
 'overview': 'A ticking-time-bomb insomniac and a slippery soap salesman channel primal male aggression into a shocking new form of therapy. Their concept catches on, with underground "fight clubs" forming in every town, until an eccentric gets in the way and ignites an out-of-control spiral toward oblivion.',
 'popularity': 36.011,
 'poster_path': '/adw6Lq9FiC9zjYEpOqfq03ituwp.jpg',
 'production_companies': [{'id': 508,
   'logo_path': '/7PzJdsLGlR7oW4J0J5Xcd0pHGRg.png',
   'name': 'Regency Enterprises',
   'origin_country': 'US'},
  {'id': 711,
   'logo_path': '/tEiIH5QesdheJmDAqQwvtN60727.png',
   'name': 'Fox 2000 Pictures',
   'origin_country': 'US'},
  {'id': 205

### Anatomy of a Request:

 - **base url:** the url that you attach various queries to at the end, in our case it is `https://api.themoviedb.org
 - **add on url's**: sub directories of the url that are used for various portions of the API.  In our case it's `/3/movie`, which denotes we're accessing a v3 REST api, in the `movie` subidrectory.
 - **query string**: The portion of the end point which contains arguments for the dynamic parts of the data that we would like to access.  In our example it would be `550?api_key=c6c78f5b65558f9fc6ab9a3ef2d8ba7d'`
  - this encodes the fact that we're searcing for a movie with `id` set to 50, and using the specified api key.

### Other Components:

 - `?`: you see these in url's a lot -- encode search parameters that are being sent to the database
 - `&`: a way to connect different search parameters.  ie
  - `cast_id=1234&api_key=58576fgdghkt`, and so on
   - parameters are usually insensitive to order
  - for certain arguments you can `|` and `,` for `OR` and `AND` operators
   - these details can differ from API to API, so be sure to read the docs carefully

### More involved example:

Go to https://developers.themoviedb.org/3, and choose the `Discover` tab.  

We're going to choose all movies released in 2007 that were greater than 3 hours (180 minutes) long.  

The Movies DB api has built in commands to search for exactly this:

 - `primary_release_year`
 - `with_runtime.gte`

In [3]:
# setup the query string
api_key  = 'c6c78f5b65558f9fc6ab9a3ef2d8ba7d'
# the url we are going to attach everything else to
url_base = 'https://api.themoviedb.org/3/discover/movie'
# specific query string we're going to use -- include arguments for
# primary release year, with_runtime.gte, and our api key
query   = f'?primary_release_year=2007&with_runtime.gte=180&api_key={api_key}'

data = requests.get(url_base+query).json()

In [4]:
# and our results
data

{'page': 1,
 'total_results': 153,
 'total_pages': 8,
 'results': [{'popularity': 11.371,
   'vote_count': 1992,
   'video': False,
   'poster_path': '/7Yjzttt0VfPphSsUg8vFUO9WaEt.jpg',
   'id': 1992,
   'adult': False,
   'backdrop_path': '/a2zIKDg5QGFc2vzdaPXT7uZKipe.jpg',
   'original_language': 'en',
   'original_title': 'Planet Terror',
   'genre_ids': [28, 27, 53],
   'title': 'Planet Terror',
   'vote_average': 6.6,
   'overview': 'Two doctors find their graveyard shift inundated with townspeople ravaged by sores. Among the wounded is Cherry, a dancer whose leg was ripped from her body. As the invalids quickly become enraged aggressors, Cherry and her ex-boyfriend Wray lead a team of accidental warriors into the night.',
   'release_date': '2007-04-06'},
  {'popularity': 7.776,
   'vote_count': 142,
   'video': False,
   'poster_path': '/huVxyPbe5XYZjSVGI68Y7RBo6I9.jpg',
   'id': 10247,
   'adult': False,
   'backdrop_path': '/a94rYKkW0wtko06ogdT2zQ0CHRF.jpg',
   'original_langu

In [5]:
# note the actual movies are stored inside the results key, with each item in the list
# a dictionary with info about its movie
data['results']

[{'popularity': 11.371,
  'vote_count': 1992,
  'video': False,
  'poster_path': '/7Yjzttt0VfPphSsUg8vFUO9WaEt.jpg',
  'id': 1992,
  'adult': False,
  'backdrop_path': '/a2zIKDg5QGFc2vzdaPXT7uZKipe.jpg',
  'original_language': 'en',
  'original_title': 'Planet Terror',
  'genre_ids': [28, 27, 53],
  'title': 'Planet Terror',
  'vote_average': 6.6,
  'overview': 'Two doctors find their graveyard shift inundated with townspeople ravaged by sores. Among the wounded is Cherry, a dancer whose leg was ripped from her body. As the invalids quickly become enraged aggressors, Cherry and her ex-boyfriend Wray lead a team of accidental warriors into the night.',
  'release_date': '2007-04-06'},
 {'popularity': 7.776,
  'vote_count': 142,
  'video': False,
  'poster_path': '/huVxyPbe5XYZjSVGI68Y7RBo6I9.jpg',
  'id': 10247,
  'adult': False,
  'backdrop_path': '/a94rYKkW0wtko06ogdT2zQ0CHRF.jpg',
  'original_language': 'en',
  'original_title': 'He Was a Quiet Man',
  'genre_ids': [35, 18, 10749],
 

### Web Scraping

Way of connecting to the web so that html documents are structured data in your console.  

We're going to discuss two separate ways of doing this:

 - `pd.read_html`
 - `BeautifulSoup

### pd.read_html

Most straight forward way to bring web data into your console.

 - uses the `html5lib` web parser to read in web data
 - only reads in html data inside a `<table>` tag
 - only reads in the 1st such element that meets this description

### Quick Example:  Downloading Microsoft Stock Data

Go to this website and copy its url:  'https://finance.yahoo.com/quote/MSFT/history?p=MSFT'

In [6]:
import pandas as pd

# load into df
msft = pd.read_html('https://finance.yahoo.com/quote/MSFT/history?p=MSFT')

# msft is a list -- first item is a dataframe
msft[0]

Unnamed: 0,Date,Open,High,Low,Close*,Adj Close**,Volume
0,"Dec 18, 2019",154.30,155.21,154.19,154.37,154.37,18926677.0
1,"Dec 17, 2019",155.45,155.71,154.45,154.69,154.69,25425600.0
2,"Dec 16, 2019",155.11,155.90,154.82,155.53,155.53,24144200.0
3,"Dec 13, 2019",153.00,154.89,152.83,154.53,154.53,23845400.0
4,"Dec 12, 2019",151.65,153.44,151.02,153.24,153.24,24612100.0
5,"Dec 11, 2019",151.54,151.87,150.33,151.70,151.70,18856600.0
6,"Dec 10, 2019",151.29,151.89,150.76,151.13,151.13,16476100.0
7,"Dec 09, 2019",151.07,152.21,150.91,151.36,151.36,16687400.0
8,"Dec 06, 2019",150.99,151.87,150.27,151.75,151.75,16403500.0
9,"Dec 05, 2019",150.05,150.32,149.48,149.93,149.93,17869100.0


### Beautiful Soup

More fully featured web scraper.  Allows you to create searchable data structures from `html` tags.  

In [7]:
# import it the following way
from bs4 import BeautifulSoup

url  = 'https://finance.yahoo.com/quote/MSFT/history?p=MSFT'
msft = requests.get(url)

# create a BeautifulSoup object from the text of the website
doc  = BeautifulSoup(msft.text)

In [8]:
# the loaded website is its own custom data type
type(doc)

bs4.BeautifulSoup

In [9]:
# you can grab any tag from the website as if it were an attribute
# for example, this is the contents of the title tag
doc.title

<title>Microsoft Corporation (MSFT) Stock Historical Prices &amp; Data - Yahoo Finance</title>

In [10]:
# or the body
doc.body

<body><div id="app"><div class="" data-react-checksum="-672278274" data-reactid="1" data-reactroot=""><div data-reactid="2"><div class="render-target-active render-target-default" data-reactid="3" id="render-target-default"><div class="finance US en-US H(100%) uh-search-open_Ovy(h) uh-search-open_H(100vh)" data-reactid="4"><div class="YDC-MainCanvas Bgc($bg-body) Mih(100%)" data-reactid="5" id="YDC-MainCanvas" style="padding-top:54px;margin-bottom:auto;"><div class="YDC-UH lw-nav-open_D(n)" data-reactid="6" id="YDC-UH" style="height:54px;"><div class="YDC-UH-Stack Z(10) End(0) Start(0) T(0) Pos(f) uh-search-open_Pos(a) uh-mobile-nav-open_D(n) lw-nav-open_D(n)" data-reactid="7" id="YDC-UH-Stack"><div data-reactid="8" id="YDC-UH-Stack-Composite"><div data-reactid="9"><div data-locator="subtree-root" id="mrt-node-UH-0-UH"><div data-react-checksum="-279790627" data-reactid="1" data-reactroot="" id="UH-0-UH-Proxy"><div data-reactid="2"><div data-reactid="3"><div class="C(#fff) Fz(13px) H(22

In [11]:
# or its immediate predecessor in the tree
doc.body.parent

<html class="NoJs featurephone" id="atomic" lang="en-US"><head prefix="og: http://ogp.me/ns#"><script>window.performance && window.performance.mark && window.performance.mark('PageStart');</script><meta charset="utf-8"/><title>Microsoft Corporation (MSFT) Stock Historical Prices &amp; Data - Yahoo Finance</title><meta content="MSFT, Microsoft Corporation, MSFT historical prices, Microsoft Corporation historical prices, historical prices, stocks, quotes, finance" name="keywords"/><meta content="on" http-equiv="x-dns-prefetch-control"/><meta content="on" property="twitter:dnt"/><meta content="90376669494" property="fb:app_id"/><meta content="#400090" name="theme-color"/><meta content="width=device-width, initial-scale=1" name="viewport"/><meta content="Discover historical prices for MSFT stock on Yahoo Finance. View daily, weekly or monthly format back to when Microsoft Corporation stock was issued." lang="en-US" name="description"/><meta content="guce.yahoo.com" name="oath:guce:consent-

In [12]:
# the find all method allows you to grab all instances of a particular tag
all_links = doc.find_all('a')

In [13]:
# first link in the page
all_links[0]

<a class="Bgpx(0) Bgr(nr) Cur(p) D(b) H(35px) Bgz(702px) Mx(a)! W(92px)" data-reactid="12" href="https://finance.yahoo.com/" id="uh-logo"><b class="Hidden" data-reactid="13">Yahoo</b></a>

In [14]:
# second, and so on
all_links[1]

<a class="Pos(r) D(ib) Ta(s) Td(n):h" data-reactid="45" href="https://mail.yahoo.com/?.intl=us&amp;.lang=en-US&amp;.partner=none&amp;.src=finance" id="uh-mail"><svg class="Cur(p)" data-icon="NavMail" data-reactid="46" height="35" style="fill:#400090;stroke:#400090;stroke-width:0;vertical-align:bottom;" viewbox="0 0 512 512" width="30"><path d="M460.586 91.31H51.504c-10.738 0-19.46 8.72-19.46 19.477v40.088l224 104.03 224-104.03v-40.088c0-10.757-8.702-19.478-19.458-19.478M32.046 193.426V402.96c0 10.758 8.72 19.48 19.458 19.48h409.082c10.756 0 19.46-8.722 19.46-19.48V193.428l-224 102.327-224-102.327z" data-reactid="47"></path></svg><b class="Lh(userNavTextLh) D(ib) C($c-fuji-purple-1-c) Fz(14px) Fw(b) Va(t) Mstart(6px)" data-reactid="48">Mail</b></a>

### Case Study:  Scraping the General Assembly Website

Please see accompanying notebook in folder!