# Getting Data from the Web

This class will provide an introduction to programmatically accessing data from websites and APIs using Python. 

## Table of Contents
1. [APIs](#1)
  1. [What are they?](#1)
  2. [Movies](#1B)
  3. [Color Palette Data](#1C)
2. [Scraping](#2)
  1. [Weather](#2A)
  2. [UFO Sitings](#2B)

## API (Application Programming Interface) <a id="1"></a>

What is an API?
- Structured way to expose specific functionality and data to users
- Web APIs usually follow the "REST" standard (i.e. stateless)

How to interact with an API:
- Make a "request" to a specific URL (an "endpoint"), and get the data back in a "response"
- Most relevant request method for us is GET (other methods: POST, PUT, DELETE)
- Response is often JSON or XML format
- Web console is sometimes available (allows you to explore an API)

### Movie Data from IMDB <a id="1B"></a>

In [3]:
import pandas as pd
import requests

### Using the Requests Library
We will submit a get request to specific movie to the URL: `http://www.omdbapi.com`

In [4]:
API_KEY = "53bfc95d" # <- Super Secret Shhhh
title = "Jurassic Park" #Search for a movie you like
url = 'http://www.omdbapi.com?'

payload = {'t': title,
           'apikey': API_KEY}

r = requests.get('http://www.omdbapi.com?', params=payload)

In [5]:
# check the status: 200 means success, 4xx or 5xx means error
r.status_code

200

In [6]:
r.url

'http://www.omdbapi.com/?t=Jurassic+Park&apikey=53bfc95d'

We know from the documentation on omdapi.com that the response is a `JSON` object.

In [7]:
r.json()

{'Title': 'Jurassic Park',
 'Year': '1993',
 'Rated': 'PG-13',
 'Released': '11 Jun 1993',
 'Runtime': '127 min',
 'Genre': 'Adventure, Sci-Fi, Thriller',
 'Director': 'Steven Spielberg',
 'Writer': 'Michael Crichton (novel), Michael Crichton (screenplay), David Koepp (screenplay)',
 'Actors': 'Sam Neill, Laura Dern, Jeff Goldblum, Richard Attenborough',
 'Plot': 'During a preview tour, a theme park suffers a major power breakdown that allows its cloned dinosaur exhibits to run amok.',
 'Language': 'English, Spanish',
 'Country': 'USA',
 'Awards': 'Won 3 Oscars. Another 32 wins & 25 nominations.',
 'Poster': 'https://m.media-amazon.com/images/M/MV5BMjM2MDgxMDg0Nl5BMl5BanBnXkFtZTgwNTM2OTM5NDE@._V1_SX300.jpg',
 'Ratings': [{'Source': 'Internet Movie Database', 'Value': '8.1/10'},
  {'Source': 'Rotten Tomatoes', 'Value': '91%'},
  {'Source': 'Metacritic', 'Value': '68/100'}],
 'Metascore': '68',
 'imdbRating': '8.1',
 'imdbVotes': '784,720',
 'imdbID': 'tt0107290',
 'Type': 'movie',
 'DVD

In [8]:
#We can call out specific elements
r.json()['Actors']

'Sam Neill, Laura Dern, Jeff Goldblum, Richard Attenborough'

What happens if the movie isn't found?

In [9]:
payload = {'apikey': API_KEY, 
          't': 'Philip is a dope dude'}

r = requests.get(url, params=payload)
print(r.status_code)
r.json()

200


{'Response': 'False', 'Error': 'Movie not found!'}

### Exercise
Python strongly emphasizes the *Don't Repeat Yourself (DRY)* programming paradigm, and accessing APIs can be a very repeative process for gathering lots of data. This repeatition movates people to write language based *clients* or *thin client wrappers* to make calls to an API easier by using fucntions native to that programming language. 

Define a function to return the year of release of a given movie title, return None if no movie found.

In [10]:
def get_movie_year(title):
    
    payload = {
        'apikey': API_KEY,
        't': title
    }
    
    r = requests.get(url, params=payload)
    year = r.json()['Year']
    
    return year

In [11]:
get_movie_year("Howl's Moving Castle")

'2004'

### Business Use Case Discussion
* How might managers or clients ask us to leverage text data such as this to inform business decisions?
* What could we do if we mashed this movie meta data up with movie review data from a source such as Rotten Tomatoes?

## Color Wheel <a id='1C'></a>
In this example, accessing data on some of the colors of Crayola® palette via the Smithosian Cooper Hewitt's API. Archievists at Cooper Hewitt use this palette to tag images of objects by color. Downstream, these tags allow for a greater accuracy in information retrival for users looking for objects of a certain hue.

### Business Use Case
A major online retailer aggregator has asked to Deloitte to help it increase the efficetiveness of its search engine. They found that users are searching their catalog of over 100,000 items by color. However, many of the items don't have color tags making the search process frustrating for users. 

Our role will be to leverage the Cooper Hewitt's extensive online catalog of design objects tagged with colors to build a training set for automated tagging for the client with machine learning. 

If our future algorithm is successful, it will save our client thousands of hours to manual labor tagging the images on their website.


*__Note:__ This API implements the new standard for API autentication by using OAuth2 with access tokens. I've created a token for us ahead of time.*

[Article on Cooper Hewitt's API](https://labs.cooperhewitt.org/2014/the-api-at-the-center-of-the-museum/)

In [12]:
key = "84976de03204c1d366ae0224bf21d103"
token = "2f49c9d05b2faf779d11420637d99f57"

In [13]:
base_url = 'https://api.collection.cooperhewitt.org/rest/?method=cooperhewitt.colors.palettes.getInfo&access_token=%s&palette=crayola' % token

In [14]:
r = requests.get(base_url)

In [15]:
r.json()

{'palette': 'crayola',
 'colors': {'#fc89ac': {'name': 'Tickle Me Pink'},
  '#1f75fe': {'name': 'Blue'},
  '#a8e4a0': {'name': 'Granny Smith Apple'},
  '#fc74fd': {'name': 'Pink Flamingo'},
  '#7366bd': {'name': 'Blue Violet'},
  '#18a7b5': {'name': 'Teal Blue'},
  '#1164b4': {'name': 'Green Blue'},
  '#b2ec5d': {'name': 'Inchworm'},
  '#58427c': {'name': 'Cyber Grape'},
  '#bf4f51': {'name': 'Bittersweet Shimmer'},
  '#5d76cb': {'name': 'Indigo'},
  '#c5e384': {'name': 'Yellow Green'},
  '#8fd400': {'name': 'Sheen Green'},
  '#4a646c': {'name': 'Deep Space Sparkle'},
  '#ffbcd9': {'name': 'Cotton Candy'},
  '#ff7f49': {'name': 'Burnt Orange'},
  '#fefe22': {'name': 'Laser Lemon'},
  '#bc5d58': {'name': 'Chestnut'},
  '#9fe2bf': {'name': 'Sea Green'},
  '#000000': {'name': 'Black'},
  '#414a4c': {'name': 'Outer Space'},
  '#7851a9': {'name': 'Royal Purple'},
  '#ace5ee': {'name': 'Blizzard Blue'},
  '#a2add0': {'name': 'Wild Blue Yonder'},
  '#dd9475': {'name': 'Copper'},
  '#ffffff': 

In [17]:
_hex = list(r.json()['colors'].keys())
names = [k['name'] for k in list(r.json()['colors'].values())]

In [18]:
# DataFrame of the results
crayola= pd.DataFrame({'hex': _hex, 'name':names})

crayola.head()

Unnamed: 0,hex,name
0,#fc89ac,Tickle Me Pink
1,#1f75fe,Blue
2,#a8e4a0,Granny Smith Apple
3,#fc74fd,Pink Flamingo
4,#7366bd,Blue Violet


### Red/Blue/Green
The hex code is by design very dense information. Let's parse out the individual color components from the data. 

In [19]:
def hex_to_rbg(_hexcode):
    h = _hexcode.lstrip('#')
    rbg = tuple(int(h[i:i+2], 16) for i in (0, 2 ,4))
    return rbg

In [20]:
rbg = [hex_to_rbg(h) for h in crayola['hex'].tolist()]

In [21]:
rbg = pd.DataFrame(rbg, columns=['red', 'green', 'blue'])
crayola = pd.concat([crayola, rbg], axis=1)

In [22]:
crayola['red'] = [hex_to_rbg(x)[0] for x in crayola['hex']]

In [23]:
crayola.head()

Unnamed: 0,hex,name,red,green,blue
0,#fc89ac,Tickle Me Pink,252,137,172
1,#1f75fe,Blue,31,117,254
2,#a8e4a0,Granny Smith Apple,168,228,160
3,#fc74fd,Pink Flamingo,252,116,253
4,#7366bd,Blue Violet,115,102,189


In [23]:
crayola.head()

crayola.to_csv('../data/crayola_colors.csv', index=False)

In [24]:
import plotly.offline as py
import plotly.graph_objs as go

py.init_notebook_mode(connected=True)

In [25]:
trace = go.Scatter3d(
    x = crayola['red'],
    y = crayola['green'],
    z = crayola['blue'],
    mode = 'markers',
    marker = dict(
        color = crayola['hex'].tolist(),
        size = 5,
        symbol = 'circle',
        opacity = 1))

layout = go.Layout(margin=dict(l=0, r=0, b=0, t=0))

In [27]:
fig = go.Figure(data=[trace],layout=layout)
py.iplot(fig, filename='Crayola-Scatter')

### Your Turn!
*30 -40 minutes*

Build out our training dataset by studying the API documentation on the Cooper Hewit Website. We need a dataset with the museum curent objects on display (only 100 items), the images associated with those items, and the color(s) of those items.

* Store the the name of the objects and other meta data in a csv called `current_collection.csv`
  * Remember to use a python dictionary to structure the api parameters like we did in the first exercise. This will help you structure the url automatically.
* Place the images in a name folder in the day_5 folder named `collection_images` and add the file name of the image to the previous csv
  * You can used the request method `content` to access the file to write it to a file.
  * Raw Example: `open('image.jpg', 'wb').write(request.content)`
* Grab the color information and place it another csv `current_collection_colors.csv`

The API documentation is available [here](https://collection.cooperhewitt.org/api/methods/). You will want to use the follwing end points: 
1. [`getOnDisplay`](https://collection.cooperhewitt.org/api/methods/cooperhewitt.objects.getOnDisplay)
2. [`getImages`](https://collection.cooperhewitt.org/api/methods/cooperhewitt.objects.getImages)
3. [`getColors`](https://collection.cooperhewitt.org/api/methods/cooperhewitt.objects.getColors)

In [28]:
api = 'https://api.collection.cooperhewitt.org/rest/?'

def on_display(token,n=100):
    payload = {'access_token': token,
              'method': 'cooperhewitt.objects.getOnDisplay'}
    
    r = requests.get(api, params=payload)
    
    objs = pd.DataFrame.from_records(r.json()['objects'])
    
    return objs

objects = on_display(token,n=100)

In [29]:
objects.head()

Unnamed: 0,accession_number,creditline,date,decade,department_id,description,dimensions,dimensions_raw,gallery_text,has_no_known_copyright,...,title_raw,tms:id,type_id,url,videos,woe:country,woe:country_id,year_acquired,year_end,year_start
0,1921-6-402,,,,35347493,,,,,1,...,Essai de papilloneries humaines (Ideas for Sce...,48,35256543,https://collection.cooperhewitt.org/objects/18...,,,,,,
1,1906-21-245,,,,35347493,,,,,1,...,"Design for Embroidery, Corner of Gentleman's W...",14463,35237093,https://collection.cooperhewitt.org/objects/18...,,23424819.0,23424819.0,,,
2,1909-13-100,,,,35347493,,,,,1,...,Design for Woven Paisley,16906,35237093,https://collection.cooperhewitt.org/objects/18...,,,,,,
3,1909-13-102,,,,35347493,,,,,1,...,Designs for Woven Paisley,16908,35237093,https://collection.cooperhewitt.org/objects/18...,,,,,,
4,1909-13-103,,,,35347493,,,,,1,...,Design for Woven Paisley,16909,35237093,https://collection.cooperhewitt.org/objects/18...,,,,,,


In [30]:
list(objects)

['accession_number',
 'creditline',
 'date',
 'decade',
 'department_id',
 'description',
 'dimensions',
 'dimensions_raw',
 'gallery_text',
 'has_no_known_copyright',
 'id',
 'inscribed',
 'is_loan_object',
 'justification',
 'label_text',
 'markings',
 'media_id',
 'medium',
 'on_display',
 'period_id',
 'provenance',
 'signed',
 'title',
 'title_raw',
 'tms:id',
 'type_id',
 'url',
 'videos',
 'woe:country',
 'woe:country_id',
 'year_acquired',
 'year_end',
 'year_start']

In [31]:
test_id = objects['id'][0]
print(test_id)

18065649


In [32]:
for obj in objects['id']:

    payload = {'method': 'cooperhewitt.objects.getImages',
            'access_token': token,
            'object_id': obj}

    r = requests.get(api, params=payload)

    image_url = r.json()['images'][0]['b']['url']
    
    image = requests.get(image_url)
    
    out = os.path.join('../collection_images/', "%s.jpg" % obj)
        
    with open(out, 'wb') as img:
        img.write(image.content)

NameError: name 'os' is not defined

In [33]:
print()




In [34]:
a = ['a']
a[0]

'a'

In [35]:
print(r.json().keys())

dict_keys(['object_id', 'count_images', 'images', 'stat'])


In [36]:
import os

def grab_images(object_ids, token, output_path):
    
    for i in object_ids:
        
        payload = {'method': 'cooperhewitt.objects.getImages',
                  'access_token': token,
                  'object_id': i}
        
        r = requests.get(api, params=payload)
        
        image_url = r.json()['images'][0]['b']['url']
        
        image = requests.get(image_url)
        out = os.path.join(output_path, "%s.jpg" % i)
        
        with open(out, 'wb') as img:
            img.write(image.content)

In [37]:
grab_images(objects['id'], token, '../collection_images')

FileNotFoundError: [Errno 2] No such file or directory: '../collection_images/18065649.jpg'

In [38]:
 def grab_colors(object_ids, token):
    
    data = []
    
    for i in object_ids:
    
        payload = {'method': 'cooperhewitt.objects.getColors',
                  'access_token': token,
                  'object_id': i}

        r = requests.get(api, params=payload)

        for color in r.json()['colors']:
            temp = {'id': i,
                   'true_color': color['color'],
                   'crayola': color['closest_crayola']}
            
            data.append(temp)
            
    return pd.DataFrame.from_records(data)

In [55]:
colors = grab_colors(objects['id'], token)

In [None]:
colors.head()

## Web Scraping <a id=2></a>

Often times data is not available in the neat & tidy formats we are used from databases and APIs. We need to out into the world and capture the data. 

Enter web scraping which is the process of crawling a website(s) and extracting structured information from the pages of the site(s). 

There are a whole host of ethical concerns with web scraping. Make sure to read a site's `robots.txt` before initating a web scraping project. 

In [44]:
import re #Regular expressions
from bs4 import BeautifulSoup # a python HTML parser
import requests

### Weather Data <a id=2A></a>

Let's focus on grabbing general weather data & forecasts

In [45]:
url = "https://forecast.weather.gov/MapClick.php?lat=38.89435000000003&lon=-77.07514989999999#.XFzl9s9KiCc"
r = requests.get(url)
r.status_code

200

In [46]:
#Let's make some soup
soup = BeautifulSoup(r.content, 'html.parser')

In [47]:
seven_day = soup.find(id="seven-day-forecast")

In [48]:
seven_day

<div class="panel panel-default" id="seven-day-forecast">
<div class="panel-heading">
<b>Extended Forecast for</b>
<h2 class="panel-title">
	    	    Fort Myer VA	</h2>
</div>
<div class="panel-body" id="seven-day-forecast-body">
<div id="seven-day-forecast-container"><ul class="list-unstyled" id="seven-day-forecast-list"><li class="forecast-tombstone">
<div class="tombstone-container">
<p class="period-name">Overnight<br/><br/></p>
<p><img alt="Overnight: Partly cloudy, with a low around 70. Light south wind. " class="forecast-icon" src="newimages/medium/nsct.png" title="Overnight: Partly cloudy, with a low around 70. Light south wind. "/></p><p class="short-desc">Partly Cloudy</p><p class="temp temp-low">Low: 70 °F</p></div></li><li class="forecast-tombstone">
<div class="tombstone-container">
<p class="period-name">Tuesday<br/><br/></p>
<p><img alt="Tuesday: Mostly sunny and hot, with a high near 97. Southwest wind 5 to 8 mph. " class="forecast-icon" src="newimages/medium/hot.png" t

In [49]:
forecast_items = seven_day.find_all(class_="tombstone-container")
tonight = forecast_items[0]
print(tonight.prettify())

<div class="tombstone-container">
 <p class="period-name">
  Overnight
  <br/>
  <br/>
 </p>
 <p>
  <img alt="Overnight: Partly cloudy, with a low around 70. Light south wind. " class="forecast-icon" src="newimages/medium/nsct.png" title="Overnight: Partly cloudy, with a low around 70. Light south wind. "/>
 </p>
 <p class="short-desc">
  Partly Cloudy
 </p>
 <p class="temp temp-low">
  Low: 70 °F
 </p>
</div>


##### Extracting information from the page

As you can see, inside the forecast item tonight is all the information we want. There are 4 pieces of information we can extract:

* The name of the forecast item — in this case, Tonight.
* The description of the conditions — this is stored in the title property of img.
* A short description of the conditions.
* The temperature low.

We'll extract the name of the forecast item, the short description, and the temperature first, since they're all similar:

In [84]:
period = tonight.find(class_="period-name").get_text()
short_desc = tonight.find(class_="short-desc").get_text()
temp = tonight.find(class_="temp").get_text()

print(period)
print(short_desc)
print(temp)

Tonight
PatchyDrizzle andPatchy Fog
Low: 50 °F


Now, we can extract the `title` attribute from the `img` tag. To do this, we just treat the BeautifulSoup object like a dictionary, and pass in the attribute we want as a key:

In [85]:
img = tonight.find("img")
desc = img['title']

print(desc)

Tonight: Patchy drizzle and fog.  Mostly cloudy, with a low around 50. Southeast wind around 6 mph becoming southwest after midnight. 


##### Extracting all the information from the page
Now that we know how to extract each individual piece of information, we can combine our knowledge with css selectors and list comprehensions to extract everything at once.

In the below code, we:

* Select all items with the class `period-name` inside an item with the class `tombstone-container` in `seven_day`.
* Use a list comprehension to call the `get_text` method on each `BeautifulSoup` object.

In [86]:
period_tags = seven_day.select(".tombstone-container .period-name")
periods = [pt.get_text() for pt in period_tags]
periods

['Tonight',
 'Friday',
 'FridayNight',
 'Saturday',
 'SaturdayNight',
 'Sunday',
 'SundayNight',
 'Monday',
 'MondayNight']

In [87]:
short_descs = [sd.get_text() for sd in seven_day.select(".tombstone-container .short-desc")]
temps = [t.get_text() for t in seven_day.select(".tombstone-container .temp")]
descs = [d["title"] for d in seven_day.select(".tombstone-container img")]

print(short_descs)
print(temps)
print(descs)

['PatchyDrizzle andPatchy Fog', 'Slight ChanceShowers thenSunny andBreezy', 'Mostly Clear', 'Sunny', 'Mostly Clear', 'Mostly Sunny', 'Rain/SnowLikely', 'ChanceRain/Snow', 'Rain/SnowLikely']
['Low: 50 °F', 'High: 57 °F⇓', 'Low: 25 °F', 'High: 36 °F', 'Low: 23 °F', 'High: 40 °F', 'Low: 32 °F', 'High: 41 °F', 'Low: 32 °F']
['Tonight: Patchy drizzle and fog.  Mostly cloudy, with a low around 50. Southeast wind around 6 mph becoming southwest after midnight. ', 'Friday: A slight chance of showers before noon.  Mostly cloudy, then gradually becoming sunny, with a temperature rising to near 57 by 9am, then falling to around 47 during the remainder of the day. Breezy, with a west wind 8 to 13 mph increasing to 15 to 20 mph in the afternoon. Winds could gust as high as 34 mph.  Chance of precipitation is 20%.', 'Friday Night: Mostly clear, with a low around 25. Northwest wind 14 to 16 mph, with gusts as high as 24 mph. ', 'Saturday: Sunny, with a high near 36. Northwest wind 10 to 14 mph, with 

### Exercise
Combine all the newly scraped data and analyze it. In order to do this, we'll call the DataFrame class, and pass in each list of items that we have. We pass them in as part of a dictionary. Each dictionary key will become a column in the DataFrame, and each list will become the values in the column.

In [None]:
weather = pd.DataFrame({
        "period": periods, 
        "short_desc": short_descs, 
        "temp": temps, 
        "desc":descs
    })

In [None]:
weather.head()

### Analyzing Weather

In [None]:
# Use the Series.str.extract method to insert a regular expression to pull out numeric temperature values
temp_nums = weather["temp"].str.extract("(?P<temp_num>\d+)", expand=False)
weather["temp_num"] = temp_nums.astype('int')
temp_nums

In [None]:

# Find the mean of this week's temperature
weather["temp_num"].mean()

In [None]:
# Select rows that occur only at night
is_night = weather["temp"].str.contains("Low")
weather["is_night"] = is_night
is_night

In [None]:
weather[is_night]

<a name="2B"></a>
### UFO Sightings

In [None]:
r = requests.get("http://www.nuforc.org/webreports/ndxe201608.html")
b = BeautifulSoup(r.text, 'html.parser')
r.status_code

In [None]:
# Let's take a look at the first sighting
for tr in b.findAll('tr', attrs = {'valign':'TOP'})[:1]:
    # the findChildren method returns all children underneath it
    for child in tr.findChildren():
        print(child.text)

In [None]:
# OK, it's a bit messy, Let's clean it up
# Looks like the first element is the date, the 4th is the city, 6th if state, 8th is shape (this ones blank)
# 13th is the summary

ufo_sightings = {
        'Date':[],
        'City':[],
        'State':[],
        'Shape':[],
        'Summary':[]
    }

for tr in b.findAll('tr', attrs = {'valign':'TOP'}):
    # the findChildren method returns all children underneath it
    ufo_sighting_info = []
    for child in tr.findChildren():
        ufo_sighting_info.append(child.text)
    ufo_sightings['Date'].append(ufo_sighting_info[0])
    ufo_sightings['City'].append(ufo_sighting_info[3])
    ufo_sightings['State'].append(ufo_sighting_info[5])
    ufo_sightings['Shape'].append(ufo_sighting_info[7])
    ufo_sightings['Summary'].append(ufo_sighting_info[12])

pd.DataFrame(ufo_sightings).head()