<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Practice Using APIs

_Authors: Dave Yerrington (SF), Sam Stack (DC)_

---

In this lab, we'll practice using some popular APIs to retrieve and store data.

In [1]:
# Imports at the top.
import json
import urllib
import pandas as pd
import numpy as np
import requests
import json
import re
import matplotlib.pyplot as plt
%matplotlib inline

## Exercise 1: Get Data From Sheetsu

---

[Sheetsu](https://sheetsu.com/) is an online service that allows you to access any Google spreadsheet from an API. This can be a powerful way to share a data set with colleagues, as well as create mini, centralized data storage that is simpler to edit than a database.

A Google spreadsheet with wine data can be found [here](https://docs.google.com/spreadsheets/d/1pBwap3K4Blwbx3Su07HAxxZCyy0lOGAiwBrUIvuDbsE).

It can be accessed through the Sheetsu API at this endpoint: https://sheetsu.com/apis/v1.0/1a4050d2ae98.

**Questions:**

1) Use the `requests` library to access the document. Inspect the response text. What kind of data is it?
- Check the status code of the response object. What code is it?
- Use the appropriate libraries and read functions to read the response into a Pandas DataFrame.
- Once you've imported the data into a DataFrame, check the value of the fifth line. What's the price?

In [2]:
# You can either post or get information from this API.
api_base_url = 'https://sheetsu.com/apis/v1.0/1a4050d2ae98'

In [3]:
# What kind of data is this returning?
api_response = requests.get(api_base_url)
api_response.text[:100]

'[{"Color":"W","Region":"Portugal","Country":"Portugal","Vintage":"2013","Vinyard":"Vinho Verde","Nam'

In [4]:
api_response.headers

{'Server': 'nginx', 'Date': 'Tue, 09 Oct 2018 01:13:43 GMT', 'Content-Type': 'application/json; charset=utf-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'X-Frame-Options': 'SAMEORIGIN', 'X-XSS-Protection': '1; mode=block', 'X-Content-Type-Options': 'nosniff', 'ETag': 'W/"f1727cc148169dbcf4d0a8e58d1d6c32"', 'Cache-Control': 'max-age=0, private, must-revalidate', 'X-Request-Id': '2fd5f90a-8b71-4264-93f1-afb16bb2a9ec', 'X-Runtime': '1.893247', 'Vary': 'Origin'}

In [5]:
# 10 It's a JSON string.

In [6]:
reponse = json.loads(api_response.text)

In [7]:
type(reponse)

list

In [8]:
reponse

[{'Color': 'W',
  'Region': 'Portugal',
  'Country': 'Portugal',
  'Vintage': '2013',
  'Vinyard': 'Vinho Verde',
  'Name': '',
  'Grape': '',
  'Consumed In': '2015',
  'Score': '4',
  'Price': ''},
 {'Color': 'W',
  'Region': 'France',
  'Country': 'France',
  'Vintage': '2013',
  'Vinyard': 'Peyruchet',
  'Name': '',
  'Grape': '',
  'Consumed In': '2015',
  'Score': '3',
  'Price': '17.8'},
 {'Color': 'W',
  'Region': 'Oregon',
  'Country': 'Oregon',
  'Vintage': '2013',
  'Vinyard': 'Abacela',
  'Name': '',
  'Grape': '',
  'Consumed In': '2015',
  'Score': '3',
  'Price': '20'},
 {'Color': 'W',
  'Region': 'Spain',
  'Country': 'Spain',
  'Vintage': '2012',
  'Vinyard': 'Ochoa',
  'Name': '',
  'Grape': 'chardonay',
  'Consumed In': '2015',
  'Score': '2.5',
  'Price': '7'},
 {'Color': 'R',
  'Region': '',
  'Country': 'US',
  'Vintage': '2012',
  'Vinyard': 'Heartland',
  'Name': 'Spice Trader',
  'Grape': 'chiraz, cab',
  'Consumed In': '2015',
  'Score': '3',
  'Price': '6'},


In [9]:
api_response.status_code

200

In [10]:
# 20 The response code is 200.

In [11]:
wine_df = pd.DataFrame(reponse)
wine_df.head()

Unnamed: 0,Color,Consumed In,Country,Grape,Name,Price,Region,Score,Vintage,Vinyard
0,W,2015,Portugal,,,,Portugal,4.0,2013,Vinho Verde
1,W,2015,France,,,17.8,France,3.0,2013,Peyruchet
2,W,2015,Oregon,,,20.0,Oregon,3.0,2013,Abacela
3,W,2015,Spain,chardonay,,7.0,Spain,2.5,2012,Ochoa
4,R,2015,US,"chiraz, cab",Spice Trader,6.0,,3.0,2012,Heartland


In [12]:
# Alternatively:
wine_df = pd.read_json(api_response.text)
wine_df.head(2)

Unnamed: 0,Color,Consumed In,Country,Grape,Name,Price,Region,Score,Vintage,Vinyard
0,W,2015,Portugal,,,,Portugal,4.0,2013,Vinho Verde
1,W,2015,France,,,17.8,France,3.0,2013,Peyruchet


In [13]:
wine_df.iloc[4, :]
# The price for the fifth row is six. 

Color                     R
Consumed In            2015
Country                  US
Grape           chiraz, cab
Name           Spice Trader
Price                     6
Region                     
Score                     3
Vintage                2012
Vinyard           Heartland
Name: 4, dtype: object

## Exercise 2: IMDb TV Shows

---

Sometimes an API doesn't provide all of the information we'd like and we need to get creative.

Here we'll use a combination of scraping and API calls to find the ratings and networks of famous television shows.

### 2.A) Get the Top TV Shows

IMDb contains data about movies and TV shows. Unfortunately, it doesn't have a public API.

The page http://www.imdb.com/chart/toptv/?ref_=nv_tp_tv250_2 contains the list of the top 250 television shows of all time. Retrieve the page using the `requests` library and then parse the HTML to obtain a list of the `television_ids` for these shows. You can parse it with regular expression or by using a library like `BeautifulSoup`.

> **Hint:** television_ids look like this: `tt2582802`.
> _Everything after "/title/" and before "/?"_

In [14]:
response = requests.get('http://www.imdb.com/chart/toptv/?ref_=nv_tp_tv250_2')
html = response.text

In [15]:
response.headers

{'Server': 'Server', 'Date': 'Tue, 09 Oct 2018 01:13:59 GMT', 'Content-Type': 'text/html;charset=UTF-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'X-Frame-Options': 'SAMEORIGIN', 'Content-Security-Policy': "frame-ancestors 'self' imdb.com *.imdb.com *.media-imdb.com withoutabox.com *.withoutabox.com amazon.com *.amazon.com amazon.co.uk *.amazon.co.uk amazon.de *.amazon.de translate.google.com images.google.com www.google.com www.google.co.uk search.aol.com bing.com www.bing.com", 'Content-Language': 'en-US', 'Set-Cookie': 'uu=BCYpZA_amB7sw4QBrKJIn_HqjNdDxXJx3YSVGHn9hQjLWv69vSIqVX0FPRmTe4JWFgduq-bEOZ-f%0D%0AGzjRezJjsglME1F5Y5r_YodJa45x5JEiCstYRBYhH8qWhFrkIo4VMGQ63ZCekQZWZS9Tet-Fdd0f%0D%0AiXnfPXZrnIE42tre40iSwTwDasn3paX3poCeTnsiFzaGbUEaXyDlL62_dJbAhAbn1879dEEjxXyL%0D%0AtdO9Lxl_SMzNhXEQR9FnPil8fsHNrkXtaf358VGbm5OzByo1GCArJw%0D%0A; Domain=.imdb.com; Expires=Sun, 27-Oct-2086 04:28:06 GMT; Path=/; Secure, session-id=000-0000000-0000000; Domain=.imdb.com; Expires=Sun, 27-Oc

In [16]:
from bs4 import BeautifulSoup


In [17]:
# solution usin BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')
for a in soup.findAll('a'):
    if '/title/' in a.get('href',''):
        print (a.get('href').split('/')[2])

tt0266697
tt0266697
tt5491994
tt5491994
tt0185906
tt0185906
tt0795176
tt0795176
tt0944947
tt0944947
tt0903747
tt0903747
tt0306414
tt0306414
tt2395695
tt2395695
tt2861424
tt2861424
tt0081846
tt0081846
tt6769208
tt6769208
tt0141842
tt0141842
tt0071075
tt0071075
tt0417299
tt0417299
tt1533395
tt1533395
tt1475582
tt1475582
tt1806234
tt1806234
tt0052520
tt0052520
tt1877514
tt1877514
tt0092337
tt0092337
tt0098769
tt0098769
tt0303461
tt0303461
tt2356777
tt2356777
tt1355642
tt1355642
tt1508238
tt1508238
tt2802850
tt2802850
tt3530232
tt3530232
tt0103359
tt0103359
tt0877057
tt0877057
tt0296310
tt0296310
tt0213338
tt0213338
tt2092588
tt2092588
tt4508902
tt4508902
tt2085059
tt2085059
tt0112130
tt0112130
tt0063929
tt0063929
tt0081834
tt0081834
tt2571774
tt2571774
tt0108778
tt0108778
tt4574334
tt4574334
tt0367279
tt0367279
tt1856010
tt1856010
tt0098904
tt0098904
tt0081912
tt0081912
tt0475784
tt0475784
tt3718778
tt3718778
tt0098936
tt0098936
tt1865718
tt1865718
tt2707408
tt2707408
tt0193676
tt0193676


In [18]:
def get_top_250():
    response = requests.get('http://www.imdb.com/chart/toptv/?ref_=nv_tp_tv250_2')
    html = response.text
    # Use the greedy version to find everything after title to the next backslash in the a href element.
    entries = re.findall("<a href.*?/title/(.*?)/", html) 
    # Create a list of the top 250 results.
    return list(set(entries))

In [19]:
entries = get_top_250()

In [20]:
len(entries)

251

In [21]:
entries[0]

'tt0417373'

### 2.B) Get Data on the Top TV Shows

Although IMBb doesn't have a public API, an open API exists at http://www.tvmaze.com/api.

Use this API to retrieve information about each of the 250 TV shows you extracted in the previous step.
1) Check the documentation of TVmaze's API to learn how to request show data by ID.
- Define a function that returns a Python object with select information for a given ID.
    - Show name.
    - Rating (avg).
    - Genre(s).
    - Network name.
    - Premiere date.
    - Status.
> Tip: The JSON object can easily be converted into a Python dictionary.

- Store the gathered information in a Pandas DataFrame.

Because the target information is in a JSON format, you'll need `json.loads(res.text)` in order to gather it.

Here's an example of the information and how we can interact with it:

In [23]:
# Example URL.
res=requests.get('http://api.tvmaze.com/lookup/shows?imdb=tt0944947')

# Status code.
print(res.status_code)

# Just the contents of the name element.
print(json.loads(res.text).get('name'))

# The entire contents.
print(json.loads(res.text))

200
Game of Thrones
{'id': 82, 'url': 'http://www.tvmaze.com/shows/82/game-of-thrones', 'name': 'Game of Thrones', 'type': 'Scripted', 'language': 'English', 'genres': ['Drama', 'Adventure', 'Fantasy'], 'status': 'Running', 'runtime': 60, 'premiered': '2011-04-17', 'officialSite': 'http://www.hbo.com/game-of-thrones', 'schedule': {'time': '21:00', 'days': ['Sunday']}, 'rating': {'average': 9.4}, 'weight': 99, 'network': {'id': 8, 'name': 'HBO', 'country': {'name': 'United States', 'code': 'US', 'timezone': 'America/New_York'}}, 'webChannel': {'id': 22, 'name': 'HBO Go', 'country': {'name': 'United States', 'code': 'US', 'timezone': 'America/New_York'}}, 'externals': {'tvrage': 24493, 'thetvdb': 121361, 'imdb': 'tt0944947'}, 'image': {'medium': 'http://static.tvmaze.com/uploads/images/medium_portrait/143/359013.jpg', 'original': 'http://static.tvmaze.com/uploads/images/original_untouched/143/359013.jpg'}, 'summary': '<p>Based on the bestselling book series <i>A Song of Ice and Fire</i> 

In [24]:
# Function to pull information from the API using JSON interaction.
def get_entry(entry):
    res=requests.get('http://api.tvmaze.com/lookup/shows?imdb='+entry)
    if res.status_code == 200:
        try:
            status = json.loads(res.text).get('status')
        except AttributeError:
            status = 'NA'
        try: 
            rating = json.loads(res.text).get('rating').get('average')
        except AttributeError:
            rating = 'NA'
            
        try:
            network = json.loads(res.text).get('network').get('name')
        except AttributeError:
            network = 'NA'
            
        try:
            title = json.loads(res.text).get('name')
        except AttributeError:
            title = 'NA'
            
        try:
            premier = json.loads(res.text).get('premiered')
        except AttributeError:
            premier = 'NA'
            
        try:
            genres = json.loads(res.text).get('genres')
        except AttributeError:
            genres = 'NA'

        # Takes local variables as: 
        shows_df.loc[len(shows_df)] = [title, rating, genres, network, premier, status]

In [25]:
# Function to pull information from the API converting JSON into a Python dictionary element.
def get_entry(entry):
    print(entry)
    res=requests.get('http://api.tvmaze.com/lookup/shows?imdb='+entry)
    if res.status_code == 200:
        results = json.loads(res.text)
        
        try:    
            status = results['status']
        except TypeError:
            status = 'NA'   
        try:
            rating = results['rating']['average']
        except TypeError:
            rating = 'NA'
        try:
            network = results['network']['name']
        except TypeError:
            network = 'NA'
        try:   
            title = results['name']
        except TypeError:
            title = 'NA'
        try:   
            genres = results['genres']
        except TypeError:
            genres = 'NA'
        try:   
            premier = results['premiered']
        except TypeError:
            premier = 'NA'
        shows_df.loc[len(shows_df)] = [title, rating, genres, network, premier, status]

In [26]:
# In both functions, we're looking for specific elements. If an element is missing, an error will return — thus the need
# for try and except statements.

In [27]:
shows_df= pd.DataFrame( columns = ['show_name', 'rating_avg', 'genres', 'network', 'premiere_date', 'status'])

for entry in entries[:30]:
    get_entry(entry)

tt0417373
tt3322312
tt0098825
tt1474684
tt5071412
tt0056751
tt6478318
tt0187664
tt0061287
tt1227926
tt0163507
tt4093826
tt0460681
tt0214341
tt0088484
tt0795176
tt0417349
tt1305826
tt7259746
tt4082744
tt0388629
tt6468322
tt0319969
tt0182629
tt2356777
tt0280249
tt1355642
tt4158110
tt0092455
tt0995832


In [29]:
shows_df

Unnamed: 0,show_name,rating_avg,genres,network,premiere_date,status
0,The Venture Bros.,9.0,"[Comedy, Adventure]",Adult Swim,2004-08-07,Running
1,Marvel's Daredevil,8.4,"[Drama, Action, Crime]",,2015-04-10,Running
2,House of Cards,8.3,[Drama],BBC One,1990-11-18,Ended
3,Luther,8.8,"[Drama, Crime, Mystery]",BBC One,2010-05-04,Running
4,Ozark,8.4,"[Drama, Crime, Thriller]",,2017-07-21,To Be Determined
5,Doctor Who,9.0,"[Action, Adventure, Science-Fiction]",BBC One,1963-11-23,Ended
6,Masum,,"[Drama, Crime, Mystery]",,2017-01-26,Ended
7,Spaced,9.5,[Comedy],Channel 4,1999-09-24,Ended
8,The Prisoner,9.3,"[Science-Fiction, Mystery]",ITV,1967-10-01,Ended
9,Dr. Horrible's Sing-Along Blog,8.4,"[Comedy, Music, Science-Fiction]",,2008-07-15,Ended
