<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Practice Using APIs

_Authors: Dave Yerrington (SF), Sam Stack (DC)_

---

In this lab, we'll practice using some popular APIs to retrieve and store data.

In [1]:
# Imports at the top.
import json
import urllib
import pandas as pd
import numpy as np
import requests
import json
import re
import matplotlib.pyplot as plt
%matplotlib inline

## Exercise 1: Get Data From Sheetsu

---

[Sheetsu](https://sheetsu.com/) is an online service that allows you to access any Google spreadsheet from an API. This can be a powerful way to share a data set with colleagues, as well as create mini, centralized data storage that is simpler to edit than a database.

A Google spreadsheet with wine data can be found [here](https://docs.google.com/spreadsheets/d/1pBwap3K4Blwbx3Su07HAxxZCyy0lOGAiwBrUIvuDbsE).

It can be accessed through the Sheetsu API at this endpoint: https://sheetsu.com/apis/v1.0/1a4050d2ae98.

**Questions:**

1) Use the `requests` library to access the document. Inspect the response text. What kind of data is it?
- Check the status code of the response object. What code is it?
- Use the appropriate libraries and read functions to read the response into a Pandas DataFrame.
- Once you've imported the data into a DataFrame, check the value of the fifth line. What's the price?

In [2]:
# You can either post or get information from this API.
api_base_url = 'https://sheetsu.com/apis/v1.0/1a4050d2ae98'

In [3]:
# What kind of data is this returning?
api_response = requests.get(api_base_url)
api_response.text[:100]

u'[{"Color":"W","Region":"Portugal","Country":"Portugal","Vintage":"2013","Vinyard":"Vinho Verde","Nam'

In [4]:
api_response.headers

{'X-Request-Id': '8c78c3e3-541e-4c21-884f-7dda801888de', 'X-XSS-Protection': '1; mode=block', 'X-Content-Type-Options': 'nosniff', 'Transfer-Encoding': 'chunked', 'Vary': 'Origin', 'X-Runtime': '2.318054', 'Server': 'nginx', 'Connection': 'keep-alive', 'ETag': 'W/"f1727cc148169dbcf4d0a8e58d1d6c32"', 'Cache-Control': 'max-age=0, private, must-revalidate', 'Date': 'Tue, 03 Jul 2018 06:13:52 GMT', 'X-Frame-Options': 'SAMEORIGIN', 'Content-Type': 'application/json; charset=utf-8'}

In [5]:
# 10 It's a JSON string.

In [6]:
reponse = json.loads(api_response.text)

In [7]:
type(reponse)

list

In [8]:
reponse

[{u'Color': u'W',
  u'Consumed In': u'2015',
  u'Country': u'Portugal',
  u'Grape': u'',
  u'Name': u'',
  u'Price': u'',
  u'Region': u'Portugal',
  u'Score': u'4',
  u'Vintage': u'2013',
  u'Vinyard': u'Vinho Verde'},
 {u'Color': u'W',
  u'Consumed In': u'2015',
  u'Country': u'France',
  u'Grape': u'',
  u'Name': u'',
  u'Price': u'17.8',
  u'Region': u'France',
  u'Score': u'3',
  u'Vintage': u'2013',
  u'Vinyard': u'Peyruchet'},
 {u'Color': u'W',
  u'Consumed In': u'2015',
  u'Country': u'Oregon',
  u'Grape': u'',
  u'Name': u'',
  u'Price': u'20',
  u'Region': u'Oregon',
  u'Score': u'3',
  u'Vintage': u'2013',
  u'Vinyard': u'Abacela'},
 {u'Color': u'W',
  u'Consumed In': u'2015',
  u'Country': u'Spain',
  u'Grape': u'chardonay',
  u'Name': u'',
  u'Price': u'7',
  u'Region': u'Spain',
  u'Score': u'2.5',
  u'Vintage': u'2012',
  u'Vinyard': u'Ochoa'},
 {u'Color': u'R',
  u'Consumed In': u'2015',
  u'Country': u'US',
  u'Grape': u'chiraz, cab',
  u'Name': u'Spice Trader',
  u'Pr

In [9]:
api_response.status_code

200

In [10]:
# 20 The response code is 200.

In [11]:
wine_df = pd.DataFrame(reponse)
wine_df.head()

Unnamed: 0,Color,Consumed In,Country,Grape,Name,Price,Region,Score,Vintage,Vinyard
0,W,2015,Portugal,,,,Portugal,4.0,2013,Vinho Verde
1,W,2015,France,,,17.8,France,3.0,2013,Peyruchet
2,W,2015,Oregon,,,20.0,Oregon,3.0,2013,Abacela
3,W,2015,Spain,chardonay,,7.0,Spain,2.5,2012,Ochoa
4,R,2015,US,"chiraz, cab",Spice Trader,6.0,,3.0,2012,Heartland


In [12]:
# Alternatively:
wine_df = pd.read_json(api_response.text)
wine_df.head(2)

Unnamed: 0,Color,Consumed In,Country,Grape,Name,Price,Region,Score,Vintage,Vinyard
0,W,2015,Portugal,,,,Portugal,4.0,2013,Vinho Verde
1,W,2015,France,,,17.8,France,3.0,2013,Peyruchet


In [13]:
wine_df.iloc[4, :]
# The price for the fifth row is six. 

Color                     R
Consumed In            2015
Country                  US
Grape           chiraz, cab
Name           Spice Trader
Price                     6
Region                     
Score                     3
Vintage                2012
Vinyard           Heartland
Name: 4, dtype: object

## Exercise 2: IMDb TV Shows

---

Sometimes an API doesn't provide all of the information we'd like and we need to get creative.

Here we'll use a combination of scraping and API calls to find the ratings and networks of famous television shows.

### 2.A) Get the Top TV Shows

IMDb contains data about movies and TV shows. Unfortunately, it doesn't have a public API.

The page http://www.imdb.com/chart/toptv/?ref_=nv_tp_tv250_2 contains the list of the top 250 television shows of all time. Retrieve the page using the `requests` library and then parse the HTML to obtain a list of the `television_ids` for these shows. You can parse it with regular expression or by using a library like `BeautifulSoup`.

> **Hint:** television_ids look like this: `tt2582802`.
> _Everything after "/title/" and before "/?"_

In [14]:
response = requests.get('http://www.imdb.com/chart/toptv/?ref_=nv_tp_tv250_2')
html = response.text

In [15]:
response.headers

{'Content-Language': 'en-US', 'Content-Security-Policy': "frame-ancestors 'self' imdb.com *.imdb.com *.media-imdb.com withoutabox.com *.withoutabox.com amazon.com *.amazon.com amazon.co.uk *.amazon.co.uk amazon.de *.amazon.de translate.google.com images.google.com www.google.com www.google.co.uk search.aol.com bing.com www.bing.com", 'Content-Encoding': 'gzip', 'Transfer-Encoding': 'chunked', 'Set-Cookie': 'uu=BCYvWktpybrIabiW03YGotRIE75MYWJiKyIaWu1dvJ5Qu3dlDISajyeiqIEUz1__8PtTIizK8E8p%0D%0AXtWxGSSEi3rcJJ9wmyp5l3XsHc9sd3JTzh-yyw2HSuUdjHJ2osrzB3UvclYWCdGitdVyHmRU2vyk%0D%0AJmAkFfVZ8HMORjTSBO3ffpRSZchZ6ipVJ32fLWIOmHy6r4BsckCXs-1x7ARUiMTTapy5GTUB-qkp%0D%0AUP1KWsw6yc95rXwG8ORRBkpt0504EqMzRY-MCcHSBAiYYR-Z2VlRxg%0D%0A; Domain=.imdb.com; Expires=Sun, 21-Jul-2086 09:28:34 GMT; Path=/; Secure, session-id=000-0000000-0000000; Domain=.imdb.com; Expires=Sun, 21-Jul-2086 09:28:34 GMT; Path=/; Secure, session-id-time=2161318466; Domain=.imdb.com; Expires=Sun, 21-Jul-2086 09:28:34 GMT; Path=/; Secure'

In [16]:
from bs4 import BeautifulSoup


In [17]:
# solution usin BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')
for a in soup.findAll('a'):
    if '/title/' in a.get('href',''):
        print (a.get('href').split('/')[2])

tt0245429
tt0245429
tt5491994
tt5491994
tt0185906
tt0185906
tt0795176
tt0795176
tt0944947
tt0944947
tt0903747
tt0903747
tt0306414
tt0306414
tt2395695
tt2395695
tt2861424
tt2861424
tt0081846
tt0081846
tt0141842
tt0141842
tt0071075
tt0071075
tt0417299
tt0417299
tt1533395
tt1533395
tt6769208
tt6769208
tt1475582
tt1475582
tt1806234
tt1806234
tt0052520
tt0052520
tt0098769
tt0098769
tt0092337
tt0092337
tt0303461
tt0303461
tt2356777
tt2356777
tt1355642
tt1355642
tt3530232
tt3530232
tt2802850
tt2802850
tt0103359
tt0103359
tt0877057
tt0877057
tt0296310
tt0296310
tt0213338
tt0213338
tt4508902
tt4508902
tt2085059
tt2085059
tt0063929
tt0063929
tt0112130
tt0112130
tt0081834
tt0081834
tt2571774
tt2571774
tt2092588
tt2092588
tt4574334
tt4574334
tt0367279
tt0367279
tt0475784
tt0475784
tt0108778
tt0108778
tt1856010
tt1856010
tt7221388
tt7221388
tt0098904
tt0098904
tt0081912
tt0081912
tt3718778
tt3718778
tt0098936
tt0098936
tt2707408
tt2707408
tt1865718
tt1865718
tt0193676
tt0193676
tt0074006
tt0074006


In [18]:
def get_top_250():
    response = requests.get('http://www.imdb.com/chart/toptv/?ref_=nv_tp_tv250_2')
    html = response.text
    # Use the greedy version to find everything after title to the next backslash in the a href element.
    entries = re.findall("<a href.*?/title/(.*?)/", html) 
    # Create a list of the top 250 results.
    return list(set(entries))

In [19]:
entries = get_top_250()

In [20]:
len(entries)

251

In [21]:
entries[0]

u'tt2674806'

### 2.B) Get Data on the Top TV Shows

Although IMBb doesn't have a public API, an open API exists at http://www.tvmaze.com/api.

Use this API to retrieve information about each of the 250 TV shows you extracted in the previous step.
1) Check the documentation of TVmaze's API to learn how to request show data by ID.
- Define a function that returns a Python object with select information for a given ID.
    - Show name.
    - Rating (avg).
    - Genre(s).
    - Network name.
    - Premiere date.
    - Status.
> Tip: The JSON object can easily be converted into a Python dictionary.

- Store the gathered information in a Pandas DataFrame.

Because the target information is in a JSON format, you'll need `json.loads(res.text)` in order to gather it.

Here's an example of the information and how we can interact with it:

In [22]:
# Example URL.
res=requests.get('http://api.tvmaze.com/lookup/shows?imdb=tt0944947')

# Status code.
print res.status_code

# Just the contents of the name element.
print json.loads(res.text).get('name')

# The entire contents.
print json.loads(res.text)

200
Game of Thrones
{u'status': u'Running', u'rating': {u'average': 9.4}, u'genres': [u'Drama', u'Adventure', u'Fantasy'], u'weight': 99, u'updated': 1529403380, u'name': u'Game of Thrones', u'language': u'English', u'schedule': {u'days': [u'Sunday'], u'time': u'21:00'}, u'url': u'http://www.tvmaze.com/shows/82/game-of-thrones', u'officialSite': u'http://www.hbo.com/game-of-thrones', u'externals': {u'thetvdb': 121361, u'tvrage': 24493, u'imdb': u'tt0944947'}, u'premiered': u'2011-04-17', u'summary': u'<p>Based on the bestselling book series <i>A Song of Ice and Fire</i> by George R.R. Martin, this sprawling new HBO drama is set in a world where summers span decades and winters can last a lifetime. From the scheming south and the savage eastern lands, to the frozen north and ancient Wall that protects the realm from the mysterious darkness beyond, the powerful families of the Seven Kingdoms are locked in a battle for the Iron Throne. This is a story of duplicity and treachery, nobility 

In [23]:
# Function to pull information from the API using JSON interaction.
def get_entry(entry):
    res=requests.get('http://api.tvmaze.com/lookup/shows?imdb='+entry)
    if res.status_code == 200:
        try:
            status = json.loads(res.text).get('status')
        except AttributeError:
            status = 'NA'
        try: 
            rating = json.loads(res.text).get('rating').get('average')
        except AttributeError:
            rating = 'NA'
            
        try:
            network = json.loads(res.text).get('network').get('name')
        except AttributeError:
            network = 'NA'
            
        try:
            title = json.loads(res.text).get('name')
        except AttributeError:
            title = 'NA'
            
        try:
            premier = json.loads(res.text).get('premiered')
        except AttributeError:
            premier = 'NA'
            
        try:
            genres = json.loads(res.text).get('genres')
        except AttributeError:
            genres = 'NA'

        # Takes local variables as: 
        shows_df.loc[len(shows_df)] = [title, rating, genres, network, premier, status]

In [24]:
# Function to pull information from the API converting JSON into a Python dictionary element.
def get_entry(entry):
    print(entry)
    res=requests.get('http://api.tvmaze.com/lookup/shows?imdb='+entry)
    if res.status_code == 200:
        results = json.loads(res.text)
        
        try:    
            status = results['status']
        except TypeError:
            status = 'NA'   
        try:
            rating = results['rating']['average']
        except TypeError:
            rating = 'NA'
        try:
            network = results['network']['name']
        except TypeError:
            network = 'NA'
        try:   
            title = results['name']
        except TypeError:
            title = 'NA'
        try:   
            genres = results['genres']
        except TypeError:
            genres = 'NA'
        try:   
            premier = results['premiered']
        except TypeError:
            premier = 'NA'
        shows_df.loc[len(shows_df)] = [title, rating, genres, network, premier, status]

In [25]:
# In both functions, we're looking for specific elements. If an element is missing, an error will return — thus the need
# for try and except statements.

In [26]:
shows_df= pd.DataFrame( columns = ['show_name', 'rating_avg', 'genres', 'network', 'premiere_date', 'status'])

for entry in entries[:30]:
    get_entry(entry)

tt2674806
tt0081912
tt1475582
tt6205862
tt1831164
tt1439629
tt0436992
tt0096657
tt5555260
tt0248654
tt4288182
tt2244495
tt0118266
tt3655448
tt0203082
tt4156586
tt1298820
tt0086661
tt0384766
tt0314979
tt0081834
tt5290382
tt3530232
tt0773262
tt4270492
tt0106028
tt0275137
tt0080306
tt0403778
tt7016936


In [27]:
shows_df.head()

Unnamed: 0,show_name,rating_avg,genres,network,premiere_date,status
0,Inside No. 9,8.3,"[Comedy, Thriller, Supernatural]",BBC Two,2014-02-05,Running
1,Only Fools and Horses,8.6,[Comedy],BBC One,1981-09-08,Ended
2,Sherlock,9.2,"[Drama, Crime, Mystery]",BBC One,2010-07-25,To Be Determined
3,Leyla ile Mecnun,,"[Drama, Comedy, Adventure]",TRT1,2011-02-09,Ended
4,Community,8.3,[Comedy],,2009-09-17,Ended
