# Lab 7
The topics of this week  continues to be getting data, in this case using an API to access structured data. 

In this lab notebook you will gain experience reading data from and posting to an API. 


## Lab Setup

In [30]:
import requests
import json
import datetime
import time
from io import StringIO
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl 
%matplotlib inline  

import otter
grader = otter.Notebook()

## API Getting Data

So far we have seen examples of getting data from an API.  These examples make use of GET requests from the API/server. 

Making a HTTP GET request can be done using several python libraries including: 

* httplib 
* urllib 
* requests 

We have been using the `requests` module.

Let's look at another example.

## Example: Google Books

Here we will examine using the Google Books API:  
https://developers.google.com/books/docs/overview


We will be using the "volumes" resource which does not require authentication.  
https://developers.google.com/books/docs/v1/getting_started#background-operations

Specifically, we will be using the query function to search by ISBN or book numbers. 
https://developers.google.com/books/docs/v1/using#PerformingSearch




In [31]:
# api-endpoint 
url = "https://www.googleapis.com/books/v1/volumes"
  
isbn = "isbn:0553386794"

# set the parameters to be sent to the API
params = {'q': isbn}

resp = requests.get(url, params)

Look at what the response is? 

How do we then extract the data?

In [32]:
resp

<Response [200]>

In [35]:
dat = resp.json()
#dat

# First, we can print it better! 
print(json.dumps(resp.json(), indent=4)[:])

{
    "kind": "books#volumes",
    "totalItems": 2,
    "items": [
        {
            "kind": "books#volume",
            "id": "hXNvadj27ekC",
            "etag": "jenFhvAFelk",
            "selfLink": "https://www.googleapis.com/books/v1/volumes/hXNvadj27ekC",
            "volumeInfo": {
                "title": "A Game of Thrones (HBO Tie-in Edition)",
                "subtitle": "A Song of Ice and Fire: Book One",
                "authors": [
                    "George R. R. Martin"
                ],
                "publisher": "Bantam",
                "publishedDate": "2011-03-22",
                "description": "NOW THE ACCLAIMED HBO SERIES GAME OF THRONES\u2014THE MASTERPIECE THAT BECAME A CULTURAL PHENOMENON Winter is coming. Such is the stern motto of House Stark, the northernmost of the fiefdoms that owe allegiance to King Robert Baratheon in far-off King\u2019s Landing. There Eddard Stark of Winterfell rules in Robert\u2019s name. There his family dwells in peace and 

There is a lot of information here.  Explore the structure of the JSON information. 

In [36]:
dat.keys()

dict_keys(['kind', 'totalItems', 'items'])

In [37]:
dat['kind']

'books#volumes'

In [38]:
dat['totalItems']

2

In [39]:
type(dat['items'])

list

In [40]:
# We can look at the first item on the list 
dat['items'][0]

{'kind': 'books#volume',
 'id': 'hXNvadj27ekC',
 'etag': 'jenFhvAFelk',
 'selfLink': 'https://www.googleapis.com/books/v1/volumes/hXNvadj27ekC',
 'volumeInfo': {'title': 'A Game of Thrones (HBO Tie-in Edition)',
  'subtitle': 'A Song of Ice and Fire: Book One',
  'authors': ['George R. R. Martin'],
  'publisher': 'Bantam',
  'publishedDate': '2011-03-22',
  'description': 'NOW THE ACCLAIMED HBO SERIES GAME OF THRONES—THE MASTERPIECE THAT BECAME A CULTURAL PHENOMENON Winter is coming. Such is the stern motto of House Stark, the northernmost of the fiefdoms that owe allegiance to King Robert Baratheon in far-off King’s Landing. There Eddard Stark of Winterfell rules in Robert’s name. There his family dwells in peace and comfort: his proud wife, Catelyn; his sons Robb, Brandon, and Rickon; his daughters Sansa and Arya; and his bastard son, Jon Snow. Far to the north, behind the towering Wall, lie savage Wildings and worse—unnatural things relegated to myth during the centuries-long summer

In [41]:
'''We can investigate the keys where information is stored for each item'''
dat['items'][0].keys()

dict_keys(['kind', 'id', 'etag', 'selfLink', 'volumeInfo', 'saleInfo', 'accessInfo', 'searchInfo'])

In [42]:
# You can start building pretty long lines of code to access information deep 
#  in the structure. 
# Print out the ISBN_10 number for the book 
dat['items'][0]['volumeInfo']['industryIdentifiers'][0]['identifier']

'9780553386790'

## Exercise 1 

Which of the Game of Thrones books is longest?

Get information about each book and print out the title and number of pages.  Then, report the book title and number of pages for the book with that is the longest.  

*Note, the API may return multiple entries for each isbn.  You may use the first entry for information.  If the information is missing a page number it is likely an audiobook, and you should then use the next entry for information.  If no entry has the title and page number information return the title as "no title" and the number of pages as '-1'.*

Collect the book information -- title, number of pages -- in a nested list, `ex1list` in the for loop. 

Create a DataFrame `ex1df` from this nested list with columns of `Title` and `NumPages`. 

For the book with the most pages, report its title `longestBookTitle` and number of pages `longestBookNumPages`. 

In [43]:
''' Following is the isbn codes for Game of Thrones books. '''

isbns = ['0553386794', '0345535421', '9780345543981', '0553390570', '1101886048']

In [44]:
'''
Iterate for each isbns to finds titles and pages for each item. 
Collect this information in a list. 
Look to use "volumeInfo" to gather the information needed.
Print the title + the number of pages in the loop. 

Outside the loop:
- Convert the list to a DataFrame, ex1df, column names 'Title' and 'NumPages' 
- Report longestBookTitle and longestBookNumPages.
'''

exalist = [] 
exblist = []
for i in isbns: 
    params = {'q': 'isbn:' + i}
    resp = requests.get(url, params)
    dat = resp.json()
    title = dat['items'][0]['volumeInfo']['title']
    pages = int(dat['items'][0]['volumeInfo']['pageCount'])
    if i != i:
        continue
    elif title == '':
        title = 'no title'
    elif pages < 1:
        pages = '-1'
    else: 
        exalist.append(title)
        exblist.append(pages)
exadf = pd.DataFrame(exalist)
exbdf = pd.DataFrame(exblist)
ex1df = pd.concat([exadf,exbdf] ,ignore_index = True, axis = 1)
ex1df = ex1df.rename(columns={0:'Title', 1: 'NumPages'})


longestBookTitle = ex1df[ex1df['Title'].str.len()==ex1df['Title'].str.len().max()]['Title']
longestBookNumPages = ex1df[ex1df['NumPages']==ex1df['NumPages'].max()]['NumPages']

In [45]:
grader.check("q1")

## Example: iTunes Content 

Apple has a simple [API](https://developer.apple.com/library/archive/documentation/AudioVideo/Conceptual/iTuneSearchAPI/Searching.html#//apple_ref/doc/uid/TP40017632-CH5-SW1) for looking up iTunes content.

In [46]:
# api-endpoint
url = 'https://itunes.apple.com/search'

# For example let's search for lord of the rings ebooks 
params = {'term': 'lord+of+the+rings', 'entity': 'ebook', 
         'limit': 3}

resp = requests.get(url, params)

In [47]:
resp

<Response [200]>

In [48]:
resp.json()

{'resultCount': 1,
 'results': [{'releaseDate': '2001-10-26T07:00:00Z',
   'trackId': 739542595,
   'trackName': 'Lord of the Rings',
   'genreIds': ['10084', '38', '9031'],
   'artistIds': [482333908],
   'kind': 'ebook',
   'currency': 'USD',
   'description': '" With New Line Cinema\'s production of The Lord of the Rings film trilogy, the popularity of the works of J.R.R. Tolkien is unparalleled. Tolkien\'s books continue to be bestsellers decades after their original publication. An epic in league with those of Spenser and Malory, The Lord of the Rings trilogy, begun during Hitler\'s rise to power, celebrates the insignificant individual as hero in the modern world. Jane Chance\'s critical appraisal of Tolkien\'s heroic masterwork is the first to explore its "mythology of power"–that is, how power, politics, and language interact. Chance looks beyond the fantastic, self-contained world of Middle-earth to the twentieth-century parallels presented in the trilogy.',
   'trackCensoredN

## Exercise 2

Search for the 50 "The Expanse" e-books (search may return fewer or slightly more). 

Create a DataFrame from the responses containing the `TrackName`, `TrackID`, `Price`, `AveRating`, `NumRating`. 

Sort the results from highest to lowest of `AveRating`, then by `NumRating`.

If any of the information you are meant to collect is missing, replace with `NaN`

In [49]:
url = 'https://itunes.apple.com/search'

# """ For example let's search for "The Expanse" ebooks """

params = {'term': 'expanse', 'entity': 'ebook', 'limit': 50}
resp = requests.get(url, params) 

resp.json()

{'resultCount': 60,
 'results': [{'artistIds': [433411981],
   'artistId': 433411981,
   'artistName': 'James S. A. Corey',
   'genres': ['Science Fiction',
    'Books',
    'Sci-Fi & Fantasy',
    'Science Fiction & Literature',
    'Adventure Sci-Fi'],
   'price': 9.99,
   'trackId': 395522188,
   'trackName': 'Leviathan Wakes',
   'releaseDate': '2011-06-15T07:00:00Z',
   'genreIds': ['10063', '38', '9020', '10064', '11006'],
   'kind': 'ebook',
   'currency': 'USD',
   'description': '<b>From a <i>New York Times</i> bestselling and Hugo award-winning author comes a modern masterwork of science fiction, introducing a captain, his crew, and a detective as they unravel a horrifying solar system wide conspiracy that begins with a single missing girl.&#xa0;</b><b>With over 10 million copies sold, The Expanse has become one of the biggest science fiction phenomenons of the decade.&#xa0;</b><br /><br /><b>Now a Prime Original series.&#xa0;</b><br /><br /><b>HUGO AWARD WINNER FOR BEST SERI

In [50]:
obj = json.loads(resp.text)
#obj       # comment out to explore, leave commented before submission

Try using at least two approaches to create the DataFrame, e.g., 

* *Method 1* - Keep track of rows in a list, convert nested lists to DataFrame.  Note, do not create an empty DataFrame and append entries in an iterator (this is not scalable)  
https://stackoverflow.com/questions/13784192/creating-an-empty-pandas-dataframe-and-then-filling-it/41529411#41529411
* *Method 2* - Use pandas `read_json` function to convert JSON to pandas object
* *Method 3* - Use `json_normalize` function to read in JSON to a flat table. 
The `json_normalize` function normalizes a semi-structured JSON data object into a flat table.   
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.json_normalize.html

<!-- BEGIN QUESTION -->



In [51]:
# State which method you are using: 
#  Method 1

df_method1 = pd.DataFrame(
    columns=['TrackName', 'TrackID', 'Price', 'AveRating','NumRating'],
    data=[[o['trackName'], o['trackId'], o.get('price', np.nan), o.get('averageUserRating', np.nan), o.get('userRatingCount', np.nan) ] for o in obj['results']])
q2df1 = df_method1.sort_values(by =['AveRating', 'NumRating'])

print(q2df1.shape)
q2df1.head()

(60, 5)


Unnamed: 0,TrackName,TrackID,Price,AveRating,NumRating
42,The Bastard Legion,1198098757,2.99,3.5,7.0
16,The Expanse,1374018584,7.99,3.5,15.0
37,Stars and Bones,1574808432,9.99,3.5,36.0
23,Star Trek: Enterprise: The Expanse,381505937,8.99,4.0,9.0
53,Stolen Earth,1526997206,9.99,4.0,11.0


<!-- END QUESTION -->

<!-- BEGIN QUESTION -->



In [52]:
# State which method you are using: 
#  Method 3
q2df2 = pd.json_normalize(obj['results'])

q2df2 = q2df2[['trackName','trackId','price','averageUserRating','userRatingCount']].sort_values(by = ['averageUserRating','userRatingCount'],ascending = False).rename({'trackName':'TrackName','trackId':'TrackID','price':'Price','averageUserRating':'AveRating','userRatingCount':'NumRating'},axis =1 )
print(q2df2.shape)
q2df2.head()

(60, 5)


Unnamed: 0,TrackName,TrackID,Price,AveRating,NumRating
0,Leviathan Wakes,395522188,9.99,4.5,2662.0
1,Caliban's War,469995594,11.99,4.5,1834.0
7,Tiamat's Wrath,1367091224,11.99,4.5,1782.0
2,Abaddon's Gate,576438197,11.99,4.5,1522.0
4,Nemesis Games,926360337,11.99,4.5,1438.0


<!-- END QUESTION -->

## Example: TV Shows 

Here we can use an API on tv show information:  
http://api.tvmaze.com/

In [53]:
# We can find the tvmaze id for a show based on the IMDB id. 
id_bcs = 'tt3032476'
resp = requests.get('http://api.tvmaze.com/lookup/shows?imdb=' + id_bcs)

In [54]:
resp.json()

{'id': 618,
 'url': 'https://www.tvmaze.com/shows/618/better-call-saul',
 'name': 'Better Call Saul',
 'type': 'Scripted',
 'language': 'English',
 'genres': ['Drama', 'Crime', 'Legal'],
 'status': 'Ended',
 'runtime': 60,
 'averageRuntime': 64,
 'premiered': '2015-02-08',
 'ended': '2022-08-15',
 'officialSite': 'https://www.amc.com/shows/better-call-saul--1002228',
 'schedule': {'time': '21:00', 'days': ['Monday']},
 'rating': {'average': 8.6},
 'weight': 98,
 'network': {'id': 20,
  'name': 'AMC',
  'country': {'name': 'United States',
   'code': 'US',
   'timezone': 'America/New_York'},
  'officialSite': None},
 'webChannel': None,
 'dvdCountry': None,
 'externals': {'tvrage': 37780, 'thetvdb': 273181, 'imdb': 'tt3032476'},
 'image': {'medium': 'https://static.tvmaze.com/uploads/images/medium_portrait/501/1253515.jpg',
  'original': 'https://static.tvmaze.com/uploads/images/original_untouched/501/1253515.jpg'},
 'summary': '<p><b>Better Call Saul</b> is the prequel to the award-win

## Exercise 3

Let's consider the 5 most viewed shows on Netflix (from their [2024 engagment report](https://www.tvguide.com/galleries/the-most-watched-netflix-shows-2024/)) as well as several shows that won Emmy's in 2024. 

For each show get information on the episodes. 

Consider using the endpoint - http://www.tvmaze.com/api#show-episode-list

Create a DataFrame, `q3df`, that reports for each show and season the number of episodes, the min, mean, and max running time as well as the min, mean, and max rating over the episodes that season. 

The DataFrame should have columns: `ShowName`, `Season`, `Num_Eps`, `Min_Run`, `Mean_Run`, `Max_Run`, `Min_Rating`, `Mean_Rating`, `Max_Rating`.  

In your solution, but in a cooling period of 2-5 seconds between API calls. You may want to look at using `time.sleep`


In [24]:
imdb_ids = ['tt5611024', 'tt8740790', 'tt13649112', 'tt13210838', 'tt9018736', 
           'tt11815682', 'tt2788316', 'tt5875444', 'tt14452776']

In [25]:
# Create a DataFrame "q3df"
q3df = pd.DataFrame([],columns = ['ShowName','Season','Num_Eps','Min_Run','Mean_Run','Max_Run','Min_Rating','Mean_Rating',
                                  'Max_Rating'])
for id in imdb_ids:
    resp = requests.get('http://api.tvmaze.com/lookup/shows?imdb=' + id)
    obj = resp.json()
    ShowName = obj['name']
    showID = obj['id']
    time.sleep(2)
    resp2 = requests.get('http://api.tvmaze.com/shows/'+str(showID)+'/episodes')
    episodes = pd.json_normalize(json.loads(resp2.text))
    for season in episodes['season'].unique():
        eachSeas = episodes[episodes['season']==season]
        q3df = pd.concat([pd.DataFrame([[ShowName,season,len(eachSeas.index),eachSeas['runtime'].min(),eachSeas['runtime'].mean()
                                         ,eachSeas['runtime'].max(),eachSeas['rating.average'].min(),eachSeas['rating.average'].mean()
                                         ,eachSeas['rating.average'].max()]], columns=q3df.columns), q3df], ignore_index=True)
    time.sleep(2)
    
q3df

Unnamed: 0,ShowName,Season,Num_Eps,Min_Run,Mean_Run,Max_Run,Min_Rating,Mean_Rating,Max_Rating
0,The Bear,3,10,27,34.9,43,7.0,7.52,8.0
1,The Bear,2,10,26,36.1,66,7.9,8.23,8.8
2,The Bear,1,8,20,30.25,48,7.5,8.0375,8.6
3,Slow Horses,4,6,40,44.833333,52,7.8,8.4,8.7
4,Slow Horses,3,6,41,43.333333,45,8.4,8.616667,8.8
5,Slow Horses,2,6,41,46.0,53,7.9,8.216667,8.7
6,Slow Horses,1,6,41,46.666667,53,7.3,7.866667,8.3
7,Shōgun,1,10,53,59.2,70,8.8,9.06,9.2
8,Hacks,3,9,29,33.666667,36,6.8,7.666667,8.5
9,Hacks,2,8,29,32.5,35,6.5,7.3125,7.8


In [26]:
grader.check("q3")

## Congratulations! You have finished Lab7! 

### Submission Instructions

Below, you will see a cell. Running this cell will automatically generate a zip file with your autograded answers. Once you submit this file to the Lab 7 assignment on Gradescope. 


Make sure you have run all cells in your notebook **in order** before running the cell below. The cell below will generate a zip file for you to submit. **Please save before exporting!**

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False, run_tests=True)