
# Using APIs

In [1]:
# Please inlcude your names below
# Also, please edit the name of the file and include the names of the two(or three) people answering

# Pair answering the assignment: Jialu Liu, Michael Blum
# Pair giving feedback: ..., ...

In [2]:
import requests
import json
import os
from pprint import pprint

## 1. Google books
### Setting the request URL parameters
You can set the parameters of the API request according to the documentation below. The first part of the request is always the same, and the "q" which stands for question will take various parameters. <br>
More documentation can be found at: https://developers.google.com/books/docs/v1/using?hl=vi#ids

<img src="parameters.png" width=80%> </img>

**Example**: `https://www.googleapis.com/books/v1/volumes?q=isbn:9780141909882` returns information about the book with the given ISBN number (Everything is illuminated by Jonathan Safran Foer).

**Exercise**

1. Using the above parameters create the following request URLs!
    1. Requesting books that have "potter" in the title
    2. Requesting books that have "doyle" as author
    3. With isbn "1904633684"
    4. With id "2bGdK8CRKoEC"
    5. Second result page when searching books that has "detective" listed in the category list
    6. Second result page when searching books that have "potter in the title but showing 40 results in one page, not 10. 
<br>

Try and see whether they work with `requests.get()`!

In [3]:
base_url = 'https://www.googleapis.com/books/v1/volumes?q=' 
result_A = base_url + 'intitle:potter'
result_B = base_url + 'inauthor:doyle'
result_C = base_url + 'isbn:1904633684'
result_D = base_url + 'id:2bGdK8CRKoEC'
result_E = base_url + 'subject:detective&startIndex=10'
result_F = base_url + 'intitle:potter&startIndex=10&maxResults=40'

#check with requests

text_A = requests.get(result_A).text
text_B = requests.get(result_B).text
text_C = requests.get(result_C).text
text_D = requests.get(result_D).text
text_E = requests.get(result_E).text
text_F = requests.get(result_F).text

#continue to check all your URLs this way!

1G. Define a function that sends a request to the google books API with the URL parameters as inputs to the function. Try to incorporate as many as the variables as possible and output a URL according to the settings you want to have. Don't forget to write a docstring explaning how the function works. Docstrings are explanations to functions, describing the input, output, and purpose of the function. If you haven't used them before, you can find more examples for example at: https://www.geeksforgeeks.org/python-docstrings/

In [4]:
def request_books_api(intitle=False, inauthor=False, inpublisher=False, subject=False, isbn=False):
    """ 
    Creates Google Books API request with optional search queries 
  
    Parameters: 
    intitle (string): string that needs to be in title
    inauthor (string): string that needs to be in author
    inpublisher (string): string that needs to be in publisher
    subject (string): string that needs to be in category list
    isbn (string): isbn that needs to match book isbn
  
    Returns: 
    dict: query result from Google Books API 
    """
    url = 'https://www.googleapis.com/books/v1/volumes?q=' 
    if intitle: url += 'intitle:' + str(intitle)
    if inauthor: url += 'inauthor:' + str(inauthor)
    if inpublisher: url += 'inpublisher:' + str(inpublisher)
    if subject: url += 'subject:' + str(subject)
    if isbn: url += 'isbn:' + str(isbn)
    
    return requests.get(url)

test = request_books_api(intitle='potter').text
    
    

#### Status codes

Responses contain information even without looking into the textual content. printing the response tells us the URL we requested, the date, its status, the content type and the size of the response object

The most important for us is the status: it tells us whether our request has been successful: You can find a list of HTTP status codes here https://en.wikipedia.org/wiki/List_of_HTTP_status_codes.
Or, you can always check HTTP Status Cats: https://www.flickr.com/photos/girliemac/sets/72157628409467125

The most important status codes for us are:

- successful call: code 200
- client error: 4xx, e.g. 401: Unauthorized, 404: Not found
- server error: 5xx, e.g. 500: Internal Server Error, 502: Bad Gateway

### 2. Parsing json

1. Using the previously defined function, query the book with isbn number 1904633684 and print the text of the result. 

In [5]:
request_1 = request_books_api(isbn=1904633684)
# print(request_1.text)

2. Now load the previous response into a json object.:

In [6]:
json_object = json.loads(request_1.text)

3. What are the highest level keys of the json object?

In [7]:
for key in json_object:
    print(key)

kind
totalItems
items


4. What is the type of the value of 'items' key?

In [8]:
print(type(json_object['items']))

<class 'list'>


5. Parse the following information from the json object

In [9]:
### Total number of items returned by the request
number_of_items = len(json_object['items'])
print('number of Items:', number_of_items)

### Title of the book
book = json_object['items'][0]
print('Title of book:', book['volumeInfo']['title'])

### Authors of the book
print('Authors of book:', book['volumeInfo']['authors'])

### Date of publishing
print('publishing date of book:', book['volumeInfo']['publishedDate'])

### Page Count
print('page count of book:', book['volumeInfo']['pageCount'])

### Categories
print('categories of book:', book['volumeInfo']['categories'])

### Average Rating
print('average Rating of book:', book['volumeInfo']['averageRating'])

### Rating Count
print('rating count of book:', book['volumeInfo']['ratingsCount'])

### Is it avaliable as Epublication (Epub)
print('available as epub:', book['accessInfo']['epub']['isAvailable'])

number of Items: 1
Title of book: The Case-book of Sherlock Holmes
Authors of book: ['Arthur Conan Doyle']
publishing date of book: 2004
page count of book: 304
categories of book: ['Detective and mystery stories']
average Rating of book: 3.5
rating count of book: 23
available as epub: False


Unlike in the case of requesting books by IDs, the requests in which you search for author or title usually have more than one book as a result. Try searching for books that contain a specific word in their title.

6. Once you obtain the result of the request as a json object, loop through all books in the json and print out the **title** of all the books. 

In [10]:
books = json.loads(request_books_api(intitle='chess').text)
for book in books['items']:
    print(book['volumeInfo']['title'])


The American Chess Magazine
The British Chess Magazine
The Chess Journal
The Theory of Chess
The Chess Player's Chronicle
British Miscellany, and Chess Player's Chronicle
The Chess Players Chronicle
Chess Player's Chronicle
The Chess Player's Magazine
Chess Player


7. Now search for books with a category and print out the authors

In [11]:
books = json.loads(request_books_api(subject='adventure').text)
for book in books['items']:
    print(book['volumeInfo']['authors'])

['Gary Paulsen']
['Matt Doeden']
['Ross Van Zyl']
['Gordon Korman']
['Alison Lester']
['Jon Mayhew']
['Belinda Murrell']
['R. E. Taylor']
['Claire Saxby']
['Bindi Irwin', 'Marisa Nathar']


8. Define a function that given an item in the json object (the meta information about one book) returns a list with the following attributes: `title, authors, publishedDate, pageCount, categories, averageRating, ratingsCount, epub`. 
<br>
Note that **not** every book has all the features required. If a piece of information is missing, your code should write NaN instead in place of the value. 

In [12]:
def parse_json(j):
    '''the function takes a book item as an input and returns a list of the extracted features'''
    
    try:
        title = book['volumeInfo']['title']
    except KeyError:
        title = 'NaN'
    try:
        authors = book['volumeInfo']['authors']
    except KeyError:
        authors = 'NaN'
    try:
        publishedDate = book['volumeInfo']['publishedDate']
    except KeyError:
        publishedDate = 'NaN'
    try:
        pageCount = book['volumeInfo']['pageCount']
    except KeyError:
        pageCouunt = 'NaN'
    try:
        categories = book['volumeInfo']['categories']
    except KeyError:
        categories = 'NaN'
    try:
        averageRating = book['volumeInfo']['averageRating']
    except KeyError:
        averageRating = 'NaN'
    try:
        ratingsCount = book['volumeInfo']['ratingsCount']
    except KeyError:
        ratingsCount = 'NaN'
    try:
        epub = book['accessInfo']['epub']['isAvailable'] 
    except KeyError:
        epub = 'NaN'
    
    
    return title, authors, publishedDate, pageCount, categories, averageRating, ratingsCount, epub

parsed = parse_json(json_object)
print(parsed)

('An Island Escape', ['Bindi Irwin', 'Marisa Nathar'], '2012', 185, ['Juvenile Fiction'], 'NaN', 'NaN', False)


### 3. New York Times API

Your task in this exercise will be to compare the amount of Brexit, Trump and Corona related articles in the last 6 months, using an API that the New York Times provides. 

Start with creating an API key on the NYT API website. As you can see there are multiple functionalities/APIs that the NYT provides. For this exercise we will use the one that allows you to search among articles. so when you sign up for the API key, make sure to pick that one. 

Here's the documentation for using this API, it explains the syntacs of queries: https://developer.nytimes.com/docs/articlesearch-product/1/overview 

1. How can you specify a keyword to search for in the URL?
2. How can you specify a date or multiple dates to search for?
3. Write a function that takes a query to the API as an input and returns the number of hits (number of article results) that this query returns. 
4. How many results are in a response json by default? How would you collect all results for a specific search? Either write an example code that in fact collects all articles for a query (in an example that has more than one page of results) or explain in detail how you would automate this process. You can use the function you created in 3. to automatically determine how may pages you have to loop through. 

Now you have all the pieces together to write a function that collects all results for a specific topic (keyword) written on a specific date. Remember, that our original question was how the appearance of 3 topics changed over time in the last 6 months. 

5. Loop through all dates in the last 6 months and figure out how many articles there were in each of the 3 topics. You can aggregate into weekly or monthly buckets. You can also include synonyms of the given words (e.g. "Covid-19" for "Corona") or also search other terms that interest you.

Using the below code, you can do a visualization of your findings. 

Trick for pretty printing json:
when dealing with large json objects and trying to understand them, it is often difficult to read them on the screen. Use pprint library to see a nicer version of these jsons. (from pprint import pprint, and pprint("hello world"))

#### Visualization

To help you out with the visualization, we have created the code below. In the description of the function you can find instructions on how to use it. There is also an example of a call underneath the function.

You need to give two parameters to the function. The first one is a dictionary where the keys are the three search query terms that you have used (given as a string); for each term there is one list with the number of queries per each time-block considered. The second parameter is a list of strings with the names of the time periods being considered. 

Important note: the lengths of the lists must match. It is assumed that for each query there is a vector having the number of hits per each period specified in the list of the second parameter. This means that the three lists in the dictionary and the list given as the second parameter must have equal lengths.

In [None]:
# This is a pre-implemented function for crating the visualisation
# You don't have to modify this

import matplotlib
import matplotlib.pyplot as plt
import numpy as np

def plot_no_articles(dictionary_results, periods):
    '''
    Plots the statistics with the number of articles in the past month.
    
    dictionary_results = dictionary of the form query_term: [no_articles_for_period_1, no_articles_for_period_2, ...]
        e.g. {'Brexit':[250, 200], 'Trump':[100, 75], 'Corona':[300, 400]}
             if you group articles by month periods, and 
             you have looked only at the past two months, and
             there were 250 hits for Brexit in February, and 200 in March, and
             there were 100 hits for Trump in February, and 75 in March, and
             there were 300 hits for Corona in February, and 400 in March
    periods = list of time periods used for the investigation
        e.g. ['February', 'March']
             if you have considered the past two months
    '''
    d = dictionary_results
    labels = periods
    query_terms = list(d.keys())
    list_0 = d[query_terms[0]]
    list_1 = d[query_terms[1]]
    list_2 = d[query_terms[2]]
    
    # locations for labels
    x = np.arange(0, len(labels))
    # width per bar
    width = 0.3
    
    # Building the subplots
    fig, ax = plt.subplots(figsize=(18,5))
    rects1 = ax.bar(x - width, list_0, width, label=query_terms[0])
    rects2 = ax.bar(x, list_1, width, label=query_terms[1])
    rects3 = ax.bar(x + width, list_2, width, label=query_terms[2])

    # Labeling
    ax.set_xlabel('Time periods')
    ax.set_ylabel('Number of articles')
    ax.set_title('Number of articles by query')
    ax.set_xticks(x)
    ax.set_xticklabels(labels)
    ax.autoscale()
    xmin = -2*width
    xmax = max(np.arange(len(labels)))+2*width
    ymin = 0
    ymax = max(list_0+list_1+list_2)*1.1 
    ax.set(xlim=(xmin, xmax), ylim=(ymin, ymax))
    ax.legend(loc='best')


    def autolabel(rects):
        """Attach a text label above each bar in *rects*, displaying its height."""
        for rect in rects:
            height = rect.get_height()
            ax.annotate('{}'.format(height),
                        xy=(rect.get_x() + rect.get_width() /2, height),
                        xytext=(0, 3),  # 3 points vertical offset
                        textcoords="offset points",
                        ha='center', va='bottom')


    autolabel(rects1)
    autolabel(rects2)
    autolabel(rects3)
    
    fig.autofmt_xdate()

    fig.tight_layout()

    plt.show()

# Below there is one example of how to use the above plot function
dict_results2 = {'Brexit':[250, 200], 'Trump':[100, 75], 'Corona':[300, 400]}
plot_no_articles(dict_results2, ['February', 'March'])


#### Your solution for the third exercise

In [None]:
### Code for task 3
from urllib import request
import json

query="COVID-19" # Replace the word with your query
api_key="YgiAaTTsbsSRlBGmnR6kW6AlCkhMZuX6" 
url = "https://api.nytimes.com/svc/search/v2/articlesearch.json?&q=%s&api-key=%s" % (query, api_key)
h = request.urlopen(url)
html = h.read()
text=html.decode('utf-8')
obj = json.loads(text)
b=obj['response']['meta']['hits']
print(b)
# The number printed is the "hits" in NYT api.

In [None]:
### Code for task 4

import copy

query="COVID-19"
api_key="YgiAaTTsbsSRlBGmnR6kW6AlCkhMZuX6" 
url = "https://api.nytimes.com/svc/search/v2/articlesearch.json?&q=%s&api-key=%s" % (query, api_key)
h = request.urlopen(url)
html = h.read()
text=html.decode('utf-8')
obj = json.loads(text)
a=len(obj['response']['docs'])
c=obj['response']['meta']['hits']
print(a)


# We can see that the number of results returned at one time is 10

# The way to collect all the results for a specific research is to write a loop 


"""
from urllib import request
import json
import math
import copy
import time

query="COVID-19"
api_key="YgiAaTTsbsSRlBGmnR6kW6AlCkhMZuX6" 
url = "https://api.nytimes.com/svc/search/v2/articlesearch.json?&q=%s&api-key=%s" % (query, api_key)
h = request.urlopen(url)
html = h.read()
text=html.decode('utf-8')
obj = json.loads(text)
temp=copy.deepcopy(obj)
a=len(obj['response']['docs'])
b=obj['response']['meta']['hits']
c=int(math.ceil(b/10))

for i in range(1,c):
    page=i
    url = "https://api.nytimes.com/svc/search/v2/articlesearch.json?&q=%s&api-key=%s&page=%d" % (query, api_key,page)
    time.sleep(5)
    h = request.urlopen(url)
    html = h.read()
    text=html.decode('utf-8')
    obj = json.loads(text) 
    temp.update(obj)

print(temp)
"""

In [None]:
### Code for task 5

from urllib import request
import json
import matplotlib.pyplot as plt
import numpy as np
import calendar
import time

def plot_no_articles(dictionary_results, periods):

    d = dictionary_results
    labels = periods
    query_terms = list(d.keys())
    list_0 = d[query_terms[0]]
    list_1 = d[query_terms[1]]
    list_2 = d[query_terms[2]]
    
    # locations for labels
    x = np.arange(0, len(labels))
    # width per bar
    width = 0.3
    
    # Building the subplots
    fig, ax = plt.subplots(figsize=(18,5))
    rects1 = ax.bar(x - width, list_0, width, label=query_terms[0])
    rects2 = ax.bar(x, list_1, width, label=query_terms[1])
    rects3 = ax.bar(x + width, list_2, width, label=query_terms[2])

    # Labeling
    ax.set_xlabel('Time periods')
    ax.set_ylabel('Number of articles')
    ax.set_title('Number of articles by query')
    ax.set_xticks(x)
    ax.set_xticklabels(labels)
    ax.autoscale()
    xmin = -2*width
    xmax = max(np.arange(len(labels)))+2*width
    ymin = 0
    ymax = max(list_0+list_1+list_2)*1.1 
    ax.set(xlim=(xmin, xmax), ylim=(ymin, ymax))
    ax.legend(loc='best')


    def autolabel(rects):
        """Attach a text label above each bar in *rects*, displaying its height."""
        for rect in rects:
            height = rect.get_height()
            ax.annotate('{}'.format(height),
                        xy=(rect.get_x() + rect.get_width() /2, height),
                        xytext=(0, 3),  # 3 points vertical offset
                        textcoords="offset points",
                        ha='center', va='bottom')


    autolabel(rects1)
    autolabel(rects2)
    autolabel(rects3)
    
    fig.autofmt_xdate()

    fig.tight_layout()

    plt.show()


query=["COVID-19","Trump","Brexit"]
api_key="YgiAaTTsbsSRlBGmnR6kW6AlCkhMZuX6" 
beg_date=20191001
end_date=20200331
beg=str(beg_date)
endd=str(end_date)
mon1=int(beg[4:6])-1
yy1=int(beg[0:4])
dd1=int(beg[6:9])
mon2=int(endd[4:6])-1
yy2=int(endd[0:4])
dd2=int(endd[6:9])
Mon=['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']

if yy2>yy1:
    m=list(range(mon1,12,1))+list(range(0,mon2+1,1))
else:
    m=list(range(mon1,mon2+1,1))
    
s=len(m)
M=list(range(0,s))

temp=m.index(11)

for j in range(0,3):  
    b = list(range(s))
    for i in range(0,temp+1,1):
        
        M[i]=Mon[m[i]]
        sta=yy1*10000+(m[i]+1)*100+1
        ove=calendar.monthrange(yy1, m[i]+1)[1]+yy2*10000+(m[i]+1)*100
        url = "https://api.nytimes.com/svc/search/v2/articlesearch.json?&q=%s&api-key=%s&begin_date=%d&end_date=%d" % (query[j], api_key,sta,ove)
        h = request.urlopen(url)
        time.sleep(5)        
        html = h.read()
        text=html.decode('utf-8')
        obj = json.loads(text)
        b[i]=obj['response']['meta']['hits']

    for i in range(temp+1,s,1):
        
        M[i]=Mon[m[i]]
        sta=yy2*10000+(m[i]+1)*100+1
        ove=calendar.monthrange(yy2, m[i]+1)[1]+yy2*10000+(m[i]+1)*100
        url = "https://api.nytimes.com/svc/search/v2/articlesearch.json?&q=%s&api-key=%s&begin_date=%d&end_date=%d" % (query[j], api_key,sta,ove)
        time.sleep(5)
        h = request.urlopen(url)
        html = h.read()
        text=html.decode('utf-8')
        obj = json.loads(text)
        b[i]=obj['response']['meta']['hits']       
    if j==0:
        T1=b
    elif j==1:
        T2=b    
    else:
        T3=b
            
dict_results = {query[0]:T1, query[1]:T2, query[2]:T3}
plot_no_articles(dict_results, M)


Congratulations for completing the second notebook! Now it’s time for feedback.
1.	Pass your solution to the other pair in your group.
2.	Include your feedback in the other pair’s notebook. Don’t forget to add your names at the top.
3.	Return the notebook with feedback to the original pairs.
4.	Upload your notebook, with the feedback included by the other pair on OLAT.

You can think of/suggest (among other things)
 - improvements in the code (e.g. readability, efficiency)
 - improvements in the answers (e.g. are they easy to understand, are they correct, how can they be improved?)
 - point out differences (e.g. are there any differences between the responses of the two pairs? if yes what are they, what is the cause, and in which way can they be useful?)
 
Not all suggestions about the type of feedback apply to all types of questions. Try to give feedback in a meaningful and constructive way.

In [None]:
# Below there is space for giving feedback. This space should be used only by the other pair in your group.

'''
Feedback here
'''