## Introduction to Web APIs in Python
- UMN LATIS & Libraries workshop, Oct 30 2020
- Cody Hennesy (chennesy@umn.edu) and Michael Beckstrand (mjbeckst@umn.edu)

In this workshop we’ll use Python to query and download data using the NY Times API.

This workshop will cover how to:
* Use Python 3 in a JupyterLab computing environment
* Read API documentation to build successful API queries
* Use the Requests and JSON Python libraries to download data from the NY Times API
* Use built-in Python functions such as type, len, and dir to explore API data
* Explore API data in Python using dictionaries

Credits: Content for this workshop was adapted from the [DHSI 2019 APIs class](https://github.com/szweibel/DHSI-API-workshop) and from [Software Carpentry Python lessons](http://swcarpentry.github.io/python-novice-inflammation/).

### Why Python? 
- Reproducibility
- Repeatable
- Extensible - popular libraries (packages such as pandas, matplotlib)
- Great for data access and data cleaning

### What's Jupyter?
- Web-based, so easy to share
- Easy to read, easy to run
- Run code piece by piece

### Brief intro to Python and Jupyter
- You can use Python as a calculator. 
- To "run" a Jupyter cell hold down shift and select Return/Enter, or choose the "play icon" (right-facing triangle) from the Jupyter menu above. 

In [None]:
2 + 3 * 5

But to do more interesting things, we will want to assign values to *variables*.

In [None]:
weight_kg = 60

In [None]:
print(weight_kg)

In Python, variable names:

* can include letters, digits, and underscores
* cannot start with a digit
* are case sensitive.

You can do calculations while printing:

## Importing Libraries

In [None]:
import requests

Importing a library is like getting a piece of lab equipment out of a storage locker and setting it up on the bench. Libraries provide additional functionality to the basic Python package, much like a new piece of equipment adds functionality to a lab space. Just like in the lab, importing too many libraries can sometimes complicate and slow down your programs - so we only import what we need for each program.

In [None]:
requests.get('https://lib.umn.edu')

### Libraries and functions
The expression ```requests.get(...)``` is a function call that asks Python to run the function ```get``` which belongs to the ```requests``` library. 

This dotted notation is used everywhere in Python: the thing that appears before the dot contains the thing that appears after.

As an example, we could use the dot notation to write the relationship between Minneapolis and Minnesota as ```Minnesota.Minneapolis```, just as *get* is a function that belongs to the *requests* library.

#### What did we do above?
1. Created a Python HTTP request object for a GET
2. Send the HTTP request to webserver at lib.umn.edu
3. Received the response ```[200]``` from lib.umn.edu - [what's that mean?](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)

In Jupyter notebooks using Python you can explore functions of a library using the *tab* key.

And to understand each function you can get information by putting a question mark after it:

In [None]:
requests.get?

You can store the data that is returned from the get request in a variable:

In [None]:
umn_site = requests.get('https://lib.umn.edu')
print(umn_site)

Now you can explore the attributes of the data object, umn_site, using the same dot notation. Use tab to explore the options, and the question mark to read more about the attribute.

```umn_site.text```, for example.

In [None]:
#print(umn_site.headers)

### NYT API request
Let's apply what we've learned about Python and the Requests library to the NY Times API.

1. Create a variable called nyt_articles, and use requests.get to make an API call to the NY Times.
2. Make sure the URL includes your:
 - keywords (```q=```)
 - the ```begin_date``` and ```end_date``` filters to make sure you don't go over your daily limit
 - your API key (```api-key=```)
 
Check out the [Developer site for an example URL](https://developer.nytimes.com/docs/articlesearch-product/1/overview).

We can also separate different filters out into their own variables and concatenate the full URL using the plus sign!

In [None]:
nyt_articles = requests.get("http://api.nytimes.com/svc/search/v2/articlesearch.json?q=facebook&api-key=[INSERTKEY]&begin_date=20190101&end_date=20190102")

### String concatenation
We can concatenate strings together using the + operator. 
```
string_1 = "Hello"
string_2 = "world!"
combined = string_1 + string_2
print(combined)```

"Hello world!"

In [None]:
base_url = "http://api.nytimes.com/svc/search/v2/articlesearch.json?"
query = "q=facebook"
api_key = "&api-key="
dates = "&begin_date=20200101&end_date=20200102"

url = base_url+query+api_key+dates

In [None]:
nyt_articles = requests.get(url)

We can use a built in Python function called type() to explore the results.

In [None]:
type(nyt_articles)

This means that nyt_articles is a Response object as defined in the requests library. But what is a Python object? And what can we do with it?

Objects in Python (and other programming languages) are basically containers that can hold data and/or functions inside them. When a function is inside an object, we usually call the function a "method." When data is inside an object, we usually call it an "attribute." The terminology isn't that important, though. What we do need to know is that you can access these "methods" and "attributes" with a . (a dot or period).

### Explore the NYT API Response object
- We stored the requests response from the NY Times API in a variable called ```nyt_articles.```
- Let's use Jupyter's tab autocomplete feature after the variable (```nyt_articles.```) to explore the different attributes of the response object. 

In [None]:
#nyt_articles.text

When you encounter an object, how can you learn its methods and atributes so you can use them? There are two main ways. The first, and likely the most practical, is to read the documentation of the library you're using.

It often happens, though, that the docs for a library you're using are confusing, nonexistent, or inaccurate. In these cases, you can try using the dir() function, which will tell you which methods and attributes are available in an object.

When using dir(), you'll mostly want to ignore the methods and attributes that have underscores around them. They mainly have to do with the internals of the Python language.

In [None]:
dir(nyt_articles)

### Working with JSON and dictionaries
Let's import the json python library to be able to dig more deeply into the data we acquired from the NY Times.

In [None]:
import json

json includes a function called ```.loads``` that we can use to assign the json data to something called a Python dictionary.

Let's create a variable to hold the dictionary, and then apply json.loads to our nyt_articles.text.

In [None]:
articles_dict = json.loads(nyt_articles.text)

### Ignore: for workshop prep

In [None]:
with open('articles_dict.json', 'w') as fp:
    json.dump(articles_dict, fp)

### Load file if student does not have API key

In [None]:
with open('articles_dict.json', 'r') as fp:
    articles_dict = json.load(fp)

We can see that the results are something called a dict, by using our old friend type().

In [None]:
type(articles_dict)

### Python Dictionaries 

A Python dictionary is a way to hold an unordered list of items, using something called a 'key:value' pair.

You can create an empty dictionary using curly brackets:

In [None]:
my_dictionary = {}

Or you could manually assign it keys and values, like so:

In [None]:
my_dictionary = {'Monday':'Apples', 'Tuesday':'Oranges'} 
my_dictionary

In [None]:
my_dictionary.keys()

In [None]:
my_dictionary.values()

In [None]:
articles_dict

You can explore the keys of any dictionary using the notation:

```my_dict.keys()```

In [None]:
articles_dict.keys()

You can look at the value of a specific key using square brackets:

```my_dict['key_name']```

In [None]:
articles_dict['status']

In [None]:
articles_dict['copyright']

In [None]:
articles_dict['response']

Wait a minute - that's a ton of information in a format we've seen before. 

Let's check out what type of object is inside of the response object value of this dictionary:

In [None]:
type(articles_dict['response'])

Whoa, it's a dictionary inside a dictionary. So now we can explore the keys and values in the response object.

In [None]:
articles_dict['response'].keys()

To look deeper inside of the dictionary object, we can just add another bracketed statement pointing to the key.

In [None]:
articles_dict['response']['meta']

In [None]:
#articles_dict['response']['docs']

Ok, so here's the content we want - the metadata about the actual articles! Let's assign this all to another variable so that it's a little easier to work with going forward.

In [None]:
articles_all = articles_dict['response']['docs']

In [None]:
articles_all.keys()

Why can't we look at the keys for the articles? 

If we take a closer look at the content, it looks like it starts with a square bracket, not a curly bracket. This is a sign that we're dealing with a list and not a dictionary!

We can rename a new variable to make that more clear.

In [None]:
articles_list = articles_dict['response']['docs']

You can use len() to see how many items are in the list:

In [None]:
len(articles_list)

You can also use len to count other things: to see how many characters are in a string, for example.

In [None]:
len("Minnesota")

### API Limits and paging
Why are there only ten articles in our list, when the response object told us there were 33 hits?

From the [NY Times API documentation](https://developer.nytimes.com/docs/articlesearch-product/1/overview): 

"The Article Search API returns a max of 10 results at a time. The meta node in the response contains the total number of matches ("hits") and the current offset. Use the page query parameter to paginate thru results (page=0 for results 1-10, page=1 for 11-20, ...). You can paginate thru up to 100 pages (1,000 results). If you get too many results try filtering by date range."

And from [their FAQ](https://developer.nytimes.com/faq#a11): 

"There are two rate limits per API: 4,000 requests per day and 10 requests per minute. You should sleep 6 seconds between calls to avoid hitting the per minute rate limit. If you need a higher rate limit, please contact us at code@nytimes.com."

So if we want to collect all of these responses, we'll need to page through them ten-articles at a time, with a pause of six seconds per request. We can do that later on using a "for loop."

### Python Lists

The most popular kind of data collection in Python is the list, which takes the place of arrays in programming languages like C and Fortran.
Lists have two primary important characteristics:
1. They are mutable, i.e., they can be changed after they are created.
2. They are heterogeneous, i.e., they can store values of many different types.

To create a new list, you can just put some values in square brackets with commas in between.

In [None]:
my_list = ['red', 'orange', 'yellow']
my_list

To fetch the element at a specific location, put the *index* of that location in square brackets.

In [None]:
my_list[1]

Why isn't this 'red'? Because Python list indexes start at 0!

In [None]:
my_list[3]

See, there's nothing in the third place in this index, because the three elements are at 0, 1, and 2. 

You can also display multiple items from the list using a slice, which just adds a colon between the range you'd like to show. 


In [None]:
my_list[0:2]

But why doesn't this show 'yellow'? I thought that was in the index location number 2? 

The 2 above means to end at the third element in the list, which is ```my_list[2]```, *but not to include it*. So if we want to show the whole list we give one above the final index location:

In [None]:
my_list[0:3]

So let's go back to our list of articles.

In [None]:
articles_list[0]

This looks familiar - could it be another dictionary?!

In [None]:
type(articles_list[0])

In [None]:
articles_list[0].keys()

In [None]:
headline = articles_list[0]['headline']

In [None]:
headline

In [None]:
type(headline)

In [None]:
headline = articles_list[0]['headline']['main']

In [None]:
headline

### Loops

Ugh, this data is pretty gnarly: dictionaries inside of lists inside of dictionaries. How can we put it into a more accessible format?

One thing we can do is loop through each article and only grab the information that we care about.
We can store that information in lists.

### Note the syntax: 

```
for x in y:
    do_something # the code in the loop needs to be indented
    do_another_thing
```

In [None]:
#this is a list
headlines = []
n = 0
for article in articles_list:
    # Extract the title from the dictionary
    headline = article['headline']['main']
    print("loop", n, headline)
    # Add the title to the output list
    headlines.append(headline)
    n += 1

In [None]:
headlines

### What about the full set of 33 articles?

If you remember from before, the Article Search API will only return 10 articles at a time, and requires you to page through the responses in sets of ten, with a delay of six seconds (probably to keep their servers from getting overloaded with huge requests).

So we can set up another loop, to page through your search one page at a time. To do that, first we'll need a couple of handy Python libraries: math and time. 

In [None]:
# to set a timer and cycle through the pages, we need a couple of other Python libraries
import math
import time

Next, let's see if our search url is still ready to go.

In [None]:
print(url)

In [None]:
# if you need to load a search url:
url = http://api.nytimes.com/svc/search/v2/articlesearch.json?q=facebook&api-key=[INSERT KEY]&begin_date=20200101&end_date=20200102

First, let's make another request for the url, and convert the data into a dictionary.

In [None]:
# make request
r = requests.get(url)

# convert to a dictionary
data=json.loads(r.text)
print(r.url)
print()

Next, let's get the number of articles/hits and calculate the number of pages we'll need to page through by dividing the number of hits by 10. We'll use a function of the math library called .ceil() to help us round that number up to the next largest integer. If there are 33 hits, for example, we'll need to request 4 pages of results, so want 33/10 to round up to 4.

In [None]:
# get number of hits
hits = data['response']['meta']['hits']
print("number of hits:", str(hits))

# get number of pages
pages = int(math.ceil(hits/10))
print("number of pages:", pages)

Now let's create an empty list where we can collect all of the articles one page at a time using a for loop. 

We can loop through the four pages by using a range() function, which returns a sequence of numbers, starting from 0 by default, and increments by 1 (by default), and stops before a specified number (in our case the number of pages).

Inside of the for loop we'll first pause for seven seconds using time.sleep(), so that we don't go over the limit of sending requests faster than every six seconds.

Then we'll make our request adding the page number, from our range() index, i, at the end of each URL along with 'page='.

Next we'll grab the data we want, the .text object, and convert it to a JSON dictionary.

Then we'll add each doc to the list, all_docs, using the extend() function which adds all of the members of a list to the bottom of an existing list


In [None]:
# make an empty list where we'll hold all of our docs for every page
all_docs = [] 

# now we're ready to loop through the pages
for i in range(pages):
    print("collecting page", str(i))
    time.sleep(7)

    # make request
    r = requests.get(url+'page='+str(i))

    # get text and convert to a dictionary
    data=json.loads(r.text)

    # get just the docs
    docs = data['response']['docs']

    # add those docs to the big list
    all_docs.extend(docs)

In [None]:
len(all_docs)

### Functions

Now let's define a function ```format_articles``` that cycles through our ```all_docs``` list, and grabs all of the data we care about.

The function definition opens with the keyword ```def``` followed by the name of the function (format_articles) and a parenthesized list of parameter names (unformatted_docs). The body of the function — the statements that are executed when it runs — is indented below the definition line. The body concludes with a return keyword followed by the value we want to take from the function.

In [None]:
def format_articles(unformatted_docs):
    formatted = []
    counter = 0
    for i in unformatted_docs:
        print("formatting article number", counter)
        dic = {}
        dic['web_url'] = i['web_url']
        dic['headline'] = i['headline']['main']
        dic['abstract'] = i['abstract']
        dic['date'] = i['pub_date'][0:10] # cutting time of day.
        if i['lead_paragraph']:
            dic['lead_paragraph'] = i['lead_paragraph']
        dic['word_count'] = i['word_count']
        formatted.append(dic)
        counter += 1
    return(formatted) 

In [None]:
all_formatted = format_articles(all_docs)

In [None]:
all_formatted[4]

In [None]:
all_formatted[4]['headline']

### Putting it all together: NY Times API Function
Because the NYTimes API is consistently structured, we can build a function that does all of the work of the API call for us by providing a simple series of inputs. We can then call the function at any time by providing a keywords, start and end dates and Python will do all of the rest of the work.

Specifically, we want a function that will:
1. Build the url and send the API request to api.nytimes.com
2. Pull down the json data, loop through multiple pages of results (so we can view more than ten articles)
3. Run the whole thing on a timer so that you do *not* hit the API limit

In [None]:
### Get API data function

def get_api_data(term, date_s, date_e):
    # set base url
    base_url="http://api.nytimes.com/svc/search/v2/articlesearch.json"
    key = "" #add your api key here
    
    # set search parameters
    search_params = {"q":term,
                     "api-key":key,
                     "begin_date": str(date_s), 
                     "end_date":str(date_e)}

    # make request
    r = requests.get(base_url, params=search_params)

    # convert to a dictionary
    data=json.loads(r.text)
    print(r.url)
    print()
    
    # get number of hits
    hits = data['response']['meta']['hits']
    print("number of hits:", str(hits))

    # get number of pages
    pages = int(math.ceil(hits/10))

    # make an empty list where we'll hold all of our docs for every page
    all_docs = [] 

    # now we're ready to loop through the pages
    for i in range(pages):
        print("collecting page", str(i))
        time.sleep(7)
        # set the page parameter
        search_params['page'] = i

        # make request
        r = requests.get(base_url, params=search_params)
        
        # get text and convert to a dictionary
        data=json.loads(r.text)
        
        # get just the docs
        docs = data['response']['docs']

        # add those docs to the big list
        all_docs = all_docs + docs
        
    return(all_docs)

In [None]:
all_north = get_api_data("north korea", 20200401, 20200430) # add your search terms between quotes; dates must be in YYYYMMDD format

In [None]:
#how many results are there?
len(all_north)

In [None]:
#all_north[1]

Now let's format those articles using our old ```formatted_articles``` function!

In [None]:
formatted_articles = format_articles(all_north)

In [None]:
formatted_articles[0]

### Save to JSON
When you've collected a lot of data like this it can be helpful to save it for later use. Which format you want to use to save data in Python depends on the structure of the data, but common formats are CSVs, JSON, and pickle files. 

Since our formatted_articles object is a dictionary, let's save it to JSON. First we need to open a blank JSON file using a "with open" statement. We'll tell the open() function the name of the file and pass the 'w' parameter to say we want to write to the file. ('r' would be to read a file). fp will be the variable name for the file inside of the with statement. 

Then we'll use a json function called dump() to load in the formatted_articles list, and write it to the fp file. 

In [None]:
with open('formatted_articles.json', 'w') as fp:
    json.dump(formatted_articles, fp)

Now that your data is saved you can quit Jupyter/Python and rest assured your data is available for later. The next time you start your work you can load up the JSON file by reading it ('r') and then saving the content via json.load() into a new variable. 

In [None]:
with open('formatted_articles.json', 'r') as fp:
    formatted_articles = json.load(fp)

### Challenge
Write a loop to print the headlines, urls, and dates for each article in your data.


1. Within your formatted_articles list, each item is represented in a dictionary key:value pair. To display the value, use the syntax ```variable_name['key_name']``` 
2. A good way to start the loop would be: ```for article in formatted_articles:```
3. It's helpful to display the article number as you print each article. You can use an iterator variable to do that, and use ```x += 1``` at the end of the loop to keep the iterator in line with the loop. (There's also a fucntion in Python called [enumerate()](https://www.w3schools.com/python/ref_func_enumerate.asp) that you could use instead of a manual iterator, if you want)
4. Within ```print()``` you can call multiple variables using commas: ```print(variable_one, '\n', variable_two['key_name'])```
 - Note that ```'\n'``` will add a line break to the output, which can make things easier to read.


In [None]:
x = 0
for article in formatted_articles:
    print(x, article['headline'], '\n', article['web_url'], '\n', article['date'], '\n')
    x += 1