## Introduction to Web APIs in Python
- UMN LATIS & Libraries workshop, Oct 30 2020
- Cody Hennesy (chennesy@umn.edu) and Michael Beckstrand (mjbeckst@umn.edu)

In this workshop we’ll use Python to query and download data using the NY Times API.

This workshop will cover how to:
* Use Python 3 in a JupyterLab computing environment
* Read API documentation to build successful API queries
* Use the Requests and JSON Python libraries to download data from the NY Times API
* Use built-in Python functions such as type, len, and dir to explore API data
* Explore API data in Python using dictionaries

Credits: Content for this workshop was adapted from the [DHSI 2019 APIs class](https://github.com/szweibel/DHSI-API-workshop) and from [Software Carpentry Python lessons](http://swcarpentry.github.io/python-novice-inflammation/).

### Why Python? 
- Reproducibility
- Repeatable
- Extensible - popular libraries (packages such as pandas, matplotlib)
- Great for data access and data cleaning

### What's Jupyter?
- Web-based, so easy to share
- Easy to read, easy to run
- Run code piece by piece

### Brief intro to Python and Jupyter
- You can use Python as a calculator. 
- To "run" a Jupyter cell hold down shift and select Return/Enter, or choose the "play icon" (right-facing triangle) from the Jupyter menu above. 

In [1]:
2 + 3 * 5

17

But to do more interesting things, we will want to assign values to *variables*.

In [2]:
weight_kg = 60

In [3]:
print(weight_kg)

60


In Python, variable names:

* can include letters, digits, and underscores
* cannot start with a digit
* are case sensitive.

You can do calculations while printing:

In [4]:
print(2.2 * weight_kg)

132.0


But it might make more sense to create a variable to help out:

In [5]:
weight_lb = 2.2 * weight_kg

In [6]:
print(weight_lb)

132.0


In [7]:
print(weight_kg)

60


In [8]:
weight_kg = 65.0
print('kg:', weight_kg)
print('lbs: ', weight_lb)

weight in kilograms is now: 65.0
weight in lbs is:  132.0


Oh no, what happened? The weight in pounds didn't update?

You need to re-assign the variable. So variables aren't always a very good way to deal with changing values.

In [9]:
weight_lb = 2.2 * weight_kg
print('lbs: ', weight_lb)

lbs:  143.0


### String concatenation

Be careful though: you can also concatenate strings together using a plus sign, so the + operator has multiple uses.

In [10]:
base_url = "http://google.com/search?q="
query = "ada lovelace"
search = base_url + query
print(search)

http://google.com/search?q=ada lovelace


## Importing Libraries

In [11]:
import requests

Importing a library is like getting a piece of lab equipment out of a storage locker and setting it up on the bench. Libraries provide additional functionality to the basic Python package, much like a new piece of equipment adds functionality to a lab space. Just like in the lab, importing too many libraries can sometimes complicate and slow down your programs - so we only import what we need for each program.

In [12]:
requests.get('https://lib.umn.edu')

<Response [200]>

### Libraries and functions
The expression ```requests.get(...)``` is a function call that asks Python to run the function ```get``` which belongs to the ```requests``` library. 

This dotted notation is used everywhere in Python: the thing that appears before the dot contains the thing that appears after.

As an example, we could use the dot notation to write the relationship between Minneapolis and Minnesota as ```Minnesota.Minneapolis```, just as *get* is a function that belongs to the *requests* library.

#### What did we do above?
1. Created a Python HTTP request object for a GET
2. Send the HTTP request to webserver at lib.umn.edu
3. Received the response ```[200]``` from lib.umn.edu - [what's that mean?](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)

In Jupyter notebooks using Python you can explore functions of a library using the *tab* key.

And to understand each function you can get information by putting a question mark after it:

In [13]:
requests.get?

[0;31mSignature:[0m [0mrequests[0m[0;34m.[0m[0mget[0m[0;34m([0m[0murl[0m[0;34m,[0m [0mparams[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m [0;34m**[0m[0mkwargs[0m[0;34m)[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Sends a GET request.

:param url: URL for the new :class:`Request` object.
:param params: (optional) Dictionary or bytes to be sent in the query string for the :class:`Request`.
:param \*\*kwargs: Optional arguments that ``request`` takes.
:return: :class:`Response <Response>` object
:rtype: requests.Response
[0;31mFile:[0m      /anaconda3/lib/python3.7/site-packages/requests/api.py
[0;31mType:[0m      function


You can store the data that is returned from the get request in a variable:

In [14]:
umn_site = requests.get('https://lib.umn.edu')
print(umn_site)

<Response [200]>


Now you can explore the attributes of the data object, umn_site, using the same dot notation. Use tab to explore the options, and the question mark to read more about the attribute.

```umn_site.text```, for example.

In [None]:
#print(umn_site.headers)

### NYT API request
Let's apply what we've learned about Python and the Requests library to the NY Times API.

1. Create a variable called nyt_articles, and use requests.get to make an API call to the NY Times.
2. Make sure the URL includes your:
 - keywords (```q=```)
 - the ```begin_date``` and ```end_date``` filters to make sure you don't go over your daily limit
 - your API key (```api-key=```)
 
Check out the [Developer site for an example URL](https://developer.nytimes.com/docs/articlesearch-product/1/overview).

We can also separate different filters out into their own variables and concatenate the full URL using the plus sign!

In [15]:
nyt_articles = requests.get("http://api.nytimes.com/svc/search/v2/articlesearch.json?q=facebook&api-key=[INSERTKEY]&begin_date=20190101&end_date=20190102")

In [16]:
base_url = "http://api.nytimes.com/svc/search/v2/articlesearch.json?"
query = "q=facebook"
api_key = "&api-key=[INSERTKEY]"
dates = "&begin_date=20190101&end_date=20190102"

url = base_url+query+api_key+dates

In [17]:
nyt_articles = requests.get(url)

We can use a built in Python function called type() to explore the results.

In [18]:
type(nyt_articles)

requests.models.Response

This means that nyt_articles is a Response object as defined in the requests library. But what is a Python object? And what can we do with it?

Objects in Python (and other programming languages) are basically containers that can hold data and/or functions inside them. When a function is inside an object, we usually call the function a "method." When data is inside an object, we usually call it an "attribute." The terminology isn't that important, though. What we do need to know is that you can access these "methods" and "attributes" with a . (a dot or period).

### Explore the NYT API Response object
- We stored the requests response from the NY Times API in a variable called ```nyt_articles.```
- Let's use Jupyter's tab autocomplete feature after the variable (```nyt_articles.```) to explore the different attributes of the response object. 

In [20]:
#nyt_articles.text

When you encounter an object, how can you learn its methods and atributes so you can use them? There are two main ways. The first, and likely the most practical, is to read the documentation of the library you're using.

It often happens, though, that the docs for a library you're using are confusing, nonexistent, or inaccurate. In these cases, you can try using the dir() function, which will tell you which methods and attributes are available in an object.

When using dir(), you'll mostly want to ignore the methods and attributes that have underscores around them. They mainly have to do with the internals of the Python language.

In [21]:
dir(nyt_articles)

['__attrs__',
 '__bool__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__enter__',
 '__eq__',
 '__exit__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__nonzero__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_content',
 '_content_consumed',
 '_next',
 'apparent_encoding',
 'close',
 'connection',
 'content',
 'cookies',
 'elapsed',
 'encoding',
 'headers',
 'history',
 'is_permanent_redirect',
 'is_redirect',
 'iter_content',
 'iter_lines',
 'json',
 'links',
 'next',
 'ok',
 'raise_for_status',
 'raw',
 'reason',
 'request',
 'status_code',
 'text',
 'url']

### Working with JSON and dictionaries
Let's import the json python library to be able to dig more deeply into the data we acquired from the NY Times.

In [22]:
import json

json includes a function called ```.loads``` that we can use to assign the json data to something called a Python dictionary.

Let's create a variable to hold the dictionary, and then apply json.loads to our nyt_articles.text.

In [23]:
articles_dict = json.loads(nyt_articles.text)

We can see that the results are something called a dict, by using our old friend type().

In [24]:
type(articles_dict)

dict

### Python Dictionaries 

A Python dictionary is a way to hold an unordered list of items, using something called a 'key:value' pair.

You can create an empty dictionary using curly brackets:

In [25]:
my_dictionary = {}

Or you could manually assign it keys and values, like so:

In [26]:
my_dictionary = {'Monday':'Apples', 'Tuesday':'Oranges'} 
my_dictionary

{'Monday': 'Apples', 'Tuesday': 'Oranges'}

In [27]:
my_dictionary.keys()

dict_keys(['Monday', 'Tuesday'])

In [28]:
my_dictionary.values()

dict_values(['Apples', 'Oranges'])

In [30]:
articles_dict

You can explore the keys of any dictionary using the notation:

```my_dict.keys()```

In [31]:
articles_dict.keys()

dict_keys(['status', 'copyright', 'response'])

You can look at the value of a specific key using square brackets:

```my_dict['key_name']```

In [32]:
articles_dict['status']

'OK'

In [33]:
articles_dict['copyright']

'Copyright (c) 2020 The New York Times Company. All Rights Reserved.'

In [35]:
articles_dict['response']

Wait a minute - that's a ton of information in a format we've seen before. 

Let's check out what type of object is inside of the response object value of this dictionary:

In [36]:
type(articles_dict['response'])

dict

Whoa, it's a dictionary inside a dictionary. So now we can explore the keys and values in the response object.

In [37]:
articles_dict['response'].keys()

dict_keys(['docs', 'meta'])

To look deeper inside of the dictionary object, we can just add another bracketed statement pointing to the key.

In [38]:
articles_dict['response']['meta']

{'hits': 36, 'offset': 0, 'time': 37}

In [None]:
#articles_dict['response']['docs']

Ok, so here's the content we want - the metadata about the actual articles! Let's assign this all to another variable so that it's a little easier to work with going forward.

In [39]:
articles_all = articles_dict['response']['docs']

In [40]:
articles_all.keys()

AttributeError: 'list' object has no attribute 'keys'

Why can't we look at the keys for the articles? 

If we take a closer look at the content, it looks like it starts with a square bracket, not a curly bracket. This is a sign that we're dealing with a list and not a dictionary!

We can rename a new variable to make that more clear.

In [41]:
articles_list = articles_dict['response']['docs']

### Python Lists

The most popular kind of data collection in Python is the list, which takes the place of arrays in programming languages like C and Fortran.
Lists have two primary important characteristics:
1. They are mutable, i.e., they can be changed after they are created.
2. They are heterogeneous, i.e., they can store values of many different types.

To create a new list, you can just put some values in square brackets with commas in between.

In [42]:
my_list = ['red', 'orange', 'yellow']
my_list

['red', 'orange', 'yellow']

To fetch the element at a specific location, put the *index* of that location in square brackets.

In [43]:
my_list[1]

'orange'

Why isn't this 'red'? Because Python list indexes start at 0!

In [44]:
my_list[3]

IndexError: list index out of range

See, there's nothing in the third place in this index, because the three elements are at 0, 1, and 2. 

You can also display multiple items from the list using a slice, which just adds a colon between the range you'd like to show. 


In [45]:
my_list[0:2]

['red', 'orange']

But why doesn't this show 'yellow'? I thought that was in the index location number 2? 

The 2 above means to end at the third element in the list, which is ```my_list[2]```, *but not to include it*. So if we want to show the whole list we give one above the final index location:

In [46]:
my_list[0:3]

['red', 'orange', 'yellow']

You can use len to see how many items are in the list:

In [47]:
len(my_list)

3

You can also use len to count other things: to see how many characters are in a string, for example.

In [48]:
len("Minnesota")

9

So let's go back to our list of articles.

In [50]:
articles_list[0]

This looks familiar - could it be another dictionary?!

In [51]:
type(articles_list[0])

dict

In [52]:
articles_list[0].keys()

dict_keys(['abstract', 'web_url', 'snippet', 'lead_paragraph', 'print_section', 'print_page', 'source', 'multimedia', 'headline', 'keywords', 'pub_date', 'document_type', 'news_desk', 'section_name', 'byline', 'type_of_material', '_id', 'word_count', 'uri'])

In [53]:
headline = articles_list[0]['headline']

In [54]:
headline

{'main': 'Big Tech May Look Troubled, but It’s Just Getting Started',
 'kicker': None,
 'content_kicker': None,
 'print_headline': 'Did Troubles Clip Wings Of Big Tech? Not Just Yet',
 'name': None,
 'seo': None,
 'sub': None}

In [55]:
type(headline)

dict

In [56]:
headline = articles_list[0]['headline']['main']

In [57]:
headline

'Big Tech May Look Troubled, but It’s Just Getting Started'

### Loops

Ugh, this data is pretty gnarly: dictionaries inside of lists inside of dictionaries. How can we put it into a more accessible format?

One thing we can do is loop through each article and only grab the information that we care about.
We can store that information in lists.

### Note the syntax: 

```
for x in y:
    do_something # the code in the loop needs to be indented
    do_another_thing
```

In [58]:
#this is a list
headlines = []
n = 0
for article in articles_list:
    # Extract the title from the dictionary
    headline = article['headline']['main']
    print("loop", n, headline)
    # Add the title to the output list
    headlines.append(headline)
    n += 1

loop 0 Big Tech May Look Troubled, but It’s Just Getting Started
loop 1 Was I Wrong to Facebook-Friend My Nephew’s Girlfriend?
loop 2 Happy New Year
loop 3 The Lives They Loved: Janet Cunningham
loop 4 Mitt Romney, Piling On Trump
loop 5 Gillian Anderson Reads ‘Maternal Wisdom (5 Pounds’ Worth)’
loop 6 How Tech Surprised and Scared Us in 2018
loop 7 Will 2019 Be a Good Year for Investors? Here Are 4 Key Factors Affecting Stocks and the Economy
loop 8 DealBook Briefing: What Could Go Wrong in 2019? Plenty
loop 9 What to Cook Right Now


In [59]:
headlines

['Big Tech May Look Troubled, but It’s Just Getting Started',
 'Was I Wrong to Facebook-Friend My Nephew’s Girlfriend?',
 'Happy New Year',
 'The Lives They Loved: Janet Cunningham',
 'Mitt Romney, Piling On Trump',
 'Gillian Anderson Reads ‘Maternal Wisdom (5 Pounds’ Worth)’',
 'How Tech Surprised and Scared Us in 2018',
 'Will 2019 Be a Good Year for Investors? Here Are 4 Key Factors Affecting Stocks and the Economy',
 'DealBook Briefing: What Could Go Wrong in 2019? Plenty',
 'What to Cook Right Now']

### Putting it all together
#### Python functions

We’d like a way to package our code so that it is easier to reuse, and Python provides for this by letting us define things called ‘functions’ — a shorthand way of re-executing longer pieces of code. 

Let’s start by defining a function ```format_articles``` that cycles through our ```articles_list```, and grabs all of the data we care about.

The function definition opens with the keyword ```def``` followed by the name of the function (format_articles) and a parenthesized list of parameter names (unformatted_docs). The body of the function — the statements that are executed when it runs — is indented below the definition line. The body concludes with a return keyword followed by the value we want to take from the function.

In [60]:
def format_articles(unformatted_docs):
    formatted = []
    counter = 0
    for i in unformatted_docs:
        print("formatting article number", counter)
        dic = {}
        dic['web_url'] = i['web_url']
        dic['headline'] = i['headline']['main']
        dic['abstract'] = i['abstract']
        dic['date'] = i['pub_date'][0:10] # cutting time of day.
        if i['lead_paragraph']:
            dic['lead_paragraph'] = i['lead_paragraph']
        dic['word_count'] = i['word_count']
        formatted.append(dic)
        counter += 1
    return(formatted) 

In [61]:
all_formatted = format_articles(articles_list)

formatting article number 0
formatting article number 1
formatting article number 2
formatting article number 3
formatting article number 4
formatting article number 5
formatting article number 6
formatting article number 7
formatting article number 8
formatting article number 9


In [62]:
all_formatted[4]

{'web_url': 'https://www.nytimes.com/2019/01/02/opinion/romney-trump-republican-party.html',
 'headline': 'Mitt Romney, Piling On Trump',
 'abstract': 'The president is facing a level of intraparty criticism that has no recent precedent.',
 'date': '2019-01-02',
 'lead_paragraph': 'This article is part of the Opinion Today newsletter. You can sign up here to receive the newsletter each weekday.',
 'word_count': 575}

In [63]:
all_formatted[4]['headline']

'Mitt Romney, Piling On Trump'

### NY Times API Function
Because the NYTimes API is consistently structured, we can build a function that does all of the work of the API call for us by providing a simple series of inputs. We can then call the function at any time by providing a keywords, start and end dates and Python will do all of the rest of the work.

Specifically, we want a function that will:
1. Build the url and send the API request to api.nytimes.com
2. Pull down the json data, loop through multiple pages of results (so we can view more than ten articles)
3. Run the whole thing on a timer so that you do *not* hit the API limit

In [64]:
# to set a timer and cycle through the pages, we need a couple of other Python libraries
import math
import time

In [68]:
### Get API data function

def get_api_data(term, date_s, date_e):
    # set base url
    base_url="http://api.nytimes.com/svc/search/v2/articlesearch.json"
    key = "" #add your api key here
    
    # set search parameters
    search_params = {"q":term,
                     "api-key":key,
                     "begin_date": str(date_s), 
                     "end_date":str(date_e)}

    # make request
    r = requests.get(base_url, params=search_params)

    # convert to a dictionary
    data=json.loads(r.text)
    print(r.url)
    print()
    
    # get number of hits
    hits = data['response']['meta']['hits']
    print("number of hits:", str(hits))

    # get number of pages
    pages = int(math.ceil(hits/10))

    # make an empty list where we'll hold all of our docs for every page
    all_docs = [] 

    # now we're ready to loop through the pages
    for i in range(pages):
        print("collecting page", str(i))
        time.sleep(7)
        # set the page parameter
        search_params['page'] = i

        # make request
        r = requests.get(base_url, params=search_params)
        
        # get text and convert to a dictionary
        data=json.loads(r.text)
        
        # get just the docs
        docs = data['response']['docs']

        # add those docs to the big list
        all_docs = all_docs + docs
        
    return(all_docs)

In [69]:
all_north = get_api_data("north korea", 20200401, 20200405) # add your search terms between quotes; dates must be in YYYYMMDD format

http://api.nytimes.com/svc/search/v2/articlesearch.json?q=north+korea&api-key=25ixfQ4setpYPVT9KmVeyh9aKM9Cmn44&begin_date=20200401&end_date=20200405

number of hits: 13
collecting page 0
collecting page 1


In [70]:
#how many results are there?
len(all_north)

13

In [None]:
#all_north[1]

Now let's format those articles using our old ```formatted_articles``` function!

In [71]:
formatted_articles = format_articles(all_north)

formatting article number 0
formatting article number 1
formatting article number 2
formatting article number 3
formatting article number 4
formatting article number 5
formatting article number 6
formatting article number 7
formatting article number 8
formatting article number 9
formatting article number 10
formatting article number 11
formatting article number 12


In [72]:
formatted_articles[0]

{'web_url': 'https://www.nytimes.com/interactive/2020/04/03/briefing/coronavirus-ventilators-ellis-marsalis-jr-news-quiz.html',
 'headline': 'News Quiz: Coronavirus, Ventilators, Ellis Marsalis Jr.',
 'abstract': 'Did you follow the headlines this week?',
 'date': '2020-04-03',
 'lead_paragraph': 'Did you follow the headlines this week?',
 'word_count': 0}

### Putting it all together
We can write a loop to print the headlines, urls, and dates for each article in your data.


1. Within your formatted_articles list, each item is represented in a dictionary key:value pair. To display the value, use the syntax ```variable_name['key_name']``` 
2. A good way to start the loop would be: ```for article in formatted_articles:```
3. It's helpful to display the article number as you print each article. You can use an iterator variable to do that, and use ```x += 1``` at the end of the loop to keep the iterator in line with the loop. (There's also a fucntion in Python called [enumerate()](https://www.w3schools.com/python/ref_func_enumerate.asp) that you could use instead of a manual iterator, if you want)
4. Within ```print()``` you can call multiple variables using commas: ```print(variable_one, '\n', variable_two['key_name'])```
 - Note that ```'\n'``` will add a line break to the output, which can make things easier to read.


In [73]:
x = 0
for article in formatted_articles:
    print(x, article['headline'], '\n', article['web_url'], '\n', article['date'], '\n')
    x += 1

0 News Quiz: Coronavirus, Ventilators, Ellis Marsalis Jr. 
 https://www.nytimes.com/interactive/2020/04/03/briefing/coronavirus-ventilators-ellis-marsalis-jr-news-quiz.html 
 2020-04-03 

1 Prepare for War or Fight Coronavirus? U.S. Military Battles Competing Instincts 
 https://www.nytimes.com/2020/04/01/us/politics/coronavirus-aircraft-carrier-roosevelt.html 
 2020-04-01 

2 Empire State Building Coronavirus Tribute Rang a False Alarm, Fallon Jokes 
 https://www.nytimes.com/2020/04/01/arts/television/empire-state-building-coronavirus.html 
 2020-04-01 

3 Coronavirus, Donald Trump, Asia: Your Wednesday Briefing 
 https://www.nytimes.com/2020/03/31/briefing/coronavirus-trump-asia.html 
 2020-04-01 

4 U.N. Security Council ‘Missing In Action’ in Coronavirus Fight 
 https://www.nytimes.com/2020/04/02/world/americas/coronavirus-united-nations-guterres.html 
 2020-04-02 

5 He Led a Top Navy Ship. Now He Sits in Quarantine, Fired and Infected. 
 https://www.nytimes.com/2020/04/05/us/poli