### GESIS Fall Seminar in Computational Social Science 2022
### Introduction to Computational Social Science with Python
# Day 3-3: JSON and working with APIs

## Overview

* The Anatomy of APIs
* Authentication
* Pagination

## The Anatomy of APIs
* **A**pplication **P**rogramming **I**nterfaces allow your code to interact with an external service.
* JSON is the widely preferred method of delivering information when querying an API.
* We've already seen JSON files and how they can be natively read into dictionary/list structures in Python.
* APIs for Twitter, GitHub, Wikipedia, Skyscanner, Alexa, Spotify, etc etc etc

### APIs
* We will focus on REST (REpresentational State Transfer) APIs, using HTTP requests to query for the data we require.
* Essentially, this means constructing URLs with certain keywords, sending it to a service, and receiving back a JSON with the data we need.
* Many APIs also have official and unofficial Python packages, where the task of constructing requests is slightly removed from the user.

### Requests and Responses
When we use web APIs, we (the client) construct a request to send to a server, and receive a response from the server.
* Requests contain relevant data regarding your API request call:
    - **Base URL** - The base web address for the service
    - **API endpoint** - The more specific API address
    - **HTTP method** - The type of request (see below)
    - **Headers** - Metadata like identification info, authentication, preferred response formats
    - **Query parameters** - Custom tags to specify the data we require
* Responses contain relevant data returned by the server: 
    - **Content** - The data we requested
    - **Status code** - Tells you whether and why the request was successful or not
    - **Headers** - Metadata like server info, response encoding


### HTTP Methods
There are several different kinds of requests we can make, depending on how we want to interact with the server.
* **GET** - Retrieve information (we only use GET today, and this is what you will mostly use when interacting with APIs)
* **POST** - Create new information on the server (e.g. post a tweet)
* **PUT** - Modify information on the server (e.g. modify a GitHub repo)
* **DELETE** - Delete information information from the server (e.g. delete a file)


### HTTP Response codes

* Success Codes
     - **200** OK - Success (common response to GET)
     - **201** Created - New resource created (POST/PUT)
     - **204** No Content - Success but no content returned (not an issue)

* Client Error Codes
     - **400** Bad Request - Request not understood due to bad syntax (fix your JSON)
     - **401** Unauthorized - Authentication required
     - **403** Forbidden - Exists, but not authorised with current authentication
     - **404** Not Found - Does not exist

* Server Error Codes
     - **500** Internal Server Error - there's something wrong with the server

In [None]:
# Constructing a basic query using the requests library and Random User API
# Base url: https://randomuser.me/
# Endpoint: api/

import requests

response = requests.get("https://randomuser.me/api/")

In [None]:
# Let's see some info about the request we sent
request = response.request
print(request.url)
print(request.path_url)
print(request.method)
print(request.headers)

In [None]:
# Let's see some info about the request we sent

print(response.content) # a bytestring of the content
print(response.text) # a text string of the content
print()
print(response.status_code)
print()
print(response.headers)
data = response.json() # Convert the JSON content to a python object

In [None]:
# We can explore the data that was returned:

data

In [None]:
# Let's construct a query and request a German man
# our query parameters here are ?gender=male&nat=de
germanman = requests.get("https://randomuser.me/api/?gender=male&nat=de").json()
germanman

Opening the link in a web browser will return the same content: [https://randomuser.me/api/?gender=male&nat=de](https://randomuser.me/api/?gender=male&nat=de).
API documentation frequently has examples and sandboxers where you can try different queries.

In [None]:
# We can alternatively pass a dictionary of parameters
parameters = {'gender':'male', 'nat':'de'}

germanman = requests.get("https://randomuser.me/api/", params=parameters).json()
germanman

## 🏋️‍♀️ PRACTICE

In [None]:
# Q1: Use the random user API to generate details of a Canadian woman



In [None]:
# Q2: With a single query, use the random user API to generate details for 10 people
# from a random selection of British, German, and French nationalities
# How many people from each location are there (calculate using Python)?


## Authentication
Many APIs require a level of identification and authentication to use
* None - Occasionally a header indicating username/email required or requested.
* API Key - Unique text credentials, typically tied to an account with the service. Supplied in the query parameters or headers.
* OAuth - Token-based authentication, more sophisticated, supports multi-user apps, controlled access scopes.

In [None]:
# The Wikipedia API doesn't require authentication, but does request identifying info in the user-agent header

headers = {'User-Agent': add your username / email / contact details here}

# Let's query Wikipedia for the links on the page "Mannheim"
# https://en.wikipedia.org/wiki/Mannheim

url = "https://en.wikipedia.org/w/api.php"

parameters = {
    "action": "query",
    "titles": "Mannheim",
    "format":"json",
    "prop": "links"}

response = requests.get(url=url, params=parameters, headers=headers) # add the headers to the query
data = response.json()

data

In [None]:
# This isn't very many links...
# By default only the first 10 links are returned
# Let's increase the pllimit parameter to return more

parameters = {
    "action": "query",
    "titles": "Mannheim",
    "format":"json",
    "prop": "links",
    "pllimit":500}

response = requests.get(url=url, params=parameters, headers=headers) # add the headers to the query
data = response.json()

data

# Better, but not everything...

## Pagination
* Due to API rate limits and file sizes, sometimes results are split into different pages.
* By default just the first page is returned, but typically the option to continue to the next page is provided with a follow up query.
* The exact implementation depends on the API - read the documentation.

In [None]:
# The Wikipedia API returns "continue" parameters
# If we feed these to a follow up query, we get the next page of results

parameters = {
    "action": "query",
    "titles": "Mannheim",
    "format":"json",
    "prop": "links",
    "pllimit":500,
    'plcontinue': '99627|14|Articles_to_be_expanded_from_June_2017',
    'continue': '||'}

response2 = requests.get(url=url, params=parameters, headers=headers) # add the headers to the query
data2 = response2.json()

data2


In [None]:
# Great, we got everything, but what about articles with many pages?
# We need to write a function to automatically cycle through pages...

def wikiquery(parameters, headers={}, url="https://en.wikipedia.org/w/api.php"):
    result = [] # create list for response data pages
    lastcontinue = {} # create placeholder for continue parameters
    while True: # keep querying until...
        p = parameters.copy() # copy original params
        p.update(lastcontinue) # update params with continue parameters
        data = requests.get(url=url, params=p, headers=headers).json() # make query and get data
        result.append(data['query']) # add query response for page to results list
        if 'continue' in data: # if 'continue' param in response, update the continue placeholder, query again 
            lastcontinue = data['continue']
        else: # keep querying until... ...there is no continue parameter in the response
            return result
        
parameters = {
    "action": "query",
    "titles": "London",
    "format":"json",
    "prop": "links",
    "pllimit":500}

data = wikiquery(parameters, headers)

data

## 🏋️‍♀️ PRACTICE

In [None]:
# Q3: Part of the task of using APIs is in parsing the output.
# a) Parse the output from the Wikipedia "London" links search above, the output should look like a dictionary:
# {'London':['.london' '101 Dalmatians (1996 film)', '10 Downing Street', ...]}
# b) Create a function that can do this with any article title


In [None]:
# Q4: Repeat the task of getting all the links on the Mannheim Wikipedia article, but across 10 different languages
# e.g. de.wikipedia.org, fr.wikipedia.org, ...
# Create a dictionary with the language as keys and the number of links as values


In [None]:
# Q5: Write a query function to get data from all pages of a World Bank API query.
# Documentation here: https://datahelpdesk.worldbank.org/knowledgebase/articles/898581-api-basic-call-structures
# The mechanism for iterating through pages is different from Wikipedia's!
# Test with this API call: http://api.worldbank.org/v2/country/all/indicator/SP.POP.TOTL?date=2020&format=json
# You should be able to iterate through all 6 pages of data and save it


### API key authentication
* In order to control access to APIs, or allow access to personal user data, services may require you use a unique API key.
* You typically need to login to the service website and request a key, which you then add to your HTTP API requests.
* Do not share your API key details, or accidentally upload them to GitHub!

In [None]:
# Sign up for an API key from https://www.omdbapi.com/
# Make sure to click the link in the follow-up email to activate your key
# It will then take a few minutes to register with the server

# The OMDB API asks that we simply send the API key as a parameter:

parameters = {'t':'Hackers', 'apikey': type your api key here}

data = requests.get('https://www.omdbapi.com/', params=parameters).json()
data

In [None]:
# We don't really want to be sharing our API keys, common practice is to hide them in a .env file
# We also want to make sure the .env file is included in the .gitignore, so we don't upload it to github
# Create a text file in your current directory with the line OMDB_KEY=[your key]
# Be careful of hidden file extensions when saving the .env file

import os
from dotenv import load_dotenv # you may need to conda install python-dotenv

load_dotenv() # reads the .env file

API_KEY = os.getenv('OMDB_KEY') # loads the OMDB_KEY as a variable
print(API_KEY) # If this correctly prints your key, then it worked. If not, then there is a filename/filepath error.
# (You will also want to clear this cell, as the output is saved to the .ipynb file)

In [None]:
# Run a query with our API key
parameters = {'t':'The Net', 'apikey':API_KEY}

data = requests.get('https://www.omdbapi.com/', params=parameters).json()
data

### Bonus API Tips
* Be mindful of API limits! This can be on the number of queries and/or rate of queries.
* Uncertain page ranges mean while loops are regularly used, but make sure your code doesn't get trapped.
* `try`/`except` statements, or conditionals based on the response status code are also useful.
* If querying for large amounts of data it is good practice to save progress as the code runs, not all at the end.

## 🏋️‍♀️ PRACTICE

In [None]:
# Q6: Use OMDB API "search" to get all Movie type results for the query "Lord of the Rings"
# Make sure to go through all pages of results


In [None]:
# Q7: Browse the GitHub list of public APIs:
# Select one that requires an API key, sign up to receive an API key (you may want to take Q8 into account).
# Explore the documentation and write some of your own queries.
# Save the initial JSON data you receive, as well as a cleaned version of the data.



In [None]:
# Q8: Create a dataset by linking information from 2 different APIs.
# e.g. country-level weather data from a weather API compared with agricultural data from the World Bank API,
# cryptocurrency prices compared to stock prices, Wikipedia page views compared with Steam game store stats, etc...
# Clean the data and save it locally.
# Can you tell us some basic stats about the data collected?
