## UBC Intro to Machine Learning

###  APIs
Instructor: Socorro Dominguez  
February 05, 2022

[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/sedv8808/APIs_UBC_ML/main?labpath=APIs.ipynb)

**Agenda:**

* Where does our data come from? (10 minutes)
* APIs (20 minutes)
    * What is it?
    * Examples of APIs
    * Applications
    * Demo using Python
    * Example of a notebook to get information using APIs (10 minutes)

## How is Data Science related to the Web?

- In order for you to be able to perform Data Science, you need the raw material: DATA
- Where do you think you can get data from?

- Publicly available datasets (e.g. check out [Kaggle](https://www.kaggle.com/competitions))
    - Good for benchmarking, but limited for real use-case
- Company’s database (e.g. transaction history)
    - SQL, MongoDB, etc.
- From the web
    - Collected manually (scraping)
    - Collected automatically (APIs)

###  Task: You work for Company A. You want to know what your customers are talking about you on Twitter. How do you retrieve the data?

![img](img/api_img.png)

It would be hard to copy paste everything on Twitter and then build a dataframe...
One easier way could be scraping. This is what a website looks like when you view it in its raw "code" format.

![img](img/xml_img.png)

F12 for Windows Users   
Right click + Inspect for Mac users

### What if we wanted to collect this data for analysis?

In [None]:
# Import libraries
import requests

URL = "https://twitter.com/search?q=%23DataScience&src=typeahead_click"
res = requests.get(URL).text
res

Now, this looks hard - like a bad headache.

We would need to create a Data Mining program that goes to URLs and parses the HTML to extract data.
- Effortful
    - HTML is difficult to parse - how do you clean the above data?
    - Almost all information is irrelevant - are all these `hreflang="vi" href="` necessary?
    - Websites often require interaction - look at all the links...
    - When websites update, your code will break - not even going there...
    - Every website is different - now I want to use a different hashtag...
    - Companies try to stop data miners

## What is an API?

**A**pplication  
**P**rogramming  
**I**nterface  
  
* We will be mostly using RESTful APIs.   

**RE**presentation  
**S**tate  
**T**ransfer  
**C**haracteristics  

Easy explanation:

![API](img/bfa.jpeg)

- Programmer-friendly version of websites
- Go to a URL composed of
    - API root endpoint
    - API function
    - API key (like a login)
    - Parameter keys
    - Parameter values
    - Returns data (JSON, XML, csv, etc.)

### Characteristics?

Client-server, typically HTTP-based, stateless server

- In order to use them, you might have to sign up and create a Developer account. 
- There might be screenings.
- Some APIs will not be free or might have "premium" versions.   
[Twitter API](https://developer.twitter.com/en)

### What representation will DATA be in?

### JSON
- **J**ava**S**cript **O**bject **N**otation
* textual description of python (javascript actually) objects
* arrays and dictionaries <- the reason why Module 5 in PPDS is so important

```
[{
'library': [
           {'title': 'For Whom the Bell Tolls', 'author': 'Ernest Hemingway'},
           {'title': 'Trump: The Art of the Deal', 'author': 'Good Question'}
           ]
}
, ... ]
```

**LESSON:** Even if the data is provided still in a nicer presentation than from the web, you must still write wrapping functions

### Using a Web API

Provider defines:
* message format for requests and responses
* usually in both JSON and XML (XML is not very used)
* registration and authentication
* usually using OAuth (delegated authorization framework for REST/APIs. It enables apps to obtain limited access to a user's data without giving away a user's password.)

### Language integration / Wrapper Function

* might be provided or you might have to do it yourself
* if provided, usually someone other than data source
* library API for various languages like python, R, ...
* you write a python program that calls library procedures
* library formats messages, sends them to web provider, translates responses as return values

### When you write the wrapper functions

When the functions are not provided, you will need to load the following libraries:
- `import requests` [Documentation](https://docs.python-requests.org/en/latest/)
- `import json` [Documentation](https://docs.python.org/3/library/json.html)

Also, if you are in Chrome, get the [JSON formatter](https://chrome.google.com/webstore/detail/json-formatter/bcjindcccaagfpapjjmafapmmgkkhgoa?hl=en) extension.

### Getting JSON Data

We need to select the output format using API:
* e.g., http header: accept = application/json

Use request .get
* this returns a python array or dictionary

## Demo with Translink

1. Get your own API token from [Translink](developer.translink.ca)
2. In the config file, pass your username and password.

Hint: If you are on Chrome, use the JSON formatter extension.

In [None]:
import requests
import config as cfg

# Get your own API token from developer.translink.ca
# open config.py and write in your password
apikey = cfg.translink['key']

requests.get('http://api.translink.ca/rttiapi/v1/stops/61945/estimates?apikey={}'\
             .format(apikey),headers={'accept': 'application/JSON'}).json()


In [None]:
x = requests.get('http://api.translink.ca/rttiapi/v1/stops/61935/estimates?apikey={}'\
                 .format(apikey),headers={'accept': 'application/JSON'}).json()

In [None]:
data = x[0]['Schedules']
data

## Pandas to the Rescue

In [None]:
import pandas as pd
df = pd.DataFrame.from_dict(data)

In [None]:
df

In [None]:
from IPython.display import JSON
request_url = 'https://api.translink.ca/rttiapi/v1/stops?apikey={}&lat={}&long={}'.format(apikey, 49.248523, -123.108800)
response = requests.get(request_url, headers={'accept': 'application/JSON'})
response


In [None]:
pd.DataFrame.from_dict(response.json())

In [None]:
data = response.json()
len(data)

In [None]:
new_list =[]

for i in range(0, len(data)-1):
    r = data[i]['Name']
    new_list.append(r)
new_list

## HTTP Requests
- Hypertext Transfer Protocol

- When you access a website (through an URL), you are:
    - "sending a HTTP GET request to the server to retrieve data"
    - "data" can be a webpage that is displayed, it can be JSON from an API



- When you access a website, you know it worked if it loaded
    - Status codes are helpful when you're working with code

- Common HTTP status codes:
    - 200 OK
    - 400 Bad Request
    - 401 Unauthorized
    - 404 Not Found

In [None]:
# Something with a bad request
requests.get('https://api.github.com/users/sedv8809')

In [None]:
# Something with a good request
requests.get('https://api.github.com/users/sedv8808')

### The Anatomy Of A Request

It’s important to know that a request is made up of four things:

1. The endpoint <- the URL you are pointing to.

2. The method <- `get` , there are others but not today

3. The headers <- `accept JSON`

4. The data (or body) <- what we use to work with

## Use Cases

## 1. Getting a README from GitHub

In [None]:
import base64

def get_readme(url, token):
    '''Document your function'''
    url_to_api_endpoint = url.replace('https://github.com/', '')
    new_url = 'https://api.github.com/repos/' + url_to_api_endpoint + '/contents/README.md'
    headers = {'Authorization': f'token {token}', 'accept': 'application/JSON'}
    
    try:
        readme = requests.get(new_url, headers=headers).json()
        readme = readme['content']
        readme = base64.b64decode(readme)
    except:
        readme = "Missing"

    return readme

In [None]:
url = 'https://github.com/UBC-MDS/exploratory-data-viz'
token = cfg.github_api['secret']

In [None]:
get_readme(url = url, token=token)

## 1. Getting Tweets with a Specific #

In [None]:
import twitter
import json

Using [Twitter](https://python-twitter.readthedocs.io/en/latest/twitter.html)'s Wrapping Methods

In [None]:
api = twitter.Api(consumer_key = cfg.twitter_api['consumer_key'],
                  consumer_secret = cfg.twitter_api['consumer_secret'],
                  access_token_key = cfg.twitter_api['access_token'],
                  access_token_secret = cfg.twitter_api['access_token_secret'])

In [None]:
# FOLLOWING FUNCTION WILL COLLECT REAL-TIME TWEETS IN OUR COMPUTER

# data returned will be for any tweet mentioning strings in the list FILTER
FILTER = ['datascience']

# Languages to filter tweets by is a list. This will be joined by Twitter
# to return data mentioning tweets only in the english language.
LANGUAGES = ['en']


def retrieve_tweets(path, FILTER, LANGUAGES):
    with open(path + 'output.txt', 'a') as f:
        # api.GetStreamFilter will return a generator that yields one status
        # message (i.e., Tweet) at a time as a JSON dictionary.
        counter = 0
        for line in api.GetStreamFilter(track=FILTER, languages=LANGUAGES):
            counter += 1
            f.write(json.dumps(line))
            f.write('\n')
            print(counter)
            if counter == 5:
                break

In [None]:
retrieve_tweets(path='', FILTER=FILTER, LANGUAGES=LANGUAGES)

In [None]:
?api.GetStreamFilter