## UBC Intro to Machine Learning

###  APIs
Instructor: Socorro Dominguez  
February 05, 2022

[Binder]

**Agenda:**

* Where does our data come from? (10 minutes)
* APIs (20 minutes)
    * What is it?
    * Examples of APIs
    * Applications
    * Demo using Python
    * Example of a notebook to get information using APIs (10 minutes)

## How is Data Science related to the Web?

- In order for you to be able to perform Data Science, you need the raw material: DATA
- Where do you think you can get data from?

- Publicly available datasets (e.g. check out [Kaggle](https://www.kaggle.com/competitions))
    - Good for benchmarking, but limited for real use-case
- Company’s database (e.g. transaction history)
    - SQL, MongoDB, etc.
- From the web
    - Collected manually (scraping)
    - Collected automatically (APIs)

###  Task: You work for Company A. You want to know what your customers are talking about you on Twitter. How do you retrieve the data?

![img](img/api_img.png)

It would be hard to copy paste everything on Twitter and then build a dataframe...
One easier way could be scraping. This is what a website looks like when you view it in its raw "code" format.

![img](img/xml_img.png)

F12 for Windows Users   
Right click + Inspect for Mac users

### What if we wanted to collect this data for analysis?

In [1]:
# Import libraries
import requests

URL = "https://twitter.com/search?q=%23DataScience&src=typeahead_click"
res = requests.get(URL).text
res



Now, this looks hard - like a bad headache.

We would need to create a Data Mining program that goes to URLs and parses the HTML to extract data.
- Effortful
    - HTML is difficult to parse - how do you clean the above data?
    - Almost all information is irrelevant - are all these `hreflang="vi" href="` necessary?
    - Websites often require interaction - look at all the links...
    - When websites update, your code will break - not even going there...
    - Every website is different - now I want to use a different hashtag...
    - Companies try to stop data miners

## What is an API?

**A**pplication  
**P**rogramming  
**I**nterface  
  
* We will be mostly using RESTful APIs.   

**RE**presentation  
**S**tate  
**T**ransfer  
**C**haracteristics  

Easy explanation:

![API](img/bfa.jpeg)

- Programmer-friendly version of websites
- Go to a URL composed of
    - API root endpoint
    - API function
    - API key (like a login)
    - Parameter keys
    - Parameter values
    - Returns data (JSON, XML, csv, etc.)

### Characteristics?

Client-server, typically HTTP-based, stateless server

- In order to use them, you might have to sign up and create a Developer account. 
- There might be screenings.
- Some APIs will not be free or might have "premium" versions.   
[Twitter API](https://developer.twitter.com/en)

### What representation will DATA be in?

### JSON
- **J**ava**S**cript **O**bject **N**otation
* textual description of python (javascript actually) objects
* arrays and dictionaries <- the reason why Module 5 in PPDS is so important

```
[{
'library': [
           {'title': 'For Whom the Bell Tolls', 'author': 'Ernest Hemingway'},
           {'title': 'Trump: The Art of the Deal', 'author': 'Good Question'}
           ]
}
, ... ]
```

**LESSON:** Even if the data is provided still in a nicer presentation than from the web, you must still write wrapping functions

### Using a Web API

Provider defines:
* message format for requests and responses
* usually in both JSON and XML (XML is not very used)
* registration and authentication
* usually using OAuth (delegated authorization framework for REST/APIs. It enables apps to obtain limited access to a user's data without giving away a user's password.)

### Language integration / Wrapper Function

* might be provided or you might have to do it yourself
* if provided, usually someone other than data source
* library API for various languages like python, R, ...
* you write a python program that calls library procedures
* library formats messages, sends them to web provider, translates responses as return values

### When you write the wrapper functions

When the functions are not provided, you will need to load the following libraries:
- `import requests` [Documentation](https://docs.python-requests.org/en/latest/)
- `import json` [Documentation](https://docs.python.org/3/library/json.html)

Also, if you are in Chrome, get the [JSON formatter](https://chrome.google.com/webstore/detail/json-formatter/bcjindcccaagfpapjjmafapmmgkkhgoa?hl=en) extension.

### Getting JSON Data

We need to select the output format using API:
* e.g., http header: accept = application/json

Use request .get
* this returns a python array or dictionary

## Demo with Translink

1. Get your own API token from [Translink](developer.translink.ca)
2. In the config file, pass your username and password.

Hint: If you are on Chrome, use the JSON formatter extension.

In [2]:
import requests
import config as cfg

# Get your own API token from developer.translink.ca
# open config.py and write in your password
apikey = cfg.translink['key']

requests.get('http://api.translink.ca/rttiapi/v1/stops/61945/estimates?apikey={}'\
             .format(apikey),headers={'accept': 'application/JSON'}).json()


[{'RouteNo': '372',
  'RouteName': 'CLAYTON HEIGHTS/LANGLEY CENTRE',
  'Direction': 'EAST',
  'RouteMap': {'Href': 'https://nb.translink.ca/geodata/372.kmz'},
  'Schedules': [{'Pattern': 'EB1',
    'Destination': 'LANGLEY CTR',
    'ExpectedLeaveTime': '10:33pm 2022-02-04',
    'ExpectedCountdown': 7,
    'ScheduleStatus': '*',
    'CancelledTrip': False,
    'CancelledStop': False,
    'AddedTrip': False,
    'AddedStop': False,
    'LastUpdate': '11:08:12 pm'}]}]

In [3]:
x = requests.get('http://api.translink.ca/rttiapi/v1/stops/61935/estimates?apikey={}'\
                 .format(apikey),headers={'accept': 'application/JSON'}).json()

In [4]:
data = x[0]['Schedules']
data

[{'Pattern': 'E1',
  'Destination': "COMM'L-BDWAY STN",
  'ExpectedLeaveTime': '10:26pm 2022-02-04',
  'ExpectedCountdown': 0,
  'ScheduleStatus': '-',
  'CancelledTrip': False,
  'CancelledStop': False,
  'AddedTrip': False,
  'AddedStop': False,
  'LastUpdate': '10:23:58 pm'},
 {'Pattern': 'E1',
  'Destination': "COMM'L-BDWAY STN",
  'ExpectedLeaveTime': '10:32pm 2022-02-04',
  'ExpectedCountdown': 6,
  'ScheduleStatus': '*',
  'CancelledTrip': False,
  'CancelledStop': False,
  'AddedTrip': False,
  'AddedStop': False,
  'LastUpdate': '09:32:10 pm'},
 {'Pattern': 'E8FL2',
  'Destination': 'TO BOUNDARY B-LINE',
  'ExpectedLeaveTime': '10:42pm 2022-02-04',
  'ExpectedCountdown': 16,
  'ScheduleStatus': '*',
  'CancelledTrip': False,
  'CancelledStop': False,
  'AddedTrip': False,
  'AddedStop': False,
  'LastUpdate': '09:42:19 pm'},
 {'Pattern': 'E1',
  'Destination': "COMM'L-BDWAY STN",
  'ExpectedLeaveTime': '10:52pm 2022-02-04',
  'ExpectedCountdown': 26,
  'ScheduleStatus': '*',
 

## Pandas to the Rescue

In [5]:
import pandas as pd
df = pd.DataFrame.from_dict(data)

In [6]:
df

Unnamed: 0,Pattern,Destination,ExpectedLeaveTime,ExpectedCountdown,ScheduleStatus,CancelledTrip,CancelledStop,AddedTrip,AddedStop,LastUpdate
0,E1,COMM'L-BDWAY STN,10:26pm 2022-02-04,0,-,False,False,False,False,10:23:58 pm
1,E1,COMM'L-BDWAY STN,10:32pm 2022-02-04,6,*,False,False,False,False,09:32:10 pm
2,E8FL2,TO BOUNDARY B-LINE,10:42pm 2022-02-04,16,*,False,False,False,False,09:42:19 pm
3,E1,COMM'L-BDWAY STN,10:52pm 2022-02-04,26,*,False,False,False,False,09:52:12 pm
4,E1,COMM'L-BDWAY STN,11:04pm 2022-02-04,38,,False,False,False,False,10:04:09 pm
5,E1,COMM'L-BDWAY STN,11:16pm 2022-02-04,50,,False,False,False,False,10:16:00 pm


In [7]:
from IPython.display import JSON
request_url = 'https://api.translink.ca/rttiapi/v1/stops?apikey={}&lat={}&long={}'.format(apikey, 49.248523, -123.108800)
response = requests.get(request_url, headers={'accept': 'application/JSON'})
response


<Response [200]>

In [8]:
pd.DataFrame.from_dict(response.json())

Unnamed: 0,StopNo,Name,BayNo,City,OnStreet,AtStreet,Latitude,Longitude,WheelchairAccess,Distance,Routes
0,51516,EB W KING EDWARD AVE FS MANITOBA ST,N,VANCOUVER,W KING EDWARD AVE,MANITOBA ST,49.248819,-123.107042,1,132,
1,51573,WB W KING EDWARD AVE FS COLUMBIA ST,N,VANCOUVER,W KING EDWARD AVE,COLUMBIA ST,49.249023,-123.110516,1,136,025
2,51514,EB W KING EDWARD AVE FS YUKON ST,N,VANCOUVER,W KING EDWARD AVE,YUKON ST,49.24882,-123.11172,1,215,
3,51572,WB W KING EDWARD AVE FS ONTARIO ST,N,VANCOUVER,W KING EDWARD AVE,ONTARIO ST,49.248922,-123.105388,1,252,025
4,60335,EB E KING EDWARD AVE FS ONTARIO ST,N,VANCOUVER,E KING EDWARD AVE,ONTARIO ST,49.248771,-123.104633,1,304,025
5,51513,KING EDWARD STN BAY 3,3,VANCOUVER,KING EDWARD STN,BAY 3,49.248866,-123.114707,1,430,025
6,51517,EB E KING EDWARD AVE FS QUEBEC ST,N,VANCOUVER,E KING EDWARD AVE,QUEBEC ST,49.248736,-123.102742,0,440,
7,61372,KING EDWARD STATION PLATFORM 1,N,VANCOUVER,KING EDWARD STATION,PLATFORM 1,49.249172,-123.115228,0,472,
8,50475,KING EDWARD STN BAY 2,2,VANCOUVER,KING EDWARD STN,BAY 2,49.249415,-123.115245,1,478,"015, 033, N15"
9,61253,NB CAMBIE ST FS W 27 AVE,N,VANCOUVER,CAMBIE ST,W 27 AVE,49.247293,-123.115175,1,483,


In [9]:
data = response.json()
len(data)

10

In [10]:
new_list =[]

for i in range(0, len(data)-1):
    r = data[i]['Name']
    new_list.append(r)
new_list

['EB W KING EDWARD AVE FS MANITOBA ST',
 'WB W KING EDWARD AVE FS COLUMBIA ST',
 'EB W KING EDWARD AVE FS YUKON ST',
 'WB W KING EDWARD AVE FS ONTARIO ST',
 'EB E KING EDWARD AVE FS ONTARIO ST',
 'KING EDWARD STN BAY 3',
 'EB E KING EDWARD AVE FS QUEBEC ST',
 'KING EDWARD STATION PLATFORM 1',
 'KING EDWARD STN BAY 2']

## HTTP Requests
- Hypertext Transfer Protocol

- When you access a website (through an URL), you are:
    - "sending a HTTP GET request to the server to retrieve data"
    - "data" can be a webpage that is displayed, it can be JSON from an API



- When you access a website, you know it worked if it loaded
    - Status codes are helpful when you're working with code

- Common HTTP status codes:
    - 200 OK
    - 400 Bad Request
    - 401 Unauthorized
    - 404 Not Found

In [11]:
# Something with a bad request
requests.get('https://api.github.com/users/sedv8809')

<Response [404]>

In [12]:
# Something with a good request
requests.get('https://api.github.com/users/sedv8808')

<Response [200]>

### The Anatomy Of A Request

It’s important to know that a request is made up of four things:

1. The endpoint <- the URL you are pointing to.

2. The method <- `get` , there are others but not today

3. The headers <- `accept JSON`

4. The data (or body) <- what we use to work with

## Use Cases

## 1. Getting a README from GitHub

In [13]:
import base64

def get_readme(url, token):
    '''Document your function'''
    url_to_api_endpoint = url.replace('https://github.com/', '')
    new_url = 'https://api.github.com/repos/' + url_to_api_endpoint + '/contents/README.md'
    headers = {'Authorization': f'token {token}', 'accept': 'application/JSON'}
    
    try:
        readme = requests.get(new_url, headers=headers).json()
        readme = readme['content']
        readme = base64.b64decode(readme)
    except:
        readme = "Missing"

    return readme

In [14]:
url = 'https://github.com/UBC-MDS/exploratory-data-viz'
token = cfg.github_api['secret']

In [15]:
get_readme(url = url, token=token)

b'## Key Capabilities in Data Science: Data Visualization\n\n[![Netlify Status](https://api.netlify.com/api/v1/badges/17c9c1dc-7623-4871-bcb5-543d3e0a8952/deploy-status)](https://app.netlify.com/sites/exploratory-data-visualization/deploys)\n\nHosted here: [https://exploratory-data-visualization.netlify.app/](https://exploratory-data-visualization.netlify.app/)\n\n### View locally with Docker\n\nTo view the course locally with Docker:\n\n1. Install docker and docker-compose\n\n2. Clone this GitHub repo, and then from the root of this project repo type: `docker-compose up`\n\n3. Then copy this url to the browser: `http://localhost:8000`\n'

## 1. Getting Tweets with a Specific #

In [16]:
import twitter
import json

Using [Twitter](https://python-twitter.readthedocs.io/en/latest/twitter.html)'s Wrapping Methods

In [20]:
api = twitter.Api(consumer_key = cfg.twitter_api['consumer_key'],
                  consumer_secret = cfg.twitter_api['consumer_secret'],
                  access_token_key = cfg.twitter_api['access_token'],
                  access_token_secret = cfg.twitter_api['access_token_secret'])

In [31]:
# FOLLOWING FUNCTION WILL COLLECT REAL-TIME TWEETS IN OUR COMPUTER

# data returned will be for any tweet mentioning strings in the list FILTER
FILTER = ['datascience']

# Languages to filter tweets by is a list. This will be joined by Twitter
# to return data mentioning tweets only in the english language.
LANGUAGES = ['en']


def retrieve_tweets(path, FILTER, LANGUAGES):
    with open(path + 'output.txt', 'a') as f:
        # api.GetStreamFilter will return a generator that yields one status
        # message (i.e., Tweet) at a time as a JSON dictionary.
        counter = 0
        for line in api.GetStreamFilter(track=FILTER, languages=LANGUAGES):
            counter += 1
            f.write(json.dumps(line))
            f.write('\n')
            print(counter)
            if counter == 5:
                break

In [32]:
retrieve_tweets(path='', FILTER=FILTER, LANGUAGES=LANGUAGES)

1
2
3
4
5
6
7
8
9
10


In [33]:
?api.GetStreamFilter