# LIS 875 Text Mining: Week 10 (Web I/O using OAuth2)
* OAuth2 authentication
* https://www.reddit.com/prefs/apps
* https://www.reddit.com/dev/api

* redirect URL: http://localhost:8080

## Register a Reddit developer API
* Sign up for an account: https://www.reddit.com/
* Register an API key: https://www.reddit.com/prefs/apps ("create app"-->"script")
* replace the CLIENT_ID and CLIENT_SECRET with yours

OAuth2 authentication is the most popular web API authentication methods. Here we only use Reddit as an example, but you can use a similar method to request web services from other sites, e.g., Microsoft's Bing search API.

You should always keep your keys confidential. Some services (e.g., Bing and Amazon's web services API) will charge you based on the usage of your keys.

Note that the "script" API key may expire every few hours. So you'll need to create a new one often.


![](https://drive.google.com/uc?export=view&id=1XckhPEbnreUoDq-cjfXXtbvdOz8ksH0K)


In [None]:
# make sure replace to your own CLIENT_ID and CLIENT_SECRET

CLIENT_ID = "ffOmssP-iEL5hxyTzfS5Bw"
CLIENT_SECRET = "	kfDT9-7qr2RhTAqmeahs8vVGtULdaA"

## Connect and Authenticate

In [None]:
import requests

auth = requests.auth.HTTPBasicAuth(CLIENT_ID, CLIENT_SECRET)

# change the following to your username and password
data = {'grant_type': 'password',
        'username': 'Your Reddit username',
        'password': 'Your Reddit password'}

# setup our header info, which gives reddit a brief description of our app
headers = {'User-Agent': 'teaching demo'}

# send our request for an OAuth token
res = requests.post('https://www.reddit.com/api/v1/access_token',
                    auth=auth, data=data, headers=headers)

# convert response to JSON and pull access_token value
TOKEN = res.json()['access_token']

# add authorization to our headers dictionary
headers = {**headers, **{'Authorization': f"bearer {TOKEN}"}}

# while the token is valid (~2 hours) we just add headers=headers to our requests
requests.get('https://oauth.reddit.com/api/v1/me', headers=headers)

<Response [200]>

## Check API document to see what services are available. Request one and see the outputs.
* https://www.reddit.com/dev/api

In [None]:
res = requests.get("https://oauth.reddit.com/r/python", headers=headers)

data = res.json()
data

In [None]:
data['data']

In [None]:
data['data']['children'][2]

## URL parameters
* Many web applications and APIs use URL to pass parameters. 
* the format is: baseurl?param1=value1&param2=value2

For example, when you search for a query "uw-madison" in Google, you can take a look at the parameters of the search results page to have an idea of Google's parameters:

https://www.google.com/search?q=uw-madison&biw=1295&bih=1287&ei=1dQgYqXFA9XK0PEP5L21sAY&ved=0ahUKEwjlg8C8l6r2AhVVJTQIHeReDWYQ4dUDCA4&oq=uw-madison&gs_lcp=Cgdnd3Mtd2l6EAxKBAhBGABKBAhGGABQAFgAYABoAHABeACAAQCIAQCSAQCYAQA&sclient=gws-wiz

Turn to the second page of search results and see if the URL changes:

https://www.google.com/search?q=uw-madison&ei=G9QgYt7NCKHB0PEPzMOZwAg&start=10&sa=N&ved=2ahUKEwiex-zjlqr2AhWhIDQIHcxhBogQ8NMDegQIAhBF&biw=1295&bih=1287&dpr=1

In [None]:
# let's take a look at another API -- reddit's search API

from urllib import parse

paramstr = parse.urlencode({
    'q':'list comprehension'
    }
)

url = "https://oauth.reddit.com/r/python/search?" + paramstr

url

'https://oauth.reddit.com/r/python/search?q=list+comprehension'

In [None]:

res = requests.get(url, headers=headers)

data = res.json()
data['data']['children'][2]

## In-class Exercise
* Take a look at UW-Madison's subreddit: https://www.reddit.com/r/UWMadison/
* Use API to request the newest 100 posts
* Summarize the top 50 most frequent words from the posts (removing stop words and punctuations, apply stemming and case-folding)

In [None]:
# make sure the required python packages are installed

# install nltk (we'll use 3.6.7 in Spring 2022)
!pip install nltk==3.6.7 --upgrade

# install spacy (we'll use 3.2.1 in Spring 2022)
!pip install spacy==3.2.1 --upgrade

# download the spacy en_core_web_sm model (3.2.0 version)
!python -m spacy download en_core_web_sm-3.2.0 --direct

In [None]:
# identify the API you want to use

from urllib import parse

paramstr = parse.urlencode({
      'limit':100
    }
)

url = "https://oauth.reddit.com/r/UWMadison/new?" + paramstr

url

'https://oauth.reddit.com/r/UWMadison/new?limit=100'

In [None]:
# send request to API and get data back

from datetime import datetime

res = requests.get(url, headers=headers)

data = res.json()
[ (datetime.fromtimestamp(post['data']['created']).strftime("%m/%d/%Y, %H:%M:%S"), post['data']['title']) for post in data['data']['children'] ]

In [None]:
# count word frequency

import spacy
from collections import Counter

nlp = spacy.load( "en_core_web_sm", disable=["parser", "ner"] )

Counter([ token.lemma_.lower() for post in data['data']['children'] for token in nlp(post['data']['title']) if not token.is_stop and not token.is_punct ]).most_common(50)


## In-class Exercise
* Explore reddit's search API's other parameters
* Read the API document and try to change your request parameters to see how the results are different

In [None]:
# identify the API you want to use

from urllib import parse

paramstr = parse.urlencode({
      'q':'ischool',
      'limit':5,
      'sort':'new'
    }
)

url = "https://oauth.reddit.com/r/UWMadison/search?" + paramstr

url

'https://oauth.reddit.com/r/UWMadison/search?q=ischool&limit=5&sort=new'

In [None]:
res = requests.get(url, headers=headers)

data = res.json()

[ post['data']['title'] for post in data['data']['children'] ]

['my grad school application hub/ tracker',
 'Switching to online school halfway through senior year',
 'Spring and Summer 2023 sublease',
 'Enrollment in restricted class',
 'Tuition Increase Fall 2023']