# Accessing remote resources from Python and Webscraping

In [None]:
from __future__ import print_function, unicode_literals

## Dependencies

To install dependencies for this tutorial, if you have conda:

    conda install requests beautifulsoup4
    pip install twitter ads pygithub
    
and if not using conda:

    pip install requests beautifulsoup4 twitter ads pygithub

## Introduction to GET, POST, and PUT

When we access a web address (URL) through a browser, we are sending requests and receiving back responses from the server. There are different kinds of requests, and we will look at two of these here:

### GET requests

This is the default type of request when you open a URL in a browser. For example, if you access http://www.google.com, you are implicitly doing a GET request. Here is what Google returns when you send a

    GET http://www.google.com
    
request:

    200 OK
    Date:  Mon, 02 Nov 2015 05:58:03 GMT
    Expires:  -1
    Cache-Control:  private, max-age=0
    Content-Type:  text/html; charset=UTF-8
    p3p:  CP="This is not a P3P policy! See http://www.google.com/support/accounts/bin/answer.py?hl=en&answer=151657 for more info."
    Content-Encoding:  gzip
    Server:  gws
    X-XSS-Protection:  1; mode=block
    X-Frame-Options:  SAMEORIGIN
    Set-Cookie:  PREF=ID=1111111111111111:FF=0:TM=1446443883:LM=1446443883:V=1:S=MArUR2er4w4bbp8V; expires=Thu, 31-Dec-2015 16:02:17 GMT; path=/; domain=.google.com.au NID=73=Ge4XbDeJ8ahg7gLQOb3tlZPb-54GTW8SQmEifTRC9RpYnKywKCJh0zg-yiW3kL5MVgn6iMS9zmdIK-FBLjkGaC_yt4zIPlDFoiT5NTUZ-k_yeH28_1jHwgYWXugdMJtV1OCp3VUNlxE7A67vQdihcnsZuzcixAe5; expires=Tue, 03-May-2016 05:58:03 GMT; path=/; domain=.google.com.au; HttpOnly
    alternate-protocol:  443:quic,p=1
    Alt-Svc:  quic="www.google.com:443"; p="1"; ma=600,quic=":443"; p="1"; ma=600
    X-Firefox-Spdy:  h2

    <!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="en-AU"><head><meta content="/logos/doodles/2015/george-booles-200th-birthday-5636122663190528.2-hp.gif" itemprop="image"><link href="/images/branding/product
    etc.
    
This is the full response, which we never see in practice. Note the content type:

    Content-Type:  text/html; charset=UTF-8

which we'll look at again later. A GET request can also include data in the URL: https://www.google.com.au/?q=dotastronomy

A GET request should not have any side effects, i.e. it should not change anything on the server.

### POST and PUT requests

Unlike a GET request, a POST and PUT requests are used to send data that may for example be stored on the server, so it will explicly write data to the server (the distinction between the two is beyond the scope of this tutorial - you can read people discussing it [here](http://stackoverflow.com/questions/630453/put-vs-post-in-rest))

However, data is not encoded in the URL, but instead is encoded and sent during the request. There is no easy way to sent a post request from the browser, but we can do this from Python.

## The requests library

Python includes a library to get and post data to the web, [urllib](https://docs.python.org/3.5/library/urllib.html), but it is not straightforward to use, so a group of developers made [requests](http://docs.python-requests.org), a Python library that does *HTTP for Humans*.

In [None]:
import requests

In [None]:
r = requests.get('http://www.google.com')

In [None]:
r.status_code

In [None]:
r.headers

In [None]:
r.headers['content-type']

In [None]:
r.content[:1000]

In [None]:
r.text[:1000]

Including data in the GET request is easy:

In [None]:
r = requests.get('http://www.google.com', params={'q': 'dotastronomy'})

In [None]:
r.status_code

In [None]:
r.request.url

To send a post request instead, we can simply replace ``requests.get`` by ``requests.post``.

## Scraping HTML

Let's say we want to extract ('scrape') data off an ADS webpage:

http://adsabs.harvard.edu/abs/2013A%26A...558A..33A

We can start off by getting the source for the page:

In [None]:
r = requests.get('http://adsabs.harvard.edu/abs/2013A%26A...558A..33A')

In [None]:
print(r.text[:1000].strip())

We can now use [BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/) to parse this:

In [None]:
import bs4
soup = bs4.BeautifulSoup(r.content, 'lxml')

In [None]:
soup.title

In [None]:
link = soup.find_all('a')[1]
link

In [None]:
link.attrs

In [None]:
link.content

In [None]:
table = soup.find_all('table')[0]

In [None]:
len(soup.find_all('a'))

In [None]:
len(soup.find_all('a', {'class':'oa'}))

Note that the above example is for demonstration only. In practice, there is a proper way to access the ADS data that does not involve scraping (see further down).

## APIs (Application programming interfaces)

In practice, scraping HTML code is hard, and should be avoided whenever possible. A number of websites now offer APIs, which are documented ways of accessing machine-readable code.

For example, GitHub offers a way to access data about users and repositories through an API that is described [here](https://developer.github.com/v3/)

Many APIs require authentication, which I won't cover here, but there are a few that do not.

Let's take a look at one of the examples for GitHub which does not require authentication: https://developer.github.com/v3/users/

In [None]:
r = requests.get('https://api.github.com/users/astrofrog')

In [None]:
r.content

In [None]:
r.headers['content-type']

It looks like the data was returned in JSON. Let's take a look in more detail!

## JSON (JavaScript Object Notation)

JSON is a very common data format used for many APIs. A JSON object is a string that basically looks to Python users like a set of strings, lists, and dictionaries. We can easily transform a JSON object into an actual Python object with the requests library:

In [None]:
data = r.json()

In [None]:
data

This is now a Python dictionary! We can access keys with:

In [None]:
data['name']

In [None]:
data['location']

In general, APIs such as this can also accept POST requests for some of the actions that would have an effect on the repository or the user (see e.g. [this](https://developer.github.com/v3/repos/releases/#create-a-release) for an example of a possible POST request). The data can be passed to ``requests.post`` using a normal Python dictionary.

Some APIs also use PUT instead of POST - in that case, use ``requests.put``. 

If you need to parse JSON manually, you can use the built-in [json](https://docs.python.org/3.5/library/json.html) library:

In [None]:
import json
d = json.loads('{"a":1}')
d['a']

## Specialized libraries

In some cases, you don't actually need to use the APIs directly, but you can use exiting packages that will provide a 'Pythonic' interface to various websites.

### Github

For example, we can use [PyGithub](https://github.com/PyGithub/PyGithub):

In [None]:
from github import Github

In [None]:
g = Github()

In [None]:
user = g.get_user('astrofrog')

In [None]:
user.name

In [None]:
user.location

### Twitter

There are also libraries for e.g. [Twitter](http://www.twitter.com):

In [None]:
from twitter import Twitter, oauth_dance, OAuth

In [None]:
# Normally these should be secret but I am using a dummy account, so this is fine
CONSUMER_KEY = 'igqeyII44KCFOFLGqrYYjU6qo'
CONSUMER_SECRET = '88vrcN4xjrDPyMNr8jgpB82ar3nddFtRfjqOjBjTwG7iSXkZr2'

oauth_token, oauth_token_secret = oauth_dance("Test Bot", CONSUMER_KEY, CONSUMER_SECRET)

In [None]:
oauth_token, oauth_token_secret

In [None]:
t = Twitter(auth=OAuth(oauth_token, oauth_token_secret, CONSUMER_KEY, CONSUMER_SECRET))

In [None]:
t.statuses.home_timeline()

In [None]:
result = t.statuses.update(status="Testing, 1, 2, 3")

In [None]:
tweets = t.search.tweets(q='#dotastro')

In [None]:
len(tweets['statuses'])

In [None]:
tweets['statuses'][1]['user']['screen_name']

In [None]:
tweets['statuses'][1]['text']

In [None]:
response = t.statuses.update(status='@davidwhogg FRA ✈ SYD #dotastro')

### ADS

A final example is the [ads](https://github.com/andycasey/ads) library (``pip install ads``). To use this, we need an API key which we get by creating an account [here](https://ui.adsabs.harvard.edu), which we then put in ``.ads/dev_key``.

In [None]:
import ads

In [None]:
papers = ads.SearchQuery(q="author:hogg,d.", sort="citation_count")

In [None]:
for paper in papers:
    print(paper.title[0], paper.citation_count)

## Python on the server

If you are interested in developing a web server than *runs* Python, you can look into [Django](https://www.djangoproject.com/) and [Flask](http://flask.pocoo.org/). You can then host these types of apps on services like [Heroku](https://www.heroku.com/).