# Webscraing and JSON APIs

The tasks that this will demo:
  
* Downloading websites with a given URL
* Using predictable patterns to construct URLs for downloading
    * Single pages
    * Page ranges (e.g. parsing through a list of search results)
    * Feeding a series of URLs into Python to download to disk

# This is all still just text

APIs are nifty and using them is often another matter of manipulating text.

Each one will work a bit differently, so you'll need to wrap your head around how that one wants you to put in text.  This is how you should be reading the documentation, but a substantial amount of experimentation will be needed.

Let's play with one that:

* doesn't want a log in to mess with
* isn't super restricted
* can handle a load of humans whacking on

Once you know how to craft the queries, you can use python to make it happen.  These queries are really just URLs with other information in them.  You'll use string methods to construct those URLs and then throw them to Python for evaluation.  

## before you start anything, you need to know:

* are there any restrictions on how many times I can hit their servers?
* what delay should you have between these requests
* how should the URLs be constructed?

## Lorem ipsum api

We're going to use this one to explore the mechanism or putting in a request, receiving results, and saving the content.  Then we'll move on to...

## DataCite

This is a service that deals with DOIs for datasets.  I don't want to get too much into the weeds about this because it isn't important. What you need to know is that they also have an API to query and serve up metadata records about those datasets.

Let's write a program that takes in a search term and gets all the data records out of it.

We aren't going to be focusing too much on how they work because every API works differently.  

## What's the end point?

Here's how this one works:

* There's a base URL that you can plunk in a query text and it'll serve up results, but the pages will be numbered.
* It'll serve it up via json, and the metadata xml payload will be there as base64. 
* It'll also state which page we're on and how many pages there are.
* We want to hit their servers as litle as possible, so we'll need to grab the results pages and then get the metadata out from our local copies.


We're going to use two packages that will be new to us:

1. requests to download the urls
2. json to work with the data being served to us to get a few points of information out

## Getting just one page

The first step here is just to get one page into memory.  Websites are usually just text, or at least what's being served to you is text.  So we have a choice:  we can keep it in memory or we can write it out to a file.

We'll start with keeping it in memory as a string.  Once we have the data as a string we can pass it into something to parse it.  For example, sometimes it'll be JSON or XML data, and while that's just a plain text file (and thus can be stored as a string), the string structure doesn't know the ins and outs of that data format.  Meaning that while we can see it, write it, and manipulate it as text, we can't query it with that data format's native methods.

There are special considerations for if you need to deal with a page or API that requires a key or password, which will not be covered here.  However, you'll still have to deal with all this stuff in the process.

### Getting it into memory

We're going to use the requests module, which is not part of the standard library, but is one of the gold standard packages for dealing with this stuff and should come with your normal anaconda installer.

There are two phases here:  

1. Have requests make a connection to that website.
2. Extract what you want out of there.

There are nice ways to check the status codes and other HTTPish things, but we're going to focus on grabbing the results first.

We'll use a loren ipsum API to play with first to get the hang of getting stuff back from requests, and then we'll play with DataCite.

We'll want to use this URL:  https://loripsum.net/api/1/plaintext/short

Broken down, this URL will be giving us 1 short paragraph formatted in plain text.  You can read about the options here:  https://loripsum.net/ They have a system where you can play with all the options and see the results.

This gives plain text back.  Let's first get that to print out.

#### Making the request

Before you do anything you must import the requests module.

In [48]:
import requests

Now we can make out first request.  We need to provide it a URL and it will give us back a requests connection object, that we can later ask for more information.

In [51]:
import requests

url = "https://loripsum.net/api/1/plaintext/short"
result = requests.get(url)

print(result)

<Response [200]>


So what we're seeing here is a sucessfull connection, but not the text.  We have to ask about that explicitly from out result object.

We do this with `.text` (no parens!) this will allow us to ask for a variable value within out object (versus calling a function). Some objects just work this way, and we know how to do this by looking at the documentation or a tutorial.

In [52]:
print(result.text)

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Primum in nostrane potestate est, quid meminerimus? Primum in nostrane potestate est, quid meminerimus? 




Cool! We can play with constructing URLs in a loop here.  

This tool allows you to specify some parameters in the URL, separated by `/` characters.  Options include having it be short, medium, long, verylong.  Let's loop through these options and look at what's returned.  Since we'll be making multiple calls, we also need to add a time delay.

In [55]:
import requests
import time

options = ['short', 'medium', 'long', 'verylong']

for length in options:
    print("here's", length)
    url = "https://loripsum.net/api/1/plaintext/" + length
    result = requests.get(url)
    print(result.text)
    time.sleep(2)S

here's short
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Summum a vobis bonum voluptas dicitur. In schola desinis. Num quid tale Democritus? Omnis enim est natura diligens sui. Stoicos roga. Cum audissem Antiochum, Brute, ut solebam, cum M. Sed ille, ut dixi, vitiose. Haec igitur Epicuri non probo, inquam. 


here's medium
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Quare obscurentur etiam haec, quae secundum naturam esse dicimus, in vita beata; Expectoque quid ad id, quod quaerebam, respondeas. Sed ad bona praeterita redeamus. Atqui reperies, inquit, in hoc quidem pertinacem; Et quidem iure fortasse, sed tamen non gravissimum est testimonium multitudinis. Duo Reges: constructio interrete. Hoc est vim afferre, Torquate, sensibus, extorquere ex animis cognitiones verborum, quibus inbuti sumus. 


here's long
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Alia quaedam dicent, credo, magna antiquorum esse peccata, quae ille veri investigandi cupidus nul

As you can see here, the URL is a string, so we can use string methods to change it as part of an iteration.

There are other times when a URL has numbers in it, such as a page number, that you need to loop through.  In these cases, you can generate the numbers that you want from a for loop, recast them to a string, and then add them to the URL string that you need.

For example:

In [57]:
for i in range(1, 6):
    print("http://www.something.com/page=" + str(i))

http://www.something.com/page=1
http://www.something.com/page=2
http://www.something.com/page=3
http://www.something.com/page=4
http://www.something.com/page=5


Let's change gears and take a look at the DataCite API.

Here's an example search:  https://search.datacite.org/works?query=tuatara

This gives you two pages of results, and when we click on the the next page, the URL reveals to us the page number structure.

So here's page 2's url:  https://search.datacite.org/works?query=tuatara&page=2

So we can try changing that 2 to 1: https://search.datacite.org/works?query=tuatara&page=1 and see if that indeed gives us the first page of results.

So if look at a larger search:  https://search.datacite.org/works?query=snake

We can look at how many pages there are, and at the bottom of the page it looks like the last one is page 40.  Let's try going there.  Indeed that looks right.

So how do we generate these 40 URLs?  Well, let's think:  how can we generate the numbers 1-40?  

* range(1, 41) will do that.

In [59]:
for i in range(1, 41):
    print("https://search.datacite.org/works?query=snake&page=" + str(i))

https://search.datacite.org/works?query=snake&page=1
https://search.datacite.org/works?query=snake&page=2
https://search.datacite.org/works?query=snake&page=3
https://search.datacite.org/works?query=snake&page=4
https://search.datacite.org/works?query=snake&page=5
https://search.datacite.org/works?query=snake&page=6
https://search.datacite.org/works?query=snake&page=7
https://search.datacite.org/works?query=snake&page=8
https://search.datacite.org/works?query=snake&page=9
https://search.datacite.org/works?query=snake&page=10
https://search.datacite.org/works?query=snake&page=11
https://search.datacite.org/works?query=snake&page=12
https://search.datacite.org/works?query=snake&page=13
https://search.datacite.org/works?query=snake&page=14
https://search.datacite.org/works?query=snake&page=15
https://search.datacite.org/works?query=snake&page=16
https://search.datacite.org/works?query=snake&page=17
https://search.datacite.org/works?query=snake&page=18
https://search.datacite.org/works?que

Once you can generate these urls, you can send them through requests to get the data.

The next progrgamming chunk that we will be covering will be working with JSON and getting the data out of it.  We can change part of the URL to grab the JSON output.

In [63]:
for i in range(1, 41):
    print("https://api.datacite.org/works?query=snake&page=" + str(i))

https://api.datacite.org/works?query=snake&page=1
https://api.datacite.org/works?query=snake&page=2
https://api.datacite.org/works?query=snake&page=3
https://api.datacite.org/works?query=snake&page=4
https://api.datacite.org/works?query=snake&page=5
https://api.datacite.org/works?query=snake&page=6
https://api.datacite.org/works?query=snake&page=7
https://api.datacite.org/works?query=snake&page=8
https://api.datacite.org/works?query=snake&page=9
https://api.datacite.org/works?query=snake&page=10
https://api.datacite.org/works?query=snake&page=11
https://api.datacite.org/works?query=snake&page=12
https://api.datacite.org/works?query=snake&page=13
https://api.datacite.org/works?query=snake&page=14
https://api.datacite.org/works?query=snake&page=15
https://api.datacite.org/works?query=snake&page=16
https://api.datacite.org/works?query=snake&page=17
https://api.datacite.org/works?query=snake&page=18
https://api.datacite.org/works?query=snake&page=19
https://api.datacite.org/works?query=sna

Now we can use the i in our loop to help make file names.

In [71]:
import time

folder = "results/"

for i in range(1, 41):
    filename = folder + "result_page_" + str(i) + ".json"
    result = requests.get("https://api.datacite.org/works?query=snake&page[number]=" + str(i))
    print(result)
    with open(filename, 'w') as fout:
        fout.write(result.text)
    time.sleep(2)

<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>


In this way we are hard coding how many pages there are, which might be completely fine for our purposes.

# Block 10: Bio/break 2:30-2:45

# Block 11: Working with APIs

* Downloading websites with a given URL
* Using predictable patterns to construct URLs for downloading
    * Single pages
    * Page ranges (e.g. parsing through a list of search results)
* Discussion:  what have you tried or wanted to try before?


## this is all still just text

APIs are nifty and using them is often another matter of manupilating text.

Every one will work a bit differently, so you'll need to wrap your head around how that one wants you to put in text.  This is how you should be reading the documentation.

Let's play with one that:

* doesn't want a log in to mess with
* isn't super restricted
* can handle a load

Once you know how to craft the queries, you can use python to make it happen.  These queries are really just URLs with other information in them.  You'll use string methods to construct those URLs and then throw them to Python for evaluation.  

## before you start anything, you need to know:

* are there any restrictions on how many times I can hit their servers?
* what delay should you have between these requests
* how should the URLs be constructed?

## DataCite

This is a service that deals with DOIs for datasets.  I don't want to get too much into the weeds about this because it isn't important. What you need to know is that they also have an API to query and serve up metadata records about those datasets.

Let's write a program that takes in a search term and gets all the data records out of it.

We aren't going to be focusing too much on how they work because every API works differently.  

## What's the end point?

Here's how this one works:

* There's a base URL that you can plunk in a query text and it'll serve up results, but the pages will be numbered.
* It'll serve it up via json, and the metadata xml payload will be there as base64. 
* It'll also state which page we're on and how many pages there are.
* We want to hit their servers as litle as possible, so we'll need to grab the results pages and then get the metadata out from our local copies.


We're going to use two packages that will be new to us:

1. requests to download the urls
2. json to work with the data being served to us to get a few points of information out

## Getting just one page

The first step here is just to get one page into memory.  Websites are usually just text, or at least what's being served to you is text.  So we have a choice:  we can keep it in memory or we can write it out to a file.

We'll start with keeping it in memory as a string.  Once we have the data as a string we can pass it into something to parse it.  For example, sometimes it'll be JSON or XML data, and while that's just a plain text file (and thus can be stored as a string), the string structure doesn't know the ins and outs of that data format.  Meaning that while we can see it, write it, and manipulate it as text, we can't query it with that data format's native methods.

There are special considerations for if you need to deal with a page or API that requires a key or password, which will not be covered here.  However, you'll still have to deal with all this stuff in the process.

### Getting it into memory

We're going to use the requests module, which is not part of the standard library, but is one of the gold standard packages for dealing with this stuff and should come with your normal anaconda installer.

There are two phases here:  

1. Have requests make a connection to that website.
2. Extract what you want out of there.

There are nice ways to check the status codes and other HTTPish things, but we're going to focus on grabbing the results first.

We'll use a loren ipsum API to play with first to get the hang of getting stuff back from requests, and then we'll play with DataCite.

We'll want to use this URL:  https://loripsum.net/api/1/plaintext/short

Broken down, this URL will be giving us 1 short paragraph formatted in plain text.  You can read about the options here:  https://loripsum.net/ They have a system where you can play with all the options and see the results.

This gives plain text back.  Let's first get that to print out.

#### Making the request

Before you do anything you must import the requests module.

In [48]:
import requests

Now we can make out first request.  We need to provide it a URL and it will give us back a requests connection object, that we can later ask for more information.

In [51]:
import requests

url = "https://loripsum.net/api/1/plaintext/short"
result = requests.get(url)

print(result)

<Response [200]>


So what we're seeing here is a sucessfull connection, but not the text.  We have to ask about that explicitly from out result object.

We do this with `.text` (no parens!) this will allow us to ask for a variable value within out object (versus calling a function). Some objects just work this way, and we know how to do this by looking at the documentation or a tutorial.

In [52]:
print(result.text)

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Primum in nostrane potestate est, quid meminerimus? Primum in nostrane potestate est, quid meminerimus? 




Cool! We can play with constructing URLs in a loop here.  

This tool allows you to specify some parameters in the URL, separated by `/` characters.  Options include having it be short, medium, long, verylong.  Let's loop through these options and look at what's returned.  Since we'll be making multiple calls, we also need to add a time delay.

In [55]:
import requests
import time

options = ['short', 'medium', 'long', 'verylong']

for length in options:
    print("here's", length)
    url = "https://loripsum.net/api/1/plaintext/" + length
    result = requests.get(url)
    print(result.text)
    time.sleep(2)S

here's short
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Summum a vobis bonum voluptas dicitur. In schola desinis. Num quid tale Democritus? Omnis enim est natura diligens sui. Stoicos roga. Cum audissem Antiochum, Brute, ut solebam, cum M. Sed ille, ut dixi, vitiose. Haec igitur Epicuri non probo, inquam. 


here's medium
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Quare obscurentur etiam haec, quae secundum naturam esse dicimus, in vita beata; Expectoque quid ad id, quod quaerebam, respondeas. Sed ad bona praeterita redeamus. Atqui reperies, inquit, in hoc quidem pertinacem; Et quidem iure fortasse, sed tamen non gravissimum est testimonium multitudinis. Duo Reges: constructio interrete. Hoc est vim afferre, Torquate, sensibus, extorquere ex animis cognitiones verborum, quibus inbuti sumus. 


here's long
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Alia quaedam dicent, credo, magna antiquorum esse peccata, quae ille veri investigandi cupidus nul

As you can see here, the URL is a string, so we can use string methods to change it as part of an iteration.

There are other times when a URL has numbers in it, such as a page number, that you need to loop through.  In these cases, you can generate the numbers that you want from a for loop, recast them to a string, and then add them to the URL string that you need.

For example:

In [57]:
for i in range(1, 6):
    print("http://www.something.com/page=" + str(i))

http://www.something.com/page=1
http://www.something.com/page=2
http://www.something.com/page=3
http://www.something.com/page=4
http://www.something.com/page=5


Let's change gears and take a look at the DataCite API.

Here's an example search:  https://search.datacite.org/works?query=tuatara

This gives you two pages of results, and when we click on the the next page, the URL reveals to us the page number structure.

So here's page 2's url:  https://search.datacite.org/works?query=tuatara&page=2

So we can try changing that 2 to 1: https://search.datacite.org/works?query=tuatara&page=1 and see if that indeed gives us the first page of results.

So if look at a larger search:  https://search.datacite.org/works?query=snake

We can look at how many pages there are, and at the bottom of the page it looks like the last one is page 40.  Let's try going there.  Indeed that looks right.

So how do we generate these 40 URLs?  Well, let's think:  how can we generate the numbers 1-40?  

* range(1, 41) will do that.

In [59]:
for i in range(1, 41):
    print("https://search.datacite.org/works?query=snake&page=" + str(i))

https://search.datacite.org/works?query=snake&page=1
https://search.datacite.org/works?query=snake&page=2
https://search.datacite.org/works?query=snake&page=3
https://search.datacite.org/works?query=snake&page=4
https://search.datacite.org/works?query=snake&page=5
https://search.datacite.org/works?query=snake&page=6
https://search.datacite.org/works?query=snake&page=7
https://search.datacite.org/works?query=snake&page=8
https://search.datacite.org/works?query=snake&page=9
https://search.datacite.org/works?query=snake&page=10
https://search.datacite.org/works?query=snake&page=11
https://search.datacite.org/works?query=snake&page=12
https://search.datacite.org/works?query=snake&page=13
https://search.datacite.org/works?query=snake&page=14
https://search.datacite.org/works?query=snake&page=15
https://search.datacite.org/works?query=snake&page=16
https://search.datacite.org/works?query=snake&page=17
https://search.datacite.org/works?query=snake&page=18
https://search.datacite.org/works?que

Once you can generate these urls, you can send them through requests to get the data.

The next progrgamming chunk that we will be covering will be working with JSON and getting the data out of it.  We can change part of the URL to grab the JSON output.

In [63]:
for i in range(1, 41):
    print("https://api.datacite.org/works?query=snake&page=" + str(i))

https://api.datacite.org/works?query=snake&page=1
https://api.datacite.org/works?query=snake&page=2
https://api.datacite.org/works?query=snake&page=3
https://api.datacite.org/works?query=snake&page=4
https://api.datacite.org/works?query=snake&page=5
https://api.datacite.org/works?query=snake&page=6
https://api.datacite.org/works?query=snake&page=7
https://api.datacite.org/works?query=snake&page=8
https://api.datacite.org/works?query=snake&page=9
https://api.datacite.org/works?query=snake&page=10
https://api.datacite.org/works?query=snake&page=11
https://api.datacite.org/works?query=snake&page=12
https://api.datacite.org/works?query=snake&page=13
https://api.datacite.org/works?query=snake&page=14
https://api.datacite.org/works?query=snake&page=15
https://api.datacite.org/works?query=snake&page=16
https://api.datacite.org/works?query=snake&page=17
https://api.datacite.org/works?query=snake&page=18
https://api.datacite.org/works?query=snake&page=19
https://api.datacite.org/works?query=sna

Now we can use the i in our loop to help make file names.

In [71]:
import time

folder = "results/"

for i in range(1, 41):
    filename = folder + "result_page_" + str(i) + ".json"
    result = requests.get("https://api.datacite.org/works?query=snake&page[number]=" + str(i))
    print(result)
    with open(filename, 'w') as fout:
        fout.write(result.text)
    time.sleep(2)

<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>


In this way we are hard coding how many pages there are, which might be completely fine for our purposes.