# Interacting with the Web

The Internet is a gigantic data dump. There is all the social networking data from Facebook, Twitter, and so on. There is the news from all the traditional media sources plus Quartz, Vox, and so on. Then there is the data from organizations such as the World Bank, the Bureau of Labor Statistics, the US Census, or Chicago's Data Portal.  Finally, you have all your scientific data sources: the National Cancer Institute, the ProteinBank, or the Kyoto Gene and Genomes Encyclopedia.

How can you use Python to access those sites and retrieve data for your research, your business, or your hobby?

There are two main approaches to retrieve data from websources. The preferred approach is using **Application Program Interfaces** or APIs.  If an organization has decided to share its data, and they have the forethought and resources to do it, they will develop an API that will let you interact with their data.

If the organization does not have the forethought or resources to create an API (or if they do not want to share their data), then you have to **crawl** their website and **scrape** their data.

**Note:** Many organizations do not allow data scrapping from their website and so, there may not be any API to scrap the data from there and the traditional methods to use web-scrapping libraries also may not work.

    In this notebook, we are going to see a demo of scrapping real-time data using a small API.

# Application Program Interfaces (APIs)


APIs simplify the process of obtaining specific information from a data source.  You do not have to worry about figuring out the **format** in which the information is stored, or **where** the information is stored.  All of those matter are handled seamlessly by the API.

But convenience is not the only advantage of an API. APIs are also particular useful when:

* You want a small piece of a much larger set of data. **Reddit comments** are one example. What if you want to just pull your own comments on Reddit? It doesn’t make much sense to download the entire Reddit database, then filter just your own comments.
    
* There is repeated computation involved. **Spotify has an API that can tell you the genre of a piece of music**. You could theoretically create your own classifier, and use it to categorize music, but you’ll never have as much data as Spotify does.
    
* The data is changing quickly. An example of this is **stock price data**. It doesn’t really make sense to regenerate a dataset and download it every minute – this will take a lot of bandwidth, and be pretty slow.

* Many big companies have their own well-managed APIs using which we can scrap the data. Since the website and data structures may be different, it become very important that we look into the documentation of such APIs. For example, here is the link of Twitter API: https://developer.twitter.com/en/docs/twitter-api
    
    
    
### Making a request

In order to learn how APIs work, we will first use the APIs developed to retrieve data on the **International Space Station (ISS)**.  The relevant APIs can be found at http://open-notify.org/.  We will first consider the API for retrieving the location (latitude and longitude) of the ISS (http://open-notify.org/Open-Notify-API/ISS-Location-Now/). The API is hosted at http://api.open-notify.org/iss-now.json.

So, how do we make requests for information with this API?

Like standard webpages, APIs are also hosted on web servers. When you type http://www.google.com in your browser’s address bar, your computer is actually asking the http://www.google.com server for a webpage, which it then returns it to your browser for display. That action is called a `request`. APIs work much the same way, except instead of your web browser asking for a webpage, your program asks for **data**. This data is usually returned in JSON format.

There are many possible types of requests. The most common, and the one we will be using throughout this unit, is the `GET` request. A `GET` request simply accesses and downloads the webpage found at the URL you specified as an input.

We will use the package [`requests`](http://docs.python-requests.org/en/latest/user/quickstart/) package to crawl (load) webpages and scrape (download) their contents.

In [1]:
import requests # This is a request library! it is really just a whole bunch of functions put together!
import json

response = requests.get("http://api.open-notify.org/iss-now.json")
print(response)
print(response.status_code)

<Response [200]>
200


Methods from the `requests` package return `Response` objects. One of the most important properties of the response is its `status code`, which is printed by default but which we can also get explicitly.

Here are some of the most common status codes you might encounter:
* 200, **OK**. Standard response for successful HTTP requests. The actual response will depend on the request method used.
* 301, **Moved Permanently**. The server is redirecting you to a different endpoint. This and all future requests should be directed to the given URL. This can happen when a company switches domain names, or an endpoint name is changed.
* 303, **See Other**. The response to the request can be found under another URI using a GET method. When received in response to a POST (or PUT/DELETE), the client should presume that the server has received the data and should issue a redirect with a separate GET message. Your web browser automatically fetches the new URL but web crawlers do not usually do this unless you specify it.
* 400, **Bad Request**. The server cannot or will not process the request due to an apparent client error (e.g., malformed request syntax, invalid request message framing, or deceptive request routing).
* 401, **Unauthorized**. Similar to `403 Forbidden`, but specifically for use when authentication is required and has failed or has not yet been provided. The response must include a WWW-Authenticate header field containing a challenge applicable to the requested resource.
* 403, **Forbidden**. The request was a valid request, but the server is refusing to respond to it. `403` error semantically means "unauthorized", i.e. the user does not have the necessary permissions for the resource.
* 404, **Not Found**. The requested resource could not be found but may be available in the future. Subsequent requests by the client are permissible.
* 500, **Internal Server Error**. A generic error message, given when an `unexpected` condition was encountered and no more specific message is suitable.
* 503, **Service Unavailable**. The server is currently unavailable (because it is overloaded or down for maintenance). Generally, this is a temporary state.
* 504, **Gateway Timeout**. The server was acting as a gateway or proxy and did not receive a timely response from the upstream server.[



More codes: http://en.wikipedia.org/wiki/List_of_HTTP_status_codes

The status code of our request was **200**. It means that all went well -- we successfully connected to the web address we wanted and downloaded its contents.

But `status codes` are not the only methods available:

In [2]:
print(response.url)

http://api.open-notify.org/iss-now.json


In [3]:
print(response.text)

{"message": "success", "iss_position": {"longitude": "-107.9806", "latitude": "50.9244"}, "timestamp": 1752983069}


In [7]:
import requests #this is a request library! it is really just a whole bunch of functions put together!
import json
import time

# Getting information in near real time
for i in range(1,100):
    response = requests.get("http://api.open-notify.org/iss-now.json")
    #print(response.text)
    time.sleep(5) # sleep for 5 secs/keep idle for 5 secs

    data = json.loads(response.text)

    print( "The ISS current position is {} of latitude and {} of longitude.".format(
        data['iss_position']['latitude'],
        data['iss_position']['longitude']))

# This cell will keep printing the ISS position every 5 seconds
# To stop it, interrupt the cell execution.

The ISS current position is 44.5657 of latitude and -83.1466 of longitude.
The ISS current position is 44.1640 of latitude and -82.2148 of longitude.
The ISS current position is 44.0011 of latitude and -81.8458 of longitude.


KeyboardInterrupt: 

In [6]:
data = json.loads(response.text)

print( "The ISS current position is {} of latitude and {} of longitude.".format(
        data['iss_position']['latitude'],
        data['iss_position']['longitude']))

print( "The ISS current position is "+str(data['iss_position']['latitude'])+" of latitude and "+str(data['iss_position']['longitude'])+" of longitude.")

The ISS current position is 47.3593 of latitude and -90.7995 of longitude.
The ISS current position is 47.3593 of latitude and -90.7995 of longitude.


Other way to scrap data from website is using web-scrapping libraries such as `beautifulsoup`, `scrappy` etc. In the additional reading, a notebook has been provided on web-scrapping through libraries for you to read.