# Session 6.1: Introduction to Web Access through Python

In the data access module, we will study how to use python to access the data on the web. In the baisc session, we will learn:
    1. The Internet: How is information structured and transmitted on the Internet
    2. HTML: How does browser work? How to access basic text information on the internet?
    3. How do we leverage on Python packages to extract key information from a website?


### Point72 Guidelines of scraping

1. No webscraping may infringe upon a lawful copyright.
2. No webscraping may have an adverse impact on site.
3. No webscraping is permitted on sites protected with CAPTCHA programs.
4. Webscraping that would compile data or information for competitive use is prohibited. 
5. Webscraping that will collect personal identifiable information about individuals, including but not limited to, phone numbers, proper names, account numbers, etc., is prohibited
6. Any access you obtain must be consistent with applicable Robots Exclusion Protocols (robots.txt).
7. Notify Compliance immediately if the program is blocked by the target site




## The Internet.
As we discussed, the internet has several layers: Network Interface, Network, Transport and Application. The network interface is the physical layer on coding and decoding electromegnetic waves to digital information. The Network layer allows devices in one network to identify each other and communicate with each other. The Transport layer is one of layer on top of the network layer that makes sure that the communication between devices is smooth and error-prone. The highest layer is the application layer, which links applications together (such as email clients and email servers). In this session, we will only look at the transport layer and the application layer. 

Here is a graph view of the Internet:

<img src="http://fiberbit.com.tw/wp-content/uploads/2013/12/TCP-IP-model-vs-OSI-model.png">

At each layer, there are numerour protocols that allow devices to communicate. A **protocol** is a set of rules that all parties follow so we can predict each other’s behavior. And in this context, protocol allows all parties on the internet to agree on how to share and access information. The most important protocols in the Network and Transport layer is **Internet protocol suite**.


**Internet protocol suite** is a conceptual framework which contains multiple protocols that allow users to share information on the Internet. Internet protocol suite can be decomposed into two major protocols: 
    1. Internet Protocol (IP): The Internet Protocol is the original principal communications protocol in the Internet protocol suite. Its routing function enables inter-device communication through the network, and essentially establishes the Internet.
    2. Transport Control Protocol (TCP): TCP provides reliable, ordered, and error-checked delivery of a stream of data between applications running on hosts communicating by an IP network. 
    
Because of these two major protocols, the Internet protocol suite is commonly referred as **TCP/IP**.

*Bonus Knowledge*: Applications that do not require reliable data stream service may use the User Datagram Protocol (UDP), which provides a connectionless datagram service that emphasizes reduced latency over reliability. UDP is heavily used in the high-frequency trading environment.

**TCP/IP**: TCP/IP provides each device a number to identify themselves in the network. You can check your IP number by searching "What is my IP" on Google: [Your IP](https://www.google.com/#q=what+is+my+ip&*)


On top of TCP/IP, each application will have a specific socket (port) number on the device. For instance, if you use email and browser at the same time from your machine, they will have the same IP address but different port numbers. 

Here are some different port numbers:
    1. Telnet (Login): 23
    2. SSH (Secure Login): 22
    3. HTTP (Browser): 80
    4. HTTPS (Secure Browser): 443
    5. SMTP (Mail): 25
    6. FTP (File transfer): 21
    7. DNS (Domain Name): 53

Even though the TCP/IP is very complicated, Python actually provides a very nice package to directly send TCP/IP message on the network. The package is called **[socket](https://docs.python.org/3/library/socket.html)**.

### Example 2.1.1 Google's robots.txt (socket)
Let us then use Python's socket to get the robots.txt file from Google. 


In [None]:
%%bash 
cat Data/google_robots.txt

## The Application Layer - HTTP and urllib

In this subsection, we will focus on a specific application-layer protocol -- **Hpertext Transfer Protocl (HTTP)**. HTTP is an application protocol for distributed, collaborative, and hypermedia information systems. It is the foundation of data communication for the World Wide Web, and the most useful application protocol for scraping data from the web (since most data is still stored on webpages).

HTTP has two basic requests, **GET** and **POST**:
    1. GET: Requests data from a specified resource
    2. POST: Submits data to be processed to a specified resource

A web communication starts with a client sending an HTTP request, normally a GET request. Then the request goes to the server and server response to the request. The response normally contains the response to the HTTP request (information used by the HTTP protocol) as well as the content response. For exmaple, in the Exercise 1.1, we send an HTTP request to www.mit.edu/robots.txt, and our response contains two parts. The first part is a response to the HTTP request, while the second part is the content of the requested documents. There are also several other types of methods in HTTP. For more information, see [here](https://www.w3schools.com/tags/ref_httpmethods.asp).

Another important feature of HTTP is to have a specified code for each kind of response. For instance, in the previous example, we get HTTP response code **"200 OK"**, which means that our request is OK. There are also other codes:
    * 100 Continue: The server has received the request headers, and the client should proceed to send the request body.
    * 200 OK: The request is OK (this is the standard response for successful HTTP requests).
    * 400 Bad Request: The request cannot be fulfilled due to bad syntax.
    * 500 Internal Server Error: A generic error message, given when no more specific message is suitable.
    
Those HTTP error codes are very important for debugging your future scraper. We will come back to this later.





Put everything together, we have:
    1. TCP/IP: a series of number or characters that allows us to identify devices on the network.
    2. Socket number: a number that allows us to communicate with particular programms on a device. 
    3. HTTP: a protocol that defines a set of methods that allow us to communicate to programs handling webpages. 
    

The combination of those 3 protocols are called **Uniform Resource Locator (URL)**. A URL, commonly informally termed a web address, is a reference to a web resource that specifies its location on a computer network and a mechanism for retrieving it.  URLs occur most commonly to reference web pages (http), but are also used for file transfer (ftp), email (mailto), database access (JDBC), and many other applications. Every HTTP URL conforms to the syntax of a generic URI. A generic URI is of the form: **host:socket/path**. 
 
Everything sounds very complicated, right? What if I tell you Python can make our lifes much easier?
<img src="https://imgs.xkcd.com/comics/python.png">


Let us introduce the Python package that saves our lives -- **Urllib**. In particularly, we focus on the sub-module called **urllib.request**, which is a Python module for fetching URLs. It offers a very simple interface, in the form of the urlopen function. This is capable of fetching URLs using a variety of different protocols. It also offers a slightly more complex interface for handling common situations - like basic authentication, cookies, proxies and so on. These are provided by objects called handlers and openers.

### Example Get NYAutoGiants Search Page (urllib)
For example, the following code block uses urllib.request to get https://nyautogiant.com/advanced-search/page-1


In [4]:
import requests

try:
    url = "http://www.google1.com/robots.txt"
    response = requests.get(url)
    print(response.text)
except:
    print("Error: Cannot get information from %s" % url)
    

Error: Cannot get information from http://www.google1.com/robots.txt


In [None]:
import requests
response = requests.get('https://nyautogiant.com/advanced-search/page-1')
html = response.text
print(html)

The above program runs in the following order:

1. The first line imports request to the python program. 
2. The second line uses a function in request called [get](https://docs.python.org/3.0/library/urllib.request.html) to open the URL, and save the response to the response file. 
3. The third line uses the response from request.get() and save the text content from the response into a buffer.


### Try and Catch

Sometimes, the scraper will fail due to random reasons such as the site is down. You do not want your scraper to crash when the website is down. Therefore, you need to use (try and except statement):

    try:
        ....
    except:
        ....

In [None]:
response = requests.get('https://nyautogiant232.com/advanced-search/page-1')
html = response.text
html

In [None]:
try:
    url = 'https://nyautogiant232.com/advanced-search/page-1'
    response = requests.get(url)
    html = response.text
except:
    print("Error: cannot scrape url: %s" % url)
