# Webscraping 
Concept: web ‚Äúscraping‚Äù is ‚Äúthe construction of an agent to download, parse, and organize data from the web in an automated manner.‚Äù 

Instead of a human end user clicking away in a web browser and copy-pasting interesting parts
into a spreadsheet, web scraping offloads this task to a computer program that can
execute it much faster, and more correctly, than a human can.

-----------------------------

**Why** do we need webscraping?

- There are many interesting data sources that in different unstructured websites. Web browsers are good at showing images, displaying animations, and laying out websites, but it's hard to export the data from there. 

- Webscraping is similar to what Application Programming Interface (APIs) are doing. Nowadays, many websites provide APIs for the users to access their data repository in a structured way.

- Webscraping can help you to get rich dataset automatically. 

------------------------------

**Rule of thumb**: 

- First, look for an API and use that if you can. Ex., Twitter, Facebook, LinkedIn, Spotify, and Google all have their own APIs.

- While webscraping is preferred in the following circumstances.

    - The target website doesn't offer API.
    - The API is not free.
    - The API has rate limit, like it only allows a certain number of connectoins per minute, per day, ...
    - You need more than what the API can provide. In many cases, the APIs do not provide all the data from the website/app, then you need to collect them youself.


## Some examples of webscraping

- Many of Google‚Äôs products have benefited from Google‚Äôs core business of crawling the web. Google Translate, for instance, utilizes text stored on the web to train and improve itself.

- Scraping is being applied a lot in HR and employee analytics. The San Francisco-based hiQ startup specializes in selling employee analyses by collecting and examining public profile information, for instance, from LinkedIn (who was not happy about this but was so far unable to prevent this practice following a court case;

- In one study, messages scraped from Twitter, blogs, and other social media were scraped to construct a data set that was used to build a predictive model toward identifying patterns of depression and suicidal thoughts.

- In another study, web scraping was used to extract information from job sites, to get an idea regarding the popularity of different data science- and analytics-related tools in the workplace (spoiler: Python and R were both rising steadily).

- Last, there is one study uses web scraping to monitor news outlets and web forums to track public sentiment regarding Bitcoin.

## Setting up
- You will need to install "pip", Python's package manager. 
- Recent version of Python 3 will come with pip installed. 
- But let's make sure pip is up to date and install the most recent version of `pip`.
    - `python -m pip install -U pip` in Windows
    - `pip install -U pip` in Linux or MacOS
    
- Another option: manually installing pip. 
    - Refer to the following page to install it on your system (under ‚ÄúInstalling with get-pip.py‚Äù): https://pip.pypa.io/en/stable/installation/.

You can check the latest version of the packages.
- use Magic command `%%cmd` to run cmd command.
- On Mac, use `%%bash`.

In [6]:
%%cmd
pip list --outdated

Microsoft Windows [Version 10.0.22621.1992]
(c) Microsoft Corporation. All rights reserved.

C:\Users\YuxiaoLuo\Documents\python3\Analytics_Python\Web_Scraping>pip list --outdate
Package                  Version     Latest Type
------------------------ ----------- ------ -----
anyio                    3.6.2       3.7.1  wheel
attrs                    22.1.0      23.1.0 wheel
beautifulsoup4           4.11.1      4.12.2 wheel
bleach                   5.0.1       6.0.0  wheel
charset-normalizer       3.1.0       3.2.0  wheel
comm                     0.1.2       0.1.3  wheel
debugpy                  1.6.4       1.6.7  wheel
exceptiongroup           1.1.1       1.1.2  wheel
fastjsonschema           2.16.2      2.17.1 wheel
ipykernel                6.19.2      6.24.0 wheel
ipython                  8.7.0       8.14.0 wheel
jsonpointer              2.3         2.4    wheel
jsonschema               4.17.3      4.18.4 wheel
jupyter_client           7.4.8       8.3.0  wheel
jupyter_core          


[notice] A new release of pip is available: 23.1.2 -> 23.2
[notice] To update, run: python.exe -m pip install --upgrade pip



C:\Users\YuxiaoLuo\Documents\python3\Analytics_Python\Web_Scraping>

From the message, `pip` has a newer version, let's upgrade it. 
- update pip from 23.1.2 to 23.2 (07/18/2023)
- if you install package in Jupyter Notebook, you may have to relaunch the script to use the newly installed package. 

In [8]:
%%cmd
python -m pip install -U pip

Microsoft Windows [Version 10.0.22621.1992]
(c) Microsoft Corporation. All rights reserved.

C:\Users\YuxiaoLuo\Documents\python3\Analytics_Python\Web_Scraping># use Magic command to run cmd command


'#' is not recognized as an internal or external command,
operable program or batch file.
'#' is not recognized as an internal or external command,
operable program or batch file.



C:\Users\YuxiaoLuo\Documents\python3\Analytics_Python\Web_Scraping># update pip from 23.1.2 to 23.2 (06/07/2022)

C:\Users\YuxiaoLuo\Documents\python3\Analytics_Python\Web_Scraping>python -m pip install -U pip
Collecting pip
  Downloading pip-23.2-py3-none-any.whl (2.1 MB)
     ---------------------------------------- 2.1/2.1 MB 14.7 MB/s eta 0:00:00
Installing collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 23.1.2
    Uninstalling pip-23.1.2:
      Successfully uninstalled pip-23.1.2
Successfully installed pip-23.2

C:\Users\YuxiaoLuo\Documents\python3\Analytics_Python\Web_Scraping>

## Webscraping I : establishing connections over the internet

- HyperText Transfer Protocol (HTTP).

- Protocol: a standard agreement regarding what messages between communicating parties should look like. 

- `requests` library: perform HTTP requests and retrieve websites with Python.

-------------------------------
When you are accessing a website, ex., "www.google.com".

1. The protocol, Domain Name System (DNS), will translate domain names like "www.google.com" to an IP address (a series of numbers).
    - Usually, the web browser stores websites you historically viewed in the cache (short term memory) and reuse every time you go to those websites again. 
    - If not, the browser will ask the OS (ex., Windows) and see if it knows the web address.
    - If the OS doesn't know, the browser will send a DNS request to your router, which also has its own DNS cache. 
    - If the router doesn't know, it will send a number of data packets to known DNS servers (maintained by your Internet Service Provider, known as ISP). The DNS server's IP address should be known and stored in the router. 
    - The DNS server then reply with a response: the IP address of "www.google.com" is 172.217.17.68. 
    - If the DNS server of the ISP doesn't know, it can ask other DNS servers (located higher in the DNS hierarchy).

2. The browser now establishes a connection to 172.217.17.68, Google's web server. A number of protocols are combined to construct a message. 
    - IEEE 802.3 (Ethernet) protocol: communicate with machines on the same network. We don't use it in this case.
    - Internet protocol (IP): embed another message indicating we want to contact the server at address 172.217.17.68. 
    - Transmission control protocol (TCP): provides a general, reliable ways to deliver network messages and also includes functionality for error checking and splitting messages up in a smaller packages. 
    - Inside TCP messages, there is another message formatted based on HTTP protocol (HyperText Transfer Protocol), which is used to request and receive web pages. 

3. Google's web server sends back an HTTP reply, which contains textual content of the page and is formatted using HTML (HyperText Markup Language). 
    - The browser will render the these HTML text to the actual page we are reading and make sure everything is arranged correctly specified by the HTML content. 
    - The webpages may contain pieces of content that the browser will initiate new HTTP request, ex., to get the contents of the image (which is raw, binary data). As a result, rendering one web page could deal with large amount of HTTP requests.
    - Browser will sned out multiple requests in parallel to speed up this process. (less than a second)

-------------------------------------------------

To standardize the large amount of protocols that form the web, the International Organization of Standardization (ISO) maintains the Open Systems Interconnection (OSI) model, which organizes computer communication into 7 layers.

1. Physical layer: includes Ethernet protocol, but also USB, Bluetooth, and other radio protocols.
2. Data link Layer: Includes the Ethernet protocol.
3. Network Layer: Includes IP (Internet Protocol).
4. Transport Layer: TCP, but also protocols such as UDP (user datagram protocol), which do not offer the advanced error checking and recovery mechanisms of TCP for sake of speed.
5. Session Layer: Includes protocols for opening/closing and managing sessions.
6. Presentation Layer: Includes protocols to format and translate data.
7. Application Layer: HTTP and DNS, for instance.

------------------------------------------

**Note**: not all network communications need to use protocols from all these layers.
- To request a web page, for instance, layers 1 (physical), 2 (Ethernet), 3 (IP), 4 (TCP), and 7 (HTTP) are involved, but the layers are constructed so that each protocol found at a higher level can be contained inside the message of a lower-layer protocol.

- When you request a secure web page, for instance, the HTTP message (layer 7) will be encoded in an encrypted message (layer 6) (this is what happens if you surf to an ‚Äúhttps‚Äù-address).

- The lower the layer you aim for when programming networked applications, the more functionality and complexity you need to deal with.

- **For webscrapping, we only deal with the topmost layer, HTTP, and leave all complexities regarding TCP, IP, Ethernet, and domain name system up to the Python library and the OS.**

### The HyperText Transfer Protocol: HTTP
The core component in the exchange of messages consists of a HTTP request message to a web server, followed by an HTTP
response, which can be rendered by the browser.

A client (ex., web browser) and web server will communicate by sending plain text messages. The client sends requests to the server and the server sends responses. A request message consists of the following: 
- A request line
- A number of request headers, each on their own line
- An empty line
- An optional message body, which can also take up multiple lines

Each line in an HTTP message must end with `<CR><LF>` (the ASCII characters 0D
and 0A). The empty line is simply `<CR><LF>` with no other additional white space. 

Note: `<CR>` and `<LF>` are two special characters to indicate that a new line should be started (Windows uses both, MacOS uses `<CR>`, Linux uses `<LF>`)


#### HTTP request message 

The following code fragment shows a full HTTP request message as executed by a
web browser (we don‚Äôt show the ‚Äú`<CR><LF>`‚Äù after each line, except for the last,
blank line):
    
```
GET / HTTP/1.1
Host: example.com
Connection: keep-alive
Cache-Control: max-age=0
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 ÔÉâ
(KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Referer: https://www.google.com/
Accept-Encoding: gzip, deflate
Accept-Language: en-US,en;q=0.8,nl;q=0.6
<CR><LF>


- ‚ÄúGET / HTTP/1.1‚Äù is the request line. It contains the HTTP ‚Äúverb‚Äù or ‚Äúmethod‚Äù we want to execute (‚ÄúGET‚Äù in the example above), the URL we want to retrieve (‚Äú/‚Äù), and the HTTP version we understand (‚ÄúHTTP/1.1‚Äù).
- ‚ÄúGET‚Äù means this: ‚Äúget the contents of this URL for me.‚Äù Every time you enter a URL in your address bar and press enter, your browser will perform a GET request.
- Host: indicating from which domain name the server should retrieve the page. 
    - in HTTP 1.1 version, the same server can serve multiple websites with the same IP address. Ex., the same server responsible for "example.com" might also be the one serving pages delonging to "example.org".
- Request headers: each on their own line. In this example, we already have quite a few of them. Each header includes a name followed by a colon and the actual value of the header. 
- Connection: keep-alive, it should keep the connection open for subsequent requests if it can. 
- User-Agent: information about the browser (type, version).

- Polite request: the browser might indicate what forms it understands, the web server might still ignore them. These requests are polite requests.
    - ‚ÄúAccept‚Äù tells the server which forms of content the browser prefers to get back.
    - ‚ÄúAccept-Encoding‚Äù tells the server that the browser is also able to get back compressed content.
- ‚ÄúReferer‚Äù header (a deliberate misspelling) tells the server from which page the browser comes from (in this case, a link was clicked on ‚Äúgoogle.com‚Äù sending the browser to ‚Äúexample.com‚Äù).

#### HTTP response message

The web server will process our request and send back an HTTP reply. These look very similar to HTTP requests and contain:

- A status line that includes the status code and a status message;

- A number of response headers, again all on the same line;

- An empty line;

- An optional message body.

As such, we might get the following response following our request above:


```
HTTP/1.1 200 OK
Connection: keep-alive
Content-Encoding: gzip
Content-Type: text/html;charset=utf-8
Date: Mon, 28 Aug 2017 10:57:42 GMT
Server: Apache v1.3
Vary: Accept-Encoding
Transfer-Encoding: chunked
<CR><LF>
<html>
<body>Welcome to My Web Page</body>
</html>



- The first line indicates the status result of the request. It opens by listing the HTTP version the server understands (‚ÄúHTTP/1.1‚Äù), followed by a status code (‚Äú200‚Äù), and a status message (‚ÄúOK‚Äù).
    - if everything goes well, status is 200.
    - there are other HTTP status codes, ex., 404 status indicating that URL listed in the request could not be retrieved or "not found" on the server.
    
- Other headers: Date, Server version, Content-Type. 
    - Content-Type tells the browser what the content included in the reply looks like. It's HTML text in this case, it could also be binary image data movie data, etc..
    
- After `<CR><LF>`, you see an optional message body, which contains content of the reply. 
    - Ex., HTML text saying "Sorry, this page could not be found" written by the website you are visiting. It's optional though, almost all cases will have it.

### HTTP in Python: the request library

There are some libraries in Python can take care of HTTP for us. 

- Python 3 comes with a built-in module called ‚Äúurllib,‚Äù which can deal with all things HTTP (see https://docs.python.org/3/library/urllib.html). The module got heavily revised compared to its counterpart in Python 2, where HTTP functionality was split up in both ‚Äúurllib‚Äù and ‚Äúurllib2‚Äù and somewhat cumbersome to work with.

- "httplib2‚Äù (see https://github.com/httplib2/httplib2): a small, fast HTTP client library. Originally developed by Googler Joe Gregorio, and now community supported.

- ‚Äúurllib3‚Äù (see https://urllib3.readthedocs.io/): a powerful HTTP client for Python, used by the requests library below.

- ‚Äúrequests‚Äù (see http://docs.python-requests.org/): an elegant and simple HTTP library for Python, built ‚Äúfor human beings.‚Äù

- ‚Äúgrequests‚Äù (see https://pypi.python.org/pypi/grequests): which extends requests to deal with asynchronous, concurrent HTTP requests.

- ‚Äúaiohttp‚Äù (see http://aiohttp.readthedocs.io/): another library focusing on asynchronous HTTP.

-------------------------------

**Why** `requests`:

- `urllib` provides solid HTTP functionality, but using it involves lots of boilerplate code making the module less pleasant to use and not very elegant to read.

- `urllib3` (not part of the standard Python modules) extends the Python ecosystem regarding HTTP with some advanced features, but it also doesn‚Äôt really focus that much on being elegant or concise.

- `requests` builds on top of ‚Äúurllib3,‚Äù but it allows you to tackle the majority of HTTP use cases in code that is short, pretty, and easy to use.

- Both `grequests` and `aiohttp` are more modern-oriented libraries and aim to make HTTP with Python more asynchronous. This is useful when you make lots of HTTP requests as quickly as possible. But asynchronous programming is a challenging topic. Here we discuss traditional ways of speeding up web scraping program. You can easily move on to these two libraries on you own later.

Let's install the `requests` library first.

- installing the requests libaray
    - in Windows prompt or terminal (on Mac)
        - `pip install requests`
    - in Anaconda prompt
        - `conda install requests`
- ‚Äú-U‚Äù argument will make sure to update an existing version of requests should there already be one
    - in Windows prompt or terminal (on Mac)
        - `pip install -U requests`
    - in Anaconda prompt
        - `conda install requests`

1. Check if I have `requests` installed. The results showed the version is 2.31.0.

In [9]:
%%cmd
pip show requests

Microsoft Windows [Version 10.0.22621.1992]
(c) Microsoft Corporation. All rights reserved.

C:\Users\YuxiaoLuo\Documents\python3\Analytics_Python\Web_Scraping>pip show requests
Name: requests
Version: 2.31.0
Summary: Python HTTP for Humans.
Home-page: https://requests.readthedocs.io
Author: Kenneth Reitz
Author-email: me@kennethreitz.org
License: Apache 2.0
Location: C:\Users\YuxiaoLuo\AppData\Local\Programs\Python\Python311\Lib\site-packages
Requires: certifi, charset-normalizer, idna, urllib3
Required-by: 

C:\Users\YuxiaoLuo\Documents\python3\Analytics_Python\Web_Scraping>

2. Show all the outdated packages. Upgrade them based on your needs.

In [10]:
%%cmd
pip list -o

Microsoft Windows [Version 10.0.22621.1992]
(c) Microsoft Corporation. All rights reserved.

C:\Users\YuxiaoLuo\Documents\python3\Analytics_Python\Web_Scraping>pip list -o
Package                  Version     Latest Type
------------------------ ----------- ------ -----
anyio                    3.6.2       3.7.1  wheel
attrs                    22.1.0      23.1.0 wheel
beautifulsoup4           4.11.1      4.12.2 wheel
bleach                   5.0.1       6.0.0  wheel
charset-normalizer       3.1.0       3.2.0  wheel
comm                     0.1.2       0.1.3  wheel
debugpy                  1.6.4       1.6.7  wheel
exceptiongroup           1.1.1       1.1.2  wheel
fastjsonschema           2.16.2      2.17.1 wheel
ipykernel                6.19.2      6.24.0 wheel
ipython                  8.7.0       8.14.0 wheel
jsonpointer              2.3         2.4    wheel
jsonschema               4.17.3      4.18.4 wheel
jupyter_client           7.4.8       8.3.0  wheel
jupyter_core             5.1.

In [17]:
%%cmd
# update the outdated requests package
pip install -U requests

# alternative 1
# python -m pip install -U requests

# alternative 2
# py -m pip install -U requests

# alternative 3
# python3 -m pip install -U requests

Microsoft Windows [Version 10.0.19044.1706]
(c) Microsoft Corporation. All rights reserved.

C:\Users\Yuxiao Luo\Documents\python3\Analytics_Python\Web_Scraping>
C:\Users\Yuxiao Luo\Documents\python3\Analytics_Python\Web_Scraping>pip install --upgrade requests
Collecting requests
  Downloading requests-2.27.1-py2.py3-none-any.whl (63 kB)
     ---------------------------------------- 63.1/63.1 kB 3.3 MB/s eta 0:00:00
Collecting charset-normalizer~=2.0.0
  Downloading charset_normalizer-2.0.12-py3-none-any.whl (39 kB)
Installing collected packages: charset-normalizer, requests
  Attempting uninstall: requests
    Found existing installation: requests 2.25.1
    Uninstalling requests-2.25.1:
      Successfully uninstalled requests-2.25.1
Successfully installed charset-normalizer-2.0.12 requests-2.27.1

C:\Users\Yuxiao Luo\Documents\python3\Analytics_Python\Web_Scraping>

### Code Demonstration
Let's play around with the library with 2 examples. 
1. We will use the example in the textbook.
    - first, import the module
    - try to open http://www.webscrapingfordatascience.com/basichttp/ on your browser. You‚Äôll see ‚ÄúHello from the web!‚Äù appear on the page. This is what we want to extract using Python.
    - We use the `requests.get` method to perform an ‚ÄúHTTP GET‚Äù request to the specified URL. Requests will make sure to format a proper HTTP request message.
    - The `requests.get` method returns a HTTP GET requests.
        - Response Python object containing lots of information regarding the HTTP reply that was retrieved.
    - `r.text` contains the HTTP response content body in a textual form.
2. We will scrape the Baruch College page on Wikipedia (https://en.wikipedia.org/wiki/Baruch_College).
3. Other practice websites
    - http://books.toscrape.com/
    - https://webscraper.io/test-sites/tables
    - https://www.scrapethissite.com/pages/

#### Example

In [28]:
import requests 

url = 'https://www.google.com'

#requests.get() method returns to an requests.response object which contains a lot of info retrived
r = requests.get(url)

# this does the same job
# r = requests.request('GET', url)

# what is r
print(type(r))

# which HTTP status code from the server
print(r.status_code)

# textual status code
print(r.reason)

# HTTP resonse headers
print(r.headers)

<class 'requests.models.Response'>
200
OK
{'Date': 'Wed, 19 Jul 2023 16:05:18 GMT', 'Expires': '-1', 'Cache-Control': 'private, max-age=0', 'Content-Type': 'text/html; charset=ISO-8859-1', 'Content-Security-Policy-Report-Only': "object-src 'none';base-uri 'self';script-src 'nonce-jXvzz11FuKbzLK4C0khd7Q' 'strict-dynamic' 'report-sample' 'unsafe-eval' 'unsafe-inline' https: http:;report-uri https://csp.withgoogle.com/csp/gws/other-hp", 'P3P': 'CP="This is not a P3P policy! See g.co/p3phelp for more info."', 'Content-Encoding': 'gzip', 'Server': 'gws', 'X-XSS-Protection': '0', 'X-Frame-Options': 'SAMEORIGIN', 'Set-Cookie': '1P_JAR=2023-07-19-16; expires=Fri, 18-Aug-2023 16:05:18 GMT; path=/; domain=.google.com; Secure, AEC=Ad49MVFwqi1SSqoqhh1kvk9XUQKGk94eo8JfgRJ3iZaGjHpHgkSoDodqfyI; expires=Mon, 15-Jan-2024 16:05:18 GMT; path=/; domain=.google.com; Secure; HttpOnly; SameSite=lax, NID=511=utC7PQYc6maQeSjQoUIQyAqFiCdEif53m5CUdm0-FXIy3CKed8Oh4-bSZ_rMML0TOucuD_-CcGqjy0GsEL6XBerJo2eOCE0XAo7UQU

In [29]:
# the request info is saved in r.request (Python object)
print(r.request)
print(type(r.request))

# HTTP request headers
dict(r.request.headers)

<PreparedRequest [GET]>
<class 'requests.models.PreparedRequest'>


{'User-Agent': 'python-requests/2.31.0',
 'Accept-Encoding': 'gzip, deflate',
 'Accept': '*/*',
 'Connection': 'keep-alive'}

In [30]:
# HTTP reponse content
print(r.text)

<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="en"><head><meta content="Search the world's information, including webpages, images, videos and more. Google has many special features to help you find exactly what you're looking for." name="description"><meta content="noodp" name="robots"><meta content="text/html; charset=UTF-8" http-equiv="Content-Type"><meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"><title>Google</title><script nonce="jXvzz11FuKbzLK4C0khd7Q">(function(){var _g={kEI:'vgm4ZM29Eb_L1sQP65C58Ak',kEXPI:'0,18168,774942,566299,6059,206,4804,2316,383,246,5,1129120,1749,1195989,380753,16114,19398,9286,22431,1361,12313,17586,4998,13124,3951,38444,2872,2891,3926,214,7614,606,50059,10631,15324,781,1244,1,16916,2652,4,1528,2304,29062,13065,11442,2216,2980,1457,9358,13216,6663,7596,1,42154,2,16395,342,23024,5679,1021,31121,4569,6258,23418,1252,5835,19300,7484,445,2,2,1,24626,2006,8155,7381,2,3,1474,14491,872,

- Looking at HTTP status code and status message --> everything went well
    - `status_code`: 200
    - `reason`: ok
    
- Other headers are included in the HTTP reply from the server

- To get information regarding the HTTP request, you can access request attribute of a `request.Response` object. This attribute itself is a `request.Request` object, containing information about the HTTP request that was prepared.

- We can access the headers attribute for this object as well to get a dictionary representing the headers that were included by requests.
    - User-Agent
    - Accept-Encoding
    - Accept
    - Connection

#### Practice

In [31]:
import requests 

#url = 'https://httpbin.org/get'
url = 'https://en.wikipedia.org/wiki/Baruch_College'

r_school = requests.get(url)

print(r_school.status_code) #check status code

print(r_school.reason) #check textual status code

200
OK


In [32]:
print(type(r_school))
print(type(r_school.request))

<class 'requests.models.Response'>
<class 'requests.models.PreparedRequest'>


In [193]:
# check other HTTP response headers
dict(r_school.headers)

{'date': 'Thu, 09 Jun 2022 17:10:25 GMT',
 'vary': 'Accept-Encoding,Cookie,Authorization',
 'server': 'ATS/8.0.8',
 'x-content-type-options': 'nosniff',
 'content-language': 'en',
 'last-modified': 'Mon, 06 Jun 2022 00:51:00 GMT',
 'content-type': 'text/html; charset=UTF-8',
 'content-encoding': 'gzip',
 'age': '6587',
 'x-cache': 'cp1087 miss, cp1085 hit/3',
 'x-cache-status': 'hit-front',
 'server-timing': 'cache;desc="hit-front", host;desc="cp1085"',
 'strict-transport-security': 'max-age=106384710; includeSubDomains; preload',
 'report-to': '{ "group": "wm_nel", "max_age": 86400, "endpoints": [{ "url": "https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0" }] }',
 'nel': '{ "report_to": "wm_nel", "max_age": 86400, "failure_fraction": 0.05, "success_fraction": 0.0}',
 'set-cookie': 'WMF-Last-Access=09-Jun-2022;Path=/;HttpOnly;secure;Expires=Mon, 11 Jul 2022 12:00:00 GMT, WMF-Last-Access-Global=09-Jun-20

In [195]:
#HTTP request message

#check info from the HTTP request
#info are stored in the object r.request

dict(r_school.request.headers) #check HTTP request headers

{'User-Agent': 'python-requests/2.27.1',
 'Accept-Encoding': 'gzip, deflate',
 'Accept': '*/*',
 'Connection': 'keep-alive'}

In [58]:
# check the scraped text from the webpage
print(r.text)

<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>Baruch College - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":false,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"c80b9687-614e-4828-a3a1-1bdd45cb30fc","wgCSPNonce":false,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Baruch_College","wgTitle":"Baruch College","wgCurRevisionId":1088159113,"wgRevisionId":1088159113,"wgArticleId":484019,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Webarchive template wayback links","CS1 errors: generic title","All articles with bare URLs for citations","Articles with bare URLs for citations from 

#### Web forms and HTTP post method 

- While GET is used to request data from a specified resource. Ex., that the query string (name/value pairs) is sent in the URL of a GET request. 
    - URLs are (by definition) limited in terms of length
    - There might be sensitive information (private infomation, passcodes, etc.) in the requests
   

- Better way of providing input and sending that input to a web server: web forms.
    - Web forms are shown in browser using tags inside of HTML.
    - Each web form on a page corresponds with a block of HTML code enclosed in "<form\>" tag:
    
    ```
    <form>
     [...]
    </form>

- Inside of the web form, there are many tags representing the form fields themselves. 
    - Most are provided through an `<input>` tag, with the `type` attribute specifying what kind of field it should represent.
    - `<input>` does not have closing tag.


- Here is a short list of these tags:
    - `<input type="text">` for simple text fields;
    - `<input type="password">` for password entry fields;
    - `<input type="button">` for general-purpose buttons;
    - `<input type="reset">` for a ‚Äúreset‚Äù button (when clicked, the browser will reset all form values to their initial state, but this button is rarely encountered these days);
    - `<input type="submit">` for a ‚Äúsubmit‚Äù button (more on this later);
    - `<input type="checkbox">` for check boxes;
    - `<input type="radio">` for radio boxes;
    - `<input type="hidden">` for hidden fields, which will not be shown to the user but can still contain a value.

- Those fields can also be included in with pairs of tags.
    - `<button>...</button>` as another way to define buttons;
    - `<select>...</select>` for drop-down lists. Within these, every choice is defined by using `<option>...</option>` tags;
    - `<textarea>...</textarea>` for larger text entry fields.


- We will use GET to scrape a web form on the testing website: https://httpbin.org/forms/post.


- POST is used to send data to a server to create/update a resource (submit web forms). The data sent to the server with POST is stored in the request body of the HTTP request.

Difference between GET and POST ([Ref](https://www.w3schools.com/tags/ref_httpmethods.asp))

- GET: Used when typing in a URL in the address bar and pressing enter, clicking a link, or submitting a GET form. Here, the assumption is that the same request can be executed multiple times without ‚Äúdamaging‚Äù the user‚Äôs experience. For instance, it‚Äôs fine to refresh the URL ‚Äúsearch.html?query=Test‚Äù, but perhaps not the URL ‚ÄúdoMoneyTransfer?to=Bart&from=Seppe&amount=100‚Äù.


- POST:Used when submitting a POST form. Here, the assumption is that this request will lead to an action being undertaken that should not be executed multiple times. Most browser will actually warn you if you try refreshing a page that resulted from a POST request (‚Äúyou‚Äôll be resubmitting the same information again ‚Äî are you sure this is what you want to do?‚Äù).


|                             | GET                                                                                                                                            | POST                                                                                                           |
|-----------------------------|------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------|
| BACK button/Reload          | Harmless                                                                                                                                       | Data will be re-submitted (the browser should alert the user that the data are about to be re-submitted)       |
| Bookmarked                  | Can be bookmarked                                                                                                                              | Cannot be bookmarked                                                                                           |
| Cached                      | Can be cached                                                                                                                                  | Not cached                                                                                                     |
| Encoding type               | application/x-www-form-urlencoded                                                                                                              | application/x-www-form-urlencoded or multipart/form-data. Use multipart encoding for binary data               |
| History                     | Parameters remain in browser history                                                                                                           | Parameters are not saved in browser history                                                                    |
| Restrictions on data length | Yes, when sending data, the GET method adds the data to the URL; and the length of a URL is limited (maximum URL length is 2048 characters)    | No restrictions                                                                                                |
| Restrictions on data type   | Only ASCII characters allowed                                                                                                                  | No restrictions. Binary/numeric/... data is also allowed                                                                   |
| Security                    | GET is less secure compared to POST because data sent is part of the URL  Never use GET when sending passwords or other sensitive information! | POST is a little safer than GET because the parameters are not stored in browser history or in web server logs |
| Visibility                  | Data is visible to everyone in the URL                                                                                                         | Data is not displayed in the URL                                                                               |

In [36]:
#HTTP Post method
#send data mostly through a form to the server for creating or updating data in the server

#syntax
#requests.post(url, data={key: value}, json={key: value}, args)

in_values = {'username':'Jack',
             'password':'Hello'}

r = requests.post('https://httpbin.org/post',data = in_values)
print(r.text)

{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "password": "Hello", 
    "username": "Jack"
  }, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Content-Length": "28", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.31.0", 
    "X-Amzn-Trace-Id": "Root=1-64b80c9e-0cd3d66d2c76682e0601cf11"
  }, 
  "json": null, 
  "origin": "150.210.231.129", 
  "url": "https://httpbin.org/post"
}



In [35]:
url1 = "http://www.webscrapingfordatascience.com/paramhttp/?query=cabddd8&^"
url2 = "https://httpbin.org/forms/post"
r1 = requests.get(url2)

print(r1.text)

<!DOCTYPE html>
<html>
  <head>
  </head>
  <body>
  <!-- Example form from HTML5 spec http://www.w3.org/TR/html5/forms.html#writing-a-form's-user-interface -->
  <form method="post" action="/post">
   <p><label>Customer name: <input name="custname"></label></p>
   <p><label>Telephone: <input type=tel name="custtel"></label></p>
   <p><label>E-mail address: <input type=email name="custemail"></label></p>
   <fieldset>
    <legend> Pizza Size </legend>
    <p><label> <input type=radio name=size value="small"> Small </label></p>
    <p><label> <input type=radio name=size value="medium"> Medium </label></p>
    <p><label> <input type=radio name=size value="large"> Large </label></p>
   </fieldset>
   <fieldset>
    <legend> Pizza Toppings </legend>
    <p><label> <input type=checkbox name="topping" value="bacon"> Bacon </label></p>
    <p><label> <input type=checkbox name="topping" value="cheese"> Extra Cheese </label></p>
    <p><label> <input type=checkbox name="topping" value="onion"> 

### Query strings: URL with Parameters

A URL (Uniform Resource Locator) is a unique identifier used to locate a resource on the Internet. 

When a server receives an HTTP request for such URLs, it may run a program that uses the parameters included in the query
string ‚Äî the ‚ÄúURL parameters‚Äù ‚Äî to render different content

For example:
- https://www.google.com/search?dcr=0&source=hp&q=test&oq=test
- http://www.webscrapingfordatascience.com/paramhttp/?query=test
- http://www.webscrapingfordatascience.com/paramhttp/?query=anothertest

The optional ‚Äú?‚Ä¶‚Äù part in URLs is called the ‚Äúquery string,‚Äù and it is meant to contain data that does not fit within a URL‚Äôs normal hierarchical path structure

------------------------

Query strings in URLs should adhere to the following conventions:
- A query string comes at the end of a URL, starting with a single question mark, ‚Äú?‚Äù.
- Parameters are provided as key-value pairs and separated by an ampersand, ‚Äú&‚Äù.
- The key and value are separated using an equals sign, ‚Äú=‚Äù.
- Since some characters cannot be part of a URL or have a special meaning (the characters ‚Äú/‚Äù, ‚Äú?‚Äù, ‚Äú&‚Äù, and ‚Äú=‚Äù for instance), URL ‚Äúencoding‚Äù needs to be applied to properly format such characters when using them inside of a URL. Try this out using the URL http://www.webscrapingfordatascience.com/paramhttp/?query=another%20test%3F%26, which sends ‚Äúanother test?&‚Äù as the value for the ‚Äúquery‚Äù parameter to the server in an encoded form.
- Other exact semantics are not standardized. 
    - In general, the order in which the URL parameters are specified is not taken into account by web servers, though some might. 
    - Many web servers will also be able to deal and use pages with URL parameters without a value, for example, http://www.example.com/?noparam=&anotherparam. 
    - Since the full URL is included in the request line of an HTTP request, the web server can decide how to parse and deal with these.
    
-------------------------
**URL encoding** (or percent encoding) is a method to encode arbitrary data in a Uniform Resource Identifier (URI) using only the limited US-ASCII characters legal within a URI. It is often used in the submission of HTML form data in HTTP requests. More details can be found [here](https://en.wikipedia.org/wiki/URL_encoding). For example:
- %20% represents space ` `.
- %3F represents questions mark `?`.
- %26 represents ampersand `&`.

In [93]:
# URL query string
# issue: requests is not able to provide smart url encoding, so some info (special characters) can't be retrived

#"?query=test"
url0 = "http://www.webscrapingfordatascience.com/paramhttp/?query=test"

# if the URL query has reserved characters, you need to use URL encoding
url1 = "http://www.webscrapingfordatascience.com/paramhttp/?query=another%20test%3F%26"

# URl encoding
# sometimes, requests will try to help you out and encode some characters for you
url2 = "http://www.webscrapingfordatascience.com/paramhttp/?query=a querywith spaces"

# for ambiguous url
# requests is unsure whether you meant ‚Äú?&‚Äù to belong to the actual URL as is or whether you wanted to encode it 
# Hence,special character & will not be encoded
url3 = "http://www.webscrapingfordatascience.com/paramhttp/?query=complex?&"

for url in list([url0,url1,url2,url3]):
    r = requests.get(url)
    print(r.request.url)
    print(r.text)

http://www.webscrapingfordatascience.com/paramhttp/?query=test
I don't have any information on "test"
http://www.webscrapingfordatascience.com/paramhttp/?query=another%20test%3F%26
I don't have any information on "another test?&"
http://www.webscrapingfordatascience.com/paramhttp/?query=a%20querywith%20spaces
I don't have any information on "a querywith spaces"
http://www.webscrapingfordatascience.com/paramhttp/?query=complex?&
I don't have any information on "complex?"


Let's play around with the URL query in Google search engine. 
- Try to add query string in the URL https://www.google.com/ and see what will happen in the text box. 
- Use `requests` to do HTTP request and see the requested URL.

In [14]:
import requests
url0 = "https://www.google.com/?query=baruch college > < >>"

r = requests.get(url0)
print(r.request.url)

https://www.google.com/?query=baruch%20college%20%3E%20%3C%20%3E%3E


#### Encode special characters with `urllib.parse` functions

Sometimes, the URL is too ambiguous for `requests` to make use of it.Then, we can use the `quote` and `quote_plus` functions.
   - `quote` encode special characters in the path section of URLs and encode special characters using percent encoding ("%XX"), including spaces. 
       - doesn't encode slash / as it's meanto to be used on URL paths
   - `quote_plus` replaces spaces by plus signs, and it is generally used to encode query strings
       - use plus sign to encode spaces and slashes

In [97]:
# properly encoding urls with requests library
# using urllib.parse.quote or urllib.parse.quote_plus function

from urllib.parse import quote, quote_plus

raw_string = 'a query with /, spaces and?&'

# urllib.parse.quote
# encode special character using percent ‚Äú%XX‚Äù encoding, including spaces
# doesn't encode slash / as it's meanto to be used on URL paths
print(quote(raw_string))

# urllib.parse.quote_plus
# similar to the quote method but 
# use plus sign to encode spaces and slashes
print(quote_plus(raw_string))

a%20query%20with%20/%2C%20spaces%20and%3F%26
a+query+with+%2F%2C+spaces+and%3F%26


##### Summary 
- As long as we make sure that our query parameter does not use slashes, both encoding approaches are valid to be used to encode query strings.

- In case our query string does include a slash, and if we do want to use `quote`, we can simply override its safe argument (`safe=''`)


In [98]:
from urllib.parse import quote, quote_plus
help(quote)
# The default for the safe arg is '/'

Help on function quote in module urllib.parse:

quote(string, safe='/', encoding=None, errors=None)
    quote('abc def') -> 'abc%20def'
    
    Each part of a URL, e.g. the path info, the query, etc., has a
    different set of reserved characters that must be quoted. The
    quote function offers a cautious (not minimal) way to quote a
    string for most of these parts.
    
    RFC 3986 Uniform Resource Identifier (URI): Generic Syntax lists
    the following (un)reserved characters.
    
    unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"
    reserved      = gen-delims / sub-delims
    gen-delims    = ":" / "/" / "?" / "#" / "[" / "]" / "@"
    sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
                  / "*" / "+" / "," / ";" / "="
    
    Each of the reserved characters is reserved in some component of a URL,
    but not necessarily in all of them.
    
    The quote function %-escapes all characters that are neither in the
    unreserved chars ("always safe") nor

In [203]:
raw_string = 'a query with /, spaces and?&'

url = 'http://www.webscrapingfordatascience.com/paramhttp/?query='
print('Using quote:')

# Nothing is safe, not even '/' characters, so encode everything
r = requests.get(url + quote(raw_string, safe=''))

print(r.url)
print(r.text)

Using quote:
http://www.webscrapingfordatascience.com/paramhttp/?query=a%20query%20with%20%2F%2C%20spaces%20and%3F%26
<html>
<head>
<title>Web Page Blocked</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<META HTTP-EQUIV="PRAGMA" CONTENT="NO-CACHE">
<style>
#content{border:3px solid#aaa;background-color:#fff;margin:40;padding:40;font-family:Tahoma,Helvetica,Arial,sans-serif;font-size:12px;}
  h1{font-size:20px;font-weight:bold;color:#196390;}
  b{font-weight:bold;color:#196390;}
</style>
</head>
<body bgcolor="#e7e8e9">
<div id="content">
<h1>Web Page Blocked</h1>
<p>Access to the web page you were trying to visit has been blocked in accordance with company policy. Please contact your system administrator if you believe this is in error.</p>
<p><b>User:</b> 10.29.130.30 </p>
<p><b>URL:</b> www.webscrapingfordatascience.com/paramhttp/?query=a%20query%20with%20/,%20spaces%20and?%26 </p>
<p><b>Category:</b> malware </p>
</div>
</body>
</html

In [100]:
# quote_plus can get you the same result

raw_string = 'a query with /, spaces and?&'

print('Using quote_plus:')
r = requests.get(url + quote_plus(raw_string))
print(r.url)
print(r.text)

Using quote_plus:
http://www.webscrapingfordatascience.com/paramhttp/?query=a+query+with+%2F%2C+spaces+and%3F%26
I don't have any information on "a query with /, spaces and?&"


Let's practice in Google

In [17]:
from urllib.parse import quote, quote_plus

google = "https://www.google.com/?query="
query_string = 'the full list of URL encoding?'

print('Using quote_plus:')
r = requests.get(google + quote_plus(query_string))
print(r.url)
print(r.text)

Using quote_plus:
https://www.google.com/?search=the+full+list+of+URL+encoding%3F
<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="en"><head><meta content="Search the world's information, including webpages, images, videos and more. Google has many special features to help you find exactly what you're looking for." name="description"><meta content="noodp" name="robots"><meta content="text/html; charset=UTF-8" http-equiv="Content-Type"><meta content="/logos/doodles/2023/2023-womens-world-cup-opening-day-6753651837110060.5-law.gif" itemprop="image"><meta content="2023 Women's World Cup - Opening Day!" property="twitter:title"><meta content="Let the games begin! #GoogleDoodle" property="twitter:description"><meta content="Let the games begin! #GoogleDoodle" property="og:description"><meta content="summary_large_image" property="twitter:card"><meta content="@GoogleDoodles" property="twitter:site"><meta content="https://www.google.com/logos/doodles/2023/2023-wome

##### Use `params` argument in `requests.get`

Pass a Python dictionary with your non-encoded URL parameters and requests will take
care of encoding them for you.

In [205]:
help(requests.get)

Help on function get in module requests.api:

get(url, params=None, **kwargs)
    Sends a GET request.
    
    :param url: URL for the new :class:`Request` object.
    :param params: (optional) Dictionary, list of tuples or bytes to send
        in the query string for the :class:`Request`.
    :param \*\*kwargs: Optional arguments that ``request`` takes.
    :return: :class:`Response <Response>` object
    :rtype: requests.Response



In [103]:
#2 using params argument in requests.get() 

url = 'http://www.webscrapingfordatascience.com/paramhttp/'

#creating a dictionary with non-encoded URL parameters
parameters = {'query': 'a query with /, spaces and?&', 
              'query': '1234'}

# or list of tuples
# parameters = [('query','a query with /, spaces and ?&'),
#               ('query', '1234')]

# parameters = [('query','\&^'),('query','a query with /, spaces and?&')]

r = requests.get(url, params=parameters)

print(r.url)
print(r.text)

http://www.webscrapingfordatascience.com/paramhttp/?query=a+query+with+%2F%2C+spaces+and%3F%26
I don't have any information on "a query with /, spaces and?&"


#### What to pass in`params`
- Empty parameters, for example, as in
‚Äúparams={‚Äôquery‚Äô: ‚Äù}‚Äù will end up in the URL with an equals sign included, that
is, ‚Äú?query=‚Äù.

- You can also pass a list to params with every element
being a tuple or list itself having two elements representing the key and value
per parameter respectively, in which case the order of the list will be respected.

- You can also pass an OrderedDict object (a built-in object provided by the
‚Äúcollections‚Äù module in Python 3) that will retain the ordering.

- You can also
pass a string representing your query string part. In this case, requests will prepend
the question mark (‚Äú?‚Äù) for you, but will ‚Äî once again ‚Äî not be able to provide
smart URL encoding, so that you are responsible to make sure your query string
is encoded properly.

In [110]:
parameters = [('query','\&^'),('query','a query with /, spaces and?&')]
r = requests.get(url, params=parameters)
print(r.url)
print(r.text)

http://www.webscrapingfordatascience.com/paramhttp/?query=a+query+with+%2F%2C+spaces+and%3F%26
I don't have any information on "a query with /, spaces and?&"


In [111]:
parameters = [('query','\&^')]
r = requests.get(url, params=parameters)
print(r.url)
print(r.text)

http://www.webscrapingfordatascience.com/paramhttp/?query=%5C%26%5E
I don't have any information on "\&^"


**Practice**

Let's use `params` argument in `requests.get` to make URL queries to https://httpbin.org/get.

In [19]:
payload = {'key1': 'value1', 
           'key2': 'value2'}
r = requests.get('https://httpbin.org/get', params=payload)

print(r.url)
print(r.text)

https://httpbin.org/get?key1=value1&key2=value2
https://httpbin.org/get?key1=value1&key2=value2
{
  "args": {
    "key1": "value1", 
    "key2": "value2"
  }, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.31.0", 
    "X-Amzn-Trace-Id": "Root=1-64b9680a-6ce8a3686acee3db46559aeb"
  }, 
  "origin": "150.210.231.129", 
  "url": "https://httpbin.org/get?key1=value1&key2=value2"
}



##### Silencing requests completely
- In rare situations, a very picky web server might nevertheless expect URLs to come in unencoded. 
- It's extremely rare and you will need to override requests as follows:

In [29]:
import requests
url = 'http://www.example.com/?spaces |pipe'
r = requests.get(url)
print(r.url)

http://www.example.com/?spaces%20%7Cpipe


We need to use `unquote` function.

In [None]:
from urllib.parse import unquote
unquote('http://www.example.com/?spaces |pipe')

Then, we need to override `requests` and avoid automatic encoding.

In [125]:
import requests
from urllib.parse import unquote

class NonEncodedSession(requests.Session):
    # Override the default send method
    def send(self, *a, **kw):
    # Revert the encoding which was applied
        a[0].url = unquote(a[0].url)
        return requests.Session.send(self, *a, **kw)

my_requests = NonEncodedSession()
url = 'http://www.example.com/?spaces |pipe'
r = my_requests.get(url)
print(r.url)
# Will show: http://www.example.com/?spaces |pipe

http://www.example.com/?spaces |pipe


##### Exercise
Head over to http://www.webscrapingfordatascience.com/calchttp/. Play around with the ‚Äúa,‚Äù ‚Äúb,‚Äù and ‚Äúop‚Äù URL parameters. You should be able to
work out what the following code does:

In [134]:
def calc(a, b, op):
    url = 'http://www.webscrapingfordatascience.com/calchttp/'
    params = {'a': a, 'b': b, 'op': op}
    r = requests.get(url, params=params)
    return r.text

print(calc(4, 6, '*'))

#print(calc(4, 6, '/'))

# calc('a','b','+')

24


#### Another URL issue

Some web server is clever enough to route URLs to the proper page even no params are provided.

In [212]:
import requests
# https://en.wikipedia.org/w/index.php?title=List_of_Game_of_Thrones_episodes&oldid=802553687

url1 = 'https://en.wikipedia.org/w/index.php' + \
'?title=List_of_Game_of_Thrones_episodes&oldid=802553687'

url2 = 'https://en.wikipedia.org/wiki/List_of_Game_of_Thrones_episodes'
r1,r2 = map(requests.get, [url1,url2])

print(r1.url, r2.url, sep = '\n')

https://en.wikipedia.org/w/index.php?title=List_of_Game_of_Thrones_episodes&oldid=802553687
https://en.wikipedia.org/wiki/List_of_Game_of_Thrones_episodes


- We‚Äôre using the ‚Äúoldid‚Äù URL parameter here such that we
obtain a specific version of the ‚ÄúList of Game of Thrones episodes‚Äù page, to make
sure that our subsequent examples will keep working.
  
- Both links lead to the same page. But the latter uses URL paramters while the former does not. 
    - https://en.wikipedia.org/wiki/List_of_Game_of_Thrones_episodes
    - https://en.wikipedia.org/w/index.php?title=List_of_Game_of_Thrones_episodes

#### Practice (HTTP requests)

Try to send URL request to the following webpages:
- https://www.reddit.com/r/Baruch/
- https://www.baruch.cuny.edu/
- https://www.google.com/
- https://www.google.com/search?q=baruch+college&sxsrf=ALiCzsbfuw20w2QzkizVWy1CYK3MIVCsZw%3A1654792541686&source=hp&ei=XSGiYvfwJY22ggec-6TwBQ&iflsig=AJiK0e8AAAAAYqIvbfeKXieXpugQYk44b9A6wkNKwg8r&ved=0ahUKEwi3k_vM5qD4AhUNm-AKHZw9CV4Q4dUDCAk&uact=5&oq=baruch+college&gs_lcp=Cgdnd3Mtd2l6EAMyBAgjECcyCAguEIAEELEDMgoIABCABBCHAhAUMgUIABCABDIFCAAQgAQyCwguEIAEEMcBEK8BMgUIABCABDIFCAAQgAQyCwguEIAEEMcBEK8BMgUIABCABDoHCCMQ6gIQJzoICC4QsQMQgwE6EQguEIAEELEDEIMBEMcBEKMCOgsILhCxAxCDARDUAjoLCC4QgAQQxwEQowI6CwguEIAEEMcBENEDOgQIABBDOg4ILhCABBCxAxDHARCjAjoRCC4QgAQQsQMQxwEQowIQ1AI6CAguEIAEENQCOgsILhCABBCxAxDUAjoLCAAQgAQQsQMQyQM6BQgAEJIDOhAILhCxAxCDARDHARCjAhBDOhEILhCABBCxAxCDARDHARCvAToHCAAQsQMQQzoLCAAQgAQQsQMQgwFQ3AlYqBVgpxZoAXAAeACAAWmIAYwKkgEEMTEuM5gBAKABAbABCg&sclient=gws-wiz
- https://twitter.com/BaruchCollege?ref_src=twsrc%5Egoogle%7Ctwcamp%5Eserp%7Ctwgr%5Eauthor

In [44]:
import requests

url = 'https://twitter.com/BaruchCollege?ref_src=twsrc%5Egoogle%7Ctwcamp%5Eserp%7Ctwgr%5Eauthor'
r = requests.get(url)

print(r.text)

<!DOCTYPE html>
<html dir="ltr" lang="en">
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width,initial-scale=1,maximum-scale=1,user-scalable=0,viewport-fit=cover" /><link rel="preconnect" href="//abs.twimg.com" /><link rel="dns-prefetch" href="//abs.twimg.com" /><link rel="preconnect" href="//api.twitter.com" /><link rel="dns-prefetch" href="//api.twitter.com" /><link rel="preconnect" href="//pbs.twimg.com" /><link rel="dns-prefetch" href="//pbs.twimg.com" /><link rel="preconnect" href="//t.co" /><link rel="dns-prefetch" href="//t.co" /><link rel="preconnect" href="//video.twimg.com" /><link rel="dns-prefetch" href="//video.twimg.com" /><link nonce="NmJhYjUxM2YtZTcyOC00ZTFhLWFlY2EtOTAxNjU2MDlhNWUx" rel="preload" as="script" crossorigin="anonymous" href="https://abs.twimg.com/responsive-web/client-web-legacy/polyfills.836eaeda.js" /><link nonce="NmJhYjUxM2YtZTcyOC00ZTFhLWFlY2EtOTAxNjU2MDlhNWUx" rel="preload" as="script" crossorigin="anonymous" href="https:/

In [47]:
print(r.status_code)
print(r.reason)
print(r.headers)

200
OK
{'date': 'Thu, 20 Jul 2023 19:21:12 GMT', 'perf': '7626143928', 'expiry': 'Tue, 31 Mar 1981 05:00:00 GMT', 'pragma': 'no-cache', 'server': 'tsa_b', 'set-cookie': 'ct0=; Max-Age=-1689880871; Expires=Thu, 01 Jan 1970 00:00:01 GMT; Path=/; Domain=.twitter.com; Secure; SameSite=Lax', 'content-type': 'text/html; charset=utf-8', 'x-powered-by': 'Express', 'cache-control': 'no-cache, no-store, must-revalidate, pre-check=0, post-check=0', 'last-modified': 'Thu, 20 Jul 2023 19:21:12 GMT', 'x-frame-options': 'DENY', 'x-transaction-id': 'ac2e1d266f585fbe', 'x-xss-protection': '0', 'x-content-type-options': 'nosniff', 'content-security-policy': "connect-src 'self' blob: https://*.pscp.tv https://*.video.pscp.tv https://*.twimg.com https://api.twitter.com https://api-stream.twitter.com https://ads-api.twitter.com https://aa.twitter.com https://caps.twitter.com https://pay.twitter.com https://sentry.io https://ton.twitter.com https://twitter.com https://upload.twitter.com https://www.google-a

#### The fragment identifier
Apart from the query string, there is another part of the URL that you might have encountered before: the fragment identifier, or ‚Äúhash,‚Äù as it is sometimes called.
- It prepended by a `#` and comes at the end of a URL and afte the query string. Ex.,http://www.example.org/about.htm?p=8#contact
- This part of the URL is meant to identify a portion of the document corresponding to the URL. For instance, a web page can include a link including a fragment identifier that, if you click on it, immediately scrolls your view to the corresponding part of the page.
- The fragment identifier is different from other parts in the URL, as it is processed exclusively by the web browser with no participation from the web server.
- Proper web browser should not include the fragment identifier in the HTTP requests when they fetch a resource from a web server. 
- The browser waits until the web server has sent its reply, and then use the fragment identifier to scroll to the correct part of the page.

- Most web servers ignore a fragment identifier if it's in a request URL. 

In [33]:
url = 'https://en.wikipedia.org/wiki/Baruch_College'
r = requests.get(url)

print(r.text)

<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-enabled vector-feature-main-menu-pinned-disabled vector-feature-limited-width-enabled vector-feature-limited-width-content-enabled vector-feature-zebra-design-disabled" lang="en" dir="ltr">
<head>
<meta charset="UTF-8">
<title>Baruch College - Wikipedia</title>
<script>document.documentElement.className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-enabled vector-feature-main-menu-pinned-disabled vector-feature-limited-width-enabled vector-feature-limited-width-content-enabled vector-feature-zebra-design-disabled";(function(){var cookie=document.cookie.match(/(?:^|; )enwikimwclientpref

### Play with HTML
There is package named `bs4` can help us pull data out of HTML.

In [34]:
from bs4 import BeautifulSoup

html_soup = BeautifulSoup(r.text)

In [35]:
type(html_soup)

bs4.BeautifulSoup

prettify() method will turn a Beautiful Soup parse tree into a nicely formatted Unicode string, with a separate line for each tag and each string:

In [36]:
print(html_soup.prettify())

<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-enabled vector-feature-main-menu-pinned-disabled vector-feature-limited-width-enabled vector-feature-limited-width-content-enabled vector-feature-zebra-design-disabled" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Baruch College - Wikipedia
  </title>
  <script>
   document.documentElement.className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-enabled vector-feature-main-menu-pinned-disabled vector-feature-limited-width-enabled vector-feature-limited-width-content-enabled vector-feature-zebra-design-disabled";(function(){var cookie=document.cookie.match(/(?:^|; 

We can use methods in `BeautifulSoup` to identify the tag and class in HTML code.

In [37]:
cites = html_soup.find_all('cite',class_='citation')
# print(type(cites))

for cite in cites:
    print(cite.get_text(),'\n')
    

"Annual Report 2020-21" (PDF). www.alumni.baruch.cuny.edu. 2022. Retrieved April 7, 2022. 

Office of Institutional Research. "Fact Sheet: Fall 2020, Student Total Enrollment" (PDF). Baruch College, Office of Institutional Research. Retrieved December 3, 2021. 

"CUNY ‚Äì Baruch College". Colleges.usnews.rankingsandreviews.com. 

"Kathleen Waldron, Baruch's New President, Announces Historic Gifts of $53.5 Million". Retrieved December 29, 2022. 

"Baruch Means Business Capital Campaign". Archived from the original on September 18, 2009. Retrieved October 12, 2009. 

Speri, Alice; Phillips, Anna M. (November 21, 2011). "CUNY Students Clash With Police in Manhattan". The New York Times. 

"CUNY Appoints Next President Of Baruch College". Retrieved December 29, 2022. 

"Academic Degree Programs - Baruch College". www.baruch.cuny.edu. 

"The Zicklin School of Business". Zicklin.baruch.cuny.edu. Retrieved April 29, 2014. 

"The Weissman School of Arts and Sciences". Baruch.cuny.edu. April 24