In this notebook, we will explore the requests library (GET [headers and parameters] and POST (with json) requests), status codes, authentication, timeouts.

Finally, we will see a small example with BeautifulSoup to parse the HTML content of Hindi Wikipedia page.

# Requests 

## What is HTML?

HTML (Hypertext Markup Language) is the foundation of every webpage .
* It provides the *structure* and *content* of the webpage .

* Tags: HTML uses tags (e.g., \<p\>, \<h1\>, \<a\>) to define different elements. 
These tags indicate what each part of the page is (paragraph, heading, link).

* Elements: An HTML element consists of an opening tag, content, and a closing tag 
(e.g. \<p\>This is a paragraph\</p\>) .

* An element is everything from the start tag to the end tag.

* Attributes: Tags can have attributes that provide additional information about the element 
(e.g. <a href="https://example.com">Link</a>).


The purpose of a **web browser** (Chrome, Edge, Firefox, Safari) is to read HTML documents and display them correctly.

* A browser does not display the HTML tags, but uses them to determine how to display the document.

* Every web page you visit/scrape will have an HTML structure for its corresponding pages. 

In [None]:
<!DOCTYPE html>
<html>
<head>
<title>My Webpage</title>
</head>
<body>

<h1>Welcome to My Webpage</h1>
<p>This is a paragraph on my webpage.</p>

</body>
</html>

HTML resource for beginners: https://www.w3schools.com/html/html_intro.asp 

### Demo of a local stored website

<font color='red'>Jump to index.html demo here.</font>

In [None]:
* While HTML provides structure, CSS is responsible for styling the look and feel of the site 
and the location of the different elements

* HTML is also used with JavaScript for adding interactivity and dynamic behavior.

<b>When using the Requests library in Python, you're essentially retrieving the HTML code of a website. </b>

### APIs

It's Note: While Requests is great for many websites, some sites with dynamic content 
(Imagine the rate at which tweets - 6,000 per second - change the website constantly. Think about the sheer volume as well) 
or complex structures (think about the complex structure of a youtube layout), require a more structured approach such as using their APIs. eg- Twitter, Reddit, Google Maps, YTube, etc.



API stands for Application Programming Interface. 
It acts as a contract between two applications, 
allowing them to communicate by sending requests and receiving responses. 

It is more structured way to interact with websites.

# What is HTTP? üåê

The Hypertext Transfer Protocol (HTTP) 
is designed to enable communications between clients and servers.

HTTP works as a request-response protocol between a client and server.

Example: A client (browser) sends an HTTP request to the server; 
then the server returns a response to the client. 
The response contains status information about the request 
and may also contain the requested content.

<img src="https://support.safe.com/hc/article_attachments/25411178860173" />

In [None]:
HTTP Methods:

* GET
* POST
* PUT
* HEAD
...
*CONNECT


The two most common HTTP methods are: GET and POST.

## GET Method

GET is used to request data from a specified resource.

For example while fetching data from multiple pages of a blog post to create your dataset.

Note that the query string (name/value pairs) is sent in the URL of a GET request:

In [None]:

/test/demo_form.php?name1=value1&name2=value2 

## POST Method

POST is used to send data to a server to create/update a resource.

For example to post a form to a website.


The data sent to the server with POST is stored in the request body of the HTTP request:

In [None]:
POST /test/demo_form.php HTTP/1.1
Host: w3schools.com

name1=value1&name2=value2 

In [None]:
We can use a GET request to retrieve information, and a POST request to submit data.

# Requests python library

In [None]:
# use this import statement to import the module
import requests

## GET requests ‚Üê

In [None]:
r = requests.get('https://xkcd.com/353/')
print(r) #return a response object

In [5]:
# the web contents are stored in the attributes of an object

dir(r) #dir() function lists attributes of a python object

['__attrs__',
 '__bool__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__enter__',
 '__eq__',
 '__exit__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__nonzero__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_content',
 '_content_consumed',
 '_next',
 'apparent_encoding',
 'close',
 'connection',
 'content',
 'cookies',
 'elapsed',
 'encoding',
 'headers',
 'history',
 'is_permanent_redirect',
 'is_redirect',
 'iter_content',
 'iter_lines',
 'json',
 'links',
 'next',
 'ok',
 'raise_for_status',
 'raw',
 'reason',
 'request',
 'status_code',
 'text',
 'url']

In [None]:
help(r) #gives more detailed description
#content lists the content in bytes which can be used to load the image later
#text attribute contains the content in unicode formatting

In [7]:
r.text #use an html parser to parse the information

'<!DOCTYPE html>\n<html>\n<head>\n<link rel="stylesheet" type="text/css" href="/s/7d94e0.css" title="Default"/>\n<title>xkcd: Python</title>\n<meta http-equiv="X-UA-Compatible" content="IE=edge"/>\n<link rel="shortcut icon" href="/s/919f27.ico" type="image/x-icon"/>\n<link rel="icon" href="/s/919f27.ico" type="image/x-icon"/>\n<link rel="alternate" type="application/atom+xml" title="Atom 1.0" href="/atom.xml"/>\n<link rel="alternate" type="application/rss+xml" title="RSS 2.0" href="/rss.xml"/>\n<!-- <script type="text/javascript" src="/s/b66ed7.js" async></script>\n<script type="text/javascript" src="/s/1b9456.js" async></script> -->\n\n<meta property="og:site_name" content="xkcd">\n\n<meta property="og:title" content="Python">\n<meta property="og:url" content="https://xkcd.com/353/">\n<meta property="og:image" content="https://imgs.xkcd.com/comics/python.png">\n<meta name="twitter:card" content="summary_large_image">\n\n</head>\n<body>\n<div id="topContainer">\n<div id="topLeft">\n<ul

In [17]:
r = requests.get('https://imgs.xkcd.com/comics/python.png')
print(r.content)

b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x02\x06\x00\x00\x02L\x08\x00\x00\x00\x00\xa4\x971v\x00\x00\x00\x04gAMA\x00\x00\xb1\x8e|\xfbQ\x93\x00\x00\x00 cHRM\x00\x00z%\x00\x00\x80\x83\x00\x00\xf9\xff\x00\x00\x80\xe9\x00\x00u0\x00\x00\xea`\x00\x00:\x98\x00\x00\x17o\x92_\xc5F\x00\x00\x00\tpHYs\x00\x00\x0b\x13\x00\x00\x0b\x13\x01\x00\x9a\x9c\x18\x00\x01bIIDATx\xda\xdc\xbde@U]\xb7\xfe\xbd\x10\xc5\xa0\xb1\x03\xb1\xb1\x03;\xb1;\xb1\xbb\xbb\xbb\xbb\xbb\xbb\xbbE\x051P\x04E\x01\xa5K\xba\xbbkw\xfc\xde\x0f\x1b\x15\x15\xe3~n\xcf9\xcf\xfb__\xd8{\xaf\xc9\x8a9\xc7\x1cq\x8d\x12\x84"\xda\xda\xdaZBq#}\xed\x02\x87\x96PD[[[\xbbHq\xe3\xe2E\xb4\x05-}]m-A\xd0)\xa5\xad%\x08\xc5\rJi\t\xa5\x0c\xf5\x04A\xdb\xc0PK\xd0\xd2+QD[K(jX\xaa\x88 h\x1b\x95\xd0\x12\x84bF\xa5\xb4\x04\xa1\x94\xbe\xb6\xb6\x96P\xc4@O\x10\xb4\xf4\xf4\x8ahk\tE\x8d\xf4\xb5\xb5\xb4\x8a\x18\x96\xd0."\x08E\x8dJ\x16\xd1\xd6\x12\x8a\x19\xeb\x08B\t\xc3\xa2\xda\xda\xdaZ\x82\x8e\x91n\x11mmA(\xa2o\xa0\xa5\xa5\xa7\xa7\xad-\x08:F%\x8ahN\x19hN\x14-"\x08\xd

In [9]:
with open('comic.png', 'wb') as f: #open in write and 'bytes' mode
    f.write(r.content) 

In [10]:
print(r.status_code)

200


### HTTP status codes

HTTP response status codes indicate whether a specific HTTP request has been successfully completed. 
Responses are grouped in five classes:

* Informational responses (100 ‚Äì 199)
* Successful responses (200 ‚Äì 299)
* Redirection messages (300 ‚Äì 399)
* Client error responses (400 ‚Äì 499) like if you dont have permission
* Server error responses (500 ‚Äì 599) when a site crashes

In [15]:
r.ok #is code is less than 400

True

### Headers

In [18]:
r.headers

{'Connection': 'keep-alive', 'Content-Length': '90835', 'Server': 'nginx', 'Content-Type': 'image/png', 'ETag': '"4b66d225-162d3"', 'Expires': 'Thu, 02 Jan 2025 10:14:06 GMT', 'Cache-Control': 'max-age=300', 'Accept-Ranges': 'bytes', 'Date': 'Thu, 02 Jan 2025 13:31:33 GMT', 'Via': '1.1 varnish', 'Age': '0', 'X-Served-By': 'cache-bom4745-BOM', 'X-Cache': 'HIT', 'X-Cache-Hits': '2', 'X-Timer': 'S1735824693.917207,VS0,VE527'}

### Sending requests with parameters

In [9]:
payload = {'page':2, 'count':25} #dictionary of url parameters
r = requests.get('https://httpbin.org/get', params=payload)

print(r.url) #note the url created from the dictionary and the url string by the .get() function

https://httpbin.org/get?page=2&count=25


In [10]:
print(r.headers) #get the headers from the get request from this website

{'Date': 'Thu, 02 Jan 2025 19:05:08 GMT', 'Content-Type': 'application/json', 'Content-Length': '363', 'Connection': 'keep-alive', 'Server': 'gunicorn/19.9.0', 'Access-Control-Allow-Origin': '*', 'Access-Control-Allow-Credentials': 'true'}


In [11]:
print(r.text) #print the text of the get request. note, httpbin's get request return a json object unlike xkcd's website that returns an html doc. json reqspones are common when working with APIs.

{
  "args": {
    "count": "25", 
    "page": "2"
  }, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.32.2", 
    "X-Amzn-Trace-Id": "Root=1-6776e362-7b6543953b8d09a62dfcbcdf"
  }, 
  "origin": "152.59.196.233", 
  "url": "https://httpbin.org/get?page=2&count=25"
}



## POST requests ‚Üí

In [12]:
payload = {'username':'JonDoe', 'password':'testing'} #dictionary of form data
r = requests.post('https://httpbin.org/post', data=payload)

print(r.text) 

{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "password": "testing", 
    "username": "JonDoe"
  }, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Content-Length": "32", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.32.2", 
    "X-Amzn-Trace-Id": "Root=1-6776e804-26c9932513331b3c0f0241c7"
  }, 
  "json": null, 
  "origin": "152.59.196.233", 
  "url": "https://httpbin.org/post"
}



In [14]:
r.json() #create a python dictionary from the response

{'args': {},
 'data': '',
 'files': {},
 'form': {'password': 'testing', 'username': 'JonDoe'},
 'headers': {'Accept': '*/*',
  'Accept-Encoding': 'gzip, deflate',
  'Content-Length': '32',
  'Content-Type': 'application/x-www-form-urlencoded',
  'Host': 'httpbin.org',
  'User-Agent': 'python-requests/2.32.2',
  'X-Amzn-Trace-Id': 'Root=1-6776e804-26c9932513331b3c0f0241c7'},
 'json': None,
 'origin': '152.59.196.233',
 'url': 'https://httpbin.org/post'}

In [16]:
type(r.text), type(r.json())

(str, dict)

In [19]:
r_dict = r.json()
r_dict['headers']['User-Agent'] #check the POST json's User-Agent

'python-requests/2.32.2'

## Authorization üîç

Let's demonstrate how authorization happens in browser.

In [27]:
r = requests.get('https://httpbin.org/basic-auth/JonDoe/testing', auth=('JonDoe123', 'testing'))

In [28]:
print(r.ok)

False


In [29]:
print(r) #auth error code

<Response [401]>


In [24]:
r = requests.get('https://httpbin.org/basic-auth/JonDoe/testing', auth=('JonDoe', 'testing'))

In [25]:
print(r.ok)

True


In [26]:
print(r.text) #print the text of the authenticated request

{
  "authenticated": true, 
  "user": "JonDoe"
}



### Timeout üïë

In [None]:
First of all, if your server does not respond or takes too long to respond, nobody will wait for it. 



In [30]:
r = requests.get('https://httpbin.org/basic-auth/JonDoe/testing', timeout=3) 

In [34]:
r = requests.get('https://httpbin.org/delay/6')

In [35]:
r = requests.get('https://httpbin.org/delay/6', timeout=3) 

ReadTimeout: HTTPSConnectionPool(host='httpbin.org', port=443): Read timed out. (read timeout=3)

# Parsing HTML documents üëÄ

After fetching the request page, we find elements by their type and attributes like id and class. 

Here, let's try fetching HiWiki's boxes of daily info.

Beautiful Soup is a Python package for parsing HTML and XML documents, including those with malformed markup. It creates a parse tree for documents that can be used to extract data from HTML, which is useful for web scraping.

In [41]:
! pip install beautifulsoup4 

Collecting beautifulsoup4
  Downloading beautifulsoup4-4.12.3-py3-none-any.whl.metadata (3.8 kB)
Collecting soupsieve>1.2 (from beautifulsoup4)
  Downloading soupsieve-2.6-py3-none-any.whl.metadata (4.6 kB)
Downloading beautifulsoup4-4.12.3-py3-none-any.whl (147 kB)
Downloading soupsieve-2.6-py3-none-any.whl (36 kB)
Installing collected packages: soupsieve, beautifulsoup4
Successfully installed beautifulsoup4-4.12.3 soupsieve-2.6


In [42]:
from bs4 import BeautifulSoup

In [49]:
r = requests.get('https://hi.wikipedia.org/wiki/%E0%A4%AE%E0%A5%81%E0%A4%96%E0%A4%AA%E0%A5%83%E0%A4%B7%E0%A5%8D%E0%A4%A0')
print(r.text)

<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-disabled vector-feature-custom-font-size-clientpref-1 vector-feature-appearance-pinned-clientpref-1 vector-feature-night-mode-disabled skin-theme-clientpref-day vector-toc-not-available" lang="hi" dir="ltr">
<head>
<meta charset="UTF-8">
<title>‡§µ‡§ø‡§ï‡§ø‡§™‡•Ä‡§°‡§ø‡§Ø‡§æ</title>
<script>(function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1

In [50]:
soup = BeautifulSoup(r.text)
print(soup)

<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-disabled vector-feature-custom-font-size-clientpref-1 vector-feature-appearance-pinned-clientpref-1 vector-feature-night-mode-disabled skin-theme-clientpref-day vector-toc-not-available" dir="ltr" lang="hi">
<head>
<meta charset="utf-8"/>
<title>‡§µ‡§ø‡§ï‡§ø‡§™‡•Ä‡§°‡§ø‡§Ø‡§æ</title>
<script>(function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-

In [54]:
mydivs = soup.find_all("div", {"class": "mcContingut"})
print(mydivs[2])

<div class="mcContingut">
<div class="mcPestanya" id="mc2ps1" style="display:block;visibility:visible;"><figure class="mw-halign-right" typeof="mw:File"><a class="mw-file-description" href="/wiki/%E0%A4%9A%E0%A4%BF%E0%A4%A4%E0%A5%8D%E0%A4%B0:Nazia_Hassan.jpg" title="‡§®‡§æ‡§ú‡§º‡§ø‡§Ø‡§æ ‡§π‡§∏‡§®"><img alt="‡§®‡§æ‡§ú‡§º‡§ø‡§Ø‡§æ ‡§π‡§∏‡§®" class="mw-file-element" data-file-height="211" data-file-width="152" decoding="async" height="100" src="//upload.wikimedia.org/wikipedia/commons/thumb/0/04/Nazia_Hassan.jpg/72px-Nazia_Hassan.jpg" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/0/04/Nazia_Hassan.jpg/108px-Nazia_Hassan.jpg 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/0/04/Nazia_Hassan.jpg/144px-Nazia_Hassan.jpg 2x" width="72"/></a><figcaption>‡§®‡§æ‡§ú‡§º‡§ø‡§Ø‡§æ ‡§π‡§∏‡§®</figcaption></figure>
<div style="font-size:100%;border:none;margin: 0;padding:.1em;color:#">
<ul><li>...‡§ï‡§ø <b><a href="/wiki/%E0%A4%86%E0%A4%88_%E0%A4%B9%E0%A5%88%E0%A4%B5_%E0%A4%8F_%E0%A4%A1%E0

In [59]:
r = requests.get('https://en.wikipedia.org/wiki/English_Wikipedia')

soup = BeautifulSoup(r.text)
mydivs = soup.find("h1")fetch the heading

In [60]:
mydivs

[<h1 class="firstHeading mw-first-heading" id="firstHeading"><span class="mw-page-title-main">English Wikipedia</span></h1>]

In [65]:
#list all headings of wikipedia page  
soup.find_all('div', {'class':'mw-heading'})

[<div class="mw-heading mw-heading2"><h2 id="Articles">Articles</h2></div>,
 <div class="mw-heading mw-heading2"><h2 id="Bureaucracy">Bureaucracy</h2></div>,
 <div class="mw-heading mw-heading2"><h2 id="Wikipedians">Wikipedians</h2></div>,
 <div class="mw-heading mw-heading2"><h2 id="Criticism">Criticism</h2></div>,
 <div class="mw-heading mw-heading2"><h2 id="Controversies">Controversies</h2></div>,
 <div class="mw-heading mw-heading3"><h3 id="English_varieties">English varieties</h3></div>,
 <div class="mw-heading mw-heading3"><h3 id="Disputed_articles">Disputed articles</h3></div>,
 <div class="mw-heading mw-heading3"><h3 id="Threats_against_high_schools">Threats against high schools</h3></div>,
 <div class="mw-heading mw-heading2"><h2 id="WikiProjects_and_assessment">WikiProjects and assessment</h2></div>,
 <div class="mw-heading mw-heading2"><h2 id="Internal_news_publications">Internal news publications</h2></div>,
 <div class="mw-heading mw-heading2"><h2 id="See_also">See also</h

 References:
 
[ YTube Video - Python Requests Tutorial: Request Web Pages, Download Images, POST Data, Read JSON, and More ](https://www.youtube.com/watch?v=tb8gHvYlCFs&t=1200s&ab_channel=CoreySchafer)