## Web Scraping with Python

### Requests Library

*requests* is a wonderful library to use HTTP and interact with APIs and websites.

Let's get started on working with it?

First you'll need to install the library if you haven't already:

In [1]:
!pip install requests



Then import into your script or shell using

In [2]:
import requests

And voila!, you're ready to make Python talk with the internet.

### Using requests on APIs

To test out how you can emulate HTTP requests with requests library, let's make a GET request to https://en.wikipedia.org and check the response



In [None]:
# GET - Retrieve Data
# POST - 

In [3]:
# Sending an HTTP request to a simple website
url = 'https://en.wikipedia.org'
response = requests.get(url)

# Checking the status of the response
response.status_code

200

In [4]:
response.content

b'<!DOCTYPE html>\n<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-disabled vector-feature-custom-font-size-clientpref-1 vector-feature-appearance-pinned-clientpref-1 vector-feature-night-mode-enabled skin-theme-clientpref-day vector-toc-not-available" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8">\n<title>Wikipedia, the free encyclopedia</title>\n<script>(function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clie

Evidently, I am getting a HTML document in response

### Parsing and querying HTML code with BeautifulSoup

Yet again, we have a cool library called *BeautifulSoup* which parses markup-language files and strings and gives you ability to query it using DOM selection methods like CSS Selectors or XPATH.

We'll work mainly on CSS Selectors and in short.

- To select elements IDs, you use #
- To select elements by classes, you use .
- And to select elements by attributes, you use [ ]

Let's try to parse the scraped HTML text using BeautifulSoup. We'll do the following:

- Install BeautifulSoup library using pip install beautifulsoup4
- Import the BeautifulSoup parser in your script using from bs4 import BeautifulSoup
- Send BeautifulSoup class your HTML code and define a parser to use

Voila, you can now use BeautifulSoup to find different HTML attributes using *find_all* and *find_one* methods

In [5]:
!pip install beautifulsoup4



In [6]:
from bs4 import BeautifulSoup

In [7]:
url = 'https://en.wikipedia.org'
response = requests.get(url)

soup = BeautifulSoup(response.text, 'html.parser')

In [9]:
url = "https://en.wikipedia.org/wiki/List_of_flag_bearers_for_Pakistan_at_the_Olympics"
response = requests.get(url)

soup = BeautifulSoup(response.text, 'html.parser')
soup

<!DOCTYPE html>

<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-1 vector-feature-appearance-pinned-clientpref-1 vector-feature-night-mode-enabled skin-theme-clientpref-day vector-toc-available" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>List of flag bearers for Pakistan at the Olympics - Wikipedia</title>
<script>(function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature

In [14]:
for heading in soup.find_all('h1'):
    print(heading.text)

List of flag bearers for Pakistan at the Olympics


In [15]:
for heading in soup.find_all('h2'):
    print(heading.text)

Contents
See also
References


In [None]:
# CSS
# class -> global styling
# id -> specific styling

#firstHeading{
    color: Blue;
    font: Lato;
    font-size: 14px
}

or alternatively you can use *select* OR *select_one* methods to extract data

![image.png](attachment:image.png)

In [19]:
soup.select_one("#firstHeading").text

'List of flag bearers for Pakistan at the Olympics'

### Fetch tables from website

Now, let's try to fetch table data from Wikipedia's website

In [20]:
import requests
from bs4 import BeautifulSoup

In [51]:
url = "https://en.wikipedia.org/wiki/Taxation_in_Pakistan"

response = requests.get(url)

soup = BeautifulSoup(response.text, 'html.parser')

In [52]:
table = soup.find('table', {'class':'wikitable'})
print(table)

<table class="wikitable">
<caption>
</caption>
<tbody><tr>
<th>Fiscal Year
</th>
<th>Tax Collected
<p>(In Trillion Rs)
</p>
</th></tr>
<tr>
<td>2003-2004
</td>
<td>520.8
</td></tr>
<tr>
<td>2004-2005
</td>
<td>590.4
</td></tr>
<tr>
<td>2005-2006
</td>
<td>713.5
</td></tr>
<tr>
<td>2006-2007
</td>
<td>847.2
</td></tr>
<tr>
<td>2007-2008
</td>
<td>1008.1
</td></tr>
<tr>
<td>2008-2009
</td>
<td>1161.2
</td></tr>
<tr>
<td>2009-2010
</td>
<td>1327.4
</td></tr>
<tr>
<td>2010-2011
</td>
<td>1558
</td></tr>
<tr>
<td>2011-2012
</td>
<td>1882.7
</td></tr>
<tr>
<td>2012-2013
</td>
<td>1946.4
</td></tr>
<tr>
<td>2013-2014
</td>
<td>2254.5
</td></tr>
<tr>
<td>2014-2015
</td>
<td>2590
</td></tr>
<tr>
<td>2015-2016
</td>
<td>3112.5
</td></tr>
<tr>
<td>2016-2017
</td>
<td>3367.9
</td></tr>
<tr>
<td>2017-2018
</td>
<td>3843.8
</td></tr>
<tr>
<td>2018-2019
</td>
<td>3828.5
</td></tr>
<tr>
<td>2019-2020
</td>
<td>3996.7
</td></tr>
<tr>
<td>2020-2021
</td>
<td>4734
</td></tr>
<tr>
<td>2021-2022
</td>
<td>

In [53]:
columns_heading = []
for t in table.find_all('th'):
    columns_heading.append(t.text.strip().replace("\n",""))
    
print(columns_heading)

['Fiscal Year', 'Tax Collected(In Trillion Rs)']


In [42]:
rows = []

for row in table.find_all('tr'):
    table_data = row.find_all('td')
    if len(table_data) > 0: 
        row = [cell.text.strip() for cell in table_data]
        print(row)
        rows.append(row)

['2003-2004', '520.8']
['2004-2005', '590.4']
['2005-2006', '713.5']
['2006-2007', '847.2']
['2007-2008', '1008.1']
['2008-2009', '1161.2']
['2009-2010', '1327.4']
['2010-2011', '1558']
['2011-2012', '1882.7']
['2012-2013', '1946.4']
['2013-2014', '2254.5']
['2014-2015', '2590']
['2015-2016', '3112.5']
['2016-2017', '3367.9']
['2017-2018', '3843.8']
['2018-2019', '3828.5']
['2019-2020', '3996.7']
['2020-2021', '4734']
['2021-2022', '6126.1']
['2022-2023', '7163.8']
['2023-2024', '9.285']


In [None]:
rows = []
for tr in table.find_all('tr'):
    cells = tr.find_all('td')
    if len(cells) > 0:
        row = [cell.text.strip() for cell in cells]
        rows.append(row)

In [45]:
string = "Hello \nWorld".replace("\n","")
print(string)

Hello World


In [50]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

# Example URL containing the table
url = "https://en.wikipedia.org/wiki/Taxation_in_Pakistan"

response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

# Find the specific table (inspect the page to ensure correct selection)
table = soup.find('table', {'class': 'wikitable'})

# Extract headers
headers = []
for th in table.find_all('th'):
    headers.append(th.text.strip())
    
headers[1] = headers[1].replace("\n","")
    
# Extract rows
rows = []
for tr in table.find_all('tr'):
    cells = tr.find_all('td')
    if len(cells) > 0:
        row = [cell.text.strip() for cell in cells]
        rows.append(row)

# Create DataFrame
df = pd.DataFrame(rows, columns=headers)

# Display the DataFrame
df

Unnamed: 0,Fiscal Year,Tax Collected(In Trillion Rs)
0,2003-2004,520.8
1,2004-2005,590.4
2,2005-2006,713.5
3,2006-2007,847.2
4,2007-2008,1008.1
5,2008-2009,1161.2
6,2009-2010,1327.4
7,2010-2011,1558.0
8,2011-2012,1882.7
9,2012-2013,1946.4


### Fetch Image from website

What about images? Yeah we can extract them too. Let's work on that!

In [56]:
import requests
from PIL import Image
from io import BytesIO
from bs4 import BeautifulSoup

# URL of the Wikipedia page containing the image
image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/3/32/Flag_of_Pakistan.svg/220px-Flag_of_Pakistan.svg.png"
image_response = requests.get(image_url)
img = Image.open(BytesIO(image_response.content))

# Save the image
# img.save("pakistan_flag.png")

# Display the image
img.show()

#### Class Activity

Instructions:

- GET HTML Data from https://www.sbp.org.pk/l_frame/index2.asp using requests library
- Parse it using BeautifulSoup
- Find all links for PDF Documents. 
  - Hint: href.endswith('.pdf')
- Do any processing IF required to make them proper links
  - Hint: Should start with http
  - Hint: Base URL Should be https://www.sbp.org.pk/l_frame/index2.asp/........
- Print ALL PDF Documents Links

### Challenge

Instructions:
    
- Extract the following table from PSX's Website
  - Hint: Find DIV with ID marketmainboard and then find ONE table in it
  
![image.png](attachment:image.png)