# Web Scrapping

It is not always that we have access to a neat, organized dataset avaliable in the .csv format;   
sometimes, the data we need may be available on the web, and we have to be capable of collecting it.   
Luckily for us, Python has a solution in the form of the package Beautiful Soup.

In [1]:
# import libraries

import requests
import pandas as pd
from bs4 import BeautifulSoup

After importing the necessary libraries, we have to download the actual HTML of the site.

In [10]:
# Download the page using requests

# url
url = "https://pt.wikipedia.org/wiki/Lista_de_bairros_de_Manaus"

# get the url content
response = requests.get(url)
content = response.content

# creating a beautiful soup object
soup = BeautifulSoup(content, 'html.parser')

# Find the Table

We now have the HTML of the page, so we need to find the table we want. We could retrieve the first table available, but there is the possibility the page contains more than one table, which is common in Wikipedia pages. For this reason, we have to look at all tables and find the correct one. We cannot advance blindly, though. Let us have a look at the structure of the HTML.


 Unfortunately, the tables do not have a title, but they do have a class attribute. We can use this information to pick the correct table.

In [11]:
# verify tables and their classes
print('Classes of each table: ')
for table in soup.find_all('table'):
    print(table.get('class'))

Classes of each table: 
['box-Desatualizado', 'plainlinks', 'metadata', 'ambox', 'ambox-content']
['wikitable', 'sortable']
['nowraplinks', 'collapsible', 'collapsed', 'navbox-inner']


We have seen the tables available, and the one we want is the second table (aka. class = ‘wikitable’ and ‘sortable’).

In [12]:
# creating a list of all tables
tables = soup.find_all('table')

# looking for our designated table
table = soup.find('table', class_ = 'wikitable sortable')

Once we have the correct data, we can extract its data to create our very own dataframe.

In [21]:
# defining the columns' headers
header = ['Neighborhood', 'Zone', 'Area', 'Population', 'Density', 'Homes_count']

# Collecting the data
rows = []
for row in table.tbody.find_all('tr'):
    # find all data for each column
    columns = row.find_all('td')
    
    if columns != []:
        neighborhood = columns[0].text.strip()
        zone = columns[1].text.strip()
        area = columns[2].span.contents[0].strip('&0.')
        population = columns[3].span.contents[0].strip('&0.')
        density = columns[4].span.contents[0].strip('&0.')
        homes_count = columns[5].span.contents[0].strip('&0.')
        
        # append to rows
        rows.append([neighborhood, zone, area, population, density, homes_count])

# dataframe
df = pd.DataFrame(data = rows, columns = header)

# display the dataframe
df.head()

Unnamed: 0,Neighborhood,Zone,Area,Population,Density,Homes_count
0,Adrianópolis,Centro-Sul,248.45,10459,3560.88,3224
1,Aleixo,Centro-Sul,618.34,24417,3340.4,6101
2,Alvorada,Centro-Oeste,553.18,76392,11681.73,18193
3,Armando Mendes,Leste,307.65,33441,9194.86,7402
4,Betânia,Sul,52.51,1294,20845.55,3119


# A different url

# Countries in the world by population (2024)

let us scrap this table on this page, lucky enough it just has one table.

In [40]:
# NO NEED TO IMPORT WE DID IT EARLY 

# url of the webpage
url = 'https://www.worldometers.info/world-population/population-by-country/'

# fetch the content
response = requests.get(url)
content = response.content

# parse the content with beutifulsoup
soup = BeautifulSoup(content, 'html.parser')

# get the table
table = soup.find('table')

# extract the table's headers for columns definition
headers = []
for header in table.find_all('th'):
    header = header.text
    headers.append(header)

# print headers first
print(headers)

# extract the table's rows
rows = []
for row in table.tbody.find_all('tr'):
    columns = row.find_all('td')
    
    # strp the columns data
    tdatas = []
    for data in columns:
        tdata = data.text.strip()
        tdatas.append(tdata)

    # append to rows
    rows.append(tdatas)

# define our own dataframe
df = pd.DataFrame(rows, columns=headers)

# drop the id column of the page table
df = df.drop(columns=['#']) 


# display 5 rows of the df table
df.head()

['#', 'Country (or dependency)', 'Population (2023)', 'Yearly Change', 'Net Change', 'Density (P/Km²)', 'Land Area (Km²)', 'Migrants (net)', 'Fert. Rate', 'Med. Age', 'Urban Pop %', 'World Share']


Unnamed: 0,Country (or dependency),Population (2023),Yearly Change,Net Change,Density (P/Km²),Land Area (Km²),Migrants (net),Fert. Rate,Med. Age,Urban Pop %,World Share
0,India,1428627663,0.81 %,11454490,481,2973190,-486136,2.0,28,36 %,17.76 %
1,China,1425671352,-0.02 %,-215985,152,9388211,-310220,1.2,39,65 %,17.72 %
2,United States,339996563,0.50 %,1706706,37,9147420,999700,1.7,38,83 %,4.23 %
3,Indonesia,277534122,0.74 %,2032783,153,1811570,-49997,2.1,30,59 %,3.45 %
4,Pakistan,240485658,1.98 %,4660796,312,770880,-165988,3.3,21,35 %,2.99 %


# Simplified version

In [42]:
url = 'https://www.worldometers.info/world-population/population-by-country/'
response = requests.get(url)
content = response.content

soup = BeautifulSoup(content, 'html.parser')

table = soup.find('table')

headers = [header.text for header in table.find_all('th')]

rows = []
for row in table.tbody.find_all('tr'):
    columns = row.find_all('td')
    columns = [cell.text.strip() for cell in columns]
    rows.append(columns)

df = pd.DataFrame(rows, columns=headers)

df = df.drop(columns=['#'])

df.head()

Unnamed: 0,Country (or dependency),Population (2023),Yearly Change,Net Change,Density (P/Km²),Land Area (Km²),Migrants (net),Fert. Rate,Med. Age,Urban Pop %,World Share
0,India,1428627663,0.81 %,11454490,481,2973190,-486136,2.0,28,36 %,17.76 %
1,China,1425671352,-0.02 %,-215985,152,9388211,-310220,1.2,39,65 %,17.72 %
2,United States,339996563,0.50 %,1706706,37,9147420,999700,1.7,38,83 %,4.23 %
3,Indonesia,277534122,0.74 %,2032783,153,1811570,-49997,2.1,30,59 %,3.45 %
4,Pakistan,240485658,1.98 %,4660796,312,770880,-165988,3.3,21,35 %,2.99 %


# Edpuzzle | BeautifulSoup + Requests | Web Scraping in Python

was a 5/5 no worries.