## **Wikipedia Web Scrapping**
- **Source**: Wikipedia (2025). List od US states by intentional homicide rate.
- **URL**: https://en.wikipedia.org/wiki/List_of_U.S._states_by_intentional_homicide_rate
- **Date**: 10/01/24
- **Goal**: Create a basic table extraction from a webpage 

In [1]:
# Step 0. Load libraries and custom functions
import re
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [2]:
# Step 1. Get the data from website
# 1.1 Get the HTML
url = "https://en.wikipedia.org/wiki/List_of_U.S._states_by_intentional_homicide_rate"
headers = {'User-Agent': 'MyWebScraper/1.0'}
response = requests.get(url, headers=headers)

In [3]:
# 1.2 Parse the HTML
soup = BeautifulSoup(response.content, 'html.parser')
tables = soup.find_all('table')
print(f'There are {len(tables)} tables in the webpage')


There are 7 tables in the webpage


In [4]:
# 1.3 Retrieve just the first table
wiki_tables = soup.find_all('table', class_='wikitable')
wiki_tables[0]

<table class="wikitable sortable static-row-numbers sort-under col1left sticky-table-head sticky-table-col1 mw-datatable" style="text-align:right">
<caption>Homicides per 100,000 people by year. <a href="/wiki/Federal_Bureau_of_Investigation" title="Federal Bureau of Investigation">FBI</a>
</caption>
<tbody><tr>
<th>Location</th>
<th>2018</th>
<th>2019</th>
<th>2020</th>
<th>2021</th>
<th>2022
</th></tr>
<tr class="static-row-numbers-norank">
<td><span class="flagicon" style="padding-left:25px;"> </span><b>United States</b></td>
<td>5.0</td>
<td>5.1</td>
<td>6.5</td>
<td>6.8</td>
<td>6.3
</td></tr>
<tr class="static-row-numbers-norank">
<td style="text-align:left"><span class="nowrap"><span class="flagicon" style="display:inline-block;width:25px;text-align:left"><span class="mw-image-border" typeof="mw:File"><span><img alt="" class="mw-file-element" data-file-height="1000" data-file-width="2000" decoding="async" height="12" src="//upload.wikimedia.org/wikipedia/commons/thumb/0/03/Flag_

In [5]:
# 1.4 Extract headers
headers = []
rows = wiki_tables[0].find_all('tr')
for th in rows[0].find_all('th'):
    headers.append(th.text.strip())
headers

['Location', '2018', '2019', '2020', '2021', '2022']

In [6]:
# 1.5 Extract rows
data = []
for row in rows[1:]:
    cols = row.find_all('td')
    cols = [ele.text.strip() for ele in cols]
    data.append(cols)
data[:5]

[['United States', '5.0', '5.1', '6.5', '6.8', '6.3'],
 ['District of Columbia', '22.8', '23.4', '28.2', '41.0', '29.3'],
 ['Louisiana', '11.4', '11.7', '15.8', '19.6', '16.1'],
 ['New Mexico', '7.3', '8.8', '7.8', '12.9', '12.0'],
 ['South Carolina', '8.1', '8.8', '10.5', '11.4', '11.2']]

In [7]:
# 1.6 Create DataFrame
df = pd.DataFrame(data, columns=headers)
df.sample(5, random_state=2025)

Unnamed: 0,Location,2018,2019,2020,2021,2022
8,Alaska,6.4,9.4,6.7,6.1,9.5
39,Vermont,1.8,1.8,2.2,1.4,3.4
2,Louisiana,11.4,11.7,15.8,19.6,16.1
26,California,4.4,4.3,5.6,6.0,5.7
6,Arkansas,7.4,7.8,10.6,11.0,10.2


### **References**
[1] Real Python (2025). Beautifulsoup Web Scrapper. Retrieved from https://realpython.com/beautiful-soup-web-scraper-python/  
