## Part 1 - **Data Acquisition**:

Using a web scraping library from Python, extract and build a dataset coming from a website (should contain data with mining interest, and significance such as Weather, Population, Diseases..etc). Once you have extracted your dataset, apply the necessary cleansing process to prepare the data for your Model.

Specify the line where you have saved your data in a CSV or XLS file.

In [1]:
# Installing required packages (beautifulsoup)
!pip install requests beautifulsoup4

#importing libaries
import requests   #for sending HTP requests to websites
from bs4 import BeautifulSoup   #for parsing and navigating HTML
import re   #for handling regular expressions
import dateutil   #for parsing dates
import pandas as pd   #for tabular data




In [2]:
results = requests.get("https://en.wikipedia.org/wiki/List_of_infectious_diseases") #this sends a GET request to the wikipedia page I chose for my dataset

In [3]:
assert results.status_code==200 #ensurnig the page was successfully loaded

In [4]:
src = results.content  #extracting the raw HTML content of the page (inspect)
document = BeautifulSoup(src, 'lxml')   #parse the html using BeautifulSoup with the 'lxml' parser

In [5]:
# this is to get the HTML table to make sure that I got the correct wikipedia

table = document.find("table")
table

<table class="sortable wikitable">
<tbody><tr>
<th>Infectious agent
</th>
<th>Common name
</th>
<th>Diagnosis
</th>
<th>Treatment
</th>
<th>Vaccine(s)
</th></tr>
<tr>
<td><i><a href="/wiki/Acinetobacter_baumannii" title="Acinetobacter baumannii">Acinetobacter baumannii</a></i>
</td>
<td><i><a href="/wiki/Acinetobacter" title="Acinetobacter">Acinetobacter</a></i> infections
</td>
<td>Culture
</td>
<td>Supportive care
</td>
<td class="table-no" style="background:#FFC7C7;color:black;vertical-align:middle;text-align:center;">No
</td></tr>
<tr>
<td><i><a href="/wiki/Actinomyces_israelii" title="Actinomyces israelii">Actinomyces israelii</a></i>, <i><a href="/wiki/Actinomyces_gerencseriae" title="Actinomyces gerencseriae">Actinomyces gerencseriae</a></i> and <i><a class="mw-redirect" href="/wiki/Propionibacterium_propionicus" title="Propionibacterium propionicus">Propionibacterium propionicus</a></i>
</td>
<td><a href="/wiki/Actinomycosis" title="Actinomycosis">Actinomycosis</a>
</td>
<td>Hi

In [6]:
print(table.find("th").get_text()) #helps check what kind of table i'll be working on (and also to double-check if I got the table right)

Infectious agent



In [7]:
#gathering all rows, including header and content rows, and prepare a list to store parsed data

rows = table.find_all("tr")   #find all rows from the data
rows

[<tr>
 <th>Infectious agent
 </th>
 <th>Common name
 </th>
 <th>Diagnosis
 </th>
 <th>Treatment
 </th>
 <th>Vaccine(s)
 </th></tr>,
 <tr>
 <td><i><a href="/wiki/Acinetobacter_baumannii" title="Acinetobacter baumannii">Acinetobacter baumannii</a></i>
 </td>
 <td><i><a href="/wiki/Acinetobacter" title="Acinetobacter">Acinetobacter</a></i> infections
 </td>
 <td>Culture
 </td>
 <td>Supportive care
 </td>
 <td class="table-no" style="background:#FFC7C7;color:black;vertical-align:middle;text-align:center;">No
 </td></tr>,
 <tr>
 <td><i><a href="/wiki/Actinomyces_israelii" title="Actinomyces israelii">Actinomyces israelii</a></i>, <i><a href="/wiki/Actinomyces_gerencseriae" title="Actinomyces gerencseriae">Actinomyces gerencseriae</a></i> and <i><a class="mw-redirect" href="/wiki/Propionibacterium_propionicus" title="Propionibacterium propionicus">Propionibacterium propionicus</a></i>
 </td>
 <td><a href="/wiki/Actinomycosis" title="Actinomycosis">Actinomycosis</a>
 </td>
 <td>Histologic fin

In [10]:
# Find the first column
tables = document.find_all("table", class_="wikitable")
target_table = tables[0]

# Store all rows into a list
rows = target_table.find_all("tr")
data = []

#loop thorugh each row except the header (starting from index 1)
for row in rows[1:]:
    cells = row.find_all(["td", "th"])
    if len(cells) == 5:   #5 columns for agent, common name, diagnosis, treatment, vaccine

        cells_text = [cell.get_text(strip=True).replace('\xa0', ' ') for cell in cells]

        # Skip rows that look like repeated headers or missing values
        if all(cells_text) and not any("agent" in c.lower() for c in cells_text):
            data.append(cells_text)

In [11]:
# Create DataFrame
columns = ["Infectious agent", "Common name", "Diagnosis", "Treatment", "Vaccine(s)"]
df = pd.DataFrame(data, columns=columns)

# Drop duplicates or clean common formatting issues
df["Common name"] = df["Common name"].str.replace('\n', ' ').str.strip()
df = df.drop_duplicates()

# Preview
df.head(10)

Unnamed: 0,Infectious agent,Common name,Diagnosis,Treatment,Vaccine(s)
0,Acinetobacter baumannii,Acinetobacterinfections,Culture,Supportive care,No
1,"Actinomyces israelii,Actinomyces gerencseriaea...",Actinomycosis,Histologic findings,"Penicillin,doxycycline, andsulfonamides",No
2,Adenoviridae,Adenovirus infection,"Antigendetection,polymerase chain reactionassa...",Most infections are mild and require no therap...,Under research[1]
3,Trypanosoma brucei,African sleeping sickness(African trypanosomia...,Identification of trypanosomes in a sample by ...,Fexinidazoleby mouth orpentamidineby injection...,Under research[2]
4,HIV (Human immunodeficiency virus),AIDS (acquired immunodeficiency syndrome),"Antibody test, p24 antigen test, PCR",Treatment is typically anon-nucleoside reverse...,Under research[3]
5,Anaplasmaspecies,Anaplasmosis,indirect immunofluorescence antibody assay for...,"Tetracycline drugs (includingtetracycline,chlo...",No
6,Angiostrongylus,Angiostrongyliasis,"Lumbar puncture, brain imaging, serology",Albendazole,No
7,Anisakis,Anisakiasis,"Gastroscopic examination, or histopathologic e...",Albendazole,No
8,Bacillus anthracis,Anthrax,"Culture, PCR",Large doses of intravenous and oral antibiotic...,Yes
9,Arcanobacterium haemolyticum,Arcanobacterium haemolyticuminfection,Culture in human bloodagarplates,"erythromycin(proposed as the first-line drug),...",No


In [13]:
# Downloading the pre-processed CSV file to local machine

# Save the DataFrame to a CSV file first
df.to_csv("infections_list_cleaned.csv", index=False)

from google.colab import files
files.download("infections_list_cleaned.csv")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>