# Milestone 3
**Name:** Eula Fullerton  
**Due Date:** Oct 20, 2024  
**Class:** DSC540-T302 Data Preparation  
**Professor:** Professor Williams 

Cleaning/Formatting Website Data

Perform at least 5 data transformation and/or cleansing steps to your website data. The below examples are not required - they are just potential transformations you could do. If your data doesn't work for these scenarios, complete different transformations. You can do the same transformation multiple times if needed to clean your data. The goal is a clean dataset at the end of the milestone. As a reminder - you cannot export your website data to CSV to work with it, you must do all the work directly against the HTML source.

Examples:
- Replace Headers
- Format data into a more readable format
- Identify outliers and bad data
- Find duplicates
- Fix casing or inconsistent values
- Conduct Fuzzy Matching
- Make sure you clearly number and label each transformation step (Step #1, Step #2, etc.) in your code and describe what it is doing in 1-2 sentences.

You must submit the following:
Jupyter Notebook File or PDF of your code with Milestone # listed.
Each transformation should be labeled with description or what it is doing.
Human readable dataset after all transformations should be printed at the end of your notebook.

- 1 paragraph of the ethical implications of data wrangling specific to your datasource and the steps you completed answering the following questions:
- What changes were made to the data?
- Are there any legal or regulatory guidelines for your data or project topic?
- What risks could be created based on the transformations done?
- Did you make any assumptions in cleaning/transforming the data?
- How was your data sourced / verified for credibility?
- Was your data acquired in an ethical way?
- How would you mitigate any of the ethical implications you have identified?

In [3]:
# Import necessary libraries
import pandas as pd
from bs4 import BeautifulSoup
import requests
import numpy as np

## Viewing the Data

In my project, I will work with Table 0 and Table 3

In [6]:
# Retrieve the HTML content
url = "https://en.wikipedia.org/wiki/Pregnancy_category"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")

In [8]:
# Extract all tables into a list of DataFrames
tables = pd.read_html(url)

In [10]:
# Display the number of tables extracted
print(f"Number of tables found: {len(tables)}")

Number of tables found: 5


In [12]:
# Display the first few rows of each table to identify them
for i, table in enumerate(tables):
    print(f"\nTable {i}:")
    print(table.head())  


Table 0:
  Pregnancy Category                                        Description
0                  A  No risk in controlled human studies: Adequate ...
1                  B  No risk in other studies: Animal reproduction ...
2                  C  Risk not ruled out: Animal reproduction studie...
3                  D  Positive evidence of risk: There is positive e...
4                  X  Contraindicated in pregnancy: Studies in anima...

Table 1:
  Pregnancy Category  \
0                  A   
1                 B1   
2                 B2   
3                 B3   
4                  C   

  Australian categorisation system for prescribing medicines in pregnancy  
0  Drugs which have been taken by many pregnant w...                       
1  Drugs which have been taken by only a limited ...                       
2  Drugs which have been taken by only a limited ...                       
3  Drugs which have been taken by only a limited ...                       
4  Drugs which, owing t

In [16]:
# Step 2: Extract the relevant table (assuming it is the first table on the page)
table = soup.find("table", {"class": "wikitable"})
df = pd.read_html(str(table))[0]

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

## 1. Replace Headers

In [19]:
# Step 1: display first few rows of Table 0
print("Original Data Headers:")
print(df.head()) 

Original Data Headers:
  Pregnancy Category                                        Description
0                  A  No risk in controlled human studies: Adequate ...
1                  B  No risk in other studies: Animal reproduction ...
2                  C  Risk not ruled out: Animal reproduction studie...
3                  D  Positive evidence of risk: There is positive e...
4                  X  Contraindicated in pregnancy: Studies in anima...


In [21]:
# Step 2: Replace Headers of Table 0
df.columns = df.columns.str.replace("Pregnancy Category", "FDA Pregnancy Risk Category")  # Renaming for clarity
df.columns = df.columns.str.strip()  # Stripping whitespace from headers

In [23]:
# Step 3: Display new headers 
print("New Headers:")
print(df.head())

New Headers:
  FDA Pregnancy Risk Category  \
0                           A   
1                           B   
2                           C   
3                           D   
4                           X   

                                         Description  
0  No risk in controlled human studies: Adequate ...  
1  No risk in other studies: Animal reproduction ...  
2  Risk not ruled out: Animal reproduction studie...  
3  Positive evidence of risk: There is positive e...  
4  Contraindicated in pregnancy: Studies in anima...  


##  2. Create a new row splitting the description

A new row is added to split the description from table 0.

In [26]:
# Step 1: Get the column names of table_0
table_0 = tables[0]  
table_0_columns = table_0.columns.tolist() 
print("Column names in Table 0:")
print(table_0_columns)

Column names in Table 0:
['Pregnancy Category', 'Description']


In [28]:
# Step 2: Split the 'Description' column values by ':' and create new columns
split_column = table_0['Description'].str.split(':', expand=True)

# Step 3: Rename the new columns to custom names
split_column.columns = ['Risk', 'Detailed Description']

# Step 4: Concatenate the original DataFrame with the new columns
table_0 = pd.concat([table_0, split_column], axis=1)

# Step 5: Drop the original 'Description' column if no longer needed
table_0 = table_0.drop(columns=['Description'])

# Step 6: Display the updated DataFrame
print("Updated Table 0:")
print(table_0.head())

Updated Table 0:
  Pregnancy Category                                 Risk  \
0                  A  No risk in controlled human studies   
1                  B             No risk in other studies   
2                  C                   Risk not ruled out   
3                  D            Positive evidence of risk   
4                  X         Contraindicated in pregnancy   

                                Detailed Description  
0   Adequate and well-controlled human studies ha...  
1   Animal reproduction studies have failed to de...  
2   Animal reproduction studies have shown an adv...  
3   There is positive evidence of human fetal ris...  
4   Studies in animals or humans have demonstrate...  


## 3. Remove unnecessary columns

My project will focus on the US column. I omited Pharmaceutical agent Australia from Table 3.

In [31]:
# Step 1: Access Table 4 (index 3)
table_4 = tables[3]  

In [33]:
# Step 2: Display the original Table 4
print("Original Table 4:")
print(table_4.head()) 

Original Table 4:
  Classification of some agents, based on different national bodies            \
                                               Pharmaceutical agent Australia   
0                     Acetylsalicylic acid (aspirin)                        C   
1                                            Alcohol                        ?   
2                                        Amoxicillin                        A   
3                                           Caffeine                        A   
4                   Amoxicillin with clavulanic acid                       B1   

                      
       United States  
0  D third trimester  
1                  X  
2                  B  
3                  ?  
4                  B  


In [35]:
# Step 3: Display the column names
print("Column names in Table 4:")
print(table_4.columns.tolist())  

Column names in Table 4:
[('Classification of some agents, based on different national bodies', 'Pharmaceutical agent'), ('Classification of some agents, based on different national bodies', 'Australia'), ('Classification of some agents, based on different national bodies', 'United States')]


In [37]:
# Step 4: Remove the "Pharmaceutical agent Australia" column
column_to_drop = ('Classification of some agents, based on different national bodies', 'Australia')  

# Step 5: Check if the column exists before trying to drop it
if column_to_drop in table_4.columns:
    table_4 = table_4.drop(columns=[column_to_drop])  
    print(f"Column '{column_to_drop}' has been removed.")
else:
    print(f"Column '{column_to_drop}' not found in Table 4.")

# Step 6: Display the updated DataFrame to confirm the column has been removed
print("\nUpdated Table 4:")
print(table_4.head())  


Column '('Classification of some agents, based on different national bodies', 'Australia')' has been removed.

Updated Table 4:
  Classification of some agents, based on different national bodies  \
                                               Pharmaceutical agent   
0                     Acetylsalicylic acid (aspirin)                  
1                                            Alcohol                  
2                                        Amoxicillin                  
3                                           Caffeine                  
4                   Amoxicillin with clavulanic acid                  

                      
       United States  
0  D third trimester  
1                  X  
2                  B  
3                  ?  
4                  B  


## 4. Replace Values 
Research was conducted fo find the missing values for caffeine and nicotine FDA classifications in the US. According to the drugs.com caffeine is classfied as class C, while nicotine is classified as class D. 

In [40]:
# Step 1: Replace values in the United States column for Caffeine and Nicotine with different values
table_4.loc[
    table_4[('Classification of some agents, based on different national bodies', 'Pharmaceutical agent')] == 'Caffeine',
    ('Classification of some agents, based on different national bodies', 'United States')
] = 'C'  

table_4.loc[
    table_4[('Classification of some agents, based on different national bodies', 'Pharmaceutical agent')] == 'Nicotine',
    ('Classification of some agents, based on different national bodies', 'United States')
] = 'D'  

# Step 2: Display the updated DataFrame to confirm the changes
print("Updated Table 4 after replacing values in the 'United States' column for Caffeine and Nicotine:")
print(table_4.head()) 

Updated Table 4 after replacing values in the 'United States' column for Caffeine and Nicotine:
  Classification of some agents, based on different national bodies  \
                                               Pharmaceutical agent   
0                     Acetylsalicylic acid (aspirin)                  
1                                            Alcohol                  
2                                        Amoxicillin                  
3                                           Caffeine                  
4                   Amoxicillin with clavulanic acid                  

                      
       United States  
0  D third trimester  
1                  X  
2                  B  
3                  C  
4                  B  


## 5. Replace Header
Since the key relationship in this study will be through drug names prescribed or associated with pregnancy outcomes, I changed the header "Pharmaceutical agent" to "Drug".

In [43]:
# Step 1: Create a list to rename the columns
new_columns = [
    ('Classification of some agents, based on different national bodies', 'Drug') 
    if col == ('Classification of some agents, based on different national bodies', 'Pharmaceutical agent') 
    else col 
    for col in table_4.columns
]

# Step 2: Update the DataFrame with the new column names
table_4.columns = pd.MultiIndex.from_tuples(new_columns)

# Step 3: Display the updated column names
print("Column names in Table 4 after renaming:")
print(table_4.columns.tolist())

Column names in Table 4 after renaming:
[('Classification of some agents, based on different national bodies', 'Drug'), ('Classification of some agents, based on different national bodies', 'United States')]


## Concluding Remarks

In the process of data wrangling, several changes were made to enhance the clarity and usability of the datasets from the Wikipedia page. The 5 changes were:

1. The header "Pregnancy Category" was updated to "FDA Pregnancy Risk Category" to better reflect its content and regulatory context.
2. A new row was created by splitting the "Description" column, allowing for a more granular analysis of the data.
3. Unnecessary columns were removed such as the Australian classification column.
4. Values were replaced for "nicotine" and "caffeine" with values from drugs.com to replace the missing content in that column with the known US FDA risk category. 
5. The column name for "Pharmaceutical Agent" was changed to "Drug" for conciseness and clarity when it comes to merging the 3 datasets for the project.

These transformations raise ethical implications, particularly regarding the accuracy and integrity of the data from the Wikipedia page. Legal and regulatory guidelines, such as those established by the FDA for drug safety, must be considered, as misrepresenting the data could have serious consequences. Although, the data wrangling process did not include major transformations, the risks associated with these transformations may include potential misinterpretation of the data and a loss of context that may lead to incorrect conclusions. Assumptions made during the cleaning process, such as the consistency of drug classifications, could also impact the reliability of the findings. Although it is important to point out that the replaced values were sourced from credible sites that adhere to ethical standards for data collection.  

## Download Final tables

In [53]:
# Save table_0 to a CSV file
table_0.to_csv('table_0_final.csv', index=False)

# Save table_4 to a CSV file
table_4.to_csv('table_4_final.csv', index=False)

print("Tables have been saved as CSV files.")

Tables have been saved as CSV files.
