<h1><span style="color: lightblue;">Performing Data Wrangling ⚡️</span></h1>

## 🌟 Step 1
<h2><span style="color: yellow;">Data Gathering 🌎 </span></h2> 

   From various sources :<br>
- `CSV files` <br>
- `APIs` <br>
- `Web Scraping` <br>
- `Databases` <br>

In [106]:
import re
import pandas as pd
import numpy as np

In [109]:
patients = pd.read_csv('patients.csv')
treatments = pd.read_csv('treatments.csv')
adverse_reactions = pd.read_csv('adverse_reactions.csv')
treatments_cut = pd.read_csv('treatments_cut.csv')

- Gather data from CSV file.
- Got this dataset to practice from CampusX { https://github.com/campusx-official/data-wrangling/blob/master/Data%20Wrangling.ipynb }.

## 🌟 Step 2
<h2><span style="color: yellow;">Data  Assessments 🌎 </span></h2>  
- In this step, the data is to be understood more deeply. Before implementing methods to clean it, you will definitely need to have a better idea about what the data is about.<br>
- Basically a whole summary of data.<br>
- Data assessment is often an iterative process.<br>
    
<h2><span style="color: red;"> 🛑 Step 1 Discover </span></h2>  
- View datasets.<br>
- Check shape of the data.<br>
    
### 🔥 Automatic Assessments 
- Programmatic.<br>
- Using Pandas.<br>
  - `head and tail`
  - `describe`
  - `sample`
  - `info`
  - `isnull`
  - `duplicated`

In [32]:
patients

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi
0,1,female,Zoe,Wellish,576 Brown Bear Drive,Rancho California,California,92390.0,United States,951-719-9170ZoeWellish@superrito.com,7/10/1976,121.7,66,19.6
1,2,female,Pamela,Hill,2370 University Hill Road,Armstrong,Illinois,61812.0,United States,PamelaSHill@cuvox.de+1 (217) 569-3204,4/3/1967,118.8,66,19.2
2,3,male,Jae,Debord,1493 Poling Farm Road,York,Nebraska,68467.0,United States,402-363-6804JaeMDebord@gustr.com,2/19/1980,177.8,71,24.8
3,4,male,Liêm,Phan,2335 Webster Street,Woodbridge,NJ,7095.0,United States,PhanBaLiem@jourrapide.com+1 (732) 636-8246,7/26/1951,220.9,70,31.7
4,5,male,Tim,Neudorf,1428 Turkey Pen Lane,Dothan,AL,36303.0,United States,334-515-7487TimNeudorf@cuvox.de,2/18/1928,192.3,27,26.1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
498,499,male,Mustafa,Lindström,2530 Victoria Court,Milton Mills,ME,3852.0,United States,207-477-0579MustafaLindstrom@jourrapide.com,4/10/1959,181.1,72,24.6
499,500,male,Ruman,Bisliev,494 Clarksburg Park Road,Sedona,AZ,86341.0,United States,928-284-4492RumanBisliev@gustr.com,3/26/1948,239.6,70,34.4
500,501,female,Jinke,de Keizer,649 Nutter Street,Overland Park,MO,64110.0,United States,816-223-6007JinkedeKeizer@teleworm.us,1/13/1971,171.2,67,26.8
501,502,female,Chidalu,Onyekaozulu,3652 Boone Crockett Lane,Seattle,WA,98109.0,United States,ChidaluOnyekaozulu@jourrapide.com1 360 443 2060,2/13/1952,176.9,67,27.7


In [7]:
treatments.head()

Unnamed: 0,given_name,surname,auralin,novodra,hba1c_start,hba1c_end,hba1c_change
0,veronika,jindrová,41u - 48u,-,7.63,7.2,
1,elliot,richardson,-,40u - 45u,7.56,7.09,0.97
2,yukitaka,takenaka,-,39u - 36u,7.68,7.25,
3,skye,gormanston,33u - 36u,-,7.97,7.62,0.35
4,alissa,montez,-,33u - 29u,7.78,7.46,0.32


In [8]:
adverse_reactions.head()

Unnamed: 0,given_name,surname,adverse_reaction
0,berta,napolitani,injection site discomfort
1,lena,baer,hypoglycemia
2,joseph,day,hypoglycemia
3,flavia,fiorentino,cough
4,manouck,wubbels,throat irritation


In [9]:
treatments_cut.head()

Unnamed: 0,given_name,surname,auralin,novodra,hba1c_start,hba1c_end,hba1c_change
0,jožka,resanovič,22u - 30u,-,7.56,7.22,0.34
1,inunnguaq,heilmann,57u - 67u,-,7.85,7.45,
2,alwin,svensson,36u - 39u,-,7.78,7.34,
3,thể,lương,-,61u - 64u,7.64,7.22,0.92
4,amanda,ribeiro,36u - 44u,-,7.85,7.47,0.38


In [10]:
treatments_cut.shape

(70, 7)

In [16]:
adverse_reactions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34 entries, 0 to 33
Data columns (total 3 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   given_name        34 non-null     object
 1   surname           34 non-null     object
 2   adverse_reaction  34 non-null     object
dtypes: object(3)
memory usage: 948.0+ bytes


In [17]:
treatments_cut.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70 entries, 0 to 69
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   given_name    70 non-null     object 
 1   surname       70 non-null     object 
 2   auralin       70 non-null     object 
 3   novodra       70 non-null     object 
 4   hba1c_start   70 non-null     float64
 5   hba1c_end     70 non-null     float64
 6   hba1c_change  42 non-null     float64
dtypes: float64(3), object(4)
memory usage: 4.0+ KB


In [18]:
treatments.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 280 entries, 0 to 279
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   given_name    280 non-null    object 
 1   surname       280 non-null    object 
 2   auralin       280 non-null    object 
 3   novodra       280 non-null    object 
 4   hba1c_start   280 non-null    float64
 5   hba1c_end     280 non-null    float64
 6   hba1c_change  171 non-null    float64
dtypes: float64(3), object(4)
memory usage: 15.4+ KB


In [19]:
treatments.describe()

Unnamed: 0,hba1c_start,hba1c_end,hba1c_change
count,280.0,280.0,171.0
mean,7.985929,7.589286,0.546023
std,0.568638,0.569672,0.279555
min,7.5,7.01,0.2
25%,7.66,7.27,0.34
50%,7.8,7.42,0.38
75%,7.97,7.57,0.92
max,9.95,9.58,0.99


In [30]:
treatments_cut.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70 entries, 0 to 69
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   given_name    70 non-null     object 
 1   surname       70 non-null     object 
 2   auralin       70 non-null     object 
 3   novodra       70 non-null     object 
 4   hba1c_start   70 non-null     float64
 5   hba1c_end     70 non-null     float64
 6   hba1c_change  42 non-null     float64
dtypes: float64(3), object(4)
memory usage: 4.0+ KB


In [16]:
patients.describe()

Unnamed: 0,patient_id,zip_code,weight,height,bmi
count,503.0,491.0,503.0,503.0,503.0
mean,252.0,49084.118126,173.43499,66.634195,27.483897
std,145.347859,30265.807442,33.916741,4.411297,5.276438
min,1.0,1002.0,48.8,27.0,17.1
25%,126.5,21920.5,149.3,63.0,23.3
50%,252.0,48057.0,175.3,67.0,27.2
75%,377.5,75679.0,199.5,70.0,31.75
max,503.0,99701.0,255.9,79.0,37.7


In [31]:
adverse_reactions[adverse_reactions.duplicated()].count()

given_name          0
surname             0
adverse_reaction    0
dtype: int64

### 🔥 Manual Assessments 
- Export data into excel sit and watch for hours to the data or ooking through the data manually in google sheets.

In [None]:
with pd.ExcelWriter('clinical_trials.xlsx') as writer:
  patients.to_excel(writer,sheet_name='patients')
  treatments.to_excel(writer,sheet_name='treatments')
  treatments_cut.to_excel(writer,sheet_name='treatment_cut')
  adverse_reactions.to_excel(writer,sheet_name='adverse_reactions')

<h2><span style="color: red;"> 🛑 Step 2 Document </span></h2> 
- Summary<br>
- Address issues within the dataset combine and make documents<br>

### 1.🟨 Write a summary for your data 
- This is a dataset about 503 patients storing their records and also a treatment/treatment_cuts of which 280 + 70 = 350 patients participated in a clinical trial. 
- None of the patients were using Novodra (a popular injectable insulin) or Auralin (the oral insulin being researched) as their primary source of insulin before.
- All were experiencing elevated HbA1c levels(Sugar level in desi language).
- All 350 patients were treated with Novodra and Auralin to record the difference in change in start and end of the trails.
- Data about patients feeling some adverse effects after the trails were also recorded.

### 2.🟨 Colomn Description 
- **<span style="color: red;">💥 Table</span>** - `patients`<br>
1. **patient_id**: Unique identifier for each patient.
2. **assigned_sex**: Sex assigned to the patient at birth (e.g., male, female, other).
3. **given_name**: First name of the patient.
4. **surname**: Last name (family name) of the patient.
5. **address**: Street address where the patient resides.
6. **city**: City where the patient resides.
7. **state**: State or province where the patient resides.
8. **zip_code**: Postal code for the patient's address.
9. **country**: Country where the patient resides.
10. **contact**: Contact information for the patient, typically a phone number or email address.
11. **birthdate**: Date of birth of the patient.
12. **weight**: Weight of the patient, usually measured in kilograms or pounds.
13. **height**: Height of the patient, usually measured in centimeters or inches.
14. **bmi**: Body Mass Index (BMI) of the patient, a value derived from the weight and height.

- **<span style="color: red;">💥 Table</span>** - `treatments`<br>
1. **given_name**: First name of the patient.
2. **surname**: Last name (family name) of the patient.
3. **auralin**:  Dosage of the medication Auralin prescribed to the patient, measured in units (e.g., 22u - 30u).
4. **novodra**:  Dosage of the medication Novodra prescribed to the patient, measured in units (e.g., 22u - 30u).
5. **hba1c_start**: Initial HbA1c level (percentage) at the beginning of the monitoring period.
6. **hba1c_end**: Final HbA1c level (percentage) at the end of the monitoring period.
7. **hba1c_change**: Change in HbA1c level (percentage) from the start to the end of the monitoring period.

- **<span style="color: red;">💥 Table</span>** - `treatments_cuts`<br>
1. **given_name**: First name of the patient.
2. **surname**: Last name (family name) of the patient.
3. **auralin**:  Dosage of the medication Auralin prescribed to the patient, measured in units (e.g., 22u - 30u).
4. **novodra**:  Dosage of the medication Novodra prescribed to the patient, measured in units (e.g., 22u - 30u).
5. **hba1c_start**: Initial HbA1c level (percentage) at the beginning of the monitoring period.
6. **hba1c_end**: Final HbA1c level (percentage) at the end of the monitoring period.
7. **hba1c_change**: Change in HbA1c level (percentage) from the start to the end of the monitoring period.

- **<span style="color: red;">💥 Table</span>** - `adverse_reactions`<br>
1. **given_name**: First name of the patient.
2. **surname**: Last name (family name) of the patient.
3. **adverse_reaction**: Description of any negative or unintended reactions experienced by the patient, typically as a result of medication or treatment.

### 3.🟨 Add any additional information
- Insulin resistance varies person to person, which is why both starting median daily dose and ending median daily dose are required, i.e., to calculate change in dose.
- It is important to test drugs and medical products in the people they are meant to help. 
- People of different age, race, sex, and ethnic group must be included in clinical trials mmoreover this diversity is reflected in the patients table.

### 4.🟨 Issues with the dataset 🛑
#### ⭐️ Step 1 Dirty Data
- Duplicated data 
- Missing Data 
- Corrupt Data
- Inaccurate Data

## Documented Description 🌫️
- **<span style="color: red;">💥 Table</span>** - `patients`<br>
            1. Have no Duplicate rows and colomn.<br>
            2. Have 12 missing or null values data in following columns.<br>
                - `address`<br>
                - `city`<br> 
                - `state`<br>	 
                - `zip_code`<br>	
                - `country`<br>	
                - `contact`<br>
            3. Following Colomns have corrupted data entered.<br>
                - `zip_code` : Some time 4-digit of zip code is entered.<br>	
                - `contact` : Number and email address is provided combine(e.g formate : 951-719-9170email@gmail.com).<br>
                - `state` : State in some colomn provided as abreviated (e.g NewYork : NY).<br>
            4. Inacurrate data:<br>
                - `zip_code` : Wrong datatype for this colomn is selected.<br>
                - `weight` : The weight has not been entered accurately have outliers.<br>
                - `height` : The height has also not been entered accurately have outliers.<br>
            
- **<span style="color: red;">💥 Table</span>** - `treatments`<br>
            1. Have 1 Duplicate row.<br>
            2. Have 109 missing or null values data in following columns.<br>
                - `hba1c_change`<br>
            3. Following Colomns have corrupted data entered.<br>
                - `auralin` : Some times ( '-' ) is entered in place where patients dont use auralin.<br>
                - `novodra` : Some times ( '-' ) is entered in place where patients dont use novodra.<br>
            4. Inacurrate data:<br>
                - `auralin` : The dose of the medicine written with the unit `u`.<br>
                - `novodra` : The dose of the medicine written with the unit `u`.<br>

- **<span style="color: red;">💥 Table</span>** - `treatments_cut`<br>
            1. Have 0 Duplicate row.<br>
            2. Have 28 missing or null value in data.<br>
                - `hba1c_change`<br>
            3. Following Colomns have corrupted data entered.<br>
                - `auralin` : Some times ( '-' ) is entered in place where patients dont use auralin.<br>
                - `novodra` : Some times ( '-' ) is entered in place where patients dont use novodra.<br>
            4. Inacurrate data:<br>
                - `auralin` : The dose of the medicine written with the unit `u`.<br>
                - `novodra` : The dose of the medicine written with the unit `u`.<br>

- **<span style="color: red;">💥 Table</span>** - `adverse_reactions`<br>
            1. Have 0 Duplicate row.<br>
            2. Have 0 missing or null value in data.<br>
            3. No corrupted data entered.<br>
            4. No inacurrate data:<br>
   

#### ⭐️ Step 2 Messy Data  
   - Structural issues ,Each variable forms a single column.
   - Each observation forms a row.
   - Each observational unit forms a table.

## Documented Description 🌫️
- **<span style="color: red;">💥 Table</span>** - `patients`<br>
            1.  Structural issues:<br>
                - `contact` : Number and email address is provided combine(e.g formate : 951-719-9170email@gmail.com). <br>
                - `address`	,`city`	 ,`state` ,`country` can be in one separate table.<br>

- **<span style="color: red;">💥 Table</span>** - `treatments`<br>
            1.  Structural issues:<br>
                - `auralin` and `novodra` not in a single table.<br>
       
- **<span style="color: red;">💥 Table</span>**- `treatments_cut`<br>
            1.  Structural issues:<br>
                - `auralin` and `novodra` not in a single table.<br>

- **<span style="color: red;">💥 Table</span>** - `adverse_reactions`<br>
            1.  Is kind of Structured with no issues.<br>

### 5.🟨 Providing solutions of issues within the dataset 🛑
- **<span style="color: red;">💥 Table</span>** - `patients`<br>
            1. **No Duplicate Rows and Columns:**<br>
                - Ensure the table has no duplicate rows and columns by using the Pandas functions `drop_duplicates()` for rows and checking column names for duplicates.<br>
            2. **Missing or Null Values:**<br>
                - Columns with missing values: `address`, `city`, `state`, `zip_code`, `country`, `contact`<br>
                - Handle missing values using methods like `fillna()` or `dropna()` based on the context.<br>
            3. **Corrupted Data:**<br>
                - **`zip_code`**: Ensure all zip codes are 5 digits long by applying a transformation to add leading zeros if necessary.<br>
                - **`contact`**: Split combined contact information into separate `phone_number` and `email` columns using string manipulation functions.<br>
                - **`state`**: Standardize state names by converting abbreviations to full state names using a mapping dictionary.<br>
            4. **Inaccurate Data:**<br>
                - **`zip_code`**: Correct the datatype by converting the column to string type using `astype(str)`.<br>
                - **`weight`**: Identify and handle outliers by using statistical methods such as Z-scores or IQR to filter out anomalous values.<br>
                - **`height`**: Similarly, identify and handle outliers using statistical methods to ensure accurate entries.<br>
 
- **<span style="color: red;">💥 Table</span>** - `treatments`<br>
            1. **Duplicate Row:**<br>
                - Remove duplicate rows using `drop_duplicates()`.<br>
            2. **Missing or Null Values:**<br>
                - Column with missing values: `hba1c_change`<br>
                - Handle missing values by either filling them with an appropriate statistic (mean, median) or removing the rows if it does not impact analysis significantly.<br>
            3. **Corrupted Data:**<br>
                - **`auralin`**: Replace ( - ) with `NaN` or `0` to indicate no usage.<br>
                - **`novodra`**: Replace ( - ) with `NaN` or `0` to indicate no usage.<br>
            4. **Inaccurate Data:**<br>
                - **`auralin`**: Remove the unit `u` from dosage values and ensure the column is numeric using string replacement and conversion.<br>
                - **`novodra`**: Similarly, remove the unit `u` from dosage values and ensure the column is numeric.<br>

- **<span style="color: red;">💥 Table</span>** - `treatments_cut`<br>
            1. **No Duplicate Rows:**<br>
               - Ensure the table has no duplicate rows.<br>
            2. **Missing or Null Values:**<br>
               - Column with missing values: `hba1c_change`<br>
               - Handle missing values using methods like `fillna()` or `dropna()` based on the context.<br>
            3. **Corrupted Data:**<br>
                - **`auralin`**: Replace ( - ) with `NaN` or `0` to indicate no usage.
                - **`novodra`**: Replace ( - ) with `NaN` or `0` to indicate no usage.
            4. **Inaccurate Data:**<br>
                - **`auralin`**: Remove the unit `u` from dosage values and ensure the column is numeric using string replacement and conversion.<br>
                - **`novodra`**: Similarly, remove the unit `u` from dosage values and ensure the column is numeric.<br>

- **<span style="color: red;">💥 Table</span>** - `adverse_reactions`<br>
            1. **No Duplicate Rows:**<br>
                - Ensure the table has no duplicate rows.<br>
            2. **No Missing or Null Values:**<br>
                - Verify that there are no missing or null values.<br>
            3. **No Corrupted Data:**<br>
                - Confirm that all data entries are correct and uncorrupted.<br>
            4. **No Inaccurate Data:**<br>
                - Verify the accuracy of all data entries.<br>

## 🌟 Step 3
<h2><span style="color: yellow;">Data Cleaning or Data Quality Dimensions✨</span></h2> 

- Follow the same order of steps below.<br>
- Each step include following steps.<br>
   `Define` ,`Code` ,`Test` <br>
    

In [110]:
patients_df = patients.copy()
treatments_df = treatments.copy()
treatments_cut_df = treatments_cut.copy()
adverse_reactions_df = adverse_reactions.copy()

`Always make sure to create a copy of your pandas dataframe before you start the cleaning process`

<h2><span style="color: red;"> 🛑 Step 1 Completeness </span></h2>  

## 1.🟨 Define 
- replace all missing values of patients df with no data<br>
- sub hba1c_start from hba1c_end to get all the change values<br>
- in patients table we will use regex to separate email and phone<br>

## 2.🟨 Code

In [113]:
patients_df.fillna('No data',inplace=True) 

## 3.🟨 Test

In [114]:
patients_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 503 entries, 0 to 502
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   patient_id    503 non-null    int64  
 1   assigned_sex  503 non-null    object 
 2   given_name    503 non-null    object 
 3   surname       503 non-null    object 
 4   address       503 non-null    object 
 5   city          503 non-null    object 
 6   state         503 non-null    object 
 7   zip_code      503 non-null    object 
 8   country       503 non-null    object 
 9   contact       503 non-null    object 
 10  birthdate     503 non-null    object 
 11  weight        503 non-null    float64
 12  height        503 non-null    int64  
 13  bmi           503 non-null    float64
dtypes: float64(2), int64(2), object(10)
memory usage: 55.1+ KB


- Follow the step for rest of datasets

In [115]:
treatments_df['hba1c_change'] = treatments_df['hba1c_start'] - treatments_df['hba1c_end'] 
treatments_cut_df['hba1c_change'] = treatments_cut_df['hba1c_start'] - treatments_cut_df['hba1c_end'] 

In [116]:
treatments_cut_df['hba1c_change'].info()

<class 'pandas.core.series.Series'>
RangeIndex: 70 entries, 0 to 69
Series name: hba1c_change
Non-Null Count  Dtype  
--------------  -----  
70 non-null     float64
dtypes: float64(1)
memory usage: 692.0 bytes


<h2><span style="color: red;"> 🛑 Step 2 Tidiness and Validity </span></h2>  

## 1.🟨 Define

In [117]:
patients_df

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi
0,1,female,Zoe,Wellish,576 Brown Bear Drive,Rancho California,California,92390.0,United States,951-719-9170ZoeWellish@superrito.com,7/10/1976,121.7,66,19.6
1,2,female,Pamela,Hill,2370 University Hill Road,Armstrong,Illinois,61812.0,United States,PamelaSHill@cuvox.de+1 (217) 569-3204,4/3/1967,118.8,66,19.2
2,3,male,Jae,Debord,1493 Poling Farm Road,York,Nebraska,68467.0,United States,402-363-6804JaeMDebord@gustr.com,2/19/1980,177.8,71,24.8
3,4,male,Liêm,Phan,2335 Webster Street,Woodbridge,NJ,7095.0,United States,PhanBaLiem@jourrapide.com+1 (732) 636-8246,7/26/1951,220.9,70,31.7
4,5,male,Tim,Neudorf,1428 Turkey Pen Lane,Dothan,AL,36303.0,United States,334-515-7487TimNeudorf@cuvox.de,2/18/1928,192.3,27,26.1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
498,499,male,Mustafa,Lindström,2530 Victoria Court,Milton Mills,ME,3852.0,United States,207-477-0579MustafaLindstrom@jourrapide.com,4/10/1959,181.1,72,24.6
499,500,male,Ruman,Bisliev,494 Clarksburg Park Road,Sedona,AZ,86341.0,United States,928-284-4492RumanBisliev@gustr.com,3/26/1948,239.6,70,34.4
500,501,female,Jinke,de Keizer,649 Nutter Street,Overland Park,MO,64110.0,United States,816-223-6007JinkedeKeizer@teleworm.us,1/13/1971,171.2,67,26.8
501,502,female,Chidalu,Onyekaozulu,3652 Boone Crockett Lane,Seattle,WA,98109.0,United States,ChidaluOnyekaozulu@jourrapide.com1 360 443 2060,2/13/1952,176.9,67,27.7


## 2.🟨 Code

In [123]:
def find_contact_details(text: str) -> tuple:
    # it the value is NaN, then return it
    if pd.isna(text):
        return np.nan
    
    # create the phone number pattern
    phone_number_pattern = re.compile(r"(\+[\d]{1,3}\s)?(\(?[\d]{3}\)?\s?-?[\d]{3}\s?-?[\d]{4})")
    # find the phone number from the value/text, as a result we will get a list
    phone_number  = re.findall(phone_number_pattern, text)

    # if length is 0, then the regex can't find any ph number, then define with NaN
    if len(phone_number) <= 0:
        phone_number = np.nan
    # if the country code is attached with the ph number, for that case, the first
    # element will be the country code and the 2nd element will be the actual ph
    # number. So, get that ph number
    elif len(phone_number) >= 2:
        phone_number = phone_number[1]
    # else, we will get the ph number. Grab it.
    else:
        phone_number = phone_number[0]

    # if we found the ph number (with/without country code), then remove that part from the actual value.
    # after removing the ph number, the remaining string might be the email address.
    possible_email_add = re.sub(phone_number_pattern, "", text).strip()

    # then return the ph number and the email address
    return phone_number, possible_email_add

In [124]:
patients_df['phone'] = patients_df["contact"].apply(lambda x: find_contact_details(x)).apply(lambda x:x[0])
patients_df['email'] = patients_df["contact"].apply(lambda x: find_contact_details(x)).apply(lambda x:x[1])

## 3.🟨 Test

In [125]:
patients_df

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,contact,birthdate,weight,height,bmi,phone,email
0,1,female,Zoe,Wellish,576 Brown Bear Drive,Rancho California,California,92390.0,United States,951-719-9170ZoeWellish@superrito.com,7/10/1976,121.7,66,19.6,"(, 951-719-9170)",ZoeWellish@superrito.com
1,2,female,Pamela,Hill,2370 University Hill Road,Armstrong,Illinois,61812.0,United States,PamelaSHill@cuvox.de+1 (217) 569-3204,4/3/1967,118.8,66,19.2,"(+1 , (217) 569-3204)",PamelaSHill@cuvox.de
2,3,male,Jae,Debord,1493 Poling Farm Road,York,Nebraska,68467.0,United States,402-363-6804JaeMDebord@gustr.com,2/19/1980,177.8,71,24.8,"(, 402-363-6804)",JaeMDebord@gustr.com
3,4,male,Liêm,Phan,2335 Webster Street,Woodbridge,NJ,7095.0,United States,PhanBaLiem@jourrapide.com+1 (732) 636-8246,7/26/1951,220.9,70,31.7,"(+1 , (732) 636-8246)",PhanBaLiem@jourrapide.com
4,5,male,Tim,Neudorf,1428 Turkey Pen Lane,Dothan,AL,36303.0,United States,334-515-7487TimNeudorf@cuvox.de,2/18/1928,192.3,27,26.1,"(, 334-515-7487)",TimNeudorf@cuvox.de
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
498,499,male,Mustafa,Lindström,2530 Victoria Court,Milton Mills,ME,3852.0,United States,207-477-0579MustafaLindstrom@jourrapide.com,4/10/1959,181.1,72,24.6,"(, 207-477-0579)",MustafaLindstrom@jourrapide.com
499,500,male,Ruman,Bisliev,494 Clarksburg Park Road,Sedona,AZ,86341.0,United States,928-284-4492RumanBisliev@gustr.com,3/26/1948,239.6,70,34.4,"(, 928-284-4492)",RumanBisliev@gustr.com
500,501,female,Jinke,de Keizer,649 Nutter Street,Overland Park,MO,64110.0,United States,816-223-6007JinkedeKeizer@teleworm.us,1/13/1971,171.2,67,26.8,"(, 816-223-6007)",JinkedeKeizer@teleworm.us
501,502,female,Chidalu,Onyekaozulu,3652 Boone Crockett Lane,Seattle,WA,98109.0,United States,ChidaluOnyekaozulu@jourrapide.com1 360 443 2060,2/13/1952,176.9,67,27.7,"(, 360 443 2060)",ChidaluOnyekaozulu@jourrapide.com1


- Apply same as rest of data.

In [131]:
def extract_phone_number(data):
    # Define regex pattern for phone number
    phone_pattern = r'\+?\d[\d\s()-]{8,}\d'
    
    # Find phone number in the data
    phone = re.search(phone_pattern, data)
    
    # Extract and clean the result
    phone = phone.group(0).replace(' ', '').replace('(', '').replace(')', '') if phone else None
    
    return phone

In [132]:
patients_df['phone'] = patients_df["contact"].apply(lambda x: extract_phone_number(x))

In [140]:
patients_df

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,birthdate,weight,height,bmi,phone,email
0,1,female,Zoe,Wellish,576 Brown Bear Drive,Rancho California,California,92390.0,United States,7/10/1976,121.7,66,19.6,951-719-9170,ZoeWellish@superrito.com
1,2,female,Pamela,Hill,2370 University Hill Road,Armstrong,Illinois,61812.0,United States,4/3/1967,118.8,66,19.2,+1217569-3204,PamelaSHill@cuvox.de
2,3,male,Jae,Debord,1493 Poling Farm Road,York,Nebraska,68467.0,United States,2/19/1980,177.8,71,24.8,402-363-6804,JaeMDebord@gustr.com
3,4,male,Liêm,Phan,2335 Webster Street,Woodbridge,NJ,7095.0,United States,7/26/1951,220.9,70,31.7,+1732636-8246,PhanBaLiem@jourrapide.com
4,5,male,Tim,Neudorf,1428 Turkey Pen Lane,Dothan,AL,36303.0,United States,2/18/1928,192.3,27,26.1,334-515-7487,TimNeudorf@cuvox.de
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
498,499,male,Mustafa,Lindström,2530 Victoria Court,Milton Mills,ME,3852.0,United States,4/10/1959,181.1,72,24.6,207-477-0579,MustafaLindstrom@jourrapide.com
499,500,male,Ruman,Bisliev,494 Clarksburg Park Road,Sedona,AZ,86341.0,United States,3/26/1948,239.6,70,34.4,928-284-4492,RumanBisliev@gustr.com
500,501,female,Jinke,de Keizer,649 Nutter Street,Overland Park,MO,64110.0,United States,1/13/1971,171.2,67,26.8,816-223-6007,JinkedeKeizer@teleworm.us
501,502,female,Chidalu,Onyekaozulu,3652 Boone Crockett Lane,Seattle,WA,98109.0,United States,2/13/1952,176.9,67,27.7,13604432060,ChidaluOnyekaozulu@jourrapide.com1


In [None]:
patients_df.drop(columns='contact',inplace=True)

In [136]:
patients_df

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,birthdate,weight,height,bmi,phone,email
0,1,female,Zoe,Wellish,576 Brown Bear Drive,Rancho California,California,92390.0,United States,7/10/1976,121.7,66,19.6,951-719-9170,ZoeWellish@superrito.com
1,2,female,Pamela,Hill,2370 University Hill Road,Armstrong,Illinois,61812.0,United States,4/3/1967,118.8,66,19.2,+1217569-3204,PamelaSHill@cuvox.de
2,3,male,Jae,Debord,1493 Poling Farm Road,York,Nebraska,68467.0,United States,2/19/1980,177.8,71,24.8,402-363-6804,JaeMDebord@gustr.com
3,4,male,Liêm,Phan,2335 Webster Street,Woodbridge,NJ,7095.0,United States,7/26/1951,220.9,70,31.7,+1732636-8246,PhanBaLiem@jourrapide.com
4,5,male,Tim,Neudorf,1428 Turkey Pen Lane,Dothan,AL,36303.0,United States,2/18/1928,192.3,27,26.1,334-515-7487,TimNeudorf@cuvox.de
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
498,499,male,Mustafa,Lindström,2530 Victoria Court,Milton Mills,ME,3852.0,United States,4/10/1959,181.1,72,24.6,207-477-0579,MustafaLindstrom@jourrapide.com
499,500,male,Ruman,Bisliev,494 Clarksburg Park Road,Sedona,AZ,86341.0,United States,3/26/1948,239.6,70,34.4,928-284-4492,RumanBisliev@gustr.com
500,501,female,Jinke,de Keizer,649 Nutter Street,Overland Park,MO,64110.0,United States,1/13/1971,171.2,67,26.8,816-223-6007,JinkedeKeizer@teleworm.us
501,502,female,Chidalu,Onyekaozulu,3652 Boone Crockett Lane,Seattle,WA,98109.0,United States,2/13/1952,176.9,67,27.7,13604432060,ChidaluOnyekaozulu@jourrapide.com1


In [137]:
treatments_df = pd.concat([treatments_df,treatments_cut_df])

In [139]:
treatments_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 350 entries, 0 to 69
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   given_name    350 non-null    object 
 1   surname       350 non-null    object 
 2   auralin       350 non-null    object 
 3   novodra       350 non-null    object 
 4   hba1c_start   350 non-null    float64
 5   hba1c_end     350 non-null    float64
 6   hba1c_change  350 non-null    float64
dtypes: float64(3), object(4)
memory usage: 21.9+ KB


In [142]:
def lb_to_kg(pounds):
    return pounds * 0.453592

In [143]:
patients_df['weight'] = patients_df["weight"].apply(lambda x: lb_to_kg(x))

In [144]:
patients_df

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,birthdate,weight,height,bmi,phone,email
0,1,female,Zoe,Wellish,576 Brown Bear Drive,Rancho California,California,92390.0,United States,7/10/1976,55.202146,66,19.6,951-719-9170,ZoeWellish@superrito.com
1,2,female,Pamela,Hill,2370 University Hill Road,Armstrong,Illinois,61812.0,United States,4/3/1967,53.886730,66,19.2,+1217569-3204,PamelaSHill@cuvox.de
2,3,male,Jae,Debord,1493 Poling Farm Road,York,Nebraska,68467.0,United States,2/19/1980,80.648658,71,24.8,402-363-6804,JaeMDebord@gustr.com
3,4,male,Liêm,Phan,2335 Webster Street,Woodbridge,NJ,7095.0,United States,7/26/1951,100.198473,70,31.7,+1732636-8246,PhanBaLiem@jourrapide.com
4,5,male,Tim,Neudorf,1428 Turkey Pen Lane,Dothan,AL,36303.0,United States,2/18/1928,87.225742,27,26.1,334-515-7487,TimNeudorf@cuvox.de
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
498,499,male,Mustafa,Lindström,2530 Victoria Court,Milton Mills,ME,3852.0,United States,4/10/1959,82.145511,72,24.6,207-477-0579,MustafaLindstrom@jourrapide.com
499,500,male,Ruman,Bisliev,494 Clarksburg Park Road,Sedona,AZ,86341.0,United States,3/26/1948,108.680643,70,34.4,928-284-4492,RumanBisliev@gustr.com
500,501,female,Jinke,de Keizer,649 Nutter Street,Overland Park,MO,64110.0,United States,1/13/1971,77.654950,67,26.8,816-223-6007,JinkedeKeizer@teleworm.us
501,502,female,Chidalu,Onyekaozulu,3652 Boone Crockett Lane,Seattle,WA,98109.0,United States,2/13/1952,80.240425,67,27.7,13604432060,ChidaluOnyekaozulu@jourrapide.com1


In [165]:
treatments_df['insulin'] = treatments_df['auralin'].apply(lambda x: 'novodra' if x == '-' else 'auralin')

In [166]:
treatments_df

Unnamed: 0,given_name,surname,auralin,novodra,hba1c_start,hba1c_end,hba1c_change,insulin
0,veronika,jindrová,41u - 48u,-,7.63,7.20,0.43,auralin
1,elliot,richardson,-,40u - 45u,7.56,7.09,0.47,novodra
2,yukitaka,takenaka,-,39u - 36u,7.68,7.25,0.43,novodra
3,skye,gormanston,33u - 36u,-,7.97,7.62,0.35,auralin
4,alissa,montez,-,33u - 29u,7.78,7.46,0.32,novodra
...,...,...,...,...,...,...,...,...
65,rovzan,kishiev,32u - 37u,-,7.75,7.41,0.34,auralin
66,jakob,jakobsen,-,28u - 26u,7.96,7.51,0.45,novodra
67,bernd,schneider,48u - 56u,-,7.74,7.44,0.30,auralin
68,berta,napolitani,-,42u - 44u,7.68,7.21,0.47,novodra


In [168]:
treatments_df['dose'] = treatments_df['auralin'].apply(lambda x: x if x != '-' else '-')

In [196]:
treatments_df

Unnamed: 0,given_name,surname,hba1c_start,hba1c_end,hba1c_change,insulin,dose
0,veronika,jindrová,7.63,7.20,0.43,auralin,41u - 48u
1,elliot,richardson,7.56,7.09,0.47,novodra,40u - 45u
2,yukitaka,takenaka,7.68,7.25,0.43,novodra,39u - 36u
3,skye,gormanston,7.97,7.62,0.35,auralin,33u - 36u
4,alissa,montez,7.78,7.46,0.32,novodra,33u - 29u
...,...,...,...,...,...,...,...
65,rovzan,kishiev,7.75,7.41,0.34,auralin,32u - 37u
66,jakob,jakobsen,7.96,7.51,0.45,novodra,28u - 26u
67,bernd,schneider,7.74,7.44,0.30,auralin,48u - 56u
68,berta,napolitani,7.68,7.21,0.47,novodra,42u - 44u


In [191]:
treatments_df.loc[treatments_df['dose'] == "-", 'dose'] = treatments_df.loc[treatments_df['dose'] == "-", 'novodra']

In [192]:
treatments_df

Unnamed: 0,given_name,surname,auralin,novodra,hba1c_start,hba1c_end,hba1c_change,insulin,dose
0,veronika,jindrová,41u - 48u,-,7.63,7.20,0.43,auralin,41u - 48u
1,elliot,richardson,-,40u - 45u,7.56,7.09,0.47,novodra,40u - 45u
2,yukitaka,takenaka,-,39u - 36u,7.68,7.25,0.43,novodra,39u - 36u
3,skye,gormanston,33u - 36u,-,7.97,7.62,0.35,auralin,33u - 36u
4,alissa,montez,-,33u - 29u,7.78,7.46,0.32,novodra,33u - 29u
...,...,...,...,...,...,...,...,...,...
65,rovzan,kishiev,32u - 37u,-,7.75,7.41,0.34,auralin,32u - 37u
66,jakob,jakobsen,-,28u - 26u,7.96,7.51,0.45,novodra,28u - 26u
67,bernd,schneider,48u - 56u,-,7.74,7.44,0.30,auralin,48u - 56u
68,berta,napolitani,-,42u - 44u,7.68,7.21,0.47,novodra,42u - 44u


In [193]:
treatments_df.drop(columns='auralin',inplace=True)

In [194]:
treatments_df.drop(columns='novodra',inplace=True)

In [195]:
treatments_df

Unnamed: 0,given_name,surname,hba1c_start,hba1c_end,hba1c_change,insulin,dose
0,veronika,jindrová,7.63,7.20,0.43,auralin,41u - 48u
1,elliot,richardson,7.56,7.09,0.47,novodra,40u - 45u
2,yukitaka,takenaka,7.68,7.25,0.43,novodra,39u - 36u
3,skye,gormanston,7.97,7.62,0.35,auralin,33u - 36u
4,alissa,montez,7.78,7.46,0.32,novodra,33u - 29u
...,...,...,...,...,...,...,...
65,rovzan,kishiev,7.75,7.41,0.34,auralin,32u - 37u
66,jakob,jakobsen,7.96,7.51,0.45,novodra,28u - 26u
67,bernd,schneider,7.74,7.44,0.30,auralin,48u - 56u
68,berta,napolitani,7.68,7.21,0.47,novodra,42u - 44u


In [199]:
treatments_df = treatments_df.drop_duplicates()

In [200]:
treatments_df

Unnamed: 0,given_name,surname,hba1c_start,hba1c_end,hba1c_change,insulin,dose
0,veronika,jindrová,7.63,7.20,0.43,auralin,41u - 48u
1,elliot,richardson,7.56,7.09,0.47,novodra,40u - 45u
2,yukitaka,takenaka,7.68,7.25,0.43,novodra,39u - 36u
3,skye,gormanston,7.97,7.62,0.35,auralin,33u - 36u
4,alissa,montez,7.78,7.46,0.32,novodra,33u - 29u
...,...,...,...,...,...,...,...
65,rovzan,kishiev,7.75,7.41,0.34,auralin,32u - 37u
66,jakob,jakobsen,7.96,7.51,0.45,novodra,28u - 26u
67,bernd,schneider,7.74,7.44,0.30,auralin,48u - 56u
68,berta,napolitani,7.68,7.21,0.47,novodra,42u - 44u


In [201]:
treatments_df.to_csv('treatments_cleaned_data.csv', index=False)
patients_df.to_csv('patients_cleaned_data.csv', index=False)
adverse_reactions_df.to_csv('adverse_reactions_data.csv', index=False)

In [202]:
patients = pd.read_csv('patients_cleaned_data.csv')
treatments = pd.read_csv('treatments_cleaned_data.csv')
adverse_reactions = pd.read_csv('adverse_reactions_data.csv')

In [203]:
patients

Unnamed: 0,patient_id,assigned_sex,given_name,surname,address,city,state,zip_code,country,birthdate,weight,height,bmi,phone,email
0,1,female,Zoe,Wellish,576 Brown Bear Drive,Rancho California,California,92390.0,United States,7/10/1976,55.202146,66,19.6,951-719-9170,ZoeWellish@superrito.com
1,2,female,Pamela,Hill,2370 University Hill Road,Armstrong,Illinois,61812.0,United States,4/3/1967,53.886730,66,19.2,+1217569-3204,PamelaSHill@cuvox.de
2,3,male,Jae,Debord,1493 Poling Farm Road,York,Nebraska,68467.0,United States,2/19/1980,80.648658,71,24.8,402-363-6804,JaeMDebord@gustr.com
3,4,male,Liêm,Phan,2335 Webster Street,Woodbridge,NJ,7095.0,United States,7/26/1951,100.198473,70,31.7,+1732636-8246,PhanBaLiem@jourrapide.com
4,5,male,Tim,Neudorf,1428 Turkey Pen Lane,Dothan,AL,36303.0,United States,2/18/1928,87.225742,27,26.1,334-515-7487,TimNeudorf@cuvox.de
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
498,499,male,Mustafa,Lindström,2530 Victoria Court,Milton Mills,ME,3852.0,United States,4/10/1959,82.145511,72,24.6,207-477-0579,MustafaLindstrom@jourrapide.com
499,500,male,Ruman,Bisliev,494 Clarksburg Park Road,Sedona,AZ,86341.0,United States,3/26/1948,108.680643,70,34.4,928-284-4492,RumanBisliev@gustr.com
500,501,female,Jinke,de Keizer,649 Nutter Street,Overland Park,MO,64110.0,United States,1/13/1971,77.654950,67,26.8,816-223-6007,JinkedeKeizer@teleworm.us
501,502,female,Chidalu,Onyekaozulu,3652 Boone Crockett Lane,Seattle,WA,98109.0,United States,2/13/1952,80.240425,67,27.7,13604432060,ChidaluOnyekaozulu@jourrapide.com1


In [204]:
treatments

Unnamed: 0,given_name,surname,hba1c_start,hba1c_end,hba1c_change,insulin,dose
0,veronika,jindrová,7.63,7.20,0.43,auralin,41u - 48u
1,elliot,richardson,7.56,7.09,0.47,novodra,40u - 45u
2,yukitaka,takenaka,7.68,7.25,0.43,novodra,39u - 36u
3,skye,gormanston,7.97,7.62,0.35,auralin,33u - 36u
4,alissa,montez,7.78,7.46,0.32,novodra,33u - 29u
...,...,...,...,...,...,...,...
344,rovzan,kishiev,7.75,7.41,0.34,auralin,32u - 37u
345,jakob,jakobsen,7.96,7.51,0.45,novodra,28u - 26u
346,bernd,schneider,7.74,7.44,0.30,auralin,48u - 56u
347,berta,napolitani,7.68,7.21,0.47,novodra,42u - 44u


In [205]:
adverse_reactions

Unnamed: 0,given_name,surname,adverse_reaction
0,berta,napolitani,injection site discomfort
1,lena,baer,hypoglycemia
2,joseph,day,hypoglycemia
3,flavia,fiorentino,cough
4,manouck,wubbels,throat irritation
5,jasmine,sykes,hypoglycemia
6,louise,johnson,hypoglycemia
7,albinca,komavec,hypoglycemia
8,noe,aranda,hypoglycemia
9,sofia,hermansen,injection site discomfort
