## **Preprocess dataset**

After doing EDA process, we have to pre-process dataset like below:
- Check missing values
- Check duplicated values
- Remove 'Other' from `gender` columns
- Remove feature `smoking_history`, `clinical_notes`
- Check some columns with `object` type, whether the values in this columns have different form but have the same meanings (e.g. HCM city - Saigon City)
- Combine some values
- Add BMI class to the dataset


### **Import necessary libraries**

In [1]:
import pandas as pd
import numpy as np
import os
import sys
sys.path.append(os.path.abspath(os.path.join('..')))

from utils.function import *
from utils.constants import *

import warnings
warnings.filterwarnings("ignore")

In [2]:
df = pd.read_csv('../data/diabetes_dataset_with_notes.csv')

### **Check missing values**

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 17 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   year                  100000 non-null  int64  
 1   gender                100000 non-null  object 
 2   age                   100000 non-null  float64
 3   location              100000 non-null  object 
 4   race:AfricanAmerican  100000 non-null  int64  
 5   race:Asian            100000 non-null  int64  
 6   race:Caucasian        100000 non-null  int64  
 7   race:Hispanic         100000 non-null  int64  
 8   race:Other            100000 non-null  int64  
 9   hypertension          100000 non-null  int64  
 10  heart_disease         100000 non-null  int64  
 11  smoking_history       100000 non-null  object 
 12  bmi                   100000 non-null  float64
 13  hbA1c_level           100000 non-null  float64
 14  blood_glucose_level   100000 non-null  int64  
 15  d

Based on the information above, this dataset has no missing values.

### **Check duplicated values**

In [4]:
duplicated = df.duplicated().sum()
print(f'Number of Duplicated values: {duplicated}')

Number of Duplicated values: 14


In [5]:
df.drop_duplicates(inplace=True)

In [6]:
df.reset_index(drop=True, inplace=True)

In [7]:
df.shape

(99986, 17)

In [8]:
duplicated = df.duplicated().sum()
print(f'Number of Duplicated values: {duplicated}')

Number of Duplicated values: 0


### **Remove "Other" from `gender` feature**

In [9]:
df = df[df["gender"] != "Other"]
df.shape

(99968, 17)

### **Remove feature `smoking_history`, `clinical_notes`**

In [10]:
df = df.drop(columns=['smoking_history', 'clinical_notes']).reset_index(drop=True)
df.shape

(99968, 15)

### **Check some columns with `object` type**

We will check `location` feature

In [11]:
df["location"].unique()

array(['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California',
       'Colorado', 'Connecticut', 'Delaware', 'District of Columbia',
       'Florida', 'Georgia', 'Guam', 'Hawaii', 'Idaho', 'Illinois',
       'Indiana', 'Iowa', 'Kansas', 'Kentucky', 'Louisiana', 'Maine',
       'Maryland', 'Massachusetts', 'Michigan', 'Minnesota',
       'Mississippi', 'Missouri', 'Montana', 'Nebraska', 'Nevada',
       'New Hampshire', 'New Jersey', 'New Mexico', 'New York',
       'North Carolina', 'North Dakota', 'Ohio', 'Oklahoma', 'Oregon',
       'Pennsylvania', 'Puerto Rico', 'Rhode Island', 'South Carolina',
       'South Dakota', 'Tennessee', 'Texas', 'United States', 'Utah',
       'Vermont', 'Virgin Islands', 'Virginia', 'Washington',
       'West Virginia', 'Wisconsin', 'Wyoming'], dtype=object)

The `location` columns seem correct, do not have the different categories with the same meaning!!! But in the EDA process, we can see Virgin Islands, Wisconsin and Wyoming have lesser entires as compared to other locations, so we can combine these to one location named "Others".

### **Combine some values**

In [12]:
mapping = {
    "Virgin Islands": "Others",
    "Wisconsin": "Others",
    "Wyoming": "Others",
}

df['location'] = df['location'].replace(mapping)

In [13]:
df['location'].unique()

array(['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California',
       'Colorado', 'Connecticut', 'Delaware', 'District of Columbia',
       'Florida', 'Georgia', 'Guam', 'Hawaii', 'Idaho', 'Illinois',
       'Indiana', 'Iowa', 'Kansas', 'Kentucky', 'Louisiana', 'Maine',
       'Maryland', 'Massachusetts', 'Michigan', 'Minnesota',
       'Mississippi', 'Missouri', 'Montana', 'Nebraska', 'Nevada',
       'New Hampshire', 'New Jersey', 'New Mexico', 'New York',
       'North Carolina', 'North Dakota', 'Ohio', 'Oklahoma', 'Oregon',
       'Pennsylvania', 'Puerto Rico', 'Rhode Island', 'South Carolina',
       'South Dakota', 'Tennessee', 'Texas', 'United States', 'Utah',
       'Vermont', 'Others', 'Virginia', 'Washington', 'West Virginia'],
      dtype=object)

### **Add BMI class to the dataset**

Convert BMI values into four categories (Underweight, Normal, Overweight, Obese)

In [14]:
adults = df[df['age'] >= 20] 
children = df[df['age'] < 20]

In [15]:
df.loc[:, 'age_scaled'] = (
    df['age'].apply(lambda x: years_to_years_months(x) if pd.notna(x) else pd.NA)
)

In [16]:
df["bmi_class"] = df.apply(
    lambda row: classify_bmi_row(row, bmi_percentile), axis=1
)


In [17]:
df.head()

Unnamed: 0,year,gender,age,location,race:AfricanAmerican,race:Asian,race:Caucasian,race:Hispanic,race:Other,hypertension,heart_disease,bmi,hbA1c_level,blood_glucose_level,diabetes,age_scaled,bmi_class
0,2020,Female,32.0,Alabama,0,0,0,0,1,0,0,27.32,5.0,100,0,32-0,Overweight
1,2015,Female,29.0,Alabama,0,1,0,0,0,0,0,19.95,5.0,90,0,29-0,Normal weight
2,2015,Male,18.0,Alabama,0,0,0,0,1,0,0,23.76,4.8,160,0,18-0,Normal weight
3,2015,Male,41.0,Alabama,0,0,1,0,0,0,0,27.32,4.0,159,0,41-0,Overweight
4,2016,Female,52.0,Alabama,1,0,0,0,0,0,0,23.75,6.5,90,0,52-0,Normal weight


In [18]:
df = df.drop(columns=["age_scaled"]).reset_index(drop=True)
df.shape

(99968, 16)

In [19]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99968 entries, 0 to 99967
Data columns (total 16 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   year                  99968 non-null  int64  
 1   gender                99968 non-null  object 
 2   age                   99968 non-null  float64
 3   location              99968 non-null  object 
 4   race:AfricanAmerican  99968 non-null  int64  
 5   race:Asian            99968 non-null  int64  
 6   race:Caucasian        99968 non-null  int64  
 7   race:Hispanic         99968 non-null  int64  
 8   race:Other            99968 non-null  int64  
 9   hypertension          99968 non-null  int64  
 10  heart_disease         99968 non-null  int64  
 11  bmi                   99968 non-null  float64
 12  hbA1c_level           99968 non-null  float64
 13  blood_glucose_level   99968 non-null  int64  
 14  diabetes              99968 non-null  int64  
 15  bmi_class          

### **Stores the preprocessed data**

In [20]:
df.to_csv('../data/preprocessed_diabetes.csv', index=False)