[Home](../../README.md)

### Data Wrangling

This is a demonstration of data wrangling using [Pandas](https://pandas.pydata.org/) the library for data analysis and manipulation.

This Jupyter Notepad demonstrates different processes you can apply to your data to prepare it for feature engineering and model training. For this demonstration we will wrangle the diabetes data set you previewed in the last Jupyter Notebook.

> [!Note]
> None of these processes are destructive to the source CSV as long as you save the modified data to a new CSV.

#### Load the required dependencies

In [29]:
# Import frameworks
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

####  Store the data as a local variable

The data frame is a Pandas object that structures your tabular data into an appropriate format. It loads the complete data in memory so it is now ready for preprocessing.

In [30]:
data_frame = pd.read_csv("2.1.2.Diabeties_Sample_Data.csv")

#### Dealing with null values

Null values during data analysis can cause runtime errors and unexpected results. It is important to identify null values and deal with them appropriately before training a model.

The `isnull().sum()` method call returns the null values in any column.

In [31]:
data_frame.isnull().sum()

DoB       0
DoT       0
SEX       1
BMI       0
BP        0
TC        0
BGU       0
FDR       0
Target    1
dtype: int64

If you have null data there are many ways to deal with the empty/null values. These are the two most common approaches.
1. Remove any row with a null value with a `dropna()` method call.
2. Replace missing values with another value with a `fillna()` method call. Generally, we use mean value for numerical columns because it may cause minimal changes in your mathematical analysis while maintaining the original size of the data.

Students should reflect why this example removes the null 'SEX' but replacing the mean 'Target'?

In [32]:
# Remove Null values
data_frame = data_frame.dropna(subset=['SEX'])
data_frame.isnull().sum()

DoB       0
DoT       0
SEX       0
BMI       0
BP        0
TC        0
BGU       0
FDR       0
Target    1
dtype: int64

In [33]:
# Replace Null values with the mean value for the column
data_frame['Target'] = data_frame['Target'].fillna(data_frame['Target'].mean())
data_frame.isnull().sum()

DoB       0
DoT       0
SEX       0
BMI       0
BP        0
TC        0
BGU       0
FDR       0
Target    0
dtype: int64

#### Remove Duplicates

Duplicate data can have detrimental effects on your machine learning models and outcomes, such as reducing data diversity and representativeness, which can lead to overfitting or biased models.

The `duplicated().sum()` method call returns the count of duplicate rows in the data frame.

In [34]:
data_frame.duplicated().sum()

np.int64(5)

The `drop_duplicates()` method call can be then stored back onto the data_frame variable removing the duplicates.

In [35]:
data_frame = data_frame.drop_duplicates()
data_frame.duplicated().sum()

np.int64(0)

#### Replace data

We can run a lambda function on a column to modify its values. For a simple example, let’s convert the Sex to lowercase. To run a function over a complete column, we can use the apply method which iterates over each row and modifies the values.

In [36]:
data_frame['SEX'] = data_frame['SEX'].apply(lambda x: x.lower())
data_frame['SEX'].head()

0    female
1    female
2      male
3      male
4      male
Name: SEX, dtype: object

We can check that there are no data entry errors by the `unique()` method call.

In [37]:
data_frame['SEX'].unique()

array(['female', 'male', 'girl'], dtype=object)

In [38]:
data_frame['SEX'] = data_frame['SEX'].apply(lambda gender: 'male' if gender.lower() == 'male' else 'female')
data_frame['SEX'].unique()

array(['female', 'male'], dtype=object)

#### Remove outliers

Outliers can skew your analysis on numerical columns, and it is important to remove them. We can use the 25th and 75th quartile on numerical data, to get the inter-quartile range. This allows us to estimate an acceptable range, and we can then filter out any values outside this range. Mathematically, outliers are values occurring outside 1.5 times the interquartile range (IQR) from the first quartile (Q1) or third quartile (Q3).

In [39]:
#get the inter-quartile range on the salary column
print(data_frame['BP'].describe())
Q1 = data_frame['BP'].quantile(0.25)
Q3 = data_frame['BP'].quantile(0.75)
IQR = Q3 - Q1
print(f'Outliers are a BP above {Q3 + IQR * 1.5} or below {Q1 - IQR * 1.5}')


count    442.000000
mean      94.687738
std       14.224409
min       51.000000
25%       84.000000
50%       93.000000
75%      105.000000
max      141.000000
Name: BP, dtype: float64
Outliers are a BP above 136.5 or below 52.5


In [40]:
# Filter salaries within the acceptable range
data_frame = data_frame[(data_frame['BP'] >= Q1 - 1.5 * IQR) & (data_frame['BP'] <= Q3 + 1.5 * IQR)]
print(data_frame['BP'].describe())

count    439.000000
mean      94.583098
std       13.790260
min       62.000000
25%       84.000000
50%       93.000000
75%      105.000000
max      133.000000
Name: BP, dtype: float64


#### Scaling features to a common range

Scaling the features makes it easier for machine learning algorithms to find the optimal solution, as the different scales of the features do not influence them.

In [41]:
scale_feature = 'BP'

#the minimum value with space for outliers
MIN_BP = 55

#the maximum value with space for outliers
MAX_BP = 140

#scale features
data_frame[scale_feature] = [(X - MIN_BP) / (MAX_BP - MIN_BP) for X in data_frame[scale_feature]]

data_frame.describe()

Unnamed: 0,BMI,BP,TC,BGU,FDR,Target
count,439.0,439.0,439.0,439.0,439.0,439.0
mean,26.361276,0.465684,4.069112,91.275626,1.066059,152.030328
std,4.428303,0.162238,1.294492,11.492468,0.831849,77.298096
min,18.0,0.082353,2.0,58.0,0.0,25.0
25%,23.15,0.341176,3.0,83.5,0.0,86.5
50%,25.7,0.447059,4.0,91.0,1.0,140.0
75%,29.25,0.588235,5.0,98.0,2.0,213.0
max,42.2,0.917647,9.09,124.0,3.0,346.0


> [!important]
> You need to save the calculations for each dataset you scale for scaling new values for prediction.

scale BMI

In [43]:
scale_feature = 'BMI'

#the minimum value with space for outliers
MIN_BMI = 12

#the maximum value with space for outliers
MAX_BMI = 46

#scale features
data_frame[scale_feature] = [(X - MIN_BMI) / (MAX_BMI - MIN_BMI) for X in data_frame[scale_feature]]

data_frame.describe()

Unnamed: 0,BMI,BP,TC,BGU,FDR,Target
count,439.0,439.0,439.0,439.0,439.0,439.0
mean,0.42239,0.465684,4.069112,91.275626,1.066059,152.030328
std,0.130244,0.162238,1.294492,11.492468,0.831849,77.298096
min,0.176471,0.082353,2.0,58.0,0.0,25.0
25%,0.327941,0.341176,3.0,83.5,0.0,86.5
50%,0.402941,0.447059,4.0,91.0,1.0,140.0
75%,0.507353,0.588235,5.0,98.0,2.0,213.0
max,0.888235,0.917647,9.09,124.0,3.0,346.0


scale BGU

In [44]:
scale_feature = 'BGU'

#the minimum value with space for outliers
MIN_BGU = 50

#the maximum value with space for outliers
MAX_BGU = 132

#scale features
data_frame[scale_feature] = [(X - MIN_BGU) / (MAX_BGU - MIN_BGU) for X in data_frame[scale_feature]]

data_frame.describe()

Unnamed: 0,BMI,BP,TC,BGU,FDR,Target
count,439.0,439.0,439.0,439.0,439.0,439.0
mean,0.42239,0.465684,4.069112,0.503361,1.066059,152.030328
std,0.130244,0.162238,1.294492,0.140152,0.831849,77.298096
min,0.176471,0.082353,2.0,0.097561,0.0,25.0
25%,0.327941,0.341176,3.0,0.408537,0.0,86.5
50%,0.402941,0.447059,4.0,0.5,1.0,140.0
75%,0.507353,0.588235,5.0,0.585366,2.0,213.0
max,0.888235,0.917647,9.09,0.902439,3.0,346.0


#### Save the wrangled data to CSV

In [45]:
data_frame.to_csv('../2.2.Feature_Engineering/2.2.1.wrangled_data.csv', index=False)