Topics: <br>
• Deleting missing values <br>
• Replacing missing values <br>

In [1]:
import pandas as pd

Let us now look at the dataset for this task. The dataset is about predicting whether a female has diabetes or not based on parameters such as Glucose and Insulin levels.

Read the dataset in a Pandas dataframe named as "df" using read_csv() and providing the correct path to the downloaded file.

In [2]:
df = pd.read_csv('diabetes.csv')

Let us look at more details about our data such as number of rows and columns by using .info().

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


As we can see, are 9 columns and 768 rows. There appears to be no null values in our data. Let us dive more deeply in our dataset to find out more
insights.

In [5]:
df.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


Let us look at the minimum value for each column. The 0 minimum value for
 - Glucose
 - BloodPressure
 - SkinThickness
 - Insulin
 - BMI
 
is absurd since these values cannot be 0 for any person. This suggests that missing values in this case are being represented by 0. 

Let us see how many 0 values are there in each of these columns.

Let us now try to find some details about the data. Use .head() to see the first 5 rows of the dataset.

In [3]:
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [6]:
data_cols = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin']

Counting Number of Zeroes in the above columns.

In [7]:
(df[data_cols] == 0).sum()

Glucose            5
BloodPressure     35
SkinThickness    227
Insulin          374
dtype: int64

The count of 0 values in each column have been listed above. These numbers confirm that 0 are indeed, representing missing values.add()

Let us try what happens when we try to find the count Of null values in each Column using .isnull()sum()

Count the number of NULL values in each columns.

In [9]:
df.isnull().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

It shows 0 null values for every column. This happens because null values in this case are not reperesented by the standard representation of ' NaN' or 'None'

Since, here null values are represented by 0, pandas is not able to identity any null values in the dataset. 

Let us now replace these 0 values by NaN' values.

In [10]:
from numpy import nan

In [11]:
df[data_cols] = df[data_cols].replace(0, nan)

Let us now check the null values again in our dataset.

In [12]:
df.isnull().sum()

Pregnancies                   0
Glucose                       5
BloodPressure                35
SkinThickness               227
Insulin                     374
BMI                           0
DiabetesPedigreeFunction      0
Age                           0
Outcome                       0
dtype: int64

As we can see, null values are now being detected.

All 0s have been converted to null values. 

Let us view the first 10 rows of our dataset to see if any NaN values can be Seen.

In [13]:
df.head(10)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148.0,72.0,35.0,,33.6,0.627,50,1
1,1,85.0,66.0,29.0,,26.6,0.351,31,0
2,8,183.0,64.0,,,23.3,0.672,32,1
3,1,89.0,66.0,23.0,94.0,28.1,0.167,21,0
4,0,137.0,40.0,35.0,168.0,43.1,2.288,33,1
5,5,116.0,74.0,,,25.6,0.201,30,0
6,3,78.0,50.0,32.0,88.0,31.0,0.248,26,1
7,10,115.0,,,,35.3,0.134,29,0
8,2,197.0,70.0,45.0,543.0,30.5,0.158,53,1
9,8,125.0,96.0,,,0.0,0.232,54,1


## Deleting missing values
From the count of null values, it can be seen that columns 'Glucose', 'BloodPressure'and 'BMI' have very few null values. <br>
So, if we delete these observations, it would not be detrimental to our dataset. <br>
Let us use .dropna() to drop these missing values as shown below.

In [14]:
df = df.dropna(subset=['Glucose', 'BloodPressure', 'BMI'])

In [15]:
df.isnull().sum()

Pregnancies                   0
Glucose                       0
BloodPressure                 0
SkinThickness               194
Insulin                     335
BMI                           0
DiabetesPedigreeFunction      0
Age                           0
Outcome                       0
dtype: int64

As we can observe, null values have been removed from the columns 'Glucose' , 'BloodPressure' and 'BMI'. 

Still, there are a few null values remaining in
columns, 'SkinThickness' and 'Insulin'. 

The number of null values have been reduced in these columns, because deletion of rows for 'Glucose' 'BloodPressure' and 'BMI have deleted some null values for these columns as well.

## Replacing missing values
The column 'Insulin' has 335 missing values, which should be replaced since it is a big number. Let us use mean of the column 'Insulin' to replace the missing
values.
For finding the mean value Of the column, use meanO function as shown.

In [16]:
mean_val = df['Insulin'].mean()

In [17]:
print(mean_val)

155.8854961832061


Let us use fillna() function to replace all the values in the column 'Insulin' With the mean Of the column.

In [18]:
df['Insulin'].fillna(mean_val, inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Insulin'].fillna(mean_val, inplace=True)


In [19]:
df.isnull().sum()

Pregnancies                   0
Glucose                       0
BloodPressure                 0
SkinThickness               194
Insulin                       0
BMI                           0
DiabetesPedigreeFunction      0
Age                           0
Outcome                       0
dtype: int64

Let us now see how interpolation can be used to replace missing values. interpolate() is the Pandas function for interpolation. By default, it performs linear interpolation. Let us apply interpolate on the column

In [20]:
df['SkinThickness'].interpolate(inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['SkinThickness'].interpolate(inplace=True)


Let us again check if we have any null values remaining in our dataset.

In [21]:
df.isnull().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64