<h2><u>Data Science Using Python</u></h2>

# Module 6: Data Manipulation

<h2>Demo 4: Cleaning Data</h2>

In this demo, you will be shown how to inspect and remove various inconsistencies from data in order to prepare it for analysis.

In [None]:
# Import the required libraries

import pandas as pd
import numpy as np


In [None]:
# Read sample of diabetes data from a CSV file

df = pd.read_csv('diabetes_sample.csv')
df

Unnamed: 0,PatientID,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,PT1001,6,148,72,35,250.0,33.6,0.627,50,1
1,PT1002,1,85,66,29,300.0,26.6,0.351,31,0
2,PT1003,8,183,64,0,,23.3,0.672,32,1
3,PT1004,1,89,66,23,94.0,28.1,0.167,21,0
4,PT1005,0,137,40,35,168.0,43.1,2.288,33,1
5,PT1005,0,137,40,35,168.0,43.1,2.288,33,1
6,PT1006,3,78,50,32,88.0,31.0,0.248,26,1
7,PT1007,10,115,0,0,400.0,35.3,0.134,29,0
8,PT1008,2,197,70,45,543.0,30.5,0.158,53,1
9,PT1009,8,125,96,0,0.0,0.0,0.232,54,1


In [None]:
# Check if there are any null values in the data

print(df.isnull())

# Check for null values in column Insulin

print(df.Insulin.isnull())

    PatientID  Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin  \
0       False        False    False          False          False    False   
1       False        False    False          False          False    False   
2       False        False    False          False          False     True   
3       False        False    False          False          False    False   
4       False        False    False          False          False    False   
5       False        False    False          False          False    False   
6       False        False    False          False          False    False   
7       False        False    False          False          False    False   
8       False        False    False          False          False    False   
9       False        False    False          False          False    False   
10      False        False    False          False          False    False   
11      False        False    False          False          Fals

In [None]:
# Fill the rows containing null values in Insulin column with the mean value of that column

df.Insulin = df.Insulin.fillna(df.Insulin.mean())
df

# Observe that the null value is filled using the mean

Unnamed: 0,PatientID,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,PT1001,6,148,72,35,250.0,33.6,0.627,50,1
1,PT1002,1,85,66,29,300.0,26.6,0.351,31,0
2,PT1003,8,183,64,0,216.571429,23.3,0.672,32,1
3,PT1004,1,89,66,23,94.0,28.1,0.167,21,0
4,PT1005,0,137,40,35,168.0,43.1,2.288,33,1
5,PT1005,0,137,40,35,168.0,43.1,2.288,33,1
6,PT1006,3,78,50,32,88.0,31.0,0.248,26,1
7,PT1007,10,115,0,0,400.0,35.3,0.134,29,0
8,PT1008,2,197,70,45,543.0,30.5,0.158,53,1
9,PT1009,8,125,96,0,0.0,0.0,0.232,54,1


In [None]:
# Check for duplicate rows

df.duplicated()

# Here, only row 5 is displayed since row 4 & 5 are exact copies of each other


0     False
1     False
2     False
3     False
4     False
5      True
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14    False
dtype: bool

In [None]:
# Check if any patient has a duplicate entry using PatientID column

df.duplicated(['PatientID'])

# Observe that one more entry is returned as True in this case

0     False
1     False
2     False
3     False
4     False
5      True
6     False
7     False
8     False
9     False
10     True
11    False
12    False
13    False
14    False
dtype: bool

##### Conclusion: This code demonstrates how to remove inconsistancies from a given dataset using various functions from pandas.