# Handling missing data

Often the raw data that we want to analyse, contains missing data. To clean such datasets we can take several approaches:
1. Drop the row having any missing data
2. Replace missing data with previous/next value
3. Replace missing data with a global/scalar value
4. Replace missing data with a predicted data

Pandas provides us with few easy functions to handle such data. We shall explore few ways to do so.

In [15]:
from google.colab import files

uploaded = files.upload()#creating file input stream

for fn in uploaded.keys():#getting the input bytes
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))#get name and length of file

Saving data_cancer.txt to data_cancer.txt
User uploaded file "data_cancer.txt" with length 18645 bytes


In [16]:
import pandas as pd
import numpy as np

In [17]:
df=pd.read_csv('data_cancer.txt')
cols=['Class','age',' menopause','tumor-size','inv-nodes', 'node-caps','deg-malig','breast','breast-quad','irradiat']
df.columns= cols

df[164:190]#displaying 164-190 rows

Unnamed: 0,Class,age,menopause,tumor-size,inv-nodes,node-caps,deg-malig,breast,breast-quad,irradiat
164,no-recurrence-events,40-49,premeno,20-24,3-5,no,2,right,left_up,no
165,no-recurrence-events,40-49,premeno,20-24,3-5,no,2,right,left_low,no
166,no-recurrence-events,40-49,ge40,40-44,15-17,yes,2,right,left_up,yes
167,no-recurrence-events,50-59,premeno,10-14,0-2,no,2,right,left_up,no
168,no-recurrence-events,40-49,ge40,30-34,0-2,no,2,left,left_up,yes
169,no-recurrence-events,30-39,premeno,20-24,3-5,yes,2,right,left_up,yes
170,no-recurrence-events,30-39,premeno,15-19,0-2,no,1,left,left_low,no
171,no-recurrence-events,60-69,ge40,30-34,6-8,yes,2,right,right_up,no
172,no-recurrence-events,50-59,ge40,20-24,3-5,yes,2,right,left_up,no
173,no-recurrence-events,50-59,premeno,25-29,3-5,yes,2,left,left_low,yes


# Check for presence of null data

In [18]:
df.notnull()[164:190]



Unnamed: 0,Class,age,menopause,tumor-size,inv-nodes,node-caps,deg-malig,breast,breast-quad,irradiat
164,True,True,True,True,True,True,True,True,True,True
165,True,True,True,True,True,True,True,True,True,True
166,True,True,True,True,True,True,True,True,True,True
167,True,True,True,True,True,True,True,True,True,True
168,True,True,True,True,True,True,True,True,True,True
169,True,True,True,True,True,True,True,True,True,True
170,True,True,True,True,True,True,True,True,True,True
171,True,True,True,True,True,True,True,True,True,True
172,True,True,True,True,True,True,True,True,True,True
173,True,True,True,True,True,True,True,True,True,True


# Fill null data with previous value

use 'bfill' to use next value.

In [19]:
df.fillna(method ='pad')[164:190]

Unnamed: 0,Class,age,menopause,tumor-size,inv-nodes,node-caps,deg-malig,breast,breast-quad,irradiat
164,no-recurrence-events,40-49,premeno,20-24,3-5,no,2,right,left_up,no
165,no-recurrence-events,40-49,premeno,20-24,3-5,no,2,right,left_low,no
166,no-recurrence-events,40-49,ge40,40-44,15-17,yes,2,right,left_up,yes
167,no-recurrence-events,50-59,premeno,10-14,0-2,no,2,right,left_up,no
168,no-recurrence-events,40-49,ge40,30-34,0-2,no,2,left,left_up,yes
169,no-recurrence-events,30-39,premeno,20-24,3-5,yes,2,right,left_up,yes
170,no-recurrence-events,30-39,premeno,15-19,0-2,no,1,left,left_low,no
171,no-recurrence-events,60-69,ge40,30-34,6-8,yes,2,right,right_up,no
172,no-recurrence-events,50-59,ge40,20-24,3-5,yes,2,right,left_up,no
173,no-recurrence-events,50-59,premeno,25-29,3-5,yes,2,left,left_low,yes


In [24]:
dframe=pd.read_csv('data_cancer.txt')#reading the data again 
cols=['Class','age',' menopause','tumor-size','inv-nodes', 'node-caps','deg-malig','breast','breast-quad','irradiat']
dframe.columns= cols
dframe[164:190]



Unnamed: 0,Class,age,menopause,tumor-size,inv-nodes,node-caps,deg-malig,breast,breast-quad,irradiat
164,no-recurrence-events,40-49,premeno,20-24,3-5,no,2,right,left_up,no
165,no-recurrence-events,40-49,premeno,20-24,3-5,no,2,right,left_low,no
166,no-recurrence-events,40-49,ge40,40-44,15-17,yes,2,right,left_up,yes
167,no-recurrence-events,50-59,premeno,10-14,0-2,no,2,right,left_up,no
168,no-recurrence-events,40-49,ge40,30-34,0-2,no,2,left,left_up,yes
169,no-recurrence-events,30-39,premeno,20-24,3-5,yes,2,right,left_up,yes
170,no-recurrence-events,30-39,premeno,15-19,0-2,no,1,left,left_low,no
171,no-recurrence-events,60-69,ge40,30-34,6-8,yes,2,right,right_up,no
172,no-recurrence-events,50-59,ge40,20-24,3-5,yes,2,right,left_up,no
173,no-recurrence-events,50-59,premeno,25-29,3-5,yes,2,left,left_low,yes


# Drop tuples having null values

In [25]:
dframe.dropna()[164:190]

Unnamed: 0,Class,age,menopause,tumor-size,inv-nodes,node-caps,deg-malig,breast,breast-quad,irradiat
167,no-recurrence-events,50-59,premeno,10-14,0-2,no,2,right,left_up,no
168,no-recurrence-events,40-49,ge40,30-34,0-2,no,2,left,left_up,yes
169,no-recurrence-events,30-39,premeno,20-24,3-5,yes,2,right,left_up,yes
170,no-recurrence-events,30-39,premeno,15-19,0-2,no,1,left,left_low,no
171,no-recurrence-events,60-69,ge40,30-34,6-8,yes,2,right,right_up,no
172,no-recurrence-events,50-59,ge40,20-24,3-5,yes,2,right,left_up,no
173,no-recurrence-events,50-59,premeno,25-29,3-5,yes,2,left,left_low,yes
174,no-recurrence-events,40-49,premeno,30-34,0-2,no,2,right,right_up,yes
175,no-recurrence-events,40-49,ge40,25-29,0-2,no,2,left,left_low,no
176,no-recurrence-events,60-69,ge40,10-14,0-2,no,2,left,left_low,no
