<a href="https://colab.research.google.com/github/cagBRT/Data/blob/main/3a_Missing_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Filling in Missing Data

In this notebook we will look at different methods for filling in missing data. <br>
The dataset we use is the Horse-colic dataset. It is missing quite a bit of data. 

Dataset information: [Horse-colic.csv](hhttps://raw.githubusercontent.com/jbrownlee/Datasets/master/horse-colic.names)

Aprroximately 30% of the data is missing from the data

**Column 23 is our label**:<br>
outcome<br>
>what eventually happened to the horse?<br>
>>possible values:<br>
1.   lived
2.   died
3.   euthanized



In [None]:
# Clone the entire repo.
!git clone -s https://github.com/cagBRT/Data.git cloned-repo
%cd cloned-repo

**Import the necessary libraries**<br>
Note the SimpleImouter library 

In [None]:
from numpy import isnan
import pandas as pd
from pandas import read_csv
import matplotlib.pyplot as plt
import seaborn as sb
from sklearn.impute import SimpleImputer

In [None]:
from pandas import read_csv
# load dataset
df = read_csv('horse.csv', na_values='?')

Our dataset has a lot of missing data

In [None]:
df

In [None]:
#!cat datadict.txt
df.columns

In [None]:
print(df.shape)

**Create a list of the columns that are missing data and calculate the precentage of missing data for each column**<br>


In [None]:
null = df.isnull().sum() #Count the missing value
#let's see the missing values in percetange format
percent = null/len(df)*100
null_t = percent[null>0]
null_t.sort_values(inplace=True, ascending=False)
null_t #this is the list of columns missing data

In [None]:
for i in range(len(null_t)):
  # count number of rows with missing values
  print(null_t.index[i], null_t[i])

In [None]:
#ploting missing value of each attributes by percentage
plt.figure(figsize=(20, 10))
plt.xlabel("column number")
plt.ylabel("num missing")
plt.xticks(rotation='vertical')
sb.set(font_scale=1.2)
plt.style.use('ggplot')
sb.barplot(x=null_t.index, y=null_t)
plt.show()

To use the SimpleImouter all of our data needs to be in numerical form. <br>
Right now a number of columns are in categorical form. 

**Convert catagorical data to numerical**

**Step 1**: the data

In [None]:
df.head()

**Step 2**: For a list of the unique values for each column use the unique function

In [None]:
for col in df:
    print(df[col].unique())

**Step 3**: change the categirical data to nnumerical
We have a lot of categorical data. <br>
We can replace the categorical data with numbers using the replace function.

Using the replace function: <br>
Change the categories to numbers<br>
Don't replace NaN

In [None]:
df['surgery'].replace(['no', 'yes'],[0, 1], inplace=True)
df['age'].replace(['adult','young'], [0,1], inplace=True)
#Unique values of temp_of_extremities['cool' nan 'normal' 'cold' 'warm']
df['temp_of_extremities'].replace(['cool', 'cold', 'normal','warm'], [0,1,2,3], inplace=True)

**Using one-hot encoding**
Some of our categorical data has more than two categories. <br>
For this type of data we might want to do one-hot encoding

In [None]:
df_ohe = pd.get_dummies(df, columns = ['temp_of_extremities'])
df_ohe

For this notebook we are concentrating on how to handle missing data. <br>
Using one-hot encoding would be better for when we have more than two unique values, but we are going to just convert all the data to numbers for now. 

In [None]:
df.columns

**To get a list of the unique values in a column use:**<br>
>dataframe_name.column_name.unique().tolist()

In [None]:
df.cp_data.unique().tolist()

**Replacing catagorical data with numbers**

**All the columns that still have categorical data**

In [None]:
#All the columns that still have catagorical data

#'peripheral_pulse' ['reduced' nan 'normal' 'absent' 'increased']
#'mucous_membrane' [nan 'pale_cyanotic' 'pale_pink' 'dark_cyanotic' 'normal_pink''bright_red' 'bright_pink']
#'capillary_refill_time' ['more_3_sec' 'less_3_sec' nan '3']
#'pain' ['extreme_pain' 'mild_pain' 'depressed' nan 'severe_pain' 'alert']
#'peristalsis' ['absent' 'hypomotile' nan 'hypermotile' 'normal']
#'abdominal_distention' ['severe' 'slight' 'none' nan 'moderate']
#'nasogastric_tube' [nan 'none' 'slight' 'significant']
#'nasogastric_reflux' [nan 'less_1_liter' 'none' 'more_1_liter']
#'rectal_exam_feces' ['decreased' 'absent' 'normal' nan 'increased']
#'abdomen' ['distend_large' 'other' 'normal' nan 'firm' 'distend_small']
#'abdomo_appearance' [nan 'cloudy' 'serosanguious' 'clear']
#'outcome' ['died' 'euthanized' 'lived']
#'surgical_lesion' ['no' 'yes']
#'cp_data' ['no' 'yes']

In [None]:
#To list a column by number
#df.iloc[:,[2]]

#for column in df:
#  print(df[column])

**Assignment:** <br>
Convert all catagorical data to numerical data

In [None]:
#Assignment


In [None]:
#@title 
df['peripheral_pulse'].replace(['reduced','normal' ,'absent' ,'increased'], [0,1,2,3], inplace=True)
df['mucous_membrane'].replace(['pale_cyanotic', 'pale_pink', 'dark_cyanotic', 
                              'normal_pink','bright_red', 'bright_pink'], 
                                [0,1,2,3,4,5], inplace=True)
df['capillary_refill_time'].replace(['more_3_sec', 'less_3_sec','3'],[0,1,2], inplace=True)
df['pain'].replace(['extreme_pain', 'mild_pain','depressed','severe_pain','alert'],[0,1,2,3,4], inplace=True)
df['peristalsis'].replace( ['absent', 'hypomotile','hypermotile' ,'normal'],[0,1,2,3], inplace=True)
df['abdominal_distention'].replace( ['severe' ,'slight', 'none', 'moderate'],[0,1,2,3], inplace=True)
df['nasogastric_tube'].replace( ['none', 'slight', 'significant'],[0,1,2], inplace=True)
df['nasogastric_reflux'].replace( ['less_1_liter' ,'none', 'more_1_liter'],[0,1,2], inplace=True)
df['rectal_exam_feces'].replace( ['decreased', 'absent', 'normal', 'increased'],[0,1,2,3], inplace=True)
df['abdomen'].replace( ['distend_large', 'other' ,'normal' ,'firm' ,'distend_small'],[0,1,2,3,4], inplace=True)
df['abdomo_appearance'].replace( ['cloudy' ,'serosanguious', 'clear'],[0,1,2], inplace=True)
df['outcome'].replace(['died' ,'euthanized', 'lived'],[0,1,2], inplace=True)
df['surgical_lesion'].replace( ['no', 'yes'],[0,1], inplace=True)
df['cp_data'].replace( ['no', 'yes'],[0,1], inplace=True)

When all the values have been converted to numbers, we can work on filling in the missing values

In [None]:
df

All of our data is now in numerical form. <br>
Let's now look at our missing data

In [None]:
null = df.isnull().sum() #Count the missing value
#let's see the missing values in percetange format
percent = null/len(df)*100
null_t = percent[null>0]
null_t.sort_values(inplace=True, ascending=False)
null_t

**SimpleImputer**<br>
The SimpleImputer is a data transform that is first configured based on the type of statistic to calculate for each column<br>

Choices of strategy are: <br>
- “mean”, then replace missing values using the mean along each column. **Can only be used with numeric data.**

- “median”, then replace missing values using the median along each column. **Can only be used with numeric data.**

- “most_frequent”, then replace missing using the most frequent value along each column. **Can be used with strings or numeric data. If there is more than one such value, only the smallest is returned.**

-  “constant”, then replace missing values with fill_value. **Can be used with strings or numeric data.**

In [None]:
import numpy as np
# define imputer
df.replace('Nan',np.NaN,inplace=True)
#replace missing values with the mean of its column
imputer = SimpleImputer(missing_values=np.nan,strategy= 'mean')
# fit on the dataset
imputer= imputer.fit(df[['rectal_temp']])

In [None]:
df['rectal_temp']=imputer.transform(df[['rectal_temp']])

In [None]:
print(df['rectal_temp'])

In [None]:
# define imputer
df.replace('Nan',np.NaN,inplace=True)
#replace missing values with the mean of its column
imputer = SimpleImputer(missing_values=np.nan,strategy= 'median')
# fit on the dataset

imputer= imputer.fit(df[['pulse']])
print(df['pulse'])