<a href="https://colab.research.google.com/github/drshahizan/python-tutorial/blob/main/exercise/sarahwardina/exercise3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/drshahizan/dataset/main/eda/hepatitis.csv')
df.head(10)

Unnamed: 0,age,sex,steroid,antivirals,fatigue,malaise,anorexia,liver_big,liver_firm,spleen_palpable,spiders,ascites,varices,bilirubin,alk_phosphate,sgot,albumin,protime,histology,class
0,30,male,False,False,False,False,False,False,False,False,False,False,False,1.0,85.0,18.0,4.0,,False,live
1,50,female,False,False,True,False,False,False,False,False,False,False,False,0.9,135.0,42.0,3.5,,False,live
2,78,female,True,False,True,False,False,True,False,False,False,False,False,0.7,96.0,32.0,4.0,,False,live
3,31,female,,True,False,False,False,True,False,False,False,False,False,0.7,46.0,52.0,4.0,80.0,False,live
4,34,female,True,False,False,False,False,True,False,False,False,False,False,1.0,,200.0,4.0,,False,live
5,34,female,True,False,False,False,False,True,False,False,False,False,False,0.9,95.0,28.0,4.0,75.0,False,live
6,51,female,False,False,True,False,True,True,False,True,True,False,False,,,,,,False,die
7,23,female,True,False,False,False,False,True,False,False,False,False,False,1.0,,,,,False,live
8,39,female,True,False,True,False,False,True,True,False,False,False,False,0.7,,48.0,4.4,,False,live
9,30,female,True,False,False,False,False,True,False,False,False,False,False,1.0,,120.0,3.9,,False,live


#Identify missing values
We note that the dataset presents some problems. For example, the column email is not available for all the rows. In some cases it presents the NaN value, which means that the value is missing.

In order to check whether our dataset contains missing values, we can use the function isna(), which returns if an cell of the dataset if NaN or not. Then we can count how many missing values there are for each column.

In [3]:
df.isna().sum()

age                 0
sex                 0
steroid             1
antivirals          0
fatigue             1
malaise             1
anorexia            1
liver_big          10
liver_firm         11
spleen_palpable     5
spiders             5
ascites             5
varices             5
bilirubin           6
alk_phosphate      29
sgot                4
albumin            16
protime            67
histology           0
class               0
dtype: int64

Now we can count the percentage of missing values for each column, simply by dividing the previous result by the length of the dataset (len(df)) and multiplying per 100.

In [4]:
df.isna().sum()/len(df)*100

age                 0.000000
sex                 0.000000
steroid             0.645161
antivirals          0.000000
fatigue             0.645161
malaise             0.645161
anorexia            0.645161
liver_big           6.451613
liver_firm          7.096774
spleen_palpable     3.225806
spiders             3.225806
ascites             3.225806
varices             3.225806
bilirubin           3.870968
alk_phosphate      18.709677
sgot                2.580645
albumin            10.322581
protime            43.225806
histology           0.000000
class               0.000000
dtype: float64

#Drop missing values

Dropping missing values can be one of the following alternatives:

* remove rows having missing values
* remove the whole column containing missing values We can use the dropna() by specifying the axis to be considered. If we set axis = 0 we drop the entire row, if we set axis = 1 we drop the whole column. If we apply the function df.dropna(axis=0) 80 rows of the dataset remain. If we apply the function df.dropna(axis=1), only the columns age, sex, antivirals, histology and class remain. However, removed values are not applied to the original dataframe, but only to the result. We can use the argument inplace=True in order to store changes in the original dataframe df (df.dropna(axis=1,inplace=True)).

In [5]:
df.dropna(axis=1)

Unnamed: 0,age,sex,antivirals,histology,class
0,30,male,False,False,live
1,50,female,False,False,live
2,78,female,False,False,live
3,31,female,True,False,live
4,34,female,False,False,live
...,...,...,...,...,...
150,46,female,False,True,die
151,44,female,False,True,live
152,61,female,False,True,live
153,53,male,False,True,live


#Replace missing values
A good strategy when dealing with missing values involves their replacement with another value. Usually, the following strategies are adopted:

* for numerical values replace the missing value with the average value of the column
* for categorial values replace the missing value with the most frequent value of the column
* use other functions

In order to replace missing values, three functions can be used: fillna(), replace() and interpolate(). The fillna() function replaces all the NaN values with the value passed as argument. For example, for numerical values, all the NaN values in the numeric columns could be replaced with the average value. In order to list the type of a column, we can use the attribute dtypes as follows:

In [6]:
df.dtypes

age                  int64
sex                 object
steroid             object
antivirals            bool
fatigue             object
malaise             object
anorexia            object
liver_big           object
liver_firm          object
spleen_palpable     object
spiders             object
ascites             object
varices             object
bilirubin          float64
alk_phosphate      float64
sgot               float64
albumin            float64
protime            float64
histology             bool
class               object
dtype: object

#Numeric columns

In [7]:
import numpy as np
numeric = df.select_dtypes(include=np.number)
numeric_columns = numeric.columns

In [8]:
df[numeric_columns] = df[numeric_columns].fillna(df.mean())

  df[numeric_columns] = df[numeric_columns].fillna(df.mean())


In [9]:
df.isna().sum()/len(df)*100

age                0.000000
sex                0.000000
steroid            0.645161
antivirals         0.000000
fatigue            0.645161
malaise            0.645161
anorexia           0.645161
liver_big          6.451613
liver_firm         7.096774
spleen_palpable    3.225806
spiders            3.225806
ascites            3.225806
varices            3.225806
bilirubin          0.000000
alk_phosphate      0.000000
sgot               0.000000
albumin            0.000000
protime            0.000000
histology          0.000000
class              0.000000
dtype: float64