### Lecture 22: Handing Missing Values Imputing Using Mode in Categorical column (Missing Word & Mode)

Steps (Using Pandas)
1. Checking which columns have MCAR (Missing completely at random)
2. Splitting data into training & testing
3. Seperating columns on which we apply Imputation
  <br> a. Making sure Nan Values are MCAR
1. Getting mode of the column
2. Replacing missing values with mode.

Importing Useful Libraries

In [41]:
from sklearn.model_selection import train_test_split
import pandas as pd

Importing Data

In [42]:
df=pd.read_csv('personality_dataset.csv',usecols=['Stage_fear','Time_spent_Alone','Social_event_attendance','Personality'])
df.head()


Unnamed: 0,Time_spent_Alone,Stage_fear,Social_event_attendance,Personality
0,4.0,No,4.0,Extrovert
1,9.0,Yes,0.0,Introvert
2,9.0,Yes,1.0,Introvert
3,0.0,No,6.0,Extrovert
4,3.0,No,9.0,Extrovert


Step 1

In [43]:
df.isnull().mean()*100

Time_spent_Alone           2.172414
Stage_fear                 2.517241
Social_event_attendance    2.137931
Personality                0.000000
dtype: float64

Step 2

In [44]:
X=df.drop(columns='Personality',axis=1)
Y=df.Personality

In [45]:
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.2,random_state=42)

In [46]:
print(X_train.shape)
print(X_test.shape)

(2320, 3)
(580, 3)


Step 3

In [47]:
col=[var for var in X_train.columns if X_train[var].isnull().mean()>0]

In [48]:
print(col)


['Time_spent_Alone', 'Stage_fear', 'Social_event_attendance']


In [49]:
X_train.sample(10)

Unnamed: 0,Time_spent_Alone,Stage_fear,Social_event_attendance
2854,5.0,Yes,0.0
2372,2.0,No,9.0
2329,4.0,Yes,2.0
2018,1.0,No,7.0
1857,7.0,Yes,2.0
1571,7.0,Yes,3.0
2309,9.0,Yes,2.0
2155,3.0,No,8.0
2352,8.0,Yes,2.0
2192,0.0,No,


Step 4

In [50]:
Stage_fear_mode=X_train['Stage_fear'].mode()[0]

In [51]:
print(f'{Stage_fear_mode}')

No


In [52]:
X_train['Stage_fear']

2078     No
163     Yes
1938    Yes
252     Yes
2232    Yes
       ... 
1638     No
1095     No
1130    Yes
1294    Yes
860     Yes
Name: Stage_fear, Length: 2320, dtype: object

Step 5

In [53]:
X_train['Stage_fear']=X_train['Stage_fear'].fillna(Stage_fear_mode)
X_test['Stage_fear']=X_test['Stage_fear'].fillna(Stage_fear_mode)

In [54]:
print(X_train.isnull().mean()*100)

Time_spent_Alone           2.068966
Stage_fear                 0.000000
Social_event_attendance    2.068966
dtype: float64


Steps (Using sk-learn)
1. Checking which columns have Missing values
2. Splitting data into training & testing
3. Seperating columns on which we apply Imputation
 * Used when MCAR
4. Make Function of Simple Imputer Class
5. Fit_Transform Input Data (Training/Test)

Importing Useful Libraries

In [55]:

from sklearn.impute import SimpleImputer

Step 4

In [56]:
# Making variable of SI class for mode
SI_mode=SimpleImputer(strategy='most_frequent')

# Making variable of SI class for 'missing' word
SI_missing=SimpleImputer(strategy='constant',fill_value='Missing')

In [57]:
# Fetching column names so after Applying fit_transform, I can make dataframe from numpy array
columns=X_train.columns

Step 5

In [58]:
# For Mode

X_train=SI_mode.fit_transform(X_train)
X_test=SI_mode.transform(X_test)
    
# For Missing word

X_train=SI_missing.fit_transform(X_train)
X_test=SI_missing.transform(X_test)


In [59]:
X_train=pd.DataFrame(X_train,columns=columns)
X_test=pd.DataFrame(X_test,columns=columns)


In [60]:
X_train.isnull().mean()*100

Time_spent_Alone           0.0
Stage_fear                 0.0
Social_event_attendance    0.0
dtype: float64

In [61]:
X_train['Stage_fear'].sample(20)

1310     No
1872     No
1815    Yes
2193    Yes
1457     No
1516    Yes
2275     No
2273     No
1062    Yes
1497     No
866      No
1750    Yes
501      No
1533     No
1370    Yes
610     Yes
1016     No
2183     No
567      No
1772     No
Name: Stage_fear, dtype: object