## Classification Machine Learning Problem with Imbalanced dataset 

In [2]:
# About the Dataset
# https://archive.ics.uci.edu/ml/datasets/Adult

In [3]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import metrics
from sklearn.preprocessing import StandardScaler,MinMaxScaler,LabelEncoder,OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score,GridSearchCV,RepeatedStratifiedKFold
from sklearn.compose import ColumnTransformer

import collections

In [4]:
# Import the Dataset
df = pd.read_csv("https://raw.githubusercontent.com/atulpatelDS/Data_Files/master/Adult_Census_Income/adult.csv")

In [5]:
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country,income
0,90,?,77053,HS-grad,9,Widowed,?,Not-in-family,White,Female,0,4356,40,United-States,<=50K
1,82,Private,132870,HS-grad,9,Widowed,Exec-managerial,Not-in-family,White,Female,0,4356,18,United-States,<=50K
2,66,?,186061,Some-college,10,Widowed,?,Unmarried,Black,Female,0,4356,40,United-States,<=50K
3,54,Private,140359,7th-8th,4,Divorced,Machine-op-inspct,Unmarried,White,Female,0,3900,40,United-States,<=50K
4,41,Private,264663,Some-college,10,Separated,Prof-specialty,Own-child,White,Female,0,3900,40,United-States,<=50K


In [6]:
df.shape

(32561, 15)

In [7]:
#As we can see in above data that there are many '?' , these are the null values so we have to replcae thesewith null
# then we will drop these rows but before that we will check how many rows we have which have the null values
df[df == '?'] = np.nan

In [8]:
# check the null value counts
df.isnull().sum()

age                  0
workclass         1836
fnlwgt               0
education            0
education.num        0
marital.status       0
occupation        1843
relationship         0
race                 0
sex                  0
capital.gain         0
capital.loss         0
hours.per.week       0
native.country     583
income               0
dtype: int64

In [9]:
# Lets check the unique value counts in these null columns
for i in ['workclass','occupation','native.country']:
    print("Unique Value count :\n",df[i].value_counts(),"\n\n")

Unique Value count :
 Private             22696
Self-emp-not-inc     2541
Local-gov            2093
State-gov            1298
Self-emp-inc         1116
Federal-gov           960
Without-pay            14
Never-worked            7
Name: workclass, dtype: int64 


Unique Value count :
 Prof-specialty       4140
Craft-repair         4099
Exec-managerial      4066
Adm-clerical         3770
Sales                3650
Other-service        3295
Machine-op-inspct    2002
Transport-moving     1597
Handlers-cleaners    1370
Farming-fishing       994
Tech-support          928
Protective-serv       649
Priv-house-serv       149
Armed-Forces            9
Name: occupation, dtype: int64 


Unique Value count :
 United-States                 29170
Mexico                          643
Philippines                     198
Germany                         137
Canada                          121
Puerto-Rico                     114
El-Salvador                     106
India                           100
Cuba   

As we can see there are many unique values in these columns so we can not directly replace the null value with most frequent value as these are the categorical coulmns. I had created one video how to replace missing value for categorical columns, you can check and try to replace these null values.
Here I am just deleting these null values as we have large dataset.

In [10]:
# lets remove these null values
df1 = df.dropna(axis=0)

In [11]:
# Lets check the columns "education" and "education.num"
df1.education.value_counts()

HS-grad         9840
Some-college    6678
Bachelors       5044
Masters         1627
Assoc-voc       1307
11th            1048
Assoc-acdm      1008
10th             820
7th-8th          557
Prof-school      542
9th              455
12th             377
Doctorate        375
5th-6th          288
1st-4th          151
Preschool         45
Name: education, dtype: int64

In [12]:
df1["education.num"].value_counts()

9     9840
10    6678
13    5044
14    1627
11    1307
7     1048
12    1008
6      820
4      557
15     542
5      455
8      377
16     375
3      288
2      151
1       45
Name: education.num, dtype: int64

We can see columns "education" and "education.num" are redundant. As column "education" is the ordinal categorical and which is an ordinal representation of ‘education’, lets remove this

In [13]:
# Lets remove the "education" column
df1 = df1.drop(columns="education",axis=1)

In [14]:
# Lets check the Target column "income" distribution
df1.income.value_counts()

<=50K    22654
>50K      7508
Name: income, dtype: int64

In [15]:
# Lets calculate the % distribution of class
print("Class '<=50K' percentage is: ",df1.income.value_counts()[0]/df1.shape[0])
print("Class '<50K' percentage is: ",df1.income.value_counts()[1]/df1.shape[0])

Class '<=50K' percentage is:  0.7510775147536636
Class '<50K' percentage is:  0.24892248524633645


we can see classes in target variable are not balanced, it measn it is imbalanced dataset.

In [16]:
df1.head()

Unnamed: 0,age,workclass,fnlwgt,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country,income
1,82,Private,132870,9,Widowed,Exec-managerial,Not-in-family,White,Female,0,4356,18,United-States,<=50K
3,54,Private,140359,4,Divorced,Machine-op-inspct,Unmarried,White,Female,0,3900,40,United-States,<=50K
4,41,Private,264663,10,Separated,Prof-specialty,Own-child,White,Female,0,3900,40,United-States,<=50K
5,34,Private,216864,9,Divorced,Other-service,Unmarried,White,Female,0,3770,45,United-States,<=50K
6,38,Private,150601,6,Separated,Adm-clerical,Unmarried,White,Male,0,3770,40,United-States,<=50K


As we can see that dataset has the both numerical and categorical values. The categorical values were both nominal and ordinal.

In [17]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 30162 entries, 1 to 32560
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             30162 non-null  int64 
 1   workclass       30162 non-null  object
 2   fnlwgt          30162 non-null  int64 
 3   education.num   30162 non-null  int64 
 4   marital.status  30162 non-null  object
 5   occupation      30162 non-null  object
 6   relationship    30162 non-null  object
 7   race            30162 non-null  object
 8   sex             30162 non-null  object
 9   capital.gain    30162 non-null  int64 
 10  capital.loss    30162 non-null  int64 
 11  hours.per.week  30162 non-null  int64 
 12  native.country  30162 non-null  object
 13  income          30162 non-null  object
dtypes: int64(6), object(8)
memory usage: 3.5+ MB


In [18]:
df1

Unnamed: 0,age,workclass,fnlwgt,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country,income
1,82,Private,132870,9,Widowed,Exec-managerial,Not-in-family,White,Female,0,4356,18,United-States,<=50K
3,54,Private,140359,4,Divorced,Machine-op-inspct,Unmarried,White,Female,0,3900,40,United-States,<=50K
4,41,Private,264663,10,Separated,Prof-specialty,Own-child,White,Female,0,3900,40,United-States,<=50K
5,34,Private,216864,9,Divorced,Other-service,Unmarried,White,Female,0,3770,45,United-States,<=50K
6,38,Private,150601,6,Separated,Adm-clerical,Unmarried,White,Male,0,3770,40,United-States,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,22,Private,310152,10,Never-married,Protective-serv,Not-in-family,White,Male,0,0,40,United-States,<=50K
32557,27,Private,257302,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K
32558,40,Private,154374,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
32559,58,Private,151910,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K


In [19]:
# Select the Categorical and Numerical columns in Features
cat_cols = df1.drop("income",axis=1).select_dtypes(include="object").columns.tolist()

In [20]:
num_cols = df1.drop("income",axis=1).select_dtypes(exclude="object").columns.tolist()

In [21]:
print("Categorical Columns",cat_cols)
print("Numerical Columns",num_cols)

Categorical Columns ['workclass', 'marital.status', 'occupation', 'relationship', 'race', 'sex', 'native.country']
Numerical Columns ['age', 'fnlwgt', 'education.num', 'capital.gain', 'capital.loss', 'hours.per.week']


In [22]:
df1.shape

(30162, 14)

In [23]:
df1.head(2)

Unnamed: 0,age,workclass,fnlwgt,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country,income
1,82,Private,132870,9,Widowed,Exec-managerial,Not-in-family,White,Female,0,4356,18,United-States,<=50K
3,54,Private,140359,4,Divorced,Machine-op-inspct,Unmarried,White,Female,0,3900,40,United-States,<=50K


In [24]:
# As we have categorical columns so we have to encode these columns 
# We also have numerical columns so we have to scale them
# We will first "one hot encode categorical" and second "normalize numerical"

In [25]:
df_cat = df1[cat_cols]

In [26]:
onehot = OneHotEncoder()
onehot.fit(df_cat)

OneHotEncoder()

In [27]:
onehot_np = onehot.transform(df_cat).toarray()

In [28]:
onehot_np.shape

(30162, 82)

In [29]:
type(onehot_np)

numpy.ndarray

In [30]:
onehot_np

array([[0., 0., 1., ..., 1., 0., 0.],
       [0., 0., 1., ..., 1., 0., 0.],
       [0., 0., 1., ..., 1., 0., 0.],
       ...,
       [0., 0., 1., ..., 1., 0., 0.],
       [0., 0., 1., ..., 1., 0., 0.],
       [0., 0., 1., ..., 1., 0., 0.]])

In [31]:
onehot_cols = onehot.get_feature_names()

In [32]:
onehot_cols

array(['x0_Federal-gov', 'x0_Local-gov', 'x0_Private', 'x0_Self-emp-inc',
       'x0_Self-emp-not-inc', 'x0_State-gov', 'x0_Without-pay',
       'x1_Divorced', 'x1_Married-AF-spouse', 'x1_Married-civ-spouse',
       'x1_Married-spouse-absent', 'x1_Never-married', 'x1_Separated',
       'x1_Widowed', 'x2_Adm-clerical', 'x2_Armed-Forces',
       'x2_Craft-repair', 'x2_Exec-managerial', 'x2_Farming-fishing',
       'x2_Handlers-cleaners', 'x2_Machine-op-inspct', 'x2_Other-service',
       'x2_Priv-house-serv', 'x2_Prof-specialty', 'x2_Protective-serv',
       'x2_Sales', 'x2_Tech-support', 'x2_Transport-moving', 'x3_Husband',
       'x3_Not-in-family', 'x3_Other-relative', 'x3_Own-child',
       'x3_Unmarried', 'x3_Wife', 'x4_Amer-Indian-Eskimo',
       'x4_Asian-Pac-Islander', 'x4_Black', 'x4_Other', 'x4_White',
       'x5_Female', 'x5_Male', 'x6_Cambodia', 'x6_Canada', 'x6_China',
       'x6_Columbia', 'x6_Cuba', 'x6_Dominican-Republic', 'x6_Ecuador',
       'x6_El-Salvador', 'x6_Englan

In [33]:
# Create the Dataframe of Onehot datadet
df_onehot = pd.DataFrame(onehot_np,columns=onehot_cols)

In [34]:
df_onehot.shape

(30162, 82)

In [35]:
df_onehot.head()

Unnamed: 0,x0_Federal-gov,x0_Local-gov,x0_Private,x0_Self-emp-inc,x0_Self-emp-not-inc,x0_State-gov,x0_Without-pay,x1_Divorced,x1_Married-AF-spouse,x1_Married-civ-spouse,...,x6_Portugal,x6_Puerto-Rico,x6_Scotland,x6_South,x6_Taiwan,x6_Thailand,x6_Trinadad&Tobago,x6_United-States,x6_Vietnam,x6_Yugoslavia
0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


In [36]:
# Scale the Numerical Dataset
minmax = MinMaxScaler()
minmax.fit(df1[num_cols])

MinMaxScaler()

In [37]:
scale_np = minmax.transform(df1[num_cols])

In [38]:
scale_np.shape

(30162, 6)

In [39]:
#create the dataframe of scaling dataset
df_scale = pd.DataFrame(scale_np,columns=num_cols)

In [40]:
df_scale.shape

(30162, 6)

In [41]:
df_scale.head()

Unnamed: 0,age,fnlwgt,education.num,capital.gain,capital.loss,hours.per.week
0,0.890411,0.08097,0.533333,0.0,1.0,0.173469
1,0.506849,0.086061,0.2,0.0,0.895317,0.397959
2,0.328767,0.170568,0.6,0.0,0.895317,0.397959
3,0.232877,0.138072,0.533333,0.0,0.865473,0.44898
4,0.287671,0.093024,0.333333,0.0,0.865473,0.397959


In [42]:
# Concat the df_scale and df_onehot
X = pd.concat([df_onehot,df_scale],axis=1).values

In [43]:
# Encode the target "Income"
y = LabelEncoder().fit_transform(df1["income"])

In [44]:
collections.Counter(y)

Counter({0: 22654, 1: 7508})

In [45]:
print(X.shape)
print(y.shape)

(30162, 88)
(30162,)


We will use cross validation using the RepeatedStratifiedKFold".

- As our dataset is imbalanced so K-fold wll not work correctly.
- In StratifiedKFold ,class distribution in the dataset is preserved in the training and test splits.
- Lets the ratio of class 0 to class 1 is 3/1(in our case 22654/7508 =3/1 ) . 
- If we set k=4 

    - Then the test sets include 1 data points from class1 and 3 data point from class0. 
    - Training sets include 9 data points from class0 and 3 data points from class1.

- Stratified means that each fold will contain the same mixture of examples by class, that is about 75 percent to 25 percent for the majority and minority classes respectively. Repeated means that the evaluation process will be performed multiple times to help avoid fluke results and better capture the variance of the chosen model. We will use three repeats.


This means a single model will be fit and evaluated 4 * 3 or 12 times and the mean and standard deviation of these runs will be reported.

- This can be achieved using the RepeatedStratifiedKFold scikit-learn class.

- Please refer below blog to get the difference between k-fold and StratifiedKFold
https://towardsdatascience.com/how-to-train-test-split-kfold-vs-stratifiedkfold-281767b93869

- A typical and simplified data science workflow would like
- Get the training data
- Clean/preprocess/transform the data
- Train a machine learning model
- Evaluate and optimise the model
- Clean/preprocess/transform new data
- Fit the model on new data to make predictions.

In [46]:
# Create the classifier model using the KNN
knncf = KNeighborsClassifier()

In [47]:
# Training the model using the "RepeatedStratifiedKFold" Cross Validation

# Define the Model Evaluation Function

def model_evaluate(X,y,model):
    # Define Evaluation Procedure
    cv_stkfold = RepeatedStratifiedKFold(n_splits=8,n_repeats=3,random_state=42)
    # evaluate the model
    model_score = cross_val_score(model,X,y,scoring='accuracy',cv = cv_stkfold)
    return model_score

In [48]:
scores = model_evaluate(X,y,knncf)

In [49]:
print("model mean score: \n ",np.mean(scores))

model mean score: 
  0.8234092295640728
