# SMOTENC (numeric and categorical features)
**OPIM 5512: Data Science Using Python - University of Connecticut**

-----------------------------------------------
Remember that SMOTE only works for numeric features in a classification problem. However, this really is not practical - in the real-world, we will often have a mix of categorical and numeric features.

Fortunately, there is an extension to SMOTE called SMOTENC which allows you to account for categorical features - you just need to pass a list of the column indices that are for categorical features. Let's give it a try.



# Import modules and read data

In [None]:
# this is to get rid of some annoying future warnings
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

# the usual suspects...
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# for our nearest neighbor algorithm
from sklearn.neighbors import NearestNeighbors
# random number generator
import random
# counter
from collections import Counter


In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/shrikant-temburwar/Loan-Prediction-Dataset/master/train.csv')
df.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


Let's just quickly drop that annoying ID column.

In [None]:
df.drop('Loan_ID', axis=1, inplace=True)
df.shape

(614, 12)

And let's see if there are any missing values - there are, so we drop them for simplicity right now.

In [None]:
#check for missing values
df.isna().sum()

Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64

In [None]:
df.dropna(inplace=True, axis=0)
print(df.shape) # eek, we lost quite a few rows!

(480, 12)


Now let's check for the class distribution of the target variable.

In [None]:
df['Loan_Status'].value_counts() # imbalanced! uses pandas

Y    332
N    148
Name: Loan_Status, dtype: int64

In [None]:
Counter(df['Loan_Status']) # same result! uses Counter

Counter({'N': 148, 'Y': 332})

# Split into X and y

In [None]:
y = df['Loan_Status']
X = df.drop('Loan_Status', axis=1)
print(X.shape, y.shape)

(480, 11) (480,)


# SMOTENC: for continuous and categorical features
It is not possible to calculate a ‘midpoint’ between two points of binary or categorical data. An extension to the SMOTE method allows for use of binary or categorical data by taking the most common occurring category of nearest neighbours to a minority class point.

In [None]:
from imblearn.over_sampling import SMOTENC
sm = SMOTENC(categorical_features = [0,1,2,3,4,10],random_state=42)
X_res, y_res = sm.fit_resample(X, y)
Counter(y_res)

Counter({'N': 332, 'Y': 332})

In [None]:
# it's categorical!
X['Dependents'].value_counts()

0     274
2      85
1      80
3+     41
Name: Dependents, dtype: int64

In [None]:
X.head()

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area
1,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural
2,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban
3,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban
4,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban
5,Male,Yes,2,Graduate,Yes,5417,4196.0,267.0,360.0,1.0,Urban


A little bit easier to implement, but you must specify the categorical columns as an array. You may find the following helpful...

In [None]:
tmp = X.select_dtypes(include=["object_"])
tmp

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,Property_Area
1,Male,Yes,1,Graduate,No,Rural
2,Male,Yes,0,Graduate,Yes,Urban
3,Male,Yes,0,Not Graduate,No,Urban
4,Male,No,0,Graduate,No,Urban
5,Male,Yes,2,Graduate,Yes,Urban
...,...,...,...,...,...,...
609,Female,No,0,Graduate,No,Rural
610,Male,Yes,3+,Graduate,No,Rural
611,Male,Yes,1,Graduate,No,Urban
612,Male,Yes,2,Graduate,No,Urban


In [None]:
cols = tmp.columns
catList = [df.columns.get_loc(c) for c in cols if c in df]
catList # voila!

[0, 1, 2, 3, 4, 10]

Now use your better code and re-run!

In [None]:
from imblearn.over_sampling import SMOTENC
sm = SMOTENC(categorical_features = catList, random_state=42)
X_res, y_res = sm.fit_resample(X, y)
Counter(y_res)

Counter({'N': 332, 'Y': 332})

Definitely more Pythonic to use this approach rather than hardcoding a list of values! If for some reason you didn't have access to SMOTENC, maybe you could have done one-hot encoding on your variables so that they look numeric - but this is not ideal! Use the right tool for the job. Later on, you can learn how to blend all of these topics together in a model.

## Look at what you made!

This is the synethic data.

In [None]:
tmp = X_res[-10:]
tmp = pd.DataFrame(tmp)
tmp

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area
654,Male,Yes,0,Graduate,No,4455,8816.731652,351.599997,360.0,0.699512,Semiurban
655,Male,No,0,Not Graduate,No,6244,0.0,117.584941,360.0,0.0,Rural
656,Male,No,2,Graduate,No,3532,0.0,80.867255,203.894176,1.0,Urban
657,Male,Yes,1,Not Graduate,No,2775,1915.354991,144.720548,360.0,0.0,Rural
658,Male,Yes,0,Graduate,No,1952,2927.568644,107.291902,360.0,0.714595,Urban
659,Male,No,0,Graduate,Yes,9965,0.0,215.232026,360.0,1.0,Urban
660,Male,Yes,0,Graduate,No,4543,2128.768003,155.088432,360.0,1.0,Rural
661,Male,No,0,Graduate,No,5776,4068.740765,288.023401,360.0,1.0,Rural
662,Male,No,0,Graduate,Yes,10511,0.0,201.202132,360.0,1.0,Urban
663,Male,Yes,1,Not Graduate,No,2613,1662.563266,102.512047,269.896737,0.749139,Rural


Of course, all of the samples you are generating are for the MINORITY class - that's the whole point of SYNTHETIC MINORITY SAMPLING!

In [None]:
tmp = y_res[-10:]
tmp = pd.DataFrame(tmp)
tmp

Unnamed: 0,Loan_Status
654,N
655,N
656,N
657,N
658,N
659,N
660,N
661,N
662,N
663,N


It's just so cool that none of these points exist... they may be helpful for getting your model to fit! Try to see the big picture - you can combine this with everything you have learned on sampling.

# Resources
* * https://pythonhealthcare.org/tag/smotenc/