## SMOTE-N

In this notebook, we will cover the essentials of SMOTE-N and the Value Difference Metric.

- First, we will calculate the difference between values and observations using the VDM
- Second, we will implement SMOTE-N with imbalanced learn.

In [1]:
# import libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.datasets import make_blobs
from sklearn.preprocessing import OrdinalEncoder

# from imblearn
from imblearn.over_sampling import SMOTEN
from imblearn.metrics.pairwise import ValueDifferenceMetric

## Distance between Values

In [6]:
# lets create a dataset with just 1 feature

X = np.array(['green'] * 10 + ['red']* 10 + ['blue'] * 10)
y = [1] * 8 + [0] * 5 + [1] * 7 + [0] * 9 + [1]

print(X)
print(y)

['green' 'green' 'green' 'green' 'green' 'green' 'green' 'green' 'green'
 'green' 'red' 'red' 'red' 'red' 'red' 'red' 'red' 'red' 'red' 'red'
 'blue' 'blue' 'blue' 'blue' 'blue' 'blue' 'blue' 'blue' 'blue' 'blue']
[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]


In [8]:
type(X[0])

numpy.str_

In [11]:
# also we need to reshape our X so that we can passed into the Encoder function
X = X.reshape(-1,1)
X

array([['green'],
       ['green'],
       ['green'],
       ['green'],
       ['green'],
       ['green'],
       ['green'],
       ['green'],
       ['green'],
       ['green'],
       ['red'],
       ['red'],
       ['red'],
       ['red'],
       ['red'],
       ['red'],
       ['red'],
       ['red'],
       ['red'],
       ['red'],
       ['blue'],
       ['blue'],
       ['blue'],
       ['blue'],
       ['blue'],
       ['blue'],
       ['blue'],
       ['blue'],
       ['blue'],
       ['blue']], dtype='<U5')

In [12]:
# the function "ValueDifferenceMetric" works
# only with encoded variables, so we need to transform
# the strings into numbers first

encoder = OrdinalEncoder(dtype=np.int32)
X_enc = encoder.fit_transform(X)

In [13]:
# Now, we can learn the distances
# we are putting r=1 
vdm = ValueDifferenceMetric(r=1).fit(X_enc, y)

In [14]:
# the conditional probabilities of a value given the
# class are stored, for each value
vdm.proba_per_class_

[array([[0.9, 0.1],
        [0.2, 0.8],
        [0.3, 0.7]])]

In [21]:
# which class is displayed first
# they come as stored in the categories_ attribute of the encoder

encoder.categories_

[array(['blue', 'green', 'red'], dtype='<U5')]

- So, the above probabilities proba_per_class_ means (for the first line). If its blue, the probability of belonging to class 0 is 0.9 and that to class 1 is 0.1. Similar for other values.

In [23]:
# lets try to find distance between these 3 values blue, green and red
# create a test array

test_arr = np.array(["red", "green", "blue"]).reshape(-1,1)

# transform the test data using our encoder
X_test_enc = encoder.transform(test_arr)

# calculate the distances
vdm.pairwise(X_test_enc)

array([[0. , 0.2, 1.2],
       [0.2, 0. , 1.4],
       [1.2, 1.4, 0. ]])

- In the above matrix, the diagonal values are the distance between itself
- 0.2 is the distance between red and gree
- 1.2 is the distance between red and blue and so on
- So, basically this is the distance matrix

## Distance between vectors

Now, instead of having a single value, we will determine distances in vectors

In [24]:
# We create a dataframe that contains 2 features

# 2 features
X = pd.concat([
    pd.Series(np.array(["green"] * 10 + ["red"] * 10 + ["blue"] * 10)),
    
    pd.Series(np.array(["used"] + ["new"] + ["used"] + ["new"] * 2 +
                       ["used"] * 2 + ["new"] * 3 + ["used"] * 4 + 
                       ["new"] * 6 + ["used"] * 6 + ["new"] * 4)),
    ], axis=1)

X.columns = ['colour', 'condition']

# target
y = [1] * 8 + [0] * 5 + [1] * 7 + [0] * 9 + [1]

X.head()

Unnamed: 0,colour,condition
0,green,used
1,green,new
2,green,used
3,green,new
4,green,new


In [25]:
# the function "ValueDifferenceMetric" works
# only with encoded variables, so we need to transform
# the strings into numbers first

encoder = OrdinalEncoder(dtype=np.int32)
X_enc = encoder.fit_transform(X)

In [31]:
X_enc

array([[1, 1],
       [1, 0],
       [1, 1],
       [1, 0],
       [1, 0],
       [1, 1],
       [1, 1],
       [1, 0],
       [1, 0],
       [1, 0],
       [2, 1],
       [2, 1],
       [2, 1],
       [2, 1],
       [2, 0],
       [2, 0],
       [2, 0],
       [2, 0],
       [2, 0],
       [2, 0],
       [0, 1],
       [0, 1],
       [0, 1],
       [0, 1],
       [0, 1],
       [0, 1],
       [0, 0],
       [0, 0],
       [0, 0],
       [0, 0]])

In [33]:
# now we can learn the distance using VDM,
# we will use r = 2
#first instantiate
vdm = ValueDifferenceMetric(r=2)

# fit it using the encoded variable data
vdm.fit(X_enc, y)
vdm

ValueDifferenceMetric()

In [34]:
# we can find the probabilities per class as before

vdm.proba_per_class_

[array([[0.9, 0.1],
        [0.2, 0.8],
        [0.3, 0.7]]),
 array([[0.3125    , 0.6875    ],
        [0.64285714, 0.35714286]])]

In [35]:
# we can see the order of the values for the two features
encoder.categories_

[array(['blue', 'green', 'red'], dtype=object),
 array(['new', 'used'], dtype=object)]

In [36]:
# Now we create some new data with some vector
# combinations of the 2 variables

X_test = pd.concat([
    pd.Series(np.array(["green"]+["green"]+["red"]+["red"])),
    pd.Series(np.array(["used"] + ["new"] + ["used"] + ["new"])),
], axis=1)


X_test.columns = ['colour', 'condition']

X_test

Unnamed: 0,colour,condition
0,green,used
1,green,new
2,red,used
3,red,new


In [40]:
# now before we calculate the distance between these vectors, we will need to encode the categorical variables first

X_test_enc = encoder.transform(X_test)

X_test_enc

array([[1, 1],
       [1, 0],
       [2, 1],
       [2, 0]])

In [41]:
# so now we can calculate the distance
vdm.pairwise(X_test_enc)

array([[0.        , 0.43654337, 0.04      , 0.47654337],
       [0.43654337, 0.        , 0.47654337, 0.04      ],
       [0.04      , 0.47654337, 0.        , 0.43654337],
       [0.47654337, 0.04      , 0.43654337, 0.        ]])

In [43]:
# for another test data
# Now we create some new data with some vector
# combinations of the 2 variables

X_test = pd.concat([
    pd.Series(np.array(["green"]+["green"]+["blue"]+["red"] + ["blue"])),
    pd.Series(np.array(["used"] + ["new"] + ["used"] + ["new"] + ["used"])),
], axis=1)


X_test.columns = ['colour', 'condition']

X_test

Unnamed: 0,colour,condition
0,green,used
1,green,new
2,blue,used
3,red,new
4,blue,used


In [44]:
# encode
X_test_enc = encoder.transform(X_test)

In [45]:
# calculate distance
vdm.pairwise(X_test_enc)

array([[0.        , 0.43654337, 1.96      , 0.47654337, 1.96      ],
       [0.43654337, 0.        , 2.39654337, 0.04      , 2.39654337],
       [1.96      , 2.39654337, 0.        , 1.87654337, 0.        ],
       [0.47654337, 0.04      , 1.87654337, 0.        , 1.87654337],
       [1.96      , 2.39654337, 0.        , 1.87654337, 0.        ]])

- The above results corresponds to the different combinations in our test data
- [green,used],[green, new],[blue,used],[[red,new] and [blue used]

## SMOTE-N

[SMOTE-N](https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTEN.html)

In [48]:
rng = np.random.RandomState(42)

rng.choice(['a','b','c'], size = 10).astype(object)

array(['c', 'a', 'c', 'c', 'a', 'a', 'c', 'b', 'c', 'c'], dtype=object)

In [77]:
rng.binomial(p=0.2, n=3, size=10)

array([0, 1, 0, 1, 0, 0, 1, 0, 0, 1])

In [78]:
# Create some data

rng = np.random.RandomState(42)
num_samples = 1600

X = pd.concat([
    pd.Series(rng.choice(['Blue', 'Green', 'Red'], size=num_samples).astype(object)),
    pd.Series(rng.choice(['New', 'Used'], size=num_samples).astype(object)),
    pd.Series(rng.choice(['Classic', 'Luxus', 'Smart', 'Small'], size=num_samples).astype(object)),
], axis=1)

X.columns = ['Colour', 'Condition', 'Model']

y = pd.Series(rng.binomial(p=0.1, n=1, size=num_samples))

# display size
X.shape, y.shape

((1600, 3), (1600,))

In [79]:
X.head()

Unnamed: 0,Colour,Condition,Model
0,Red,Used,Luxus
1,Blue,New,Small
2,Red,Used,Luxus
3,Red,New,Small
4,Blue,Used,Luxus


In [80]:
y.value_counts()

0    1443
1     157
dtype: int64

In [83]:
y.value_counts(normalize=True)

0    0.901875
1    0.098125
dtype: float64

In [84]:
y.value_counts()/len(y)

0    0.901875
1    0.098125
dtype: float64

In [82]:
# to calculate the probability per features as per the given data, we can do this as follows

for var in X.columns:
    print(X[var].value_counts(normalize=True))
    #print(var)

Blue     0.344375
Red      0.328750
Green    0.326875
Name: Colour, dtype: float64
Used    0.51125
New     0.48875
Name: Condition, dtype: float64
Small      0.256250
Classic    0.255625
Smart      0.251875
Luxus      0.236250
Name: Model, dtype: float64


In [85]:
# setting up SMOTE

smote = SMOTEN(sampling_strategy='auto',  # to over sample minority class
               k_neighbors=5,             # knn values
               n_jobs=2,
               random_state=0)

X_res, y_res = smote.fit_resample(X,y)
X_res.shape, y_res.shape

((2886, 3), (2886,))

In [87]:
# lets check the value counts

for var in X.columns:
    print(X_res[var].value_counts(normalize=True))

Blue     0.343728
Green    0.331254
Red      0.325017
Name: Colour, dtype: float64
Used    0.517325
New     0.482675
Name: Condition, dtype: float64
Smart      0.273042
Classic    0.258143
Small      0.241511
Luxus      0.227304
Name: Model, dtype: float64


In [90]:
# check the balance ratio for y_res 
y_res.value_counts()

0    1443
1    1443
dtype: int64