In [88]:
import pandas as pd

df_cat = pd.DataFrame(data = [ 
    ['green','M',10.1,'class1'], \
     ['blue','L',20.1,'class2'],\
      ['white','M',30.1,'class1']])
df_cat.columns = ['colour','size','price','classlabel']
print(df_cat)
size_mapping = {'M':1,'L':2}
df_cat['size'] = df_cat['size'].map(size_mapping)
df_cat

  colour size  price classlabel
0  green    M   10.1     class1
1   blue    L   20.1     class2
2  white    M   30.1     class1


Unnamed: 0,colour,size,price,classlabel
0,green,1,10.1,class1
1,blue,2,20.1,class2
2,white,1,30.1,class1


In [89]:
df_cat

Unnamed: 0,colour,size,price,classlabel
0,green,1,10.1,class1
1,blue,2,20.1,class2
2,white,1,30.1,class1


In [90]:
from sklearn.preprocessing import LabelEncoder
class_le = LabelEncoder()
df_cat_2 = df_cat.copy()
df_cat['classlabel'] =   class_le.fit_transform(df_cat['classlabel'].values)
print(df_cat_2)
df_cat

  colour  size  price classlabel
0  green     1   10.1     class1
1   blue     2   20.1     class2
2  white     1   30.1     class1


Unnamed: 0,colour,size,price,classlabel
0,green,1,10.1,0
1,blue,2,20.1,1
2,white,1,30.1,0


In [91]:
df_cat_2 = pd.get_dummies(df_cat_2[['colour','size','price','classlabel']])

In [92]:
df_cat_2

Unnamed: 0,size,price,colour_blue,colour_green,colour_white,classlabel_class1,classlabel_class2
0,1,10.1,0,1,0,1,0
1,2,20.1,1,0,0,0,1
2,1,30.1,0,0,1,1,0


# Multicollinearity
### Multicollinearity occurs in our dataset when we have features which are strongly dependent on each other. 

#### The main impact it will have is that it can cause the decision boundary to change. Additionally, won't be able to use weight vector to calc feature importance

It can, however, be detected by the Variance Inflation Factor (VIF).
<img src="https://i.imgur.com/G31R4vW.png"> 
where R² is the coefficient of determination (the proportion of the variance in the dependent variable that is predictable from the independent variable(s))

If VIF = 1 or 2 → no collinearity or multicollinearity

IF VIF = 20 → collinearity most likely

Multicollinearity exists whenever <b>two or more of the predictors in a regression
model are moderately or highly correlated</b>.

## Two types of Multicollinearity:

• <b>Structural multicollinearity</b> is a mathematical artifact caused by creating new
predictors from other predictors — such as, creating the predictor x
2
from the
predictor x.

• <b>Data-based multicollinearity </b>, on the other hand, is a result of a poorly
designed experiment, reliance on purely observational data, or the inability to
manipulate the system on which the data are collected.

## Avoidance of Multicollinearity
• Dropping feature(s) that are collinear

• Linearly combining predictors, such as adding them together

• Feature extraction/ feature engineering

<img src='https://i.imgur.com/oHN5lup.png'>

In [97]:
df_cat = pd.get_dummies(df_cat[['colour','size','price']], drop_first=True)

In [101]:
df_cat

Unnamed: 0,size,price,colour_green,colour_white
0,1,10.1,1,0
1,2,20.1,0,0
2,1,30.1,0,1


Unnamed: 0,size,price,colour_green,colour_white
0,1,10.1,1,0
1,2,20.1,0,0
2,1,30.1,0,1
