<h1><strong>Feature Engineering</strong></h1>

Feature Engineering is the process of converting raw data into meaning full features that improve model performance.

01. Understand problem
02. Collect the raw data
03. Clean data
04. Handling Values
05. Converting word into numbers
06. Feature creation (creating new information)
07. Scale data (There must be equal playing / feature field)
08. Remove Unnecesssary information
09. Check the data if features make sense

Install scikit learn with following command <br>
<code>pip install scikit-learn</code>

In [1]:
import numpy as np
import pandas as pd

In [2]:
df = pd.DataFrame(
    {
        "age":[25,24,30,np.nan,25],
        "income":[43000,23000,np.nan,50000,26500],
        "city":["Delhi","Chandigarh","Delhi",np.nan,"Mumbai"],
        "baught":[0,1,1,0,1]
    }
)

In [3]:
df

Unnamed: 0,age,income,city,baught
0,25.0,43000.0,Delhi,0
1,24.0,23000.0,Chandigarh,1
2,30.0,,Delhi,1
3,,50000.0,,0
4,25.0,26500.0,Mumbai,1


my feature columns are <code>age, income, city</code> and <code>baught</code> target column

Separate feature and target column

y = mx + c

In [None]:
X = df.drop("baught",axis=1)
Y = df["baught"]

In [5]:
X

Unnamed: 0,age,income,city
0,25.0,43000.0,Delhi
1,24.0,23000.0,Chandigarh
2,30.0,,Delhi
3,,50000.0,
4,25.0,26500.0,Mumbai


In [6]:
Y

0    0
1    1
2    1
3    0
4    1
Name: baught, dtype: int64

Separating categorical and numerical columns

In [None]:
x_num = ["age","income"]
x_cat = ["city"]

<h3>Imputation</h3>

In [8]:
from sklearn.impute import SimpleImputer

num_imputer = SimpleImputer(strategy="median")
cat_imputer = SimpleImputer(strategy="most_frequent")

imputed_num = num_imputer.fit_transform(X[["age","income"]])
imputed_cat = cat_imputer.fit_transform(X[["city"]])

In [9]:
imputed_num

array([[2.500e+01, 4.300e+04],
       [2.400e+01, 2.300e+04],
       [3.000e+01, 3.475e+04],
       [2.500e+01, 5.000e+04],
       [2.500e+01, 2.650e+04]])

In [10]:
imputed_cat

array([['Delhi'],
       ['Chandigarh'],
       ['Delhi'],
       ['Delhi'],
       ['Mumbai']], dtype=object)

<h3>Encoding</h3>

In [12]:
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(sparse_output=False,handle_unknown="ignore")

encoded_cat = encoder.fit_transform(imputed_cat)

In [13]:
encoded_cat

array([[0., 1., 0.],
       [1., 0., 0.],
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

<h3>Scaling</h3>

In [14]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaledData = scaler.fit_transform(imputed_num)

In [15]:
scaledData

array([[-0.37463432,  0.75177429],
       [-0.84292723, -1.23968078],
       [ 1.9668302 , -0.06970093],
       [-0.37463432,  1.44878357],
       [-0.37463432, -0.89117615]])

<h3>Combine The Feature</h3>

In [16]:
np.hstack([scaledData,encoded_cat])

array([[-0.37463432,  0.75177429,  0.        ,  1.        ,  0.        ],
       [-0.84292723, -1.23968078,  1.        ,  0.        ,  0.        ],
       [ 1.9668302 , -0.06970093,  0.        ,  1.        ,  0.        ],
       [-0.37463432,  1.44878357,  0.        ,  1.        ,  0.        ],
       [-0.37463432, -0.89117615,  0.        ,  0.        ,  1.        ]])