<h1><strong>Feature Engineering</strong></h1>

Feature Engineering is the process of converting raw data into meaning full features that improve model performance.

<h2>Steps of feature engineering.</h2>

1. Understand the problem
2. Collect the raw data
3. Clean data
4. Handling missing Data
5. Convert word into Numbers
6. Feature Creation(creating new information)
7. Scale Data(There must be equal playing / feature Field)
8. Unnecessary Inform
9. In Last, check the data if features make sense

In [None]:
pip install scikit-learn

In [1]:
import numpy as np
import pandas as pd

In [2]:
df = pd.DataFrame(
    {
        "age":[25,32,24,np.nan,22],
        "income":[43000,23000,np.nan,50000,23000],
        "city":["Delhi","Chandigarh","Delhi",np.nan,"Phagwara"],
        "bought":[1,0,1,0,1]
    }
)

In [3]:
df

Unnamed: 0,age,income,city,bought
0,25.0,43000.0,Delhi,1
1,32.0,23000.0,Chandigarh,0
2,24.0,,Delhi,1
3,,50000.0,,0
4,22.0,23000.0,Phagwara,1


In [None]:
df.info()

Separate feature and target columns

In [6]:
x = df.drop("bought",axis=1) #feature columns
y = df["bought"] # target column

In [9]:
x

Unnamed: 0,age,income,city
0,25.0,43000.0,Delhi
1,32.0,23000.0,Chandigarh
2,24.0,,Delhi
3,,50000.0,
4,22.0,23000.0,Phagwara


separating categorical column and numerical columns from features

In [10]:
x_num = ["age","income"]
x_cat = ["city"]

Imputation

In [11]:
from sklearn.impute import SimpleImputer

num_imputer = SimpleImputer(strategy="median")
cat_imputer = SimpleImputer(strategy="most_frequent")

imputed_num = num_imputer.fit_transform(x[x_num])
imputed_cat = cat_imputer.fit_transform(x[x_cat])


In [12]:
imputed_num

array([[2.50e+01, 4.30e+04],
       [3.20e+01, 2.30e+04],
       [2.40e+01, 3.30e+04],
       [2.45e+01, 5.00e+04],
       [2.20e+01, 2.30e+04]])

In [13]:
imputed_cat

array([['Delhi'],
       ['Chandigarh'],
       ['Delhi'],
       ['Delhi'],
       ['Phagwara']], dtype=object)

Encoding

In [14]:
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(sparse_output=False,handle_unknown="ignore")
encoded_cat = encoder.fit_transform(imputed_cat)

In [15]:
encoded_cat

array([[0., 1., 0.],
       [1., 0., 0.],
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

Scaling

In [16]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaled_data = scaler.fit_transform(imputed_num)


In [17]:
scaled_data

array([[-0.14680505,  0.79904121],
       [ 1.90846571, -1.05919416],
       [-0.44041516, -0.13007648],
       [-0.29361011,  1.44942359],
       [-1.02763538, -1.05919416]])

Combine the features

In [18]:
np.hstack([scaled_data,encoded_cat])

array([[-0.14680505,  0.79904121,  0.        ,  1.        ,  0.        ],
       [ 1.90846571, -1.05919416,  1.        ,  0.        ,  0.        ],
       [-0.44041516, -0.13007648,  0.        ,  1.        ,  0.        ],
       [-0.29361011,  1.44942359,  0.        ,  1.        ,  0.        ],
       [-1.02763538, -1.05919416,  0.        ,  0.        ,  1.        ]])