<h1><strong>Feature Engineering</strong></h1>

What we are trying to do
1. You have data about people: age,income,city.
2. You want to predict whether they will buy product(target: 0=No,1=Yes)
3. The raw data is messy: missing values,text,different scales

In [1]:
import pandas as pd
import numpy as np

In [None]:
df = pd.DataFrame({
    "age":[25,32,np.nan,40,29],
    "income":[40000,52000,61000,np.nan,45000],
    "city":["Delhi","Mumbai","Delhi",np.nan,"Banglore"],
    "bought":[0,1,0,1,0]
})

In [None]:
print(df)

<h3>Separating feature and output columns</h3>

In [None]:
x = df.drop("bought",axis=1)
y = df["bought"]

In [None]:
print(x)


<h3>Handling missing values(Imputation)</h3>
Missing values must be fixed because most models cannot handle them directly.

In [None]:
#sckit-learn
!python -m pip install sckit-learn

In [None]:
from sklearn.impute import SimpleImputer

numeric_features = ["age","income"]
categorical_features = ["city"]

num_imputer = SimpleImputer(strategy="mean")
cat_imputer = SimpleImputer(strategy="most_frequent")

x_num = num_imputer.fit_transform(x[numeric_features])
x_cat = cat_imputer.fit_transform(x[categorical_features])

print("Numeric feature after imputation: ",x_num)
print("Categorical feature after imputation: ",x_cat)


<h3>Encoding categorical Variables</h3>

In [None]:
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(sparse_output=False,handle_unknown="ignore")
x_cat_encoded = encoder.fit_transform(x_cat)

print("encoded column",x_cat_encoded)
print("Categories",encoder.categories_)

<h3>Scaling</h3>

Scaling makes numeric feature comparable in magnitude and helps many algorithms converege better

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_num_scaled = scaler.fit_transform(x_num)

print("scaled numeric values :",X_num_scaled)


Putting numeric + categorical features together

In [None]:
x_processed = np.hstack([X_num_scaled,x_cat_encoded])

print("final feature matrix :")
print(x_processed)


<h2>Ordinal Encoder</h2>

In [2]:
df1 = pd.DataFrame(
    {
        "education":["High School","Bachelor","Master","Phd"],
        "age":[22,27,25,24],
        "salary":[20000,30000,25000,34000]
    }
)

In [4]:
from sklearn.preprocessing import OrdinalEncoder

education_order = [["High School","Bachelor","Master","Phd"]]

ord_enc = OrdinalEncoder(categories=education_order)

df1["education_ord"] = ord_enc.fit_transform(df1[["education"]])

In [None]:
df1

<h2>Min Max scaler</h2>

In [None]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler(feature_range=(0,1))

num_cols = ["age","salary"]
df_scaled = scaler.fit_transform(df1[num_cols])

print(df_scaled)

In [None]:
df1

<h2><strong>Dealing with Date and time </strong></h2>

In [None]:
df = pd.DataFrame(
    {
        "order_time":["2025-01-01 10:30:00","2025-01-02 22:15:00","2025-03-15 06:45:00"]
    }
)

df.info()

In [None]:
#convert it to date time
df["order_time"] = pd.to_datetime(df["order_time"])

In [None]:
df.info()

In [None]:
#Calendar features

df["year"] = df["order_time"].dt.year
df["month"] = df["order_time"].dt.month
df["day"] = df["order_time"].dt.day
df["dayofweek"] =df["order_time"].dt.dayofweek #monday =0 , sunday = 6
df["hour"] =df["order_time"].dt.hour
df["minute"] =df["order_time"].dt.minute

In [None]:
df.info()

In [28]:
#if it is date column only
df = pd.DataFrame(
    {
        "order_time":["2025-01-01","2025-01-02","2025-03-15"]
    }
)

df["order_time"] = pd.to_datetime(df["order_time"])

In [30]:
df["year"] = df["order_time"].dt.year
df["month"] = df["order_time"].dt.month
df["day"] = df["order_time"].dt.day
df["dayofweek"] =df["order_time"].dt.dayofweek #monday =0 , sunday = 6

In [None]:
df

dealing with format like - 2h 12m

In [32]:
df = pd.DataFrame({
    "duration":["2h 12m","0h 45m","1h 05m","3h 00m"]
})

In [33]:
df

Unnamed: 0,duration
0,2h 12m
1,0h 45m
2,1h 05m
3,3h 00m


In [41]:
df["hour"] = df["duration"].str.split("h").str[0]

In [49]:
df["minute"] =df["duration"].str.split(" ").str[1].str.split("m").str[0]

In [53]:
df["hour"] = df["hour"].astype(int)
df["minute"] = df["minute"].astype(int)

In [54]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   duration  4 non-null      object
 1   hour      4 non-null      int64 
 2   minute    4 non-null      int64 
dtypes: int64(2), object(1)
memory usage: 228.0+ bytes
