<a href="https://colab.research.google.com/github/dhahbimohamed/ml-zero-to-expert/blob/main/06_Feature_Engineering%26Preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### 🔹 1. What is Feature Engineering?
Feature engineering is the process of preparing and transforming the data to make it more useful for machine learning models.  
Examples include:
- Handling missing values  
- Encoding categorical variables  
- Creating new features like `family_size = sibsp + parch`  
- Scaling numerical features

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
df = sns.load_dataset('titanic')
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.7+ KB


In [4]:
df.isnull().sum()

Unnamed: 0,0
survived,0
pclass,0
sex,0
age,177
sibsp,0
parch,0
fare,0
embarked,2
class,0
who,0


In [5]:
df.describe()

Unnamed: 0,survived,pclass,age,sibsp,parch,fare
count,891.0,891.0,714.0,891.0,891.0,891.0
mean,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,0.0,1.0,0.42,0.0,0.0,0.0
25%,0.0,2.0,20.125,0.0,0.0,7.9104
50%,0.0,3.0,28.0,0.0,0.0,14.4542
75%,1.0,3.0,38.0,1.0,0.0,31.0
max,1.0,3.0,80.0,8.0,6.0,512.3292


In [7]:
df['embarked'].value_counts()

Unnamed: 0_level_0,count
embarked,Unnamed: 1_level_1
S,644
C,168
Q,77


### 🔹 2. Why is Missing Data a Problem in ML?
Machine learning models can’t learn properly with missing values.  
Some models crash or give wrong results if they see `NaN`.  
We usually fix this by:
- Filling with median, mean, or mode  
- Dropping columns with too many missing values

In [None]:
df['age'].fillna(df['age'].mean(),inplace=True) # Fill 'age' with the median

In [9]:
df['embarked'].fillna(df['embarked'].mode()[0],inplace=True) # Fill 'embarked' with the mode (most frequent)

In [10]:
df.drop(columns=['deck'],inplace=True) # Drop 'deck' column (too many missing values)

### 🔹 3. Difference Between One-Hot & Label Encoding

| Encoding Type         | Description                            | Example           |
|-----------------------|----------------------------------------|-------------------|
| Label Encoding | Converts categories to numbers        | male → 0, female → 1 |
| One-Hot Encoding  | Creates new columns (0 or 1 values)   | embarked → Q/S/C  |


In [11]:
df['sex']=df['sex'].map({'male':0,'female':1}) # Binary encoding

In [12]:
df=pd.get_dummies(df,columns=['embarked'],drop_first=True) # One-hot encoding

### 🔹 4. Why Do We Scale Numerical Features?
Some models like Logistic Regression, SVM, and KNN are sensitive to feature scale.  
If features are on very different scales, the model may learn incorrectly.  
We use `StandardScaler` to:
- Center the data (mean = 0)  
- Normalize the spread (std = 1)

In [14]:
from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()
df[['age','fare']]=scaler.fit_transform(df[['age','fare']])

In [16]:
from sklearn.model_selection import train_test_split
X = df.drop(columns=['survived'],axis=1)
y = df['survived']
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=42)

# **🔹 Overall confidence: 7.5 / 10**


