Step 3 : Feature Engg => one of the most critical steps before training your ML model.

In [1]:
df['Age'].fillna(df['Age'].median(), inplace=True)


NameError: name 'df' is not defined

In [2]:
import pandas as pd

# Reload your Titanic dataset
df = pd.read_csv("train.csv")


In [3]:
df['Age'].fillna(df['Age'].median(), inplace=True)


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Age'].fillna(df['Age'].median(), inplace=True)


In [4]:
df['Age'].fillna(df['Age'].median(), inplace=True)


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Age'].fillna(df['Age'].median(), inplace=True)


In [5]:
df['Age'] = df['Age'].fillna(df['Age'].median())


In [7]:
df['Embarked'] = df['Embarked'].fillna(df['Embarked'].mode()[0])


In [8]:
# Safe missing value handling
df['Age'] = df['Age'].fillna(df['Age'].median())
df['Embarked'] = df['Embarked'].fillna(df['Embarked'].mode()[0])


In [9]:
df['Age'].isnull().sum()


np.int64(0)

In [10]:
import numpy as np
val = np.int64(0)
print(val)           # Output: 0
print(type(val))     # Output: <class 'numpy.int64'>



0
<class 'numpy.int64'>


Why Use np.int64 Instead of Regular int?
Feature	int (Python)	np.int64 (NumPy)
Type	Built-in Python type	NumPy-specific integer type
Use in ML	Basic math	Efficient arrays, vector ops
Memory/Performance	Variable size	Fixed 64-bit, better for big data
Compatibility	Good	Required by NumPy, Pandas, etc.

In [11]:
df['Sex'] = df['Sex'].map({'male': 0, 'female': 1})


In [12]:
df = pd.get_dummies(df, columns=['Embarked'], drop_first=True)


In [1]:
df['FamilySize'] = df['SibSp'] + df['Parch'] + 1


NameError: name 'df' is not defined

In [2]:
import pandas as pd

# Load the Titanic dataset (make sure the path is correct)
df = pd.read_csv("train.csv")

# Optional: Preview first 5 rows
df.head()


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,,S


In [3]:
df['FamilySize'] = df['SibSp'] + df['Parch'] + 1


In [4]:
df['Title'] = df['Name'].str.extract(' ([A-Za-z]+)\.', expand=False)
df['Title'] = df['Title'].replace(['Lady', 'Countess', 'Capt', 'Col', 'Don', 'Dr',
                                   'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')
df['Title'] = df['Title'].replace('Mlle', 'Miss')
df['Title'] = df['Title'].replace('Ms', 'Miss')
df['Title'] = df['Title'].replace('Mme', 'Mrs')
df = pd.get_dummies(df, columns=['Title'], drop_first=True)


  df['Title'] = df['Name'].str.extract(' ([A-Za-z]+)\.', expand=False)


In [5]:
df.drop(['PassengerId', 'Name', 'Ticket'], axis=1, inplace=True)


In [6]:
import pandas as pd

# Load the Titanic dataset (make sure the path is correct)
df = pd.read_csv("train.csv")

In [7]:
import pandas as pd

# Load the Titanic dataset (make sure the path is correct)
df = pd.read_csv("train.csv")

# Optional: Preview first 5 rows
df.head()


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,,S


📘 Step 3: Feature Engineering
🚀 Goal: Clean and transform raw data into meaningful features that improve machine learning model performance.

✅ 🔧 Actions You Performed
1. Handled Missing Values
Age → Filled missing values with median (robust to outliers):

python
Copy code
df['Age'] = df['Age'].fillna(df['Age'].median())
Embarked → Filled missing values with mode (most frequent):

python
Copy code
df['Embarked'] = df['Embarked'].fillna(df['Embarked'].mode()[0])
2. Converted Categorical Columns
Sex → Converted to numeric (male: 0, female: 1):

python
Copy code
df['Sex'] = df['Sex'].map({'male': 0, 'female': 1})
Embarked → One-hot encoded (S, C, Q):

python
Copy code
df = pd.get_dummies(df, columns=['Embarked'], drop_first=True)
3. Created New Features
FamilySize → Total number of family members on board:

python
Copy code
df['FamilySize'] = df['SibSp'] + df['Parch'] + 1
IsAlone → Binary feature (1 if alone, 0 otherwise):

python
Copy code
df['IsAlone'] = 0
df.loc[df['FamilySize'] == 1, 'IsAlone'] = 1
4. (Optional) Extracted Titles from Names
Extracted Mr, Mrs, Miss, etc. from names and grouped rare titles:

python
Copy code
df['Title'] = df['Name'].str.extract(' ([A-Za-z]+)\.', expand=False)
df['Title'] = df['Title'].replace(['Mlle', 'Ms'], 'Miss')
df['Title'] = df['Title'].replace(['Mme'], 'Mrs')
df['Title'] = df['Title'].replace(['Dr', 'Rev', 'Col', 'Major', 'Sir', 'Lady', 'Countess', 'Capt', 'Jonkheer', 'Don', 'Dona'], 'Rare')
df = pd.get_dummies(df, columns=['Title'], drop_first=True)
5. Dropped Unnecessary Columns
Removed ID and text-based columns that don’t help models:

python
Copy code
df.drop(['PassengerId', 'Name', 'Ticket'], axis=1, inplace=True)
🧠 Important Points to Remember
🔑 Point	💡 Insight
Feature Engineering	Boosts accuracy, prevents overfitting, reveals patterns
Fill missing smartly	Use median for skewed numeric, mode for categorical
Encode categorical variables	Use .map() or get_dummies() for ML-ready format
Create useful new features	FamilySize, IsAlone, Title add predictive power
Drop irrelevant features	Columns like Name, Ticket are noise in most ML tasks
Don’t use inplace on chains	Use assignment to avoid future pandas warnings
Keep transformations reusable	Helpful when scaling to new/test data