# **Feature Engineering**
## **Objective**
Feature engineering enhances the dataset by creating meaningful new features, encoding categorical data, and preparing the dataset for machine learning.

## **Key Steps**
- Create new features
- Encode categorical variables
- Normalize and scale numerical data
- Select important features

Let's start by loading the dataset.

In [1]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from sklearn.preprocessing import StandardScaler, LabelEncoder

# Load dataset
file_path = "METABRIC_RNA_Mutation.csv"
df = pd.read_csv(file_path)

# Display initial dataset shape
df.shape


  df = pd.read_csv(file_path)


(1904, 693)

## **Creating New Features**
We engineer new features that might provide additional insights.

In [2]:
# Example: Creating an age group feature
df['age_group'] = pd.cut(df['age_at_diagnosis'], bins=[20, 40, 60, 80, 100], labels=['20-40', '40-60', '60-80', '80-100'])

# Example: Creating a mutation rate feature
df['mutation_rate'] = df['mutation_count'] / (df['age_at_diagnosis'] + 1)  # Avoid division by zero

# Display new features
df[['age_group', 'mutation_rate']].head()


Unnamed: 0,age_group,mutation_rate
0,60-80,
1,40-60,0.045259
2,40-60,0.040104
3,40-60,0.020542
4,60-80,0.025651


## **Encoding Categorical Variables**
We convert categorical variables into numerical values for modeling.

In [3]:
# Encode categorical variables using Label Encoding
cat_cols = ['type_of_breast_surgery', 'cancer_type', 'hormone_therapy', 'chemotherapy', 'age_group']
label_encoders = {}
for col in cat_cols:
    le = LabelEncoder()
    df[col] = le.fit_transform(df[col].astype(str))
    label_encoders[col] = le

# Check transformed categorical variables
df[cat_cols].head()


Unnamed: 0,type_of_breast_surgery,cancer_type,hormone_therapy,chemotherapy,age_group
0,1,0,1,0,2
1,0,0,1,0,1
2,1,0,1,1,1
3,1,0,1,1,1
4,1,0,1,1,2


## **Normalizing and Scaling Numerical Features**
We scale numerical features to ensure a uniform range.

In [4]:
# Scale numerical features using StandardScaler
scaler = StandardScaler()
num_cols = ['age_at_diagnosis', 'mutation_count', 'overall_survival_months', 'mutation_rate']
df[num_cols] = scaler.fit_transform(df[num_cols])

# Display scaled data
df[num_cols].head()


Unnamed: 0,age_at_diagnosis,mutation_count,overall_survival_months,mutation_rate
0,1.122359,,0.201518,
1,-1.379317,-0.91128,-0.530544,-0.706721
2,-0.941562,-0.91128,0.505525,-0.777933
3,-1.033275,-1.157725,0.521686,-1.048173
4,1.224091,-0.91128,-1.097499,-0.9776
