# Lab 6
### Advanced Data Preprocessing and Feature Engineering

In this lab, we focus on key techniques for preparing data and engineering features to enhance the performance of machine learning models. The tasks include:

1. **Handling Categorical Data with One-Hot Encoding**  
   Transform categorical data into numerical form by applying one-hot encoding to the `Color` column.

2. **Normalizing Numerical Data**  
   Normalize numerical features like `Age` and `Salary` using Min-Max scaling to ensure values are in a uniform range.

3. **Removing Highly Correlated Features**  
   Identify and remove redundant features by computing a correlation matrix and eliminating one of the highly correlated variables.

4. **Creating New Features from Existing Data**  
   Derive a new categorical feature `AgeGroup` based on the `Age` column, classifying individuals into groups such as "Young", "Adult", and "Senior".

These techniques are essential for effective data preprocessing, ensuring the data is clean, well-scaled, and relevant for building robust models.


## LAB TASK 1
### Feature Extraction: Handling Categorical Data with One-Hot Encoding
- **Task:** 
  - You are given a dataset with a categorical column `Color` containing values `Red`, `Blue`, and `Green`.
  - Apply one-hot encoding to the `Color` column and add it as new columns to the dataset.
  
  **Dataset Example:**
  ```python
  data = {'Color': ['Red', 'Blue', 'Green', 'Blue', 'Red']}
  ```


In [1]:
import pandas as pd

# Initial dataset
data = {'Color': ['Red', 'Blue', 'Green', 'Blue', 'Red']}
df = pd.DataFrame(data)

# Apply one-hot encoding
one_hot_encoded_df = pd.get_dummies(df, columns=['Color'], prefix='Color')

# Result
print(one_hot_encoded_df)


   Color_Blue  Color_Green  Color_Red
0       False        False       True
1        True        False      False
2       False         True      False
3        True        False      False
4       False        False       True


## LAB TASK 2
### Feature Extraction: Normalizing Numerical Data

#### Task:
Given the dataset below, normalize the numerical features **Age** and **Salary** using Min-Max scaling.

**Dataset Example:**
```python
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
        'Age': [25, 30, 35, 40, 45],
        'Salary': [50000, 60000, 70000, 80000, 90000]}

# Min-Max scaling function
def min_max_scaling(column):
    min_value = min(column)
    max_value = max(column)
    return [(x - min_value) / (max_value - min_value) for x in column]
 ```


In [2]:
# Dataset
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
        'Age': [25, 30, 35, 40, 45],
        'Salary': [50000, 60000, 70000, 80000, 90000]}

# Convert to DataFrame
df = pd.DataFrame(data)

# Min-Max scaling function
def min_max_scaling(column):
    min_value = min(column)
    max_value = max(column)
    return [(x - min_value) / (max_value - min_value) for x in column]

# Apply scaling
df['Age'] = min_max_scaling(df['Age'])
df['Salary'] = min_max_scaling(df['Salary'])

# Result
print(df)


      Name   Age  Salary
0    Alice  0.00    0.00
1      Bob  0.25    0.25
2  Charlie  0.50    0.50
3    David  0.75    0.75
4      Eve  1.00    1.00


## LAB TASK 3
### Feature Selection: Removing Highly Correlated Features
- **Task:** 
  - You are given a dataset with three features, **Feature1**, **Feature2**, and **Feature3**. Compute the correlation matrix and drop one feature that is highly correlated with the others (correlation > 0.9 or < -0.9).
  
- **Dataset Example:**
  ```python
  data = {'Feature1': [1, 2, 3, 4, 5],
          'Feature2': [5, 4, 3, 2, 1],
          'Feature3': [1, 2, 3, 4, 5],
          'Target': [0, 1, 0, 1, 0]}


In [3]:
import pandas as pd

# Dataset
data = {'Feature1': [1, 2, 3, 4, 5],
        'Feature2': [5, 4, 3, 2, 1],
        'Feature3': [1, 2, 3, 4, 5],
        'Target': [0, 1, 0, 1, 0]}
df = pd.DataFrame(data)

# Compute the correlation matrix
correlation_matrix = df.corr()

# Identify highly correlated features (correlation > 0.9 or < -0.9)
dropped = set()
for i in correlation_matrix.columns:
    for j in correlation_matrix.columns:
        if i != j and abs(correlation_matrix.loc[i, j]) > 0.9:
            dropped.add(j)

# Drop one of the highly correlated features
df_reduced = df.drop(columns=list(dropped))

# Result
print(df_reduced)


   Target
0       0
1       1
2       0
3       1
4       0


## LAB TASK 4
 
### Feature Extraction: Creating New Features from Existing Data**
- **Task:** 
  - Given the dataset, create a new feature **AgeGroup** based on the **Age** column, where:
    - If **Age** is less than 30, the **AgeGroup** is "Young".
    - If **Age** is between 30 and 50 (inclusive), the **AgeGroup** is "Adult".
    - If **Age** is greater than 50, the **AgeGroup** is "Senior".
  
- **Dataset Example:**
  ```python
  data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
          'Age': [25, 30, 35, 40, 55]}




In [4]:
import pandas as pd

# Dataset
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
        'Age': [25, 30, 35, 40, 55]}
df = pd.DataFrame(data)

# Create AgeGroup feature
def categorize_age(age):
    if age < 30:
        return "Young"
    elif 30 <= age <= 50:
        return "Adult"
    else:
        return "Senior"

df['AgeGroup'] = df['Age'].apply(categorize_age)

# Result
print(df)



      Name  Age AgeGroup
0    Alice   25    Young
1      Bob   30    Adult
2  Charlie   35    Adult
3    David   40    Adult
4      Eve   55   Senior
