# Feature Engineering: Creating New Features

## What is Feature Engineering?

Feature engineering is the process of transforming raw data into meaningful features that help improve the performance of machine learning models. It is like taking information from your data and making it easier for a model to understand and make predictions. 

Imagine you're trying to predict the price of a house. You have a lot of raw information like the size of the house, number of rooms, and the year it was built. But sometimes, just using this information directly may not give the best results. Feature engineering helps us to create new pieces of information (features) from the existing data that might be more useful for making predictions.

## Why is Feature Engineering Important?

- **Improves Model Performance**: By creating better features, we give the model more useful information, which helps it to make better predictions.
- **Simplifies the Problem**: It can make the data easier to interpret, so the model can focus on the most important patterns.
- **Faster Training**: Sometimes, by creating new features, we can make the model train faster and perform better.

## Example of Feature Engineering

Let’s consider a simple dataset for predicting the price of a house:

| House Size (sq ft) | Number of Rooms | Year Built |
|-------------------|----------------|------------|
| 1000              | 3              | 2000       |
| 1500              | 4              | 2010       |
| 800               | 2              | 1995       |
| 1200              | 3              | 2015       |

Now, suppose we want to predict the price of these houses based on the above features. The model might struggle because the number of rooms and the year built might not directly relate to the price. So, we can create **new features** that might give better insights. Here’s how:

### 1. **Age of the House**
Instead of using the "Year Built" directly, we can create a new feature that shows how old the house is:

- **Age of House** = Current Year - Year Built

For example, if the current year is 2024:
- The house built in 2000 is **24 years old**.
- The house built in 2010 is **14 years old**.

### 2. **Rooms Per Square Foot**
Another useful feature can be the number of rooms relative to the size of the house. This can show how spacious the house feels.

- **Rooms per Square Foot** = Number of Rooms / House Size (in sq ft)

For example:
- House 1: 3 rooms / 1000 sq ft = 0.003 rooms per sq ft
- House 2: 4 rooms / 1500 sq ft = 0.0027 rooms per sq ft

### Updated Table with New Features

| House Size (sq ft) | Number of Rooms | Year Built | Age of House | Rooms per Square Foot |
|-------------------|----------------|------------|--------------|-----------------------|
| 1000              | 3              | 2000       | 24           | 0.003                 |
| 1500              | 4              | 2010       | 14           | 0.0027                |
| 800               | 2              | 1995       | 29           | 0.0025                |
| 1200              | 3              | 2015       | 9            | 0.0025                |

## How Does This Help?

- **Age of House**: Older houses might have lower prices, and newer houses might be more expensive. This new feature helps capture that trend.
- **Rooms per Square Foot**: This helps the model understand how spacious the house is relative to its size. Larger houses with fewer rooms might be more expensive because they are more spacious.

## Summary

Feature engineering is about thinking creatively to create new features from existing data that might help your model make better predictions. It requires understanding the problem you're solving and using your judgment to transform the data in a way that adds more value.

By creating new features like the age of the house or rooms per square foot, you can give your model better insights to make more accurate predictions!


In [2]:
import pandas as pd
from datetime import datetime

# Creating the original dataset
data = {
    'House Size (sq ft)': [1000, 1500, 800, 1200],
    'Number of Rooms': [3, 4, 2, 3],
    'Year Built': [2000, 2010, 1995, 2015]
}

# Convert the dataset into a pandas DataFrame
df = pd.DataFrame(data)

# Show the original DataFrame
print("Original DataFrame:")
print(df)

# Feature Engineering

# 1. Create 'Age of House' by subtracting 'Year Built' from the current year
current_year = datetime.now().year
df['Age of House'] = current_year - df['Year Built']

# 2. Create 'Rooms per Square Foot' by dividing 'Number of Rooms' by 'House Size (sq ft)'
df['Rooms per Square Foot'] = df['Number of Rooms'] / df['House Size (sq ft)']

# Show the DataFrame with new features
print("\nDataFrame with New Features:")
print(df)


Original DataFrame:
   House Size (sq ft)  Number of Rooms  Year Built
0                1000                3        2000
1                1500                4        2010
2                 800                2        1995
3                1200                3        2015

DataFrame with New Features:
   House Size (sq ft)  Number of Rooms  Year Built  Age of House  \
0                1000                3        2000            24   
1                1500                4        2010            14   
2                 800                2        1995            29   
3                1200                3        2015             9   

   Rooms per Square Foot  
0               0.003000  
1               0.002667  
2               0.002500  
3               0.002500  


# Feature Selection: Removing Irrelevant Features

## What is Feature Selection?

**Feature Selection** is the process of choosing the **most important** and **relevant** pieces of information (features) from your dataset while **removing** the **irrelevant**, **redundant**, or **noisy** features.

Think of it like cooking a dish: if you have a lot of ingredients, some of them may not help the taste or may even spoil the dish. You only need to keep the **important ingredients** and remove the **unnecessary ones**. In machine learning, these "important ingredients" are the **features** that help predict the target you're interested in (like house prices, product sales, etc.).

For example, if you're trying to predict the **price of a house**, features like the **size of the house**, **number of rooms**, and **location** are important. However, features like **color of the door** or **name of the street** may not help in predicting the price, so they can be removed.

---

## Why is Feature Selection Important?

1. **Improves Accuracy**: By keeping only the relevant features, your model can make better predictions, leading to more accurate results.
2. **Reduces Overfitting**: If you use too many features, your model might get too specific to the training data and fail to work well on new, unseen data. Removing irrelevant features reduces this risk.
3. **Speeds up the Model**: Fewer features mean the model can be trained and tested faster.
4. **Simplifies the Model**: A simpler model is easier to understand, explain, and interpret, especially for non-technical people.

---

## How to Perform Feature Selection?

Feature selection can be done using different methods. Here are a few simple techniques to help you understand how this works.

### 1. **Removing Irrelevant Features**
Some features do not affect the target variable (the thing you're trying to predict). For example, if you are predicting the price of a house, the **color of the door** may not have any impact on the price, so it's irrelevant. In this case, you would **remove** this feature.

**Example**: 

If you have the following dataset:

| House Size (sq ft) | Number of Rooms | Year Built | Color of the Door | House Price |
|--------------------|-----------------|------------|-------------------|-------------|
| 1000               | 3               | 2000       | Red               | $300,000    |
| 1500               | 4               | 2010       | Blue              | $400,000    |
| 800                | 2               | 1995       | Green             | $250,000    |
| 1200               | 3               | 2015       | Yellow            | $350,000    |

Here, **Color of the Door** has nothing to do with the price of the house, so it can be removed.

---

### 2. **Removing Redundant Features**
Some features are just a duplicate or a version of another feature. For example, if you have both **"Size in square feet"** and **"Size in square meters"**, they are closely related. Keeping both is redundant (they carry the same information), so you should keep just one.

**Example**: 

If you have both **"Height"** and **"Height in meters"**, you can remove one of them because they are giving the same information in different units.

---

### 3. **Using Domain Knowledge**
Domain knowledge refers to your understanding of the problem you're working on. If you know what affects the target variable, you can make better decisions about which features are important.

**Example**: In predicting house prices:
- **Number of Rooms** and **House Size** are likely important.
- **Age of the House** may be important (because older houses might be cheaper).
- **Location** might be a key factor.
- **Color of the Door** might not be important, so it can be removed.

By applying what you know about the problem, you can make better choices about which features to keep.

---

### 4. **Checking Feature Importance with Statistical Tests**
In some cases, you can apply statistical tests to find out which features have a stronger relationship with the target variable (what you're predicting). If a feature doesn’t affect the target variable, you can safely remove it.

For example:
- If you’re predicting **house prices**, you could run tests to see which features like **size**, **number of rooms**, and **year built** are most important in predicting the price.
- Features that have **no significant relationship** with the target can be dropped.

---

## Example to Understand Feature Selection

Let’s break it down with an example:

We have a dataset of houses, and we want to predict the **price of the house** based on different features:

| House Size (sq ft) | Number of Rooms | Year Built | Color of the Door | Distance to Nearest School | House Price |
|--------------------|-----------------|------------|-------------------|----------------------------|-------------|
| 1000               | 3               | 2000       | Red               | 1 km                       | $300,000    |
| 1500               | 4               | 2010       | Blue              | 2 km                       | $400,000    |
| 800                | 2               | 1995       | Green             | 0.5 km                     | $250,000    |
| 1200               | 3               | 2015       | Yellow            | 1.5 km                     | $350,000    |

### Step 1: **Remove Irrelevant Features**
- **Color of the Door**: This feature is **irrelevant** to predicting house prices, so it can be removed.

### Step 2: **Check for Redundancy**
- **House Size** and **Number of Rooms** are related. Generally, larger houses tend to have more rooms. If they provide similar information, we might decide to keep just **House Size** or **Number of Rooms**, depending on which one seems more useful.

### Step 3: **Domain Knowledge**
- We know that **Location** (e.g., Distance to Nearest School) and **Year Built** could affect the house price. Therefore, we decide to **keep** them.

### Step 4: **Final Features**
After applying feature selection, our dataset might look like this:

| House Size (sq ft) | Number of Rooms | Year Built | Distance to Nearest School | House Price |
|--------------------|-----------------|------------|----------------------------|-------------|
| 1000               | 3               | 2000       | 1 km                       | $300,000    |
| 1500               | 4               | 2010       | 2 km                       | $400,000    |
| 800                | 2               | 1995       | 0.5 km                     | $250,000    |
| 1200               | 3               | 2015       | 1.5 km                     | $350,000    |

Now, we have only the **relevant features** to predict **house price**.

---

## Conclusion

Feature selection is an important step in building a machine learning model because it helps you focus on the **most important information**. By removing irrelevant or redundant features, you can:
- Improve the accuracy of your model.
- Reduce the time it takes to train the model.
- Make the model simpler and easier to understand.

By using your **domain knowledge**, **statistical methods**, and logical reasoning, you can carefully choose the best features for your machine learning task!


In [8]:
import pandas as pd
from sklearn.feature_selection import VarianceThreshold
from sklearn.preprocessing import LabelEncoder

# Sample dataset with a feature that has zero variance
data = {
    'House Size (sq ft)': [1000, 1500, 800, 1200],
    'Number of Rooms': [3, 4, 2, 3],
    'Year Built': [2000, 2010, 1995, 2015],
    # Zero variance: All values are 'Red'
    'Color of the Door': ['Red', 'Red', 'Red', 'Red'],
    'Distance to Nearest School': [1, 2, 0.5, 1.5],
    'House Price': [300000, 400000, 250000, 350000]
}

# Convert the dataset into a pandas DataFrame
df = pd.DataFrame(data)

# Show the original DataFrame
print("Original DataFrame:")
print(df)

# Encoding all categorical features (e.g., 'Color of the Door') using LabelEncoder
label_encoder = LabelEncoder()
df['Color of the Door'] = label_encoder.fit_transform(df['Color of the Door'])

# Feature Selection: Remove features with zero variance
# Exclude 'House Price' as it's the target
input_features = df.drop(columns=['House Price'])

# Apply VarianceThreshold to remove features with zero variance
selector = VarianceThreshold(threshold=0)  # Remove features with variance = 0
df_selected = selector.fit_transform(input_features)

# Convert the result back to a DataFrame, using only the remaining columns
df_selected = pd.DataFrame(
    df_selected, columns=input_features.columns[selector.get_support()])

# Show the DataFrame after removing irrelevant and redundant features
print("\nDataFrame after Feature Selection (with VarianceThreshold):")
print(df_selected)

# Display the target variable separately (House Price)
target = df['House Price']
print("\nHouse Price (Target):")
print(target)

Original DataFrame:
   House Size (sq ft)  Number of Rooms  Year Built Color of the Door  \
0                1000                3        2000               Red   
1                1500                4        2010               Red   
2                 800                2        1995               Red   
3                1200                3        2015               Red   

   Distance to Nearest School  House Price  
0                         1.0       300000  
1                         2.0       400000  
2                         0.5       250000  
3                         1.5       350000  

DataFrame after Feature Selection (with VarianceThreshold):
   House Size (sq ft)  Number of Rooms  Year Built  Distance to Nearest School
0              1000.0              3.0      2000.0                         1.0
1              1500.0              4.0      2010.0                         2.0
2               800.0              2.0      1995.0                         0.5
3              12