# **Feature Engineering in Machine Learning**

Feature engineering is the process of selecting, modifying, or creating features (input variables) that enhance the performance of machine learning models. It plays a crucial role in the success of any machine learning project by improving model accuracy and reducing computational complexity.

## 1. **What is a Feature?**

- A **feature** is an individual measurable property or characteristic of a phenomenon being observed.
- Features serve as inputs to machine learning models.
- Examples:
  - For a house price prediction model, features might include the size of the house, number of bedrooms, and location.
  - For a text classification task, features might be word counts or TF-IDF scores.

##
---

## 2. **Importance of Feature Engineering**

- Improves model performance by providing more informative inputs.
- Reduces overfitting by eliminating irrelevant or redundant features.
- Simplifies the problem by reducing dimensionality.
- Enhances interpretability of the model.

##
---

## 3. **Types of Feature Engineering**

![Feature Engineering Tecniques.png](../images/demention_reduction.png)

### 3.1 Feature Selection

- The process of identifying and retaining only the **most relevant features** to improve model performance and reduce complexity.

#### Techniques:

1. **Filter Methods**:
   - Use statistical tests to rank features.
   - **Examples**: Chi-Square Test, ANOVA, Correlation Coefficient.

2. **Wrapper Methods**:
   - Use a machine learning model to evaluate feature subsets.
   - **Examples**: Forward Selection, Backward Elimination, Recursive Feature Elimination (RFE).

3. **Embedded Methods**:
   - Features are selected during the model training process.
   - **Examples**: Lasso Regression, Decision Trees.

#### Python Implementation: Feature Selection with Recursive Feature Elimination (RFE)

In [2]:

from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

# Example data
X = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
y = [0, 1, 0]

model = LogisticRegression()
rfe = RFE(model, n_features_to_select=2)
fit = rfe.fit(X, y)
print(f"Selected Features: {fit.support_}")
print("Feature Rankings: ", fit.ranking_)


Selected Features: [ True False  True]
Feature Rankings:  [1 2 1]


[More About Feature Selection](./8.1%20-%20feature_selection_methods.ipynb)
###
---

### 3.2 Feature Extraction
- The process of creating new features from raw data to improve its representation.
- [Principal Component Analysis (PCA)](../Unsupervised%20Learning/05%20-%20Principal%20Component%20Analysis%20(PCA).ipynb) is widly used in feature extraction as a dimensionality-reduction technique.  

#### Examples:

- **Text Data**:
  - Bag of Words (BoW), TF-IDF, Word Embeddings.
- **Image Data**:
  - Edge detection, feature maps using CNNs.
- **Time Series Data**:
  - Extracting trends, seasonality, and autocorrelations.

###
---

### 3.3 Feature Transformation
- Modifying features to make them suitable for machine learning algorithms.

#### Common Techniques:

1. **Scaling**:

In [5]:
import numpy as np

data = np.array([
    [26, 50000],
    [29, 70000],
    [34, 55000],
    [31, 41000]
])
data

array([[   26, 50000],
       [   29, 70000],
       [   34, 55000],
       [   31, 41000]])

   - **Normalization**: Scales data to a range of [0, 1].
   <br><br>
   ![Normalization.png](../images/normalization.png)

In [6]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
normalized_data = scaler.fit_transform(data)

normalized_data

array([[0.        , 0.31034483],
       [0.375     , 1.        ],
       [1.        , 0.48275862],
       [0.625     , 0.        ]])

   - **Standardization**: Centers data to have mean 0 and standard deviation 1. <br><br>
   ![Standardization.png](../images/standardization.png)

In [7]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
standardized_data = scaler.fit_transform(data)

standardized_data

array([[-1.37198868, -0.3805212 ],
       [-0.34299717,  1.52208478],
       [ 1.37198868,  0.0951303 ],
       [ 0.34299717, -1.23669388]])

2. **Log Transformation**:

   - **Reduces skewness in data**.

In [8]:
transformed_data = np.log(data + 1)
transformed_data

array([[ 3.29583687, 10.81979828],
       [ 3.40119738, 11.15626481],
       [ 3.55534806, 10.91510665],
       [ 3.4657359 , 10.62135174]])

3. **Polynomial Features**:
   - **Generate higher-degree features**.

In [1]:
from sklearn.preprocessing import PolynomialFeatures
X = [[2], [3], [4]]

poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X)
print(X_poly)

[[ 1.  2.  4.]
 [ 1.  3.  9.]
 [ 1.  4. 16.]]


4. **Encoding Categorical Variables**:
[Read More](05%20-%20data_preparation.ipynb#encoding-categorical-variables)

   - **One-Hot Encoding**:
     ```python
     from sklearn.preprocessing import OneHotEncoder
     encoder = OneHotEncoder()
     encoded_data = encoder.fit_transform(data).toarray()
     ```
   - **Label Encoding**:
     ```python
     from sklearn.preprocessing import LabelEncoder
     encoder = LabelEncoder()
     encoded_labels = encoder.fit_transform(labels)
     ```

##
---

## 4. **Data Splitting**

Dividing data into training, validation, and testing datasets is critical for evaluating model performance and avoiding overfitting.

### 4.1 Training Set
- Used to train the model.
- Typically constitutes 60-80% of the data.

### 4.2 Validation Set
- Used to tune hyperparameters and evaluate model performance during training.
- Typically constitutes 10-20% of the data.

### 4.3 Testing Set
- Used to evaluate model performance on unseen data.
- Typically constitutes 10-20% of the data.

#### Example:
```python
from sklearn.model_selection import train_test_split

# Example data
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)
```

##
---

## 5. **Challenges in Feature Engineering**

- **Curse of Dimensionality**: High-dimensional data can lead to overfitting.
- **Domain Knowledge Dependency**: Requires understanding of the dataset and problem domain.
- **Computational Complexity**: Large datasets can make feature engineering time-consuming.
- **Collinearity**: Highly correlated features can degrade model performance.

##
---

## 6. **Best Practices**

1. Understand the dataset and domain thoroughly.
2. Visualize features to identify patterns and outliers.
3. Use automated tools like `sklearn.feature_selection` for efficiency.
4. Experiment with different feature sets to find the optimal configuration.
5. Regularly validate the impact of engineered features on model performance.

##
---

## 7. **Advanced Topics**

### Feature Engineering for Deep Learning
- Use **Convolutional Neural Networks (CNNs)** for image feature extraction.
- Use **Recurrent Neural Networks (RNNs)** for sequential data like time series or text.

### Automated Feature Engineering
- Tools: `FeatureTools`, `Auto-Sklearn`, `H2O.ai`.

##
---

## Summary

| Type                   | Examples                              | Libraries/Tools               |
|------------------------|---------------------------------------|-------------------------------|
| Feature Selection      | RFE, Lasso, Chi-Square               | Scikit-learn, Statsmodels     |
| Feature Extraction     | TF-IDF, CNN Feature Maps             | Scikit-learn, TensorFlow      |
| Feature Transformation | Scaling, Polynomial Features         | Scikit-learn, NumPy           |

##
---

## References
- [Scikit-learn Feature Engineering Guide](https://scikit-learn.org/stable/modules/feature_selection.html)
- [FeatureTools Documentation](https://www.featuretools.com/)
