
# Feature Engineering Exercise

In this notebook, you will practice various feature engineering techniques.
You'll explore methods such as missing value imputation, encoding categorical variables, scaling, feature selection, and more.

## Dataset:
The dataset is inspired by natural attributes of trees, plants, and environmental factors such as:
- `Tree_Type`: Type of tree (e.g., Oak, Pine, Maple)
- `Height`: Height of the tree (in meters)
- `Age`: Age of the tree (in years)
- `Leaf_Size`: Average size of the leaves (in cm²)
- `Temperature`: Average temperature in the region where the tree grows (in Celsius)
- `Rainfall`: Average annual rainfall in the region (in mm)
- `Soil_Type`: Type of soil (e.g., Loam, Sand, Clay)
- `Health`: Overall health of the tree (Good, Moderate, Poor)
- `Observation_Date`: Date when the tree was last observed
- `Latitude`: Latitude of the region
- `Longitude`: Longitude of the region

Each exercise will guide you through a specific feature engineering technique. Answer the questions to the best of your ability.

## Dataset Creation

In [2]:
import pandas as pd
import numpy as np
import datetime as dt

data = {
    'Tree_Type': ['Oak', 'Pine', 'Maple', 'Oak', 'Maple', 'Pine', 'Oak', 'Maple', 'Pine', 'Oak'],
    'Height': [25, np.nan, 35, 40, 30, 20, np.nan, 33, 29, 38],
    'Age': [100, 50, 80, 120, 70, 60, np.nan, 85, 40, 110],
    'Leaf_Size': [200, 150, 180, np.nan, 210, 160, 170, 190, 140, 220],
    'Temperature': [15, 10, 12, 17, 16, 9, 14, 13, 11, np.nan],
    'Rainfall': [500, 600, 550, 620, 580, np.nan, 540, 570, 610, 590],
    'Soil_Type': ['Loam', 'Sand', 'Clay', 'Loam', 'Clay', 'Sand', 'Loam', 'Clay', 'Sand', 'Loam'],
    'Health': ['Good', 'Moderate', 'Good', 'Poor', 'Moderate', 'Good', 'Poor', 'Good', 'Moderate', 'Good'],
    'Observation_Date': pd.date_range(start='1/1/2020', periods=10, freq='M'),
    'Latitude': [52.1, 46.2, 47.5, 49.8, 48.9, 50.3, 51.7, 47.9, 45.6, 49.3],
    'Longitude': [-1.3, -0.7, 0.5, -1.0, 1.3, -0.9, -1.1, 0.2, -0.8, 0.4]
}

df = pd.DataFrame(data)
df

Unnamed: 0,Tree_Type,Height,Age,Leaf_Size,Temperature,Rainfall,Soil_Type,Health,Observation_Date,Latitude,Longitude
0,Oak,25.0,100.0,200.0,15.0,500.0,Loam,Good,2020-01-31,52.1,-1.3
1,Pine,,50.0,150.0,10.0,600.0,Sand,Moderate,2020-02-29,46.2,-0.7
2,Maple,35.0,80.0,180.0,12.0,550.0,Clay,Good,2020-03-31,47.5,0.5
3,Oak,40.0,120.0,,17.0,620.0,Loam,Poor,2020-04-30,49.8,-1.0
4,Maple,30.0,70.0,210.0,16.0,580.0,Clay,Moderate,2020-05-31,48.9,1.3
5,Pine,20.0,60.0,160.0,9.0,,Sand,Good,2020-06-30,50.3,-0.9
6,Oak,,,170.0,14.0,540.0,Loam,Poor,2020-07-31,51.7,-1.1
7,Maple,33.0,85.0,190.0,13.0,570.0,Clay,Good,2020-08-31,47.9,0.2
8,Pine,29.0,40.0,140.0,11.0,610.0,Sand,Moderate,2020-09-30,45.6,-0.8
9,Oak,38.0,110.0,220.0,,590.0,Loam,Good,2020-10-31,49.3,0.4



## Exercise 1: Missing Value Imputation
Perform missing value imputation on the `Height`, `Age`, `Leaf_Size`, `Temperature`, and `Rainfall` columns.

- Use mean imputation for numerical columns.
- Think about whether a different strategy might be better for any specific column (e.g., median or mode).


## Exercise 2: Binning
Create bins for the `Height` column and group the trees into categories: 'Short', 'Medium', and 'Tall'.

- What binning strategy would you choose?
- How does binning help in feature engineering?


## Exercise 3: One-Hot Encoding
Perform one-hot encoding on the `Tree_Type` and `Soil_Type` columns.

- What is the impact of creating one-hot encoded variables in terms of increasing the number of features?


## Exercise 4: Label Encoding
Perform label encoding on the `Health` column.

- How does label encoding differ from one-hot encoding, and when is it more appropriate?


## Exercise 5: Feature Scaling
Apply both normalization and standardization to the `Height`, `Age`, `Leaf_Size`, and `Temperature` columns.

- Compare the results of normalization and standardization. 
- Which scaling method might be more appropriate in this case?


## Exercise 6: Log Transformation
Perform a log transformation on the `Rainfall` column to reduce skewness.

- How does the log transformation help in dealing with skewed data?
- What should you watch out for when applying log transformations?


## Exercise 7: Polynomial Features
Create polynomial features based on the `Height` and `Age` columns.

- How can polynomial features help improve model performance?


## Exercise 8: Interaction Features
Create interaction features between the `Height` and `Leaf_Size` columns.

- How do interaction features help capture relationships between variables?

## Exercise 9: Creating Domain-Specific Features
Create a domain-specific feature that combines `Height` and `Age` to estimate a "Growth Rate" for each tree.

- How can domain knowledge be used to create new features that improve model performance?


## Exercise 10: Target Encoding
Perform target encoding on the `Soil_Type` and `Tree_Type` columns, using the `Health` column as the target.

- How does target encoding differ from one-hot encoding?


## Exercise 11: Date-Time Feature Extraction
Extract useful features from the `Observation_Date` column, such as year, month, and season.

- How can date-time features be valuable in a model?


## Exercise 12: Discretization
Discretize the `Age` column into bins (e.g., Young, Mature, Old).

- How does discretization help in simplifying continuous features?