# Feature Engineering & Selection

implemented so far:

**Feature engineering**

- New features created:
    -  Wildernes Density Features
    - Euclidian distances
- Other:
    - merge all soil types into 1 col


**Feature Selection**
- Remove features weakly correlated (<0.005) to target


### Data Import

In [67]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# load the training data
train_path = '../data/train.csv'
train = pd.read_csv(train_path)

# load the testing data
test_path = '../data/test-full.csv'
test = pd.read_csv(test_path)

### Feature Engineering

#### Create Wilderness Density Feature

In [68]:
# First, we create a DataFrame to hold the counts of each cover type within each wilderness area
wilderness_cover_counts = train.groupby('Cover_Type').agg({
    'Wilderness_Area1': 'sum',
    'Wilderness_Area2': 'sum',
    'Wilderness_Area3': 'sum',
    'Wilderness_Area4': 'sum'
}).reset_index()

# Now, we calculate the density for each cover type within each wilderness area
total_counts = train[['Wilderness_Area1', 'Wilderness_Area2', 'Wilderness_Area3', 'Wilderness_Area4']].sum()
wilderness_cover_density = wilderness_cover_counts.set_index('Cover_Type').div(total_counts)

# Function to apply the density-based feature to each row
def apply_density_feature(row):
    cover_type = row['Cover_Type']
    for i in range(1, 5):
        wilderness_area = f'Wilderness_Area{i}'
        if row[wilderness_area] == 1:  # If the observation is in wilderness area i
            return wilderness_cover_density.loc[cover_type, wilderness_area]

# Apply the function to create the new density-based feature
train['Cover_Type_Density'] = train.apply(apply_density_feature, axis=1)


#### Create euclidian distance features

In [69]:
# Calculate new features based on mean distances
train['Mean_Elevation_Vertical_Distance_Hydrology'] = (train['Elevation'] + train['Vertical_Distance_To_Hydrology']) / 2
train['Mean_Distance_Hydrology_Firepoints'] = (train['Horizontal_Distance_To_Hydrology'] + train['Horizontal_Distance_To_Fire_Points']) / 2
train['Mean_Distance_Hydrology_Roadways'] = (train['Horizontal_Distance_To_Hydrology'] + train['Horizontal_Distance_To_Roadways']) / 2
train['Mean_Distance_Firepoints_Roadways'] = (train['Horizontal_Distance_To_Fire_Points'] + train['Horizontal_Distance_To_Roadways']) / 2

print(f"Shape of the dataset after adding new features: {train.shape}")


Shape of the dataset after adding new features: (15120, 61)


#### Remove low correlation coefficients

In [70]:
import pandas as pd
import numpy as np

# Assuming 'train' is your DataFrame and includes both features and the target variable 'Cover_Type'
correlation_matrix = train.corr()  # Compute the correlation matrix

# Calculate the correlation of each feature with 'Cover_Type'
feature_correlation = correlation_matrix['Cover_Type'].drop('Cover_Type')  # Exclude self-correlation

# Decide on a threshold for low correlation (example: below 0.02 in absolute value)
low_correlation_features = feature_correlation[abs(feature_correlation) < 0.005].index.tolist()

# Drop these low correlation features from your dataset
train = train.drop(low_correlation_features, axis=1)

print(f"Removed features with low correlation to 'Cover_Type': {low_correlation_features}")
print(f"Shape of the dataset before removal: {train.shape}")
print(f"Shape of the dataset after removal: {train.shape}")


Removed features with low correlation to 'Cover_Type': ['Horizontal_Distance_To_Hydrology', 'Soil_Type11', 'Soil_Type26', 'Soil_Type34']
Shape of the dataset before removal: (15120, 57)
Shape of the dataset after removal: (15120, 57)


### Merging Soil Types

In [71]:
# Extract soil type columns
soil_columns = [col for col in train.columns if col.startswith('Soil_Type')]
# Combine the soil type columns into a single feature by identifying the active soil type
train['Soil_Type_Combined'] = train[soil_columns].idxmax(axis=1).str.extract('(\d+)').astype(int)

# Drop the original soil type columns
train = train.drop(soil_columns, axis=1)

# 3. Baseline Model

Random Forest Scores
- No feature engineering: 86.44%
- Merging Soil Types: 86.87%
- Merging Soil Types + Removing Low Corr.: 86.28%
- Merging Soil Types + Removing Low Corr. + Adding Distances: 87.07%
- Merging Soil Types + Removing Low Corr. + Adding Distances + Adding Wilderness Density: 99% (??)


In [72]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score


# Prepare the data
X = train.drop(['Id', 'Cover_Type'], axis=1)
y = train['Cover_Type']

# Split the data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the Random Forest Classifier
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the model
rf_model.fit(X_train, y_train)

# Predict on the validation set
y_pred = rf_model.predict(X_val)

# Evaluate the model
accuracy = accuracy_score(y_val, y_pred)
accuracy


0.9973544973544973