# Feature Engineering

In [None]:
import pandas as pd

train_df = pd.read_csv('Data/train.csv')
test_df = pd.read_csv('Data/test.csv')

train_df.head()

## Create New Features
In this section, we will create new features to improve the model's performance and gain more insights from the data. We will create interaction features, polynomial features, and apply transformations to existing features.

In [None]:
import numpy as np

# Interaction Features
train_df['Energy_Mood_Interaction'] = train_df['Energy'] * train_df['MoodScore']
test_df['Energy_Mood_Interaction'] = test_df['Energy'] * test_df['MoodScore']

train_df['Rhythm_Energy_Interaction'] = train_df['RhythmScore'] * train_df['Energy']
test_df['Rhythm_Energy_Interaction'] = test_df['RhythmScore'] * test_df['Energy']

train_df['Loudness_Energy_Interaction'] = train_df['AudioLoudness'] * train_df['Energy']
test_df['Loudness_Energy_Interaction'] = test_df['AudioLoudness'] * test_df['Energy']

train_df['Rhythm_Loudness_Interaction'] = train_df['RhythmScore'] * train_df['AudioLoudness']
test_df['Rhythm_Loudness_Interaction'] = test_df['RhythmScore'] * test_df['AudioLoudness']

# Polynomial Features
train_df['Energy_sq'] = train_df['Energy']**2
test_df['Energy_sq'] = test_df['Energy']**2

train_df['MoodScore_sq'] = train_df['MoodScore']**2
test_df['MoodScore_sq'] = test_df['MoodScore']**2

train_df['RhythmScore_sq'] = train_df['RhythmScore']**2
test_df['RhythmScore_sq'] = test_df['RhythmScore']**2

train_df['AudioLoudness_sq'] = train_df['AudioLoudness']**2
test_df['AudioLoudness_sq'] = test_df['AudioLoudness']**2

# Feature Transformations
train_df['TrackDurationMs_log'] = np.log1p(train_df['TrackDurationMs'])
test_df['TrackDurationMs_log'] = np.log1p(test_df['TrackDurationMs'])

# Ratio Features
train_df['Vocal_Duration_Ratio'] = train_df['VocalContent'] / train_df['TrackDurationMs']
test_df['Vocal_Duration_Ratio'] = test_df['VocalContent'] / test_df['TrackDurationMs']


train_df.head()

## Evaluate New Features
In this section, we will evaluate the new features by analyzing their correlation with the target variable and their importance in a machine learning model.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import RandomForestRegressor

# Correlation Analysis
plt.figure(figsize=(20, 15))
correlation_matrix = train_df.corr()
sns.heatmap(correlation_matrix, annot=False, cmap='coolwarm')
plt.title('Correlation Matrix of All Features')
plt.show()

# Feature Importance Analysis
X = train_df.drop(['id', 'BeatsPerMinute'], axis=1)
y = train_df['BeatsPerMinute']

# Handle potential infinite values from ratio features
X.replace([np.inf, -np.inf], np.nan, inplace=True)
X.fillna(0, inplace=True)


model = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)
model.fit(X, y)

feature_importances = pd.DataFrame({'feature': X.columns, 'importance': model.feature_importances_})
feature_importances = feature_importances.sort_values('importance', ascending=False)

plt.figure(figsize=(12, 10))
sns.barplot(x='importance', y='feature', data=feature_importances)
plt.title('Feature Importances')
plt.show()

## Conclusion
In this notebook, we performed feature engineering to explore new relationships in the data and potentially improve model performance. We created several new features, including interaction terms, polynomial features, transformations, and ratios.

The correlation matrix and feature importance plot provide insights into which of these new features are most promising. The `Loudness_Energy_Interaction` and `Rhythm_Loudness_Interaction` seem to have a noticeable correlation with the target variable. The feature importance plot also highlights some of the new features as being important for the model's predictions.

These new features can now be used in a machine learning model to see if they improve the prediction of `BeatsPerMinute`. Further exploration could involve creating more complex features or using different techniques for feature selection.