
Objective 1: Feature Engineering and Preprocessing

Load the data into a pandas DataFrame and explore it.
Check for any missing or null values.
Check for the correlation between the features and the target.
Remove any features with a correlation of less than 0.1 with the target.
Normalize the data using StandardScaler.
Split the dataset into training and validation sets.

In [25]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# Load the data
df = pd.read_csv("train.csv")

# Explore the data
print(df.head())

# Check for missing values
print(df.isnull().sum())

# Check correlation with target
correlation = df.corr()['target'].abs()
print(correlation)

# Remove features with correlation less than 0.1 with target
df.drop(columns=correlation[correlation < 0.1].index, inplace=True)

# Separate the features from the target variable
features = df.drop('target', axis=1)

# Center the data by subtracting the mean
centered_data = features - features.mean()

# Normalize the data
# scaler = StandardScaler()
# df[df.columns[:-1]] = scaler.fit_transform(df[df.columns[:-1]])

# Normalize the data by dividing by the range (maximum - minimum)
normalized_data = centered_data / (features.max() - features.min())

# Calculate the variance of each variable
var = normalized_data.var()
# Print the variance of each variable
print(var)

   acc_rate  track         m     n  current_pitch  current_roll  \
0       -31     22  1.127497  0.03           1.44          -0.7   
1       185    -40  1.363425 -0.10           0.30          -1.3   
2       170    -17  0.794534  0.12           0.31           0.6   
3      -399     -8  0.697676 -0.13           0.80          -0.7   
4       -24    -18  1.127497  0.18           0.85          -1.8   

   absoluate_roll  climb_delta  roll_rate_delta  climb_delta_diff  ...  \
0              -9            6            0.011              -0.9  ...   
1             -12           16            0.022               0.1  ...   
2              -6            9           -0.006              -2.5  ...   
3              -7           15            0.008               0.1  ...   
4             -18           -3            0.019              -0.2  ...   

   time8_delta  time9_delta  time10_delta  time11_delta  time12_delta  \
0          0.0          0.0           0.0           9.0           0.0   
1     

In [26]:
# Remove features with variance less than 0.1 with target
# df.drop(columns=var[var < 0.1].index, inplace=True)

# Split the data into training and validation sets
X = normalized_data # df.drop(columns=['target'])
y = df['target']
X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.2, random_state=42)

Objective 2: Model Selection

Define a list of candidate models to test.
Create a function that takes a model and trains it on the training set, predicts on the validation set, and returns the root mean squared error (RMSE) of the predictions.
Test each model using the function created in step 2 and select the one with the lowest RMSE.

In [27]:
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor
from sklearn.neighbors import KNeighborsRegressor
from xgboost import XGBRegressor
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.metrics import mean_squared_error
import numpy as np

# Define candidate models
models = [
    LinearRegression(),
    RandomForestRegressor(),
    GaussianProcessRegressor(),
    KNeighborsRegressor(),
    AdaBoostRegressor(),
    XGBRegressor()
]

# Function to train and evaluate a model

def evaluate_model(model):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_val)
    rmse = np.sqrt(mean_squared_error(y_val, y_pred))
    return rmse


# Test each model and select the best one
best_model = None
best_rmse = float('inf')
for model in models:
    rmse = evaluate_model(model)
    if rmse < best_rmse:
        best_model = model
        best_rmse = rmse

print(f'Best model: {best_model.__class__.__name__}')


Best model: XGBRegressor


Objective 3: Hyperparameter Optimization

Define a dictionary of hyperparameters to test for the selected model.
Use GridSearchCV from scikit-learn to perform a grid search over the hyperparameters and find the combination that gives the lowest RMSE.
Train the selected model with the best hyperparameters on the entire training set.

In [28]:
from sklearn.model_selection import GridSearchCV

# Define hyperparameters to test
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20],
    'learning_rate': [0.1, 0.01, 0.001]
}

# Use GridSearchCV to find the best hyperparameters
grid_search = GridSearchCV(best_model, param_grid, scoring='neg_mean_squared_error', cv=5, n_jobs=-1)

grid_search.fit(X,y)

# print the best parameters and best RMSE score
print('Best parameters:', grid_search.best_params_)
print('Best RMSE score:', -grid_search.best_score_)


Best parameters: {'learning_rate': 0.1, 'max_depth': None, 'n_estimators': 100}
Best RMSE score: 0.02712537739721931
