# Model Selection

In this notebook, we will identify the best regression model based on evaluation metrics such as RMSE and R-squared. We will also prepare the selected model for deployment on AWS SageMaker.

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.metrics import mean_squared_error, r2_score
import joblib

# Load the training data
train_data = pd.read_csv('../data/train.csv')

# Display the first few rows of the training data
train_data.head()

In [2]:
# Prepare features and target variable
X = train_data.drop('target_column', axis=1)  # Replace 'target_column' with the actual target column name
y = train_data['target_column']

# Split the data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize models
models = {
    'Linear Regression': LinearRegression(),
    'Ridge Regression': Ridge(),
    'Lasso Regression': Lasso()
}

# Dictionary to store evaluation metrics
model_metrics = {}

# Train and evaluate each model
for model_name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_val)
    rmse = np.sqrt(mean_squared_error(y_val, y_pred))
    r2 = r2_score(y_val, y_pred)
    model_metrics[model_name] = {'RMSE': rmse, 'R-squared': r2}
    print(f'{model_name} - RMSE: {rmse:.2f}, R-squared: {r2:.2f}')

In [3]:
# Identify the best model based on RMSE
best_model_name = min(model_metrics, key=lambda x: model_metrics[x]['RMSE'])
best_model = models[best_model_name]

# Save the best model
joblib.dump(best_model, 'best_model.pkl')

# Display the best model
print(f'The best model is: {best_model_name}')

## Next Steps

In the next notebook, we will outline the steps for deploying the selected model to AWS SageMaker and include code for making predictions.