# Modelization 
For this analysis, let's consider a range of regression models:

* Linear Regression: A simple baseline model.
* Ridge Regression: Linear regression with L2 regularization.
* Lasso Regression: Linear regression with L1 regularization.
* Random Forest Regressor: A decision tree-based ensemble method.
* Gradient Boosting Regressor: Boosting-based ensemble method.

#### 1. Getting the pre-processed data

In [14]:
%run utilspro.py
execute_notebook("3_data_preprocessing.ipynb")


## Setting up the pipelines

In [13]:
# Define the models
models = {
    'Linear Regression': LinearRegression(),
    'Ridge Regression': Ridge(),
    'Lasso Regression': Lasso(),
    'Random Forest Regressor': RandomForestRegressor(random_state=42),
    'Gradient Boosting Regressor': GradientBoostingRegressor(random_state=42)
}

# Create pipelines for each model
pipelines = {name: Pipeline([('model', model)]) for name, model in models.items()}

# Train and evaluate each pipeline
results = {}
for name, pipeline in pipelines.items():
    # Train the model
    pipeline.fit(X_train_scaled, y_train)
    
    # Predict on the test set
    predictions = pipeline.predict(X_test_scaled)
    
    # Compute the mean squared error
    mse = mean_squared_error(y_test, predictions)
    results[name] = mse

results


{'Linear Regression': 96.72539177942704,
 'Ridge Regression': 96.80781600694735,
 'Lasso Regression': 97.24170054684595,
 'Random Forest Regressor': 99.14816629889401,
 'Gradient Boosting Regressor': 85.90483782690822}

The Gradient Boosting Regressor has the lowest MSE, making it the best-performing model among the ones we evaluated.

### Hyper parameter tuning

let's focus on hyperparameter tuning for the best-performing model, which is the Gradient Boosting Regressor

* n_estimators: The number of boosting stages to be run.
* learning_rate: Determines the contribution of each tree to the final prediction.
* max_depth: The maximum depth of the individual regression estimators.
* min_samples_split: The minimum number of samples required to split an internal node.
* min_samples_leaf: The minimum number of samples required to be at a leaf node.

We will define a grid for these hyperparameters and perform a grid search to find the best combination.

In [None]:
# Define the hyperparameter grid for Gradient Boosting Regressor
param_grid = {
    'n_estimators': [50, 100, 150],
    'learning_rate': [0.01, 0.05, 0.1],
    'max_depth': [3, 4, 5],
    'min_samples_split': [2, 4],
    'min_samples_leaf': [1, 2]
}

# Create a GridSearchCV object for Gradient Boosting Regressor
grid_search = GridSearchCV(GradientBoostingRegressor(random_state=42), param_grid, cv=5, scoring='neg_mean_squared_error', n_jobs=-1)

# Fit the model
grid_search.fit(X_train_scaled, y_train)

# Get the best parameters and the corresponding score
best_params = grid_search.best_params_
best_score = -grid_search.best_score_

best_params, best_score
