# Modelization 
For this analysis, let's consider a range of regression models:

* Linear Regression: A simple baseline model.
* Ridge Regression: Linear regression with L2 regularization.
* Lasso Regression: Linear regression with L1 regularization.
* Random Forest Regressor: A decision tree-based ensemble method.
* Gradient Boosting Regressor: Boosting-based ensemble method.

#### 1. Getting the pre-processed data

In [19]:
%run utilspro.py

In [None]:
%run "3_data_preprocessing.ipynb"
#execute_notebook("3_data_preprocessing.ipynb")

## Setting up the pipelines

In [24]:
# Define the models
models = {
    'Linear Regression': LinearRegression(),
    'Ridge Regression': Ridge(),
    'Lasso Regression': Lasso(),
    'Random Forest Regressor': RandomForestRegressor(random_state=42),
    'Gradient Boosting Regressor': GradientBoostingRegressor(random_state=42)
}

# Create pipelines for each model
pipelines = {name: Pipeline([('model', model)]) for name, model in models.items()}

# Train and evaluate each pipeline
results = {}
for name, pipeline in pipelines.items():
    # Train the model
    pipeline.fit(X_train_scaled, y_train)
    
    # Predict on the test set
    predictions = pipeline.predict(X_test_scaled)
    
    # Compute the mean squared error
    mse = mean_squared_error(y_test, predictions)
    results[name] = mse

results


{'Linear Regression': 88.27013283650295,
 'Ridge Regression': 88.02320099992671,
 'Lasso Regression': 87.36392537963899,
 'Random Forest Regressor': 87.89870580270997,
 'Gradient Boosting Regressor': 78.91591652938544}

The Gradient Boosting Regressor has the lowest MSE, making it the best-performing model among the ones we evaluated.

### Hyper parameter tuning

let's focus on hyperparameter tuning for the best-performing model, which is the Gradient Boosting Regressor

* n_estimators: The number of boosting stages to be run.
* learning_rate: Determines the contribution of each tree to the final prediction.
* max_depth: The maximum depth of the individual regression estimators.
* min_samples_split: The minimum number of samples required to split an internal node.
* min_samples_leaf: The minimum number of samples required to be at a leaf node.

We will define a grid for these hyperparameters and perform a grid search to find the best combination.

In [25]:
# Define the hyperparameter grid for Gradient Boosting Regressor
param_grid = {
    'n_estimators': [50, 100, 150],
    'learning_rate': [0.01, 0.05, 0.1],
    'max_depth': [3, 4, 5],
    'min_samples_split': [2, 4],
    'min_samples_leaf': [1, 2]
}

# Create a GridSearchCV object for Gradient Boosting Regressor
grid_search = GridSearchCV(GradientBoostingRegressor(random_state=42), param_grid, cv=5, scoring='neg_mean_squared_error', n_jobs=-1)

# Fit the model
grid_search.fit(X_train_scaled, y_train)

# Get the best parameters and the corresponding score
best_params = grid_search.best_params_
best_score = -grid_search.best_score_

best_params, best_score

({'learning_rate': 0.1,
  'max_depth': 4,
  'min_samples_leaf': 1,
  'min_samples_split': 2,
  'n_estimators': 50},
 77.36852687070196)

In [26]:
# Define a reduced hyperparameter grid for Gradient Boosting Regressor
reduced_param_grid = {
    'n_estimators': [50, 100],
    'learning_rate': [0.05, 0.1],
    'max_depth': [3, 4],
    'min_samples_split': [2, 4],
    'min_samples_leaf': [1, 2]
}

# Create a GridSearchCV object for Gradient Boosting Regressor with reduced grid
reduced_grid_search = GridSearchCV(GradientBoostingRegressor(random_state=42), reduced_param_grid, cv=3, scoring='neg_mean_squared_error', n_jobs=-1)

# Fit the model with reduced grid
reduced_grid_search.fit(X_train_scaled, y_train)

# Get the best parameters and the corresponding score from the reduced grid search
reduced_best_params = reduced_grid_search.best_params_
reduced_best_score = -reduced_grid_search.best_score_

reduced_best_params, reduced_best_score

({'learning_rate': 0.1,
  'max_depth': 3,
  'min_samples_leaf': 1,
  'min_samples_split': 4,
  'n_estimators': 100},
 77.84541133733364)

In [27]:
# Create a RandomizedSearchCV object for Gradient Boosting Regressor
random_search = RandomizedSearchCV(
    GradientBoostingRegressor(random_state=42),
    param_distributions=param_grid,
    n_iter=10,  # Number of parameter settings that are sampled
    cv=3,
    scoring='neg_mean_squared_error',
    n_jobs=-1,
    random_state=42
)

# Fit the model with random search
random_search.fit(X_train_scaled, y_train)

# Get the best parameters and the corresponding score from the randomized search
random_best_params = random_search.best_params_
random_best_score = -random_search.best_score_

random_best_params, random_best_score

({'n_estimators': 150,
  'min_samples_split': 4,
  'min_samples_leaf': 1,
  'max_depth': 3,
  'learning_rate': 0.1},
 79.00452234510185)

In [30]:
refined_param_grid = {
    'n_estimators': [50, 100, 200, 300, 400, 500],
    'learning_rate': [0.001, 0.01, 0.05, 0.1, 0.5, 1],
    'max_depth': [3, 4, 5, 6, 7, 8, 9, 10],
    'min_samples_split': [2, 4, 6, 8, 10, 0.01, 0.05, 0.1, 0.2, 0.3],
    'min_samples_leaf': [1, 2, 3, 4, 5, 0.01, 0.05, 0.1, 0.2]
}
# Create a GridSearchCV object for Gradient Boosting Regressor with refined grid
refined_grid_search = GridSearchCV(GradientBoostingRegressor(random_state=42), refined_param_grid, cv=3, scoring='neg_mean_squared_error', n_jobs=-1)

# Fit the model with reduced grid
refined_grid_search.fit(X_train_scaled, y_train)

# Get the best parameters and the corresponding score from the reduced refined grid search

refined_best_params = refined_grid_search.best_params_
refined_best_score = refined_grid_search.best_score_

refined_best_params, refined_best_score

In [None]:
# saving the model to the local file system
filename = 'data/finalized_model.sav'
pickle.dump(random_search, open(filename, 'wb'))    
#saving the scaler
filename = 'data/finalized_scaler.sav'
pickle.dump(scaler, open(filename, 'wb'))
#saving the encoder
filename = 'data/finalized_encoder.sav'
pickle.dump(encoder, open(filename, 'wb'))