<h1 align="center">MACHINE LEARNING CLASSIFICATION MODEL FOR PCRF RESERVATIONS</h1>

After viewing and analyzing the data, we can (and should) create a classification Machine Learning model. We need to extract, clean, and process the data to find the best model for the classification job.

## IMPORTING LIBRARIES

We'll import the necessary libraries for preprocessing, creating and evaluating the Machine Learning classification model:

In [1]:
import os
from time import time
import numpy as np
import pandas as pd
import joblib
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import RandomizedSearchCV, TimeSeriesSplit
from scipy.stats import randint, uniform
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.metrics import r2_score
import warnings

In [2]:
pd.set_option('display.max_columns', None)

In [3]:
warnings.filterwarnings("ignore")

## EXTRACTING THE DATA

Let's read df_filtered file for construct our model:

In [4]:
project_dir = os.path.abspath(os.path.join(os.getcwd(), os.pardir))
data_dir = os.path.join(project_dir, "data", "processed")
path = os.path.join(data_dir, 'df_filtered.csv')
df = pd.read_csv(path)

In [5]:
df.sample(3)

Unnamed: 0,index,facility_name,facility_type,center_longitude,center_latitude,site_name,permit_year,permit_month,event_start_year,event_start_month,event_start_time,day_of_week,attendance,hours_reserved,schedule_type,residency_flag,customer_gender,permit_hour,event_start_mounth_numeric,event_start_day_numeric,event_end_day_numeric,event_duration,day_of_week_numeric,permit_month_numeric
167825,167825,Tennis Court 07 (MTPC),Court - Tennis,-111.741478,33.449963,Mesa Tennis & Pickleball Center,2019,August,2020,February,1900-01-01 10:30:00,Wednesday,4,2.0,Reservation: Billable,False,Female,12,2,5,5,2.0,3,8
289839,289839,Tennis Court 08 (MTPC),Court - Tennis,-111.741478,33.449963,Mesa Tennis & Pickleball Center,2019,December,2020,February,1900-01-01 18:30:00,Thursday,4,2.5,Reservation: Billable,True,Female,14,2,20,20,2.5,4,12
119186,119186,Group Fitness,Room - Fitness,-111.662972,33.432725,Recreation Centers,2019,June,2020,September,1900-01-01 12:00:00,Tuesday,1,0.75,Reservation: Internal,False,Mixed,11,9,1,1,0.75,2,6


In [6]:
df.shape

(434216, 24)

## DATA PREPROCESSING

For our model, we need to drop the columns that don't add value, reduce the outliers, transform the numeric variables to categorical, and reduce variable dimensions

### DROPING THE COLUMNS

Let's see the columns (again):

In [7]:
df.head(3)

Unnamed: 0,index,facility_name,facility_type,center_longitude,center_latitude,site_name,permit_year,permit_month,event_start_year,event_start_month,event_start_time,day_of_week,attendance,hours_reserved,schedule_type,residency_flag,customer_gender,permit_hour,event_start_mounth_numeric,event_start_day_numeric,event_end_day_numeric,event_duration,day_of_week_numeric,permit_month_numeric
0,0,Group Fitness,Room - Fitness,-111.662972,33.432725,Recreation Centers,2019,June,2022,January,1900-01-01 10:00:00,Thursday,1,1.0,Reservation: Internal,False,Mixed,14,1,27,27,1.0,4,6
1,1,Group Fitness,Room - Fitness,-111.662972,33.432725,Recreation Centers,2019,June,2022,January,1900-01-01 11:00:00,Thursday,1,0.75,Reservation: Internal,False,Mixed,14,1,27,27,0.75,4,6
2,2,Group Fitness,Room - Fitness,-111.662972,33.432725,Recreation Centers,2019,June,2022,January,1900-01-01 12:00:00,Thursday,1,0.75,Reservation: Internal,False,Mixed,14,1,27,27,0.75,4,6


The columns 'index', 'facility_name', 'site_name, 'permit_year', 'permit_month', 'permit_hour', 'event_start_time', 'event_start_year', 'event_start_month','event_end_time', 'event_start_day_numeric', 'event_end_day_numeric', 'day_of_week',and  'permit_month_numeric' don't add value to the model, so we can drop them:

In [8]:
df.drop(columns = ['index', 'facility_name','site_name', 'event_start_time', 'event_start_month', 'permit_month',
                    'event_start_day_numeric', 'event_end_day_numeric', 'day_of_week'], inplace=True)

### REDUCING OUTLIERS

In the EDA, we found that 'hours_reserved' has outliers that are greater than 22 hours. We can revisit this issue and remove these records (again)  :

In [9]:
df_2 = df[df['hours_reserved']<=20]
df_2.shape

(434197, 15)

In the same way, we can use the attendance filter extrating records smaller than 100:


In [10]:
df_3 = df_2[df_2['attendance']<=50]
df_3.shape

(373533, 15)

### TRANSFORMING IN NUMERIC VALUES

In order not to make the dataset bigger, we will use Label Encoding functionality:

In [11]:
le= LabelEncoder()

In [12]:
def label_encoder_function(dataframe, column):
    column_encoded = le.fit_transform(dataframe[column])
    return column_encoded

def add_nwe_column(dataframe, column):
    dataframe.loc[:,f'{column}_numeric']= label_encoder_function(dataframe, column)
    return dataframe

In [13]:
add_nwe_column(df_3, 'facility_type')
add_nwe_column(df_3, 'schedule_type')
add_nwe_column(df_3, 'residency_flag')
add_nwe_column(df_3, 'customer_gender')
df_3.shape

(373533, 19)

Now we must drop the old ones:

In [14]:
df_3 = df_3.drop(columns=['facility_type', 'schedule_type', 'residency_flag', 'customer_gender'])
df_3.shape

(373533, 15)

In [15]:
df_3.sample(3)

Unnamed: 0,center_longitude,center_latitude,permit_year,event_start_year,attendance,hours_reserved,permit_hour,event_start_mounth_numeric,event_duration,day_of_week_numeric,permit_month_numeric,facility_type_numeric,schedule_type_numeric,residency_flag_numeric,customer_gender_numeric
374376,-111.662972,33.432725,2023,2023,10,0.9167,9,6,0.916667,3,5,48,1,1,3
370123,-111.741478,33.449963,2023,2024,2,1.0,13,1,1.0,4,11,11,2,0,3
309929,-111.612456,33.399881,2017,2018,30,3.0,17,2,3.0,1,11,39,0,1,1


## GENERATING A PRONOSTIC MACHINE LEARNING MODEL

### FUNCTION TO PREPARE DATA

We'll create a function that splits the data for time series, standardizes the scale, trains models, and save the model with the best accuracy:

In [None]:
def tscv_with_weighted_best_model(df, date_column, n_splits, target_column, models, n_iter, param_distributions):
    df = df.sort_values(by=date_column, ascending=True)
    
    tscv = TimeSeriesSplit(n_splits=n_splits)
    scaler = StandardScaler()
    
    split_number = 1
    model_scores = {name: [] for name in models.keys()}
    training_times = {name: [] for name in models.keys()}

    total_start_time = time() 
    
    for train_index, test_index in tscv.split(df):
        print(f"\n🔹 Split {split_number}")

        train, test = df.iloc[train_index], df.iloc[test_index]
        X_train, X_test = train.drop(columns=[date_column, target_column]), test.drop(columns=[date_column, target_column])
        y_train, y_test = train[target_column], test[target_column]

        X_train_scaled = scaler.fit_transform(X_train)
        X_test_scaled = scaler.transform(X_test)      
        
        for name, model in models.items():
            print(f"\nEvaluating {name}:")
            start_time = time()
            
            random_search = RandomizedSearchCV(model, param_distributions=param_distributions[name], n_iter=n_iter, cv=5,
                                               scoring='r2', n_jobs=8)
            random_search.fit(X_train_scaled, y_train)
            
            best_candidate = random_search.best_estimator_
            y_pred = best_candidate.predict(X_test_scaled)
            r2 = r2_score(y_test, y_pred)
            
            print(f" Best R² Score ({name}): {r2:.4f}")
            
            model_scores[name].append(r2)
            training_times[name].append(time() - start_time)
        
        split_number += 1
    

    weights = np.linspace(0, 1, num=n_splits) 
    weighted_scores = {name: np.average(model_scores[name], weights=weights) for name in models.keys()}
    
    
    best_overall_model_name = max(weighted_scores, key=weighted_scores.get)
    best_overall_model = models[best_overall_model_name]
    
    total_training_time = time() - total_start_time

    print("\n📊 Weighted Scores Calculation:")
    for name, score in weighted_scores.items():
        print(f" {name}: {score:.4f} (weighted)")
    
    print(f"\n🎯 The best overall model is {best_overall_model_name} with weighted R² Score: {weighted_scores[best_overall_model_name]:.4f}")

    print(f"\n⏳ Total training time: {total_training_time:.2f} seconds")
    
    project_dir = os.path.abspath(os.path.join(os.getcwd(), os.pardir))
    model_dir = os.path.join(project_dir, "models")
    os.makedirs(model_dir, exist_ok=True)
    
    best_model_path = os.path.join(model_dir, "best_weighted_model.pkl")
    joblib.dump(best_overall_model, best_model_path)
    print(f"✅ Model saved in: {best_model_path}")
    
    return best_overall_model


We'll create a function that uses several machine learning regression models to find the best fit without presenting overfitting:

Now, let's define the models and their hyperparameter search spaces:

In [17]:
model_params = {
    'LinearRegression': {},

    'Ridge': {
        'alpha': uniform(0.1, 10),
        'solver': ['auto', 'svd', 'lsqr', 'sparse_cg', 'sag'],
        'max_iter': randint(100, 1000),
        'tol': uniform(1e-6, 1e-2),
    },

    'Lasso': {
        'alpha': uniform(0.1, 10),
        'tol': uniform(1e-6, 1e-2),
        'selection': ['cyclic', 'random']
    },

    'ElasticNet': {
        'alpha': uniform(0.1, 10),
        'l1_ratio': uniform(0, 1),
        'tol': uniform(1e-6, 1e-2),
        'selection': ['cyclic', 'random']
    },
    'RandomForestRegressor': {
        'n_estimators': randint(50, 500),
        'max_depth': randint(10, 100),
        'min_samples_split': randint(2, 20),
        'min_samples_leaf': randint(1, 10),
        'max_features': ['sqrt', 'log2'],
    },

    'XGBRegressor': {
        'n_estimators': randint(50, 500),
        'learning_rate': uniform(0.01, 0.3),
        'max_depth': randint(3, 10),
        'min_child_weight': randint(1, 10),
        'gamma': uniform(0, 1),
        'subsample': uniform(0.1, 1),
        'colsample_bytree': uniform(0.1,0.8),
        'reg_alpha': uniform(0, 1),
        'reg_lambda': uniform(0, 1),
    }

}

models={'LinearRegression': LinearRegression(),
        'Ridge': Ridge(random_state=42),        
        'Lasso': Lasso(random_state=42),
        'ElasticNet': ElasticNet(random_state=42),
        'RandomForestRegressor': RandomForestRegressor(random_state=42),
        'XGBRegressor': XGBRegressor(tree_method='gpu_hist', device='cuda:0',random_state=42)
         }

Now, let's train and evaluate the models to find the best model. For training time series models, we must remove the first and last years to have complete data:

In [18]:
df_to_ml = df_3[(df_3["event_start_year"] > 2014) & (df_3["event_start_year"] < 2024)]
print(df_to_ml["event_start_year"].value_counts())

event_start_year
2023    57577
2022    55407
2021    52674
2019    35149
2015    32795
2020    31327
2016    30875
2017    28970
2018    27735
Name: count, dtype: int64


In [19]:
df_to_ml.shape

(352509, 15)

We have to use fewer iterations to find the best model using the created function:

In [20]:
n_iter = 2
n_splits = 5

date_column = 'event_start_year'
target_column = 'attendance'

print(f"Data size: {round(df_to_ml.shape[0])}")
print(f"Number of iterations: {n_iter}")

tscv_with_weighted_best_model(df_to_ml, date_column, n_splits, target_column, models,  n_iter, model_params)

Data size: 352509
Number of iterations: 2

🔹 Split 1

Evaluating LinearRegression:
 Best R² Score (LinearRegression): 0.1261

Evaluating Ridge:
 Best R² Score (Ridge): 0.1261

Evaluating Lasso:
 Best R² Score (Lasso): 0.1701

Evaluating ElasticNet:
 Best R² Score (ElasticNet): 0.1698

Evaluating RandomForestRegressor:
 Best R² Score (RandomForestRegressor): 0.5524

Evaluating XGBRegressor:
 Best R² Score (XGBRegressor): 0.4094

🔹 Split 2

Evaluating LinearRegression:
 Best R² Score (LinearRegression): -0.2087

Evaluating Ridge:
 Best R² Score (Ridge): -0.2102

Evaluating Lasso:
 Best R² Score (Lasso): -0.0755

Evaluating ElasticNet:
 Best R² Score (ElasticNet): -0.1162

Evaluating RandomForestRegressor:
 Best R² Score (RandomForestRegressor): 0.5251

Evaluating XGBRegressor:
 Best R² Score (XGBRegressor): 0.4805

🔹 Split 3

Evaluating LinearRegression:
 Best R² Score (LinearRegression): 0.0025

Evaluating Ridge:
 Best R² Score (Ridge): 0.0020

Evaluating Lasso:
 Best R² Score (Lasso): 

The best model is **RandomForestRegressor**, so we can concentrate on it with more iterations:

In [21]:
model_params_2 = {

    'RandomForestRegressor': {
        'n_estimators': randint(50, 500),
        'max_depth': randint(10, 100),
        'min_samples_split': randint(2, 20),
        'min_samples_leaf': randint(1, 10),
        'max_features': ['sqrt', 'log2'],
    },

}

models_2={
        'RandomForestRegressor': RandomForestRegressor(),
         }

In [26]:
n_iter = 20
n_splits = 15

date_column = 'event_start_year'
target_column = 'attendance'

print(f"Data size: {round(df_to_ml.shape[0])}")
print(f"Number of iterations: {n_iter}")

tscv_with_weighted_best_model(df_to_ml, date_column, n_splits, target_column, models_2,  n_iter, model_params_2)

Data size: 352509
Number of iterations: 20

🔹 Split 1

Evaluating RandomForestRegressor:
 Best R² Score (RandomForestRegressor): 0.6707

🔹 Split 2

Evaluating RandomForestRegressor:
 Best R² Score (RandomForestRegressor): 0.7753

🔹 Split 3

Evaluating RandomForestRegressor:
 Best R² Score (RandomForestRegressor): 0.6666

🔹 Split 4

Evaluating RandomForestRegressor:
 Best R² Score (RandomForestRegressor): 0.6478

🔹 Split 5

Evaluating RandomForestRegressor:
 Best R² Score (RandomForestRegressor): 0.6357

🔹 Split 6

Evaluating RandomForestRegressor:
 Best R² Score (RandomForestRegressor): 0.7844

🔹 Split 7

Evaluating RandomForestRegressor:
 Best R² Score (RandomForestRegressor): 0.6141

🔹 Split 8

Evaluating RandomForestRegressor:
 Best R² Score (RandomForestRegressor): 0.6184

🔹 Split 9

Evaluating RandomForestRegressor:
 Best R² Score (RandomForestRegressor): 0.8473

🔹 Split 10

Evaluating RandomForestRegressor:
 Best R² Score (RandomForestRegressor): 0.8904

🔹 Split 11

Evaluating Ra

We finally select the RandomForestRegressor model as the best model for the problem.