## Model Training  

This notebook focuses on transforming the dataset and training predictive models to estimate the unbiased OFF motor score for Parkinson’s Disease (PD) patients.  

### Steps:  
1. **Feature Transformation**: Apply necessary scaling, encoding, and polynomial interactions if needed.  
2. **Model Selection & Training**: Train multiple regression models, including linear regression, decision trees, and ensemble methods.  
3. **Hyperparameter Tuning**: Optimize model performance using grid search or cross-validation.  
4. **Evaluation & Interpretation**: Assess models using RMSE, feature importance, and residual analysis.  

### Objective:  
- Develop an interpretable, robust model for unbiased OFF score prediction.  
- Compare different regressors and select the most effective one for the task.  


In [53]:
import pandas as pd
import numpy as np

In [54]:
X = pd.read_csv("..\\data\\X_train.csv")

In [55]:
X.head()

Unnamed: 0,age_at_diagnosis,age,ledd,time_since_intake_on,time_since_intake_off,on,off,time_since_diagnosis,disease_duration,on_off_diff,ledd_per_year,time_ratio_on_off,short_time_since_on,medium_time_since_on,long_time_since_on,age_ledd_interaction,time_on_ledd_interaction,duration_on_interaction,gene_No Mutation,cohort_B
0,48.5,52.1,607.0,1.9,14.111042,7.0,20.656235,3.6,3.6,-13.656235,168.611111,0.134646,0,1,0,31624.7,1153.3,25.2,False,False
1,48.5,53.0,666.0,1.9,17.6,12.0,44.0,4.5,4.5,-32.0,148.0,0.107955,0,1,0,35298.0,1265.4,54.0,False,False
2,48.5,53.9,717.0,1.2,14.111042,6.0,20.656235,5.4,5.4,-14.656235,132.777778,0.08504,0,1,0,38646.3,860.4,32.4,False,False
3,48.5,54.8,770.0,1.5,14.111042,11.0,20.656235,6.3,6.3,-9.656235,122.222222,0.1063,0,1,0,42196.0,1155.0,69.3,False,False
4,48.5,56.9,885.0,0.3,14.111042,24.0,20.656235,8.4,8.4,3.343765,105.357143,0.02126,1,0,0,50356.5,265.5,201.6,False,False


In [56]:
y = pd.read_csv("..\\data\\y_train_lXj6X5y.csv") # Target variable

In [57]:
X.shape

(55603, 20)

In [58]:
X.head()

Unnamed: 0,age_at_diagnosis,age,ledd,time_since_intake_on,time_since_intake_off,on,off,time_since_diagnosis,disease_duration,on_off_diff,ledd_per_year,time_ratio_on_off,short_time_since_on,medium_time_since_on,long_time_since_on,age_ledd_interaction,time_on_ledd_interaction,duration_on_interaction,gene_No Mutation,cohort_B
0,48.5,52.1,607.0,1.9,14.111042,7.0,20.656235,3.6,3.6,-13.656235,168.611111,0.134646,0,1,0,31624.7,1153.3,25.2,False,False
1,48.5,53.0,666.0,1.9,17.6,12.0,44.0,4.5,4.5,-32.0,148.0,0.107955,0,1,0,35298.0,1265.4,54.0,False,False
2,48.5,53.9,717.0,1.2,14.111042,6.0,20.656235,5.4,5.4,-14.656235,132.777778,0.08504,0,1,0,38646.3,860.4,32.4,False,False
3,48.5,54.8,770.0,1.5,14.111042,11.0,20.656235,6.3,6.3,-9.656235,122.222222,0.1063,0,1,0,42196.0,1155.0,69.3,False,False
4,48.5,56.9,885.0,0.3,14.111042,24.0,20.656235,8.4,8.4,3.343765,105.357143,0.02126,1,0,0,50356.5,265.5,201.6,False,False


In [59]:
y = y['target']


In [60]:
y.shape

(55603,)

### Data Splitting, Transformation, and Encoding Approach

In this approach, we perform the following steps:

1. **Data Splitting with GroupShuffleSplit:**
   - We use `GroupShuffleSplit` to split our data into 5 different train-test splits. The splitting is done based on `patient_id` to ensure that all data for a particular patient appears in either the train or the test set, preventing data leakage between the splits. The test set size is 20% of the data for each split, and the splitting process is repeated 5 times to obtain 5 distinct training and testing sets.
   - The splits ensure that the model is trained on different subsets of the data and tested on unseen data, improving the generalizability of the model.

2. **Feature Transformation:**
   - **Numerical Features Scaling:** We apply `StandardScaler` to scale numerical features (e.g., age, ledd, on/off times) to have zero mean and unit variance. This is important because many machine learning models perform better when numerical features are on the same scale.
   - **Categorical Features Encoding:** We apply `OneHotEncoder` to convert categorical features (e.g., `cohort`, `gene`) into a numerical format. This transformation creates binary columns for each unique category value, allowing the model to handle these features effectively.

3. **Handling 'sexM' Feature:**
   - The `sexM` column is already binary (0/1), so it is added manually to the transformed data after scaling and encoding the features. This ensures that the column is kept in its original form and included in the final transformed dataset.

4. **Storing Transformed Data:**
   - The transformed training and testing datasets (`X_train_transformed` and `X_test_transformed`) for each fold are stored in separate lists. Similarly, the target variables (`y_train` and `y_test`) for each fold are also stored.
   - This allows for easy access to the transformed data for each fold during model training and evaluation.

5. **Why We Are Doing This:**
   - **Cross-validation:** By using `GroupShuffleSplit`, we ensure that we evaluate the model using 5 different splits of the data, which helps in assessing the model's robustness and generalizability across different subsets of the dataset.
   - **Feature Transformation:** Scaling numerical features ensures that they are on a comparable scale, preventing any single feature from dominating the learning process. Encoding categorical variables allows the model to learn from them effectively.
   - **Manual Handling of 'sexM':** Since `sexM` is a binary feature, we ensure that it remains intact by adding it back to the transformed data after encoding the other features.
   - **Efficient Use of Data:** By storing the transformed data and target variables in lists for each fold, we ensure that the data is ready for model training and evaluation, avoiding data leakage and ensuring that each fold is properly handled during training.

This approach is critical for ensuring that the model is evaluated properly and generalizes well to unseen data. By transforming and encoding the data correctly, we also improve the model's ability to learn from both numerical and categorical features. 


In [61]:
from sklearn.model_selection import GroupShuffleSplit
from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np

# Load original data to get patient IDs
original_df = pd.read_csv('..\\data\\X_train_6ZIKlTY.csv')

# Get patient IDs from original data and match to current training data
# Assuming the rows in X_train maintain the same order as in the original dataset
# If not, you'll need to use some matching logic based on common columns

# Get the first n rows from original df where n is the length of X_train
patient_ids = original_df['patient_id'].values[:len(X)]

# Initialize lists to store transformed data for all splits
X_train_transformed_list = []
X_test_transformed_list = []
y_train_transformed_list = []
y_test_transformed_list = []

# For tracking patients
train_patient_ids_list = []
test_patient_ids_list = []

# Create a temporary dataframe with patient_ids for cross-validation
X_for_cv = X.copy()
X_for_cv['patient_id'] = patient_ids

# Initialize the GroupShuffleSplit
gss = GroupShuffleSplit(n_splits=5, test_size=0.2, random_state=42)

for train_idx, test_idx in gss.split(X_for_cv, y, groups=X_for_cv['patient_id']):
    # Split data
    X_train_fold = X_for_cv.iloc[train_idx].copy()
    X_test_fold = X_for_cv.iloc[test_idx].copy()
    y_train_fold = y.iloc[train_idx].copy()
    y_test_fold = y.iloc[test_idx].copy()
    
    # Extract patient IDs for reference
    train_patient_ids = X_train_fold['patient_id'].values
    test_patient_ids = X_test_fold['patient_id'].values
    train_patient_ids_list.append(train_patient_ids)
    test_patient_ids_list.append(test_patient_ids)
    
    # Remove patient_id from features
    X_train_fold = X_train_fold.drop('patient_id', axis=1)
    X_test_fold = X_test_fold.drop('patient_id', axis=1)
    
    # Scale the features
    scaler = StandardScaler()
    X_train_scaled = pd.DataFrame(
        scaler.fit_transform(X_train_fold),
        columns=X_train_fold.columns
    )
    X_test_scaled = pd.DataFrame(
        scaler.transform(X_test_fold),
        columns=X_test_fold.columns
    )
    
    # Store the transformed data
    X_train_transformed_list.append(X_train_scaled)
    X_test_transformed_list.append(X_test_scaled)
    y_train_transformed_list.append(y_train_fold)
    y_test_transformed_list.append(y_test_fold)

# Display information about each fold
for i in range(5):
    print(f"Fold {i+1}:")
    print(f"X_train shape: {X_train_transformed_list[i].shape}, X_test shape: {X_test_transformed_list[i].shape}")
    print(f"y_train shape: {y_train_transformed_list[i].shape}, y_test shape: {y_test_transformed_list[i].shape}")
    print(f"Number of training patients: {len(np.unique(train_patient_ids_list[i]))}")
    print(f"Number of testing patients: {len(np.unique(test_patient_ids_list[i]))}")
    print()

Fold 1:
X_train shape: (44422, 20), X_test shape: (11181, 20)
y_train shape: (44422,), y_test shape: (11181,)
Number of training patients: 5576
Number of testing patients: 1395

Fold 2:
X_train shape: (44481, 20), X_test shape: (11122, 20)
y_train shape: (44481,), y_test shape: (11122,)
Number of training patients: 5576
Number of testing patients: 1395

Fold 3:
X_train shape: (44314, 20), X_test shape: (11289, 20)
y_train shape: (44314,), y_test shape: (11289,)
Number of training patients: 5576
Number of testing patients: 1395

Fold 4:
X_train shape: (44714, 20), X_test shape: (10889, 20)
y_train shape: (44714,), y_test shape: (10889,)
Number of training patients: 5576
Number of testing patients: 1395

Fold 5:
X_train shape: (44364, 20), X_test shape: (11239, 20)
y_train shape: (44364,), y_test shape: (11239,)
Number of training patients: 5576
Number of testing patients: 1395



In [62]:
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, AdaBoostRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from xgboost import XGBRegressor
from catboost import CatBoostRegressor
from lightgbm import LGBMRegressor

models = {
                "Random Forest": RandomForestRegressor(),
                "Decision Tree": DecisionTreeRegressor(),
                "Gradient Boosting": GradientBoostingRegressor(),
                "Linear Regression": LinearRegression(),
                "K-Neighbors Regressor": KNeighborsRegressor(),
                "XGB Regressor": XGBRegressor(),
                "CatBoosting Regressor": CatBoostRegressor(),
                "Adaboost Regressor": AdaBoostRegressor(),
                "LightGBM": LGBMRegressor()
            }


# Initialize variables to track the best model and fold
best_model_name = None
best_model = None
best_rmse = np.inf
best_fold = None


# Loop through the 5 folds
for i in range(5):
    print(f"Evaluating Fold {i+1}:")
    
    # Get the transformed X_train, X_test, y_train, y_test for this fold
    X_train, X_test = X_train_transformed_list[i], X_test_transformed_list[i]
    y_train, y_test = y_train_transformed_list[i], y_test_transformed_list[i]

    # Loop through each model and evaluate its RMSE on the current fold
    for model_name, model in models.items():
        print(f"Training {model_name}...")
        
        # Fit the model
        model.fit(X_train, y_train)
        
        # Make predictions and calculate RMSE
        y_pred = model.predict(X_test)
        rmse = np.sqrt(mean_squared_error(y_test, y_pred))
        
        print(f"{model_name} RMSE: {rmse}")
        

        # Update best model if this one is better
        if rmse < best_rmse:
            best_rmse = rmse
            best_model_name = model_name
            best_model = model
            best_fold = i+1

# Output the best model and fold
print(f"\nBest Model: {best_model_name} with RMSE: {best_rmse} (Fold {best_fold})")

Evaluating Fold 1:
Training Random Forest...
Random Forest RMSE: 7.5291444645586205
Training Decision Tree...
Decision Tree RMSE: 10.432885504324178
Training Gradient Boosting...
Gradient Boosting RMSE: 7.472982316526827
Training Linear Regression...
Linear Regression RMSE: 10.293236918066198
Training K-Neighbors Regressor...
K-Neighbors Regressor RMSE: 8.20643277317051
Training XGB Regressor...
XGB Regressor RMSE: 7.469905144623316
Training CatBoosting Regressor...
Learning rate set to 0.074559
0:	learn: 15.6639427	total: 146ms	remaining: 2m 26s
1:	learn: 14.8897684	total: 150ms	remaining: 1m 14s
2:	learn: 14.1729319	total: 153ms	remaining: 50.8s
3:	learn: 13.5213793	total: 155ms	remaining: 38.7s
4:	learn: 12.9376135	total: 158ms	remaining: 31.4s
5:	learn: 12.4084753	total: 160ms	remaining: 26.6s
6:	learn: 11.9269548	total: 164ms	remaining: 23.2s
7:	learn: 11.4966523	total: 167ms	remaining: 20.7s
8:	learn: 11.1026408	total: 169ms	remaining: 18.6s
9:	learn: 10.7483536	total: 171ms	rema

In [63]:
from sklearn.metrics import mean_squared_error
from catboost import CatBoostRegressor
# Get the best fold's training data (Fold 3 as identified)
X_train, X_test = X_train_transformed_list[2], X_test_transformed_list[2]
y_train, y_test = y_train_transformed_list[2], y_test_transformed_list[2]

# Initialize the CatBoost Regressor
best_model = CatBoostRegressor()

# Train the model on the full dataset of the best fold
best_model.fit(X_train, y_train)

# Make predictions and calculate RMSE
y_pred = best_model.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

print(rmse)


Learning rate set to 0.07453
0:	learn: 15.6844449	total: 4.19ms	remaining: 4.18s
1:	learn: 14.9091415	total: 7.01ms	remaining: 3.5s
2:	learn: 14.1884763	total: 10.4ms	remaining: 3.47s
3:	learn: 13.5386672	total: 12.9ms	remaining: 3.22s
4:	learn: 12.9535122	total: 16.2ms	remaining: 3.22s
5:	learn: 12.4360789	total: 18.5ms	remaining: 3.07s
6:	learn: 11.9544337	total: 21.6ms	remaining: 3.06s
7:	learn: 11.5262724	total: 24.5ms	remaining: 3.04s
8:	learn: 11.1333477	total: 28.1ms	remaining: 3.1s
9:	learn: 10.7793194	total: 30.6ms	remaining: 3.03s
10:	learn: 10.4553699	total: 33.6ms	remaining: 3.02s
11:	learn: 10.1932078	total: 36ms	remaining: 2.97s
12:	learn: 9.9365683	total: 39.3ms	remaining: 2.98s
13:	learn: 9.7007349	total: 41.6ms	remaining: 2.93s
14:	learn: 9.5015152	total: 45ms	remaining: 2.96s
15:	learn: 9.3106613	total: 48.5ms	remaining: 2.98s
16:	learn: 9.1444577	total: 51.6ms	remaining: 2.98s
17:	learn: 8.9952374	total: 54.2ms	remaining: 2.96s
18:	learn: 8.8683115	total: 57.5ms	rema

## LSTM Implementation (Only with fold 3)

In [64]:
from keras.models import Sequential
from keras.layers import LSTM, Dense


# Step 1: Reshape the data for LSTM (3D format: [samples, time_steps, features])

X_train_reshaped = X_train.values.reshape(X_train.shape[0], 1, X_train.shape[1])  # Reshape for LSTM
X_test_reshaped = X_test.values.reshape(X_test.shape[0], 1, X_test.shape[1])  # Reshape for LSTM

# Step 2: Build the LSTM Model
model = Sequential()
model.add(LSTM(units=50, activation='relu', input_shape=(X_train_reshaped.shape[1], X_train_reshaped.shape[2])))
model.add(Dense(units=1))  # Output layer for regression (1 output value)

# Step 3: Compile the Model
model.compile(optimizer='adam', loss='mean_squared_error')

# Step 4: Train the Model
model.fit(X_train_reshaped, y_train, epochs=10, batch_size=32)

# Step 5: Evaluate and Predict
# Predict the test set results
y_pred = model.predict(X_test_reshaped)

# Evaluate the model
rmse_lstm = np.sqrt(mean_squared_error(y_test, y_pred))
print(f'Root Mean Squared Error on Test Set: {rmse_lstm}')



Epoch 1/10


  super().__init__(**kwargs)


[1m1385/1385[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 1ms/step - loss: 875.4548
Epoch 2/10
[1m1385/1385[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 1ms/step - loss: 91.7777
Epoch 3/10
[1m1385/1385[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 963us/step - loss: 74.8524
Epoch 4/10
[1m1385/1385[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - loss: 69.3433
Epoch 5/10
[1m1385/1385[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 930us/step - loss: 65.9816
Epoch 6/10
[1m1385/1385[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 973us/step - loss: 64.6693
Epoch 7/10
[1m1385/1385[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 953us/step - loss: 62.5973
Epoch 8/10
[1m1385/1385[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 933us/step - loss: 60.9454
Epoch 9/10
[1m1385/1385[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 939us/step - loss: 61.1343
Epoch 10/10
[1m1385/1385[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[3

## Hyperparameter Tuning for the models catboost and LSTM

In [65]:
from catboost import CatBoostRegressor
from sklearn.model_selection import RandomizedSearchCV

# Enhanced parameter distribution for RandomizedSearchCV
param_dist = {
    # Training parameters
    'iterations': [500, 1000, 1500, 2000, 3000],  # Added more iterations for potential better convergence
    'learning_rate': [0.001, 0.003, 0.005, 0.01, 0.03, 0.05],  # Added smaller learning rates for finer optimization
    
    # Tree parameters
    'depth': [4, 6, 8, 10],  # Modified depth range
    'l2_leaf_reg': [1, 3, 5, 7, 9],  # L2 regularization
    'random_strength': [1, 3, 5, 10],  # Controls randomization in tree structure
    
    # Sampling parameters
    'subsample': [0.65, 0.7, 0.75, 0.8, 0.85, 0.9],  # More granular subsampling
    'colsample_bylevel': [0.65, 0.7, 0.75, 0.8, 0.85, 0.9],  # More granular feature subsampling
    
    # Boosting parameters
    'boosting_type': ['Ordered', 'Plain'],  # Try different boosting types
    'bootstrap_type': ['Bayesian', 'Bernoulli', 'MVS'],  # Different bootstrap schemes
    'bagging_temperature': [0.0, 0.5, 1.0, 2.0],  # For Bayesian bootstrap
    
    # Feature processing
    'max_bin': [200, 300, 400, 500],  # Controls feature discretization
    'min_data_in_leaf': [1, 3, 5, 7, 10],  # Helps prevent overfitting
    
    # Loss function specific
    'loss_function': ['RMSE'],  # Keep RMSE as the objective
}

# Initialize CatBoost with additional base parameters
model = CatBoostRegressor(
    cat_features=[],
    verbose=0,
    early_stopping_rounds=50,  # Add early stopping
    task_type='CPU',  # Specify task type
    thread_count=-1  # Use all available CPU cores
)

# Enhanced RandomizedSearchCV
random_search = RandomizedSearchCV(
    estimator=model,
    param_distributions=param_dist,
    n_iter=50,  # Increased number of iterations
    cv=5,
    scoring='neg_root_mean_squared_error',
    n_jobs=-1,
    random_state=42,
    verbose=2
)

# Perform randomized search to find the best parameters
random_search.fit(X_train, y_train)

# Print the best parameters found
print("Best parameters found: ", random_search.best_params_)

# Make predictions with the best model found
best_model_catboost = random_search.best_estimator_
y_pred = best_model_catboost.predict(X_test)

# Calculate RMSE for the test set
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f'RMSE of the best model on the test set: {rmse}')


Fitting 5 folds for each of 50 candidates, totalling 250 fits


190 fits failed out of a total of 250.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
90 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\cilek\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\LocalCache\local-packages\Python312\site-packages\sklearn\model_selection\_validation.py", line 866, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\cilek\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\LocalCache\local-packages\Python312\site-packages\catboost\core.py", line 5873, in fit
    return self._fit(X, y, cat_features, text_features, embedding_features, None, graph, sample_weight, None, None, None, None, baseli

Best parameters found:  {'subsample': 0.7, 'random_strength': 1, 'min_data_in_leaf': 7, 'max_bin': 400, 'loss_function': 'RMSE', 'learning_rate': 0.005, 'l2_leaf_reg': 5, 'iterations': 3000, 'depth': 10, 'colsample_bylevel': 0.8, 'bootstrap_type': 'MVS', 'boosting_type': 'Plain', 'bagging_temperature': 0.0}
RMSE of the best model on the test set: 7.271105537298448


In [None]:
#Best parameters found:  {'subsample': 0.7, 'random_strength': 1, 'min_data_in_leaf': 7, 'max_bin': 400, 'loss_function': 'RMSE',
# 'learning_rate': 0.005, 'l2_leaf_reg': 5, 'iterations': 3000, 'depth': 10, 'colsample_bylevel': 0.8, 'bootstrap_type': 'MVS', 
# 'boosting_type': 'Plain', 'bagging_temperature': 0.0}
#RMSE of the best model on the test set: 7.271105537298448


best_model_catboost.fit(X_train, y_train)

y_pred = best_model_catboost.predict(X_test)

# Calculate RMSE for the test set
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f'RMSE of the best model on the test set: {rmse}')

RMSE of the best model on the test set: 7.271105537298448


In [67]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import ReduceLROnPlateau
from sklearn.model_selection import RandomizedSearchCV
from sklearn.base import BaseEstimator, RegressorMixin

# Simpler LSTM wrapper class
class SimpleLSTMRegressor(BaseEstimator, RegressorMixin):
    def __init__(self, units=100, dropout_rate=0.2, activation='relu', batch_size=32, epochs=50):
        self.units = units
        self.dropout_rate = dropout_rate
        self.activation = activation
        self.batch_size = batch_size
        self.epochs = epochs
        self.model = None

    def build_model(self, input_shape):
        model = Sequential([
            LSTM(units=self.units, activation=self.activation, input_shape=input_shape),
            Dropout(self.dropout_rate),
            Dense(64, activation=self.activation),
            Dropout(self.dropout_rate),
            Dense(1)
        ])
        
        model.compile(optimizer=Adam(), loss='mean_squared_error')
        return model

    def fit(self, X, y):
        input_shape = (X.shape[1], X.shape[2])
        self.model = self.build_model(input_shape)
        
        # Learning rate scheduler
        lr_scheduler = ReduceLROnPlateau(monitor='loss', factor=0.5, patience=5, min_lr=1e-5, verbose=1)
        
        self.model.fit(X, y, epochs=self.epochs, batch_size=self.batch_size, verbose=0, callbacks=[lr_scheduler])
        return self

    def predict(self, X):
        return self.model.predict(X, verbose=0)

# Reshape data
X_train_reshaped = X_train.values.reshape(X_train.shape[0], 1, X_train.shape[1])
X_test_reshaped = X_test.values.reshape(X_test.shape[0], 1, X_test.shape[1])

# Simplified parameter distribution
param_dist = {
    'units': [64, 128, 256],
    'dropout_rate': [0.1, 0.2, 0.3],
    'batch_size': [32, 64],
    'epochs': [30, 50],
    'activation': ['relu', 'tanh']
}

# Initialize model
model = SimpleLSTMRegressor()

# RandomizedSearchCV with fewer iterations
random_search = RandomizedSearchCV(
    estimator=model,
    param_distributions=param_dist,
    n_iter=10,  # Reduced number of iterations
    cv=3,       # Reduced number of folds
    scoring='neg_root_mean_squared_error',
    n_jobs=1,
    random_state=42,
    verbose=1
)

# Fit the model
random_search.fit(X_train_reshaped, y_train)

# Get best model and make predictions
best_model = random_search.best_estimator_
y_pred = best_model.predict(X_test_reshaped)

# Calculate RMSE
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

# Print results
print("\nBest Parameters:", random_search.best_params_)
print("Best CV Score:", -random_search.best_score_)
print("Test Set RMSE:", rmse)


Fitting 3 folds for each of 10 candidates, totalling 30 fits


  super().__init__(**kwargs)
  super().__init__(**kwargs)
  super().__init__(**kwargs)
  super().__init__(**kwargs)
  super().__init__(**kwargs)
  super().__init__(**kwargs)
  super().__init__(**kwargs)
  super().__init__(**kwargs)
  super().__init__(**kwargs)
  super().__init__(**kwargs)
  super().__init__(**kwargs)
  super().__init__(**kwargs)
  super().__init__(**kwargs)
  super().__init__(**kwargs)
  super().__init__(**kwargs)



Epoch 38: ReduceLROnPlateau reducing learning rate to 0.0005000000237487257.


  super().__init__(**kwargs)
  super().__init__(**kwargs)
  super().__init__(**kwargs)
  super().__init__(**kwargs)
  super().__init__(**kwargs)
  super().__init__(**kwargs)



Epoch 49: ReduceLROnPlateau reducing learning rate to 0.0005000000237487257.


  super().__init__(**kwargs)
  super().__init__(**kwargs)
  super().__init__(**kwargs)



Epoch 41: ReduceLROnPlateau reducing learning rate to 0.0005000000237487257.


  super().__init__(**kwargs)
  super().__init__(**kwargs)
  super().__init__(**kwargs)
  super().__init__(**kwargs)
  super().__init__(**kwargs)
  super().__init__(**kwargs)
  super().__init__(**kwargs)



Best Parameters: {'units': 128, 'epochs': 50, 'dropout_rate': 0.2, 'batch_size': 64, 'activation': 'relu'}
Best CV Score: 7.451579361940983
Test Set RMSE: 7.395590000639331


## Saving the best model (Catboost)

With all cross-validations and RMSE scores, we have decided to go with the catboost regressor model with the best parameters and dumping the model in .pkl file for further predictions.

In [68]:
import pickle

model_path = "..\\model\\catboost_model.pkl"

# Save the trained model to a .pkl file
with open(model_path, 'wb') as f:
    pickle.dump(best_model_catboost, f)

print("Best CatBoost model retrained and saved as 'catboost_model.pkl'.")

Best CatBoost model retrained and saved as 'catboost_model.pkl'.


## Predictions for the challenge

In [69]:
test = pd.read_csv("..\\data\\X_test.csv")

test.head()

Unnamed: 0,age_at_diagnosis,age,ledd,time_since_intake_on,time_since_intake_off,on,off,time_since_diagnosis,disease_duration,on_off_diff,ledd_per_year,time_ratio_on_off,short_time_since_on,medium_time_since_on,long_time_since_on,age_ledd_interaction,time_on_ledd_interaction,duration_on_interaction,gene_No Mutation,cohort_B
0,43.8,48.5,791.0,1.7,14.111042,25.0,20.656235,4.7,4.7,4.343765,168.297872,0.120473,0,1,0,38363.5,1344.7,117.5,True,False
1,43.8,49.1,826.0,2.8,14.111042,25.0,20.656235,5.3,5.3,4.343765,155.849057,0.198426,0,1,0,40556.6,2312.8,132.5,True,False
2,43.8,49.5,848.0,0.7,13.8,33.0,41.0,5.7,5.7,-8.0,148.77193,0.050725,1,0,0,41976.0,593.6,188.1,True,False
3,43.8,50.0,874.0,5.2,14.111042,34.0,20.656235,6.2,6.2,13.343765,140.967742,0.368506,0,0,1,43700.0,4544.8,210.8,True,False
4,43.8,50.8,924.0,3.4,13.7,30.0,47.0,7.0,7.0,-17.0,132.0,0.248175,0,0,1,46939.2,3141.6,210.0,True,False


In [70]:
original_test = pd.read_csv("..\\data\\X_test_oiZ2ukx.csv")
index = original_test['Index'] 

In [71]:
test.shape

(23672, 20)

In [72]:
index.head()

0    0
1    1
2    2
3    3
4    4
Name: Index, dtype: int64

In [73]:
index.shape

(23672,)

In [74]:
test_scaled = pd.DataFrame(
        scaler.fit_transform(test),
        columns=test.columns
    )

In [75]:
test_scaled.head()

Unnamed: 0,age_at_diagnosis,age,ledd,time_since_intake_on,time_since_intake_off,on,off,time_since_diagnosis,disease_duration,on_off_diff,ledd_per_year,time_ratio_on_off,short_time_since_on,medium_time_since_on,long_time_since_on,age_ledd_interaction,time_on_ledd_interaction,duration_on_interaction,gene_No Mutation,cohort_B
0,-1.207818,-1.199883,1.03424,-0.102269,-0.014336,0.317543,-0.251609,-0.225707,-0.225707,0.466597,-0.042826,-0.112888,-0.324882,0.517213,-0.361502,0.022388,0.332703,-0.115978,0.743486,-0.343475
1,-1.207818,-1.146944,1.231919,1.0775,-0.014336,0.317543,-0.251609,-0.091039,-0.091039,0.466597,-0.124114,1.022205,-0.324882,0.517213,-0.361502,0.173517,1.660023,-0.008627,0.743486,-0.343475
2,-1.207818,-1.111651,1.356175,-1.174786,-0.246657,1.212514,1.28836,-0.00126,-0.00126,-0.468041,-0.170326,-1.12851,3.078041,-1.93344,-0.361502,0.271329,-0.697098,0.389289,0.743486,-0.343475
3,-1.207818,-1.067535,1.503022,3.651541,-0.014336,1.324386,-0.251609,0.110963,0.110963,1.148053,-0.221286,3.498772,-0.324882,-1.93344,2.766233,0.390132,4.720221,0.551747,0.743486,-0.343475
4,-1.207818,-0.996949,1.785421,1.72101,-0.321348,0.8769,1.742544,0.29052,0.29052,-1.149498,-0.279843,1.746612,-0.324882,-1.93344,2.766233,0.613349,2.796354,0.546022,0.743486,-0.343475


In [76]:
predictions= best_model_catboost.predict(test_scaled)

predictions

array([44.96841571, 45.33511987, 44.94503261, ..., 57.96628022,
       50.73549627, 57.41866998])

In [83]:
submission_df = pd.DataFrame({'Index': range(len(predictions)), 'target': predictions})

submission_df.head()

Unnamed: 0,Index,target
0,0,44.968416
1,1,45.33512
2,2,44.945033
3,3,54.635862
4,4,48.911223


## İKİ SUBMİSSİON ARASI FARKA BAKMAYI UNUTMA!

In [84]:
# Save to CSV in the "submission.csv" file
submission_df.to_csv("..\\data\\submission.csv", index=False)

print("CSV file saved successfully!")

CSV file saved successfully!
