# Task
Split the provided notebook cell into multiple cells, adding code explanations and refining the code to be more professional. The goal is to improve the readability and structure of the notebook while maintaining the functionality of the original code, which involves training an AdaBoost model for classification and generating a submission file.

## Split and refine imports and data loading

### Subtask:
Create separate cells for imports, mounting Google Drive, defining file paths, and loading data with explanations.


**Reasoning**:
Create the first cell for all the necessary imports with explanations.



In [1]:
# -----------------------------
# 1️⃣ Imports
# -----------------------------
# pandas is used for data manipulation and analysis.
import pandas as pd

# AdaBoostClassifier is the boosting algorithm.
# DecisionTreeClassifier is used as the base estimator for AdaBoost.
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

# train_test_split is used to split the data into training and validation sets.
# RandomizedSearchCV is used for hyperparameter tuning of the model.
from sklearn.model_selection import train_test_split, RandomizedSearchCV

# accuracy_score and classification_report are used to evaluate the model's performance.
from sklearn.metrics import accuracy_score, classification_report

# drive is used to mount Google Drive to access files stored there.
from google.colab import drive

**Reasoning**:
Create a new cell to mount Google Drive and add a comment explaining the purpose.



In [2]:
# -----------------------------
# 2️⃣ Mount Google Drive
# -----------------------------
# Mount Google Drive to access files stored in 'MyDrive'.
drive.mount('/content/drive')

Mounted at /content/drive


**Reasoning**:
Create a new cell to define the file paths for the datasets and add comments explaining each path.



In [3]:
# -----------------------------
# 3️⃣ Define File Paths
# -----------------------------
# Define file paths based on the project structure within Google Drive.
train_processed_path = '/content/drive/MyDrive/ml_project/data/train_preprocessed.csv'
test_processed_path = '/content/drive/MyDrive/ml_project/data/test_preprocessed.csv'
sample_submission_path = '/content/drive/MyDrive/ml_project/data/sample_submission.csv'

**Reasoning**:
Create the final cell for loading the datasets into pandas DataFrames and include a quick check of their shapes.



In [4]:
# -----------------------------
# 4️⃣ Load Data
# -----------------------------
# Load the preprocessed training and test datasets, and the sample submission file into pandas DataFrames.
train_df = pd.read_csv(train_processed_path)
test_df = pd.read_csv(test_processed_path)
sample_submission_df = pd.read_csv(sample_submission_path)

# Optional: quick check to display the shapes of the loaded dataframes.
print("Train shape:", train_df.shape)
print("Test shape:", test_df.shape)
print("Sample submission shape:", sample_submission_df.shape)

Train shape: (15533, 22)
Test shape: (5225, 21)
Sample submission shape: (13840, 2)


## Explain feature and target preparation

### Subtask:
Create a cell with code and explanations for preparing the features (X) and target (y) variables, including the train-validation split.


**Reasoning**:
Create a cell to prepare features and target, and split the data into training and validation sets with explanations.



In [5]:
# -----------------------------
# 5️⃣ Prepare Features, Target, and Split Data
# -----------------------------
# Separate the features (X) from the target variable (y).
# 'WeightCategory' is the target variable we want to predict.
X = train_df.drop('WeightCategory', axis=1)
y = train_df['WeightCategory']

# Split the data into training and validation sets.
# X_train, y_train: Data used for training the model.
# X_val, y_val: Data used for evaluating the model's performance during development.
# test_size=0.2: Allocate 20% of the data for validation.
# random_state=42: Ensure the split is reproducible.
# stratify=y: Maintain the same proportion of target classes in both train and validation sets.
X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print("X_train shape:", X_train.shape)
print("X_val shape:", X_val.shape)
print("y_train shape:", y_train.shape)
print("y_val shape:", y_val.shape)

X_train shape: (12426, 21)
X_val shape: (3107, 21)
y_train shape: (12426,)
y_val shape: (3107,)


## Detail adaboost model definition and parameter grid

### Subtask:
Create a cell to define the AdaBoost model and explain the parameter grid used for hyperparameter tuning.


**Reasoning**:
Create a cell to define the AdaBoost model and its parameter grid with explanations.



In [6]:
# -----------------------------
# 6️⃣ Define AdaBoost Model and Parameter Grid
# -----------------------------

# Define the base estimator for the AdaBoost model.
# A Decision Tree Classifier is commonly used as a weak learner.
# random_state=42 ensures reproducibility of the base estimator.
base_estimator = DecisionTreeClassifier(random_state=42)

# Define the AdaBoost Classifier model.
# It boosts the performance of the base estimator.
# base_estimator: The weak learner to use.
# random_state=42 ensures reproducibility of the AdaBoost model.
ada_model = AdaBoostClassifier(
    estimator=base_estimator, # Corrected parameter name from base_estimator to estimator for newer versions of scikit-learn
    random_state=42
)

# Define the parameter grid for hyperparameter tuning using RandomizedSearchCV.
param_grid = {
    # n_estimators: The number of boosting stages to perform. Higher numbers can improve performance but increase training time.
    'n_estimators': [50, 100, 200, 300],
    # learning_rate: Weights the contribution of each weak learner. A lower learning rate requires more estimators.
    'learning_rate': [0.01, 0.05, 0.1, 0.2],
    # base_estimator__max_depth: The maximum depth of the decision tree base estimators. Shallow trees (small max_depth) are typical for boosting.
    'estimator__max_depth': [1, 2, 3, 4] # Corrected parameter name from base_estimator__max_depth to estimator__max_depth for newer versions of scikit-learn
}

print("AdaBoost model defined.")
print("Parameter grid defined for tuning:")
print(param_grid)

AdaBoost model defined.
Parameter grid defined for tuning:
{'n_estimators': [50, 100, 200, 300], 'learning_rate': [0.01, 0.05, 0.1, 0.2], 'estimator__max_depth': [1, 2, 3, 4]}


**Reasoning**:
Create a cell for RandomizedSearchCV to find the best hyperparameters for the AdaBoost model.



In [7]:
# -----------------------------
# 7️⃣ Perform Randomized Search for Hyperparameter Tuning
# -----------------------------

# Initialize RandomizedSearchCV to find the best hyperparameters.
# estimator: The model to tune (AdaBoostClassifier).
# param_distributions: The grid of hyperparameters to sample from.
# n_iter: Number of parameter settings that are sampled. Higher is better but slower.
# cv: Number of folds for cross-validation.
# scoring: Metric to evaluate the model performance (accuracy).
# n_jobs=-1: Use all available CPU cores for faster processing.
# verbose=1: Display progress during the search.
# random_state=42: Ensure reproducibility of the random sampling.
rand_search_ada = RandomizedSearchCV(
    estimator=ada_model,
    param_distributions=param_grid,
    n_iter=30,        # 30 random combinations → fast enough for demonstration
    cv=3,
    scoring='accuracy',
    n_jobs=-1,
    verbose=1,
    random_state=42
)

# Fit RandomizedSearchCV to the training data to find the best hyperparameters.
rand_search_ada.fit(X_train, y_train)

# Print the best hyperparameters found and the corresponding best cross-validation score.
print("\nBest hyperparameters found by RandomizedSearchCV:", rand_search_ada.best_params_)
print("Best cross-validation accuracy:", rand_search_ada.best_score_)

Fitting 3 folds for each of 30 candidates, totalling 90 fits

Best hyperparameters found by RandomizedSearchCV: {'n_estimators': 200, 'learning_rate': 0.1, 'estimator__max_depth': 4}
Best cross-validation accuracy: 0.883148237566393


**Reasoning**:
Create a cell to evaluate the best AdaBoost model on the validation set and display the performance metrics.



In [8]:
# -----------------------------
# 8️⃣ Evaluate the Best Model on the Validation Set
# -----------------------------

# Get the best estimator found by RandomizedSearchCV.
best_ada = rand_search_ada.best_estimator_

# Make predictions on the validation set using the best model.
y_val_pred = best_ada.predict(X_val)

# Evaluate the model's performance on the validation set.
# Calculate and print the accuracy score.
print("Validation Accuracy:", accuracy_score(y_val, y_val_pred))

# Print the classification report, which includes precision, recall, F1-score, and support for each class.
print("\nClassification Report:\n", classification_report(y_val, y_val_pred))

Validation Accuracy: 0.8818796266495011

Classification Report:
               precision    recall  f1-score   support

           0       0.90      0.95      0.92       374
           1       0.87      0.85      0.86       469
           2       0.86      0.83      0.84       441
           3       0.94      0.97      0.95       481
           4       0.99      1.00      0.99       597
           5       0.74      0.75      0.74       369
           6       0.78      0.76      0.77       376

    accuracy                           0.88      3107
   macro avg       0.87      0.87      0.87      3107
weighted avg       0.88      0.88      0.88      3107



**Reasoning**:
Create a cell to retrain the best AdaBoost model on the full training dataset and then predict on the test set.



**Reasoning**:
The traceback indicates that the `AdaBoostClassifier` constructor received an unexpected keyword argument 'estimator__max_depth'. This is because the hyperparameter 'estimator__max_depth' belongs to the base estimator (DecisionTreeClassifier), not the AdaBoostClassifier itself. The parameters should be passed correctly to the respective estimators. I need to correct how the best parameters are passed to the `AdaBoostClassifier` and its `base_estimator`.



In [17]:
# -----------------------------
# 9️⃣ Retrain on Full Training Data and Predict on Test Set - Fixed Again
# -----------------------------

# Get the best parameters from the randomized search result.
best_params = rand_search_ada.best_params_

# Extract the base estimator parameters (like max_depth) and AdaBoost parameters (like n_estimators, learning_rate).
base_estimator_params = {key.split('__')[1]: value for key, value in best_params.items() if '__' in key}
ada_params = {key: value for key, value in best_params.items() if '__' not in key}

# Define the base estimator with its best parameters.
base_estimator_full = DecisionTreeClassifier(random_state=42, **base_estimator_params)

# Retrain the AdaBoost model using the optimal hyperparameters found by RandomizedSearchCV
# on the entire training dataset.
# Drop the 'id' column from the full training data before retraining.
X_full_train = X.drop('id', axis=1)

best_ada_full = AdaBoostClassifier(
    estimator=base_estimator_full, # Pass the configured base estimator
    random_state=42,              # Ensure reproducibility
    **ada_params                  # Unpack the AdaBoost specific parameters
)
best_ada_full.fit(X_full_train, y)

# Make predictions on the preprocessed test dataset (test_df) using the retrained model.
# Drop the 'id' column from the test_df before prediction.
X_test = test_df.drop('id', axis=1)
y_test_pred = best_ada_full.predict(X_test)

print("Model retrained on full training data (excluding 'id').")
print("Predictions made on the test set (excluding 'id').")

Model retrained on full training data (excluding 'id').
Predictions made on the test set (excluding 'id').


**Reasoning**:
The previous steps have trained the model and generated predictions. The next step is to map the numeric predictions back to the original string labels and then prepare the submission file in the required format for Kaggle.



In [18]:
# -----------------------------
# 10️⃣ Map Numeric Predictions to String Labels
# -----------------------------

# Define the mapping from the numeric class labels (as used by scikit-learn)
# back to the original string labels representing the weight categories.
label_map = {
    0: 'Insufficient_Weight',
    1: 'Normal_Weight',
    2: 'Obesity_Type_I',
    3: 'Obesity_Type_II',
    4: 'Obesity_Type_III',
    5: 'Overweight_Level_I',
    6: 'Overweight_Level_II'
}

# Map the numeric predictions from the test set (y_test_pred) to their corresponding string labels
# using a list comprehension and the defined label_map.
y_test_labels = [label_map[num] for num in y_test_pred]

print("Numeric predictions mapped to string labels.")

# -----------------------------
# 11️⃣ Prepare Kaggle Submission File
# -----------------------------

# Create a pandas DataFrame for the submission file.
# It should contain two columns: 'id' and 'WeightCategory'.
# The 'id' column is taken from the original test_df (which still contains the ids).
# The 'WeightCategory' column contains the predicted string labels.
submission = pd.DataFrame({
    'id': test_df['id'],  # Use the 'id' column from the original test_df
    'WeightCategory': y_test_labels
})

# Define the path where the submission file will be saved in Google Drive.
submission_path = '/content/drive/MyDrive/ml_project/data/ada_submission.csv'

# Save the submission DataFrame to a CSV file.
# index=False prevents pandas from writing the DataFrame index as a column in the CSV.
submission.to_csv(submission_path, index=False)

print(f"AdaBoost submission saved successfully to: {submission_path}")


Numeric predictions mapped to string labels.
AdaBoost submission saved successfully to: /content/drive/MyDrive/ml_project/data/ada_submission.csv
