## Training the Model for Modality Selection 🚀

If you have features extracted from the Camel dataset in a file named `final.csv`, you're ready to train your machine learning model for modality selection!

### Dataset Path Configuration:

Make sure to set the path according to where the Camel dataset is located:

```python
path = '/content/drive/MyDrive/Camel/'


## Model Training Options:

Decide whether you want to train the model on:

- Random 30% of all ROIs
- Hand-picked sequences

Once you've made your choice, execute the appropriate code cell below.

Happy training! 🌟

In [3]:
import numpy as np
import pandas as pd
import joblib
import os

from sklearn.feature_selection import RFE
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import accuracy_score, recall_score, roc_auc_score, f1_score, precision_score
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier


In [4]:
current_directory = os.getcwd()
current_directory

'C:\\Users\\dujsi\\Desktop\\Tracker-main\\Tracker-main'

In [5]:
path = f'{current_directory}\\Camel'
path

'C:\\Users\\dujsi\\Desktop\\Tracker-main\\Tracker-main\\Camel'

In [8]:
final = pd.read_csv(f'{path}\\final.csv')

In [9]:
final['IR_better'] = final['IoU_ir'] > final['IoU']
tree_data = final.copy()

🔍 **Manual Sequence Selection:**
To manually select sequences for training the model, input individual numbers into the variable `seqs` as shown below:
```python
seqs = [3, 7, 8, 9, 13, 22, 29]


In [10]:
seqs = [3,7, 8, 9, 13, 22, 29]

In [11]:
columns_to_keep_away = ['frame_number', 'track_id', 'IoU', 'IoU_ir', 'seq', 'IR_better','y', 'x', 'w', 'h']
train = tree_data[tree_data['seq'].isin(seqs)]
test = tree_data[~tree_data['seq'].isin(seqs)]
X=tree_data.drop(columns=columns_to_keep_away)
data_to_keep_away = tree_data[columns_to_keep_away]

y_train = train['IR_better']
y_test = test['IR_better']
X_train= train.drop(columns=columns_to_keep_away)
X_test = test.drop(columns=columns_to_keep_away)

imputer = SimpleImputer(strategy='mean')

# Impute missing values in X
X_train = imputer.fit_transform(X_train)
X_test = imputer.transform(X_test)

# Initialize the MinMaxScaler
scaler = MinMaxScaler()

scaler.fit(X_train)

# Transform both the training and test data
X_train_normalized = scaler.transform(X_train)
X_test_normalized = scaler.transform(X_test)

🎲 **Random Selection:**
To train the model on a random 30% subset from all sequences, run the corresponding cell below.

In [12]:
# Create a DataFrame containing only the columns you want to keep away from the classifier
columns_to_keep_away = ['frame_number', 'track_id', 'IoU', 'IoU_ir', 'seq', 'IR_better','y', 'x', 'w', 'h']
data_to_keep_away = tree_data[columns_to_keep_away]

# Remove the columns you want to keep away from the original dataset
X = tree_data.drop(columns=columns_to_keep_away)

# Extract the target variable (y)
y = tree_data['IR_better']

# Initialize SimpleImputer with mean strategy
imputer = SimpleImputer(strategy='mean')

# Impute missing values in X
X_imputed = imputer.fit_transform(X)

# Convert back to DataFrame after imputation
X = pd.DataFrame(X_imputed, columns=X.columns)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.7, random_state=42)

# Initialize the MinMaxScaler
scaler = MinMaxScaler()

scaler.fit(X_train)

# Transform both the training and test data
X_train_normalized = scaler.transform(X_train)
X_test_normalized = scaler.transform(X_test)

## Classifiers Overview 📊

### Description:
A list of classifiers is defined to be tried out for the task. Each classifier is instantiated with specific parameters.

### Classifiers:
- **🌲 Random Forest:** Utilizes a forest of decision trees, each trained with a random subset of the data.
- **🚀 AdaBoost:** Combines multiple weak classifiers to form a strong classifier.
- **🌈 Gradient Boosting:** Builds a series of decision trees, where each tree corrects the errors of the previous one.
- **KNN (K-Nearest Neighbors):** Assigns a class label based on the majority class among its k nearest neighbors.
- **MLP (Multi-Layer Perceptron):** A type of neural network with multiple layers, utilizes backpropagation for training.

### Training and Evaluation:
Each classifier is trained and evaluated using the following metrics:
- **Accuracy:** Measures the proportion of correctly classified instances.
- **Recall:** Calculates the proportion of actual positive instances correctly predicted.
- **ROC AUC Score:** Represents the area under the ROC curve, indicating the classifier's ability to discriminate between positive and negative classes.
- **F1 Score:** Harmonic mean of precision and recall, providing a balance between them.
- **Precision:** Measures the proportion of correctly predicted positive instances among all predicted positive instances.

Results are stored for each classifier for further analysis and comparison.


In [13]:
# Define a list of classifiers to try out
classifiers = {
    "Random Forest": RandomForestClassifier(n_estimators=100, random_state=42),
    "AdaBoost": AdaBoostClassifier(n_estimators=100, random_state=42),
    "Gradient Boosting": GradientBoostingClassifier(n_estimators=100, random_state=42),
    "KNN": KNeighborsClassifier(),
    "MLP": MLPClassifier(hidden_layer_sizes=(100, 50), activation='relu', solver='adam', max_iter=500, random_state=42),

}


# Train and evaluate each classifier
results = {}
for name, clf in classifiers.items():
    # Train classifier
    clf.fit(X_train_normalized, y_train)

    # Make predictions and evaluate
    y_pred = clf.predict(X_test_normalized)
    accuracy = accuracy_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    roc = roc_auc_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)

    # Store results
    results[name] = {"f1": f1, "accuracy": accuracy, "recall": recall, "precision": precision,"roc": roc}


## Finding the Best Classifier based on F1 Score 🔍

```python
best_classifier = max(results, key=lambda x: results[x]["f1"])
print("Best Classifier based on F1 Score:", best_classifier)

best_clf = classifiers[best_classifier]


In [14]:
# Find the best classifier based on accuracy
best_classifier = max(results, key=lambda x: results[x]["f1"])
print("Best Classifier based on f1:", best_classifier)

# Initialize the best classifier
best_clf = classifiers[best_classifier]

Best Classifier based on f1: Random Forest


## Saving the Trained Model, Scaler and Imputer 📦

In [None]:
# Save the trained model to a file
joblib.dump(best_clf, f'{path}model.pkl')
joblib.dump(scaler, f'{path}scaler.pkl')
joblib.dump(imputer, f'{path}imputer.pkl')