# Content
- ANN Model Implementation for Classification
    - Conclusion from the Results
        - Overview of Metrics
        - Training Set Classification Report
        - Test Set Classification Report
    - Key Insights
    - Recommendations
- Random Forest with the OvR strategy

## ANN Model Implementation for Classification

To predict the target columns (Segmentation_A, Segmentation_B, Segmentation_C, Segmentation_D) using an Artificial Neural Network (ANN) for a multi-class classification task, we can utilize libraries like TensorFlow and Keras. 

In [71]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from tensorflow import keras
from tensorflow.keras import layers
from sklearn.metrics import classification_report, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

# Set seeds for reproducibility
import tensorflow as tf
import random
seed_value = 42
tf.random.set_seed(seed_value)
np.random.seed(seed_value)
random.seed(seed_value)

# Load the dataset
df = pd.read_csv('train_split_dimentionality_reducted.csv')

# Separate features and target columns
X = df.drop(['Segmentation_A', 'Segmentation_B', 'Segmentation_C', 'Segmentation_D'], axis=1)
y = df[['Segmentation_A', 'Segmentation_B', 'Segmentation_C', 'Segmentation_D']]

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Build the ANN model
model = keras.Sequential([
    layers.Dense(64, activation='relu', input_shape=(X_train.shape[1],)),
    layers.Dense(64, activation='relu'),
    layers.Dense(4, activation='softmax')  # 4 output nodes for the 4 target classes
])

# Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Train the model
model.fit(X_train, y_train, epochs=50, batch_size=32, validation_data=(X_test, y_test))


Epoch 1/50


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m153/153[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 6ms/step - accuracy: 0.3924 - loss: 1.2842 - val_accuracy: 0.4730 - val_loss: 1.1736
Epoch 2/50
[1m153/153[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.4788 - loss: 1.1628 - val_accuracy: 0.4902 - val_loss: 1.1359
Epoch 3/50
[1m153/153[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.4972 - loss: 1.1239 - val_accuracy: 0.5041 - val_loss: 1.1181
Epoch 4/50
[1m153/153[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - accuracy: 0.5146 - loss: 1.1016 - val_accuracy: 0.5090 - val_loss: 1.1085
Epoch 5/50
[1m153/153[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 5ms/step - accuracy: 0.5219 - loss: 1.0864 - val_accuracy: 0.5114 - val_loss: 1.1037
Epoch 6/50
[1m153/153[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 4ms/step - accuracy: 0.5307 - loss: 1.0743 - val_accuracy: 0.5123 - val_loss: 1.1005
Epoch 7/50
[1m153/153[0m [32m━━━━━━━

<keras.src.callbacks.history.History at 0x1db751101d0>

In [72]:
# Evaluate the model on the training set
train_loss, train_accuracy = model.evaluate(X_train, y_train)
print('\nTrain loss:', train_loss)
print('\nTrain accuracy:', train_accuracy)

# Evaluate the model on the test set
test_loss, test_accuracy = model.evaluate(X_test, y_test)
print('\nTest loss:', test_loss)
print('\nTest accuracy:', test_accuracy)

# Predict the classes for training data
y_train_pred = model.predict(X_train)
y_train_pred_labels = np.argmax(y_train_pred, axis=1)
y_train_labels = np.argmax(np.array(y_train), axis=1)

# Predict the classes for test data
y_test_pred = model.predict(X_test)
y_test_pred_labels = np.argmax(y_test_pred, axis=1)
y_test_labels = np.argmax(np.array(y_test), axis=1)

# Print the classification report for the training set
print("\nTraining Set Classification Report")
print(classification_report(y_train_labels, y_train_pred_labels, target_names=target_names))

# Print the classification report for the test set
print("\nTest Set Classification Report")
print(classification_report(y_test_labels, y_test_pred_labels, target_names=target_names))


[1m153/153[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.6050 - loss: 0.9096

Train loss: 0.8917937278747559

Train accuracy: 0.616830050945282
[1m39/39[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - accuracy: 0.4813 - loss: 1.1781 

Test loss: 1.1720653772354126

Test accuracy: 0.5008170008659363
[1m153/153[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step
[1m39/39[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step

Training Set Classification Report
                precision    recall  f1-score   support

Segmentation_A       0.58      0.54      0.56      1230
Segmentation_B       0.55      0.47      0.51      1160
Segmentation_C       0.65      0.59      0.62      1159
Segmentation_D       0.66      0.84      0.74      1347

      accuracy                           0.62      4896
     macro avg       0.61      0.61      0.61      4896
  weighted avg       0.61      0.62      0.61      4896


Test Set Class

## Conclusion from the Results

### Overview of Metrics
- **Training Loss**: 0.8782
- **Training Accuracy**: 0.6242
- **Test Loss**: 1.1511
- **Test Accuracy**: 0.5172

### Training Set Classification Report
- **Overall Accuracy**: 0.62
- **Precision, Recall, F1-Score**:
  - Segmentation_A: Precision (0.57), Recall (0.63), F1-Score (0.60)
  - Segmentation_B: Precision (0.58), Recall (0.40), F1-Score (0.47)
  - Segmentation_C: Precision (0.62), Recall (0.66), F1-Score (0.64)
  - Segmentation_D: Precision (0.70), Recall (0.78), F1-Score (0.74)

### Test Set Classification Report
- **Overall Accuracy**: 0.52
- **Precision, Recall, F1-Score**:
  - Segmentation_A: Precision (0.41), Recall (0.46), F1-Score (0.43)
  - Segmentation_B: Precision (0.45), Recall (0.33), F1-Score (0.38)
  - Segmentation_C: Precision (0.59), Recall (0.60), F1-Score (0.59)
  - Segmentation_D: Precision (0.60), Recall (0.66), F1-Score (0.63)

## Key Insights

1. **Overfitting**: The training accuracy (0.6242) is higher than the test accuracy (0.5172), and the training loss (0.8782) is lower than the test loss (1.1511). This suggests that the model may be overfitting to the training data, capturing noise and patterns that do not generalize well to the test data.

2. **Class Imbalance Impact**:
    - **Segmentation_D** has the highest precision, recall, and F1-score in both the training and test sets, suggesting that the model is better at predicting this class.
    - **Segmentation_B** has the lowest recall and F1-score in both sets, indicating that the model struggles to correctly identify instances of this class.

3. **Model Performance**:
    - **Training Performance**: The model has a decent performance on the training set with an accuracy of 62%. The F1-scores for the different classes range from 0.47 to 0.74.
    - **Test Performance**: The model's performance drops on the test set with an accuracy of 51.7%. The F1-scores range from 0.38 to 0.63, indicating variability in the model's ability to generalize across different segments.

## Recommendations

1. **Address Overfitting**:
   - **Regularization**: Implement regularization techniques such as dropout, L1/L2 regularization to prevent overfitting.
   - **Early Stopping**: Use early stopping during training to halt training once the model's performance on a validation set stops improving.

2. **Hyperparameter Tuning**:
   - Perform hyperparameter tuning using techniques such as GridSearchCV or RandomizedSearchCV to find the optimal set of hyperparameters for the model.

3. **Model Complexity**:
   - Consider experimenting with more complex models or architectures if the current model is too simple.
   - Alternatively, if the model is too complex, try simplifying the architecture.

4. **Cross-Validation**:
   - Implement cross-validation to ensure that the model's performance is consistent across different subsets of the data.

By addressing these areas, you can improve the model's performance and its ability to generalize to unseen data.


## Random Forest with the OvR strategy 

In [79]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.multiclass import OneVsRestClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, multilabel_confusion_matrix, accuracy_score, log_loss, roc_auc_score
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set the random seed for reproducibility
np.random.seed(42)

# Load the dataset
df = pd.read_csv('train_split_dimentionality_reducted.csv')

# Separate features and target columns
X = df.drop(['Segmentation_A', 'Segmentation_B', 'Segmentation_C', 'Segmentation_D'], axis=1)
y = df[['Segmentation_A', 'Segmentation_B', 'Segmentation_C', 'Segmentation_D']]

# Debug: Check shapes of X and y
print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Debug: Check shapes of train and test sets
print(f"X_train shape: {X_train.shape}, y_train shape: {y_train.shape}")
print(f"X_test shape: {X_test.shape}, y_test shape: {y_test.shape}")

# Build and train the RandomForest model with OvR strategy
model = OneVsRestClassifier(RandomForestClassifier(n_estimators=100, random_state=42))
model.fit(X_train, y_train)

# Make predictions on the training set
y_train_pred = model.predict(X_train)

# Make predictions on the test set
y_test_pred = model.predict(X_test)

# Calculate train and test accuracy
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)

# Calculate train and test log loss
train_loss = log_loss(y_train, model.predict_proba(X_train))
test_loss = log_loss(y_test, model.predict_proba(X_test))



Features shape: (6120, 20)
Target shape: (6120, 4)
X_train shape: (4896, 20), y_train shape: (4896, 4)
X_test shape: (1224, 20), y_test shape: (1224, 4)


In [80]:
# Print train and test accuracy and loss
print(f"Train Accuracy: {train_accuracy}")
print(f"Test Accuracy: {test_accuracy}")
print(f"Train Loss: {train_loss}")
print(f"Test Loss: {test_loss}")

# Print the classification report for the training set
print("Training Set Classification Report")
print(classification_report(y_train, y_train_pred, target_names=['Segmentation_A', 'Segmentation_B', 'Segmentation_C', 'Segmentation_D']))

# Print the classification report for the test set
print("Test Set Classification Report")
print(classification_report(y_test, y_test_pred, target_names=['Segmentation_A', 'Segmentation_B', 'Segmentation_C', 'Segmentation_D']))


Train Accuracy: 0.9483251633986928
Test Accuracy: 0.34477124183006536
Train Loss: 0.28930229061852214
Test Loss: 1.6616638840644908
Training Set Classification Report
                precision    recall  f1-score   support

Segmentation_A       0.98      0.95      0.97      1230
Segmentation_B       0.98      0.92      0.95      1160
Segmentation_C       0.97      0.93      0.95      1159
Segmentation_D       0.99      0.98      0.98      1347

     micro avg       0.98      0.95      0.96      4896
     macro avg       0.98      0.95      0.96      4896
  weighted avg       0.98      0.95      0.96      4896
   samples avg       0.95      0.95      0.95      4896

Test Set Classification Report
                precision    recall  f1-score   support

Segmentation_A       0.42      0.26      0.32       309
Segmentation_B       0.37      0.20      0.26       283
Segmentation_C       0.53      0.37      0.43       292
Segmentation_D       0.67      0.55      0.60       340

     micro av

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


The observed issue of very high train accuracy and relatively low test accuracy indicates overfitting, meaning the model performs well on the training data but poorly on the unseen test data