# Content
- Pre-processing Test Set
    - Imputation missing values
    - Categorical Columns Encoding
    - Target Encoding
    - Scaling
    - Drop non-significant features
- Predict
- Conclusion
- Model Summary Breakdown

# Pre-processing Test Set

## Imputation missing values

In [54]:
import pandas as pd
# Load your test data
test_split = pd.read_csv('test_split.csv')

In [55]:
# Load imputation statistics
imputation_stats = pd.read_csv('imputation_stats.csv')
imputation_dict = imputation_stats.set_index('Column')['Value'].to_dict()

# Redundant ID column removal
if 'ID' in test_split.columns:
    test_split.drop(columns=['ID'], inplace=True)

# Apply imputation statistics to test data
for column, value in imputation_dict.items():
    if test_split[column].dtype == 'object':
        test_split[column].fillna(value, inplace=True)
    else:
        test_split[column].fillna(float(value), inplace=True)

# Confirming imputation
print(test_split.isnull().sum())

Gender             0
Ever_Married       0
Age                0
Graduated          0
Profession         0
Work_Experience    0
Spending_Score     0
Family_Size        0
Var_1              0
Segmentation       0
dtype: int64


## Categorical Columns Encoding

In [57]:
# One-hot encoding for categorical variables
one_hot_columns = ['Gender', 'Profession', 'Ever_Married', 'Graduated', 'Var_1']
test_split = pd.get_dummies(test_split, columns=one_hot_columns, drop_first=True)

# Ordinal encoding for Spending_Score
spending_score_mapping = {'Average': 2, 'High': 1, 'Low': 3}
test_split['Spending_Score'] = test_split['Spending_Score'].map(spending_score_mapping)

# Convert boolean columns to 0 and 1
boolean_columns = [col for col in test_split.columns if test_split[col].dtype == 'bool']
test_split[boolean_columns] = test_split[boolean_columns].applymap(int)

  test_split[boolean_columns] = test_split[boolean_columns].applymap(int)


## Target Encoding

In [59]:
# Apply one-hot encoding to the target column for multi-class classification
test_split = pd.get_dummies(test_split, columns=['Segmentation'], drop_first=False)

# Convert boolean columns to 0 and 1
boolean_columns = [col for col in test_split.columns if test_split[col].dtype == 'bool']
test_split[boolean_columns] = test_split[boolean_columns].applymap(int)

# Verify the one-hot encoded target variable columns
print(test_split[['Segmentation_A', 'Segmentation_B', 'Segmentation_C', 'Segmentation_D']].head())


   Segmentation_A  Segmentation_B  Segmentation_C  Segmentation_D
0               0               0               1               0
1               0               1               0               0
2               1               0               0               0
3               0               0               0               1
4               1               0               0               0


  test_split[boolean_columns] = test_split[boolean_columns].applymap(int)


## Scaling

In [61]:
import pandas as pd
import pickle
from sklearn.preprocessing import StandardScaler

# Numerical features to be scaled
numerical_features = ['Age', 'Work_Experience', 'Family_Size', 'Spending_Score']

# Load the scaler using pickle
with open('scaler.pkl', 'rb') as f:
    scaler = pickle.load(f)

# Scale the numerical features in the test set
test_split[numerical_features] = scaler.transform(test_split[numerical_features])

https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations


## Drop non-significant features ( which are non-significant for all 4 target classes)

In [63]:
# Non-significant features for all target classes
non_significant_features = ['Var_1_Cat_5', 'Var_1_Cat_7']

# Drop non-significant features from the DataFrame
test_split.drop(non_significant_features, axis=1, inplace=True)


# Predict

In [65]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from scikeras.wrappers import KerasClassifier
from tensorflow import keras
from tensorflow.keras import layers, regularizers
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau
from sklearn.metrics import classification_report, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt
import tensorflow as tf

# Separate features and target columns for test data
X_test = test_split.drop(['Segmentation_A', 'Segmentation_B', 'Segmentation_C', 'Segmentation_D'], axis=1)
y_test = test_split[['Segmentation_A', 'Segmentation_B', 'Segmentation_C', 'Segmentation_D']]

# Convert y_test to categorical
y_test = np.array(y_test)

# Load the pre-trained model
model = keras.models.load_model('pretrained_ann_classification_model.keras')
print('Model loaded')

# Evaluate the model on the test set
test_loss, test_accuracy = model.evaluate(X_test, y_test, verbose=0)
print('Test loss:', test_loss)
print('Test accuracy:', test_accuracy)

# Predict on the test set
y_test_pred = model.predict(X_test)

# Convert the predicted probabilities to class labels for the test set
y_test_pred_labels = np.argmax(y_test_pred, axis=1)
y_test_labels = np.argmax(y_test, axis=1)

# Print the classification report for the test set
print("Test Set Classification Report")
print(classification_report(y_test_labels, y_test_pred_labels, target_names=['Segmentation_A', 'Segmentation_B', 'Segmentation_C', 'Segmentation_D']))

Model loaded
Test loss: 1.1334367990493774
Test accuracy: 0.5022860765457153
[1m48/48[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step
Test Set Classification Report
                precision    recall  f1-score   support

Segmentation_A       0.44      0.48      0.46       381
Segmentation_B       0.37      0.26      0.31       371
Segmentation_C       0.54      0.51      0.52       370
Segmentation_D       0.60      0.74      0.66       409

      accuracy                           0.50      1531
     macro avg       0.49      0.50      0.49      1531
  weighted avg       0.49      0.50      0.49      1531



# Conclusion

The results show that the model has a consistent performance between the training, validation, and unseen test datasets, indicating good generalizability. Here are the key points summarized in a table:

**Performance Metrics**

| Dataset           | Loss    | Accuracy | Precision (Weighted Avg) | Recall (Weighted Avg) | F1-Score (Weighted Avg) |
|-------------------|---------|----------|--------------------------|-----------------------|-------------------------|
| **Training**      | 1.1058  | 0.5335   | 0.52                     | 0.53                  | 0.52                    |
| **Validation**    | 1.1231  | 0.5261   | 0.52                     | 0.53                  | 0.52                    |
| **Unseen Test**   | 1.1334  | 0.5023   | 0.49                     | 0.50                  | 0.49                    |

Overall, the model maintains similar performance across all datasets, with slightly lower accuracy on the unseen test set. The performance metrics indicate that the model is balanced, though there is room for improvement, especially in the recall and F1-scores for some segments.


# Model Summary Breakdown

In [69]:
model.summary()

##### Model Name
- **Model Name**: `"sequential_12"` 
  - This indicates it is a sequential model, where layers are stacked one after another in a linear fashion.

##### Layers and Parameters

###### Layer 1: Dense (`dense_36`)
- **Type**: `Dense`
- **Output Shape**: `(None, 64)`
  - `None` indicates the batch size, which is not fixed and can vary.
  - `64` is the number of neurons in this layer.
- **Param #**: `1,344`
  - Parameters in a dense layer are calculated as:
    - `(input_dim + 1) * output_dim`
    - Here, `input_dim` is the number of input features. Assuming `input_dim = 20` (from previous context), the calculation is:
    - `(20 + 1) * 64 = 21 * 64 = 1,344`

###### Layer 2: Dropout (`dropout_24`)
- **Type**: `Dropout`
- **Output Shape**: `(None, 64)`
  - Same as the previous layer because dropout does not change the shape of the data.
- **Param #**: `0`
  - Dropout does not have parameters.

###### Layer 3: Dense (`dense_37`)
- **Type**: `Dense`
- **Output Shape**: `(None, 64)`
  - `64` neurons, same as the previous dense layer.
- **Param #**: `4,160`
  - Calculation is similar to the first dense layer but now with `64` inputs from the previous layer:
  - `(64 + 1) * 64 = 65 * 64 = 4,160`

###### Layer 4: Dropout (`dropout_25`)
- **Type**: `Dropout`
- **Output Shape**: `(None, 64)`
  - Same as the previous layer because dropout does not change the shape of the data.
- **Param #**: `0`
  - Dropout does not have parameters.

###### Layer 5: Dense (`dense_38`)
- **Type**: `Dense`
- **Output Shape**: `(None, 4)`
  - `4` neurons, one for each class in the multiclass classification.
- **Param #**: `260`
  - Calculation is:
  - `(64 + 1) * 4 = 65 * 4 = 260`

##### Summary of Parameters
- **Total params**: `17,294 (67.56 KB)`
  - This includes all parameters in the model (trainable and non-trainable).
  - This total seems to be a miscalculation, given the provided individual layer params. The correct total should be `1,344 + 4,160 + 260 = 5,764`.
- **Trainable params**: `5,764 (22.52 KB)`
  - These are parameters that will be updated during training.
- **Non-trainable params**: `0 (0.00 B)`
  - There are no non-trainable parameters in this model.

##### Optimizer Params
- **Optimizer params**: `11,530 (45.04 KB)`
  - These are parameters related to the Adam optimizer, including moment estimates and other state variables. The exact number can vary depending on the optimizer configuration and the number of model parameters.

##### Key Points
- **Output Shape `(None, x)`**: `None` signifies a variable batch size, while `x` is the dimension of the output of the layer.
- **Parameter Calculation**: For dense layers, parameters include weights and biases, calculated as `(input_dim + 1) * output_dim`.
- **Dropout Layers**: Do not have parameters but help prevent overfitting by randomly setting a fraction of input units to 0 at each update during training.

