# Content
- Get Data
- Split Data
- Preprocessing
    - Clean the Data
    - Feature Engineering
- Dimensionality Reduction
- Model Implementation
- Fine-tuning
    - Address Overfitting (Regularization, Early Stopping)
    - Address Underfitting (RandomizedSearchCV with Keras)
    - Further Fine-tuning
- Predict on unseen Data

## 1. Get Data

The dataset for this project comes from the [Customer Segmentation Classification](https://www.kaggle.com/datasets/kaushiksuresh147/customer-segmentation) task on Kaggle.
In this project, we will implement an ANN to demonstrate its capabilities.

**Context**
An automobile company has plans to enter new markets with their existing products (P1, P2, P3, P4, and P5). After intensive market research, they’ve deduced that the behavior of the new market is similar to their existing market.

In their existing market, the sales team has classified all customers into 4 segments (A, B, C, D ). Then, they performed segmented outreach and communication for a different segment of customers. This strategy has work e exceptionally well for them. They plan to use the same strategy for the new markets and have identified 2627 new potential customers.

You are required to help the manager to predict the right group of the new customers.

- **train.csv**: Contains the following columns

| Variable         | Definition                                                       |
|------------------|------------------------------------------------------------------|
| **ID**           | Unique ID                                                        |
| **Gender**       | Gender of the customer                                           |
| **Ever_Married** | Marital status of the customer                                   |
| **Age**          | Age of the customer                                              |
| **Graduated**    | Is the customer a graduate?                                      |
| **Profession**   | Profession of the customer                                       |
| **Work_Experience** | Work Experience in years                                       |
| **Spending_Score**  | Spending score of the customer                                  |
| **Family_Size**     | Number of family members for the customer (including the customer) |
| **Var_1**           | Anonymised Category for the customer                            |
| **Segmentation**    | (target) Customer Segment of the customer                       |

For more details, refer to the [Customer Segmentation Classification](https://www.kaggle.com/datasets/kaushiksuresh147/customer-segmentation).

- **Initial EDA**
- The dataset contains **8,068 entries** and **11 columns**.
- The 'Segmentation' column shows quite **balanced distribution**
- Some features are imbalanced
- There are duplicates - they were dropped at the beginning
- There are missing values
- There are categorical features 

# 2. Split Data
After removing duplicates 7951 rows remained.
For a 7951-row dataset, an 80/20 split is generally recommended for training an ANN. This split provides a substantial amount of data for training while still maintaining a reasonable size for the test set to evaluate performance.
- The `train_split` dataset contains **6,120 entries** and **11 columns**.
- The `test_split` dataset contains **1,531 entries** and **11 columns**.
We applied a regular split, but not stratify.
After split distributions of imbalanced features in train_split and test_split were compared

# 3. Preprocessing
## 3.1 Clean the Data
This stage includes handling missing values, removing duplicates, and correcting errors.

- Imputation of missing values:
    - **mode** imputation (Ever_Married, Graduated,  Var_1)
    - Imputing with a **constant** value like *Unknown* (Profession)
    - **median** imputation (Work_Experience, Family_Size)

- Duplicates deletion
    - Duplicate rows were just deleted (in the previous step)

## 3.2 Feature Engineering
This stage includes steps to prepare the structured data for machine learning models. The following transformations were applied:
- **Categorical features encoding**:
    - **One-hot encoding** will be applied for Gender, Profession, Ever_Married, Graduated as they are nominal variables with distinct categories
    - One-hot encoding will be applied for Var_1 (we do not have specific information on what each category in Var_1 represents, so treating them as nominal categories and applying one-hot encoding is the correct approach as well).
    - **Ordinary encoding** will be applied for Spending_Score: 'Average'-2, 'High'-1 'Low'-3
- **Target Encoding**:
    - "Segmentation" represents multiple classes without an ordinal relationship, one-hot encoding is preferred.
- **Scaling**:
    When preparing your data for an ANN, focus on scaling continuous numerical features while keeping one-hot encoded categorical features unchanged since they are already represented appropriately for the model.

# 4. Dimensionality Reduction
The following was done at this stage:
- Low Variance Analysis
    - There are no columns with variance below the threshold of 0.01. It indicates that most features have sufficient variability and are likely informative for the model.
    - The minimum threshold that would result in at least one low-variance column being identified is 0.02. Columns with the lowest variance are 'Profession_Unknown', 'Var_1_Cat_5'.
- Feature Importance using F-Statistics
    - Selecting a single target column Segmentation_A for the F-statistics test
    - Selecting a single target column Segmentation_B for the F-statistics test
    - Selecting a single target column Segmentation_C for the F-statistics test
    - Selecting a single target column Segmentation_D for the F-statistics test
    - Finding common non-significant features for all 4 target classes

*Names of the features that are not statistically significant (p-value >= 0.05) from the F-test results*:
- for predicting **Segmentation_A**: 'Gender_Male', 'Profession_Homemaker', 'Profession_Unknown', 'Ever_Married_Yes', 'Graduated_Yes', 'Var_1_Cat_3', 'Var_1_Cat_5', 'Var_1_Cat_7'
- for predicting **Segmentation_B**: 'Gender_Male', 'Profession_Doctor', 'Profession_Entertainment', 'Profession_Homemaker', 'Profession_Lawyer', 'Var_1_Cat_2', 'Var_1_Cat_3', 'Var_1_Cat_4', 'Var_1_Cat_5', 'Var_1_Cat_6', 'Var_1_Cat_7'
- for predicting **Segmentation_C**: 'Gender_Male', 'Profession_Doctor', 'Profession_Lawyer', 'Profession_Unknown', 'Var_1_Cat_2', 'Var_1_Cat_5', 'Var_1_Cat_7'
- for predicting **Segmentation_D**: 'Profession_Doctor', 'Profession_Engineer', 'Var_1_Cat_2', 'Var_1_Cat_5', 'Var_1_Cat_7'

*Common non-significant features for all 4 targets:* 
    - 'Var_1_Cat_5', 'Var_1_Cat_7'
In the result, non-significant features (which are non-significant for all 4 target classes) were dropped

# 5. Model Implementation
The following models were created:
- ANN Model Implementation for Classification
- Random Forest with the OvR strategy
**ANN results**
1. **Overfitting**: The training accuracy (0.6242) is higher than the test accuracy (0.5172), and the training loss (0.8782) is lower than the test loss (1.1511). This suggests that the model may be overfitting to the training data, capturing noise and patterns that do not generalize well to the test data.

2. **Class Imbalance Impact**:
    - **Segmentation_D** has the highest precision, recall, and F1-score in both the training and test sets, suggesting that the model is better at predicting this class.
    - **Segmentation_B** has the lowest recall and F1-score in both sets, indicating that the model struggles to correctly identify instances of this class.

3. **Model Performance**:
    - **Training Performance**: The model has a decent performance on the training set with an accuracy of 62%. The F1-scores for the different classes range from 0.47 to 0.74.
    - **Test Performance**: The model's performance drops on the test set with an accuracy of 51.7%. The F1-scores range from 0.38 to 0.63, indicating variability in the model's ability to generalize across different segments.

**Random Forest with the OvR strategy results**
The result was much worse compared to ANN (the observed issue of very high train accuracy and relatively low test accuracy). It was decided to give up on the model.

# 6. Fine-tuning
## 6.1 Address Overfitting (Regularization, Early Stopping)
Regularization: Regularization techniques such as dropout, L1/L2 regularization to prevent overfitting were implemented.
Early Stopping: Early stopping was added during training to halt training once the model's performance on a validation set stops improving.
**Conclusion from the Results**

**Overview of Metrics**

| Metric               | Previous Result | New Result                              |
|----------------------|-----------------|-----------------------------------------|
| **Training Loss**    | 0.8918          | 1.1601                                  |
| **Training Accuracy**| 0.6168          | 0.5076                                  |
| **Test Loss**        | 1.1721          | 1.1638                                  |
| **Test Accuracy**    | 0.5008          | 0.5229                                  |

**Training Set Classification Report (old vs new)**

| Segmentation      | Precision (Previous) | Recall (Previous) | F1-Score (Previous) | Support (Previous) | Precision (New) | Recall (New) | F1-Score (New) | Support (New) |
|-------------------|----------------------|-------------------|---------------------|--------------------|-----------------|--------------|----------------|---------------|
| Segmentation_A    | 0.58                 | 0.54              | 0.56                | 1230               | 0.43            | 0.43         | 0.43           | 1230          |
| Segmentation_B    | 0.55                 | 0.47              | 0.51                | 1160               | 0.41            | 0.23         | 0.30           | 1160          |
| Segmentation_C    | 0.65                 | 0.59              | 0.62                | 1159               | 0.54            | 0.54         | 0.54           | 1159          |
| Segmentation_D    | 0.66                 | 0.84              | 0.74                | 1347               | 0.58            | 0.78         | 0.66           | 1347          |
| **Overall Accuracy** |                   |                   | 0.62                | 4896               |                 |              | 0.51           | 4896          |
| **Macro Avg**     | 0.61                 | 0.61              | 0.61                | 4896               | 0.49            | 0.50         | 0.48           | 4896          |
| **Weighted Avg**  | 0.61                 | 0.62              | 0.61                | 4896               | 0.49            | 0.51         | 0.49           | 4896          |

**Test Set Classification Report (old vs new)**

| Segmentation      | Precision (Previous) | Recall (Previous) | F1-Score (Previous) | Support (Previous) | Precision (New) | Recall (New) | F1-Score (New) | Support (New) |
|-------------------|----------------------|-------------------|---------------------|--------------------|-----------------|--------------|----------------|---------------|
| Segmentation_A    | 0.42                 | 0.38              | 0.40                | 309                | 0.46            | 0.45         | 0.45           | 309           |
| Segmentation_B    | 0.39                 | 0.34              | 0.36                | 283                | 0.44            | 0.25         | 0.32           | 283           |
| Segmentation_C    | 0.59                 | 0.53              | 0.56                | 292                | 0.57            | 0.57         | 0.57           | 292           |
| Segmentation_D    | 0.56                 | 0.72              | 0.63                | 340                | 0.57            | 0.78         | 0.66           | 340           |
| **Overall Accuracy** |                   |                   | 0.50                | 1224               |                 |              | 0.52           | 1224          |
| **Macro Avg**     | 0.49                 | 0.49              | 0.49                | 1224               | 0.51            | 0.51         | 0.50           | 1224          |
| **Weighted Avg**  | 0.49                 | 0.50              | 0.49                | 1224               | 0.51            | 0.52         | 0.51           | 1224          |

**Key insights**
1. *Overfitting Mitigation*: 
   - The regularization techniques and early stopping appear to have reduced overfitting, as indicated by the more consistent performance between the training and test sets. However, the overall accuracy has decreased, suggesting that the model may now be underfitting.

2. *Class Performance*:
   - *Segmentation_D*: Maintains relatively high precision, recall, and F1-score, indicating the model's robustness in predicting this class.
   - *Segmentation_B*: Shows a drop in recall and F1-score, suggesting that the model still struggles to correctly identify instances of this class.

3. *Model Performance*:
   - *Training Performance*: The new model shows lower performance on the training set compared to the previous one, indicating the impact of regularization.
   - *Test Performance*: The new model's performance on the test set is slightly lower, which might be an indicator of underfitting due to the regularization being too strong or the need for more epochs.

## 6.2 Address Underfitting (RandomizedSearchCV with Keras)

Search over different dropout rates and L2 regularization strengths added. Early stopping patience changed from 10 to 20.

**Training Set Classification Report (old vs new)**

| Segmentation  | Precision (Previous) | Recall (Previous) | F1-Score (Previous) | Support (Previous) | Precision (New) | Recall (New) | F1-Score (New) | Support (New) |
|---------------|----------------------|-------------------|---------------------|--------------------|-----------------|--------------|----------------|---------------|
| Segmentation_A| 0.43                 | 0.43              | 0.43                | 1230               | 0.45            | 0.46         | 0.46           | 1230          |
| Segmentation_B| 0.41                 | 0.23              | 0.30                | 1160               | 0.44            | 0.26         | 0.33           | 1160          |
| Segmentation_C| 0.54                 | 0.54              | 0.54                | 1159               | 0.57            | 0.53         | 0.55           | 1159          |
| Segmentation_D| 0.58                 | 0.78              | 0.66                | 1347               | 0.58            | 0.81         | 0.68           | 1347          |
| Accuracy      |                      |                   | 0.51                | 4896               |                 |              | 0.53           | 4896          |
| Macro Avg     | 0.49                 | 0.50              | 0.48                | 4896               | 0.51            | 0.52         | 0.50           | 4896          |
| Weighted Avg  | 0.49                 | 0.51              | 0.49                | 4896               | 0.51            | 0.53         | 0.51           | 4896          |

**Test Set Classification Report (old vs new)**

| Segmentation  | Precision (Previous) | Recall (Previous) | F1-Score (Previous) | Support (Previous) | Precision (New) | Recall (New) | F1-Score (New) | Support (New) |
|---------------|----------------------|-------------------|---------------------|--------------------|-----------------|--------------|----------------|---------------|
| Segmentation_A| 0.46                 | 0.45              | 0.45                | 309                | 0.43            | 0.44         | 0.43           | 309           |
| Segmentation_B| 0.44                 | 0.25              | 0.32                | 283                | 0.44            | 0.26         | 0.33           | 283           |
| Segmentation_C| 0.57                 | 0.57              | 0.57                | 292                | 0.60            | 0.55         | 0.57           | 292           |
| Segmentation_D| 0.57                 | 0.78              | 0.66                | 340                | 0.58            | 0.80         | 0.67           | 340           |
| Accuracy      |                      |                   | 0.52                | 1224               |                 |              | 0.52           | 1224          |
| Macro Avg     | 0.51                 | 0.51              | 0.50                | 1224               | 0.51            | 0.51         | 0.50           | 1224          |
| Weighted Avg  | 0.51                 | 0.52              | 0.51                | 1224               | 0.51            | 0.52         | 0.51           | 1224          |

**Overview of Metrics**

| Metric               | Previous Result | New Result                              |
|----------------------|-----------------|-----------------------------------------|
| **Training Loss**    | 1.1601          | 1.1262                                  |
| **Training Accuracy**| 0.5076          | 0.5255                                  |
| **Test Loss**        | 1.1638          | 1.1389                                  |
| **Test Accuracy**    | 0.5229          | 0.5237                                  |

**Conclusion**

- *Training Set*:
  - The new model shows a slight improvement in overall accuracy from 0.51 to 0.53.
  - Precision, recall, and F1-scores have slightly increased for Segmentation_A and Segmentation_B but decreased slightly for Segmentation_C and Segmentation_D.
  - The macro and weighted averages for precision, recall, and F1-score have also improved slightly.

- *Test Set*:
  - The overall accuracy on the test set increased slightly from 0.52 to 0.53.
  - Precision, recall, and F1-scores have increased for Segmentation_C and Segmentation_D but slightly decreased for Segmentation_A and Segmentation_B.
  - Macro and weighted averages for precision, recall, and F1-score show slight improvements.

The slight improvements in both training and test set metrics indicate that the chosen best hyperparameters (dropout rate and L2 regularization) have positively impacted its performance, making it more generalizable and defying the segments correctly.
t more generalizable and better at classifying the segments correctly.

       
## 6.3. Further Fine-tuning
Search over different dropout, L2 reg, activation, epochs, and batch_size. Also, patience in reduce_lr was increased from 5 to 10.

**Training Set Classification Report (old vs new)**

| Segmentation  | Precision (Previous) | Recall (Previous) | F1-Score (Previous) | Support (Previous) | Precision (New) | Recall (New) | F1-Score (New) | Support (New) |
|---------------|----------------------|-------------------|---------------------|--------------------|-----------------|--------------|----------------|---------------|
| Segmentation_A| 0.45                 | 0.46              | 0.46                | 1230               | 0.47            | 0.49         | 0.48           | 1230          |
| Segmentation_B| 0.44                 | 0.26              | 0.33                | 1160               | 0.44            | 0.32         | 0.37           | 1160          |
| Segmentation_C| 0.57                 | 0.53              | 0.55                | 1159               | 0.58            | 0.54         | 0.56           | 1159          |
| Segmentation_D| 0.58                 | 0.81              | 0.68                | 1347               | 0.61            | 0.76         | 0.68           | 1347          |
| **Accuracy**  |                      |                   | 0.53                | 4896               |                 |              | 0.54           | 4896          |
| **Macro Avg** | 0.51                 | 0.52              | 0.50                | 4896               | 0.52            | 0.53         | 0.52           | 4896          |
| **Weighted Avg** | 0.51              | 0.53              | 0.51                | 4896               | 0.53            | 0.54         | 0.53           | 4896          |

**Test Set Classification Report (old vs new)**

| Segmentation  | Precision (Previous) | Recall (Previous) | F1-Score (Previous) | Support (Previous) | Precision (New) | Recall (New) | F1-Score (New) | Support (New) |
|---------------|----------------------|-------------------|---------------------|--------------------|-----------------|--------------|----------------|---------------|
| Segmentation_A| 0.43                 | 0.44              | 0.43                | 309                | 0.43            | 0.47         | 0.45           | 309           |
| Segmentation_B| 0.44                 | 0.26              | 0.33                | 283                | 0.42            | 0.27         | 0.33           | 283           |
| Segmentation_C| 0.60                 | 0.55              | 0.57                | 292                | 0.58            | 0.56         | 0.57           | 292           |
| Segmentation_D| 0.58                 | 0.80              | 0.67                | 340                | 0.60            | 0.75         | 0.67           | 340           |
| **Accuracy**  |                      |                   | 0.52                | 1224               |                 |              | 0.52           | 1224          |
| **Macro Avg** | 0.51                 | 0.51              | 0.50                | 1224               | 0.51            | 0.51         | 0.50           | 1224          |
| **Weighted Avg** | 0.51              | 0.52              | 0.51                | 1224               | 0.51            | 0.52         | 0.51           | 1224          |

**Conclusion**

- *Training Set*:
  - The overall accuracy on the training set has slightly increased from 0.53 to 0.54.
  - Precision, recall, and F1-scores have slightly improved for most segments, with Segmentation_A and Segmentation_D showing the most notable improvements.
  - The macro and weighted averages for precision, recall, and F1-score have shown slight improvements.

- *Test Set*:
  - The overall accuracy on the test set has remained stable at 0.52.
  - Precision, recall, and F1-scores have remained relatively stable for Segmentation_B, Segmentation_C, and Segmentation_D, with slight improvements in Segmentation_A.
  - Macro and weighted averages for precision, recall, and F1-score have remained stable.

The new model, with adjusted hyperparameters and learning rate adjustments, has shown a slight improvement in training performance but has maintained with slight improvements in Segmentation_A s

**Best parameters are:**
- {'model__learning_rate': 0.001, 'model__l2_reg': 0.01, 'model__dropout_rate': 0.2, 'model__activation': 'relu', 'epochs': 100, 'batch_size': 32}
- early_stopping = EarlyStopping(monitor='val_loss', patience=20, restore_best_weights=True)
- reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.2, patience=10, min_lr=0.0001)ficant improvements in performance.

# 7. Predict on unseen Data
The following steps were done on the data set which we put aside right after dropping duplicates.
- Pre-processing Test Set
    - Imputation missing values
    - Categorical Columns Encoding
    - Target Encoding
    - Scaling
    - Drop non-significant features
- Predict
- Conclusion

**Conclusion:**
The results show that the model has a consistent performance between the training, validation, and unseen test datasets, indicating good generalizability. Here are the key points summarized in a table:

**Performance Metrics**

| Dataset           | Loss    | Accuracy | Precision (Weighted Avg) | Recall (Weighted Avg) | F1-Score (Weighted Avg) |
|-------------------|---------|----------|--------------------------|-----------------------|-------------------------|
| **Training**      | 1.1058  | 0.5335   | 0.52                     | 0.53                  | 0.52                    |
| **Validation**    | 1.1231  | 0.5261   | 0.52                     | 0.53                  | 0.52                    |
| **Unseen Test**   | 1.1334  | 0.5023   | 0.49                     | 0.50                  | 0.49                    |

Overall, the model maintains similar performance across all datasets, with slightly lower accuracy on the unseen test set. The performance metrics indicate that the model is balanced, though there is room for improvement, especially in the recall and F1-scores for some segments.
