## Main Notebook for Benchmark Analysis - Training and Evaluation 

- Note this Jupyetr Noebook handles this initial data of the competition with only basic NLP preprocess without data augmented dut to the discouraged results in the competition score. 

- This Jupyter Notebook contains the benchmark analysis based firstly on "title" as input and secondly on "text" as input. 
- Our aim is to detect the best model (`LogisticRegression`, `Random Forest` ans `X-Boost`) with the best Input "title" or "text".
- Then we will try to improve based on hyperparameter tuning technoques only the model that was detected as the best one.
    - Note: Additioanlly, baselined models (majority and radnom classifers) created in order that we detct if a model predicts based on exactly random ness or on mode / frequent values. 
        - We want our models to outpermorf the evaluations metrics of these baselines.
---
> Evangelia P. Panourgia, Master Student in Data Science, AUEB <br />
> Department of Informatics, Athens University of Economics and Business <br />
> eva.panourgia@aueb.gr <br/><br/>


In [72]:
!pip install  xgboost



In [1]:
import os
import pandas as pd
# import nltk
import string
import random
import numpy as np
from sklearn.metrics import f1_score, classification_report
from sklearn.dummy import DummyClassifier
from sklearn.model_selection import StratifiedKFold, train_test_split, cross_val_score, StratifiedShuffleSplit
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.utils.class_weight import compute_class_weight
from sklearn.preprocessing import LabelEncoder
from sklearn.multiclass import OneVsRestClassifier
from sklearn.ensemble import StackingClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier
import joblib  # For saving and loading models
from sklearn.ensemble import RandomForestClassifier

### Load Data 
- We will load the preprocessed data (`data_nlp_incidents_train.csv`) being pre-processed with only basic nlp preprocess.
- Furthermore, we will load the unlabeleed data of the competition in ordeer to predict them (`incidents.csv`).

In [2]:
df_initial = pd.read_csv('data/data_nlp_incidents_train.csv') # load data after data augmentation
testset_competition = pd.read_csv('data/incidents.csv', index_col=0) # load testing data (conception phase, unlabeled):

In [3]:
df_initial = df_initial[['title','text','hazard-category','product-category','hazard','product']]
print(df_initial.shape[0])
df_initial.head(3) # preview preproccessed data 

5046


Unnamed: 0,title,text,hazard-category,product-category,hazard,product
0,recal notif fsis-024-94,case number 024-94 date open 07/01/1994 date c...,biological,"meat, egg and dairy products",listeria monocytogenes,smoked sausage
1,recal notif fsis-033-94,case number 033-94 date open 10/03/1994 date c...,biological,"meat, egg and dairy products",listeria spp,sausage
2,recal notif fsis-014-94,case number 014-94 date open 03/28/1994 date c...,biological,"meat, egg and dairy products",listeria monocytogenes,ham slices


- The augmentred data were `~13000`

## Baselines
- Benchmark analysis is crucial for evaluating classification performance in multiclass imbalance settings because it provides reference points for how well your model is performing relative to simple baseline classifiers. The `Random Classifier` and `Majority Classifier` are commonly used as benchmarks for the following reasons:

### Random Classifier 
- A Random Classifier predicts class labels randomly, with **uniform** based on the distribution of classes. It sets a minimal baseline and helps understand:

- `Baseline Performance`: This represents the expected performance `without learning from the data`. `If a model performs worse than a random classifier, it indicates either issues in the model or unsuitable features`.

- `Chance Levels`: It shows what performance you'd `get by chance alone`, especially useful for imbalanced datasets where metrics like accuracy can be misleading.


### Majority Classifier

- A Majority Classifier always **predicts the majority class** (`the class with the highest frequency in the training data`). 

- It helps understand:

    - `Handling Imbalance`: In multiclass imbalanced datasets, accuracy can be dominated by the majority class. The majority classifier provides a baseline to compare how well your model captures minority classes.
    - `Baseline of Naïve Solutions`: The majority classifier reflects the simplest possible rule for prediction. If a model's performance is close to that of a majority classifier, it suggests the model is failing to generalize or adapt to the minority classes.
    - `Focus on Class Imbalance`: Metrics like weighted accuracy, balanced accuracy, or macro-F1 score should be significantly better than those achieved by the majority classifier to indicate that a model is addressing imbalance effectively.

- Note in the following code cell I implement the code for Random and Majority Classifier, in order to have a high level of "logic" we added the split steps of trainingtest set, but for example for the Random Classifier it is useless as it is not affected from the input, dont learn from data.
    - Hoever, this "skeleton" is useful for the reamaining algorythms to buils in (both traditional and advanced) 

- More specifically, 

    - Random Classifier  Effect of X: The X values (features) **do not influence the random classifier's predictions**. It does not learn from the data in the feature column. Its predictions are purely random, so changing X will not alter its performance.
    - Majority Classifier Effect of X: The feature column X is ignored by the majority classifier, as it does not use features for prediction. Instead, it looks only at the distribution of y in the training data.

### Regarding the Implementation 
- The `DummyClassifier in scikit-learn` is a baseline model designed to evaluate classification algorithms by comparing them against simplistic strategies. These strategies provide minimal logic to make predictions and are often used as benchmarks to understand how well a more complex model performs.
    - `strategy="uniform"` (for Random Classifier): 
        - Predicts a class randomly and uniformly across all possible classes.
        - Each class has an equal probability of being selected, irrespective of the class distribution in the training data.
        - Use Case: Ideal for scenarios where you want to simulate random guessing.
    - `strategy="most_frequent"` (for Majoriry Classification)
        - Always predicts the most frequent class observed in the training data.
        - Ignores the input features entirely and focuses only on the training set's class distribution.
        - Use Case: Useful for understanding how well a naive baseline would perform if you simply predicted the majority class.

In [5]:
def evaluate_baselines(dataframe, feature_column):
    """
    Function to evaluate random and majority classifiers on a given dataframe.

    Args:
        dataframe: The input dataframe containing the dataset.
        feature_column: The name of the column in the dataframe to be used as features.
    """
    np.random.seed(42)  # For reproducibility

    # Train-test split with optional stratification
    trainset, testset = train_test_split(
        dataframe, 
        test_size=0.2, 
        random_state=2024, 
        # "skeleton" for the main algo here add stratisfy to hold proportion of classes 
    )
   
    # Random and Majority classifiers for each label
    for label in ('hazard-category', 'product-category', 'hazard', 'product'):
        print(f"Evaluating for label: {label}")

        # Features and target
        X_train = trainset[feature_column]
        y_train = trainset[label]
        X_test = testset[feature_column]
        y_test = testset[label]

        # Random Classifier
        random_clf = DummyClassifier(strategy="uniform", random_state=2024)
        random_clf.fit(X_train, y_train) # it is uselless X stimulate the logic of a real algo. 
        testset['predictions-random-' + label] = random_clf.predict(X_test)

        # Majority Classifier
        majority_clf = DummyClassifier(strategy="most_frequent")
        majority_clf.fit(X_train, y_train)# it is uselless X stimulate the logic of a real algo. 
        testset['predictions-majority-' + label] = majority_clf.predict(X_test)

        # Compute F1 scores
        random_f1 = f1_score(y_test, testset['predictions-random-' + label], average='macro', zero_division=0)
        majority_f1 = f1_score(y_test, testset['predictions-majority-' + label], average='macro', zero_division=0)

        print(f"F1 Score for Random Classifier ({label}): {random_f1:.3f}")
        print(f"F1 Score for Majority Classifier ({label}): {majority_f1:.3f}")

        # Generate and save classification reports
        os.makedirs('reports_initial/random', exist_ok=True)
        os.makedirs('reports_initial/majority', exist_ok=True)

        random_report = classification_report(y_test, testset['predictions-random-' + label], zero_division=0)
        majority_report = classification_report(y_test, testset['predictions-majority-' + label], zero_division=0)

        with open(f'reports_initial/random/random_classifier_report_{label}.txt', 'w') as random_file:
            random_file.write(f"Classification Report for Random Classifier ({label}):\n")
            random_file.write(random_report)

        with open(f'reports_initial/majority/majority_classifier_report_{label}.txt', 'w') as majority_file:
            majority_file.write(f"Classification Report for Majority Classifier ({label}):\n")
            majority_file.write(majority_report)
        
        
    
    # Custom metric score calculation
    def compute_score(hazards_true, products_true, hazards_pred, products_pred):
        """
        Custom scoring function to compute the macro F1 score for hazards and products.
        
        Args:
            hazards_true: Ground truth labels for hazards.
            products_true: Ground truth labels for products.
            hazards_pred: Predicted labels for hazards.
            products_pred: Predicted labels for products.
        
        Returns:
            A float representing the combined macro F1 score.
        """
        f1_hazards = f1_score(hazards_true, hazards_pred, average='macro', zero_division=0)
        f1_products = f1_score(
            products_true[hazards_pred == hazards_true],
            products_pred[hazards_pred == hazards_true],
            average='macro', 
            zero_division=0
        )
        return (f1_hazards + f1_products) / 2.

    # Example of calculating scores for Sub-Tasks (if needed):
    # Uncomment the following lines to compute scores for tasks
    print(f"Score Sub-Task 1 - Random Classifier: {compute_score(testset['hazard-category'], testset['product-category'], testset['predictions-random-hazard-category'], testset['predictions-random-product-category']):.3f}")
    print(f"Score Sub-Task 2 - Random Classifier: {compute_score(testset['hazard'], testset['product'], testset['predictions-random-hazard'], testset['predictions-random-product-category']):.3f}")
    print(f"Score Sub-Task 1 - Majority Classifier: {compute_score(testset['hazard-category'], testset['product-category'], testset['predictions-majority-hazard-category'], testset['predictions-majority-product-category']):.3f}")
    print(f"Score Sub-Task 2 - Majority Classifier: {compute_score(testset['hazard'], testset['product'], testset['predictions-majority-hazard'], testset['predictions-majority-product']):.3f}")

# Call the function with the required dataframe 
evaluate_baselines(df_initial, feature_column='text')

Evaluating for label: hazard-category
F1 Score for Random Classifier (hazard-category): 0.063
F1 Score for Majority Classifier (hazard-category): 0.060
Evaluating for label: product-category
F1 Score for Random Classifier (product-category): 0.028
F1 Score for Majority Classifier (product-category): 0.022
Evaluating for label: hazard
F1 Score for Random Classifier (hazard): 0.005
F1 Score for Majority Classifier (hazard): 0.002
Evaluating for label: product
F1 Score for Random Classifier (product): 0.000
F1 Score for Majority Classifier (product): 0.000
Score Sub-Task 1 - Random Classifier: 0.051
Score Sub-Task 2 - Random Classifier: 0.002
Score Sub-Task 1 - Majority Classifier: 0.039
Score Sub-Task 2 - Majority Classifier: 0.002


- The analysis of scores are simlliar with that conatined to teh jupyter `augmented_training_process`
    - We want our main models to have better performance of these scoresin order to predict better than randomeness and better than predicrting only the dominant (mode/frequent) values. 

## Strategy for Model Selection and Evaluation

### Overview
Our approach involves `systematically evaluating three machine learning algorithms on two input types`, **"title"** and **"text"**. The goal is to identify the `b`est-performing model based on evaluation metrics and a custom competition evaluation metric`. Due to time and memory constraints, we set specific parameter values for each algorithm after initial manual investigation. This allows us to efficiently gain an overview of model performance and make informed decisions about **further optimization**.

---

### Step-by-Step Strategy

#### 1. Initial Model Evaluation with "Title" Input
- **Algorithms Tested**:
  - `Logistic Regression`
  - `Random Forest`
  - `XGBoost`
- **Parameter Setting**:
  - Parameters for each algorithm are manually tuned based on preliminary analysis to balance performance and computational efficiency.
- **Evaluation**:
  - Models are assessed using:
    - Standard evaluation metrics (e.g., accuracy, precision, recall, F1-score).
    - A custom evaluation metric provided by the competition.
- **Objective**:
  - Identify the most `promising algorithm` based on "title" input.

---

#### 2. Evaluation with "Text" Input
- **Algorithms Tested**:
  - Logistic Regression
  - Random Forest
  - XGBoost
- **Evaluation**:
  - The same evaluation metrics and competition-specific metric are applied as in the "title" input analysis.
- **Objective**:
  - Determine the best-performing algorithm for the "text" input.

---

#### 3. Comparison of Best Models
- The top-performing models from the **"title"** and **"text"** input evaluations are compared.
- **Selection**:
  - Based on their performance across all metrics, the superior model is selected.

---

#### 4. Optimization of the Final Model
- The chosen algorithm undergoes parameter optimization to refine its performance.
- **Constraints**:
  - Due to time limitations, cross-validation (e.g., K-fold validation) will not be applied.
  - Instead, a streamlined validation approach is used to ensure efficient optimization without excessive computational overhead.

---

#### 5. Baseline Comparison
- Throughout the process, model performance is benchmarked against baseline models:
  - **Random Prediction**: A model that predicts randomly.
  - **Majority Class Prediction**: A model that always predicts the most frequent class.
- **Objective**:
  - Provide context for evaluating the added value of the trained algorithms.

---

### Summary
This structured methodology ensures a thorough evaluation of multiple algorithms across different input types, with a focus on balancing computational efficiency and performance. By the end of this process, the goal is to identify the best-performing algorithm with the best input title or text and optimize it for deployment within the constraints of time and resources.

### Part A. Benchmark Analysis Title


- Firstly,we will include the custom evaluation metric provided by the competition page.
    - It will be used for as evaluation part of all algorythms.

In [40]:
# ## Helping function for the calculation of the custom evaluation for the subtasks.
# def compute_score(hazards_true, products_true, hazards_pred, products_pred):
#     """
#     Compute a custom F1 score that considers hazards and products together.
#     """
#     # Reset indices to ensure alignment
#     hazards_true = hazards_true.reset_index(drop=True)
#     products_true = products_true.reset_index(drop=True)
#     hazards_pred = pd.Series(hazards_pred).reset_index(drop=True)
#     products_pred = pd.Series(products_pred).reset_index(drop=True)

#     # Compute F1 for hazards
#     f1_hazards = f1_score(hazards_true, hazards_pred, average='macro', zero_division=0)

#     # Compute F1 for products, only where hazards predictions match ground truth
#     mask = hazards_pred == hazards_true
#     f1_products = f1_score(
#         products_true[mask],
#         products_pred[mask],
#         average='macro',
#         zero_division=0
#     )

#     # Return the combined metric
#     return (f1_hazards + f1_products) / 2.


def compute_score(hazards_true, products_true, hazards_pred, products_pred):
    """
    Compute a custom F1 score that considers hazards and products together.
    """
    # Ensure all inputs are pandas Series
    if not isinstance(hazards_true, pd.Series):
        hazards_true = pd.Series(hazards_true)
    if not isinstance(products_true, pd.Series):
        products_true = pd.Series(products_true)
    if not isinstance(hazards_pred, pd.Series):
        hazards_pred = pd.Series(hazards_pred)
    if not isinstance(products_pred, pd.Series):
        products_pred = pd.Series(products_pred)

    # Reset indices to ensure alignment
    hazards_true = hazards_true.reset_index(drop=True)
    products_true = products_true.reset_index(drop=True)
    hazards_pred = hazards_pred.reset_index(drop=True)
    products_pred = products_pred.reset_index(drop=True)

    # Compute F1 for hazards
    f1_hazards = f1_score(hazards_true, hazards_pred, average='macro', zero_division=0)

    # Compute F1 for products, only where hazards predictions match ground truth
    mask = hazards_pred == hazards_true
    f1_products = f1_score(
        products_true[mask],
        products_pred[mask],
        average='macro',
        zero_division=0
    )

    # Return the combined metric
    return (f1_hazards + f1_products) / 2


- Due to few instances of classes we can not apply heree stratisfy, also we will not use SMOTE because of limited rememory and time resources.
    - SMOTE in combiantion with TF-idf is slow and generates a large space of X.

In [7]:
# def train_log_regression_classifiers(dataframe, feature_column):
#     """
#     Train multinomial logistic regression classifiers for four labels and calculate custom metrics on test data.

#     Args:
#         dataframe: The input dataframe containing the dataset.
#         feature_column: The name of the column in the dataframe to be used as features.

#     Returns:
#         classifiers: A dictionary containing trained classifiers for each label.
#         vectorizers: A dictionary containing TF-IDF vectorizers for each label.
#         custom_metrics: A dictionary containing the custom metric score for each pair of labels on test data.
#     """
#     np.random.seed(42)  # For reproducibility

#     classifiers = {}  # Dictionary to store the trained classifiers
#     vectorizers = {}  # Dictionary to store the TF-IDF vectorizers
#     custom_metrics = {}  # Dictionary to store custom metric scores

#     # Dictionaries to store test data for each category
#     test_data = {}

#     # Train classifiers for each label
#     for label in ('hazard-category', 'product-category', 'hazard', 'product'):
#         print(f"Training classifier for label: {label}")

#         # Train-test split with stratification based on the current label
#         trainset, testset = train_test_split(
#             dataframe,
#             test_size=0.2,
#             random_state=2024,
#             # stratify=dataframe[label] # hold proportion of classes distribution
#         )

#         # Extract train and test features
#         X_train = trainset[feature_column]
#         X_test = testset[feature_column]

#         # Target
#         y_train = trainset[label]
#         y_test = testset[label]

#         # Define TfidfVectorizer for the current label
#         vectorizer = TfidfVectorizer(strip_accents='unicode', analyzer='char', ngram_range=(2, 5), max_df=0.5, min_df=5, max_features=5000) # limit to 5000
#         vectorizers[label] = vectorizer

#         # Transform features using the label-specific vectorizer
#         X_train_tfidf = vectorizer.fit_transform(X_train)
#         X_test_tfidf = vectorizer.transform(X_test)

#         # Define and train Logistic Regression classifier
#         classifier = LogisticRegression(max_iter=100, random_state=2024, multi_class='multinomial') # limit to max iter 
#         classifier.fit(X_train_tfidf, y_train)

#         # Store the trained classifier
#         classifiers[label] = classifier

#         # Store test data
#         test_data[label] = {
#             'X_test_tfidf': X_test_tfidf,
#             'y_test': y_test
#         }

#         # Predict and evaluate
#         predictions = classifier.predict(X_test_tfidf)
#         logreg_f1 = f1_score(y_test, predictions, average='macro', zero_division=0)
#         print(f"F1 Score for {label}: {logreg_f1:.3f}")

#         # Generate classification report
#         report = classification_report(y_test, predictions, zero_division=0)
#         print(report)

#         # Save the report
#         os.makedirs('reports_initial/logreg', exist_ok=True)
#         with open(f'reports_initial/logreg/logreg_classifier_report_{label}_{feature_column}.txt', 'w') as logreg_file:
#             logreg_file.write(f"Classification Report for {label}_{feature_column}:\n")
#             logreg_file.write(report)
#             logreg_file.write(f"F1 Score: {logreg_f1:.3f}\n")

#     # Compute the custom metric for hazards and products using test data only
#     custom_metrics['subtask_1'] = compute_score(
#     test_data['hazard-category']['y_test'],
#     test_data['product-category']['y_test'],
#     classifiers['hazard-category'].predict(test_data['hazard-category']['X_test_tfidf']),
#     classifiers['product-category'].predict(test_data['product-category']['X_test_tfidf'])
#     )

#     custom_metrics['subtask_2'] = compute_score(
#     test_data['hazard']['y_test'],
#     test_data['product']['y_test'],
#     classifiers['hazard'].predict(test_data['hazard']['X_test_tfidf']),
#     classifiers['product'].predict(test_data['product']['X_test_tfidf'])
#     )


#     print(f"Custom Metric for Subtask 1 (Test Data): {custom_metrics['subtask_1']:.3f}")
#     print(f"Custom Metric for Subtask 2 (Test Data): {custom_metrics['subtask_2']:.3f}")

#     return classifiers, vectorizers, custom_metrics

In [41]:
def train_log_regression_classifiers(dataframe, feature_column):
    """
    Train multinomial logistic regression classifiers for four labels and calculate custom metrics on test data,
    with special handling for rare classes.

    Args:
        dataframe: The input dataframe containing the dataset.
        feature_column: The name of the column in the dataframe to be used as features.

    Returns:
        classifiers: A dictionary containing trained classifiers for each label.
        vectorizers: A dictionary containing TF-IDF vectorizers for each label.
        custom_metrics: A dictionary containing the custom metric score for each pair of labels on test data.
    """
    np.random.seed(42)  # For reproducibility

    classifiers = {}  # Dictionary to store the trained classifiers
    vectorizers = {}  # Dictionary to store the TF-IDF vectorizers
    custom_metrics = {}  # Dictionary to store custom metric scores

    # Dictionaries to store test data for each category
    test_data = {}

    # Train classifiers for each label
    for label in ('hazard-category', 'product-category', 'hazard', 'product'):
        print(f"Training classifier for label: {label}")

        # Separate rare classes
        rare_classes = dataframe[label].value_counts()[dataframe[label].value_counts() == 1].index
        rare_data = dataframe[dataframe[label].isin(rare_classes)]
        common_data = dataframe[~dataframe[label].isin(rare_classes)]

        # Train-test split with stratification for common classes
        train_common, test_common = train_test_split(
            common_data,
            test_size=0.2,
            random_state=2024,
            stratify=common_data[label] # maintain proportion of classes equally compared with augmented data
        )

        # Add all rare classes to either training or test set
        trainset = pd.concat([train_common, rare_data])  # Include rare classes in training
        testset = test_common

        # Extract train and test features
        X_train = trainset[feature_column]
        X_test = testset[feature_column]

        # Target
        y_train = trainset[label]
        y_test = testset[label]

        # Define TfidfVectorizer for the current label
        vectorizer = TfidfVectorizer(strip_accents='unicode', analyzer='char', ngram_range=(2, 5), max_df=0.5, min_df=5, max_features=5000)  # limit to 5000
        vectorizers[label] = vectorizer

        # Transform features using the label-specific vectorizer
        X_train_tfidf = vectorizer.fit_transform(X_train)
        X_test_tfidf = vectorizer.transform(X_test)

        # Define and train Logistic Regression classifier
        classifier = LogisticRegression(max_iter=100, random_state=2024, multi_class='multinomial')  # limit to max iter 
        classifier.fit(X_train_tfidf, y_train)

        # Store the trained classifier
        classifiers[label] = classifier

        # Store test data
        test_data[label] = {
            'X_test_tfidf': X_test_tfidf,
            'y_test': y_test
        }

        # Predict and evaluate
        predictions = classifier.predict(X_test_tfidf)
        logreg_f1 = f1_score(y_test, predictions, average='macro', zero_division=0)
        print(f"F1 Score for {label}: {logreg_f1:.3f}")

        # Generate classification report
        report = classification_report(y_test, predictions, zero_division=0)
        print(report)

        # Save the report
        os.makedirs('reports_initial/logreg', exist_ok=True)
        with open(f'reports_initial/logreg/logreg_classifier_report_{label}_{feature_column}.txt', 'w') as logreg_file:
            logreg_file.write(f"Classification Report for {label}_{feature_column}:\n")
            logreg_file.write(report)
            logreg_file.write(f"F1 Score: {logreg_f1:.3f}\n")

    # Compute the custom metric for hazards and products using test data only
    custom_metrics['subtask_1'] = compute_score(
        test_data['hazard-category']['y_test'],
        test_data['product-category']['y_test'],
        classifiers['hazard-category'].predict(test_data['hazard-category']['X_test_tfidf']),
        classifiers['product-category'].predict(test_data['product-category']['X_test_tfidf'])
    )

    custom_metrics['subtask_2'] = compute_score(
        test_data['hazard']['y_test'],
        test_data['product']['y_test'],
        classifiers['hazard'].predict(test_data['hazard']['X_test_tfidf']),
        classifiers['product'].predict(test_data['product']['X_test_tfidf'])
    )

    print(f"Custom Metric for Subtask 1 (Test Data): {custom_metrics['subtask_1']:.3f}")
    print(f"Custom Metric for Subtask 2 (Test Data): {custom_metrics['subtask_2']:.3f}")

    return classifiers, vectorizers, custom_metrics


In [42]:
classifiers, vectorizers, custom_metrics = train_log_regression_classifiers(df_initial, 'title')
print("Custom Metric Scores on Test Data:")
print(custom_metrics)

Training classifier for label: hazard-category




F1 Score for hazard-category: 0.422
                                precision    recall  f1-score   support

                     allergens       0.82      0.93      0.87       371
                    biological       0.79      0.93      0.86       344
                      chemical       0.87      0.46      0.60        57
food additives and flavourings       0.00      0.00      0.00         5
                foreign bodies       0.82      0.73      0.77       111
                         fraud       0.91      0.58      0.71        74
                     migration       0.00      0.00      0.00         1
          organoleptic aspects       0.00      0.00      0.00        10
                  other hazard       0.67      0.15      0.25        26
              packaging defect       1.00      0.09      0.17        11

                      accuracy                           0.81      1010
                     macro avg       0.59      0.39      0.42      1010
                  weighted



F1 Score for product-category: 0.355
                                                   precision    recall  f1-score   support

                              alcoholic beverages       0.00      0.00      0.00        12
                      cereals and bakery products       0.57      0.76      0.65       133
     cocoa and cocoa preparations, coffee and tea       0.62      0.55      0.58        42
                                    confectionery       0.67      0.06      0.11        34
dietetic foods, food supplements, fortified foods       0.71      0.58      0.64        26
                                    fats and oils       0.00      0.00      0.00         4
                                   feed materials       0.00      0.00      0.00         1
                   food additives and flavourings       0.00      0.00      0.00         2
                           food contact materials       0.00      0.00      0.00         1
                            fruits and vegetables   



F1 Score for hazard: 0.149
                                                   precision    recall  f1-score   support

                                        Aflatoxin       0.00      0.00      0.00         2
                                   abnormal smell       0.00      0.00      0.00         1
                                  alcohol content       0.00      0.00      0.00         1
                                        alkaloids       0.00      0.00      0.00         1
                                        allergens       0.00      0.00      0.00         3
                                           almond       0.91      0.77      0.83        13
                           antibiotics, vet drugs       0.00      0.00      0.00         1
                                    bacillus spp.       0.00      0.00      0.00         3
                             bad smell / off odor       0.00      0.00      0.00         1
                                    bone fragment       0.00  



F1 Score for product: 0.083
                                                         precision    recall  f1-score   support

                                 Catfishes (freshwater)       0.67      1.00      0.80         2
                                  Fishes not identified       0.20      0.29      0.24         7
                               Not classified pork meat       0.00      0.00      0.00         2
                             Pangas catfishes (generic)       0.00      0.00      0.00         1
                    Precooked cooked pork meat products       0.00      0.00      0.00         2
                                          Veggie Burger       0.00      0.00      0.00         1
                                    alcoholic beverages       0.00      0.00      0.00         1
                                        alfalfa sprouts       1.00      1.00      1.00         2
                                                  algae       0.00      0.00      0.00         1
 

### Random Forest  TF-IDF Title

### High-Level Explanation of `train_random_forest_classifiers`

### Objective
- Train Random Forest classifiers for four labels and compute custom metrics to evaluate performance.

### Inputs
- **`dataframe`**: Dataset containing features and target labels.
- **`feature_column`**: The column in the dataframe to be used for feature extraction.

### Outputs
- **`classifiers`**: A dictionary of trained Random Forest classifiers for each label.
- **`vectorizers`**: A dictionary of TF-IDF vectorizers used for feature extraction.
- **`custom_metrics`**: A dictionary of custom evaluation scores for subtasks on test data.

---

### Key Steps

1. **Initialization**:
   - Set a random seed for reproducibility.
   - Prepare dictionaries to store classifiers, vectorizers, and custom metrics.

2. **Label-Specific Training**:
   - Iterate over four labels: 
     - `hazard-category`, 
     - `product-category`, 
     - `hazard`, 
     - `product`.
   - For each label:
     - Perform a stratified train-test split to maintain class distribution.
     - Use `TfidfVectorizer` to extract character-based n-gram features (2-5) with a maximum of 5000 features.
     - Train a Random Forest classifier with 100 decision trees using the extracted TF-IDF features.
     - Evaluate the model using F1 score and generate a classification report.
     - Save the trained classifier, vectorizer, and test data.

3. **Custom Metric Calculation**:
   - Compute task-specific metrics for two subtasks:
     - **Subtask 1**: Evaluate the relationship between `hazard-category` and `product-category`.
     - **Subtask 2**: Evaluate the relationship between `hazard` and `product`.
   - Combine F1 scores for hazards and products to compute the final metric for each subtask.

4. **Logging and Output**:
   - Print F1 scores for each label and custom metric scores for subtasks.
   - Save classification reports to a dedicated directory.
   - Return the trained classifiers, vectorizers, and custom metrics.

---

## Key Techniques
- **TF-IDF Vectorization**: Convert text into feature vectors using character n-grams.
- **Random Forest Classifier**: Train ensemble models with 100 decision trees for robust predictions.
- **Custom Metric Calculation**: Evaluate subtasks by combining F1 scores for hazards and products.

This function enables multi-label text classification and provides task-specific evaluations with a focus on hazards and products.


In [11]:
# def train_random_forest_classifiers(dataframe, feature_column):
#     """
#     Train Random Forest classifiers for four labels and calculate custom metrics on test data.

#     Args:
#         dataframe: The input dataframe containing the dataset.
#         feature_column: The name of the column in the dataframe to be used as features.

#     Returns:
#         classifiers: A dictionary containing trained classifiers for each label.
#         vectorizers: A dictionary containing TF-IDF vectorizers for each label.
#         custom_metrics: A dictionary containing the custom metric score for each pair of labels on test data.
#     """
#     np.random.seed(42)  # For reproducibility

#     classifiers = {}  # Dictionary to store the trained classifiers
#     vectorizers = {}  # Dictionary to store the TF-IDF vectorizers
#     custom_metrics = {}  # Dictionary to store custom metric scores

#     # Dictionaries to store test data for each category
#     test_data = {}

#     # Train classifiers for each label
#     for label in ('hazard-category', 'product-category', 'hazard', 'product'):
#         print(f"Training classifier for label: {label}")

#         # Train-test split with stratification based on the current label
#         trainset, testset = train_test_split(
#             dataframe,
#             test_size=0.2,
#             random_state=2024,
#             # stratify=dataframe[label]  # hold proportion of class distribution
#         )

#         # Extract train and test features
#         X_train = trainset[feature_column]
#         X_test = testset[feature_column]

#         # Target
#         y_train = trainset[label]
#         y_test = testset[label]

#         # Define TfidfVectorizer for the current label
#         vectorizer = TfidfVectorizer(strip_accents='unicode', analyzer='char', ngram_range=(2, 5), max_df=0.5, min_df=5, max_features=5000)  # limit to 5000
#         vectorizers[label] = vectorizer

#         # Transform features using the label-specific vectorizer
#         X_train_tfidf = vectorizer.fit_transform(X_train)
#         X_test_tfidf = vectorizer.transform(X_test)

#         # Define and train Random Forest classifier
#         classifier = RandomForestClassifier(n_estimators=100, random_state=2024, n_jobs=-1)  # Using 100 trees
#         classifier.fit(X_train_tfidf, y_train)

#         # Store the trained classifier
#         classifiers[label] = classifier

#         # Store test data
#         test_data[label] = {
#             'X_test_tfidf': X_test_tfidf,
#             'y_test': y_test
#         }

#         # Predict and evaluate
#         predictions = classifier.predict(X_test_tfidf)
#         rf_f1 = f1_score(y_test, predictions, average='macro', zero_division=0)
#         print(f"F1 Score for {label}: {rf_f1:.3f}")

#         # Generate classification report
#         report = classification_report(y_test, predictions, zero_division=0)
#         print(report)

#         # Save the report
#         os.makedirs('reports_initial/random_forest', exist_ok=True)
#         with open(f'reports_initial/random_forest/rf_classifier_report_{label}_{feature_column}.txt', 'w') as rf_file:
#             rf_file.write(f"Classification Report for {label}_{feature_column}:\n")
#             rf_file.write(report)
#             rf_file.write(f"F1 Score: {rf_f1:.3f}\n")

#     # Compute the custom metric for hazards and products using test data only
#     custom_metrics['subtask_1'] = compute_score(
#         test_data['hazard-category']['y_test'],
#         test_data['product-category']['y_test'],
#         classifiers['hazard-category'].predict(test_data['hazard-category']['X_test_tfidf']),
#         classifiers['product-category'].predict(test_data['product-category']['X_test_tfidf'])
#     )

#     custom_metrics['subtask_2'] = compute_score(
#         test_data['hazard']['y_test'],
#         test_data['product']['y_test'],
#         classifiers['hazard'].predict(test_data['hazard']['X_test_tfidf']),
#         classifiers['product'].predict(test_data['product']['X_test_tfidf'])
#     )

#     print(f"Custom Metric for Subtask 1 (Test Data): {custom_metrics['subtask_1']:.3f}")
#     print(f"Custom Metric for Subtask 2 (Test Data): {custom_metrics['subtask_2']:.3f}")

#     return classifiers, vectorizers, custom_metrics

In [46]:
def train_random_forest_classifiers(dataframe, feature_column):
    """
    Train Random Forest classifiers for four labels and calculate custom metrics on test data,
    with special handling for rare classes.

    Args:
        dataframe: The input dataframe containing the dataset.
        feature_column: The name of the column in the dataframe to be used as features.

    Returns:
        classifiers: A dictionary containing trained classifiers for each label.
        vectorizers: A dictionary containing TF-IDF vectorizers for each label.
        custom_metrics: A dictionary containing the custom metric score for each pair of labels on test data.
    """
    np.random.seed(42)  # For reproducibility

    classifiers = {}  # Dictionary to store the trained classifiers
    vectorizers = {}  # Dictionary to store the TF-IDF vectorizers
    custom_metrics = {}  # Dictionary to store custom metric scores

    # Dictionaries to store test data for each category
    test_data = {}

    # Train classifiers for each label
    for label in ('hazard-category', 'product-category', 'hazard', 'product'):
        print(f"Training classifier for label: {label}")

        # Separate rare classes
        rare_classes = dataframe[label].value_counts()[dataframe[label].value_counts() == 1].index
        rare_data = dataframe[dataframe[label].isin(rare_classes)]
        common_data = dataframe[~dataframe[label].isin(rare_classes)]

        # Train-test split with stratification for common classes
        train_common, test_common = train_test_split(
            common_data,
            test_size=0.2,
            random_state=2024,
            stratify=common_data[label]
        )

        # Add all rare classes to either training or test set
        trainset = pd.concat([train_common, rare_data])  # Include rare classes in training
        testset = test_common

        # Extract train and test features
        X_train = trainset[feature_column]
        X_test = testset[feature_column]

        # Target
        y_train = trainset[label]
        y_test = testset[label]

        # Define TfidfVectorizer for the current label
        vectorizer = TfidfVectorizer(strip_accents='unicode', analyzer='char', ngram_range=(2, 5), max_df=0.5, min_df=5, max_features=5000)  # limit to 5000
        vectorizers[label] = vectorizer

        # Transform features using the label-specific vectorizer
        X_train_tfidf = vectorizer.fit_transform(X_train)
        X_test_tfidf = vectorizer.transform(X_test)

        # Define and train Random Forest classifier
        classifier = RandomForestClassifier(n_estimators=100, random_state=2024, n_jobs=-1)  # Using 100 trees
        classifier.fit(X_train_tfidf, y_train)

        # Store the trained classifier
        classifiers[label] = classifier

        # Store test data
        test_data[label] = {
            'X_test_tfidf': X_test_tfidf,
            'y_test': y_test
        }

        # Predict and evaluate
        predictions = classifier.predict(X_test_tfidf)
        rf_f1 = f1_score(y_test, predictions, average='macro', zero_division=0)
        print(f"F1 Score for {label}: {rf_f1:.3f}")

        # Generate classification report
        report = classification_report(y_test, predictions, zero_division=0)
        print(report)

        # Save the report
        os.makedirs('reports_initial/random_forest', exist_ok=True)
        with open(f'reports_initial/random_forest/rf_classifier_report_{label}_{feature_column}.txt', 'w') as rf_file:
            rf_file.write(f"Classification Report for {label}_{feature_column}:\n")
            rf_file.write(report)
            rf_file.write(f"F1 Score: {rf_f1:.3f}\n")

    # Compute the custom metric for hazards and products using test data only
    custom_metrics['subtask_1'] = compute_score(
        test_data['hazard-category']['y_test'],
        test_data['product-category']['y_test'],
        classifiers['hazard-category'].predict(test_data['hazard-category']['X_test_tfidf']),
        classifiers['product-category'].predict(test_data['product-category']['X_test_tfidf'])
    )

    custom_metrics['subtask_2'] = compute_score(
        test_data['hazard']['y_test'],
        test_data['product']['y_test'],
        classifiers['hazard'].predict(test_data['hazard']['X_test_tfidf']),
        classifiers['product'].predict(test_data['product']['X_test_tfidf'])
    )

    print(f"Custom Metric for Subtask 1 (Test Data): {custom_metrics['subtask_1']:.3f}")
    print(f"Custom Metric for Subtask 2 (Test Data): {custom_metrics['subtask_2']:.3f}")


In [47]:
train_random_forest_classifiers(df_initial, 'title')

Training classifier for label: hazard-category
F1 Score for hazard-category: 0.563
                                precision    recall  f1-score   support

                     allergens       0.86      0.92      0.89       371
                    biological       0.79      0.94      0.86       344
                      chemical       0.81      0.60      0.69        57
food additives and flavourings       1.00      0.60      0.75         5
                foreign bodies       0.84      0.73      0.78       111
                         fraud       0.88      0.57      0.69        74
                     migration       0.00      0.00      0.00         1
          organoleptic aspects       1.00      0.20      0.33        10
                  other hazard       0.60      0.23      0.33        26
              packaging defect       1.00      0.18      0.31        11

                      accuracy                           0.83      1010
                     macro avg       0.78      0.50

### Advanced X-Boost TF-IDF Title 

### High-Level Explanation of `train_xgboost_classifiers`

### Objective
- Train XGBoost classifiers for four labels and compute custom metrics to evaluate performance.

### Inputs
- **`dataframe`**: Dataset containing features and target labels.
- **`feature_column`**: The column in the dataframe to be used for feature extraction.

### Outputs
- **`classifiers`**: A dictionary of trained XGBoost classifiers for each label, including label encoders.
- **`vectorizers`**: A dictionary of TF-IDF vectorizers used for feature extraction.
- **`custom_metrics`**: A dictionary of custom evaluation scores for subtasks on test data.

---

### Key Steps

1. **Initialization**:
   - Set a random seed for reproducibility.
   - Prepare dictionaries to store classifiers, vectorizers, and custom metrics.

2. **Label-Specific Training**:
   - Iterate over four labels:
     - `hazard-category`,
     - `product-category`,
     - `hazard`,
     - `product`.
   - For each label:
     - Perform a stratified train-test split to maintain class distribution.
     - Use `TfidfVectorizer` to extract character-based n-gram features (2-5) with a maximum of 2000 features.
     - Encode target labels into numeric values using `LabelEncoder`.
     - Train an XGBoost classifier with the following parameters:
       - Maximum depth of 6.
       - 50 estimators.
       - Learning rate of 0.2.
     - Evaluate the model using F1 score and generate a classification report.
     - Save the trained classifier, vectorizer, label encoder, and test data.

3. **Custom Metric Calculation**:
   - Compute task-specific metrics for two subtasks:
     - **Subtask 1**: Evaluate the relationship between `hazard-category` and `product-category`.
     - **Subtask 2**: Evaluate the relationship between `hazard` and `product`.
   - Combine F1 scores for hazards and products to compute the final metric for each subtask.

4. **Logging and Output**:
   - Print F1 scores for each label and custom metric scores for subtasks.
   - Save classification reports to a dedicated directory.
   - Return the trained classifiers, vectorizers, and custom metrics.

---

### Key Techniques
- **TF-IDF Vectorization**: Convert text into feature vectors using character n-grams.
- **XGBoost Classifier**: Train scalable and efficient tree-based classifiers with softmax multi-class objectives.
- **Label Encoding**: Map categorical target labels to numeric values for compatibility with XGBoost.
- **Custom Metric Calculation**: Evaluate subtasks by combining F1 scores for hazards and products.

This function enables multi-label text classification and provides task-specific evaluations with a focus on hazards and products using a high-performance gradient boosting model.


In [43]:
# import os
# import numpy as np
# import pandas as pd
# from sklearn.model_selection import train_test_split
# from sklearn.preprocessing import LabelEncoder
# from sklearn.feature_extraction.text import TfidfVectorizer
# from sklearn.metrics import f1_score, classification_report
# from xgboost import XGBClassifier

# def train_xgboost_classifiers(dataframe, feature_column):
#     """
#     Train XGBoost classifiers for four labels and calculate custom metrics on test data.

#     Args:
#         dataframe: The input dataframe containing the dataset.
#         feature_column: The name of the column in the dataframe to be used as features.

#     Returns:
#         classifiers: A dictionary containing trained classifiers for each label.
#         vectorizers: A dictionary containing TF-IDF vectorizers for each label.
#         custom_metrics: A dictionary containing the custom metric score for each pair of labels on test data.
#     """
#     np.random.seed(42)  # For reproducibility

#     classifiers = {}  # Dictionary to store the trained classifiers
#     vectorizers = {}  # Dictionary to store the TF-IDF vectorizers
#     custom_metrics = {}  # Dictionary to store custom metric scores

#     # Dictionaries to store test data for each category
#     test_data = {}

#     # Train classifiers for each label
#     for label in ('hazard-category', 'product-category', 'hazard', 'product'):
#         print(f"Training classifier for label: {label}")

#         # Train-test split with stratification based on the current label
#         trainset, testset = train_test_split(
#             dataframe,
#             test_size=0.2,
#             random_state=2024
#         )

#         # Extract train and test features
#         X_train = trainset[feature_column]
#         X_test = testset[feature_column]

#         # Target
#         y_train = trainset[label]
#         y_test = testset[label]

#         # Encode target labels into numeric values
#         label_encoder = LabelEncoder()
#         y_train_encoded = label_encoder.fit_transform(y_train)

#         # Handle unseen labels in test set
#         y_test_mapped = y_test.map(lambda x: x if x in label_encoder.classes_ else None).dropna()
#         y_test_encoded = label_encoder.transform(y_test_mapped)

#         # Filter test data to exclude rows with unseen labels
#         valid_test_indices = y_test.index[y_test.isin(label_encoder.classes_)]
#         X_test = X_test.loc[valid_test_indices]

#         # Define TfidfVectorizer for the current label
#         vectorizer = TfidfVectorizer(
#             strip_accents='unicode', analyzer='char', ngram_range=(2, 5),
#             max_df=0.5, min_df=5, max_features=2000
#         )
#         vectorizers[label] = vectorizer

#         # Transform features using the label-specific vectorizer
#         X_train_tfidf = vectorizer.fit_transform(X_train)
#         X_test_tfidf = vectorizer.transform(X_test)

#         # Define and train XGBoost classifier
#         classifier = XGBClassifier(
#             use_label_encoder=False,
#             eval_metric='mlogloss',
#             objective='multi:softmax',
#             max_depth=6,
#             n_estimators=50,
#             learning_rate=0.2,
#             random_state=2024
#         )
#         classifier.fit(X_train_tfidf, y_train_encoded)

#         # Store the trained classifier
#         classifiers[label] = {
#             'model': classifier,
#             'label_encoder': label_encoder
#         }

#         # Store test data
#         test_data[label] = {
#             'X_test_tfidf': X_test_tfidf,
#             'y_test': y_test_encoded
#         }

#         # Predict and evaluate
#         predictions = classifier.predict(X_test_tfidf)
#         xgb_f1 = f1_score(y_test_encoded, predictions, average='macro', zero_division=0)
#         print(f"F1 Score for {label}: {xgb_f1:.3f}")

#         # Filter target names to match present classes
#         present_classes = np.unique(y_test_encoded)
#         filtered_target_names = [label_encoder.classes_[i] for i in present_classes]

#         # Generate classification report
#         report = classification_report(
#             y_test_encoded, 
#             predictions, 
#             zero_division=0, 
#             target_names=filtered_target_names, 
#             labels=present_classes
#         )
#         print(report)

#         # Save the report
#         os.makedirs('reports_initial/xgboost', exist_ok=True)
#         with open(f'reports_initial/xgboost/xgboost_classifier_report_{label}_{feature_column}.txt', 'w') as xgb_file:
#             xgb_file.write(f"Classification Report for {label}_{feature_column}:\n")
#             xgb_file.write(report)
#             xgb_file.write(f"F1 Score: {xgb_f1:.3f}\n")

#     # Compute the custom metric for hazards and products using test data only
#     custom_metrics['subtask_1'] = compute_score(
#         pd.Series(test_data['hazard-category']['y_test']),
#         pd.Series(test_data['product-category']['y_test']),
#         pd.Series(classifiers['hazard-category']['model'].predict(test_data['hazard-category']['X_test_tfidf'])),
#         pd.Series(classifiers['product-category']['model'].predict(test_data['product-category']['X_test_tfidf']))
#     )

#     custom_metrics['subtask_2'] = compute_score(
#         pd.Series(test_data['hazard']['y_test']),
#         pd.Series(test_data['product']['y_test']),
#         pd.Series(classifiers['hazard']['model'].predict(test_data['hazard']['X_test_tfidf'])),
#         pd.Series(classifiers['product']['model'].predict(test_data['product']['X_test_tfidf']))
#     )

#     print(f"Custom Metric for Subtask 1 (Test Data): {custom_metrics['subtask_1']:.3f}")
#     print(f"Custom Metric for Subtask 2 (Test Data): {custom_metrics['subtask_2']:.3f}")

#     return classifiers, vectorizers, custom_metrics


In [84]:
from xgboost import XGBClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, classification_report
from sklearn.preprocessing import LabelEncoder
import numpy as np
import pandas as pd
import os


def train_xgboost_classifiers(dataframe, feature_column):
    """
    Train XGBoost classifiers for multiclass labels and calculate custom metrics on test data.

    Args:
        dataframe: The input dataframe containing the dataset.
        feature_column: The name of the column in the dataframe to be used as features.

    Returns:
        classifiers: A dictionary containing trained classifiers for each label.
        vectorizers: A dictionary containing TF-IDF vectorizers for each label.
        custom_metrics: A dictionary containing the custom metric score for each pair of labels on test data.
    """
    np.random.seed(42)  # For reproducibility

    classifiers = {}
    vectorizers = {}
    custom_metrics = {}

    # Store test data for evaluation
    test_data = {}

    # Train classifiers for each label
    for label in ('hazard-category', 'product-category', 'hazard', 'product'):
        print(f"Training classifier for label: {label}")

        # Separate rare and common classes with one instance
        rare_classes = dataframe[label].value_counts()[dataframe[label].value_counts() == 1].index
        rare_data = dataframe[dataframe[label].isin(rare_classes)]
        common_data = dataframe[~dataframe[label].isin(rare_classes)]

        # Train-test split for common classes
        train_common, test_common = train_test_split(
            common_data,
            test_size=0.2,
            random_state=2024,
            stratify=common_data[label]
        )

        # Combine rare data with training data
        trainset = pd.concat([train_common, rare_data], ignore_index=True)
        testset = test_common.reset_index(drop=True)

        # Extract features and labels
        X_train = trainset[feature_column]
        X_test = testset[feature_column]
        y_train = trainset[label]
        y_test = testset[label]

        # Ensure LabelEncoder sees all classes (both train and test)
        label_encoder = LabelEncoder()
        label_encoder.fit(pd.concat([y_train, y_test], ignore_index=True))
        y_train_encoded = label_encoder.transform(y_train)
        y_test_encoded = label_encoder.transform(y_test)

        # Define TfidfVectorizer
        vectorizer = TfidfVectorizer(
            strip_accents='unicode',
            analyzer='char',
            ngram_range=(2, 5),
            max_df=0.5,
            min_df=5,
            max_features=5000
        )
        vectorizers[label] = vectorizer

        # Transform features using TF-IDF
        X_train_tfidf = vectorizer.fit_transform(X_train)
        X_test_tfidf = vectorizer.transform(X_test)

        # Define and train the XGBoost classifier
        classifier = XGBClassifier(
            eval_metric='mlogloss',
            objective='multi:softmax',
            max_depth=6,
            n_estimators=50,
            learning_rate=0.2,
            random_state=2024
        )
        classifier.fit(X_train_tfidf, y_train_encoded)

        # Store the trained classifier and label encoder
        classifiers[label] = {
            'model': classifier,
            'label_encoder': label_encoder
        }

        # Store test data for custom metric computation
        test_data[label] = {
            'X_test_tfidf': X_test_tfidf,
            'y_test': y_test_encoded,
        }

        # Predict and evaluate
        predictions = classifier.predict(X_test_tfidf)
        xgb_f1 = f1_score(y_test_encoded, predictions, average='macro', zero_division=0)
        print(f"F1 Score for {label}: {xgb_f1:.3f}")

        # Decode predictions and test labels
        predictions_decoded = label_encoder.inverse_transform(predictions)
        y_test_decoded = label_encoder.inverse_transform(y_test_encoded)

        # Generate and print classification report
        report = classification_report(y_test_decoded, predictions_decoded, zero_division=0)
        print(report)

        # Save the report
        os.makedirs('reports_initial/xgboost', exist_ok=True)
        report_path = f'reports_initial/xgboost/xgb_classifier_report_{label}_{feature_column}.txt'
        with open(report_path, 'w') as report_file:
            report_file.write(f"Classification Report for {label}_{feature_column}:\n")
            report_file.write(report)
            report_file.write(f"F1 Score: {xgb_f1:.3f}\n")

    # Compute custom metrics for subtasks
    custom_metrics['subtask_1'] = compute_score(
        test_data['hazard-category']['y_test'],
        test_data['product-category']['y_test'],
        classifiers['hazard-category']['model'].predict(test_data['hazard-category']['X_test_tfidf']),
        classifiers['product-category']['model'].predict(test_data['product-category']['X_test_tfidf'])
    )

    custom_metrics['subtask_2'] = compute_score(
        test_data['hazard']['y_test'],
        test_data['product']['y_test'],
        classifiers['hazard']['model'].predict(test_data['hazard']['X_test_tfidf']),
        classifiers['product']['model'].predict(test_data['product']['X_test_tfidf'])
    )

    print(f"Custom Metric for Subtask 1 (Test Data): {custom_metrics['subtask_1']:.3f}")
    print(f"Custom Metric for Subtask 2 (Test Data): {custom_metrics['subtask_2']:.3f}")

    return classifiers, vectorizers, custom_metrics


In [85]:
classifiers_title, vectorizers_title, custom_metrics_title = train_xgboost_classifiers(df_initial, 'title') # todo 2 

Training classifier for label: hazard-category
F1 Score for hazard-category: 0.666
                                precision    recall  f1-score   support

                     allergens       0.86      0.91      0.88       371
                    biological       0.78      0.94      0.85       344
                      chemical       0.91      0.51      0.65        57
food additives and flavourings       1.00      0.60      0.75         5
                foreign bodies       0.82      0.72      0.77       111
                         fraud       0.87      0.54      0.67        74
                     migration       1.00      1.00      1.00         1
          organoleptic aspects       0.80      0.40      0.53        10
                  other hazard       0.50      0.19      0.28        26
              packaging defect       0.67      0.18      0.29        11

                      accuracy                           0.82      1010
                     macro avg       0.82      0.60

### Genarl Notes 
- All classifiers has better performance than the baselines (that is majority and random classifier via obsering per sub task and f1 scores).
- Due to the Imbalance all classifiers seem to have overfit s1-score per class has low numbers in combiantion with low accuracy.
- Regarding the comparison of f1 scores and custom evaluation the Logistic Regression has the lowest performance. Random Forest in sub task 2 ahs a few better score, but x-boost har more in sub task 1.
- `Out best classifier for the input title is x-boost`.

### Quick Comment in comparison with results of augmebted data 
- Despite the fact that classifier performed better performance in reports, and in competition quite low score, we can assume that the dta augmented strategy we followed "mislead" the model from teh real data.

### Part B. Benchmark Analysis Text

In [50]:
classifiers, vectorizers, custom_metrics = train_log_regression_classifiers(df_initial, 'text')

Training classifier for label: hazard-category




F1 Score for hazard-category: 0.465
                                precision    recall  f1-score   support

                     allergens       0.94      0.97      0.95       371
                    biological       0.87      0.96      0.91       344
                      chemical       0.73      0.61      0.67        57
food additives and flavourings       1.00      0.20      0.33         5
                foreign bodies       0.77      0.96      0.86       111
                         fraud       0.81      0.53      0.64        74
                     migration       0.00      0.00      0.00         1
          organoleptic aspects       0.00      0.00      0.00        10
                  other hazard       0.29      0.08      0.12        26
              packaging defect       1.00      0.09      0.17        11

                      accuracy                           0.87      1010
                     macro avg       0.64      0.44      0.46      1010
                  weighted



F1 Score for product-category: 0.261
                                                   precision    recall  f1-score   support

                              alcoholic beverages       0.00      0.00      0.00        12
                      cereals and bakery products       0.39      0.65      0.49       133
     cocoa and cocoa preparations, coffee and tea       0.60      0.50      0.55        42
                                    confectionery       0.50      0.03      0.06        34
dietetic foods, food supplements, fortified foods       0.57      0.15      0.24        26
                                    fats and oils       0.00      0.00      0.00         4
                                   feed materials       0.00      0.00      0.00         1
                   food additives and flavourings       0.00      0.00      0.00         2
                           food contact materials       0.00      0.00      0.00         1
                            fruits and vegetables   



F1 Score for hazard: 0.164
                                                   precision    recall  f1-score   support

                                        Aflatoxin       0.00      0.00      0.00         2
                                   abnormal smell       0.00      0.00      0.00         1
                                  alcohol content       0.00      0.00      0.00         1
                                        alkaloids       0.00      0.00      0.00         1
                                        allergens       0.00      0.00      0.00         3
                                           almond       0.57      0.62      0.59        13
                           antibiotics, vet drugs       0.00      0.00      0.00         1
                                    bacillus spp.       0.00      0.00      0.00         3
                             bad smell / off odor       0.00      0.00      0.00         1
                                    bone fragment       0.00  



F1 Score for product: 0.045
                                                         precision    recall  f1-score   support

                                 Catfishes (freshwater)       0.00      0.00      0.00         2
                                  Fishes not identified       0.09      0.43      0.15         7
                               Not classified pork meat       0.00      0.00      0.00         2
                             Pangas catfishes (generic)       0.00      0.00      0.00         1
                    Precooked cooked pork meat products       0.00      0.00      0.00         2
                                          Veggie Burger       0.00      0.00      0.00         1
                                    alcoholic beverages       0.00      0.00      0.00         1
                                        alfalfa sprouts       0.00      0.00      0.00         2
                                                  algae       0.00      0.00      0.00         1
 

In [51]:
train_random_forest_classifiers(df_initial, 'text')

Training classifier for label: hazard-category
F1 Score for hazard-category: 0.518
                                precision    recall  f1-score   support

                     allergens       0.95      0.99      0.97       371
                    biological       0.91      0.97      0.94       344
                      chemical       0.75      0.63      0.69        57
food additives and flavourings       1.00      0.20      0.33         5
                foreign bodies       0.83      0.99      0.90       111
                         fraud       0.78      0.58      0.67        74
                     migration       0.00      0.00      0.00         1
          organoleptic aspects       0.25      0.10      0.14        10
                  other hazard       0.50      0.31      0.38        26
              packaging defect       1.00      0.09      0.17        11

                      accuracy                           0.89      1010
                     macro avg       0.70      0.49

- We have save in dictioanry format the 4 classifiers (both models and label encoder), vectorizers_text and custom metrics. 
    - So we will call teh function and we  will assign the return values to corresponding variables. 
    - `classifiers_text, vectorizers_text, custom_metrics_text = train_xgboost_classifiers(df_initial, 'text') # here todo 3`
- We will need the first two to predict teh unlabeled publish_display_data


In [87]:
classifiers_text, vectorizers_text, custom_metrics_text = train_xgboost_classifiers(df_initial, 'text') # here todo 3 

Training classifier for label: hazard-category
F1 Score for hazard-category: 0.601
                                precision    recall  f1-score   support

                     allergens       0.95      0.99      0.97       371
                    biological       0.92      0.97      0.94       344
                      chemical       0.84      0.82      0.83        57
food additives and flavourings       1.00      0.20      0.33         5
                foreign bodies       0.92      0.98      0.95       111
                         fraud       0.84      0.64      0.72        74
                     migration       0.00      0.00      0.00         1
          organoleptic aspects       0.67      0.20      0.31        10
                  other hazard       0.54      0.50      0.52        26
              packaging defect       1.00      0.27      0.43        11

                      accuracy                           0.91      1010
                     macro avg       0.77      0.56

- Comparing the custom evaluation scores and fi-macro average scores between the text clsssifers we can observe that again x-boost has better performance.
- Furthermore comparing X-boost witH Input text and X-boost with Input title  we can observe thatX-boost wit h input "text" has better custom scores than "title".

### Predict for the best model without additional tuning to have an overall view of the score


In [88]:
classifiers_text

{'hazard-category': {'model': XGBClassifier(base_score=None, booster=None, callbacks=None,
                colsample_bylevel=None, colsample_bynode=None,
                colsample_bytree=None, device=None, early_stopping_rounds=None,
                enable_categorical=False, eval_metric='mlogloss',
                feature_types=None, gamma=None, grow_policy=None,
                importance_type=None, interaction_constraints=None,
                learning_rate=0.2, max_bin=None, max_cat_threshold=None,
                max_cat_to_onehot=None, max_delta_step=None, max_depth=6,
                max_leaves=None, min_child_weight=None, missing=nan,
                monotone_constraints=None, multi_strategy=None, n_estimators=50,
                n_jobs=None, num_parallel_tree=None, objective='multi:softmax', ...),
  'label_encoder': LabelEncoder()},
 'product-category': {'model': XGBClassifier(base_score=None, booster=None, callbacks=None,
                colsample_bylevel=None, colsample_bynod

- Here we can see each indivisual classifiers used x-boost with text data what hyperparametrs has. 
    - because we did not applyied hyperparameter tunign the parameters are known beforehad as we defined them.

### Predict Unlabelled Data

- We will use the fisrt returns variable taht is `classifiers _text`
    - it is nested dictionary so we will assign its values to each individual classifier with corressponding names.

In [115]:
#predict for hazard category : # Access specific classifiers of X-Boost - Tuned : 
hazard_classifier_text = classifiers_text['hazard'] # classifier for hazard 
product_classifier_text = classifiers_text['product'] # classifier for produvt 
hazard_category_classifier_text = classifiers_text['hazard-category'] # classifier for hazard-cat
product_category_classifier_text = classifiers_text['product-category'] # xlassifier for product cat 

### Predict ST2 (X-Boost Text)

- Firstly, we will predict st2 (the vectors for product and hazard).

### Predict Hazard (ST2 part)

In [119]:
vectorizer_text_hazard = vectorizers_text['hazard']#pass TF-idf of training TfidfVectorizer(strip_accents='unicode', analyzer='char', ngram_range=(2, 5), max_df=0.5, min_df=5,max_features=2000)
vectorizer_text_hazard.fit_transform(df_initial['text']) # for the column "text" as it has better performance X-boost simple (menaing without tuning)
X_val_hazard_text=vectorizer_text_hazard.transform(testset_competition['text']) # for the column "text" as it has better performance X-boost simple (menaing without tuning)
model_hazard_classifier_text = hazard_classifier_text['model'] # save trained model X-Boost 
label_encoder_hazard_classifier_text = hazard_classifier_text['label_encoder'] # Save trained model label_encoder 
model_hazard_classifier_text.predict(X_val_hazard_text)

array([ 57,  57,  57,  57,  57,  73,  17,  17,  75, 109,  73,  73,  87,
        87,  87,  17,  57,  17,  57, 100,  57,  57,  73,  73,  57,  17,
        57,  17,  73,  90,  57,  57,  58,  36,  17,  17,  57,  17,  87,
        75,  87,  57,  87,  57,  73,  57,  17,  87,  17,  70,  87,  17,
        85,  17,  73,  57,  57,  73,  73,  57,  17,  73,  57,  98,  98,
        98,  98,  98,  73,  57,  57,  57,  55,  57,  98,  98,  70,  57,
        98, 109,  87,  73,  57,  57,  98,  55,  17,  17,  57,  17,  98,
        57,  57,  73,  57,  98,  73,  98,  17,  36,  98,  17,  57,  57,
        57,  98,  57,  97,  57,  55,  17,  57,  55,  55,  98,  17,  17,
        98,  98,  73,  98,  98,  98,  98,  98,  87,  55,  57,  98,  57,
        97,  98,  55,  55,  87,  87,  57,  73,  55,  17,  87,  98,  98,
        57,  55,   5,  55,  57,  98,  55,  17,  17,  55,  55,  55,  55,
        57,  98,  57,  57,  57,  55,  36,  98,  98,  55,  98,  17,  98,
        98,  57,  57,  55,  17,  97,  98,  87,  57,  87,  55,  5

In [121]:
model_hazard_classifier_text.predict(X_val_hazard_text).shape[0]
predictions_hazard_text= model_hazard_classifier_text.predict(X_val_hazard_text)
predictions_named_hazard_text = label_encoder_hazard_classifier_text.inverse_transform(predictions_hazard_text)  # Get original class names        predictions_dict[label] = predictions_named  # Store predictions with names
print(predictions_named_hazard_text.shape[0])
predictions_named_hazard_text

565


array(['metal fragment', 'metal fragment', 'metal fragment',
       'metal fragment', 'metal fragment', 'other',
       'cereals containing gluten and products thereof',
       'cereals containing gluten and products thereof',
       'other not classified', 'sulphur dioxide and sulphites', 'other',
       'other', 'phenylpyrazole', 'phenylpyrazole', 'phenylpyrazole',
       'cereals containing gluten and products thereof', 'metal fragment',
       'cereals containing gluten and products thereof', 'metal fragment',
       'soybeans and products thereof', 'metal fragment',
       'metal fragment', 'other', 'other', 'metal fragment',
       'cereals containing gluten and products thereof', 'metal fragment',
       'cereals containing gluten and products thereof', 'other',
       'plastic fragment', 'metal fragment', 'metal fragment',
       'microbiological contamination', 'escherichia coli',
       'cereals containing gluten and products thereof',
       'cereals containing gluten and pr

### Predict Product (ST2)

In [122]:
vectorizer_text_product = vectorizers_text['product']#pass TF-idf of training TfidfVectorizer(strip_accents='unicode', analyzer='char', ngram_range=(2, 5), max_df=0.5, min_df=5,max_features=2000)
vectorizer_text_product.fit_transform(df_initial['text']) # for the column "text" as it has better performance X-boost simple (menaing without tuning)
X_val_product_text=vectorizer_text_product.transform(testset_competition['text']) # for the column "text" as it has better performance X-boost simple (menaing without tuning)
model_product_classifier_text = product_classifier_text['model'] # save trained model X-Boost 
label_encoder_product_classifier_text = product_classifier_text['label_encoder'] # Save trained model label_encoder 
model_product_classifier_text.predict(X_val_hazard_text)

array([ 690,  820,  690,  530,  690,   67,  711,   67,  114,  167,  260,
        814,  303,  303,  782,  117,  530,  540,  303,  114,  820,   67,
        731,   49,  728,   67,  270,  260,  617,   67,   67,  260,  617,
         67,   67,  863,  114,  260,   92,  260,   67,  260,   88,  863,
         88,  530,  114,  114,  260,  690,  690,  502,  108, 1018,   58,
        728, 1018,  114,  270,  114,  734,  502,  782,  114,  711,  117,
        814,  814,  814,  742,  303,  150,   67,  138,  711,  150,   82,
        260,  870,  111,  303,  734,   67,  260,  260,  711,   67,  530,
        530,  530,  117, 1018,  530,  114,  863,  260,   19,  138,  530,
        530,  117,  711,  260,  711,  530,  530,  540,  167,  530,  863,
        530,   67,  530,  530,  649,   67,  530,  711,  117,  690,  430,
        114,  782,  114,  530,  530,  530,  782,  863,  114,  167,  711,
        530,  530,  530,  679,  502,  114,  530,  530,  810,  530,  863,
        260,  530,  167,  530,  814,  439,  530,  7

- X-boost return the predisctiona in labeled encoded format
- We need to re-transofrm to strign format. (taht is teh usage of label encodered saved value)

In [123]:

model_product_classifier_text.predict(X_val_product_text).shape[0]
predictions_product_text= model_product_classifier_text.predict(X_val_hazard_text)
predictions_named_product_text = label_encoder_product_classifier_text.inverse_transform(predictions_hazard_text)  # Get original class names        predictions_dict[label] = predictions_named  # Store predictions with names
print(predictions_named_product_text.shape[0])
predictions_named_product_text

565


array(['beans', 'beans', 'beans', 'beans', 'beans', 'biscuits',
       'almond milk', 'almond milk', 'black caviar', 'brussel sprouts',
       'biscuits', 'biscuits', 'bottled mineral water',
       'bottled mineral water', 'bottled mineral water', 'almond milk',
       'beans', 'almond milk', 'beans', 'brie cheese', 'beans', 'beans',
       'biscuits', 'biscuits', 'beans', 'almond milk', 'beans',
       'almond milk', 'biscuits', 'bovine meat and offal', 'beans',
       'beans', 'beef', 'baby food pouches', 'almond milk', 'almond milk',
       'beans', 'almond milk', 'bottled mineral water', 'black caviar',
       'bottled mineral water', 'beans', 'bottled mineral water', 'beans',
       'biscuits', 'beans', 'almond milk', 'bottled mineral water',
       'almond milk', 'beverage base of non-fruit origin, liquid',
       'bottled mineral water', 'almond milk', 'bolognese sauce',
       'almond milk', 'biscuits', 'beans', 'beans', 'biscuits',
       'biscuits', 'beans', 'almond milk', '

- Fine they are converted to string format.

### Save CSV Product and Hazard  (ST2)

- Save the `csv` predictions for the unlabeled for the sub task 1. 

In [129]:
# predictions_named_hazard_text, predictions_named_product_text
# Create folder structure if it doesn't exist
base_dir = "data_initial_submission/st2_initial"
os.makedirs(base_dir, exist_ok=True)
# Hazard product data
data_st2 = {
    "hazard": predictions_named_hazard_text,
    "product": predictions_named_product_text
}
df_st2 = pd.DataFrame(data_st2)
# Save to a CSV
filecsv_file_path = os.path.join(base_dir, "submission.csv")
df_st2.to_csv(filecsv_file_path, index=False)

### Predict ST1 (Product and Hazard Category)

- firstly find predicited labels for `hazard-category`

In [130]:
vectorizer_text_hazard_category = vectorizers_text['hazard-category']#pass TF-idf of training TfidfVectorizer(strip_accents='unicode', analyzer='char', ngram_range=(2, 5), max_df=0.5, min_df=5,max_features=2000)
vectorizer_text_hazard_category.fit_transform(df_initial['text']) # for the column "text" as it has better performance X-boost simple (menaing without tuning)
X_val_hazard_category_text=vectorizer_text_hazard_category.transform(testset_competition['text']) # for the column "text" as it has better performance X-boost simple (menaing without tuning)
model_hazard_category_classifier_text = hazard_category_classifier_text['model'] # save trained model X-Boost 
label_encoder_hazard_category_classifier_text = hazard_category_classifier_text['label_encoder'] # Save trained model label_encoder 
model_hazard_category_classifier_text.predict(X_val_hazard_category_text)

array([4, 4, 4, 1, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 5,
       4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 4, 1, 1,
       4, 1, 1, 1, 2, 1, 4, 1, 2, 1, 5, 1, 1, 1, 5, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 4, 1, 1, 1, 1, 4, 1, 1, 1, 1, 1, 1, 4, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 4, 1, 1, 4, 1, 1, 1, 4, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 4, 4, 1, 1, 1, 1, 1,
       1, 4, 1, 1, 4, 1, 1, 1, 4, 1, 1, 4, 1, 1, 1, 4, 1, 1, 1, 1, 1, 1,
       1, 4, 4, 4, 4, 4, 4, 4, 4, 1, 1, 1, 5, 1, 1, 1, 4, 4, 1, 1, 1, 4,
       1, 1, 2, 1, 1, 1, 1, 1, 1, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 4, 1, 4,
       1, 1, 4, 1, 1, 1, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 4, 4, 1, 1, 1,
       1, 1, 4, 1, 1, 1, 1, 1, 1, 1, 4, 1, 1, 4, 1, 4, 1, 5, 4, 1, 1, 1,
       1, 1, 1, 1, 1, 4, 1, 1, 4, 1, 4, 1, 1, 1, 4, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 4, 4, 4, 4, 1, 1, 1, 1, 1, 1, 1, 4, 1, 1, 4, 1, 4, 1,
       1, 1, 1, 1, 1, 1, 4, 1, 1, 1, 1, 1, 1, 1, 4,

- again the values are encoded duw to label encoder that x-boost internally used.
- we will re-call label encoder to convert to string the digits.

In [131]:
predictions_hazard_category_text= model_hazard_category_classifier_text.predict(X_val_hazard_category_text)
predictions_named_hazard_category_text = label_encoder_hazard_category_classifier_text.inverse_transform(predictions_hazard_category_text)  # Get original class names        predictions_dict[label] = predictions_named  # Store predictions with names
print(predictions_named_hazard_text.shape[0])
predictions_named_hazard_category_text

565


array(['foreign bodies', 'foreign bodies', 'foreign bodies', 'biological',
       'foreign bodies', 'biological', 'biological', 'biological',
       'biological', 'biological', 'biological', 'biological',
       'biological', 'biological', 'biological', 'biological',
       'biological', 'biological', 'biological', 'biological',
       'biological', 'fraud', 'foreign bodies', 'biological',
       'biological', 'biological', 'biological', 'biological',
       'biological', 'biological', 'biological', 'biological',
       'biological', 'biological', 'biological', 'biological',
       'biological', 'biological', 'biological', 'biological',
       'biological', 'foreign bodies', 'biological', 'biological',
       'foreign bodies', 'biological', 'biological', 'biological',
       'chemical', 'biological', 'foreign bodies', 'biological',
       'chemical', 'biological', 'fraud', 'biological', 'biological',
       'biological', 'fraud', 'biological', 'biological', 'biological',
       'biolog

- secodnly find predictive values  for product category 

- firstly find predicited labels for `product-category`

In [132]:
vectorizer_text_product_category = vectorizers_text['product-category']#pass TF-idf of training TfidfVectorizer(strip_accents='unicode', analyzer='char', ngram_range=(2, 5), max_df=0.5, min_df=5,max_features=2000)
vectorizer_text_product_category.fit_transform(df_initial['text']) # for the column "text" as it has better performance X-boost simple (menaing without tuning)
X_val_product_category_text=vectorizer_text_product_category.transform(testset_competition['text']) # for the column "text" as it has better performance X-boost simple (menaing without tuning)
model_product_category_classifier_text = product_category_classifier_text['model'] # save trained model X-Boost 
label_encoder_product_category_classifier_text = product_category_classifier_text['label_encoder'] # Save trained model label_encoder 
model_product_category_classifier_text.predict(X_val_product_category_text)

array([13, 13,  2, 13, 13, 18, 13, 13, 13, 13, 13, 13, 13, 13, 13,  1, 12,
       13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13,
       12, 12, 18, 13, 13, 13, 13, 19, 13, 12, 13,  1,  9, 13, 13, 13, 13,
       13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 12, 12,  9, 20, 12,
       13,  1, 13, 13, 13, 12, 13, 13, 13, 13,  9, 13,  9, 18, 13, 19, 13,
       13, 13, 13, 12, 12, 13, 13, 13, 13,  9, 13, 19, 13, 13,  1, 13, 20,
       13,  9, 13, 18, 13,  9, 13, 13,  9, 13, 13, 13, 13, 13, 13, 13, 13,
       18, 12, 13, 12, 12, 18, 13, 12, 12,  9, 19,  9, 12, 12, 13, 13, 13,
       13, 13, 12,  1, 13, 12, 12,  9, 13, 13, 12, 13, 13, 12, 13, 13, 12,
       12, 12, 12,  1,  1,  1,  1,  1, 12, 13, 13, 12, 12, 13,  1, 20,  1,
        1,  1, 12, 13, 13, 12, 13,  1, 13, 12, 13, 13,  2, 13,  1,  1, 12,
       13,  1,  1, 13, 13, 13,  1, 13, 13, 13,  1, 18, 13, 13, 13, 12, 13,
       13, 13, 18, 18,  9, 13, 13, 12, 18, 19, 13, 18, 12, 12, 13, 12, 13,
       12,  1, 12,  9, 12

- again the format is in encoded 
- we need to racall label encoder of x-boost to convert digits to string 

In [134]:
predictions_product_category_text= model_product_category_classifier_text.predict(X_val_product_category_text)
predictions_named_product_category_text = label_encoder_product_category_classifier_text.inverse_transform(predictions_product_category_text)  # Get original class names        predictions_dict[label] = predictions_named  # Store predictions with names
print(predictions_product_category_text.shape[0])
predictions_named_product_category_text

565


array(['meat, egg and dairy products', 'meat, egg and dairy products',
       'cocoa and cocoa preparations, coffee and tea',
       'meat, egg and dairy products', 'meat, egg and dairy products',
       'prepared dishes and snacks', 'meat, egg and dairy products',
       'meat, egg and dairy products', 'meat, egg and dairy products',
       'meat, egg and dairy products', 'meat, egg and dairy products',
       'meat, egg and dairy products', 'meat, egg and dairy products',
       'meat, egg and dairy products', 'meat, egg and dairy products',
       'cereals and bakery products', 'ices and desserts',
       'meat, egg and dairy products', 'meat, egg and dairy products',
       'meat, egg and dairy products', 'meat, egg and dairy products',
       'meat, egg and dairy products', 'meat, egg and dairy products',
       'meat, egg and dairy products', 'meat, egg and dairy products',
       'meat, egg and dairy products', 'meat, egg and dairy products',
       'meat, egg and dairy products

### Save ST1 (Hazard-Category Product Category)

In [135]:
# predictions_named_hazard_text, predictions_named_product_text

# Create folder structure if it doesn't exist
base_dir = "data_initial_submission/st1_initial"
os.makedirs(base_dir, exist_ok=True)

# Hazard product data
data_st1 = {
    "hazard-category": predictions_named_hazard_category_text,
    "product-category": predictions_named_product_category_text
}
df_st1 = pd.DataFrame(data_st1)

# Save to a CSV
filecsv_file_path = os.path.join(base_dir, "submission.csv")
df_st1.to_csv(filecsv_file_path, index=False)

x boost -> (ValueError: y contains previously unseen labels: 'meat preparations') need startisfy 
- leave-one-out scenario for rare classes in your dataset. and second reason for consistency

# Understanding the Difference in Behavior Between Random Forest and XGBoost

When working with machine learning models like Random Forest and XGBoost, their behavior can vary significantly, especially regarding handling class distributions, missing labels, and unseen data. Here's a breakdown of these differences:

---

## 1. Random Forest Behavior Without Stratify

### How It Handles Classes
- Random Forest does not explicitly require all classes in the training data to match those in the test data.
- It learns a set of decision trees, and if a class is underrepresented or absent in the training data, the model will simply predict probabilities for the known classes.
- It doesn't fail if a label in the test set wasn’t present in training, though predictions for that class will be meaningless.

### Why No Error Without Stratify
- Even if `stratify` is not used, Random Forest trains and predicts based on the classes it has seen in training.
- If a class is missing in training but appears in the test set, Random Forest will not raise an error; however, predictions for that unseen class will be invalid.
- Random Forest does not validate test labels against training labels, so the absence of a class in training doesn’t cause an issue.

---

## 2. XGBoost Behavior Without Stratify

### Strict Label Handling
- XGBoost requires that all classes in the test set are present in the training data.
- If it encounters a label during prediction that was not seen during training, it raises a `ValueError` because it cannot map the unseen label to its internal structure of classes.

### Why It Fails Without Stratify
- Without `stratify`, your train-test split can result in some classes being entirely absent from the training set.
- If a label (e.g., `'meat preparations'`) is missing in training but appears in the test set, XGBoost fails with a `ValueError` since it expects a complete mapping of all labels.

---

## Key Differences Between Random Forest and XGBoost

| **Aspect**                | **Random Forest**                                                                 | **XGBoost**                                                                 |
|---------------------------|-----------------------------------------------------------------------------------|-----------------------------------------------------------------------------|
| **Class Dependency**       | Can train and predict even if some classes are missing in training.               | Fails if a test class was not seen during training.                         |
| **Error on Unseen Labels** | Does not raise an error, but predictions for unseen classes will be invalid.      | Raises a `ValueError` if test labels contain classes not in training.       |
| **Handling Class Imbalance** | Works well even with imbalance; doesn’t require `stratify`.                     | Sensitive to imbalance; not using `stratify` can lead to unseen classes and errors. |
| **Internal Representation** | Does not require all classes to be explicitly listed or mapped.                  | Requires a strict mapping of all labels between training and test data.     |

---

## Takeaways
- **Random Forest** is more flexible but may produce invalid predictions for unseen labels.
- **XGBoost** is strict about label consistency, requiring all test labels to be represented during training.
- To avoid issues, always use **stratification** during train-test splitting when dealing with imbalanced datasets or small class distributions.
- Similarly for logistic regression.

- Scores Competition 
    - Sub Task 1 : `0.0710` (stemming from competition score data 27 nOVEMEBR 2024) 
    - Sub Task 2 : `0.0057` (stemming from competition score data 27 nOVEMEBR 2024) 
- better than fake data that were both scores extremely close to 0, but still they are low scores  with input the intiial data. 

### Hyperparameter Tuning for X-Boost Text 