## Main Notebook for Benchmark Analysis - Training and Evaluation 

- Note this Jupyetr Noebook handles this initial data of the competition with only basic NLP preprocess without data augmented dut to the discouraged results in the competition score. 

- This Jupyter Notebook contains the benchmark analysis based firstly on "title" as input and secondly on "text" as input. 
- Our aim is to detect the best model (`LogisticRegression`, `Random Forest` ans `X-Boost`) with the best Input "title" or "text".
- Then we will try to improve based on hyperparameter tuning technoques only the model that was detected as the best one.
    - Note: Additioanlly, baselined models (majority and radnom classifers) created in order that we detct if a model predicts based on exactly random ness or on mode / frequent values. 
        - We want our models to outpermorf the evaluations metrics of these baselines.
---
> Evangelia P. Panourgia, Master Student in Data Science, AUEB <br />
> Department of Informatics, Athens University of Economics and Business <br />
> eva.panourgia@aueb.gr <br/><br/>


In [1]:
!pip install  xgboost



In [2]:
import os
import pandas as pd
# import nltk
import string
import random
import numpy as np
from sklearn.metrics import f1_score, classification_report
from sklearn.dummy import DummyClassifier
from sklearn.model_selection import StratifiedKFold, train_test_split, cross_val_score, StratifiedShuffleSplit
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.utils.class_weight import compute_class_weight
from sklearn.preprocessing import LabelEncoder
from sklearn.multiclass import OneVsRestClassifier
from sklearn.ensemble import StackingClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier
import joblib  # For saving and loading models
from sklearn.ensemble import RandomForestClassifier

### Load Data 
- We will load the preprocessed data (`data_nlp_incidents_train.csv`) being pre-processed with only basic nlp preprocess.
- Furthermore, we will load the unlabeleed data of the competition in ordeer to predict them (`incidents.csv`).

In [3]:
df_initial = pd.read_csv('data/data_nlp_incidents_train.csv') # load data after data augmentation
testset_competition = pd.read_csv('data/incidents.csv', index_col=0) # load testing data (conception phase, unlabeled):

In [4]:
df_initial = df_initial[['title','text','hazard-category','product-category','hazard','product']]
print(df_initial.shape[0])
df_initial.head(3) # preview preproccessed data 

5046


Unnamed: 0,title,text,hazard-category,product-category,hazard,product
0,recal notif fsis-024-94,case number 024-94 date open 07/01/1994 date c...,biological,"meat, egg and dairy products",listeria monocytogenes,smoked sausage
1,recal notif fsis-033-94,case number 033-94 date open 10/03/1994 date c...,biological,"meat, egg and dairy products",listeria spp,sausage
2,recal notif fsis-014-94,case number 014-94 date open 03/28/1994 date c...,biological,"meat, egg and dairy products",listeria monocytogenes,ham slices


- The augmentred data were `~13000`

## Baselines
- Benchmark analysis is crucial for evaluating classification performance in multiclass imbalance settings because it provides reference points for how well your model is performing relative to simple baseline classifiers. The `Random Classifier` and `Majority Classifier` are commonly used as benchmarks for the following reasons:

### Random Classifier 
- A Random Classifier predicts class labels randomly, with **uniform** based on the distribution of classes. It sets a minimal baseline and helps understand:

- `Baseline Performance`: This represents the expected performance `without learning from the data`. `If a model performs worse than a random classifier, it indicates either issues in the model or unsuitable features`.

- `Chance Levels`: It shows what performance you'd `get by chance alone`, especially useful for imbalanced datasets where metrics like accuracy can be misleading.


### Majority Classifier

- A Majority Classifier always **predicts the majority class** (`the class with the highest frequency in the training data`). 

- It helps understand:

    - `Handling Imbalance`: In multiclass imbalanced datasets, accuracy can be dominated by the majority class. The majority classifier provides a baseline to compare how well your model captures minority classes.
    - `Baseline of Naïve Solutions`: The majority classifier reflects the simplest possible rule for prediction. If a model's performance is close to that of a majority classifier, it suggests the model is failing to generalize or adapt to the minority classes.
    - `Focus on Class Imbalance`: Metrics like weighted accuracy, balanced accuracy, or macro-F1 score should be significantly better than those achieved by the majority classifier to indicate that a model is addressing imbalance effectively.

- Note in the following code cell I implement the code for Random and Majority Classifier, in order to have a high level of "logic" we added the split steps of trainingtest set, but for example for the Random Classifier it is useless as it is not affected from the input, dont learn from data.
    - Hoever, this "skeleton" is useful for the reamaining algorythms to buils in (both traditional and advanced) 

- More specifically, 

    - Random Classifier  Effect of X: The X values (features) **do not influence the random classifier's predictions**. It does not learn from the data in the feature column. Its predictions are purely random, so changing X will not alter its performance.
    - Majority Classifier Effect of X: The feature column X is ignored by the majority classifier, as it does not use features for prediction. Instead, it looks only at the distribution of y in the training data.

### Regarding the Implementation 
- The `DummyClassifier in scikit-learn` is a baseline model designed to evaluate classification algorithms by comparing them against simplistic strategies. These strategies provide minimal logic to make predictions and are often used as benchmarks to understand how well a more complex model performs.
    - `strategy="uniform"` (for Random Classifier): 
        - Predicts a class randomly and uniformly across all possible classes.
        - Each class has an equal probability of being selected, irrespective of the class distribution in the training data.
        - Use Case: Ideal for scenarios where you want to simulate random guessing.
    - `strategy="most_frequent"` (for Majoriry Classification)
        - Always predicts the most frequent class observed in the training data.
        - Ignores the input features entirely and focuses only on the training set's class distribution.
        - Use Case: Useful for understanding how well a naive baseline would perform if you simply predicted the majority class.

In [5]:
def evaluate_baselines(dataframe, feature_column):
    """
    Function to evaluate random and majority classifiers on a given dataframe.

    Args:
        dataframe: The input dataframe containing the dataset.
        feature_column: The name of the column in the dataframe to be used as features.
    """
    np.random.seed(42)  # For reproducibility

    # Train-test split with optional stratification
    trainset, testset = train_test_split(
        dataframe, 
        test_size=0.2, 
        random_state=2024, 
        # "skeleton" for the main algo here add stratisfy to hold proportion of classes 
    )
   
    # Random and Majority classifiers for each label
    for label in ('hazard-category', 'product-category', 'hazard', 'product'):
        print(f"Evaluating for label: {label}")

        # Features and target
        X_train = trainset[feature_column]
        y_train = trainset[label]
        X_test = testset[feature_column]
        y_test = testset[label]

        # Random Classifier
        random_clf = DummyClassifier(strategy="uniform", random_state=2024)
        random_clf.fit(X_train, y_train) # it is uselless X stimulate the logic of a real algo. 
        testset['predictions-random-' + label] = random_clf.predict(X_test)

        # Majority Classifier
        majority_clf = DummyClassifier(strategy="most_frequent")
        majority_clf.fit(X_train, y_train)# it is uselless X stimulate the logic of a real algo. 
        testset['predictions-majority-' + label] = majority_clf.predict(X_test)

        # Compute F1 scores
        random_f1 = f1_score(y_test, testset['predictions-random-' + label], average='macro', zero_division=0)
        majority_f1 = f1_score(y_test, testset['predictions-majority-' + label], average='macro', zero_division=0)

        print(f"F1 Score for Random Classifier ({label}): {random_f1:.3f}")
        print(f"F1 Score for Majority Classifier ({label}): {majority_f1:.3f}")

        # Generate and save classification reports
        os.makedirs('reports_initial/random', exist_ok=True)
        os.makedirs('reports_initial/majority', exist_ok=True)

        random_report = classification_report(y_test, testset['predictions-random-' + label], zero_division=0)
        majority_report = classification_report(y_test, testset['predictions-majority-' + label], zero_division=0)

        with open(f'reports_initial/random/random_classifier_report_{label}.txt', 'w') as random_file:
            random_file.write(f"Classification Report for Random Classifier ({label}):\n")
            random_file.write(random_report)

        with open(f'reports_initial/majority/majority_classifier_report_{label}.txt', 'w') as majority_file:
            majority_file.write(f"Classification Report for Majority Classifier ({label}):\n")
            majority_file.write(majority_report)
        
        
    
    # Custom metric score calculation
    def compute_score(hazards_true, products_true, hazards_pred, products_pred):
        """
        Custom scoring function to compute the macro F1 score for hazards and products.
        
        Args:
            hazards_true: Ground truth labels for hazards.
            products_true: Ground truth labels for products.
            hazards_pred: Predicted labels for hazards.
            products_pred: Predicted labels for products.
        
        Returns:
            A float representing the combined macro F1 score.
        """
        f1_hazards = f1_score(hazards_true, hazards_pred, average='macro', zero_division=0)
        f1_products = f1_score(
            products_true[hazards_pred == hazards_true],
            products_pred[hazards_pred == hazards_true],
            average='macro', 
            zero_division=0
        )
        return (f1_hazards + f1_products) / 2.

    # Example of calculating scores for Sub-Tasks (if needed):
    # Uncomment the following lines to compute scores for tasks
    print(f"Score Sub-Task 1 - Random Classifier: {compute_score(testset['hazard-category'], testset['product-category'], testset['predictions-random-hazard-category'], testset['predictions-random-product-category']):.3f}")
    print(f"Score Sub-Task 2 - Random Classifier: {compute_score(testset['hazard'], testset['product'], testset['predictions-random-hazard'], testset['predictions-random-product-category']):.3f}")
    print(f"Score Sub-Task 1 - Majority Classifier: {compute_score(testset['hazard-category'], testset['product-category'], testset['predictions-majority-hazard-category'], testset['predictions-majority-product-category']):.3f}")
    print(f"Score Sub-Task 2 - Majority Classifier: {compute_score(testset['hazard'], testset['product'], testset['predictions-majority-hazard'], testset['predictions-majority-product']):.3f}")

# Call the function with the required dataframe 
evaluate_baselines(df_initial, feature_column='text')

Evaluating for label: hazard-category
F1 Score for Random Classifier (hazard-category): 0.063
F1 Score for Majority Classifier (hazard-category): 0.060
Evaluating for label: product-category
F1 Score for Random Classifier (product-category): 0.028
F1 Score for Majority Classifier (product-category): 0.022
Evaluating for label: hazard
F1 Score for Random Classifier (hazard): 0.005
F1 Score for Majority Classifier (hazard): 0.002
Evaluating for label: product
F1 Score for Random Classifier (product): 0.000
F1 Score for Majority Classifier (product): 0.000
Score Sub-Task 1 - Random Classifier: 0.051
Score Sub-Task 2 - Random Classifier: 0.002
Score Sub-Task 1 - Majority Classifier: 0.039
Score Sub-Task 2 - Majority Classifier: 0.002


- The analysis of scores are simlliar with that conatined to teh jupyter `augmented_training_process`
    - We want our main models to have better performance of these scoresin order to predict better than randomeness and better than predicrting only the dominant (mode/frequent) values. 

## Strategy for Model Selection and Evaluation

### Overview
Our approach involves `systematically evaluating three machine learning algorithms on two input types`, **"title"** and **"text"**. The goal is to identify the `b`est-performing model based on evaluation metrics and a custom competition evaluation metric`. Due to time and memory constraints, we set specific parameter values for each algorithm after initial manual investigation. This allows us to efficiently gain an overview of model performance and make informed decisions about **further optimization**.

---

### Step-by-Step Strategy

#### 1. Initial Model Evaluation with "Title" Input
- **Algorithms Tested**:
  - `Logistic Regression`
  - `Random Forest`
  - `XGBoost`
- **Parameter Setting**:
  - Parameters for each algorithm are manually tuned based on preliminary analysis to balance performance and computational efficiency.
- **Evaluation**:
  - Models are assessed using:
    - Standard evaluation metrics (e.g., accuracy, precision, recall, F1-score).
    - A custom evaluation metric provided by the competition.
- **Objective**:
  - Identify the most `promising algorithm` based on "title" input.

---

#### 2. Evaluation with "Text" Input
- **Algorithms Tested**:
  - Logistic Regression
  - Random Forest
  - XGBoost
- **Evaluation**:
  - The same evaluation metrics and competition-specific metric are applied as in the "title" input analysis.
- **Objective**:
  - Determine the best-performing algorithm for the "text" input.

---

#### 3. Comparison of Best Models
- The top-performing models from the **"title"** and **"text"** input evaluations are compared.
- **Selection**:
  - Based on their performance across all metrics, the superior model is selected.

---

#### 4. Optimization of the Final Model
- The chosen algorithm undergoes parameter optimization to refine its performance.
- **Constraints**:
  - Due to time limitations, cross-validation (e.g., K-fold validation) will not be applied.
  - Instead, a streamlined validation approach is used to ensure efficient optimization without excessive computational overhead.

---

#### 5. Baseline Comparison
- Throughout the process, model performance is benchmarked against baseline models:
  - **Random Prediction**: A model that predicts randomly.
  - **Majority Class Prediction**: A model that always predicts the most frequent class.
- **Objective**:
  - Provide context for evaluating the added value of the trained algorithms.

---

### Summary
This structured methodology ensures a thorough evaluation of multiple algorithms across different input types, with a focus on balancing computational efficiency and performance. By the end of this process, the goal is to identify the best-performing algorithm with the best input title or text and optimize it for deployment within the constraints of time and resources.

### Part A. Benchmark Analysis Title


- Firstly,we will include the custom evaluation metric provided by the competition page.
    - It will be used for as evaluation part of all algorythms.

In [6]:
## Helping function for the calculation of the custom evaluation for the subtasks.
def compute_score(hazards_true, products_true, hazards_pred, products_pred):
    """
    Compute a custom F1 score that considers hazards and products together.
    """
    # Reset indices to ensure alignment
    hazards_true = hazards_true.reset_index(drop=True)
    products_true = products_true.reset_index(drop=True)
    hazards_pred = pd.Series(hazards_pred).reset_index(drop=True)
    products_pred = pd.Series(products_pred).reset_index(drop=True)

    # Compute F1 for hazards
    f1_hazards = f1_score(hazards_true, hazards_pred, average='macro', zero_division=0)

    # Compute F1 for products, only where hazards predictions match ground truth
    mask = hazards_pred == hazards_true
    f1_products = f1_score(
        products_true[mask],
        products_pred[mask],
        average='macro',
        zero_division=0
    )

    # Return the combined metric
    return (f1_hazards + f1_products) / 2.

- Due to few instances of classes we can not apply heree stratisfy, also we will not use SMOTE because of limited rememory and time resources.
    - SMOTE in combiantion with TF-idf is slow and generates a large space of X.

In [7]:
def train_log_regression_classifiers(dataframe, feature_column):
    """
    Train multinomial logistic regression classifiers for four labels and calculate custom metrics on test data.

    Args:
        dataframe: The input dataframe containing the dataset.
        feature_column: The name of the column in the dataframe to be used as features.

    Returns:
        classifiers: A dictionary containing trained classifiers for each label.
        vectorizers: A dictionary containing TF-IDF vectorizers for each label.
        custom_metrics: A dictionary containing the custom metric score for each pair of labels on test data.
    """
    np.random.seed(42)  # For reproducibility

    classifiers = {}  # Dictionary to store the trained classifiers
    vectorizers = {}  # Dictionary to store the TF-IDF vectorizers
    custom_metrics = {}  # Dictionary to store custom metric scores

    # Dictionaries to store test data for each category
    test_data = {}

    # Train classifiers for each label
    for label in ('hazard-category', 'product-category', 'hazard', 'product'):
        print(f"Training classifier for label: {label}")

        # Train-test split with stratification based on the current label
        trainset, testset = train_test_split(
            dataframe,
            test_size=0.2,
            random_state=2024,
            # stratify=dataframe[label] # hold proportion of classes distribution
        )

        # Extract train and test features
        X_train = trainset[feature_column]
        X_test = testset[feature_column]

        # Target
        y_train = trainset[label]
        y_test = testset[label]

        # Define TfidfVectorizer for the current label
        vectorizer = TfidfVectorizer(strip_accents='unicode', analyzer='char', ngram_range=(2, 5), max_df=0.5, min_df=5, max_features=5000) # limit to 5000
        vectorizers[label] = vectorizer

        # Transform features using the label-specific vectorizer
        X_train_tfidf = vectorizer.fit_transform(X_train)
        X_test_tfidf = vectorizer.transform(X_test)

        # Define and train Logistic Regression classifier
        classifier = LogisticRegression(max_iter=100, random_state=2024, multi_class='multinomial') # limit to max iter 
        classifier.fit(X_train_tfidf, y_train)

        # Store the trained classifier
        classifiers[label] = classifier

        # Store test data
        test_data[label] = {
            'X_test_tfidf': X_test_tfidf,
            'y_test': y_test
        }

        # Predict and evaluate
        predictions = classifier.predict(X_test_tfidf)
        logreg_f1 = f1_score(y_test, predictions, average='macro', zero_division=0)
        print(f"F1 Score for {label}: {logreg_f1:.3f}")

        # Generate classification report
        report = classification_report(y_test, predictions, zero_division=0)
        print(report)

        # Save the report
        os.makedirs('reports_initial/logreg', exist_ok=True)
        with open(f'reports_initial/logreg/logreg_classifier_report_{label}_{feature_column}.txt', 'w') as logreg_file:
            logreg_file.write(f"Classification Report for {label}_{feature_column}:\n")
            logreg_file.write(report)
            logreg_file.write(f"F1 Score: {logreg_f1:.3f}\n")

    # Compute the custom metric for hazards and products using test data only
    custom_metrics['subtask_1'] = compute_score(
    test_data['hazard-category']['y_test'],
    test_data['product-category']['y_test'],
    classifiers['hazard-category'].predict(test_data['hazard-category']['X_test_tfidf']),
    classifiers['product-category'].predict(test_data['product-category']['X_test_tfidf'])
    )

    custom_metrics['subtask_2'] = compute_score(
    test_data['hazard']['y_test'],
    test_data['product']['y_test'],
    classifiers['hazard'].predict(test_data['hazard']['X_test_tfidf']),
    classifiers['product'].predict(test_data['product']['X_test_tfidf'])
    )


    print(f"Custom Metric for Subtask 1 (Test Data): {custom_metrics['subtask_1']:.3f}")
    print(f"Custom Metric for Subtask 2 (Test Data): {custom_metrics['subtask_2']:.3f}")

    return classifiers, vectorizers, custom_metrics

In [8]:
classifiers, vectorizers, custom_metrics = train_log_regression_classifiers(df_initial, 'title')
print("Custom Metric Scores on Test Data:")
print(custom_metrics)

Training classifier for label: hazard-category




F1 Score for hazard-category: 0.448
                                precision    recall  f1-score   support

                     allergens       0.80      0.89      0.84       374
                    biological       0.75      0.94      0.84       340
                      chemical       0.82      0.46      0.58        68
food additives and flavourings       0.00      0.00      0.00         8
                foreign bodies       0.85      0.69      0.77       101
                         fraud       0.85      0.51      0.64        67
          organoleptic aspects       0.00      0.00      0.00        10
                  other hazard       0.67      0.15      0.25        26
              packaging defect       1.00      0.06      0.12        16

                      accuracy                           0.79      1010
                     macro avg       0.64      0.41      0.45      1010
                  weighted avg       0.78      0.79      0.76      1010

Training classifier for l



F1 Score for product-category: 0.376
                                                   precision    recall  f1-score   support

                              alcoholic beverages       0.00      0.00      0.00         9
                      cereals and bakery products       0.60      0.76      0.67       133
     cocoa and cocoa preparations, coffee and tea       0.58      0.51      0.54        37
                                    confectionery       0.92      0.27      0.42        41
dietetic foods, food supplements, fortified foods       0.81      0.61      0.69        28
                                    fats and oils       0.00      0.00      0.00         4
                                   feed materials       0.00      0.00      0.00         1
                   food additives and flavourings       0.00      0.00      0.00         1
                           food contact materials       0.00      0.00      0.00         2
                            fruits and vegetables   



F1 Score for hazard: 0.135
                                                   precision    recall  f1-score   support

                                        Aflatoxin       0.00      0.00      0.00         2
                                   abnormal smell       0.00      0.00      0.00         1
                                        alkaloids       0.00      0.00      0.00         1
                                        allergens       0.00      0.00      0.00         2
                                           almond       0.90      0.53      0.67        17
                                        amygdalin       0.00      0.00      0.00         1
                           antibiotics, vet drugs       0.00      0.00      0.00         2
                                    bacillus spp.       0.00      0.00      0.00         6
                             bad smell / off odor       0.00      0.00      0.00         1
                                    bone fragment       0.00  



F1 Score for product: 0.076
                                                      precision    recall  f1-score   support

                              Catfishes (freshwater)       1.00      1.00      1.00         4
                                     Dried pork meat       0.00      0.00      0.00         1
                               Fishes not identified       0.25      0.25      0.25         4
                            Not classified pork meat       0.00      0.00      0.00         1
                          Pangas catfishes (generic)       0.00      0.00      0.00         1
                                       Veggie Burger       0.00      0.00      0.00         1
                                     adobo seasoning       0.00      0.00      0.00         1
                                     alfalfa sprouts       1.00      0.67      0.80         3
                                               algae       0.00      0.00      0.00         1
                               

### Random Forest  TF-IDF Title

### High-Level Explanation of `train_random_forest_classifiers`

### Objective
- Train Random Forest classifiers for four labels and compute custom metrics to evaluate performance.

### Inputs
- **`dataframe`**: Dataset containing features and target labels.
- **`feature_column`**: The column in the dataframe to be used for feature extraction.

### Outputs
- **`classifiers`**: A dictionary of trained Random Forest classifiers for each label.
- **`vectorizers`**: A dictionary of TF-IDF vectorizers used for feature extraction.
- **`custom_metrics`**: A dictionary of custom evaluation scores for subtasks on test data.

---

### Key Steps

1. **Initialization**:
   - Set a random seed for reproducibility.
   - Prepare dictionaries to store classifiers, vectorizers, and custom metrics.

2. **Label-Specific Training**:
   - Iterate over four labels: 
     - `hazard-category`, 
     - `product-category`, 
     - `hazard`, 
     - `product`.
   - For each label:
     - Perform a stratified train-test split to maintain class distribution.
     - Use `TfidfVectorizer` to extract character-based n-gram features (2-5) with a maximum of 5000 features.
     - Train a Random Forest classifier with 100 decision trees using the extracted TF-IDF features.
     - Evaluate the model using F1 score and generate a classification report.
     - Save the trained classifier, vectorizer, and test data.

3. **Custom Metric Calculation**:
   - Compute task-specific metrics for two subtasks:
     - **Subtask 1**: Evaluate the relationship between `hazard-category` and `product-category`.
     - **Subtask 2**: Evaluate the relationship between `hazard` and `product`.
   - Combine F1 scores for hazards and products to compute the final metric for each subtask.

4. **Logging and Output**:
   - Print F1 scores for each label and custom metric scores for subtasks.
   - Save classification reports to a dedicated directory.
   - Return the trained classifiers, vectorizers, and custom metrics.

---

## Key Techniques
- **TF-IDF Vectorization**: Convert text into feature vectors using character n-grams.
- **Random Forest Classifier**: Train ensemble models with 100 decision trees for robust predictions.
- **Custom Metric Calculation**: Evaluate subtasks by combining F1 scores for hazards and products.

This function enables multi-label text classification and provides task-specific evaluations with a focus on hazards and products.


In [11]:
def train_random_forest_classifiers(dataframe, feature_column):
    """
    Train Random Forest classifiers for four labels and calculate custom metrics on test data.

    Args:
        dataframe: The input dataframe containing the dataset.
        feature_column: The name of the column in the dataframe to be used as features.

    Returns:
        classifiers: A dictionary containing trained classifiers for each label.
        vectorizers: A dictionary containing TF-IDF vectorizers for each label.
        custom_metrics: A dictionary containing the custom metric score for each pair of labels on test data.
    """
    np.random.seed(42)  # For reproducibility

    classifiers = {}  # Dictionary to store the trained classifiers
    vectorizers = {}  # Dictionary to store the TF-IDF vectorizers
    custom_metrics = {}  # Dictionary to store custom metric scores

    # Dictionaries to store test data for each category
    test_data = {}

    # Train classifiers for each label
    for label in ('hazard-category', 'product-category', 'hazard', 'product'):
        print(f"Training classifier for label: {label}")

        # Train-test split with stratification based on the current label
        trainset, testset = train_test_split(
            dataframe,
            test_size=0.2,
            random_state=2024,
            # stratify=dataframe[label]  # hold proportion of class distribution
        )

        # Extract train and test features
        X_train = trainset[feature_column]
        X_test = testset[feature_column]

        # Target
        y_train = trainset[label]
        y_test = testset[label]

        # Define TfidfVectorizer for the current label
        vectorizer = TfidfVectorizer(strip_accents='unicode', analyzer='char', ngram_range=(2, 5), max_df=0.5, min_df=5, max_features=5000)  # limit to 5000
        vectorizers[label] = vectorizer

        # Transform features using the label-specific vectorizer
        X_train_tfidf = vectorizer.fit_transform(X_train)
        X_test_tfidf = vectorizer.transform(X_test)

        # Define and train Random Forest classifier
        classifier = RandomForestClassifier(n_estimators=100, random_state=2024, n_jobs=-1)  # Using 100 trees
        classifier.fit(X_train_tfidf, y_train)

        # Store the trained classifier
        classifiers[label] = classifier

        # Store test data
        test_data[label] = {
            'X_test_tfidf': X_test_tfidf,
            'y_test': y_test
        }

        # Predict and evaluate
        predictions = classifier.predict(X_test_tfidf)
        rf_f1 = f1_score(y_test, predictions, average='macro', zero_division=0)
        print(f"F1 Score for {label}: {rf_f1:.3f}")

        # Generate classification report
        report = classification_report(y_test, predictions, zero_division=0)
        print(report)

        # Save the report
        os.makedirs('reports_initial/random_forest', exist_ok=True)
        with open(f'reports_initial/random_forest/rf_classifier_report_{label}_{feature_column}.txt', 'w') as rf_file:
            rf_file.write(f"Classification Report for {label}_{feature_column}:\n")
            rf_file.write(report)
            rf_file.write(f"F1 Score: {rf_f1:.3f}\n")

    # Compute the custom metric for hazards and products using test data only
    custom_metrics['subtask_1'] = compute_score(
        test_data['hazard-category']['y_test'],
        test_data['product-category']['y_test'],
        classifiers['hazard-category'].predict(test_data['hazard-category']['X_test_tfidf']),
        classifiers['product-category'].predict(test_data['product-category']['X_test_tfidf'])
    )

    custom_metrics['subtask_2'] = compute_score(
        test_data['hazard']['y_test'],
        test_data['product']['y_test'],
        classifiers['hazard'].predict(test_data['hazard']['X_test_tfidf']),
        classifiers['product'].predict(test_data['product']['X_test_tfidf'])
    )

    print(f"Custom Metric for Subtask 1 (Test Data): {custom_metrics['subtask_1']:.3f}")
    print(f"Custom Metric for Subtask 2 (Test Data): {custom_metrics['subtask_2']:.3f}")

    return classifiers, vectorizers, custom_metrics

In [12]:
train_random_forest_classifiers(df_initial, 'title')

Training classifier for label: hazard-category
F1 Score for hazard-category: 0.541
                                precision    recall  f1-score   support

                     allergens       0.82      0.90      0.86       374
                    biological       0.78      0.95      0.86       340
                      chemical       0.89      0.47      0.62        68
food additives and flavourings       1.00      0.12      0.22         8
                foreign bodies       0.83      0.73      0.78       101
                         fraud       0.82      0.55      0.66        67
          organoleptic aspects       1.00      0.10      0.18        10
                  other hazard       0.75      0.35      0.47        26
              packaging defect       1.00      0.12      0.22        16

                      accuracy                           0.81      1010
                     macro avg       0.88      0.48      0.54      1010
                  weighted avg       0.82      0.81

({'hazard-category': RandomForestClassifier(n_jobs=-1, random_state=2024),
  'product-category': RandomForestClassifier(n_jobs=-1, random_state=2024),
  'hazard': RandomForestClassifier(n_jobs=-1, random_state=2024),
  'product': RandomForestClassifier(n_jobs=-1, random_state=2024)},
 {'hazard-category': TfidfVectorizer(analyzer='char', max_df=0.5, max_features=5000, min_df=5,
                  ngram_range=(2, 5), strip_accents='unicode'),
  'product-category': TfidfVectorizer(analyzer='char', max_df=0.5, max_features=5000, min_df=5,
                  ngram_range=(2, 5), strip_accents='unicode'),
  'hazard': TfidfVectorizer(analyzer='char', max_df=0.5, max_features=5000, min_df=5,
                  ngram_range=(2, 5), strip_accents='unicode'),
  'product': TfidfVectorizer(analyzer='char', max_df=0.5, max_features=5000, min_df=5,
                  ngram_range=(2, 5), strip_accents='unicode')},
 {'subtask_1': np.float64(0.5351616428920356),
  'subtask_2': np.float64(0.28639627720486455)}

### Advanced X-Boost TF-IDF Title 

### High-Level Explanation of `train_xgboost_classifiers`

### Objective
- Train XGBoost classifiers for four labels and compute custom metrics to evaluate performance.

### Inputs
- **`dataframe`**: Dataset containing features and target labels.
- **`feature_column`**: The column in the dataframe to be used for feature extraction.

### Outputs
- **`classifiers`**: A dictionary of trained XGBoost classifiers for each label, including label encoders.
- **`vectorizers`**: A dictionary of TF-IDF vectorizers used for feature extraction.
- **`custom_metrics`**: A dictionary of custom evaluation scores for subtasks on test data.

---

### Key Steps

1. **Initialization**:
   - Set a random seed for reproducibility.
   - Prepare dictionaries to store classifiers, vectorizers, and custom metrics.

2. **Label-Specific Training**:
   - Iterate over four labels:
     - `hazard-category`,
     - `product-category`,
     - `hazard`,
     - `product`.
   - For each label:
     - Perform a stratified train-test split to maintain class distribution.
     - Use `TfidfVectorizer` to extract character-based n-gram features (2-5) with a maximum of 2000 features.
     - Encode target labels into numeric values using `LabelEncoder`.
     - Train an XGBoost classifier with the following parameters:
       - Maximum depth of 6.
       - 50 estimators.
       - Learning rate of 0.2.
     - Evaluate the model using F1 score and generate a classification report.
     - Save the trained classifier, vectorizer, label encoder, and test data.

3. **Custom Metric Calculation**:
   - Compute task-specific metrics for two subtasks:
     - **Subtask 1**: Evaluate the relationship between `hazard-category` and `product-category`.
     - **Subtask 2**: Evaluate the relationship between `hazard` and `product`.
   - Combine F1 scores for hazards and products to compute the final metric for each subtask.

4. **Logging and Output**:
   - Print F1 scores for each label and custom metric scores for subtasks.
   - Save classification reports to a dedicated directory.
   - Return the trained classifiers, vectorizers, and custom metrics.

---

### Key Techniques
- **TF-IDF Vectorization**: Convert text into feature vectors using character n-grams.
- **XGBoost Classifier**: Train scalable and efficient tree-based classifiers with softmax multi-class objectives.
- **Label Encoding**: Map categorical target labels to numeric values for compatibility with XGBoost.
- **Custom Metric Calculation**: Evaluate subtasks by combining F1 scores for hazards and products.

This function enables multi-label text classification and provides task-specific evaluations with a focus on hazards and products using a high-performance gradient boosting model.


In [43]:
import os
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import f1_score, classification_report
from xgboost import XGBClassifier

def train_xgboost_classifiers(dataframe, feature_column):
    """
    Train XGBoost classifiers for four labels and calculate custom metrics on test data.

    Args:
        dataframe: The input dataframe containing the dataset.
        feature_column: The name of the column in the dataframe to be used as features.

    Returns:
        classifiers: A dictionary containing trained classifiers for each label.
        vectorizers: A dictionary containing TF-IDF vectorizers for each label.
        custom_metrics: A dictionary containing the custom metric score for each pair of labels on test data.
    """
    np.random.seed(42)  # For reproducibility

    classifiers = {}  # Dictionary to store the trained classifiers
    vectorizers = {}  # Dictionary to store the TF-IDF vectorizers
    custom_metrics = {}  # Dictionary to store custom metric scores

    # Dictionaries to store test data for each category
    test_data = {}

    # Train classifiers for each label
    for label in ('hazard-category', 'product-category', 'hazard', 'product'):
        print(f"Training classifier for label: {label}")

        # Train-test split with stratification based on the current label
        trainset, testset = train_test_split(
            dataframe,
            test_size=0.2,
            random_state=2024
        )

        # Extract train and test features
        X_train = trainset[feature_column]
        X_test = testset[feature_column]

        # Target
        y_train = trainset[label]
        y_test = testset[label]

        # Encode target labels into numeric values
        label_encoder = LabelEncoder()
        y_train_encoded = label_encoder.fit_transform(y_train)

        # Handle unseen labels in test set
        y_test_mapped = y_test.map(lambda x: x if x in label_encoder.classes_ else None).dropna()
        y_test_encoded = label_encoder.transform(y_test_mapped)

        # Filter test data to exclude rows with unseen labels
        valid_test_indices = y_test.index[y_test.isin(label_encoder.classes_)]
        X_test = X_test.loc[valid_test_indices]

        # Define TfidfVectorizer for the current label
        vectorizer = TfidfVectorizer(
            strip_accents='unicode', analyzer='char', ngram_range=(2, 5),
            max_df=0.5, min_df=5, max_features=2000
        )
        vectorizers[label] = vectorizer

        # Transform features using the label-specific vectorizer
        X_train_tfidf = vectorizer.fit_transform(X_train)
        X_test_tfidf = vectorizer.transform(X_test)

        # Define and train XGBoost classifier
        classifier = XGBClassifier(
            use_label_encoder=False,
            eval_metric='mlogloss',
            objective='multi:softmax',
            max_depth=6,
            n_estimators=50,
            learning_rate=0.2,
            random_state=2024
        )
        classifier.fit(X_train_tfidf, y_train_encoded)

        # Store the trained classifier
        classifiers[label] = {
            'model': classifier,
            'label_encoder': label_encoder
        }

        # Store test data
        test_data[label] = {
            'X_test_tfidf': X_test_tfidf,
            'y_test': y_test_encoded
        }

        # Predict and evaluate
        predictions = classifier.predict(X_test_tfidf)
        xgb_f1 = f1_score(y_test_encoded, predictions, average='macro', zero_division=0)
        print(f"F1 Score for {label}: {xgb_f1:.3f}")

        # Filter target names to match present classes
        present_classes = np.unique(y_test_encoded)
        filtered_target_names = [label_encoder.classes_[i] for i in present_classes]

        # Generate classification report
        report = classification_report(
            y_test_encoded, 
            predictions, 
            zero_division=0, 
            target_names=filtered_target_names, 
            labels=present_classes
        )
        print(report)

        # Save the report
        os.makedirs('reports_initial/xgboost', exist_ok=True)
        with open(f'reports_initial/xgboost/xgboost_classifier_report_{label}_{feature_column}.txt', 'w') as xgb_file:
            xgb_file.write(f"Classification Report for {label}_{feature_column}:\n")
            xgb_file.write(report)
            xgb_file.write(f"F1 Score: {xgb_f1:.3f}\n")

    # Compute the custom metric for hazards and products using test data only
    custom_metrics['subtask_1'] = compute_score(
        pd.Series(test_data['hazard-category']['y_test']),
        pd.Series(test_data['product-category']['y_test']),
        pd.Series(classifiers['hazard-category']['model'].predict(test_data['hazard-category']['X_test_tfidf'])),
        pd.Series(classifiers['product-category']['model'].predict(test_data['product-category']['X_test_tfidf']))
    )

    custom_metrics['subtask_2'] = compute_score(
        pd.Series(test_data['hazard']['y_test']),
        pd.Series(test_data['product']['y_test']),
        pd.Series(classifiers['hazard']['model'].predict(test_data['hazard']['X_test_tfidf'])),
        pd.Series(classifiers['product']['model'].predict(test_data['product']['X_test_tfidf']))
    )

    print(f"Custom Metric for Subtask 1 (Test Data): {custom_metrics['subtask_1']:.3f}")
    print(f"Custom Metric for Subtask 2 (Test Data): {custom_metrics['subtask_2']:.3f}")

    return classifiers, vectorizers, custom_metrics


In [22]:
classifiers, vectorizers, custom_metrics = train_xgboost_classifiers(df_initial, 'title')

Training classifier for label: hazard-category


Parameters: { "use_label_encoder" } are not used.



F1 Score for hazard-category: 0.549
                                precision    recall  f1-score   support

                     allergens       0.82      0.88      0.85       374
                    biological       0.75      0.93      0.83       340
                      chemical       0.79      0.46      0.58        68
food additives and flavourings       1.00      0.25      0.40         8
                foreign bodies       0.83      0.69      0.76       101
                         fraud       0.78      0.48      0.59        67
          organoleptic aspects       1.00      0.30      0.46        10
                  other hazard       0.54      0.27      0.36        26
              packaging defect       1.00      0.06      0.12        16

                      accuracy                           0.78      1010
                     macro avg       0.83      0.48      0.55      1010
                  weighted avg       0.79      0.78      0.77      1010

Training classifier for l

Parameters: { "use_label_encoder" } are not used.



F1 Score for product-category: 0.530
                                                   precision    recall  f1-score   support

                              alcoholic beverages       0.80      0.44      0.57         9
                      cereals and bakery products       0.62      0.74      0.68       133
     cocoa and cocoa preparations, coffee and tea       0.49      0.49      0.49        37
                                    confectionery       0.87      0.32      0.46        41
dietetic foods, food supplements, fortified foods       0.81      0.46      0.59        28
                                    fats and oils       1.00      0.75      0.86         4
                                   feed materials       0.00      0.00      0.00         1
                   food additives and flavourings       1.00      1.00      1.00         1
                           food contact materials       0.00      0.00      0.00         2
                            fruits and vegetables   

Parameters: { "use_label_encoder" } are not used.



F1 Score for hazard: 0.278
                                                   precision    recall  f1-score   support

                                        Aflatoxin       0.00      0.00      0.00         2
                                   abnormal smell       0.00      0.00      0.00         1
                                        alkaloids       0.00      0.00      0.00         1
                                        allergens       0.00      0.00      0.00         2
                                           almond       0.65      0.65      0.65        17
                                        amygdalin       0.00      0.00      0.00         1
                           antibiotics, vet drugs       0.00      0.00      0.00         2
                                    bacillus spp.       0.00      0.00      0.00         6
                             bad smell / off odor       0.00      0.00      0.00         1
                                    bone fragment       0.00  

Parameters: { "use_label_encoder" } are not used.



F1 Score for product: 0.265
                                                   precision    recall  f1-score   support

                           Catfishes (freshwater)       1.00      0.75      0.86         4
                            Fishes not identified       0.67      0.50      0.57         4
                         Not classified pork meat       0.00      0.00      0.00         1
                       Pangas catfishes (generic)       0.00      0.00      0.00         1
                                    Veggie Burger       0.00      0.00      0.00         1
                                  alfalfa sprouts       1.00      0.33      0.50         3
                                            algae       0.50      1.00      0.67         1
                            all purpose seasoning       1.00      1.00      1.00         1
                                    almond powder       0.00      0.00      0.00         1
                                  almond products       1.00 

### Genarl Notes 
- All classifiers has better performance than the baselines (that is majority and random classifier via obsering per sub task and f1 scores).
- Due to the Imbalance all classifiers seem to have overfit s1-score per class has low numbers in combiantion with low accuracy.
- Regarding the comparison of f1 scores and custom evaluation the Logistic Regression has the lowest performance. Random Forest in sub task 2 ahs a few better score, but x-boost har more in sub task 1.
- `Out best classifier for the input title is x-boost`.

### Quick Comment in comparison with results of augmebted data 
- Despite the fact that classifier performed better performance in reports, and in competition quite low score, we can assume that the dta augmented strategy we followed "mislead" the model from teh real data.

### Part B. Benchmark Analysis Text

In [23]:
classifiers, vectorizers, custom_metrics = train_log_regression_classifiers(df_initial, 'text')

Training classifier for label: hazard-category




F1 Score for hazard-category: 0.527
                                precision    recall  f1-score   support

                     allergens       0.93      0.97      0.95       374
                    biological       0.88      0.96      0.92       340
                      chemical       0.82      0.66      0.73        68
food additives and flavourings       1.00      0.12      0.22         8
                foreign bodies       0.68      0.89      0.77       101
                         fraud       0.68      0.54      0.60        67
          organoleptic aspects       0.00      0.00      0.00        10
                  other hazard       1.00      0.19      0.32        26
              packaging defect       1.00      0.12      0.22        16

                      accuracy                           0.86      1010
                     macro avg       0.78      0.50      0.53      1010
                  weighted avg       0.86      0.86      0.84      1010

Training classifier for l



F1 Score for product-category: 0.286
                                                   precision    recall  f1-score   support

                              alcoholic beverages       0.00      0.00      0.00         9
                      cereals and bakery products       0.46      0.76      0.57       133
     cocoa and cocoa preparations, coffee and tea       0.52      0.43      0.47        37
                                    confectionery       0.89      0.20      0.32        41
dietetic foods, food supplements, fortified foods       0.75      0.21      0.33        28
                                    fats and oils       0.00      0.00      0.00         4
                                   feed materials       0.00      0.00      0.00         1
                   food additives and flavourings       0.00      0.00      0.00         1
                           food contact materials       0.00      0.00      0.00         2
                            fruits and vegetables   



F1 Score for hazard: 0.165
                                                   precision    recall  f1-score   support

                                        Aflatoxin       0.00      0.00      0.00         2
                                   abnormal smell       0.00      0.00      0.00         1
                                        alkaloids       0.00      0.00      0.00         1
                                        allergens       0.00      0.00      0.00         2
                                           almond       0.55      0.65      0.59        17
                                        amygdalin       0.00      0.00      0.00         1
                           antibiotics, vet drugs       0.00      0.00      0.00         2
                                    bacillus spp.       0.00      0.00      0.00         6
                             bad smell / off odor       0.00      0.00      0.00         1
                                    bone fragment       0.00  



F1 Score for product: 0.039
                                                      precision    recall  f1-score   support

                              Catfishes (freshwater)       0.00      0.00      0.00         4
                                     Dried pork meat       0.00      0.00      0.00         1
                               Fishes not identified       0.00      0.00      0.00         4
                            Not classified pork meat       0.00      0.00      0.00         1
                          Pangas catfishes (generic)       0.00      0.00      0.00         1
                                       Veggie Burger       0.00      0.00      0.00         1
                                     adobo seasoning       0.00      0.00      0.00         1
                                     alfalfa sprouts       0.00      0.00      0.00         3
                                               algae       0.00      0.00      0.00         1
                               

In [24]:
train_random_forest_classifiers(df_initial, 'text')

Training classifier for label: hazard-category
F1 Score for hazard-category: 0.609
                                precision    recall  f1-score   support

                     allergens       0.93      0.98      0.96       374
                    biological       0.91      0.97      0.94       340
                      chemical       0.72      0.60      0.66        68
food additives and flavourings       0.80      0.50      0.62         8
                foreign bodies       0.78      0.94      0.85       101
                         fraud       0.78      0.57      0.66        67
          organoleptic aspects       0.00      0.00      0.00        10
                  other hazard       0.58      0.42      0.49        26
              packaging defect       1.00      0.19      0.32        16

                      accuracy                           0.88      1010
                     macro avg       0.72      0.58      0.61      1010
                  weighted avg       0.87      0.88

({'hazard-category': RandomForestClassifier(n_jobs=-1, random_state=2024),
  'product-category': RandomForestClassifier(n_jobs=-1, random_state=2024),
  'hazard': RandomForestClassifier(n_jobs=-1, random_state=2024),
  'product': RandomForestClassifier(n_jobs=-1, random_state=2024)},
 {'hazard-category': TfidfVectorizer(analyzer='char', max_df=0.5, max_features=5000, min_df=5,
                  ngram_range=(2, 5), strip_accents='unicode'),
  'product-category': TfidfVectorizer(analyzer='char', max_df=0.5, max_features=5000, min_df=5,
                  ngram_range=(2, 5), strip_accents='unicode'),
  'hazard': TfidfVectorizer(analyzer='char', max_df=0.5, max_features=5000, min_df=5,
                  ngram_range=(2, 5), strip_accents='unicode'),
  'product': TfidfVectorizer(analyzer='char', max_df=0.5, max_features=5000, min_df=5,
                  ngram_range=(2, 5), strip_accents='unicode')},
 {'subtask_1': np.float64(0.5194938555486676),
  'subtask_2': np.float64(0.24101826281615074)}

In [44]:
classifiers, vectorizers, custom_metrics = train_xgboost_classifiers(df_initial, 'text')

Training classifier for label: hazard-category


Parameters: { "use_label_encoder" } are not used.



F1 Score for hazard-category: 0.677
                                precision    recall  f1-score   support

                     allergens       0.94      0.98      0.96       374
                    biological       0.91      0.96      0.94       340
                      chemical       0.79      0.72      0.75        68
food additives and flavourings       1.00      0.38      0.55         8
                foreign bodies       0.80      0.91      0.85       101
                         fraud       0.75      0.60      0.67        67
          organoleptic aspects       0.83      0.50      0.62        10
                  other hazard       0.63      0.46      0.53        26
              packaging defect       1.00      0.12      0.22        16

                      accuracy                           0.89      1010
                     macro avg       0.85      0.63      0.68      1010
                  weighted avg       0.89      0.89      0.88      1010

Training classifier for l

Parameters: { "use_label_encoder" } are not used.



F1 Score for product-category: 0.509
                                                   precision    recall  f1-score   support

                              alcoholic beverages       0.67      0.44      0.53         9
                      cereals and bakery products       0.47      0.66      0.55       133
     cocoa and cocoa preparations, coffee and tea       0.47      0.59      0.52        37
                                    confectionery       0.69      0.22      0.33        41
dietetic foods, food supplements, fortified foods       0.67      0.36      0.47        28
                                    fats and oils       0.67      0.50      0.57         4
                                   feed materials       0.00      0.00      0.00         1
                   food additives and flavourings       1.00      1.00      1.00         1
                           food contact materials       0.00      0.00      0.00         2
                            fruits and vegetables   

Parameters: { "use_label_encoder" } are not used.



F1 Score for hazard: 0.361
                                                   precision    recall  f1-score   support

                                        Aflatoxin       1.00      0.50      0.67         2
                                   abnormal smell       0.00      0.00      0.00         1
                                        alkaloids       0.00      0.00      0.00         1
                                        allergens       0.00      0.00      0.00         2
                                           almond       0.62      0.76      0.68        17
                                        amygdalin       0.00      0.00      0.00         1
                           antibiotics, vet drugs       0.00      0.00      0.00         2
                                    bacillus spp.       0.75      0.50      0.60         6
                             bad smell / off odor       0.00      0.00      0.00         1
                                    bone fragment       0.00  

Parameters: { "use_label_encoder" } are not used.



F1 Score for product: 0.205
                                                   precision    recall  f1-score   support

                           Catfishes (freshwater)       0.75      0.75      0.75         4
                            Fishes not identified       0.10      0.25      0.14         4
                         Not classified pork meat       0.00      0.00      0.00         1
                       Pangas catfishes (generic)       0.00      0.00      0.00         1
                                    Veggie Burger       0.00      0.00      0.00         1
                                  alfalfa sprouts       1.00      0.67      0.80         3
                                            algae       0.00      0.00      0.00         1
                            all purpose seasoning       1.00      1.00      1.00         1
                                    almond powder       0.00      0.00      0.00         1
                                  almond products       0.50 

- Comparing the custom evaluation scores and fi-macro average scores between the text clsssifers we can observe that again x-boost has better performance.
- Furthermore comparing X-boost witH Input text and X-boost with Input title  we can observe thatX-boost wit h input "text" has better custom scores than "title".

### Predict for the best model without additional tuning to have an overall view of the score


In [45]:
classifiers

{'hazard-category': {'model': XGBClassifier(base_score=None, booster=None, callbacks=None,
                colsample_bylevel=None, colsample_bynode=None,
                colsample_bytree=None, device=None, early_stopping_rounds=None,
                enable_categorical=False, eval_metric='mlogloss',
                feature_types=None, gamma=None, grow_policy=None,
                importance_type=None, interaction_constraints=None,
                learning_rate=0.2, max_bin=None, max_cat_threshold=None,
                max_cat_to_onehot=None, max_delta_step=None, max_depth=6,
                max_leaves=None, min_child_weight=None, missing=nan,
                monotone_constraints=None, multi_strategy=None, n_estimators=50,
                n_jobs=None, num_parallel_tree=None, objective='multi:softmax', ...),
  'label_encoder': LabelEncoder()},
 'product-category': {'model': XGBClassifier(base_score=None, booster=None, callbacks=None,
                colsample_bylevel=None, colsample_bynod

### Predict Unlabelled Data

In [47]:
#predict for hazard category : 
# Access specific classifiers of X-Boost - Tuned 
hazard_classifier = classifiers['hazard']
product_classifier = classifiers['product']
hazard_category_classifier = classifiers['hazard-category']
product_category_classifier = classifiers['product-category']

### Predict Hazard (ST2 part)

In [48]:
vectorizer = vectorizers['hazard']#pass TF-idf of training TfidfVectorizer(strip_accents='unicode', analyzer='char', ngram_range=(2, 5), max_df=0.5, min_df=5,max_features=2000)
vectorizer.fit_transform(df_initial['text']) # for the column "text" as it has better performance X-boost simple (menaing without tuning)
X_val=vectorizer.transform(testset_competition['text']) # for the column "text" as it has better performance X-boost simple (menaing without tuning)
X_val # vectorized X input of unlabeled data the column "text"

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 305381 stored elements and shape (565, 2000)>

In [49]:
model_hazard_classifier = hazard_classifier['model'] # save trained model X-Boost 
label_encoder_hazard_classifier = hazard_classifier['label_encoder'] # Save trained model label_encoder

In [50]:
model_hazard_classifier = hazard_classifier['model'] # save trained model X-Boost 
label_encoder_hazard_classifier = hazard_classifier['label_encoder'] # Save trained model label_encoder 

In [51]:
model_hazard_classifier

In [52]:
model_hazard_classifier.predict(X_val)


array([ 73,  64,  73,  68,  73,  55,  55,  55,  55,  55,  55,  55,  55,
        55,  55,  68,  55,  55,  55,  68,  55,  55,  55,  55,  55,  55,
        34,  55,  55,  55,  55,  55,  55,  68,  98,  98,  55,  55,  55,
        55,  55,  68,  55,  55,  55,  55,  55,  55,  68,  55,  55,  55,
        55,  55,  55,  55,  55,  55,  55,  73,  55,  55,  55,  85,  85,
        68,  85,  85,  90,  55,  55,  55,  55,  55,  68,  68,  55,  55,
        68,  55,  39,  73,  55,  68,  55,  55,  73,  34,  55,  55,  68,
        55,  55,  55,  55,  55,  68,  55,  68,  55,  85,  68,  55,  68,
        55,   5,  55,  68,  68,  55,  68,  55,  55,  55,  85,  68,  68,
        68,  68,  68,  85,  85,  85,  85,  85,  34,  55,  68,  55,  85,
        68,  55,  98,  55,  68,  55,  55,  68,  55,  90,  68,  55,  55,
        98,  55,  98,  55,  55,  55,  55,  55,  98,  55,  55,  55,  55,
        98,  98,  98,  98,  98,  55,  34,  98,  55,  55,  98,  98,  68,
        68,  68,  68,  55,  55,  68,  55,  55,  68,  68,  55,  5

In [53]:
model_hazard_classifier.predict(X_val).shape[0]
predictions_hazard= model_hazard_classifier.predict(X_val)

In [54]:
predictions_named_hazard = label_encoder_hazard_classifier.inverse_transform(predictions_hazard)  # Get original class names        predictions_dict[label] = predictions_named  # Store predictions with names

In [55]:
predictions_named_hazard

array(['other', 'moulds', 'other', 'norovirus', 'other',
       'listeria monocytogenes', 'listeria monocytogenes',
       'listeria monocytogenes', 'listeria monocytogenes',
       'listeria monocytogenes', 'listeria monocytogenes',
       'listeria monocytogenes', 'listeria monocytogenes',
       'listeria monocytogenes', 'listeria monocytogenes', 'norovirus',
       'listeria monocytogenes', 'listeria monocytogenes',
       'listeria monocytogenes', 'norovirus', 'listeria monocytogenes',
       'listeria monocytogenes', 'listeria monocytogenes',
       'listeria monocytogenes', 'listeria monocytogenes',
       'listeria monocytogenes', 'eggs and products thereof',
       'listeria monocytogenes', 'listeria monocytogenes',
       'listeria monocytogenes', 'listeria monocytogenes',
       'listeria monocytogenes', 'listeria monocytogenes', 'norovirus',
       'salmonella', 'salmonella', 'listeria monocytogenes',
       'listeria monocytogenes', 'listeria monocytogenes',
       'lister

In [56]:
predictions_named_hazard.shape[0]

565

### Predict Product (ST2)

In [57]:
# vectorizer = vectorizers['product']#pass TF-idf of training TfidfVectorizer(strip_accents='unicode', analyzer='char', ngram_range=(2, 5), max_df=0.5, min_df=5,max_features=2000)
# vectorizer.fit_transform(df_initial['text']) # for the column "text" as it has better performance X-boost simple (menaing without tuning)
# X_val=vectorizer.transform(testset_competition['text']) # for the column "text" as it has better performance X-boost simple (menaing without tuning)
# model_product_classifier = product_classifier['model'] # save trained model X-Boost 
# label_encoder_product_classifier = product_classifier['label_encoder'] # Save trained model label_encoder 
# predictions_product= model_product_classifier.predict(X_val)
# product_predictions_named = label_encoder_product_classifier.inverse_transform(predictions_product)  # Get original class names predictions_dict[label] = predictions_named  # Store predictions with names
# product_predictions_named

array(['cookies', 'frozen hash browns', 'cookies', 'soup',
       'chicken based products', 'cookies', 'cookies', 'cookies',
       'cookies', 'cookies', 'cookies', 'cookies', 'cookies', 'cookies',
       'cookies', 'pesto', 'sauce', 'cookies', 'cookies', 'pesto',
       'cookies', 'cookies', 'cookies', 'cookies', 'cookies', 'cookies',
       'cookies', 'cookies', 'cakes', 'cookies', 'cookies', 'cookies',
       'biscuits', 'pesto', 'cookies', 'chicken based products', 'cheese',
       'cookies', 'cookies', 'cookies', 'cookies', 'salads', 'cookies',
       'cookies', 'cookies', 'tahini', 'cookies', 'cookies',
       'ready to eat - cook meals', 'cookies', 'cookies', 'cookies',
       'cheese', 'cookies', 'cookies', 'cookies', 'cookies', 'cheese',
       'cookies', 'ready to eat - cook meals', 'cookies', 'cookies',
       'cookies', 'sandwiches', 'sandwiches', 'peanuts',
       'ready to eat - cook meals', 'cookies', 'pesto', 'cookies',
       'cookies', 'soup', 'ready to eat - cook mea

### Save CSV Product and Hazard  (ST2)

In [59]:
# import zipfile
# # Your data
# data_st2 = {"hazard": predictions_named_hazard, "product": product_predictions_named}
# df_st2 = pd.DataFrame(data_st2)
# # Define the folder and file path
# base_folder = "initial_data_submission"
# sub_folder = "st2_initial"
# csv_file_name = "submission.csv"
# # Create the directories if they do not exist
# output_path = os.path.join(base_folder, sub_folder)
# os.makedirs(output_path, exist_ok=True)
# # Full file path
# filecsv_file_path = os.path.join(output_path, csv_file_name)
# # Save the DataFrame to a CSV file
# df_st2.to_csv(filecsv_file_path, index=False)
# # Create a zip file containing the CSV
# zip_file_path = os.path.join(base_folder, "st2_initial.zip")
# with zipfile.ZipFile(zip_file_path, 'w') as zipf:
#     zipf.write(filecsv_file_path, arcname=os.path.join(sub_folder, csv_file_name))
# print(f"CSV file saved to: {filecsv_file_path}")
# print(f"Zip file created at: {zip_file_path}")

CSV file saved to: initial_data_submission/st2_initial/submission.csv
Zip file created at: initial_data_submission/st2_initial.zip


### Predict Hazard-Category (ST1)

In [71]:
# vectorizer = vectorizers['hazard-category']#pass TF-idf of training TfidfVectorizer(strip_accents='unicode', analyzer='char', ngram_range=(2, 5), max_df=0.5, min_df=5,max_features=2000)
# vectorizer.fit_transform(df_initial['text']) # for the column "text" as it has better performance X-boost simple (menaing without tuning)
# X_val=vectorizer.transform(testset_competition['text']) # for the column "text" as it has better performance X-boost simple (menaing without tuning)
# model_hazard_category_classifier = hazard_category_classifier['model'] # save trained model X-Boost 
# label_encoder_hazard_category_classifier = hazard_category_classifier['label_encoder'] # Save trained model label_encoder 
# predictions_hazard_category= model_hazard_category_classifier.predict(X_val)
# hazard_category_predictions_named = label_encoder_hazard_category_classifier.inverse_transform(predictions_product)  # Get original class names predictions_dict[label] = predictions_named  # Store predictions with names
# # hazard_category_predictions_named

### Predict Product-Category (ST1)

In [None]:
# vectorizer = vectorizers['product']#pass TF-idf of training TfidfVectorizer(strip_accents='unicode', analyzer='char', ngram_range=(2, 5), max_df=0.5, min_df=5,max_features=2000)
# vectorizer.fit_transform(df_initial['text']) # for the column "text" as it has better performance X-boost simple (menaing without tuning)
# X_val=vectorizer.transform(testset_competition['text']) # for the column "text" as it has better performance X-boost simple (menaing without tuning)
# model_product_classifier = product_classifier['model'] # save trained model X-Boost 
# label_encoder_product_classifier = product_classifier['label_encoder'] # Save trained model label_encoder 
# predictions_product= model_product_classifier.predict(X_val)
# product_predictions_named = label_encoder_product_classifier.inverse_transform(predictions_product)  # Get original class names predictions_dict[label] = predictions_named  # Store predictions with names
# product_predictions_named