## Main Notebook for Benchmark Analysis - Training and Evaluation

- This Jupyter Notebook contains the benchmark analysis based firstly on "title" as input and secondly on "text" as input. 
- Our aim is to detect the best model (`LogisticRegression`, `Random Forest` ans `X-Boost`) with the best Input "title" or "text".
- Then we will try to improve based on hyperparameter tuning technoques only the model that was detected as the best one.
    - Note: Additioanlly, baselined models (majority and radnom classifers) created in order that we detct if a model predicts based on exactly random ness or on mode / frequent values. 
        - We want our models to outpermorf the evaluations metrics of these baselines.
---
> Evangelia P. Panourgia, Master Student in Data Science, AUEB <br />
> Department of Informatics, Athens University of Economics and Business <br />
> eva.panourgia@aueb.gr <br/><br/>


### Install Libraries

In [1]:
!pip install nltk optuna xgboost



### Setting the Scene 
- We will import all the needeed libraries.

In [3]:
import os
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer, PorterStemmer
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
import string
import random
import numpy as np
from sklearn.metrics import f1_score, classification_report
from sklearn.dummy import DummyClassifier
from sklearn.model_selection import StratifiedKFold, train_test_split, cross_val_score, StratifiedShuffleSplit
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.utils.class_weight import compute_class_weight
from sklearn.preprocessing import LabelEncoder
from sklearn.multiclass import OneVsRestClassifier
from sklearn.ensemble import StackingClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier
import joblib  # For saving and loading models
from sklearn.ensemble import RandomForestClassifier

[nltk_data] Downloading package punkt to /Users/evangelia/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/evangelia/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/evangelia/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
  from .autonotebook import tqdm as notebook_tqdm


### Load Data 
- We will load the preprocessed data (`data_augmented__nlp_incidents_train.csv`) being pre-processed with `data augmentes` (generation of synthetic data usage synonyms) and basic nlp preprocess.
- Furthermore, we will load the unlabeleed data of the competition in ordeer to predict them (`incidents.csv`).

In [4]:
df_augmented= pd.read_csv('data/data_augmented__nlp_incidents_train.csv') # load data after data augmentation
testset_competition = pd.read_csv('data/incidents.csv', index_col=0) # load testing data (conception phase, unlabeled):

- Note: We loaded the transformed - preprocessed dataframe because in the main code we will use for splitting for training test the argument `stratisfy` in order to hold the analogous of classes, and if we had classes of 1 instance it would a problem.
    More specifically : 
    - What Stratification Does: Stratification ensures that the proportions of classes in the training and test sets reflect the proportions in the original dataset. This is especially useful when you have imbalanced classes.
    - The Problem with Single Instances: If a class has only one instance, stratified splitting can't properly divide it across both the training and test sets while maintaining class proportions. For instance, the stratified split might try to place the instance into both sets, which is impossible.

In [5]:
df_augmented = df_augmented[['title','text','hazard-category','product-category','hazard','product']]
df_augmented.head(3) # preview preproccessed data 

Unnamed: 0,title,text,hazard-category,product-category,hazard,product
0,recal notif fsis-024-94,case number 024-94 date open 07/01/1994 date c...,biological,"meat, egg and dairy products",listeria monocytogenes,smoked sausage
1,recal notif fsis-033-94,case number 033-94 date open 10/03/1994 date c...,biological,"meat, egg and dairy products",listeria spp,sausage
2,recal notif fsis-014-94,case number 014-94 date open 03/28/1994 date c...,biological,"meat, egg and dairy products",listeria monocytogenes,ham slices


In [5]:
testset_competition.head(3)# preview test data 

Unnamed: 0,year,month,day,country,title,text
0,1994,5,5,us,Recall Notification: FSIS-017-94,Case Number: 017-94 \n Date Opene...
1,1994,5,12,us,Recall Notification: FSIS-048-94,Case Number: 048-94 \n Date Opene...
2,1995,4,16,us,Recall Notification: FSIS-032-95,Case Number: 032-95 \n Date Opene...


## Baselines
- Benchmark analysis is crucial for evaluating classification performance in multiclass imbalance settings because it provides reference points for how well your model is performing relative to simple baseline classifiers. The `Random Classifier` and `Majority Classifier` are commonly used as benchmarks for the following reasons:

### Random Classifier 
- A Random Classifier predicts class labels randomly, with **uniform** based on the distribution of classes. It sets a minimal baseline and helps understand:

- `Baseline Performance`: This represents the expected performance `without learning from the data`. `If a model performs worse than a random classifier, it indicates either issues in the model or unsuitable features`.

- `Chance Levels`: It shows what performance you'd `get by chance alone`, especially useful for imbalanced datasets where metrics like accuracy can be misleading.


### Majority Classifier

- A Majority Classifier always **predicts the majority class** (`the class with the highest frequency in the training data`). 

- It helps understand:

    - `Handling Imbalance`: In multiclass imbalanced datasets, accuracy can be dominated by the majority class. The majority classifier provides a baseline to compare how well your model captures minority classes.
    - `Baseline of Naïve Solutions`: The majority classifier reflects the simplest possible rule for prediction. If a model's performance is close to that of a majority classifier, it suggests the model is failing to generalize or adapt to the minority classes.
    - `Focus on Class Imbalance`: Metrics like weighted accuracy, balanced accuracy, or macro-F1 score should be significantly better than those achieved by the majority classifier to indicate that a model is addressing imbalance effectively.

- Note in the following code cell I implement the code for Random and Majority Classifier, in order to have a high level of "logic" we added the split steps of trainingtest set, but for example for the Random Classifier it is useless as it is not affected from the input, dont learn from data.
    - Hoever, this "skeleton" is useful for the reamaining algorythms to buils in (both traditional and advanced) 

- More specifically, 

    - Random Classifier  Effect of X: The X values (features) **do not influence the random classifier's predictions**. It does not learn from the data in the feature column. Its predictions are purely random, so changing X will not alter its performance.
    - Majority Classifier Effect of X: The feature column X is ignored by the majority classifier, as it does not use features for prediction. Instead, it looks only at the distribution of y in the training data.

### Regarding the Implementation 
- The `DummyClassifier in scikit-learn` is a baseline model designed to evaluate classification algorithms by comparing them against simplistic strategies. These strategies provide minimal logic to make predictions and are often used as benchmarks to understand how well a more complex model performs.
    - `strategy="uniform"` (for Random Classifier): 
        - Predicts a class randomly and uniformly across all possible classes.
        - Each class has an equal probability of being selected, irrespective of the class distribution in the training data.
        - Use Case: Ideal for scenarios where you want to simulate random guessing.
    - `strategy="most_frequent"` (for Majoriry Classification)
        - Always predicts the most frequent class observed in the training data.
        - Ignores the input features entirely and focuses only on the training set's class distribution.
        - Use Case: Useful for understanding how well a naive baseline would perform if you simply predicted the majority class.

In [6]:
def evaluate_baselines(dataframe, feature_column):
    """
    Function to evaluate random and majority classifiers on a given dataframe.

    Args:
        dataframe: The input dataframe containing the dataset.
        feature_column: The name of the column in the dataframe to be used as features.
    """
    np.random.seed(42)  # For reproducibility

    # Train-test split with optional stratification
    trainset, testset = train_test_split(
        dataframe, 
        test_size=0.2, 
        random_state=2024, 
        # "skeleton" for the main algo here add stratisfy to hold proportion of classes 
    )
   
    # Random and Majority classifiers for each label
    for label in ('hazard-category', 'product-category', 'hazard', 'product'):
        print(f"Evaluating for label: {label}")

        # Features and target
        X_train = trainset[feature_column]
        y_train = trainset[label]
        X_test = testset[feature_column]
        y_test = testset[label]

        # Random Classifier
        random_clf = DummyClassifier(strategy="uniform", random_state=2024)
        random_clf.fit(X_train, y_train) # it is uselless X stimulate the logic of a real algo. 
        testset['predictions-random-' + label] = random_clf.predict(X_test)

        # Majority Classifier
        majority_clf = DummyClassifier(strategy="most_frequent")
        majority_clf.fit(X_train, y_train)# it is uselless X stimulate the logic of a real algo. 
        testset['predictions-majority-' + label] = majority_clf.predict(X_test)

        # Compute F1 scores
        random_f1 = f1_score(y_test, testset['predictions-random-' + label], average='macro', zero_division=0)
        majority_f1 = f1_score(y_test, testset['predictions-majority-' + label], average='macro', zero_division=0)

        print(f"F1 Score for Random Classifier ({label}): {random_f1:.3f}")
        print(f"F1 Score for Majority Classifier ({label}): {majority_f1:.3f}")

        # Generate and save classification reports
        os.makedirs('reports/random', exist_ok=True)
        os.makedirs('reports/majority', exist_ok=True)

        random_report = classification_report(y_test, testset['predictions-random-' + label], zero_division=0)
        majority_report = classification_report(y_test, testset['predictions-majority-' + label], zero_division=0)

        with open(f'reports/random/random_classifier_report_{label}.txt', 'w') as random_file:
            random_file.write(f"Classification Report for Random Classifier ({label}):\n")
            random_file.write(random_report)

        with open(f'reports/majority/majority_classifier_report_{label}.txt', 'w') as majority_file:
            majority_file.write(f"Classification Report for Majority Classifier ({label}):\n")
            majority_file.write(majority_report)
        
        
    
    # Custom metric score calculation
    def compute_score(hazards_true, products_true, hazards_pred, products_pred):
        """
        Custom scoring function to compute the macro F1 score for hazards and products.
        
        Args:
            hazards_true: Ground truth labels for hazards.
            products_true: Ground truth labels for products.
            hazards_pred: Predicted labels for hazards.
            products_pred: Predicted labels for products.
        
        Returns:
            A float representing the combined macro F1 score.
        """
        f1_hazards = f1_score(hazards_true, hazards_pred, average='macro', zero_division=0)
        f1_products = f1_score(
            products_true[hazards_pred == hazards_true],
            products_pred[hazards_pred == hazards_true],
            average='macro', 
            zero_division=0
        )
        return (f1_hazards + f1_products) / 2.

    # Example of calculating scores for Sub-Tasks (if needed):
    # Uncomment the following lines to compute scores for tasks
    print(f"Score Sub-Task 1 - Random Classifier: {compute_score(testset['hazard-category'], testset['product-category'], testset['predictions-random-hazard-category'], testset['predictions-random-product-category']):.3f}")
    print(f"Score Sub-Task 2 - Random Classifier: {compute_score(testset['hazard'], testset['product'], testset['predictions-random-hazard'], testset['predictions-random-product-category']):.3f}")
    print(f"Score Sub-Task 1 - Majority Classifier: {compute_score(testset['hazard-category'], testset['product-category'], testset['predictions-majority-hazard-category'], testset['predictions-majority-product-category']):.3f}")
    print(f"Score Sub-Task 2 - Majority Classifier: {compute_score(testset['hazard'], testset['product'], testset['predictions-majority-hazard'], testset['predictions-majority-product']):.3f}")

# Call the function with the required dataframe (e.g., df_augmented or any other dataframe)
evaluate_baselines(df_augmented, feature_column='text')
# Uncomment the following line to use a different feature column
# evaluate_baselines(df_augmented, feature_column='title', stratify_column='hazard-category')

Evaluating for label: hazard-category
F1 Score for Random Classifier (hazard-category): 0.076
F1 Score for Majority Classifier (hazard-category): 0.047
Evaluating for label: product-category


F1 Score for Random Classifier (product-category): 0.034
F1 Score for Majority Classifier (product-category): 0.019
Evaluating for label: hazard
F1 Score for Random Classifier (hazard): 0.005
F1 Score for Majority Classifier (hazard): 0.001
Evaluating for label: product
F1 Score for Random Classifier (product): 0.000
F1 Score for Majority Classifier (product): 0.000
Score Sub-Task 1 - Random Classifier: 0.057
Score Sub-Task 2 - Random Classifier: 0.003
Score Sub-Task 1 - Majority Classifier: 0.031
Score Sub-Task 2 - Majority Classifier: 0.001


- Save results (i have speed but to be sure for teh repplication of analysis) :

- Evaluating for label: hazard-category
    - F1 Score for Random Classifier (hazard-category): 0.076
    - F1 Score for Majority Classifier (hazard-category): 0.047
- Evaluating for label: product-category
    - F1 Score for Random Classifier (product-category): 0.034
    - F1 Score for Majority Classifier (product-category): 0.019
- Evaluating for label: hazard
    - F1 Score for Random Classifier (hazard): 0.005
    - F1 Score for Majority Classifier (hazard): 0.001
- Evaluating for label: product
    - F1 Score for Random Classifier (product): 0.000
    - F1 Score for Majority Classifier (product): 0.000

- Score Sub-Task 1 - Random Classifier: 0.057
- Score Sub-Task 2 - Random Classifier: 0.003
- Score Sub-Task 1 - Majority Classifier: 0.031
- Score Sub-Task 2 - Majority Classifier: 0.001

###  Results and Observations
- Label: `hazard-category`
    - Random Classifier F1: 0.076
    - Majority Classifier F1: 0.047
- Performance is slightly better for the Random Classifier, but both are low, indicating the dataset is likely imbalanced, and random guessing doesn't align well with true labels.

- Label: `product-category`
    - Random Classifier F1: 0.034
    - Majority Classifier F1: 0.019
- Performance drops further here. It suggests more complexity or higher imbalance in this label.

- Label: `hazard`
    - Random Classifier F1: 0.005
    - Majority Classifier F1: 0.001
    - Both scores are extremely low, possibly due to:
        - Large number of classes.
        - Sparse distribution of classes.
        - Poor representation of these classes in the Random Classifier's uniform predictions or Majority Classifier's mode.

- Label: `product`
    - Random Classifier F1: 0.000
    - Majority Classifier F1: 0.000
    - Both classifiers completely fail to capture meaningful patterns for this label. This could suggest extreme imbalance or lack of meaningful correlation in the dataset.

- Sub-Tasks
    - Score Sub-Task 1: hazard-category & product-category
        - Random Classifier Score: 0.057
        - Majority Classifier Score: 0.031
        - Indicates the overall performance when combining macro F1 scores for hazard-category and product-category. Random guessing outperforms predicting the most frequent class, but both are weak.
    - Score Sub-Task 2: hazard & product
        - Random Classifier Score: 0.003
        - Majority Classifier Score: 0.001
        - Reflects the severe challenge for these labels. The performance is near zero, affirming the labels require more sophisticated approaches.

-  `Conclusions` : 
- Baseline as a Benchmark:

    - **The poor F1 scores highlight the challenging nature of the task and dataset**.
    - These results provide a benchmark to evaluate future models. Any model achieving significantly higher F1 scores would demonstrate effective learning.

- Dataset Imbalance:

    - The low performance of the Majority Classifier indicates severe class imbalance across all labels.
    - Future models should address this using strategies like stratified sampling, oversampling, or weighted loss functions.

- Complexity of Labels:

    - The complexity increases from hazard-category and product-category to hazard and product, as reflected in the declining F1 scores.

- Actionable Insights:

    - Preprocessing: Investigate the class distributions and apply balancing techniques.
    - Feature Engineering: Consider enhancing the feature column (e.g., using embeddings).
    - Advanced Models: Apply models capable of handling imbalance, such as tree-based methods, ensemble models, or neural networks.


### Traditional and Advanced Algorythms


## Strategy for Model Selection and Evaluation

### Overview
Our approach involves `systematically evaluating three machine learning algorithms on two input types`, **"title"** and **"text"**. The goal is to identify the `b`est-performing model based on evaluation metrics and a custom competition evaluation metric`. Due to time and memory constraints, we set specific parameter values for each algorithm after initial manual investigation. This allows us to efficiently gain an overview of model performance and make informed decisions about **further optimization**.

---

### Step-by-Step Strategy

#### 1. Initial Model Evaluation with "Title" Input
- **Algorithms Tested**:
  - `Logistic Regression`
  - `Random Forest`
  - `XGBoost`
- **Parameter Setting**:
  - Parameters for each algorithm are manually tuned based on preliminary analysis to balance performance and computational efficiency.
- **Evaluation**:
  - Models are assessed using:
    - Standard evaluation metrics (e.g., accuracy, precision, recall, F1-score).
    - A custom evaluation metric provided by the competition.
- **Objective**:
  - Identify the most `promising algorithm` based on "title" input.

---

#### 2. Evaluation with "Text" Input
- **Algorithms Tested**:
  - Logistic Regression
  - Random Forest
  - XGBoost
- **Evaluation**:
  - The same evaluation metrics and competition-specific metric are applied as in the "title" input analysis.
- **Objective**:
  - Determine the best-performing algorithm for the "text" input.

---

#### 3. Comparison of Best Models
- The top-performing models from the **"title"** and **"text"** input evaluations are compared.
- **Selection**:
  - Based on their performance across all metrics, the superior model is selected.

---

#### 4. Optimization of the Final Model
- The chosen algorithm undergoes parameter optimization to refine its performance.
- **Constraints**:
  - Due to time limitations, cross-validation (e.g., K-fold validation) will not be applied.
  - Instead, a streamlined validation approach is used to ensure efficient optimization without excessive computational overhead.

---

#### 5. Baseline Comparison
- Throughout the process, model performance is benchmarked against baseline models:
  - **Random Prediction**: A model that predicts randomly.
  - **Majority Class Prediction**: A model that always predicts the most frequent class.
- **Objective**:
  - Provide context for evaluating the added value of the trained algorithms.

---

### Summary
This structured methodology ensures a thorough evaluation of multiple algorithms across different input types, with a focus on balancing computational efficiency and performance. By the end of this process, the goal is to identify the best-performing algorithm with the best input title or text and optimize it for deployment within the constraints of time and resources.

### Part A. Benchmark Analysis Title


- Firstly,we will include the custom evaluation metric provided by the competition page.
    - It will be used for as evaluation part of all algorythms.

In [None]:
## Helping function for the calculation of the custom evaluation for the subtasks.
def compute_score(hazards_true, products_true, hazards_pred, products_pred):
    """
    Compute a custom F1 score that considers hazards and products together.
    """
    # Reset indices to ensure alignment
    hazards_true = hazards_true.reset_index(drop=True)
    products_true = products_true.reset_index(drop=True)
    hazards_pred = pd.Series(hazards_pred).reset_index(drop=True)
    products_pred = pd.Series(products_pred).reset_index(drop=True)

    # Compute F1 for hazards
    f1_hazards = f1_score(hazards_true, hazards_pred, average='macro', zero_division=0)

    # Compute F1 for products, only where hazards predictions match ground truth
    mask = hazards_pred == hazards_true
    f1_products = f1_score(
        products_true[mask],
        products_pred[mask],
        average='macro',
        zero_division=0
    )

    # Return the combined metric
    return (f1_hazards + f1_products) / 2.

### Logistic Regression TF-IDF Title 

### High-Level Explanation of `train_log_regression_classifiers`

### Objective
- Train multinomial logistic regression classifiers for four labels and compute custom metrics to evaluate performance.

### Inputs
- **`dataframe`**: Dataset containing features and target labels.
- **`feature_column`**: The column in the dataframe to be used for feature extraction.

### Outputs
- **`classifiers`**: A dictionary of trained logistic regression classifiers for each label.
- **`vectorizers`**: A dictionary of TF-IDF vectorizers used for feature extraction.
- **`custom_metrics`**: A dictionary of custom evaluation scores for subtasks on test data.

---

### Key Steps

1. **Initialization**:
   - Set a random seed for reproducibility.
   - Prepare dictionaries to store classifiers, vectorizers, and custom metrics.

2. **Label-Specific Training**:
   - Iterate over four labels: 
     - `hazard-category`, 
     - `product-category`, 
     - `hazard`, 
     - `product`.
   - For each label:
     - Perform a stratified train-test split to maintain class distribution.
     - Use `TfidfVectorizer` to extract character-based n-gram features (2-5) with a maximum of 5000 features.
     - Train a multinomial logistic regression classifier with a maximum of 100 iterations.
     - Evaluate the model using F1 score and generate a classification report.
     - Save the trained classifier, vectorizer, and test data.

3. **Custom Metric Calculation**:
   - Compute task-specific metrics for two subtasks:
     - **Subtask 1**: Evaluate the relationship between `hazard-category` and `product-category`.
     - **Subtask 2**: Evaluate the relationship between `hazard` and `product`.
   - Combine F1 scores for hazards and products to compute the final metric for each subtask.

4. **Logging and Output**:
   - Print F1 scores for each label and custom metric scores for subtasks.
   - Save classification reports to a dedicated directory.
   - Return the trained classifiers, vectorizers, and custom metrics.

---

### Key Techniques
- **TF-IDF Vectorization**: Convert text into feature vectors using character n-grams.
- **Logistic Regression Classifier**: Train multinomial classifiers with limited iterations for efficient predictions.
- **Custom Metric Calculation**: Evaluate subtasks by combining F1 scores for hazards and products.

This function facilitates multi-label text classification and provides task-specific evaluations with a focus on hazards and products.


In [13]:
def train_log_regression_classifiers(dataframe, feature_column):
    """
    Train multinomial logistic regression classifiers for four labels and calculate custom metrics on test data.

    Args:
        dataframe: The input dataframe containing the dataset.
        feature_column: The name of the column in the dataframe to be used as features.

    Returns:
        classifiers: A dictionary containing trained classifiers for each label.
        vectorizers: A dictionary containing TF-IDF vectorizers for each label.
        custom_metrics: A dictionary containing the custom metric score for each pair of labels on test data.
    """
    np.random.seed(42)  # For reproducibility

    classifiers = {}  # Dictionary to store the trained classifiers
    vectorizers = {}  # Dictionary to store the TF-IDF vectorizers
    custom_metrics = {}  # Dictionary to store custom metric scores

    # Dictionaries to store test data for each category
    test_data = {}

    # Train classifiers for each label
    for label in ('hazard-category', 'product-category', 'hazard', 'product'):
        print(f"Training classifier for label: {label}")

        # Train-test split with stratification based on the current label
        trainset, testset = train_test_split(
            dataframe,
            test_size=0.2,
            random_state=2024,
            stratify=dataframe[label] # hold proportion of classes distribution
        )

        # Extract train and test features
        X_train = trainset[feature_column]
        X_test = testset[feature_column]

        # Target
        y_train = trainset[label]
        y_test = testset[label]

        # Define TfidfVectorizer for the current label
        vectorizer = TfidfVectorizer(strip_accents='unicode', analyzer='char', ngram_range=(2, 5), max_df=0.5, min_df=5, max_features=5000) # limit to 5000
        #TODO space reduction ???
        vectorizers[label] = vectorizer

        # Transform features using the label-specific vectorizer
        X_train_tfidf = vectorizer.fit_transform(X_train)
        X_test_tfidf = vectorizer.transform(X_test)

        # Define and train Logistic Regression classifier
        classifier = LogisticRegression(max_iter=100, random_state=2024, multi_class='multinomial') # limit to max iter 
        classifier.fit(X_train_tfidf, y_train)

        # Store the trained classifier
        classifiers[label] = classifier

        # Store test data
        test_data[label] = {
            'X_test_tfidf': X_test_tfidf,
            'y_test': y_test
        }

        # Predict and evaluate
        predictions = classifier.predict(X_test_tfidf)
        logreg_f1 = f1_score(y_test, predictions, average='macro', zero_division=0)
        print(f"F1 Score for {label}: {logreg_f1:.3f}")

        # Generate classification report
        report = classification_report(y_test, predictions, zero_division=0)
        print(report)

        # Save the report
        os.makedirs('reports/logreg', exist_ok=True)
        with open(f'reports/logreg/logreg_classifier_report_{label}_{feature_column}.txt', 'w') as logreg_file:
            logreg_file.write(f"Classification Report for {label}_{feature_column}:\n")
            logreg_file.write(report)
            logreg_file.write(f"F1 Score: {logreg_f1:.3f}\n")

    # Compute the custom metric for hazards and products using test data only
    custom_metrics['subtask_1'] = compute_score(
    test_data['hazard-category']['y_test'],
    test_data['product-category']['y_test'],
    classifiers['hazard-category'].predict(test_data['hazard-category']['X_test_tfidf']),
    classifiers['product-category'].predict(test_data['product-category']['X_test_tfidf'])
    )

    custom_metrics['subtask_2'] = compute_score(
    test_data['hazard']['y_test'],
    test_data['product']['y_test'],
    classifiers['hazard'].predict(test_data['hazard']['X_test_tfidf']),
    classifiers['product'].predict(test_data['product']['X_test_tfidf'])
    )


    print(f"Custom Metric for Subtask 1 (Test Data): {custom_metrics['subtask_1']:.3f}")
    print(f"Custom Metric for Subtask 2 (Test Data): {custom_metrics['subtask_2']:.3f}")

    return classifiers, vectorizers, custom_metrics

- Note, manually I tried seperate values for TfidfVectorizer for the parameter `max_features`, note, the default one in combiantion with having in Logistic Regression `max_iters` 1000, I observed that it was quite low. 
- So, I concluded to define TfidfVectorizer(max_features= 5000) and LogisticRegression(max_iters= 100
- This manually experimentation was crucial due to limited local recourses.)
- Furthermore, manually we treid instead of `multi_class='multinomial` in Logisticregression to use `one-vs-all` strategy, but it was computationally costy, so we concluded to `multinomial`

In [8]:
classifiers, vectorizers, custom_metrics = train_log_regression_classifiers(df_augmented, 'title')
print("Custom Metric Scores on Test Data:")
print(custom_metrics)

Training classifier for label: hazard-category




F1 Score for hazard-category: 0.683
                                precision    recall  f1-score   support

                     allergens       0.82      0.93      0.87       863
                    biological       0.84      0.92      0.88       771
                      chemical       0.80      0.83      0.82       329
food additives and flavourings       1.00      0.21      0.35        14
                foreign bodies       0.86      0.75      0.80       253
                         fraud       0.84      0.71      0.77       280
                     migration       1.00      0.93      0.96        29
          organoleptic aspects       1.00      0.23      0.38        39
                  other hazard       0.92      0.46      0.61       103
              packaging defect       0.83      0.26      0.39        39

                      accuracy                           0.83      2720
                     macro avg       0.89      0.62      0.68      2720
                  weighted



F1 Score for product-category: 0.688
                                                   precision    recall  f1-score   support

                              alcoholic beverages       0.95      0.50      0.66        38
                      cereals and bakery products       0.65      0.70      0.67       252
     cocoa and cocoa preparations, coffee and tea       0.77      0.71      0.74        99
                                    confectionery       0.81      0.37      0.51        91
dietetic foods, food supplements, fortified foods       0.79      0.78      0.78        95
                                    fats and oils       1.00      0.62      0.77        32
                                   feed materials       1.00      0.73      0.84        11
                   food additives and flavourings       1.00      0.70      0.82        10
                           food contact materials       1.00      0.89      0.94        35
                            fruits and vegetables   



F1 Score for hazard: 0.476
                                                   precision    recall  f1-score   support

                                        Aflatoxin       0.00      0.00      0.00         5
                                   abnormal smell       1.00      0.43      0.60         7
                                  alcohol content       0.82      0.90      0.86        10
                                        alkaloids       1.00      0.67      0.80         9
                                        allergens       0.00      0.00      0.00         7
                                           almond       0.60      0.48      0.54        31
             altered organoleptic characteristics       1.00      0.90      0.95        10
                                        amygdalin       0.62      0.89      0.73         9
                           antibiotics, vet drugs       0.00      0.00      0.00         6
                                    bacillus spp.       1.00  



F1 Score for product: 0.344
                                                                        precision    recall  f1-score   support

                                                Catfishes (freshwater)       0.40      0.67      0.50         3
                                                       Dried pork meat       0.00      0.00      0.00         1
                                                 Fishes not identified       0.17      0.44      0.24         9
                                                    Groupers (generic)       0.00      0.00      0.00         1
                                              Not classified pork meat       1.00      0.75      0.86         4
                                            Pangas catfishes (generic)       0.00      0.00      0.00         2
                                   Precooked cooked pork meat products       0.00      0.00      0.00         6
                                    Torpedo-shaped catfishes (generic)     

- **Importan Note**: At this cell we will explain the way in which we interpret the `evaluation report` in depth, duw to time limitatiosn we will not repeat it for all modells. In the follwoing evaluations of other models we will pay attention to the final competition sub tasks custo mevalautions, to macro avg f1-score.
  - General comment : `Utilize metrics like **macro average F1 score** to emphasize balanced performance across all classes`.

Notes for evaluation of `Logistic Regression` (input `title`): 
   - The Logistic Regression model outperforms both the majority and random classifier baselines, which is a positive outcome. 
        - If the model’s evaluation metrics were close to those of a random classifier, it would indicate that the training process was ineffective, as the model would essentially be making predictions at random. 
        - Similarly, if the model’s evaluation metrics were close to those of the majority classifier, it would suggest that the model requires further optimization, as it primarily predicts the most frequent class instead of learning meaningful patterns from the data. 
   - Analysis of Evaluation Report : 
   ### General Observations:
   - **Overall Accuracy:** The model achieves an accuracy of **83%**, which is decent but insufficient as a standalone metric, especially for imbalanced datasets.
   - **Macro Avg (F1 Score):** The macro average F1 score is **0.68**, highlighting significant performance variation across categories. This is much lower than the weighted F1 score of **0.82**, suggesting better performance on frequently occurring categories.
   - **Weighted Avg (F1 Score):** The weighted F1 score is **0.82**, as it accounts for the class distribution, showing the model is biased toward the majority classes.

   ### Class-wise Performance:
   ### Strong Performers:
   - **Allergens (F1: 0.87):** High precision (**0.82**) and recall (**0.93**), indicating consistent performance.
   - **Biological (F1: 0.88):** Strong performance with balanced precision and recall.
   - **Chemical (F1: 0.82):** Good overall performance.
   - **Migration (F1: 0.96):** Excellent results, though support is very low (**29 samples**), raising concerns about generalizability.

   ### Weak Performers:
   - **Food Additives and Flavourings (F1: 0.35):** Perfect precision (**1.00**) but extremely poor recall (**0.21**), indicating very conservative predictions.
   - **Organoleptic Aspects (F1: 0.38):** Similar to the above, precision is high but recall is very low.
   - **Packaging Defect (F1: 0.39):** Poor recall (**0.26**) suggests difficulty identifying these cases.
   - **Other Hazard (F1: 0.61):** Moderate performance with low recall (**0.46**).

   ### Overfitting Indicators:
   - **Precision-Recall Disparity:** The significant gap between **macro average precision (0.89)** and **macro average recall (0.62)** suggests potential overfitting. The model confidently predicts a few dominant classes but fails to generalize to minority classes.
   - **Low Recall for Minority Classes:** For low-support categories like "food additives and flavourings" and "organoleptic aspects," the model achieves perfect precision but nearly zero recall, indicating memorization rather than generalization.

   ### Model's Strengths and Weaknesses:
   ### Strengths:
   - Excellent performance on majority classes like allergens and biological hazards.
   - High precision in predicting categories with substantial support.

   ### Weaknesses:
   - Poor performance on minority classes due to class imbalance.
   - Bias toward majority classes, with low recall for underrepresented categories.

   ### Next steps for improvement if it is verified as the best model:
     1. **Address Class Imbalance:**
   - Use techniques such as oversampling minority classes, undersampling majority classes, or employing weighted loss functions during training.
    - Note : Due to space and time limitations this suggestion is impossible as e.g. SMOTE multiply the already "big" space steeming from vectorizing the text e.g. with the usage of TF-Idf. 

     2. **Focus on Evaluation Metrics:**
   - Utilize metrics like **macro average F1 score** to emphasize balanced performance across all classes.
      - Note: `We will prioritise` this metric.
   - `Pay closer attention to **recall**, especially for minority classes, to improve generalization`.

     3. **Introduce Regularization:**
   - Apply regularization techniques like dropout or L2 regularization to reduce overfitting.
   - Monitor performance on a validation set to detect overfitting during training.
    - Note: Regularization for example in Logistic Regression is represented with the parameter `C` in wich we want to have lo  numbers if we want regularization. 
      - If logistic regression verified as our best model at first glance compared to X-Boost (advanced algorythm) and Random Foresr (simpel algorythm) we will apply hyperparameter tuning emphasizing on this parameter possible values. 
      - We don't adapt it beforehand as our aim is quich to detect our best model firstly. 
- Simillarly we expalin the evaluation reports for the remaining labels. 

- Regarding the scres of the competition custom metrics  we have: <br> 
    - `subtask_1`: ~`0.689`, <br> 
    - `subtask_2`: `~0.425`<br> 



In [9]:
# predict for hazard category : 
# Access specific classifiers
# hazard_classifier = classifiers['hazard']
# product_classifier = classifiers['product']
# hazard_category_classifier = classifiers['hazard-category']
# product_category_classifier = classifiers['product-category']

In [10]:
# vectorizer = TfidfVectorizer(strip_accents='unicode', analyzer='char', ngram_range=(2, 5), max_df=0.5, min_df=5,max_features=5000)
# vectorizer.fit_transform(df_augmented['title'])
# X_val=vectorizer.transform(testset_competition['title'])

In [11]:
# hazard_category_classifier.predict(X_val)

In [12]:
# product_classifier.predict(X_val)

In [13]:
# hazard_category_classifier.predict(X_val)

In [14]:
# product_category_classifier.predict(X_val)

### Random Forest  TF-IDF Title

### High-Level Explanation of `train_random_forest_classifiers`

### Objective
- Train Random Forest classifiers for four labels and compute custom metrics to evaluate performance.

### Inputs
- **`dataframe`**: Dataset containing features and target labels.
- **`feature_column`**: The column in the dataframe to be used for feature extraction.

### Outputs
- **`classifiers`**: A dictionary of trained Random Forest classifiers for each label.
- **`vectorizers`**: A dictionary of TF-IDF vectorizers used for feature extraction.
- **`custom_metrics`**: A dictionary of custom evaluation scores for subtasks on test data.

---

### Key Steps

1. **Initialization**:
   - Set a random seed for reproducibility.
   - Prepare dictionaries to store classifiers, vectorizers, and custom metrics.

2. **Label-Specific Training**:
   - Iterate over four labels: 
     - `hazard-category`, 
     - `product-category`, 
     - `hazard`, 
     - `product`.
   - For each label:
     - Perform a stratified train-test split to maintain class distribution.
     - Use `TfidfVectorizer` to extract character-based n-gram features (2-5) with a maximum of 5000 features.
     - Train a Random Forest classifier with 100 decision trees using the extracted TF-IDF features.
     - Evaluate the model using F1 score and generate a classification report.
     - Save the trained classifier, vectorizer, and test data.

3. **Custom Metric Calculation**:
   - Compute task-specific metrics for two subtasks:
     - **Subtask 1**: Evaluate the relationship between `hazard-category` and `product-category`.
     - **Subtask 2**: Evaluate the relationship between `hazard` and `product`.
   - Combine F1 scores for hazards and products to compute the final metric for each subtask.

4. **Logging and Output**:
   - Print F1 scores for each label and custom metric scores for subtasks.
   - Save classification reports to a dedicated directory.
   - Return the trained classifiers, vectorizers, and custom metrics.

---

## Key Techniques
- **TF-IDF Vectorization**: Convert text into feature vectors using character n-grams.
- **Random Forest Classifier**: Train ensemble models with 100 decision trees for robust predictions.
- **Custom Metric Calculation**: Evaluate subtasks by combining F1 scores for hazards and products.

This function enables multi-label text classification and provides task-specific evaluations with a focus on hazards and products.


In [6]:
# def compute_score(hazards_true, products_true, hazards_pred, products_pred):
#     """
#     Compute a custom F1 score that considers hazards and products together.
#     """
#     # Reset indices to ensure alignment
#     hazards_true = hazards_true.reset_index(drop=True)
#     products_true = products_true.reset_index(drop=True)
#     hazards_pred = pd.Series(hazards_pred).reset_index(drop=True)
#     products_pred = pd.Series(products_pred).reset_index(drop=True)

#     # Compute F1 for hazards
#     f1_hazards = f1_score(hazards_true, hazards_pred, average='macro', zero_division=0)

#     # Compute F1 for products, only where hazards predictions match ground truth
#     mask = hazards_pred == hazards_true
#     f1_products = f1_score(
#         products_true[mask],
#         products_pred[mask],
#         average='macro',
#         zero_division=0
#     )

#     # Return the combined metric
#     return (f1_hazards + f1_products) / 2.


def train_random_forest_classifiers(dataframe, feature_column):
    """
    Train Random Forest classifiers for four labels and calculate custom metrics on test data.

    Args:
        dataframe: The input dataframe containing the dataset.
        feature_column: The name of the column in the dataframe to be used as features.

    Returns:
        classifiers: A dictionary containing trained classifiers for each label.
        vectorizers: A dictionary containing TF-IDF vectorizers for each label.
        custom_metrics: A dictionary containing the custom metric score for each pair of labels on test data.
    """
    np.random.seed(42)  # For reproducibility

    classifiers = {}  # Dictionary to store the trained classifiers
    vectorizers = {}  # Dictionary to store the TF-IDF vectorizers
    custom_metrics = {}  # Dictionary to store custom metric scores

    # Dictionaries to store test data for each category
    test_data = {}

    # Train classifiers for each label
    for label in ('hazard-category', 'product-category', 'hazard', 'product'):
        print(f"Training classifier for label: {label}")

        # Train-test split with stratification based on the current label
        trainset, testset = train_test_split(
            dataframe,
            test_size=0.2,
            random_state=2024,
            stratify=dataframe[label]  # hold proportion of class distribution
        )

        # Extract train and test features
        X_train = trainset[feature_column]
        X_test = testset[feature_column]

        # Target
        y_train = trainset[label]
        y_test = testset[label]

        # Define TfidfVectorizer for the current label
        vectorizer = TfidfVectorizer(strip_accents='unicode', analyzer='char', ngram_range=(2, 5), max_df=0.5, min_df=5, max_features=5000)  # limit to 5000
        vectorizers[label] = vectorizer

        # Transform features using the label-specific vectorizer
        X_train_tfidf = vectorizer.fit_transform(X_train)
        X_test_tfidf = vectorizer.transform(X_test)

        # Define and train Random Forest classifier
        classifier = RandomForestClassifier(n_estimators=100, random_state=2024, n_jobs=-1)  # Using 100 trees
        classifier.fit(X_train_tfidf, y_train)

        # Store the trained classifier
        classifiers[label] = classifier

        # Store test data
        test_data[label] = {
            'X_test_tfidf': X_test_tfidf,
            'y_test': y_test
        }

        # Predict and evaluate
        predictions = classifier.predict(X_test_tfidf)
        rf_f1 = f1_score(y_test, predictions, average='macro', zero_division=0)
        print(f"F1 Score for {label}: {rf_f1:.3f}")

        # Generate classification report
        report = classification_report(y_test, predictions, zero_division=0)
        print(report)

        # Save the report
        os.makedirs('reports/random_forest', exist_ok=True)
        with open(f'reports/random_forest/rf_classifier_report_{label}_{feature_column}.txt', 'w') as rf_file:
            rf_file.write(f"Classification Report for {label}_{feature_column}:\n")
            rf_file.write(report)
            rf_file.write(f"F1 Score: {rf_f1:.3f}\n")

    # Compute the custom metric for hazards and products using test data only
    custom_metrics['subtask_1'] = compute_score(
        test_data['hazard-category']['y_test'],
        test_data['product-category']['y_test'],
        classifiers['hazard-category'].predict(test_data['hazard-category']['X_test_tfidf']),
        classifiers['product-category'].predict(test_data['product-category']['X_test_tfidf'])
    )

    custom_metrics['subtask_2'] = compute_score(
        test_data['hazard']['y_test'],
        test_data['product']['y_test'],
        classifiers['hazard'].predict(test_data['hazard']['X_test_tfidf']),
        classifiers['product'].predict(test_data['product']['X_test_tfidf'])
    )

    print(f"Custom Metric for Subtask 1 (Test Data): {custom_metrics['subtask_1']:.3f}")
    print(f"Custom Metric for Subtask 2 (Test Data): {custom_metrics['subtask_2']:.3f}")

    return classifiers, vectorizers, custom_metrics


In [7]:
train_random_forest_classifiers(df_augmented, 'title')

Training classifier for label: hazard-category


F1 Score for hazard-category: 0.742
                                precision    recall  f1-score   support

                     allergens       0.86      0.94      0.90       863
                    biological       0.84      0.94      0.89       771
                      chemical       0.85      0.85      0.85       329
food additives and flavourings       1.00      0.43      0.60        14
                foreign bodies       0.82      0.75      0.78       253
                         fraud       0.92      0.72      0.81       280
                     migration       1.00      0.93      0.96        29
          organoleptic aspects       1.00      0.26      0.41        39
                  other hazard       0.92      0.55      0.69       103
              packaging defect       1.00      0.36      0.53        39

                      accuracy                           0.86      2720
                     macro avg       0.92      0.67      0.74      2720
                  weighted

({'hazard-category': RandomForestClassifier(n_jobs=-1, random_state=2024),
  'product-category': RandomForestClassifier(n_jobs=-1, random_state=2024),
  'hazard': RandomForestClassifier(n_jobs=-1, random_state=2024),
  'product': RandomForestClassifier(n_jobs=-1, random_state=2024)},
 {'hazard-category': TfidfVectorizer(analyzer='char', max_df=0.5, max_features=5000, min_df=5,
                  ngram_range=(2, 5), strip_accents='unicode'),
  'product-category': TfidfVectorizer(analyzer='char', max_df=0.5, max_features=5000, min_df=5,
                  ngram_range=(2, 5), strip_accents='unicode'),
  'hazard': TfidfVectorizer(analyzer='char', max_df=0.5, max_features=5000, min_df=5,
                  ngram_range=(2, 5), strip_accents='unicode'),
  'product': TfidfVectorizer(analyzer='char', max_df=0.5, max_features=5000, min_df=5,
                  ngram_range=(2, 5), strip_accents='unicode')},
 {'subtask_1': np.float64(0.7602824625659861),
  'subtask_2': np.float64(0.7210352952785184)})

### Advanced X-Boost TF-IDF Title 

### High-Level Explanation of `train_xgboost_classifiers`

### Objective
- Train XGBoost classifiers for four labels and compute custom metrics to evaluate performance.

### Inputs
- **`dataframe`**: Dataset containing features and target labels.
- **`feature_column`**: The column in the dataframe to be used for feature extraction.

### Outputs
- **`classifiers`**: A dictionary of trained XGBoost classifiers for each label, including label encoders.
- **`vectorizers`**: A dictionary of TF-IDF vectorizers used for feature extraction.
- **`custom_metrics`**: A dictionary of custom evaluation scores for subtasks on test data.

---

### Key Steps

1. **Initialization**:
   - Set a random seed for reproducibility.
   - Prepare dictionaries to store classifiers, vectorizers, and custom metrics.

2. **Label-Specific Training**:
   - Iterate over four labels:
     - `hazard-category`,
     - `product-category`,
     - `hazard`,
     - `product`.
   - For each label:
     - Perform a stratified train-test split to maintain class distribution.
     - Use `TfidfVectorizer` to extract character-based n-gram features (2-5) with a maximum of 2000 features.
     - Encode target labels into numeric values using `LabelEncoder`.
     - Train an XGBoost classifier with the following parameters:
       - Maximum depth of 6.
       - 50 estimators.
       - Learning rate of 0.2.
     - Evaluate the model using F1 score and generate a classification report.
     - Save the trained classifier, vectorizer, label encoder, and test data.

3. **Custom Metric Calculation**:
   - Compute task-specific metrics for two subtasks:
     - **Subtask 1**: Evaluate the relationship between `hazard-category` and `product-category`.
     - **Subtask 2**: Evaluate the relationship between `hazard` and `product`.
   - Combine F1 scores for hazards and products to compute the final metric for each subtask.

4. **Logging and Output**:
   - Print F1 scores for each label and custom metric scores for subtasks.
   - Save classification reports to a dedicated directory.
   - Return the trained classifiers, vectorizers, and custom metrics.

---

### Key Techniques
- **TF-IDF Vectorization**: Convert text into feature vectors using character n-grams.
- **XGBoost Classifier**: Train scalable and efficient tree-based classifiers with softmax multi-class objectives.
- **Label Encoding**: Map categorical target labels to numeric values for compatibility with XGBoost.
- **Custom Metric Calculation**: Evaluate subtasks by combining F1 scores for hazards and products.

This function enables multi-label text classification and provides task-specific evaluations with a focus on hazards and products using a high-performance gradient boosting model.


In [8]:
# def compute_score(hazards_true, products_true, hazards_pred, products_pred):
#     """
#     Compute a custom F1 score that considers hazards and products together.
#     """
#     # Reset indices to ensure alignment
#     hazards_true = hazards_true.reset_index(drop=True)
#     products_true = products_true.reset_index(drop=True)
#     hazards_pred = pd.Series(hazards_pred).reset_index(drop=True)
#     products_pred = pd.Series(products_pred).reset_index(drop=True)

#     # Compute F1 for hazards
#     f1_hazards = f1_score(hazards_true, hazards_pred, average='macro', zero_division=0)

#     # Compute F1 for products, only where hazards predictions match ground truth
#     mask = hazards_pred == hazards_true
#     f1_products = f1_score(
#         products_true[mask],
#         products_pred[mask],
#         average='macro',
#         zero_division=0
#     )

#     # Return the combined metric
#     return (f1_hazards + f1_products) / 2.

def train_xgboost_classifiers(dataframe, feature_column):
    """
    Train XGBoost classifiers for four labels and calculate custom metrics on test data.

    Args:
        dataframe: The input dataframe containing the dataset.
        feature_column: The name of the column in the dataframe to be used as features.

    Returns:
        classifiers: A dictionary containing trained classifiers for each label.
        vectorizers: A dictionary containing TF-IDF vectorizers for each label.
        custom_metrics: A dictionary containing the custom metric score for each pair of labels on test data.
    """
    np.random.seed(42)  # For reproducibility

    classifiers = {}  # Dictionary to store the trained classifiers
    vectorizers = {}  # Dictionary to store the TF-IDF vectorizers
    custom_metrics = {}  # Dictionary to store custom metric scores

    # Dictionaries to store test data for each category
    test_data = {}

    # Train classifiers for each label
    for label in ('hazard-category', 'product-category', 'hazard', 'product'):
        print(f"Training classifier for label: {label}")

        # Train-test split with stratification based on the current label
        trainset, testset = train_test_split(
            dataframe,
            test_size=0.2,
            random_state=2024,
            stratify=dataframe[label]  # Maintain class distribution
        )

        # Extract train and test features
        X_train = trainset[feature_column]
        X_test = testset[feature_column]

        # Target
        y_train = trainset[label]
        y_test = testset[label]

        # Encode target labels into numeric values
        label_encoder = LabelEncoder()
        y_train_encoded = label_encoder.fit_transform(y_train)
        y_test_encoded = label_encoder.transform(y_test)

        # Define TfidfVectorizer for the current label
        vectorizer = TfidfVectorizer(
            strip_accents='unicode', analyzer='char', ngram_range=(2, 5),
            max_df=0.5, min_df=5, max_features=2000
        )
        vectorizers[label] = vectorizer

        # Transform features using the label-specific vectorizer
        X_train_tfidf = vectorizer.fit_transform(X_train)
        X_test_tfidf = vectorizer.transform(X_test)

        # Define and train XGBoost classifier
        classifier = XGBClassifier(
            use_label_encoder=False,
            eval_metric='mlogloss',
            objective='multi:softmax',
            max_depth=6,
            n_estimators=50,
            learning_rate=0.2,
            random_state=2024
        )
        classifier.fit(X_train_tfidf, y_train_encoded)

        # Store the trained classifier
        classifiers[label] = {
            'model': classifier,
            'label_encoder': label_encoder
        }

        # Store test data
        test_data[label] = {
            'X_test_tfidf': X_test_tfidf,
            'y_test': y_test_encoded
        }

        # Predict and evaluate
        predictions = classifier.predict(X_test_tfidf)
        xgb_f1 = f1_score(y_test_encoded, predictions, average='macro', zero_division=0)
        print(f"F1 Score for {label}: {xgb_f1:.3f}")

        # Generate classification report
        report = classification_report(y_test_encoded, predictions, zero_division=0, target_names=label_encoder.classes_)
        print(report)

        # Save the report
        os.makedirs('reports/xgboost', exist_ok=True)
        with open(f'reports/xgboost/xgboost_classifier_report_{label}_{feature_column}.txt', 'w') as xgb_file:
            xgb_file.write(f"Classification Report for {label}_{feature_column}:\n")
            xgb_file.write(report)
            xgb_file.write(f"F1 Score: {xgb_f1:.3f}\n")

    # Compute the custom metric for hazards and products using test data only
    custom_metrics['subtask_1'] = compute_score(
        pd.Series(test_data['hazard-category']['y_test']),
        pd.Series(test_data['product-category']['y_test']),
        pd.Series(classifiers['hazard-category']['model'].predict(test_data['hazard-category']['X_test_tfidf'])),
        pd.Series(classifiers['product-category']['model'].predict(test_data['product-category']['X_test_tfidf']))
    )

    custom_metrics['subtask_2'] = compute_score(
        pd.Series(test_data['hazard']['y_test']),
        pd.Series(test_data['product']['y_test']),
        pd.Series(classifiers['hazard']['model'].predict(test_data['hazard']['X_test_tfidf'])),
        pd.Series(classifiers['product']['model'].predict(test_data['product']['X_test_tfidf']))
    )

    print(f"Custom Metric for Subtask 1 (Test Data): {custom_metrics['subtask_1']:.3f}")
    print(f"Custom Metric for Subtask 2 (Test Data): {custom_metrics['subtask_2']:.3f}")

    return classifiers, vectorizers, custom_metrics


- Manually we adapted `n_estimators` and `learning_rate` to run to a logic way. 

In [9]:
classifiers, vectorizers, custom_metrics = train_xgboost_classifiers(df_augmented, 'title')

Training classifier for label: hazard-category


Parameters: { "use_label_encoder" } are not used.



F1 Score for hazard-category: 0.725
                                precision    recall  f1-score   support

                     allergens       0.84      0.93      0.89       863
                    biological       0.78      0.91      0.84       771
                      chemical       0.81      0.81      0.81       329
food additives and flavourings       1.00      0.43      0.60        14
                foreign bodies       0.87      0.71      0.78       253
                         fraud       0.91      0.69      0.78       280
                     migration       1.00      0.93      0.96        29
          organoleptic aspects       1.00      0.31      0.47        39
                  other hazard       0.90      0.43      0.58       103
              packaging defect       0.94      0.38      0.55        39

                      accuracy                           0.83      2720
                     macro avg       0.90      0.65      0.73      2720
                  weighted

Parameters: { "use_label_encoder" } are not used.



F1 Score for product-category: 0.749
                                                   precision    recall  f1-score   support

                              alcoholic beverages       1.00      0.66      0.79        38
                      cereals and bakery products       0.68      0.63      0.65       252
     cocoa and cocoa preparations, coffee and tea       0.74      0.71      0.72        99
                                    confectionery       0.87      0.53      0.66        91
dietetic foods, food supplements, fortified foods       0.91      0.79      0.85        95
                                    fats and oils       0.88      0.72      0.79        32
                                   feed materials       0.91      0.91      0.91        11
                   food additives and flavourings       1.00      0.60      0.75        10
                           food contact materials       0.94      0.89      0.91        35
                            fruits and vegetables   

Parameters: { "use_label_encoder" } are not used.



F1 Score for hazard: 0.668
                                                   precision    recall  f1-score   support

                                        Aflatoxin       1.00      0.60      0.75         5
                                   abnormal smell       1.00      1.00      1.00         7
                                  alcohol content       0.91      1.00      0.95        10
                                        alkaloids       1.00      0.89      0.94         9
                                        allergens       1.00      0.29      0.44         7
                                           almond       0.82      0.58      0.68        31
             altered organoleptic characteristics       0.82      0.90      0.86        10
                                        amygdalin       0.88      0.78      0.82         9
                           antibiotics, vet drugs       1.00      0.33      0.50         6
                                    bacillus spp.       0.80  

Parameters: { "use_label_encoder" } are not used.



F1 Score for product: 0.632
                                                                        precision    recall  f1-score   support

                                                Catfishes (freshwater)       1.00      1.00      1.00         3
                                                       Dried pork meat       1.00      1.00      1.00         1
                                                 Fishes not identified       0.38      0.33      0.35         9
                                                    Groupers (generic)       1.00      1.00      1.00         1
                                              Not classified pork meat       1.00      0.75      0.86         4
                                            Pangas catfishes (generic)       1.00      0.50      0.67         2
                                   Precooked cooked pork meat products       0.75      0.50      0.60         6
                                    Torpedo-shaped catfishes (generic)     

Random Forest better logistic worde random but due to the power of the algorythm we tries some others parametes, too. To see it will be better than random 

with parameters 
train XGBoost classifier
        classifier = XGBClassifier(
            use_label_encoder=False,
            eval_metric='mlogloss',
            objective='multi:softmax',
            max_depth=3,
            n_estimators=50,
            learning_rate=0.2,
            random_state=2024
        )

Custom Metric for Subtask 1 (Test Data): 0.676
Custom Metric for Subtask 2 (Test Data): 0.646

Changing from max_depth=3 to 6

TODO resukts

### Part B. Benchmark Analysis Text

In [14]:
classifiers, vectorizers, custom_metrics = train_log_regression_classifiers(df_augmented, 'text')

Training classifier for label: hazard-category




F1 Score for hazard-category: 0.730
                                precision    recall  f1-score   support

                     allergens       0.93      0.98      0.95       863
                    biological       0.93      0.96      0.94       771
                      chemical       0.81      0.93      0.86       329
food additives and flavourings       1.00      0.29      0.44        14
                foreign bodies       0.83      0.86      0.84       253
                         fraud       0.77      0.69      0.73       280
                     migration       1.00      1.00      1.00        29
          organoleptic aspects       1.00      0.18      0.30        39
                  other hazard       0.86      0.55      0.67       103
              packaging defect       0.94      0.38      0.55        39

                      accuracy                           0.89      2720
                     macro avg       0.91      0.68      0.73      2720
                  weighted



F1 Score for product-category: 0.659
                                                   precision    recall  f1-score   support

                              alcoholic beverages       0.89      0.66      0.76        38
                      cereals and bakery products       0.51      0.63      0.56       252
     cocoa and cocoa preparations, coffee and tea       0.73      0.46      0.57        99
                                    confectionery       0.82      0.41      0.54        91
dietetic foods, food supplements, fortified foods       0.77      0.71      0.74        95
                                    fats and oils       0.86      0.56      0.68        32
                                   feed materials       1.00      0.73      0.84        11
                   food additives and flavourings       1.00      0.70      0.82        10
                           food contact materials       1.00      0.97      0.99        35
                            fruits and vegetables   



F1 Score for hazard: 0.500
                                                   precision    recall  f1-score   support

                                        Aflatoxin       0.00      0.00      0.00         5
                                   abnormal smell       1.00      0.71      0.83         7
                                  alcohol content       0.83      1.00      0.91        10
                                        alkaloids       1.00      0.67      0.80         9
                                        allergens       0.00      0.00      0.00         7
                                           almond       0.73      0.52      0.60        31
             altered organoleptic characteristics       1.00      0.80      0.89        10
                                        amygdalin       0.75      1.00      0.86         9
                           antibiotics, vet drugs       0.00      0.00      0.00         6
                                    bacillus spp.       1.00  



In [11]:
train_random_forest_classifiers(df_augmented, 'text')

Training classifier for label: hazard-category
F1 Score for hazard-category: 0.770
                                precision    recall  f1-score   support

                     allergens       0.94      1.00      0.97       863
                    biological       0.96      0.96      0.96       771
                      chemical       0.83      0.92      0.87       329
food additives and flavourings       1.00      0.29      0.44        14
                foreign bodies       0.82      0.91      0.87       253
                         fraud       0.84      0.76      0.80       280
                     migration       1.00      1.00      1.00        29
          organoleptic aspects       1.00      0.23      0.38        39
                  other hazard       0.94      0.66      0.78       103
              packaging defect       0.95      0.49      0.64        39

                      accuracy                           0.91      2720
                     macro avg       0.93      0.72

({'hazard-category': RandomForestClassifier(n_jobs=-1, random_state=2024),
  'product-category': RandomForestClassifier(n_jobs=-1, random_state=2024),
  'hazard': RandomForestClassifier(n_jobs=-1, random_state=2024),
  'product': RandomForestClassifier(n_jobs=-1, random_state=2024)},
 {'hazard-category': TfidfVectorizer(analyzer='char', max_df=0.5, max_features=5000, min_df=5,
                  ngram_range=(2, 5), strip_accents='unicode'),
  'product-category': TfidfVectorizer(analyzer='char', max_df=0.5, max_features=5000, min_df=5,
                  ngram_range=(2, 5), strip_accents='unicode'),
  'hazard': TfidfVectorizer(analyzer='char', max_df=0.5, max_features=5000, min_df=5,
                  ngram_range=(2, 5), strip_accents='unicode'),
  'product': TfidfVectorizer(analyzer='char', max_df=0.5, max_features=5000, min_df=5,
                  ngram_range=(2, 5), strip_accents='unicode')},
 {'subtask_1': np.float64(0.7841164075998285),
  'subtask_2': np.float64(0.7580341615326311)})

In [10]:
classifiers, vectorizers, custom_metrics = train_xgboost_classifiers(df_augmented, 'text')

Training classifier for label: hazard-category


Parameters: { "use_label_encoder" } are not used.



F1 Score for hazard-category: 0.840
                                precision    recall  f1-score   support

                     allergens       0.95      0.99      0.97       863
                    biological       0.96      0.97      0.97       771
                      chemical       0.88      0.96      0.91       329
food additives and flavourings       1.00      0.57      0.73        14
                foreign bodies       0.87      0.94      0.90       253
                         fraud       0.91      0.78      0.84       280
                     migration       1.00      0.97      0.98        29
          organoleptic aspects       1.00      0.46      0.63        39
                  other hazard       0.90      0.74      0.81       103
              packaging defect       0.91      0.51      0.66        39

                      accuracy                           0.93      2720
                     macro avg       0.94      0.79      0.84      2720
                  weighted

Parameters: { "use_label_encoder" } are not used.



F1 Score for product-category: 0.788
                                                   precision    recall  f1-score   support

                              alcoholic beverages       0.96      0.66      0.78        38
                      cereals and bakery products       0.60      0.74      0.66       252
     cocoa and cocoa preparations, coffee and tea       0.81      0.70      0.75        99
                                    confectionery       0.89      0.46      0.61        91
dietetic foods, food supplements, fortified foods       0.91      0.67      0.78        95
                                    fats and oils       1.00      0.66      0.79        32
                                   feed materials       1.00      0.82      0.90        11
                   food additives and flavourings       1.00      0.80      0.89        10
                           food contact materials       1.00      0.91      0.96        35
                            fruits and vegetables   

Parameters: { "use_label_encoder" } are not used.



F1 Score for hazard: 0.823
                                                   precision    recall  f1-score   support

                                        Aflatoxin       1.00      0.80      0.89         5
                                   abnormal smell       1.00      1.00      1.00         7
                                  alcohol content       0.91      1.00      0.95        10
                                        alkaloids       0.90      1.00      0.95         9
                                        allergens       1.00      0.57      0.73         7
                                           almond       0.84      0.84      0.84        31
             altered organoleptic characteristics       1.00      1.00      1.00        10
                                        amygdalin       1.00      1.00      1.00         9
                           antibiotics, vet drugs       1.00      0.67      0.80         6
                                    bacillus spp.       1.00  

Parameters: { "use_label_encoder" } are not used.



F1 Score for product: 0.697
                                                                        precision    recall  f1-score   support

                                                Catfishes (freshwater)       0.33      0.33      0.33         3
                                                       Dried pork meat       1.00      1.00      1.00         1
                                                 Fishes not identified       0.00      0.00      0.00         9
                                                    Groupers (generic)       1.00      1.00      1.00         1
                                              Not classified pork meat       0.50      0.75      0.60         4
                                            Pangas catfishes (generic)       1.00      0.50      0.67         2
                                   Precooked cooked pork meat products       1.00      0.83      0.91         6
                                    Torpedo-shaped catfishes (generic)     