## Main Notebook for Training and Evaluation

---
> Evangelia P. Panourgia, Master Student in Data Science, AUEB <br />
> Department of Informatics, Athens University of Economics and Business <br />
> eva.panourgia@aueb.gr <br/><br/>


### Install Libraries

In [1]:
!pip install nltk optuna xgboost



### Setting the Scene 
- We will import all the needeed libraries.

In [42]:
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer, PorterStemmer
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
import string
import random
from sklearn.metrics import f1_score
from sklearn.dummy import DummyClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.model_selection import train_test_split, KFold
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.utils.class_weight import compute_class_weight
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as ImbPipeline
from sklearn.multiclass import OneVsRestClassifier
import optuna
from sklearn.svm import SVC
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.ensemble import StackingClassifier
from xgboost import XGBClassifier
from sklearn.svm import SVC
import os
import joblib  # For saving and loading models
from sklearn.metrics import classification_report
from optuna import TrialPruned
from sklearn.model_selection import cross_val_score
from optuna.exceptions import TrialPruned
import os
import numpy as np
from sklearn.dummy import DummyClassifier
from sklearn.metrics import classification_report, f1_score
from sklearn.model_selection import train_test_split

[nltk_data] Downloading package punkt to /Users/evangelia/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/evangelia/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/evangelia/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


### Load Data 
- We will load the preprocessed data (`data_augmented__nlp_incidents_train.csv`) being pre-processed with `data augmentes` (generation of synthetic data usage synonyms) and basic nlp preprocess.
- Furthermore, we will load the unlabeleed data of the competition in ordeer to predict them (`incidents.csv`).

In [43]:
df_augmented= pd.read_csv('data/data_augmented__nlp_incidents_train.csv') # load data after data augmentation
testset_competition = pd.read_csv('data/incidents.csv', index_col=0) # load testing data (conception phase, unlabeled):

In [44]:
df_augmented = df_augmented[['title','text','hazard-category','product-category','hazard','product']]
df_augmented.head(3) # preview preproccessed data 

Unnamed: 0,title,text,hazard-category,product-category,hazard,product
0,recal notif fsis-024-94,case number 024-94 date open 07/01/1994 date c...,biological,"meat, egg and dairy products",listeria monocytogenes,smoked sausage
1,recal notif fsis-033-94,case number 033-94 date open 10/03/1994 date c...,biological,"meat, egg and dairy products",listeria spp,sausage
2,recal notif fsis-014-94,case number 014-94 date open 03/28/1994 date c...,biological,"meat, egg and dairy products",listeria monocytogenes,ham slices


In [45]:
testset_competition.head(3)# preview test data 

Unnamed: 0,year,month,day,country,title,text
0,1994,5,5,us,Recall Notification: FSIS-017-94,Case Number: 017-94 \n Date Opene...
1,1994,5,12,us,Recall Notification: FSIS-048-94,Case Number: 048-94 \n Date Opene...
2,1995,4,16,us,Recall Notification: FSIS-032-95,Case Number: 032-95 \n Date Opene...


## Baselines
- Benchmark analysis is crucial for evaluating classification performance in multiclass imbalance settings because it provides reference points for how well your model is performing relative to simple baseline classifiers. The `Random Classifier` and `Majority Classifier` are commonly used as benchmarks for the following reasons:

### Random Classifier 
- A Random Classifier predicts class labels randomly, with **uniform** based on the distribution of classes. It sets a minimal baseline and helps understand:

- `Baseline Performance`: This represents the expected performance `without learning from the data`. `If a model performs worse than a random classifier, it indicates either issues in the model or unsuitable features`.

- `Chance Levels`: It shows what performance you'd `get by chance alone`, especially useful for imbalanced datasets where metrics like accuracy can be misleading.


### Majority Classifier

- A Majority Classifier always **predicts the majority class** (`the class with the highest frequency in the training data`). 

- It helps understand:

    - `Handling Imbalance`: In multiclass imbalanced datasets, accuracy can be dominated by the majority class. The majority classifier provides a baseline to compare how well your model captures minority classes.
    - `Baseline of Naïve Solutions`: The majority classifier reflects the simplest possible rule for prediction. If a model's performance is close to that of a majority classifier, it suggests the model is failing to generalize or adapt to the minority classes.
    - `Focus on Class Imbalance`: Metrics like weighted accuracy, balanced accuracy, or macro-F1 score should be significantly better than those achieved by the majority classifier to indicate that a model is addressing imbalance effectively.

- Note in the following code cell I implement the code for Random and Majority Classifier, in order to have a high level of "logic" we added the split steps of trainingtest set, but for example for the Random Classifier it is useless as it is not affected from the input, dont learn from data.
    - Hoever, this "skeleton" is useful for the reamaining algorythms to buils in (both traditional and advanced) 

- More specifically, 

    - Random Classifier  Effect of X: The X values (features) **do not influence the random classifier's predictions**. It does not learn from the data in the feature column. Its predictions are purely random, so changing X will not alter its performance.
    - Majority Classifier Effect of X: The feature column X is ignored by the majority classifier, as it does not use features for prediction. Instead, it looks only at the distribution of y in the training data.

### Regarding the Implementation 
- The `DummyClassifier in scikit-learn` is a baseline model designed to evaluate classification algorithms by comparing them against simplistic strategies. These strategies provide minimal logic to make predictions and are often used as benchmarks to understand how well a more complex model performs.
    - `strategy="uniform"` (for Random Classifier): 
        - Predicts a class randomly and uniformly across all possible classes.
        - Each class has an equal probability of being selected, irrespective of the class distribution in the training data.
        - Use Case: Ideal for scenarios where you want to simulate random guessing.
    - `strategy="most_frequent"` (for Majoriry Classification)
        - Always predicts the most frequent class observed in the training data.
        - Ignores the input features entirely and focuses only on the training set's class distribution.
        - Use Case: Useful for understanding how well a naive baseline would perform if you simply predicted the majority class.

In [6]:
def evaluate_baselines(dataframe, feature_column):
    """
    Function to evaluate random and majority classifiers on a given dataframe.

    Args:
        dataframe: The input dataframe containing the dataset.
        feature_column: The name of the column in the dataframe to be used as features.
    """
    np.random.seed(42)  # For reproducibility

    # Train-test split with optional stratification
    trainset, testset = train_test_split(
        dataframe, 
        test_size=0.2, 
        random_state=2024, 
        # "skeleton" for the main algo here add stratisfy to hold proportion of classes 
    )
   
    # Random and Majority classifiers for each label
    for label in ('hazard-category', 'product-category', 'hazard', 'product'):
        print(f"Evaluating for label: {label}")

        # Features and target
        X_train = trainset[feature_column]
        y_train = trainset[label]
        X_test = testset[feature_column]
        y_test = testset[label]

        # Random Classifier
        random_clf = DummyClassifier(strategy="uniform", random_state=2024)
        random_clf.fit(X_train, y_train) # it is uselless X stimulate the logic of a real algo. 
        testset['predictions-random-' + label] = random_clf.predict(X_test)

        # Majority Classifier
        majority_clf = DummyClassifier(strategy="most_frequent")
        majority_clf.fit(X_train, y_train)# it is uselless X stimulate the logic of a real algo. 
        testset['predictions-majority-' + label] = majority_clf.predict(X_test)

        # Compute F1 scores
        random_f1 = f1_score(y_test, testset['predictions-random-' + label], average='macro', zero_division=0)
        majority_f1 = f1_score(y_test, testset['predictions-majority-' + label], average='macro', zero_division=0)

        print(f"F1 Score for Random Classifier ({label}): {random_f1:.3f}")
        print(f"F1 Score for Majority Classifier ({label}): {majority_f1:.3f}")

        # Generate and save classification reports
        os.makedirs('reports/random', exist_ok=True)
        os.makedirs('reports/majority', exist_ok=True)

        random_report = classification_report(y_test, testset['predictions-random-' + label], zero_division=0)
        majority_report = classification_report(y_test, testset['predictions-majority-' + label], zero_division=0)

        with open(f'reports/random/random_classifier_report_{label}.txt', 'w') as random_file:
            random_file.write(f"Classification Report for Random Classifier ({label}):\n")
            random_file.write(random_report)

        with open(f'reports/majority/majority_classifier_report_{label}.txt', 'w') as majority_file:
            majority_file.write(f"Classification Report for Majority Classifier ({label}):\n")
            majority_file.write(majority_report)
        
        
    
    # Custom metric score calculation
    def compute_score(hazards_true, products_true, hazards_pred, products_pred):
        """
        Custom scoring function to compute the macro F1 score for hazards and products.
        
        Args:
            hazards_true: Ground truth labels for hazards.
            products_true: Ground truth labels for products.
            hazards_pred: Predicted labels for hazards.
            products_pred: Predicted labels for products.
        
        Returns:
            A float representing the combined macro F1 score.
        """
        f1_hazards = f1_score(hazards_true, hazards_pred, average='macro', zero_division=0)
        f1_products = f1_score(
            products_true[hazards_pred == hazards_true],
            products_pred[hazards_pred == hazards_true],
            average='macro', 
            zero_division=0
        )
        return (f1_hazards + f1_products) / 2.

    # Example of calculating scores for Sub-Tasks (if needed):
    # Uncomment the following lines to compute scores for tasks
    print(f"Score Sub-Task 1 - Random Classifier: {compute_score(testset['hazard-category'], testset['product-category'], testset['predictions-random-hazard-category'], testset['predictions-random-product-category']):.3f}")
    print(f"Score Sub-Task 2 - Random Classifier: {compute_score(testset['hazard'], testset['product'], testset['predictions-random-hazard'], testset['predictions-random-product-category']):.3f}")
    print(f"Score Sub-Task 1 - Majority Classifier: {compute_score(testset['hazard-category'], testset['product-category'], testset['predictions-majority-hazard-category'], testset['predictions-majority-product-category']):.3f}")
    print(f"Score Sub-Task 2 - Majority Classifier: {compute_score(testset['hazard'], testset['product'], testset['predictions-majority-hazard'], testset['predictions-majority-product']):.3f}")

# Call the function with the required dataframe (e.g., df_augmented or any other dataframe)
evaluate_baselines(df_augmented, feature_column='text')
# Uncomment the following line to use a different feature column
# evaluate_baselines(df_augmented, feature_column='title', stratify_column='hazard-category')

Evaluating for label: hazard-category
F1 Score for Random Classifier (hazard-category): 0.076
F1 Score for Majority Classifier (hazard-category): 0.047
Evaluating for label: product-category
F1 Score for Random Classifier (product-category): 0.034
F1 Score for Majority Classifier (product-category): 0.019
Evaluating for label: hazard
F1 Score for Random Classifier (hazard): 0.005
F1 Score for Majority Classifier (hazard): 0.001
Evaluating for label: product
F1 Score for Random Classifier (product): 0.000
F1 Score for Majority Classifier (product): 0.000
Score Sub-Task 1 - Random Classifier: 0.057
Score Sub-Task 2 - Random Classifier: 0.003
Score Sub-Task 1 - Majority Classifier: 0.031
Score Sub-Task 2 - Majority Classifier: 0.001


- Save results (i have speed but to be sure for teh repplication of analysis) :

- Evaluating for label: hazard-category
    - F1 Score for Random Classifier (hazard-category): 0.076
    - F1 Score for Majority Classifier (hazard-category): 0.047
- Evaluating for label: product-category
    - F1 Score for Random Classifier (product-category): 0.034
    - F1 Score for Majority Classifier (product-category): 0.019
- Evaluating for label: hazard
    - F1 Score for Random Classifier (hazard): 0.005
    - F1 Score for Majority Classifier (hazard): 0.001
- Evaluating for label: product
    - F1 Score for Random Classifier (product): 0.000
    - F1 Score for Majority Classifier (product): 0.000

- Score Sub-Task 1 - Random Classifier: 0.057
- Score Sub-Task 2 - Random Classifier: 0.003
- Score Sub-Task 1 - Majority Classifier: 0.031
- Score Sub-Task 2 - Majority Classifier: 0.001

###  Results and Observations
- Label: `hazard-category`
    - Random Classifier F1: 0.076
    - Majority Classifier F1: 0.047
- Performance is slightly better for the Random Classifier, but both are low, indicating the dataset is likely imbalanced, and random guessing doesn't align well with true labels.

- Label: `product-category`
    - Random Classifier F1: 0.034
    - Majority Classifier F1: 0.019
- Performance drops further here. It suggests more complexity or higher imbalance in this label.

- Label: `hazard`
    - Random Classifier F1: 0.005
    - Majority Classifier F1: 0.001
    - Both scores are extremely low, possibly due to:
        - Large number of classes.
        - Sparse distribution of classes.
        - Poor representation of these classes in the Random Classifier's uniform predictions or Majority Classifier's mode.

- Label: `product`
    - Random Classifier F1: 0.000
    - Majority Classifier F1: 0.000
    - Both classifiers completely fail to capture meaningful patterns for this label. This could suggest extreme imbalance or lack of meaningful correlation in the dataset.

- Sub-Tasks
    - Score Sub-Task 1: hazard-category & product-category
        - Random Classifier Score: 0.057
        - Majority Classifier Score: 0.031
        - Indicates the overall performance when combining macro F1 scores for hazard-category and product-category. Random guessing outperforms predicting the most frequent class, but both are weak.
    - Score Sub-Task 2: hazard & product
        - Random Classifier Score: 0.003
        - Majority Classifier Score: 0.001
        - Reflects the severe challenge for these labels. The performance is near zero, affirming the labels require more sophisticated approaches.

-  `Conclusions` : 
- Baseline as a Benchmark:

    - **The poor F1 scores highlight the challenging nature of the task and dataset**.
    - These results provide a benchmark to evaluate future models. Any model achieving significantly higher F1 scores would demonstrate effective learning.

- Dataset Imbalance:

    - The low performance of the Majority Classifier indicates severe class imbalance across all labels.
    - Future models should address this using strategies like stratified sampling, oversampling, or weighted loss functions.

- Complexity of Labels:

    - The complexity increases from hazard-category and product-category to hazard and product, as reflected in the declining F1 scores.

- Actionable Insights:

    - Preprocessing: Investigate the class distributions and apply balancing techniques.
    - Feature Engineering: Consider enhancing the feature column (e.g., using embeddings).
    - Advanced Models: Apply models capable of handling imbalance, such as tree-based methods, ensemble models, or neural networks.


### Traditional and Advanced Approach - Design Decision (skeleton of function) and Limitations in Implemeentation (Resourses) 

- For the traditional ML approach we will run `Logistic regression` and for advanced algorythm `X-Boost` 
- This is because these approaches reflect diffrent things. 
    - `Logistic Regression`:  serves as a simple, interpretable way for understanding how well your data can be modeled with `linear relationships`.
    - `X-Boost`: if XGBoost significantly outperforms logistic regression, it suggests that your data has complex patterns that require `non-linear modeling`.

-  Skeleton Function, we spent time on writting a "skeletton" of function in order to be easy adaptable if we will change the model. Both for logistic regression and x-boost we have the following logic in code : 
    1. <br> split train test stratisfy 
    2. <br> one vs all 
    3. <br> opttuna C regularization  cross - vall 
    4. <br> tf-idf pipeline 
    5. <br>
    6. <br>

Key Note : Due to time - space limitations we  Hyperparameter tuning 

- Limitations 
    1. SMOTE <br>
    2. CROSS VAL <br>
    3. <br>
    4. <br>
    5. <br>
    6. <br> 
- Optuna k-0fold

sklearn  not gpu usage 

logistic not svm due to resource more complex 

smote xoros ayjanei 

- One VS All and TF-idf increase computatioanl ! EXTRANAL NOT IN PIPELINE 
- aLTERNATIVE USAGE OF 

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score, classification_report
import numpy as np
import os

def evaluate_log_regression_with_multinomial(dataframe, feature_column, testset_competition):
    """
    Function to evaluate multinomial logistic regression classifier on a given dataframe
    without applying SMOTE.

    Args:
        dataframe: The input dataframe containing the dataset.
        feature_column: The name of the column in the dataframe to be used as features.
        test_competition: The unlabeled dataframe 
    """
    np.random.seed(42)  # For reproducibility

    # Dictionary to store predictions for each label
    predictions_dict = {}

    # Define TfidfVectorizer and Logistic Regression once
    vectorizer = TfidfVectorizer(strip_accents='unicode', analyzer='char', ngram_range=(2, 5), max_df=0.5, min_df=5, max_features=5000)
    classifier = LogisticRegression(max_iter=30, random_state=2024, multi_class='multinomial')

    for label in ('hazard-category', 'product-category', 'hazard', 'product'):
        print(f"Evaluating for label: {label}")

        # Train-test split with stratification based on the current label
        trainset, testset = train_test_split(
            dataframe,
            test_size=0.2,
            random_state=2024,
            stratify=dataframe[label]
        )

        # Extract train and test features
        X_train = trainset[feature_column]
        X_test = testset[feature_column]

        # Target
        y_train = trainset[label]
        y_test = testset[label]

        # Transform features using the pre-defined vectorizer
        X_train_tfidf = vectorizer.fit_transform(X_train)
        X_test_tfidf = vectorizer.transform(X_test)

        # Fit and predict using the multinomial Logistic Regression Classifier
        classifier.fit(X_train_tfidf, y_train)
        predictions = classifier.predict(X_test_tfidf)

        # Store predictions separately
        predictions_dict[label] = {
            "y_test": y_test,
            "predictions": predictions
        }

        # Compute F1 scores
        logreg_f1 = f1_score(y_test, predictions, average='macro', zero_division=0)
        print(f"F1 Score for Multinomial Logistic Regression ({label}): {logreg_f1:.3f}")

        # Generate and save classification reports
        os.makedirs('reports/logreg', exist_ok=True)

        logreg_report = classification_report(y_test, predictions, zero_division=0)

        with open(f'reports/logreg/logreg_classifier_report_{label}.txt', 'w') as logreg_file:
            logreg_file.write(f"Classification Report for Multinomial Logistic Regression ({label}):\n")
            logreg_file.write(logreg_report)
            logreg_file.write(f"F1 Score for Multinomial Logistic Regression ({label}): {logreg_f1:.3f}")

    # Custom metric score calculation
    def compute_score(hazards_true, products_true, hazards_pred, products_pred):
        """
        Custom scoring function to compute the macro F1 score for hazards and products.

        Args:
            hazards_true: Ground truth labels for hazards.
            products_true: Ground truth labels for products.
            hazards_pred: Predicted labels for hazards.
            products_pred: Predicted labels for products.

        Returns:
            A float representing the combined macro F1 score.
        """
        f1_hazards = f1_score(hazards_true, hazards_pred, average='macro', zero_division=0)
        f1_products = f1_score(
            products_true[hazards_pred == hazards_true],
            products_pred[hazards_pred == hazards_true],
            average='macro',
            zero_division=0
        )
        return (f1_hazards + f1_products) / 2.

    # Example of calculating scores for Sub-Tasks (if needed):
    print(f"Score Sub-Task 1 - Multinomial Logistic Regression: {compute_score(predictions_dict['hazard-category']['y_test'], predictions_dict['product-category']['y_test'], predictions_dict['hazard-category']['predictions'], predictions_dict['product-category']['predictions']):.3f}")
    print(f"Score Sub-Task 2 - Multinomial Logistic Regression: {compute_score(predictions_dict['hazard']['y_test'], predictions_dict['product']['y_test'], predictions_dict['hazard']['predictions'], predictions_dict['product']['predictions']):.3f}")


In [11]:
evaluate_log_regression_with_multinomial(df_augmented, 'title', testset_competition)

Evaluating for label: hazard-category


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


F1 Score for Multinomial Logistic Regression (hazard-category): 0.613
Evaluating for label: product-category


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


F1 Score for Multinomial Logistic Regression (product-category): 0.616
Evaluating for label: hazard


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


F1 Score for Multinomial Logistic Regression (hazard): 0.418
Evaluating for label: product




KeyboardInterrupt: 

### Logistic Regression TF-IDF Title 

In [50]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score, classification_report
import numpy as np
import os


def compute_score(hazards_true, products_true, hazards_pred, products_pred):
    """
    Compute a custom F1 score that considers hazards and products together.
    """
    # Reset indices to ensure alignment
    hazards_true = hazards_true.reset_index(drop=True)
    products_true = products_true.reset_index(drop=True)
    hazards_pred = pd.Series(hazards_pred).reset_index(drop=True)
    products_pred = pd.Series(products_pred).reset_index(drop=True)

    # Compute F1 for hazards
    f1_hazards = f1_score(hazards_true, hazards_pred, average='macro', zero_division=0)

    # Compute F1 for products, only where hazards predictions match ground truth
    mask = hazards_pred == hazards_true
    f1_products = f1_score(
        products_true[mask],
        products_pred[mask],
        average='macro',
        zero_division=0
    )

    # Return the combined metric
    return (f1_hazards + f1_products) / 2.


def train_log_regression_classifiers(dataframe, feature_column):
    """
    Train multinomial logistic regression classifiers for four labels and calculate custom metrics on test data.

    Args:
        dataframe: The input dataframe containing the dataset.
        feature_column: The name of the column in the dataframe to be used as features.

    Returns:
        classifiers: A dictionary containing trained classifiers for each label.
        vectorizers: A dictionary containing TF-IDF vectorizers for each label.
        custom_metrics: A dictionary containing the custom metric score for each pair of labels on test data.
    """
    np.random.seed(42)  # For reproducibility

    classifiers = {}  # Dictionary to store the trained classifiers
    vectorizers = {}  # Dictionary to store the TF-IDF vectorizers
    custom_metrics = {}  # Dictionary to store custom metric scores

    # Dictionaries to store test data for each category
    test_data = {}

    # Train classifiers for each label
    for label in ('hazard-category', 'product-category', 'hazard', 'product'):
        print(f"Training classifier for label: {label}")

        # Train-test split with stratification based on the current label
        trainset, testset = train_test_split(
            dataframe,
            test_size=0.2,
            random_state=2024,
            stratify=dataframe[label]
        )

        # Extract train and test features
        X_train = trainset[feature_column]
        X_test = testset[feature_column]

        # Target
        y_train = trainset[label]
        y_test = testset[label]

        # Define TfidfVectorizer for the current label
        vectorizer = TfidfVectorizer(strip_accents='unicode', analyzer='char', ngram_range=(2, 5), max_df=0.5, min_df=5, max_features=5000)
        vectorizers[label] = vectorizer

        # Transform features using the label-specific vectorizer
        X_train_tfidf = vectorizer.fit_transform(X_train)
        X_test_tfidf = vectorizer.transform(X_test)

        # Define and train Logistic Regression classifier
        classifier = LogisticRegression(max_iter=100, random_state=2024, multi_class='multinomial')
        classifier.fit(X_train_tfidf, y_train)

        # Store the trained classifier
        classifiers[label] = classifier

        # Store test data
        test_data[label] = {
            'X_test_tfidf': X_test_tfidf,
            'y_test': y_test
        }

        # Predict and evaluate
        predictions = classifier.predict(X_test_tfidf)
        logreg_f1 = f1_score(y_test, predictions, average='macro', zero_division=0)
        print(f"F1 Score for {label}: {logreg_f1:.3f}")

        # Generate classification report
        report = classification_report(y_test, predictions, zero_division=0)
        print(report)

        # Save the report
        os.makedirs('reports/logreg', exist_ok=True)
        with open(f'reports/logreg/logreg_classifier_report_{label}.txt', 'w') as logreg_file:
            logreg_file.write(f"Classification Report for {label}:\n")
            logreg_file.write(report)
            logreg_file.write(f"F1 Score: {logreg_f1:.3f}\n")

    # Compute the custom metric for hazards and products using test data only
    custom_metrics['subtask_1'] = compute_score(
    test_data['hazard-category']['y_test'],
    test_data['product-category']['y_test'],
    classifiers['hazard-category'].predict(test_data['hazard-category']['X_test_tfidf']),
    classifiers['product-category'].predict(test_data['product-category']['X_test_tfidf'])
    )

    custom_metrics['subtask_2'] = compute_score(
    test_data['hazard']['y_test'],
    test_data['product']['y_test'],
    classifiers['hazard'].predict(test_data['hazard']['X_test_tfidf']),
    classifiers['product'].predict(test_data['product']['X_test_tfidf'])
    )


    print(f"Custom Metric for Subtask 1 (Test Data): {custom_metrics['subtask_1']:.3f}")
    print(f"Custom Metric for Subtask 2 (Test Data): {custom_metrics['subtask_2']:.3f}")

    return classifiers, vectorizers, custom_metrics

In [51]:
classifiers, vectorizers, custom_metrics = train_log_regression_classifiers(df_augmented, 'title')

print("Custom Metric Scores on Test Data:")
print(custom_metrics)

Training classifier for label: hazard-category




F1 Score for hazard-category: 0.683
                                precision    recall  f1-score   support

                     allergens       0.82      0.93      0.87       863
                    biological       0.84      0.92      0.88       771
                      chemical       0.80      0.83      0.82       329
food additives and flavourings       1.00      0.21      0.35        14
                foreign bodies       0.86      0.75      0.80       253
                         fraud       0.84      0.71      0.77       280
                     migration       1.00      0.93      0.96        29
          organoleptic aspects       1.00      0.23      0.38        39
                  other hazard       0.92      0.46      0.61       103
              packaging defect       0.83      0.26      0.39        39

                      accuracy                           0.83      2720
                     macro avg       0.89      0.62      0.68      2720
                  weighted



F1 Score for product-category: 0.688
                                                   precision    recall  f1-score   support

                              alcoholic beverages       0.95      0.50      0.66        38
                      cereals and bakery products       0.65      0.70      0.67       252
     cocoa and cocoa preparations, coffee and tea       0.77      0.71      0.74        99
                                    confectionery       0.81      0.37      0.51        91
dietetic foods, food supplements, fortified foods       0.79      0.78      0.78        95
                                    fats and oils       1.00      0.62      0.77        32
                                   feed materials       1.00      0.73      0.84        11
                   food additives and flavourings       1.00      0.70      0.82        10
                           food contact materials       1.00      0.89      0.94        35
                            fruits and vegetables   



F1 Score for hazard: 0.476
                                                   precision    recall  f1-score   support

                                        Aflatoxin       0.00      0.00      0.00         5
                                   abnormal smell       1.00      0.43      0.60         7
                                  alcohol content       0.82      0.90      0.86        10
                                        alkaloids       1.00      0.67      0.80         9
                                        allergens       0.00      0.00      0.00         7
                                           almond       0.60      0.48      0.54        31
             altered organoleptic characteristics       1.00      0.90      0.95        10
                                        amygdalin       0.62      0.89      0.73         9
                           antibiotics, vet drugs       0.00      0.00      0.00         6
                                    bacillus spp.       1.00  



F1 Score for product: 0.344
                                                                        precision    recall  f1-score   support

                                                Catfishes (freshwater)       0.40      0.67      0.50         3
                                                       Dried pork meat       0.00      0.00      0.00         1
                                                 Fishes not identified       0.17      0.44      0.24         9
                                                    Groupers (generic)       0.00      0.00      0.00         1
                                              Not classified pork meat       1.00      0.75      0.86         4
                                            Pangas catfishes (generic)       0.00      0.00      0.00         2
                                   Precooked cooked pork meat products       0.00      0.00      0.00         6
                                    Torpedo-shaped catfishes (generic)     

In [53]:
# predict for hazard category : 
# Access specific classifiers
hazard_classifier = classifiers['hazard']
product_classifier = classifiers['product']
hazard_category_classifier = classifiers['hazard-category']
product_category_classifier = classifiers['product-category']

In [54]:
vectorizer = TfidfVectorizer(strip_accents='unicode', analyzer='char', ngram_range=(2, 5), max_df=0.5, min_df=5,max_features=5000)
vectorizer.fit_transform(df_augmented['title'])
X_val=vectorizer.transform(testset_competition['title'])

In [55]:
hazard_category_classifier.predict(X_val)

array(['fraud', 'biological', 'biological', 'allergens', 'fraud',
       'biological', 'chemical', 'biological', 'biological', 'biological',
       'allergens', 'allergens', 'allergens', 'biological', 'biological',
       'allergens', 'allergens', 'biological', 'biological', 'chemical',
       'biological', 'allergens', 'allergens', 'biological', 'allergens',
       'allergens', 'allergens', 'biological', 'allergens', 'allergens',
       'biological', 'allergens', 'allergens', 'fraud', 'biological',
       'biological', 'allergens', 'allergens', 'biological', 'allergens',
       'biological', 'allergens', 'allergens', 'biological', 'allergens',
       'chemical', 'allergens', 'allergens', 'biological', 'allergens',
       'allergens', 'allergens', 'biological', 'allergens', 'allergens',
       'allergens', 'allergens', 'biological', 'allergens', 'allergens',
       'fraud', 'allergens', 'biological', 'biological', 'biological',
       'biological', 'allergens', 'biological', 'biologica

In [56]:
product_classifier.predict(X_val)

array(['ice cream', 'ice cream', 'ice cream', 'chicken based products',
       'ice cream', 'cakes', 'other dairy products',
       'chicken based products', 'cakes', 'salmon', 'biscuits',
       'ice cream', 'ready to eat - cook meals', 'beer', 'beer',
       'ice cream', 'ready to eat - cook meals', 'biscuits',
       'ready to eat - cook meals', 'cakes', 'cakes',
       'chicken based products', 'salmon', 'chicken based products',
       'cakes', 'ready to eat - cook meals', 'ready to eat - cook meals',
       'cakes', 'sausage', 'cakes', 'ice cream', 'chicken based products',
       'cakes', 'cakes', 'chicken based products',
       'chicken based products', 'ice cream', 'cheese',
       'chicken based products', 'sausage', 'cakes',
       'ready to eat - cook meals', 'ready to eat - cook meals',
       'ice cream', 'cakes', 'fresh pork', 'cakes', 'cakes', 'ice cream',
       'ice cream', 'cakes', 'ready to eat - cook meals',
       'chicken based products', 'cakes', 'ice cream', '

In [57]:
hazard_category_classifier.predict(X_val)

array(['fraud', 'biological', 'biological', 'allergens', 'fraud',
       'biological', 'chemical', 'biological', 'biological', 'biological',
       'allergens', 'allergens', 'allergens', 'biological', 'biological',
       'allergens', 'allergens', 'biological', 'biological', 'chemical',
       'biological', 'allergens', 'allergens', 'biological', 'allergens',
       'allergens', 'allergens', 'biological', 'allergens', 'allergens',
       'biological', 'allergens', 'allergens', 'fraud', 'biological',
       'biological', 'allergens', 'allergens', 'biological', 'allergens',
       'biological', 'allergens', 'allergens', 'biological', 'allergens',
       'chemical', 'allergens', 'allergens', 'biological', 'allergens',
       'allergens', 'allergens', 'biological', 'allergens', 'allergens',
       'allergens', 'allergens', 'biological', 'allergens', 'allergens',
       'fraud', 'allergens', 'biological', 'biological', 'biological',
       'biological', 'allergens', 'biological', 'biologica

In [58]:
product_category_classifier.predict(X_val)

array(['meat, egg and dairy products', 'meat, egg and dairy products',
       'meat, egg and dairy products', 'fruits and vegetables',
       'meat, egg and dairy products', 'meat, egg and dairy products',
       'meat, egg and dairy products', 'meat, egg and dairy products',
       'meat, egg and dairy products', 'meat, egg and dairy products',
       'meat, egg and dairy products', 'meat, egg and dairy products',
       'meat, egg and dairy products', 'fruits and vegetables',
       'cereals and bakery products', 'meat, egg and dairy products',
       'meat, egg and dairy products', 'meat, egg and dairy products',
       'meat, egg and dairy products', 'meat, egg and dairy products',
       'meat, egg and dairy products', 'meat, egg and dairy products',
       'meat, egg and dairy products', 'meat, egg and dairy products',
       'meat, egg and dairy products',
       'soups, broths, sauces and condiments',
       'meat, egg and dairy products', 'meat, egg and dairy products',
      

### Logistic Regression TF-IDF Text 