<div style="width: 30%; float: right; margin: 10px; margin-right: 5%;">
    <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/d/d3/FHNW_Logo.svg/2560px-FHNW_Logo.svg.png" width="500" style="float: left; filter: invert(50%);"/>
</div>

<h1 style="text-align: left; margin-top: 10px; float: left; width: 60%;">
    npr Mini-Challenge 1: <br>TFIDF-HGBC
</h1>

<p style="clear: both; text-align: left;">
    Bearbeitet durch Florin Barbisch, Gabriel Torres Gamez und Jan Zwicky im HS 2023.
</p>

## Modellerklärung

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed euismod, nisl quis tincidunt aliquam, nunc nisl ultricies nunc, sit amet ultricies nisl ante nec leo. Donec vitae ex euismod, tincidunt nisl quis, gravida nisl. Sed vitae quam vitae nisl tincidunt lacinia. Nullam et semper nisl, sed rutrum ipsum. Donec ac odio nec dolor ultricies aliquam. Sed id nisl at nisi ultricies ultrices. Curabitur sed neque eget tortor vulputate imperdiet. Sed vitae dui nec justo aliquam ultrices. Sed id nunc euismod, ultricies velit nec, aliquam urna. Sed eget semper eros. Sed et nisl at magna ultricies lobortis. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia curae; Sed vel massa auctor, aliquet libero et, aliquet odio. Nulla facilisi. Sed sit amet dolor vel diam tincidunt aliquet. Donec sed tortor eget sapien gravida aliquet.

## Requirements, Imports und Einstellungen
Hier werden die benötigten Python-Pakete importiert und die Einstellungen für die Plots
vorgenommen. 

In [1]:
# All Imports
import sys
import html
import scipy
import torch
import sklearn
import warnings
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.feature_extraction.text import TfidfVectorizer

# Versions of the packages used
print(f"Python Version: {sys.version}")
print(f"PyTorch Version: {torch.__version__}")
print(f"Numpy Version: {np.__version__}")
print(f"Scipy Version: {scipy.__version__}")
print(f"Pandas Version: {pd.__version__}")
print(f"Matplotlib Version: {plt.matplotlib.__version__}")
print(f"Sklearn Version: {sklearn.__version__}")
print(f"Seaborn Version: {sns.__version__}")

# Warnings Settings
warnings.simplefilter(action='ignore', category=FutureWarning)

# Numpy Settings
np.set_printoptions(precision=2, suppress=True)
np.random.seed(42)

# Matplotlib Settings
plt.rcParams["figure.figsize"] = (24, 12)

Python Version: 3.10.13 (main, Aug 24 2023, 12:59:26) [Clang 15.0.0 (clang-1500.0.40.1)]
PyTorch Version: 2.0.1
Numpy Version: 1.23.5
Scipy Version: 1.9.3
Pandas Version: 2.1.1
Matplotlib Version: 3.6.3
Sklearn Version: 1.3.1
Seaborn Version: 0.12.2


## Helper Functions

Hier werden Funktionen und Klassen definiert, die bei den Modellen verwendet werden.

In [2]:
# Metrics, shortcuts, constants, etc.
class Pipeline:
    """Pipeline for customizable text classification model."""

    def __init__(self, vectorizer, classifier, reducer=None) -> None:
        self.vectorizer = vectorizer
        self.classifier = classifier
        self.reducer = reducer

    def fit(self, features, labels) -> "Pipeline":
        """
        Fit the model to the given training data.

        Parameters
        ----------
        features : list of str
            Training data features.
        labels : np.ndarray or pandas.Series
            Training data labels.

        Returns
        -------
        self
            The fitted model.
        """
        self.vectorizer.fit(features)
        transformed_features = self.vectorizer.transform(features).toarray()
        if self.reducer:
            self.reducer.fit(transformed_features)
            transformed_features = self.reducer.transform(transformed_features)
        self.classifier.fit(transformed_features, labels)
        return self

    def predict(self, features) -> np.ndarray:
        """
        Predict labels for the given features.

        Parameters
        ----------
        features : list of str
            Features to predict labels for.

        Returns
        -------
        np.ndarray
            Predicted labels for the given features.
        """
        transformed_features = self.vectorizer.transform(features).toarray()
        if self.reducer:
            transformed_features = self.reducer.transform(transformed_features)
        return self.classifier.predict(transformed_features)

    def score(self, features, labels, metric="acc") -> np.float64:
        """
        Score the model on the given features and labels.

        Parameters
        ----------
        features : list of str
            Features to score the model on.
        labels : np.ndarray or pandas.Series
            Labels to score the model on.
        metric : str, optional
            Metric to use for scoring, by default "acc"

        Returns
        -------
        np.float64
            Score of the model on the given features and labels.
        """
        match metric:
            case "acc":
                return sklearn.metrics.accuracy_score(labels, self.predict(features))
            case "f1_macro":
                return sklearn.metrics.f1_score(labels, self.predict(features), average="macro")
            case _:
                raise ValueError(f"Unknown metric: {metric}")

## Einlesen des verarbeiteten Datensatzes

In [3]:
train = pd.read_csv('data/processed/train.csv', index_col=0)
val = pd.read_csv('data/processed/val.csv', index_col=0)

X_train = [html.unescape(tweet) for tweet in train["text"]]
X_val = [html.unescape(tweet) for tweet in val["text"]]

y_train = train["target"]
y_val = val["target"]

## Klassifikationsmodell

In [4]:
# Define Pipeline
TFIDF_HGBC = Pipeline(
    vectorizer=TfidfVectorizer(max_features=1000),
    reducer=None, # None, PCA, NMF, etc.
    classifier=HistGradientBoostingClassifier(),
)

# Fit Pipeline
TFIDF_HGBC.fit(X_train, y_train)

print(f"Train Accuracy: {TFIDF_HGBC.score(X_train, y_train, metric='acc'):.4f}")
print(f"Train F1 Macro: {TFIDF_HGBC.score(X_train, y_train, metric='f1_macro'):.4f}")

Train Accuracy: 0.8786
Train F1 Macro: 0.8727


### Auswertung des Modells

In [5]:
print(f"Accuracy: {TFIDF_HGBC.score(X_val, y_val.to_numpy(), metric='acc'):.4f}")
print(f"F1 Macro: {TFIDF_HGBC.score(X_val, y_val, metric='f1_macro'):.4f}")

Accuracy: 0.7756
F1 Macro: 0.7650


In [6]:
kaggle_test = pd.read_csv('data/processed/test.csv', index_col="id", encoding="utf-8")
X_test = [html.unescape(tweet) for tweet in kaggle_test["text"]]

submisson = TFIDF_HGBC.predict(X_test)
submisson = pd.DataFrame(submisson, index=kaggle_test.index, columns=["target"])
submisson.to_csv("data/submissions/TFIDF_HGBC.csv", index=True, index_label="id")

## Evaluation
In der Evaluation beschreiben wir, welche Metrik verwendet wurde, wieso die Metrik für den Anwendungsfall passt und diskutieren die Ergebnisse der Experimente und einige Vorhersagen auf einzelnen Testsamples.

## Erkenntnisse

## Fazit