# Datenanalyse

In [1]:
from model.dataset import load_tweet_sentiment_csv_file
import pandas as pd

  from .autonotebook import tqdm as notebook_tqdm


In [21]:
train_file_path = "data/training.csv"
train_df = load_tweet_sentiment_csv_file(train_file_path)
val_file_path = "data/validation.csv"
val_df = load_tweet_sentiment_csv_file(val_file_path)
val_df[:7]

Unnamed: 0,tweet_id,tag,sentiment,text
0,3364,Facebook,Irrelevant,"I mentioned on Facebook that I was struggling for motivation to go for a run the other day, whic..."
1,352,Amazon,Neutral,BBC News - Amazon boss Jeff Bezos rejects claims company acted like a 'drug dealer' bbc.co.uk/ne...
2,8312,Microsoft,Negative,@Microsoft Why do I pay for WORD when it functions so poorly on my @SamsungUS Chromebook? 🙄
3,4371,CS-GO,Negative,"CSGO matchmaking is so full of closet hacking, it's a truly awful game."
4,4433,Google,Neutral,Now the President is slapping Americans in the face that he really did commit an unlawful act af...
5,6273,FIFA,Negative,Hi @EAHelp I’ve had Madeleine McCann in my cellar for the past 13 years and the little sneaky th...
6,7925,MaddenNFL,Positive,Thank you @EAMaddenNFL!! \n\nNew TE Austin Hooper in the ORANGE & BROWN!! \n\n#Browns | @AustinH...


In [3]:
print(train_df.shape)
print(train_df["sentiment"].unique())

(74682, 4)
['Positive' 'Neutral' 'Negative' 'Irrelevant']


In [4]:
train_df["sentiment"].value_counts()

Negative      22542
Positive      20832
Neutral       18318
Irrelevant    12990
Name: sentiment, dtype: int64

Trainings und Validierungsdatensatz sind leicht unausgeglichen. Es gibt neben Positive und Negative zusätzliche Klassen Neutral und Irrelevant was untypisch ist und von vortrainierten Referenzmodellen nicht abgedeckt ist.

In [5]:
train_df[train_df["text"].isna()]

Unnamed: 0,tweet_id,tag,sentiment,text
61,2411,Borderlands,Neutral,
553,2496,Borderlands,Neutral,
589,2503,Borderlands,Neutral,
745,2532,Borderlands,Positive,
1105,2595,Borderlands,Positive,
...,...,...,...,...
73972,9073,Nvidia,Positive,
73973,9073,Nvidia,Positive,
74421,9154,Nvidia,Positive,
74422,9154,Nvidia,Positive,


In [18]:
for row in train_df[train_df["tweet_id"]==2404]["text"]:
    print(row)

that was the first borderlands session in a long time where i actually had a really satisfying combat experience. i got some really good kills
this was the first Borderlands session in a long time where i actually had a really satisfying fighting experience. i got some really good kills
that was the first borderlands session in a long time where i actually had a really satisfying combat experience. i got some really good kills
that was the first borderlands session in a long time where i actually enjoyed a really satisfying combat experience. i got some rather good kills
that I was the first real borderlands session in a nice long wait time where i actually had a really satisfying combat experience. and i got some really good kills
that was the first borderlands session in a hot row where i actually had a really bad combat experience. i did some really good kills


Der Trainingsdatensatz beinhaltet zusätzliche augmentierte Samples, die grammatikalisch inkorrekt oder leer sein können. Es scheint dass, das erste Sample für jede tweet_id die Orginaldaten beeinhält.

In [7]:
train_df["tweet_id"].nunique()

12447

Der Trainingsdatensatz beeinhaltet 12447 samples von unique tweets, was für ein sinnvolles finetuning ausreichen sollte.

In [9]:
val_df["tweet_id"].apply(lambda x: x in train_df["tweet_id"]).sum()

1000

Alle Tweets aus dem Validierungsdatensatz kommen auch im Trainingsdatensatz vor. Dieser Leak sollte behoben werden. Da die tweet_ids gegeben sind, können wir die Validierungssamples direkt aus dem Trainingsdatensatz löschen.

In [10]:
pd.set_option('display.max_colwidth', 100)
merged_df = pd.merge(train_df, val_df, on='tweet_id', suffixes=('_train', '_val'))
merged_df = merged_df[~merged_df.duplicated(subset="tweet_id", keep="first")]
merged_df[["text_train", "text_val"]][10:20]

Unnamed: 0,text_train,text_val
60,Wine drunk playing the new Borderlands . . . Goddess life is a fun life.. . . findom,Wine drunk playing the new Borderlands 😩\n\nGoddess life is a fun life.\n\n✨ findom
66,I love u guys,I love u guys
72,Atleast I have Borderlands to cheer me up :(,Atleast I have Borderlands to cheer me up :(
78,Chris loves me in borderlands one and two.,Chris loves me in borderlands one and two.
84,This cricket has been the worst hivemind of fandom I have done this more times than I would love...,This cricket has been the worst hivemind of fandom I have done this more times than I would love...
90,"5 games, 5 Mutuals. . Pokémon. Borderlands (p much all). Sims 4 (haven't played in forever tho)....","5 games, 5 Mutuals\n\nPokémon\nBorderlands (p much all)\nSims 4 (haven't played in forever tho)\..."
96,I want to thank,I want to thank #SSKYWILDKATSSS for letting me run the new Borderlands 3 DLC with him last night...
102,"So after the past 9 days of streaming on the bounce, and last nights insanely brilliant session ...","So after the past 9 days of streaming on the bounce, and last nights insanely brilliant session ..."
108,Today sucked so it’s time to drink wine n play borderlands until the sun comes up so I can hate ...,Today sucked so it’s time to drink wine n play borderlands until the sun comes up so I can hate ...
114,Shitting around | Borderlands 3 | Part 5.5 twitch.tv/slayer3000bot,Shitting around | Borderlands 3 | Part 5.5 twitch.tv/slayer3000bot


Wenn wir die Samples mit übereinstimmender tweet_id aus dem Validierungsdatensatz mit dem Trainingsdatensatz vergleichen, können wir ausserdem ein paar zusätzliche Modifikationen des Traingsdatensatzes sehen. Ich konnte die folgenden finden:
- Zeilenumbrüche entfernt
- Manche Emojis entfernt
- Die meisten Hashtags entfernt
- Manche Links entfernt

Wir sollten vor der Evaluierung die Modifikationen auch auf die Validierungsdaten anwenden, damit der Modell Input besser mit dem Training übereinstimmt.

# Datenaufbereitung

In [None]:
# %load model/dataset.py
from typing import Dict, Tuple

import pandas as pd
import re
from datasets import Dataset
from transformers import PreTrainedTokenizer


def load_tweet_sentiment_csv_file(file_name: str) -> pd.DataFrame:
    """
    Load a sentiment file for the coding challenge from the disk and add some fitting column names.
    """
    column_names = ["tweet_id", "tag", "sentiment", "text"]
    return pd.read_csv(file_name, header=None, names=column_names)


def prepare_text_data(df: pd.DataFrame) -> pd.DataFrame:
    """
    Does some basic cleaning on the data: removes invalid or too short samples and unwanted text
    segments.
    """
    df = df[~df["text"].isna()]
    df = df[df["text"].str.count("[a-zA-Z]") >= 5]
    df["text"] = df["text"].apply(lambda text: re.sub(r"\n", "", text)) # \n only appears in validation data
    df["text"] = df["text"].apply(lambda text: re.sub(r"#\w+", "", text)) # Remove hashtags
    df["text"].apply(lambda text: re.sub(r"http\S+|www\.\S+", "", text))  # Remove URLs
    return df


def prepare_labels(df: pd.DataFrame, label_map: Dict[str, int]):
    """Creates the column for the labels, created by a map from the "sentiment" entry."""
    df["label"] = [label_map[label_name] for label_name in df["sentiment"]]
    return df


def remove_leaked_training_samples(train_df: pd.DataFrame, val_df: pd.DataFrame) -> pd.DataFrame:
    """Removes the samples that appear in both training and validation data from the training data."""
    return train_df[~train_df["tweet_id"].isin(val_df["tweet_id"])]


def remove_augmented_training_samples(train_df: pd.DataFrame) -> pd.DataFrame:
    """Removes all augmented versions of a training sample, except the first one."""
    train_df = train_df[~train_df.duplicated(subset="tweet_id", keep="first")]
    return train_df


def create_datasets(
    train_file_name: str,
    val_file_name: str,
    label_definitions: Dict[str, int],
    tokenizer: PreTrainedTokenizer,
) -> Tuple[Dataset, Dataset]:
    """
    Loads the string data from the disk, cleans and preprocesses the entries, converts to a Dataset and applies the
    given tokenizer.
    """
    train_df = load_tweet_sentiment_csv_file(train_file_name)
    val_df = load_tweet_sentiment_csv_file(val_file_name)

    train_df = remove_leaked_training_samples(train_df, val_df)
    train_df = remove_augmented_training_samples(train_df)

    train_df = prepare_text_data(train_df)
    val_df = prepare_text_data(val_df)

    train_df = prepare_labels(train_df, label_definitions)
    val_df = prepare_labels(val_df, label_definitions)

    train_dataset = Dataset.from_pandas(train_df[["text", "label"]])
    val_dataset = Dataset.from_pandas(val_df[["text", "label"]])

    tokenizer_fn = lambda examples: tokenizer(examples["text"], truncation=True)
    tokenized_train = train_dataset.map(tokenizer_fn, batched=True)
    tokenized_test = val_dataset.map(tokenizer_fn, batched=True)
    return tokenized_train, tokenized_test


Bevor wir den Trainingsdatensatz erstellen, bereinigen wir ihn um die oben festgestellten Probleme zu beheben. Bei den augmentierten Samples habe ich beide Versionen, mit und ohne Augmentationen getestet (siehe weiter unten).

Ich habe hier auf zusätzliche Modifikationen des Text Inputs (wie löschen von extra Leerzeichen, ersetzen von Spezialzeichen) verzichtet, da der Bert Tokenizer den wir verwenden solche Fälle sinnvoll verarbeiten kann und damit das "alignment" der Daten mit dem vortrainierten Modell höher ist.

# Model Training

Ich möchte ein vortrainiertes Modell verwenden und mit den 4 gegebenen Klassen finetunen. Ein vortrainiertes NLP model sollte generelle aus Konzepte der Sprache erlernt haben um den Satz zu evaluieren. Das Finetuning zielt darauf ab, das Modell auf die genaue Domain, Sentimentdaten von Twitter, zu spezialisieren.

Das basis Model ist eine kleinere "distillierte" aber kompetente Version des generellen Sprachmodells Bert, welches auf Huggingface verfügbar ist. Ich verwende auch die Huggingface Library um das Modell zu fine-tunen. Ich habe kein Hyperparemeter-Tuning durchgeführt und verwende die von Huggingface empfohĺenen Hyperparemeter für Transfer Training.

In [None]:
# %load model/trainer
from model.config import train_file_name, val_file_name, label_definitions, output_dir
from model.dataset import create_datasets
from model.metrics import compute_metrics
from model.plots import plot_confusion_matrix
from transformers import (
    TrainingArguments,
    AutoModelForSequenceClassification,
    Trainer,
    DataCollatorWithPadding,
    AutoTokenizer,
)


def train():
    """
    Main training pipeline, load and pre-process the data, initialize the trainer from a pretrained model and start
    the training loop.
    """
    tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
    train_dataset, val_dataset = create_datasets(train_file_name, val_file_name, label_definitions, tokenizer)
    data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

    id2label = {id: label for label, id in label_definitions.items()}
    model = AutoModelForSequenceClassification.from_pretrained(
        "distilbert-base-uncased",
        num_labels=4,
        id2label=id2label,
        label2id=label_definitions,
        ignore_mismatched_sizes=True,
    )

    training_args = TrainingArguments(
        output_dir=output_dir,
        learning_rate=2e-5,
        per_device_train_batch_size=4,
        per_device_eval_batch_size=4,
        num_train_epochs=5,
        weight_decay=0.01,
        save_strategy="epoch",
        push_to_hub=False,
        evaluation_strategy="epoch",
        eval_steps=1,
    )

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=val_dataset,
        tokenizer=tokenizer,
        data_collator=data_collator,
        compute_metrics=compute_metrics,
    )

    trainer.train()


  # Evaluation

In [11]:
from model.evaluation import eval_model_summary

model_names = [
    "distilbert/distilbert-base-uncased-finetuned-sst-2-english",
    "christian-git-md/distilbert-base-uncased-finetuned-twitter-noleak",
    "christian-git-md/distilbert-base-uncased-finetuned-twitter-noleak-noduplicates",
    "christian-git-md/distilbert-base-uncased-finetuned-twitter-leak",
]
eval_model_summary(model_names)


Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.
  load_accuracy = load_metric("accuracy")


Unnamed: 0,model_name,accuracy,binary_accuracy,f1
0,distilbert-base-uncased-finetuned-sst-2-english,0.438,0.807,0.285
1,distilbert-base-uncased-finetuned-twitter-noleak,0.647,0.862,0.632
2,distilbert-base-uncased-finetuned-twitter-noleak-noduplicates,0.641,0.86,0.628
3,distilbert-base-uncased-finetuned-twitter-leak,0.972,0.991,0.971


Die Modelle die trainiert und evaluiert wurden sind:
- <b>distilbert-base-uncased-finetuned-sst-2-english:</b> Ein generelles Sentiment Modell, das als Referenz evaluiert wurde. Da dieses Model nur auf zwei Klassen trainiert wurde ist hier nur die "binary_accuracy", also die Accuracy auf positive / negative Samples relevant
- <b>distilbert-base-uncased-finetuned-twitter-noleak:</b> Das trainierte Modell wie oben beschrieben, mit allen augmentierten Samples
- <b>distilbert-base-uncased-finetuned-twitter-noleak-noduplicates:</b> Mit training ohne augmentierten Samples
- <b>distilbert-base-uncased-finetuned-twitter-leak:</b> Trainining ohne Entfernung der Validierungssamples aus den Trainingsdaten. Wie zu erwarten resultiert dies in eine unrealistisch hohe Performance

# Serving

Ein Dockerfile für den Container der das Modell per fastapi App auf dem Port 9090 bereitstellt. 

Die App implementiert neben den Endpoints nur eine einfache Queue um Text aus Requests sequenziell von dem Modell evaluieren zu lassen.

In [None]:
# %load docker/Dockerfile
FROM pytorch/pytorch:1.10.0-cuda11.3-cudnn8-runtime
RUN pip install 'transformers[torch]' \
 fastapi \
 uvicorn \
 python-multipart \
 scikit-learn \
 emoji \
 datasets \
 huggingface_hub \
 evaluate \
 matplotlib \
 jupyterlab
RUN apt-get update
RUN apt-get install -y git-lfs
COPY download_models.py .
RUN python3 download_models.py
COPY entrypoint.sh /usr/src/app/entrypoint.sh
RUN chmod +x /usr/src/app/entrypoint.sh
ENTRYPOINT ["/usr/src/app/entrypoint.sh"]

In [None]:
# %load docker/entrypoint.sh
#!/bin/bash
uvicorn serving.app.main:app --host 0.0.0.0 --port 9090

In [None]:
# %load serving/app/main.py
from asyncio import Queue

from fastapi import FastAPI, Request
from fastapi.responses import JSONResponse
import asyncio
import logging

from transformers import pipeline

from model.dataset import clean_text

app = FastAPI()
logger = logging.getLogger(__name__)


@app.on_event("startup")
async def start_model_server_loop():
    """
    Start the model server loop upon application startup.

    Roughly mimicking the Starlette setup from https://huggingface.co/docs/transformers/main/en/pipeline_webserver.
    """
    q = asyncio.Queue()
    app.state.model_queue = q
    asyncio.create_task(model_server_loop(q))


@app.post("/")
@app.post("/evaluate_text_sentiment")
async def evaluate_text_sentiment(request: Request):
    """
    Evaluate the sentiment of the provided text. This endpoint accepts text data as raw payload, processes it through
    the sentiment analysis model, and returns the sentiment.
    """
    payload = await request.body()
    string = payload.decode("utf-8")
    response_q = asyncio.Queue()
    await request.app.state.model_queue.put((string, response_q))
    return await response_q.get()


async def model_server_loop(in_queue: Queue):
    """
    A simple server loop, so requests can be processed sequentially by the model.
    """
    model = pipeline(
        task="sentiment-analysis", model="christian-git-md/distilbert-base-uncased-finetuned-twitter-noleak"
    )
    while True:
        string, response_queue = await in_queue.get()
        try:
            string = clean_text(string)
            out = model(string)
            await response_queue.put(JSONResponse(content=out))
        except (KeyboardInterrupt, SystemExit):
            exit()
        except Exception:
            logger.exception("An error occured during model processing.")
            await response_queue.put(JSONResponse(status_code=500, content={"error": "Internal server error"}))


# uvicorn serving.app.main:app --host 0.0.0.0 --port 9090


In [12]:
!curl -X POST http://34.32.61.134:9090/ -d "I love donuts"

[{"label":"Positive","score":0.9768452644348145}]

Ich habe das distilbert-base-uncased-finetuned-twitter-noleak Modell auf einer vm bereitgestellt. Wir können Sentiment per POST Request beurteilen lassen.

# Interpretation / Diskussion

Gibt es am Mittwoch :)!