# Baseline system of "call to action" detection model 

The notebook contains the code for the baseline system for the automatic detection of **Calls-to-action** (**subtask 1** of the Shared Task on Harmful Content Detection). A gradient boosting algorithm was chosen for classification, using sentence embeddings and a polarity score as features. The notebook covers the training of the system as well as the prediction on the test data and the evaluation. 

The programme was tested using Python version 3.12.9. Executing the following two lines of code will install all necessary packages. 

In [1]:
%%writefile requirements.txt

pandas==2.2.3
spacy==3.8.2
scikit-learn==1.6.1
textblob==0.15.3
textblob-de==0.4.3
sentence-transformers==4.1.0
nltk==3.9.1
numpy==2.0.1

Overwriting requirements.txt


In [None]:
%pip install -r requirements.txt

## 1. Importing training data

First, the training data was read in. 

In [3]:
# Reading in training data 
import pandas as pd

filename = "c2a_train.csv" # Path needs to be adjusted 
train_c2a = pd.read_csv(filename, sep=';')
train_c2a.head()

Unnamed: 0,id,description,C2A
0,874458912592533,"Wenn Du Wert drauf legst, dann erwarten wir di...",False
1,855617527810005,"Macht was ihr wollt, aber schreibt nicht ""Wir ...",False
2,1111407378897684,"Tja, wohl etwas schwierig, jetzt das Geld per ...",False
3,945847662120324,"UInd das Glaube ich nicht,,,in Hotels in Arfri...",False
4,993541110684312,und kein TTIP,False


The class distribution of the training data was analysed. 

In [4]:
# Absolute number of instances in each class 
class_counts_c2a = train_c2a["C2A"].value_counts()

# Relative number of instances in each class 
class_percent_c2a = train_c2a['C2A'].value_counts(normalize=True) * 100

# Summarise into a dataframe
class_table_c2a = pd.DataFrame({
    'Frequency': class_counts_c2a,
    'Percentage': class_percent_c2a.round(2)
})

print(class_table_c2a)

       Frequency  Percentage
C2A                         
False       6177       90.31
True         663        9.69


## 2. Data preprocessing and undersampling

Since the data is unbalanced, undersampling was applied to the training data. 

In [5]:
from sklearn.utils import resample

minority_class = train_c2a[train_c2a['C2A'] == True]
majority_class = train_c2a[train_c2a['C2A'] == False]

majority_undersampled = resample(majority_class, replace=False, n_samples=len(minority_class), random_state=42)
train_c2a = pd.concat([minority_class, majority_undersampled])

print(train_c2a['C2A'].value_counts())

C2A
True     663
False    663
Name: count, dtype: int64


The training data was then pre-processed. This included basic cleansing steps, specifically the removal of URLs, hashtags and mentions. 

In [6]:
import re

def clean_text(text):
    url_pattern = r'https?://\S+|www\.\S+'
    mention_pattern = r'@\w+'
    hashtag_pattern = r'#\w+'
    combined_pattern = f'({url_pattern})|({mention_pattern})|({hashtag_pattern})'

    cleaned_text = re.sub(combined_pattern, '', text)
    return cleaned_text

The texts were then lemmatised and tokenised. 

In [7]:
# Download the Spacy Pipeline for the German language 
! python -m spacy download de_core_news_md

Collecting de-core-news-md==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/de_core_news_md-3.8.0/de_core_news_md-3.8.0-py3-none-any.whl (44.4 MB)
     ---------------------------------------- 0.0/44.4 MB ? eta -:--:--
     --------- ----------------------------- 10.5/44.4 MB 72.5 MB/s eta 0:00:01
     ----------------------- --------------- 26.2/44.4 MB 66.4 MB/s eta 0:00:01
     ------------------------------------ -- 41.9/44.4 MB 74.2 MB/s eta 0:00:01
     --------------------------------------- 44.4/44.4 MB 62.8 MB/s eta 0:00:00
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('de_core_news_md')


In [8]:
import spacy

# Loading the Spacy pipeline for the German language 
nlp = spacy.load('de_core_news_md')

In [9]:
# Defining the lemmatisation function 
def text_lemmatize_tokenize(texts):
    lemmatized = []
    for doc in nlp.pipe(texts, batch_size = 50):
        tokens = [token.lemma_.lower() for token in doc if not token.is_punct]
        lemmatized.append(' '.join(tokens))
    return lemmatized

The two functions for removing certain tokens and for lemmatisation were applied to the training data. 

In [10]:
# Removing URLs, hashtags and mentions 
train_c2a['description'] = train_c2a['description'].apply(clean_text)
print(train_c2a.head())

                  id                                        description   C2A
13  1043633089008447  die aus sicheren herkunftsländern nach hause s...  True
26  1105436602828095  es ist schade um die armen schweineköpfe, dafü...  True
31   982671735104583  Frag doch mal das Serbische Volk wieviel Flüch...  True
35  1105439379494484  vergrabt ein totes Schwein, hat in Spanien auc...  True
39   946634062041684  Wenn die auf dem Mittelmeer von der Marine auf...  True


In [11]:
# Lemmatisation and tokenisation 
train_c2a['description'] = text_lemmatize_tokenize(train_c2a['description'].tolist())
print(train_c2a.head())

                  id                                        description   C2A
13  1043633089008447  der aus sicher herkunftsländer nach hause schi...  True
26  1105436602828095  es sein schade um der arm schweineköpfe dafür ...  True
31   982671735104583  frag doch mal der serbisch volk wieviel flücht...  True
35  1105439379494484  vergraben ein tot schwein haben in spanien auc...  True
39   946634062041684  wenn der auf der mittelmeer von der marine auf...  True


## 3. Feature Engineering

Next, the tweets from the training data were converted into a feature representation. Specifically, polarity was extracted as a feature using the Text Blob library. 

In [12]:
from textblob_de import TextBlobDE

# Function for determining the polarity of a tweet
def add_polarity(df):
    def calculate_sentiment_features(text):
        blob = TextBlobDE(text)
        return blob.sentiment.polarity

    df[['polarity']] = df['description'].apply(
        lambda x: pd.Series(calculate_sentiment_features(x))
    )

    return df

In addition, the tweets were represented as sentence embeddings using a Sentence BERT model. The feature vectors of each tweet consisted of the polarity value and the sentence embedding. 

In [13]:
from sentence_transformers import SentenceTransformer
# Load BERT model 
sentence_model = SentenceTransformer('distiluse-base-multilingual-cased-v2')

# Function for representing tweets as embeddings 
def add_semantic_features(df, model):
    texts = df['description'].astype(str).values.tolist()

    embeddings = model.encode(texts, show_progressbar=True)
    embeddings_df = pd.DataFrame(embeddings, columns=[f'embedding_{i}' for i in range(embeddings.shape[1])])

    df = pd.concat([df.reset_index(drop=True), embeddings_df.reset_index(drop=True)], axis=1)
    return df

The features were extracted from the tweets in the training data. 

In [14]:
import nltk
nltk.download('punkt')
nltk.download('vader_lexicon')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to \\na2.hs-
[nltk_data]     mittweida.de\felser\Wappscfg\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package vader_lexicon to \\na2.hs-
[nltk_data]     mittweida.de\felser\Wappscfg\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
[nltk_data] Downloading package punkt_tab to \\na2.hs-
[nltk_data]     mittweida.de\felser\Wappscfg\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [15]:
# Extraction of the polarity score
train_c2a = add_polarity(train_c2a)
print(train_c2a.head())

                  id                                        description   C2A  \
13  1043633089008447  der aus sicher herkunftsländer nach hause schi...  True   
26  1105436602828095  es sein schade um der arm schweineköpfe dafür ...  True   
31   982671735104583  frag doch mal der serbisch volk wieviel flücht...  True   
35  1105439379494484  vergraben ein tot schwein haben in spanien auc...  True   
39   946634062041684  wenn der auf der mittelmeer von der marine auf...  True   

    polarity  
13       0.7  
26      -0.9  
31       0.0  
35      -1.0  
39       0.0  


In [17]:
# Extraction of embedding features 
train_c2a = add_semantic_features(train_c2a, sentence_model)
print(train_c2a.head())

                 id                                        description   C2A  \
0  1043633089008447  der aus sicher herkunftsländer nach hause schi...  True   
1  1105436602828095  es sein schade um der arm schweineköpfe dafür ...  True   
2   982671735104583  frag doch mal der serbisch volk wieviel flücht...  True   
3  1105439379494484  vergraben ein tot schwein haben in spanien auc...  True   
4   946634062041684  wenn der auf der mittelmeer von der marine auf...  True   

   polarity  embedding_0  embedding_1  embedding_2  embedding_3  embedding_4  \
0       0.7    -0.022992    -0.002712     0.018320    -0.000925     0.031636   
1      -0.9    -0.003322     0.006351     0.033073     0.018905     0.022041   
2       0.0    -0.042495     0.020839     0.050303     0.010994    -0.010169   
3      -1.0    -0.011467    -0.021485    -0.011638    -0.002952     0.046747   
4       0.0     0.031033    -0.015447    -0.019592    -0.017471     0.069267   

   embedding_5  ...  embedding_502  em

## 4. Training the model 

A gradient boosting classifier was then trained using the extracted features. 

In [18]:
# Restrict training data set to features 
X_train = train_c2a.drop(columns=['id', 'description', 'C2A'])
# Extract labels from training data 
y_train = train_c2a['C2A']

In [None]:
# Training the model 
from sklearn.ensemble import GradientBoostingClassifier
model = GradientBoostingClassifier(random_state=42)
model.fit(X_train, y_train)

## 5. Prediction on the test data 

Now the trained classification model could be used to make a prediction for the test data. To do this, the test data was preprocessed in the same way as the training data. 

In [20]:
# Importing the test data
filename = "c2a_test.csv" # Path needs to be adjusted 
test_c2a = pd.read_csv(filename, sep=';')

In [21]:
# Removing URLs, hashtags and mentions 
test_c2a['description'] = test_c2a['description'].apply(clean_text)

# Lemmatisation and tokenisation 
test_c2a['description'] = text_lemmatize_tokenize(test_c2a['description'].tolist())

Subsequently, the polarity score was also determined for the tweets in the test data and they were converted into sentence embeddings. 

In [22]:
# Extraction of the polarity score
test_c2a = add_polarity(test_c2a)

# Extraction of embedding features 
test_c2a = add_semantic_features(test_c2a, sentence_model)

The feature representation was passed to the gradient boosting model to make a prediction. 

In [23]:
# Restrict test data set to features 
X_test = test_c2a.drop(columns=['id', 'description'])
y_test_pred = model.predict(X_test)

## 6. Evaluation of results 

The predictions based on the test data were compared with the gold standard and evaluation metrics were calculated. The results achieved serve as a guide and baseline for the competition participants. 

In [24]:
# Importing the gold standard 
filename = "c2a_gold.csv" # Path needs to be adjusted 
gold_c2a = pd.read_csv(filename, sep=';')

In [25]:
# Check that the IDs from the test data and the gold standard are in the same order. 
gold_c2a["id"].tolist() == test_c2a["id"].tolist()

True

In [26]:
# The true label of the tweets was extracted from the gold standard. 
y_true = gold_c2a['C2A']

The macro F1 measure is used as the evaluation metric for the ranking in the competition leaderboard. In addition, other evaluation metrics such as precision and recall are calculated for the individual classes, as well as the macro and weighted average. 

In [27]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import numpy as np
test_report = classification_report(y_true, y_test_pred)
print("Test Classification Report:")
print(test_report)

Test Classification Report:
              precision    recall  f1-score   support

       False       0.97      0.72      0.83      2693
        True       0.23      0.78      0.36       289

    accuracy                           0.73      2982
   macro avg       0.60      0.75      0.59      2982
weighted avg       0.90      0.73      0.78      2982

