<a href="https://colab.research.google.com/github/gabsioussema/doctors_sentiment_analysis/blob/main/doctors_sentiment_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentiment Analysis of Doctor Reviews

## Introduction

This Colab notebook aims to develop a sentiment analysis model for categorizing reviews of doctors by patients as either positive or negative based on the text provided in the reviews. The dataset used for this task is a subset of the [German-Language Reviews of Doctors by Patients 2021](https://data.world/mc51/german-language-reviews-of-doctors-by-patients-2021) dataset. Each review includes a rating (positive or negative) and a corresponding comment.

## Tasks Overview

### Task 1: Data Splitting
The first step involves splitting the dataset into suitable training and testing sets.

### Task 2: Data Cleaning and Preprocessing
This section may involve tasks such as text cleaning, tokenization, and handling missing values.

### Task 3: Model Training
In this task, we will explore various machine learning approaches to train a sentiment analysis model. We have the flexibility to experiment with different algorithms and pre-trained models to identify the one that performs well. External data sources and embeddings can also be considered to enhance model performance.

### Task 4: Model Evaluation
The final task involves evaluating the performance of the selected sentiment analysis model on the test set using appropriate evaluation metrics. This step helps us assess how well the model generalizes to unseen data.

## Deliverables

### Jupyter Notebook and Python Scripts
The project will be organized into Jupyter notebooks and Python scripts that cover the following aspects:
- Data cleaning and preprocessing
- Model selection and training
- Model evaluation

In [None]:
from google.colab import drive
import os
from google.colab import data_table

In [None]:
import re
import pickle
import numpy as np
import pandas as pd
import nltk
from pathlib import Path
from tqdm import tqdm
from IPython.display import display


In [None]:
!pip install transformers
!pip install sentence_transformers
!pip install setfit
!pip install bertopic

In [None]:
from datasets import load_dataset
from datasets import Dataset


from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import RidgeClassifier
from sklearn.model_selection import KFold
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import sklearn

import plotly.express as px

from sentence_transformers import SentenceTransformer
from bertopic import BERTopic
from sentence_transformers.losses import CosineSimilarityLoss
from setfit import SetFitModel, SetFitTrainer


###  mount drive

In [None]:
gdrive='/content/gdrive/MyDrive/Sentiment_Analysis'
drive.mount('/content/gdrive', force_remount=True)

Mounted at /content/gdrive


In [None]:
# Importing Colab's addon for displaying pandas dataframes in interactive displays
data_table.enable_dataframe_formatter()

# 1. Data Spliting

In [None]:
# reading the data
FILE_REVIEWS = Path(gdrive  + "/Data/reviews.csv")
data = pd.read_csv(FILE_REVIEWS, sep=';', na_values=[""])
data.head(5)

Unnamed: 0,rating,comment
0,positive,Ich liebe Herrn Dr. Scheeser er nimmt dich imm...
1,negative,Die Behandlungen von Dr. Brede dauern im Schni...
2,negative,Hilfe bei Zahnschmerzen. <br />\r\nNicht diese...
3,negative,Ich bin unzufrieden und kann Frau Dr. Frankenb...
4,negative,"Der arzt ist okay,aber es ist für mich offensi..."


Split the data into train and test sets


In [None]:
train ,test = train_test_split(data, random_state=2, test_size=0.3, shuffle=True)
print(f"Number of samples in train set: {len(train)}")
print(f"Number of samples in test set: {len(test)}")

Number of samples in train set: 14000
Number of samples in test set: 6000


The plots show a balanced distribution of sentiment labels of the training set, with an equal number of negative and positive ratings. This balanced dataset will contribute to a fair evaluation of the model's performance, reducing bias, and allowing for reliable assessment of its ability to classify both negative and positive sentiments effectively.

In [None]:
fig = px.histogram(train, x='rating', nbins=2, labels={'rating': 'Sentiment'},
                   title='Distribution of Sentiment Ratings in Training Data')
fig.update_xaxes(tickvals=[0, 1], ticktext=['Negative', 'Positive'])
fig.update_layout(width=700, height=400)
fig.show()


## BERTopic


In the last section, based on the embedding from the German BERT transformer, we employed the BERTopic model to extract features from our data. Our objective was to gain a deeper understanding of the data and explore the possibility of extracting sentiment information solely from the identified topics.

The results of this analysis revealed significant topics represented by prominent circles. Remarkably, these major topics appeared to correspond to positive and negative sentiments.

In [None]:
topic_model = BERTopic(embedding_model=None)
topics, probs = topic_model.fit_transform(data['comment'],np.array(data['german_bert_embedding'].tolist()))

In [None]:
topic_model.visualize_topics()


# 2. Cleaning and preprocessing


The goal of using these dictionaries is to preprocess the text data for sentiment analysis. We want to replace the words in the text with specific tokens ('positive' for positive words and 'negative' for negative words) to make it easier for the model to understand the sentiment (files can be found under this [link](https://www.kaggle.com/datasets/rtatman/german-sentiment-analysis-toolkit))

In [None]:
# Two dictionaries, positive_data and negative_data, which contain lists of positive and negative words in the German language, respectively
positive_data = pd.read_csv(gdrive+'/Data/SentiWS_v1.8c_Positive.txt', sep='\t', header=None)
negative_data = pd.read_csv(gdrive+'/Data/SentiWS_v1.8c_Negative.txt', sep='\t', header=None)
# Print the first 4 rows of the positive_data dictionary to display a sample of positive words.


# Print the first 5 rows of the positive_data dictionary to display a sample of positive words.
print("Sample of Positive Words:")
display(positive_data.head(5))

# Print the first 4 rows of the negative_data dictionary to display a sample of negative words.
print("\nSample of Negative Words:")
display(negative_data.head(5))


Sample of Positive Words:


Unnamed: 0,0,1,2
0,Abmachung|NN,0.004,Abmachungen
1,Abschluß|NN,0.004,"Abschlüße,Abschlußs,Abschlußes,Abschlüßen"
2,Abstimmung|NN,0.004,Abstimmungen
3,Agilität|NN,0.004,
4,Aktivität|NN,0.004,Aktivitäten



Sample of Negative Words:


Unnamed: 0,0,1,2
0,Abbau|NN,-0.058,"Abbaus,Abbaues,Abbauen,Abbaue"
1,Abbruch|NN,-0.0048,"Abbruches,Abbrüche,Abbruchs,Abbrüchen"
2,Abdankung|NN,-0.0048,Abdankungen
3,Abdämpfung|NN,-0.0048,Abdämpfungen
4,Abfall|NN,-0.0048,"Abfalles,Abfälle,Abfalls,Abfällen"


In [None]:
# Extracting negative and positive words
def extract_words(data, pos=1):
    data = data.rename(columns={0: 'word', 1: 'score', 2: 'list'})
    data = data[pos*data.score > pos*0.2]
    data['word'] = data['word'].apply(lambda x: x.split('|')[0])
    data['word'] = data['word'].str.cat(data['list'],sep=',')
    data = data.drop(['list'], axis=1)
    data = data.dropna()
    data['word'] = data['word'].apply(lambda x: x.split(','))
    return set(data.explode('word')['word'].tolist()) # set of words

In [None]:
neg_set = extract_words(negative_data, -1)
pos_set = extract_words(positive_data, pos=1)

The replace_emoji function is intended to simplify and prepare text data for the sentiment analysis task by substituting emoticons with sentiment categories ('positive' or 'negative')

In [None]:
def replace_emoji(comment):
    emoji_categories = {
    r'(:\s?\(|:\s-\s\(|\)\s?:|\)\s-\s:| :,\(|:\'\(|:"\()| :/': 'negative',
    r'(:\s?\)|;-\)|:\s-\s\)|\(\s?:|\(-:|:\'\)|:\s?D|:-D|x-?D|X-?D|<3|:\*)': 'positive'}
    for emoji_pattern, category in emoji_categories.items():
        comment = re.sub(emoji_pattern, ' ' + category + ' '  , comment)
    return comment

In [None]:
sentences = [
    "I'm feeling so happy today! :-D",
    "I had a terrible day at work. : - (",
    "I love pizza! <3",
    "This movie was hilarious! LOL :-D",
    "I'm  happyyyyyy! :))))))))))))",
    "Hello, im not happy! :/"
]

# Iterate through the sentences and print the original and modified sentences
for sentence in sentences:
    modified_sentence = replace_emoji(sentence)
    print(f"Original: {sentence}")
    print(f"Modified: {modified_sentence}\n")

Original: I'm feeling so happy today! :-D
Modified: I'm feeling so happy today!  positive 

Original: I had a terrible day at work. : - (
Modified: I had a terrible day at work.  negative 

Original: I love pizza! <3
Modified: I love pizza!  positive 

Original: This movie was hilarious! LOL :-D
Modified: This movie was hilarious! LOL  positive 

Original: I'm  happyyyyyy! :))))))))))))
Modified: I'm  happyyyyyy!  positive )))))))))))

Original: Hello, im not happy! :/
Modified: Hello, im not happy! negative 



In [None]:
nltk.download('punkt') # using german stemmer
stemmer = SnowballStemmer("german")

nltk.download('stopwords')# to remove the stop words
stop_words = set(stopwords.words("german"))

def clean_text(text):
    RE_WSPACE = re.compile(r"\s+", re.IGNORECASE)
    RE_TAGS = re.compile(r"<[^>]+>")
    RE_ASCII = re.compile(r"[^A-Za-zÀ-ž]", re.IGNORECASE)
    RE_SINGLECHAR = re.compile(r"\b[A-Za-zÀ-ž]\b", re.IGNORECASE)

    # Replace emojis and then remove special character to get rid of residuals.
    text = replace_emoji(text)
    # Remove tags such "< /br>"
    text = re.sub(RE_TAGS, " ", text)
    # Remove digits and not ASCII chars
    text = re.sub(RE_ASCII, " ", text)
    # Remove all single Characters
    text = re.sub(RE_SINGLECHAR, " ", text)
    # Eliminate white spaces
    text = re.sub(RE_WSPACE, " ", text)

    words_tokens = word_tokenize(text)

    # Replace each negative word with token "negative"
    words_filtered = ["negative" for word in words_tokens if word.lower() in neg_set]

    # Replace each positive word with token "positive"
    words_filtered = ["positive" for word in words_tokens if word.lower() in pos_set]

    words_filtered = [stemmer.stem(word) for word in words_tokens if word not in stop_words]
    text_clean = " ".join(words_filtered)
    return text_clean

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
%%time
# Clean and process each comment using our cleant_text function
train["comment_clean"] = train["comment"].apply(lambda x: clean_text(x))
test["comment_clean"] = test["comment"].apply(lambda x:  clean_text(x))

CPU times: user 7.29 s, sys: 0 ns, total: 7.29 s
Wall time: 7.75 s


In [None]:
# Save the processed data in drive
train.to_csv(gdrive+'/Data/train_processed_data.csv')
test.to_csv(gdrive+'/Data/test_processed_data.csv')

### Analysis

In [None]:
# Display the comments before and after applying our processing function
train_sample = train.sample(n=7, random_state=0)
display(train_sample)

Unnamed: 0,rating,comment,comment_clean
12806,positive,Ich fand die Therapie hilfreich und der Hund v...,ich fand therapi hilfreich hund fr bos suss
14243,positive,"Sehr nette und freundliche Praxis,kurze Termin...",sehr nett freundlich praxis kurz termin ausges...
7176,positive,"Gründliche Ärztin, die auch ihre Kontakte spie...",grundlich arztin kontakt spiel lasst spezialis...
110,negative,"Man hat nicht den Eindruck, trotz frischer Ste...",man eindruck trotz frisch stent irgendein weis...
5089,negative,Hat ungefragt unnütze Behandlungen durchgeführ...,hat ungefragt unnutz behandl durchgefuhrt abge...
15818,negative,Gibt sich fast immer freundlich und zuvorkomme...,gibt fast imm freundlich zuvorkomm krag platzt...
17759,positive,Ich bin bisher sehr zufrieden mit Frau Mertes....,ich bish zufried frau mert wenn kost behandl n...


In [None]:
train_clean = train.copy()
test_clean = test.copy()

In [None]:
word_freq = pd.Series(" ".join(train_clean['comment_clean']).split()).value_counts()
word_freq_top40 = word_freq[1:41]
word_freq_df = pd.DataFrame({'Word': word_freq_top40.index, 'Frequency': word_freq_top40.values})
fig = px.bar(word_freq_df, x='Word', y='Frequency', title='Word Frequency of Top 40 Common Words in Comments')
fig.update_xaxes(tickangle=60)
fig.update_layout(width=700, height=400)
fig.show()


### Feature creation with TF-IDF
Using the option of two words (ngram) to compute a unique word vector with frequencies that excludes very uncommon (10 obsvervations) and frequent (>=30%) terms.

In [None]:
vectorizer_tf_idf = TfidfVectorizer(analyzer="word", max_df=0.3, min_df=10, ngram_range=(1, 2), norm="l2")
vectorizer_tf_idf.fit(train_clean["comment_clean"])
vectorizer_tf_idf.fit(test_clean["comment_clean"])

In [None]:
positiv_vec = vectorizer_tf_idf.vocabulary_.get("positiv")
print(f"Vector representation of the word *positv*: {positiv_vec}")

negativ_vec = vectorizer_tf_idf.vocabulary_.get("negativ")
print(f"Vector representation of the word *positv*: {negativ_vec}")

Vector representation of the word *positv*: 1007
Vector representation of the word *positv*: 915


Here, we are transforming each sentence in the training set into a numeric vector using TF-IDF

In [None]:
X_train = train_clean["comment_clean"]
Y_train = train_clean["rating"]
X_train_vec = vectorizer_tf_idf.transform(X_train)
X_train_vec.get_shape()

(14000, 1490)

In [None]:
X_test = test_clean["comment_clean"]
Y_test = test_clean["rating"]
X_test_vec = vectorizer_tf_idf.transform(X_test)

### Feature creation with BOW (Bag of words)

In the Bag of Words (BoW) section, we are essentially following the same steps to represent our data as we did when using TF-IDF.

In [None]:
vectorizer_bow = CountVectorizer()

In [None]:
vectorizer_bow.fit(train_clean["comment_clean"])
vectorizer_bow.fit(test_clean["comment_clean"])

In [None]:
positiv_vec = vectorizer_bow.vocabulary_.get("positiv")
print(f"Vector representation of the word *positv*: {positiv_vec}")

negativ_vec = vectorizer_bow.vocabulary_.get("negativ")
print(f"Vector representation of the word *positv*: {negativ_vec}")

Vector representation of the word *positv*: 5942
Vector representation of the word *positv*: 5436


In [None]:
X_train = train_clean["comment_clean"]
Y_train = train_clean["rating"]
X_train_vec_bow = vectorizer_bow.transform(X_train)
X_train_vec_bow.get_shape()


(14000, 1490)

In [None]:
X_test = test_clean["comment_clean"]
Y_test = test_clean["rating"]
X_test_vec_bow = vectorizer_bow.transform(X_test)

# 3. Models Training and Evaluation

## Logistic Regression

### Logistic regression using tf-idf embeddings

In [None]:
logis_tf_idf = LogisticRegression(solver="sag", random_state=1)

In [None]:
logis_tf_idf.fit(X_train_vec, Y_train)
prediction = logis_tf_idf.predict(X_test_vec)
logistic_report = sklearn.metrics.classification_report(Y_test, prediction)

In [None]:
print(logistic_report)

              precision    recall  f1-score   support

    negative       0.89      0.92      0.91      3006
    positive       0.92      0.89      0.90      2994

    accuracy                           0.91      6000
   macro avg       0.91      0.91      0.91      6000
weighted avg       0.91      0.91      0.91      6000



### Logistic Regression using BOW embeddings

In [None]:
logis_bow = LogisticRegression(solver="sag", random_state=1)

In [None]:
logis_bow.fit(X_train_vec_bow, Y_train)
prediction = logis_bow.predict(X_test_vec_bow)
logistic_report_bow = sklearn.metrics.classification_report(Y_test, prediction)

In [None]:
print(logistic_report_bow)

              precision    recall  f1-score   support

    negative       0.90      0.91      0.90      3006
    positive       0.91      0.90      0.90      2994

    accuracy                           0.90      6000
   macro avg       0.90      0.90      0.90      6000
weighted avg       0.90      0.90      0.90      6000



## Ridge Regression

In Ridge Regression, we perform a grid search to fine-tune the hyperparameter alpha, aiming to obtain the best classifier based on this optimization.

### Ridge regression with tf-idf

In [None]:
ridge = RidgeClassifier()

In [None]:
ridge.fit(X_train_vec, Y_train)
prediction = ridge.predict(X_test_vec)
ridge_report = sklearn.metrics.classification_report(Y_test, prediction)
print(ridge_report)

              precision    recall  f1-score   support

    negative       0.90      0.92      0.91      3006
    positive       0.91      0.89      0.90      2994

    accuracy                           0.90      6000
   macro avg       0.90      0.90      0.90      6000
weighted avg       0.90      0.90      0.90      6000



### Performing GridSearch for hyperparameter alpha with 5 cv

In [None]:
parameters = {'alpha':[0.00001,0.001,0.01, 0.1, 1, 5, 4.5, 10, 100]}
grid_search = GridSearchCV(ridge, parameters,cv=5)
Y_train = np.array(Y_train)
grid_search.fit(X_train_vec,Y_train)

print(grid_search.best_params_)
predictions = grid_search.best_estimator_.predict(X_test_vec)
print(f"Results for GridSearch with parameter Tuning on Ridge Regression:\n {sklearn.metrics.classification_report(Y_test,prediction)}")

{'alpha': 4.5}
Results for GridSearch with parameter Tuning on Ridge Regression:
               precision    recall  f1-score   support

    negative       0.91      0.90      0.90      3006
    positive       0.90      0.91      0.90      2994

    accuracy                           0.90      6000
   macro avg       0.90      0.90      0.90      6000
weighted avg       0.90      0.90      0.90      6000



### Ridge Regression with BOW embeddings

In [None]:
parameters = {'alpha':[0.00001,0.001,0.01, 0.1, 1, 5, 4.5, 10, 100]}
grid_search = GridSearchCV(ridge, parameters,cv=5)
Y_train_bow = np.array(Y_train)
grid_search.fit(X_train_vec_bow,Y_train_bow)
print(grid_search.best_params_)
predictions = grid_search.best_estimator_.predict(X_test_vec_bow)
print(f"Results for GridSearch with parameter Tuning on Ridge Regression:\n {sklearn.metrics.classification_report(Y_test,prediction)}")

{'alpha': 1}
Results for GridSearch with parameter Tuning on Ridge Regression:
               precision    recall  f1-score   support

    negative       0.90      0.91      0.90      3006
    positive       0.91      0.90      0.90      2994

    accuracy                           0.90      6000
   macro avg       0.90      0.90      0.90      6000
weighted avg       0.90      0.90      0.90      6000



## SVC

Another machine learning algorithm used during this sentiment analysis is the Support Vector Machine (SVM), which typically yields good accuracy when paired with the Bag of Words (BoW) embedding method.

### SVC tf-idf embeddings

In [None]:
svm_tf_idf = LinearSVC(random_state=1)

svm_tf_idf.fit(X_train_vec, Y_train)
prediction_svm_tf_idf = svm_tf_idf.predict(X_test_vec)
svm_tf_idf_report = sklearn.metrics.classification_report(Y_test, prediction)


In [None]:
print(svm_tf_idf_report)

              precision    recall  f1-score   support

    negative       0.90      0.91      0.90      3006
    positive       0.91      0.90      0.90      2994

    accuracy                           0.90      6000
   macro avg       0.90      0.90      0.90      6000
weighted avg       0.90      0.90      0.90      6000



### SVC BOW embeddings

In [None]:
linear_svc_bow = LinearSVC(random_state=1)
linear_svc_bow.fit(X_train_vec_bow, Y_train)
linear_svc_bow_pred = linear_svc_bow.predict(X_test_vec_bow)
linear_svc_bow_report = sklearn.metrics.classification_report(Y_test, prediction)

In [None]:
print(linear_svc_bow_report)

              precision    recall  f1-score   support

    negative       0.90      0.91      0.90      3006
    positive       0.91      0.90      0.90      2994

    accuracy                           0.90      6000
   macro avg       0.90      0.90      0.90      6000
weighted avg       0.90      0.90      0.90      6000



## German Semantic Model (built on BERT)

Here, we used a German BERT-based model. In this case, there's no need to perform our preprocessing because BERT has its own preprocessing steps to embed the data.

In [None]:
# Bert base german model
model = SentenceTransformer('aari1995/German_Semantic_STS_V2')



In [None]:
sentences = data['comment'].to_list()
encoded_sentences = [model.encode([sentence])[0] for sentence in tqdm(sentences, desc="Encoding")]

Encoding: 100%|██████████| 20000/20000 [11:11<00:00, 29.78it/s]


In [None]:
data['german_bert_embedding'] = encoded_sentences

In [None]:
# Saving Bert embeddings
data.to_csv(gdrive+'/Data/bert_processed_data.csv')

In [None]:
X = data['german_bert_embedding'].tolist()
Y = data['rating'].values
X_train_bert, X_test_bert, Y_train_bert, Y_test_bert = train_test_split(X, Y, test_size=0.3, random_state=1)

### Logistic Regression Bert Embeddings

In [None]:
bert_logistic = LogisticRegression(solver="sag")
bert_logistic.fit(X_train_bert, Y_train_bert)
y_pred = bert_logistic.predict(X_test_bert)
bert_logistic_report = sklearn.metrics.classification_report(Y_test_bert, y_pred)

In [None]:
print(bert_logistic_report)

              precision    recall  f1-score   support

    negative       0.95      0.95      0.95      3037
    positive       0.95      0.94      0.95      2963

    accuracy                           0.95      6000
   macro avg       0.95      0.95      0.95      6000
weighted avg       0.95      0.95      0.95      6000



### Ridge Regression Bert Embeddings

In [None]:
from sklearn.model_selection import GridSearchCV
parameters = {'alpha':[0.00001,0.001,0.01, 0.1, 1, 5, 4.5, 10, 100]}
grid_search = GridSearchCV(ridge, parameters,cv=5)
Y_train = np.array(Y_train_bert)
grid_search.fit(X_train_bert,Y_train_bert)

print(grid_search.best_params_)
predictions = grid_search.best_estimator_.predict(X_test_bert)
print(f"Results for GridSearch with parameter Tuning on Ridge Regression:\n {sklearn.metrics.classification_report(Y_test,prediction)}")

In [None]:
print(f"Results for GridSearch with parameter Tuning on Ridge Regression:\n {sklearn.metrics.classification_report(Y_test,prediction)}")

Results for GridSearch with parameter Tuning on Ridge Regression:
               precision    recall  f1-score   support

    negative       0.90      0.91      0.90      3006
    positive       0.91      0.90      0.90      2994

    accuracy                           0.90      6000
   macro avg       0.90      0.90      0.90      6000
weighted avg       0.90      0.90      0.90      6000



### SVC Bert Embedding

In [None]:
svm_bert = LinearSVC(random_state=42, max_iter=100000)
svm_bert.fit(X_train_bert, Y_train_bert)
prediction_bert = svm_bert.predict(X_test_bert)
svm_bert_report = sklearn.metrics.classification_report(Y_test_bert, prediction)

In [None]:
print(svm_bert_report)

              precision    recall  f1-score   support

    negative       0.51      0.51      0.51      3037
    positive       0.50      0.50      0.50      2963

    accuracy                           0.50      6000
   macro avg       0.50      0.50      0.50      6000
weighted avg       0.50      0.50      0.50      6000



# Overall accuracy reports + Conclusion

In our sentiment analysis project focused on doctor reviews, we evaluated various combinations of embeddings and models. Here are the key takeaways:

| *Embedding\Model* | *Logistic Regression* | *Ridge Regression* | *SVC* |
|---------------------|-------------------------|----------------------|---------|
| Tf-Idf              | 91%                     | 90%                  | 90%     |
| Bag Of Words        | 90%                     | 90%                    | 90%     |
| Bert Embedding      | 95%                  | 90%                 | 50%       |

TF-IDF and Bag of Words (BoW): Both TF-IDF and BoW embeddings, coupled with logistic or ridge regression, delivered consistent and competitive accuracies of around 90-91%.

BERT (German_Semantic_STS_V2) Embedding: BERT, a powerful transformer model fine-tuned for semantic tasks, stood out with the highest accuracy of 95%. It excelled in capturing nuanced text relationships.

Machine Learning Models: Logistic and ridge regression, along with support vector classifiers (SVC), performed reliably across different embeddings.

We primarily used accuracy as the evaluation metric due to the balanced nature of our dataset. It provides a straightforward measure of correct predictions and is easy to interpret.

# Few Shot Learning

In machine learning, large labeled datasets have traditionally been the linchpin for effective sentiment analysis. Yet, the acquisition of such data can be resource-intensive and limit scalability. Enter few-shot learning—an innovative approach aiming to mitigate the need for extensive data without compromising on model efficacy.

We've delved deep into this paradigm, harnessing the Setfit model, a cutting-edge tool anchored in contrastive learning techniques. Our central query: Can we achieve industry-standard performance in sentiment analysis using merely 1% of our labeled data?

The subsequent sections detail our meticulous experimentation with Setfit. Our findings challenge convention, suggesting that with strategic methodologies, robust machine learning can indeed be both efficient and scalable.

In [None]:
df_train_setfit,df_test_setfit = train_test_split(data, random_state=2, test_size=0.99, shuffle=True)

In [None]:
df_train_setfit['rating_setfit'] = df_train_setfit['rating'].map({'positive' : 1, 'negative': 0})
df_test_setfit['rating_setfit'] = df_test_setfit['rating'].map({'positive' : 1, 'negative': 0})

In [None]:
model = SetFitModel.from_pretrained('aari1995/German_Semantic_STS_V2')

Downloading (…)lve/main/config.json:   0%|          | 0.00/685 [00:00<?, ?B/s]

Downloading (…)5dc24/.gitattributes:   0%|          | 0.00/1.48k [00:00<?, ?B/s]

Downloading (…)a19105dc24/README.md:   0%|          | 0.00/5.67k [00:00<?, ?B/s]

Downloading (…)9105dc24/config.json:   0%|          | 0.00/685 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

Downloading (…)5dc24/tokenizer.json:   0%|          | 0.00/729k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

Downloading (…)a19105dc24/vocab.txt:   0%|          | 0.00/240k [00:00<?, ?B/s]

model_head.pkl not found on HuggingFace Hub, initialising classification head with random weights. You should TRAIN this model on a downstream task to use it for predictions and inference.


In [None]:
trainer = SetFitTrainer(
    model=model,
    train_dataset=Dataset.from_pandas(df_train_setfit),
    eval_dataset=Dataset.from_pandas(df_test_setfit),
    loss_class=CosineSimilarityLoss,
    batch_size=16,
    num_iterations=20,
    num_epochs=1,
    column_mapping = {'comment' : 'text', 'rating_setfit': 'label'}
)

In [None]:
trainer.train()

Applying column mapping to training dataset


Generating Training Pairs:   0%|          | 0/20 [00:00<?, ?it/s]

***** Running training *****
  Num examples = 8000
  Num epochs = 1
  Total optimization steps = 500
  Total train batch size = 16


Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Iteration:   0%|          | 0/500 [00:00<?, ?it/s]

In [None]:
trainer.evaluate()

Applying column mapping to evaluation dataset
***** Running evaluation *****


Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

{'accuracy': 0.9326262626262626}