This example uses a dataset in romanian language to train and evaluate a sentiment analysis model.

We use datasets library to load the "ro_sent" dataset from the public [HuggingFace](https://huggingface.co/) repository.

The dataset is composed of phrases labeled as positive/negative (1/0)

In [64]:
from datasets import load_dataset
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression

In [21]:
# download the dataset
ro_sentence_sentiment = load_dataset("ro_sent")

Reusing dataset ro_sent (/Users/andreiterecoasa/.cache/huggingface/datasets/ro_sent/default/1.0.0/45a32ef8c65b2b93a8602bd67cc295b1e760cf89cb32de2e3805999fafaa1c96)


  0%|          | 0/2 [00:00<?, ?it/s]

In [22]:
# the downloaded dataset is already split between train and test sets so we will use these.
# We could also merge them and use train_test_split from sklearn.model_selection to split by custom ratio
train = ro_sentence_sentiment["train"].to_pandas().drop(columns=["original_id", "id"])
test = ro_sentence_sentiment["test"].to_pandas().drop(columns=["original_id", "id"])

In [23]:
# A machine learning model works with numbers. It finds patterns amongts the features and that's how it's trained and then can predict.
# To be able to train the model we first have to transform the sentences to numbers somehow
# To do this we will use CountVectorizer from sklearn.feature_extraction.text.
# From the docs: CountVectorizer converts a collection of text documents to a matrix of token counts
vectorizer = CountVectorizer()

In [24]:
# .fit method helps the vectorizer learn all the tokens (words) in the dataset
vectorizer.fit(train["sentence"].append(test["sentence"]))

CountVectorizer()

In [50]:
# Preparing the inputs for the model training.

# previously the vectorizer learned the vocabulary of the dataset. Now we actually apply the transformation to the train set (tokens to matrix counts)

X_train = vectorizer.transform(train["sentence"])
y_train = train["label"]

# X_ and y_ are common notations for model input. X_ denotes the training vector and y_ is the target vector

In [51]:
# For the model, we will use LogisticRegression.
# This models helps solving problems where output is one of two (yes/no, true/false, 1/0, etc)

sentiment_model = LogisticRegression(max_iter=1000)

In [52]:
# fit the training data into the model.
# basically this represents model training
sentiment_model.fit(X_train, y_train)

LogisticRegression(max_iter=1000)

In [63]:
# Now we're ready to make our first prediction
sent_map = ["Negative", "Positive"]
# our input phrases
input_phrases = ["doamne ce bun a fost kfc-ul asta", "saramalele sunt ok", "e o vreme de cacat. nu vezi nimic afara"]
# transforming the input to matrix counts as we did before for the training data
my_test = vectorizer.transform(input_phrases)

# predict the labels for the input_phrases
res = sentiment_model.predict(my_test)
for index, _r in enumerate(res):
    print(f"'{input_phrases[index]}' ====> {sent_map[_r]}")

'doamne ce bun a fost kfc-ul asta' ====> Positive
'saramalele sunt ok' ====> Positive
'e o vreme de cacat. nu vezi nimic' ====> Positive


In [45]:
# We can use the .score method on the LogisticModel to test the accuracy of the model
# We first we use the vectorizer to transform the test set and then we simply call the .score method.
X_test = vectorizer.transform(test["sentence"])
acc = sentiment_model.score(X_test, test["label"])
f"Accuracy is {acc}"

'Accuracy is 0.8529759200363471'

While an 85% accuracy looks good and you'll find plenty examples for which the prediction is correct bear in mind that this is just a simple example and does not offer a complete solution.