***Custom Text Classifier using Bag-of-Words and Logistic Regression***

This script demonstrates a simple text classification pipeline using a manually created dataset of positive and negative sentences. It applies basic text vectorization using the Bag-of-Words model (via CountVectorizer) and trains a Logistic Regression classifier to predict sentiment labels.

Main steps:

Define a small labeled dataset of positive and negative text samples

Transform text into numerical vectors using CountVectorizer (BoW)

Split the data into training and test sets

Train a Logistic Regression model on the training set

Evaluate the model using accuracy and classification report metrics

This example illustrates how to build a custom sentiment classifier using classical NLP methods without relying on pre-trained models.



In [1]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

In [2]:
data = pd.DataFrame([("i love spending time with my friends and family", "positive"),
                     ("that was the best meal i've ever had in my life", "positive"),
                     ("i feel so grateful for everything i have in my life", "positive"),
                     ("i received a promotion at work and i couldn't be happier", "positive"),
                     ("watching a beautiful sunset always fills me with joy", "positive"),
                     ("my partner surprised me with a thoughtful gift and it made my day", "positive"),
                     ("i am so proud of my daughter for graduating with honors", "positive"),
                     ("listening to my favorite music always puts me in a good mood", "positive"),
                     ("i love the feeling of accomplishment after completing a challenging task", "positive"),
                     ("i am excited to go on vacation next week", "positive"),
                     ("i feel so overwhelmed with work and responsibilities", "negative"),
                     ("the traffic during my commute is always so frustrating", "negative"),
                     ("i received a parking ticket and it ruined my day", "negative"),
                     ("i got into an argument with my partner and we're not speaking", "negative"),
                     ("i have a headache and i feel terrible", "negative"),
                     ("i received a rejection letter for the job i really wanted", "negative"),
                     ("my car broke down and it's going to be expensive to fix", "negative"),
                     ("i'm feeling sad because i miss my friends who live far away", "negative"),
                     ("i'm frustrated because i can't seem to make progress on my project", "negative"),
                     ("i'm disappointed because my team lost the game", "negative")
                    ],
                    columns=['text', 'sentiment'])

In [3]:
# Shuffle dataset

data = data.sample(frac=1).reset_index(drop=True)

In [4]:
# Split features and labels

x= data['text']
y= data['sentiment']

In [5]:
# Convert text into Bag-of-Words vectors

countvec = CountVectorizer()

In [6]:
countvec_fit = countvec.fit_transform(x)

In [7]:
bag_of_words = pd.DataFrame(countvec_fit.toarray(), columns = countvec.get_feature_names_out())

In [8]:
bag_of_words

Unnamed: 0,accomplishment,after,always,am,an,and,argument,at,away,be,...,vacation,ve,wanted,was,watching,we,week,who,with,work
0,0,0,0,1,0,0,0,0,0,0,...,1,0,0,0,0,0,1,0,0,0
1,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,1,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
6,0,0,0,0,1,1,1,0,0,0,...,0,0,0,0,0,1,0,0,1,0
7,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,1,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [9]:
# Train-test split

X_train, X_test, y_train, y_test = train_test_split(bag_of_words, y, test_size=0.3, random_state=7)

In [10]:
# Train logistic regression classifier

lr = LogisticRegression(random_state=1).fit(X_train, y_train)

In [11]:
# Predict and evaluate

y_pred_lr = lr.predict(X_test)

In [12]:
accuracy_score(y_pred_lr, y_test)

0.16666666666666666

In [13]:
print(classification_report(y_test, y_pred_lr, zero_division=0))

              precision    recall  f1-score   support

    negative       0.17      1.00      0.29         1
    positive       0.00      0.00      0.00         5

    accuracy                           0.17         6
   macro avg       0.08      0.50      0.14         6
weighted avg       0.03      0.17      0.05         6

