Problem: 
We have a small dataset of sentences labeled as positive or negative (classify the emotional tone), and we want to train a basic machine learning model to classify them.


In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

A very basic custom generated dataset is used here. 

For a real world scenario, data can be collected from sources like Twitter, Reddit, or web scraping, then clean and label it based on your task (e.g., sentiment, topic). Structure it in a CSV or DataFrame with "text" and "label" columns for model training.

TfidfVectorizer is used as tokennizer

TfidfVectorizer converts a collection of text documents into a matrix of TF-IDF features, where each value reflects how important a word is to a document relative to the entire corpus. The TF-IDF score is calculated as:

  TF-IDF(t, d) = TF(t, d) × log(N / (1 + DF(t)))

where TF(t, d) is term frequency of term t in document d, DF(t) is the number of documents containing t, and N is the total number of documents.

In [17]:
# 1. Create a small dataset
data = {
    "text": [
        "I love this product, it's amazing!", 
        "This is the worst thing I ever bought.",
        "Absolutely fantastic experience!", 
        "I hate it, completely useless.", 
        "Best purchase ever, highly recommend.", 
        "Terrible, waste of money.", 
        "So happy with this!", 
        "Awful quality, very disappointed."
    ],
    "label": [1, 0, 1, 0, 1, 0, 1, 0]  # 1 = Positive, 0 = Negative
}

df = pd.DataFrame(data)
print(df)

# Loading the dataset
df = pd.read_csv("Custom_Sentiment_Dataset.csv")

# 2. Preprocess the text (lowercase, remove special characters)
def preprocess(text):
    text = text.lower()
    text = re.sub(r'\W', ' ', text)  # Remove non-word characters
    text = re.sub(r'\s+', ' ', text).strip()  # Remove extra spaces
    return text

df['clean_text'] = df['text'].apply(preprocess)
print("------------------------------------------------")
print("Printing clean text")
print(df[['clean_text', 'label']])


# print number of samples in each class
print("------------------------------------------------")
print("Number of samples in each class")
print(df['label'].value_counts())

# 3. Convert text to numerical features using TF-IDF
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df['clean_text'])
y = np.array(df['label'])


                                     text  label
0      I love this product, it's amazing!      1
1  This is the worst thing I ever bought.      0
2        Absolutely fantastic experience!      1
3          I hate it, completely useless.      0
4   Best purchase ever, highly recommend.      1
5               Terrible, waste of money.      0
6                     So happy with this!      1
7       Awful quality, very disappointed.      0
------------------------------------------------
Printing clean text
                            clean_text  label
0   it s perfect exactly what i wanted      1
1      highly recommend it to everyone      1
2                  not worth it at all      0
3   impressive results would buy again      1
4                    hate this product      0
..                                 ...    ...
95              it broke after one use      0
96              it broke after one use      0
97               worst experience ever      0
98             completely disa

Training with LogisticRegression

In [20]:

# 4. Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)

print("NUmber of samples in X_train: ", X_train.shape)
print("NUmber of samples in X_test: ", X_test.shape)

# 5. Train a simple Logistic Regression model
# Create and train a logistic regression model with custom parameters
model = LogisticRegression(
    penalty='l2',        # regularization (default: 'l2')
    C=1.0,               # inverse of regularization strength (smaller values = stronger regularization)
    solver='lbfgs',      # optimization algorithm
    max_iter=200,        # number of iterations (similar to epochs)
    random_state=42      # for reproducibility
)

model.fit(X_train, y_train)

NUmber of samples in X_train:  (50, 64)
NUmber of samples in X_test:  (50, 64)


Evaluating on the test split & new dataset

In [26]:
# 6. Evaluate the model
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2f}")

# 7. Test with a new sentence
new_sentences = ["I absolutely love this!", "Horrible experience, never again.", "No never", "Waste of money.", "waste of food"]
new_sentences_clean = [preprocess(sent) for sent in new_sentences]
new_features = vectorizer.transform(new_sentences_clean)
predictions = model.predict(new_features)

# Print Predictions
for sent, pred in zip(new_sentences, predictions):
    sentiment = "Positive" if pred == 1 else "Negative"
    print(f"'{sent}' → {sentiment}")

Model Accuracy: 0.84
'I absolutely love this!' → Positive
'Horrible experience, never again.' → Positive
'No never' → Positive
'Waste of money.' → Negative
'waste of food' → Negative


The resultant accuracy is very low with this model and the amount of dataset used.

The answer to second sentence is "positive" but it should have been "negative". This could be because model has never a seen a dataset with "word" horrible, never, etc.