# Logistic Regression using Scikit Learn

This is based on the example from the textbook Jurafsky and Martin.

(C) 2025 by [Damir Cavar](https://damir.cavar.me/)

In [23]:
import numpy as np
import os
import math
import csv
import ast
from nltk.tokenize import word_tokenize
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

We will use the Vader lexicon for the sentiment scores for lexical items:

In [13]:
vader_filename = "data/vader_lexicon.txt"
vader_data = {}
if os.path.exists(vader_filename):
	with open(vader_filename, 'r', encoding='utf-8') as f:
		for l in f:
			tokens = l.strip().split('\t')
			if len(tokens) != 4:
				continue
			vader_data[tokens[0]] = (float(tokens[1]), float(tokens[2]), ast.literal_eval(tokens[3]))
else:
	print(f"File {vader_filename} does not exist.")

Test whether the examples are accessible:

In [15]:
print("annoying", vader_data["annoying"])
print("admirable", vader_data["admirable"])

annoying (-1.7, 0.64031, [-1, -2, -1, -2, -1, -1, -2, -2, -3, -2])
admirable (2.6, 0.66332, [2, 3, 3, 3, 4, 3, 2, 2, 2, 2])


This is the vectorization function as defined in the text book:

In [16]:
def generate_feature_vector(text: str) -> list:
    tokens = word_tokenize(text)
    scores = [ vader_data.get(t, [0, 0]) for t in tokens ]
    negative_terms = sum(1 for i in scores if i[0] < 0)
    positive_terms = sum(1 for i in scores if i[0] > 0)
    if "no" in tokens:
        no_in_text = 1
    else:
        no_in_text = 0
    pronouns = set( ("I", "you", "me", "your", "mine") )
    count_pronouns = sum(1 for i in tokens if i in pronouns)
    if "!" in tokens:
        excl_in_text = 1
    else:
        excl_in_text = 0
    return np.array([positive_terms, negative_terms, no_in_text, count_pronouns, excl_in_text, math.log(len(tokens))])

In the textbook the feature vector is generated using the following scores:
- number of positive terms in text
- number of negative terms
- 1, if there is a *no* in the text, 0 if there is none
- number of pronouns, all variants of 1st and 2nd person
- 1 if there is a *!* in the text, 0 if there is none
- the log of the number of tokens

The following function generates a feature vector from some text that we will read from the `reviews.csv` file.

In [17]:
experiment_data = []
with open(os.path.join('.', 'data', 'reviews.csv'), newline='') as csvfile:
    datareader = csv.reader(csvfile, delimiter=',', quotechar='"')
    header = next(datareader)
    for row in datareader:
        if len(row) == 2:
            experiment_data.append( [row[0].strip(), int(row[1].strip())] )

Make sure that we have all the examples and that there are 25,000 positive and 25,000 negative examples:

In [18]:
count_positive = sum([ 1 for x in experiment_data if x[1] == 1 ])
count_negative = sum([ 1 for x in experiment_data if x[1] == 0 ])
print(f"Positive: {count_positive}\t Negative: {count_negative}")
print("Total reviews:", len(experiment_data))

Positive: 25000	 Negative: 25000
Total reviews: 50000


One example data-point consists of a tuple with the text and the sentiment score:

In [19]:
experiment_data[0]

["Once again Mr. Costner has dragged out a movie for far longer than necessary. Aside from the terrific sea rescue sequences, of which there are very few I just did not care about any of the characters. Most of us have ghosts in the closet, and Costner's character are realized early on, and then forgotten until much later, by which time I did not care. The character we should really care about is a very cocky, overconfident Ashton Kutcher. The problem is he comes off as kid who thinks he's better than anyone else around him and shows no signs of a cluttered closet. His only obstacle appears to be winning over Costner. Finally when we are well past the half way point of this stinker, Costner tells us all about Kutcher's ghosts. We are told why Kutcher is driven to be the best with no prior inkling or foreshadowing. No magic here, it was all I could do to keep from turning it off an hour in.",
 0]

We create the data-point vectors using our `generate_feature_vector` function:

In [20]:

X = np.array([ generate_feature_vector(x[0]) for x in experiment_data ])
y = np.array([ x[1] for x in experiment_data ])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

We use the scikit learn LogisticRegression functions and fit the model on the training data:

In [21]:
model = LogisticRegression()
model.fit(X_train, y_train)

We evaluate the model using the test-portion of the data:

In [22]:
y_pred = model.predict(X_test)
print(accuracy_score(y_test, y_pred))

0.7048


(C) 2025 by [Damir Cavar](https://damir.cavar.me/)