# Scikit-learn Logistic Regression

(C) 2023-2024 by [Damir Cavar](http://damir.cavar.me/)

**Version:** 1.2, September 2024

**Download:** This and various other Jupyter notebooks are available from my [GitHub repo](https://github.com/dcavar/python-tutorial-notebooks).

**License:** [Creative Commons Attribution-ShareAlike 4.0 International License](https://creativecommons.org/licenses/by-sa/4.0/) ([CA BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/))

This tutorial was developed as part of the course material for the course Advanced Natural Language Processing at [Indiana University](https://www.indiana.edu/).

**Prerequisites:**

In [None]:
!pip install -U scikit-learn

In [None]:
!pip install -U numpy

In [None]:
!pip install -U nltk

### TOC

- [Introduction](#introduction)
- [Learning Weights](#learning-weights)
- [Gradient Descent](#gradient-descent)
- [Using Scikit-Learn](#using-scikit-learn)

## Introduction <a class="anchor" id="introduction"></a>

The example problems are taken from the textbook Dan Jurafsky and James H. Martin (2023 draft) [Speech and Language Processing](https://web.stanford.edu/~jurafsky/slp3/) chapter 5 on Logistic Regression. The code is written by [Damir Cavar](http://damir.cavar.me/) and simplified for use in the Advanced Natural Language Processing course taught at Indiana University in Fall 2023 and 2024.

In the following code we import all the used modules. You will need to make sure that [Scikit-learn](https://scikit-learn.org/stable/), and [NLTK](https://www.nltk.org/) are installed. You will need to implement a sigmoid function in a file called *secret.py* in the local folder. Since this is part of an assignment, this is not shared here yet.

In [3]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from collections import Counter
import os
import csv
import math
import random
from secret import sigmoid
import numpy as np
from nltk.tokenize import sent_tokenize, word_tokenize
import zipfile
import ast

In [4]:
import nltk
print(nltk.data.path)

['/home/damir/nltk_data', '/usr/nltk_data', '/usr/share/nltk_data', '/usr/lib/nltk_data', '/usr/share/nltk_data', '/usr/local/share/nltk_data', '/usr/lib/nltk_data', '/usr/local/lib/nltk_data']


The Vader Lexicon file can be found in the NLTK data in `nltk-data/sentiment/vader_lexicon.zip`. It contains a list of tokens with sentiment ratings. Each line represents one token and the tab-seperated values are:
- token
- the mean of the human sentiment ratings
- the Standard Deviation of the token
- the list of 10 human ratings taken during experiments

In the following the assumption is that the Vader lexicon is located in your `nltk-data`-folder. On Linux systems this is per default in your home directory. On Windows this is in your `AppData\roaming` folder.

We can read the Vader lexicon from the NLTK zip-file into a dictionary structure as follows:

If the lexicon is in the data folder, as it should be, the code below will read in the data into the dictionary `vader_data`.

In [7]:
vader_filename = "data/vader_lexicon.txt"
vader_data = {}
if os.path.exists(vader_filename):
	with open(vader_filename, 'r', encoding='utf-8') as f:
		for l in f:
			tokens = l.strip().split('\t')
			if len(tokens) != 4:
				continue
			vader_data[tokens[0]] = (float(tokens[1]), float(tokens[2]), ast.literal_eval(tokens[3]))
else:
	print(f"File {vader_filename} does not exist.")

We can now request the scores for existing tokens from the `vadar_data` dictionary:

In [8]:
print(vader_data["admirable"])

(2.6, 0.66332, [2, 3, 3, 3, 4, 3, 2, 2, 2, 2])


We can assume that positive scores indicate that the token is typical for positive sentiment, while negative scores represent negative sentiment. We can see that for example when pulling the scores for token `annoying`.

In [9]:
print(vader_data["annoying"])

(-1.7, 0.64031, [-1, -2, -1, -2, -1, -1, -2, -2, -3, -2])


In the textbook the feature vector is generated using the following scores:
- number of positive terms in text
- number of negative terms
- 1, if there is a *no* in the text, 0 if there is none
- number of pronouns, all variants of 1st and 2nd person
- 1 if there is a *!* in the text, 0 if there is none
- the log of the number of tokens

The following function generates a feature vector from some text:

In [10]:
def generate_feature_vector(text: str) -> list:
    tokens = word_tokenize(text)
    scores = [ vader_data.get(t, [0, 0]) for t in tokens ]
    negative_terms = sum(1 for i in scores if i[0] < 0)
    positive_terms = sum(1 for i in scores if i[0] > 0)
    if "no" in tokens:
        no_in_text = 1
    else:
        no_in_text = 0
    pronouns = set( ("I", "you", "me", "your", "mine") )
    count_pronouns = sum(1 for i in tokens if i in pronouns)
    if "!" in tokens:
        excl_in_text = 1
    else:
        excl_in_text = 0
    return np.array([positive_terms, negative_terms, no_in_text, count_pronouns, excl_in_text, math.log(len(tokens))])

Use some sample text and generate a feature vector for it:

In [11]:
sample_text = """It's hokey. There are virtually no surprises, and the writing is second-rate.
So why was it so enjoyable? For one thing, the cast is great.
Another nice touch is the music.
I was overcome with the urge to get off the couch and start dancing.
It sucked me in, and it'll do the same to you."""

The feature vector is:

In [12]:
sample_text_vector = generate_feature_vector(sample_text)
print(sample_text_vector)

[4.         2.         1.         3.         0.         4.21950771]


The textbook approach uses different vocabulary and entries

In [13]:
sample_text_vector_textbook = np.array([3, 2, 1, 3, 0, 4.19])

Assume weights

In [14]:
weights = np.array([2.5, -5.0, -1.2, 0.5, 2.0, 0.7])
b = 0.1

Computing the sigmoid scores for the feature vectors generated from the Vader lexicon is given in the following:

In [15]:
sigmoid_positive = sigmoid( np.dot(weights, sample_text_vector) + b )
sigmoid_negative = 1 - sigmoid_positive

The text is classified as `positive sentiment` with 96% likelihood:

In [16]:
print(sigmoid_positive, sigmoid_negative)

0.9662243326599138 0.03377566734008619


If we use the textbook scores for the text vector, our sigmoid values are:

In [17]:
sigmoid_positive = sigmoid( np.dot(weights, sample_text_vector_textbook) + b )
sigmoid_negative = 1 - sigmoid_positive

The text is still judged as `positive sentiment`:

In [18]:
print(sigmoid_positive, sigmoid_negative)

0.6969888901292717 0.3030111098707283


## Learning Weights <a class="anchor" id="learning-weights"></a>

The weights in the previos section have been manually set. In the following we will go over a strategy to learn those weights using the Cross-entropy Loss Function and Stochastic Gradient Descent.

For the problem discussed in the textbook, the Cross-entropy Loss Function is defined as:

$L_{CE}(\hat{y},y) = -\, log\, p(y|x) = -\, [y\, log(\hat{y}) + (1-y)\, log(1-\hat{y})] $

If `y=1`, the second summand in the equation become `0`, thus we only look at:

$L_{CE}(\hat{y},1) = -\, log\, p(1|x) = -\, 1\, log(\hat{y}) = -\, log(\hat{y})$

For `y=0`, the first summand in the equation becomes `0`, this we only look at:

$L_{CE}(\hat{y},0) = -\, log\, p(0|x) = -\, (1-0)\, log(1-\hat{y}) = -\, log(1-\hat{y}) $

In [19]:
def cross_entropy_loss(y_hat, y):
  return -np.log(y_hat)

$\hat{y}$ is the `sigmoid` of the dot-product of the weight and feature vector after adding the bias value to it. That is, we can compute the `cross-entropy` from the `sigmoid` scores above" 

In [20]:
cel_positive = cross_entropy_loss(sigmoid_positive, 1)
print(f"sigmoid {sigmoid_positive} for y={1} Loss positive: {cel_positive}")

sigmoid 0.6969888901292717 for y=1 Loss positive: 0.3609858079049309


In [21]:
cel_negative = cross_entropy_loss(sigmoid_negative, 0)
print(f"sigmoid {sigmoid_negative} for y={0} Loss positive: {cel_negative}")

sigmoid 0.3030111098707283 for y=0 Loss positive: 1.1939858079049306


## Gradient Descent <a class="anchor" id="gradient-descent"></a>

Loss function is paretrized by the weights $\theta = (w,b)$

In [22]:
data = [
    ("""It's hokey. There are virtually no surprises, and the writing is second-rate.
So why was it so enjoyable? For one thing, the cast is great.
Another nice touch is the music.
I was overcome with the urge to get off the couch and start dancing.
It sucked me in, and it'll do the same to you.""", 1)
]

The stochastic gradient descent function in our simple classification task can be defined as:

In [23]:
def stochastic_gradient_descent(data):
    w = np.array([0, 0, 0, 0, 0, 0])
    b = 0
    learning_rate = 0.1
    for text, y in data:
        x = generate_feature_vector(text)
        print("x:", x)
        y_hat = sigmoid( np.dot(w, x) + b )
        print("y_hat:", y_hat)
        gradient_b = y_hat - y
        print("gradient b:", gradient_b)
        b = b - learning_rate * gradient_b
        print("new gradient b:", b)
        gradient_w = (y_hat - y) * x
        print("gradient w:", gradient_w)
        # w = gradient_w - learning_rate * gradient_w
        w = w - learning_rate * gradient_w
        print("new weights w:", w)
    return w, b

In [24]:
w, b = stochastic_gradient_descent(data)

x: [4.         2.         1.         3.         0.         4.21950771]
y_hat: 0.5
gradient b: -0.5
new gradient b: 0.05
gradient w: [-2.         -1.         -0.5        -1.5        -0.         -2.10975385]
new weights w: [0.2        0.1        0.05       0.15       0.         0.21097539]


### Training on Corpus

In [25]:
def stochastic_gradient_descent_silent(data):
    w = np.array([0, 0, 0, 0, 0, 0])
    b = 0
    learning_rate = 0.1
    for text, y in data:
        x = generate_feature_vector(text)
        y_hat = sigmoid( np.dot(w, x) + b )
        gradient_b = y_hat - y
        b = b - learning_rate * gradient_b
        gradient_w = (y_hat - y) * x
        w = w - learning_rate * gradient_w
    return w, b

In [26]:
experiment_data = []

In [27]:
with open(os.path.join('.', 'data', 'reviews.csv'), newline='') as csvfile:
    datareader = csv.reader(csvfile, delimiter=',', quotechar='"')
    header = next(datareader)
    for row in datareader:
        if len(row) == 2:
            experiment_data.append( [row[0].strip(), int(row[1].strip())] )

In [28]:
count_positive = sum([ 1 for x in experiment_data if x[1] == 1 ])
count_negative = sum([ 1 for x in experiment_data if x[1] == 0 ])
print(f"Positive: {count_positive}\t Negative: {count_negative}")
print("Total reviews:", len(experiment_data))

Positive: 25000	 Negative: 25000
Total reviews: 50000


In [29]:
print(experiment_data[:3])
print(w, b)

[["Once again Mr. Costner has dragged out a movie for far longer than necessary. Aside from the terrific sea rescue sequences, of which there are very few I just did not care about any of the characters. Most of us have ghosts in the closet, and Costner's character are realized early on, and then forgotten until much later, by which time I did not care. The character we should really care about is a very cocky, overconfident Ashton Kutcher. The problem is he comes off as kid who thinks he's better than anyone else around him and shows no signs of a cluttered closet. His only obstacle appears to be winning over Costner. Finally when we are well past the half way point of this stinker, Costner tells us all about Kutcher's ghosts. We are told why Kutcher is driven to be the best with no prior inkling or foreshadowing. No magic here, it was all I could do to keep from turning it off an hour in.", 0], ["This is an example of why the majority of action films are the same. Generic and boring,

In [30]:
w, b = stochastic_gradient_descent_silent(experiment_data[:50])

In [31]:
w, b = stochastic_gradient_descent_silent(experiment_data) # [:50])

In [32]:
print(w, b)

[ 1.6508892   0.49448536 -0.0074577   0.84723691 -0.00302566  0.96923873] 0.206158278233459


In [33]:
def test(data, w, b):
    res = []
    for text, y in data:
        x = generate_feature_vector(text)
        y_hat = sigmoid( np.dot(w, x) + b )
        if y_hat > .5:
            y_hat = 1
        else:
            y_hat = 0
        res.append( (y, y_hat) )
    return res

In [34]:
#result = test(experiment_data[-20:], w, b)
result = test(experiment_data, w, b) #-200:], w, b)

In [35]:
counts = Counter(result)

In [36]:
print(counts)

Counter({(0, 1): 25000, (1, 1): 25000})


## Using Scikit-Learn <a class="anchor" id="using-scikit-learn"></a>

We will use the data from the reviews data set above. For that we will create two lists containing the text and the label respectively.

In [37]:
data = []
data_labels = []
for e in experiment_data:
    data.append(e[0])
    if e[1] == 1:
        data_labels.append('pos')
    else:
        data_labels.append('neg')

The data will be transformed into feature vectors that take frequency into account, remove function words, and so on. The tokens or text is not normalized to lowercase.

In [38]:
vectorizer = CountVectorizer(analyzer = 'word', lowercase = False,)
features = vectorizer.fit_transform(data)
features_nd = features.toarray() # for easy usage

We split the corpus into a training and test corpus, using 20% for testing.

In [None]:
X_train, X_test, y_train, y_test  = train_test_split(
        features_nd, 
        data_labels,
        train_size=0.80, 
        random_state=1234)


Create a Logistic Regression model:

In [None]:
log_model = LogisticRegression()

Fit the model to the training data:

In [37]:
log_model = log_model.fit(X=X_train, y=y_train)

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Run the prediction on the 20% test data:

In [1]:
y_pred = log_model.predict(X_test)

NameError: name 'log_model' is not defined

Print the results for a few random examples:

In [53]:
j = random.randint(0,len(X_test)-7)
for i in range(j,j+7):
    print(y_pred[0])
    ind = features_nd.tolist().index(X_test[i].tolist())
    print(data[ind].strip())

pos
Creakiness and atmosphere this film has, but so unfortunately does the print I just viewed. Raymond Massey provides a laid back Sherlock Holmes, almost comically so in early scenes in his bathrobe, which he trades in for a laborer's garb to investigate the creepy mansion of Dr. Rylott (Lyn Harding). What wasn't clear to me was why Rylott would have wanted his stepdaughters dead. If as in the case of Helen (Angela Baddeley), he didn't want her to run off to get married, he would have accomplished the same thing by having her dispatched. Other curiosities abound as well. After setting an early wedding date with Helen, the fiancÃ©e is no longer heard from for the rest of the picture. The presence of a band of gypsies at the time of Violet Stoner's death provides merely a diversion, and what could have been an interesting murder tool, a poisonous snake, is diluted by the fact that it was not a cobra, the musical renderings of the Indian man servant notwithstanding. Athole Stewart compe

Print the overall accuracy:

In [55]:
print(accuracy_score(y_test, y_pred))

0.8864


**(C) 2023-2024 by [Damir Cavar](http://damir.cavar.me/) <<dcavar@iu.edu>>**