# Scikit-learn Logistic Regression

(C) 2023 by [Damir Cavar](http://damir.cavar.me/)

The example problems are taken from the textbook Dan Jurafsky and James H. Martin (2023 draft) [Speech and Language Processing](https://web.stanford.edu/~jurafsky/slp3/) chapter 5 on Logistic Regression. The code is written by [Damir Cavar](http://damir.cavar.me/) and simplified for use in the Advanced Natural Language Processing course taught at Indiana University in Fall 2023.

In [113]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from collections import Counter
import os
import csv
import math
from secret import sigmoid
import numpy as np
from nltk.tokenize import sent_tokenize, word_tokenize
import zipfile
import ast

The Vader Lexicon file can be found in the NLTK data in `nltk-data/sentiment/vader_lexicon.zip`. It contains a list of tokens with sentiment ratings. Each line represents one token and the tab-seperated values are:
- token
- the mean of the human sentiment ratings
- the Standard Deviation of the token
- the list of 10 human ratings taken during experiments

In the following the assumption is that the Vader lexicon is located in your `nltk-data`-folder. On Linux systems this is per default in your home directory. On Windows this is in your `AppData\roaming` folder.

We can read the Vader lexicon into a dictionary structure as follows:

In [65]:
nltk_data_folder = "/home/damir/nltk_data"
vader_filename = "vader_lexicon/vader_lexicon.txt"
vader_data = {}
with zipfile.ZipFile(os.path.join(nltk_data_folder, "sentiment", 'vader_lexicon.zip')) as z:
    if vader_filename in z.namelist():
        with z.open(vader_filename) as f:
            for l in f:
                tokens = l.decode(encoding='utf-8').strip().split('\t')
                if len(tokens) != 4: continue
                vader_data[tokens[0]] = (float(tokens[1]), float(tokens[2]), ast.literal_eval(tokens[3]))

We can now request the scores for existing tokens from the `vadar_data` dictionary:

In [66]:
print(vader_data["admirable"])

(2.6, 0.66332, [2, 3, 3, 3, 4, 3, 2, 2, 2, 2])


We can assume that positive scores indicate that the token is typical for positive sentiment, while negative scores represent negative sentiment. We can see that for example when pulling the scores for token `annoying`.

In [67]:
print(vader_data["annoying"])

(-1.7, 0.64031, [-1, -2, -1, -2, -1, -1, -2, -2, -3, -2])


In the textbook the feature vector is generated using the following scores:
- number of positive terms in text
- number of negative terms
- 1, if there is a *no* in the text, 0 if there is none
- number of pronouns, all variants of 1st and 2nd person
- 1 if there os a *!* in the text, 0 if there is none
- the log of the number of tokens

The following function generates a feature vector from some text:

In [68]:
def generate_feature_vector(text: str) -> list:
    tokens = word_tokenize(text)
    scores = [ vader_data.get(t, [0, 0]) for t in tokens ]
    negative_terms = sum(1 for i in scores if i[0] < 0)
    positive_terms = sum(1 for i in scores if i[0] > 0)
    if "no" in tokens:
        no_in_text = 1
    else:
        no_in_text = 0
    pronouns = set( ("I", "you", "me", "your", "mine") )
    count_pronouns = sum(1 for i in tokens if i in pronouns)
    if "!" in tokens:
        excl_in_text = 1
    else:
        excl_in_text = 0
    return np.array([positive_terms, negative_terms, no_in_text, count_pronouns, excl_in_text, math.log(len(tokens))])

Use some sample text and generate a feature vector for it:

In [69]:
sample_text = """It's hokey. There are virtually no surprises, and the writing is second-rate.
So why was it so enjoyable? For one thing, the cast is great.
Another nice touch is the music.
I was overcome with the urge to get off the couch and start dancing.
It sucked me in, and it'll do the same to you."""

The feature vector is:

In [70]:
sample_text_vector = generate_feature_vector(sample_text)
print(sample_text_vector)

[4.         2.         1.         3.         0.         4.21950771]


The textbook approach uses different vocabulary and entries

In [71]:
sample_text_vector_textbook = np.array([3, 2, 1, 3, 0, 4.19])

Assume weights

In [72]:
weights = np.array([2.5, -5.0, -1.2, 0.5, 2.0, 0.7])
b = 0.1

Computing the sigmoid scores for the feature vectors generated from the Vader lexicon is given in the following:

In [73]:
sigmoid_positive = sigmoid( np.dot(weights, sample_text_vector) + b )
sigmoid_negative = 1 - sigmoid_positive

The text is classified as `positive sentiment` with 96% likelihood:

In [75]:
print(sigmoid_positive, sigmoid_negative)

0.9662243326599138 0.03377566734008619


If we use the textbook scores for the text vector, our sigmoid values are:

In [76]:
sigmoid_positive = sigmoid( np.dot(weights, sample_text_vector_textbook) + b )
sigmoid_negative = 1 - sigmoid_positive

The text is still judged as `positive sentiment`:

In [77]:
print(sigmoid_positive, sigmoid_negative)

0.6969888901292717 0.3030111098707283


## Learning Weights

The weights in the previos section have been manually set. In the following we will go over a strategy to learn those weights using the Cross-entropy Loss Function and Stochastic Gradient Descent.

For the problem discussed in the textbook, the Cross-entropy Loss Function is defined as:

$L_{CE}(\hat{y},y) = -\, log\, p(y|x) = -\, [y\, log(\hat{y}) + (1-y)\, log(1-\hat{y})] $

If `y=1`, the second summand in the equation become `0`, thus we only look at:

$L_{CE}(\hat{y},1) = -\, log\, p(1|x) = -\, 1\, log(\hat{y}) = -\, log(\hat{y})$

For `y=0`, the first summand in the equation becomes `0`, this we only look at:

$L_{CE}(\hat{y},0) = -\, log\, p(0|x) = -\, (1-0)\, log(1-\hat{y}) = -\, log(1-\hat{y}) $

In [78]:
def cross_entropy_loss(y):
  return -np.log(y)

$\hat{y}$ is the `sigmoid` of the dot-product of the weight and feature vector after adding the bias value to it. That is, we can compute the `cross-entropy` from the `sigmoid` scores above" 

In [79]:
cel_positive = cross_entropy_loss(sigmoid_positive)
print(f"sigmoid {sigmoid_positive} for y={1} Loss positive: {cel_positive}")

sigmoid 0.6969888901292717 for y=1 Loss positive: 0.3609858079049309


In [80]:
cel_negative = cross_entropy_loss(sigmoid_negative)
print(f"sigmoid {sigmoid_negative} for y={0} Loss positive: {cel_negative}")

sigmoid 0.3030111098707283 for y=0 Loss positive: 1.1939858079049306


## Gradient Descent

Loss function is paretrized by the weights $\theta = (w,b)$

In [81]:
data = [
    ("""It's hokey. There are virtually no surprises, and the writing is second-rate.
So why was it so enjoyable? For one thing, the cast is great.
Another nice touch is the music.
I was overcome with the urge to get off the couch and start dancing.
It sucked me in, and it'll do the same to you.""", 1)
]

The stochastic gradient descent function in our simple classification task can be defined as:

In [101]:
def stochastic_gradient_descent(data):
    w = np.array([0, 0, 0, 0, 0, 0])
    b = 0
    learning_rate = 0.1
    for text, y in data:
        x = generate_feature_vector(text)
        print("x:", x)
        y_hat = sigmoid( np.dot(w, x) + b )
        print("y_hat:", y_hat)
        gradient_b = y_hat - y
        print("gradient b:", gradient_b)
        b = b - learning_rate * gradient_b
        print("new gradient b:", b)
        gradient_w = (y_hat - y) * x
        print("gradient w:", gradient_w)
        w = gradient_w - learning_rate * gradient_w
        print("new gradient w:", w)
    return w, b

In [102]:
w, b = stochastic_gradient_descent(data)

x: [4.         2.         1.         3.         0.         4.21950771]
y_hat: 0.5
gradient b: -0.5
new gradient b: 0.05
gradient w: [-2.         -1.         -0.5        -1.5        -0.         -2.10975385]
new gradient w: [-1.8        -0.9        -0.45       -1.35        0.         -1.89877847]


### Training on Corpus

In [100]:
def stochastic_gradient_descent_silent(data):
    w = np.array([0, 0, 0, 0, 0, 0])
    b = 0
    learning_rate = 0.1
    for text, y in data:
        x = generate_feature_vector(text)
        y_hat = sigmoid( np.dot(w, x) + b )
        gradient_b = y_hat - y
        b = b - learning_rate * gradient_b
        gradient_w = (y_hat - y) * x
        w = gradient_w - learning_rate * gradient_w
    return w, b

In [85]:
experiment_data = []

In [103]:
with open(os.path.join('.', 'data', 'reviews.csv'), newline='') as csvfile:
    datareader = csv.reader(csvfile, delimiter=',', quotechar='"')
    header = next(datareader)
    for row in datareader:
        if len(row) == 2:
            experiment_data.append( [row[0].strip(), int(row[1].strip())] )

In [127]:
count_positive = sum([ 1 for x in experiment_data if x[1] == 1 ])
count_negative = sum([ 1 for x in experiment_data if x[1] == 0 ])
print(f"Positive: {count_positive}\t Negative: {count_negative}")
print("Total reviews:", len(experiment_data))

Positive: 50000	 Negative: 50000
Total reviews: 100000


In [104]:
print(experiment_data[:3])
print(w, b)

[["Once again Mr. Costner has dragged out a movie for far longer than necessary. Aside from the terrific sea rescue sequences, of which there are very few I just did not care about any of the characters. Most of us have ghosts in the closet, and Costner's character are realized early on, and then forgotten until much later, by which time I did not care. The character we should really care about is a very cocky, overconfident Ashton Kutcher. The problem is he comes off as kid who thinks he's better than anyone else around him and shows no signs of a cluttered closet. His only obstacle appears to be winning over Costner. Finally when we are well past the half way point of this stinker, Costner tells us all about Kutcher's ghosts. We are told why Kutcher is driven to be the best with no prior inkling or foreshadowing. No magic here, it was all I could do to keep from turning it off an hour in.", 0], ["This is an example of why the majority of action films are the same. Generic and boring,

In [108]:
w, b = stochastic_gradient_descent_silent(experiment_data[:50])

In [109]:
print(w, b)

[5.4        5.4        0.         2.7        0.         4.59535093] -4.94999999999999


In [118]:
def test(data, w, b):
    res = []
    for text, y in data:
        x = generate_feature_vector(text)
        y_hat = sigmoid( np.dot(w, x) + b )
        if y_hat > .5:
            y_hat = 1
        else: y_hat = 0
        res.append( (y, y_hat) )
    return res

In [123]:
result = test(experiment_data[-20:], w, b)

In [124]:
counts = Counter(result)

In [125]:
print(counts)

Counter({(1, 1): 20})
