### Logistic Regression

Now we are going to apply what we have learned to logistic regression with 2 predictors/features.

First we will generate some random data for an imagined sentiment classification task. We can think of our two features as being the log of the counts of positive words (e.g. good, excellent) and the log of the counts of negative words (e.g. bad, rubbish). The label we are trying to predict is either 1 (positive sentiment text) or 0 (negative sentiment text).

In [None]:
## Create simulated data
np.random.seed(10)
w1_center = (2, 3)
w2_center = (3, 2)
batch_size=50

x = np.zeros((batch_size, 2))
y = np.zeros(batch_size)
for i in range(batch_size):
    if np.random.random() > 0.5:
        x[i] = np.random.normal(loc=w1_center)
    else:
        x[i] = np.random.normal(loc=w2_center)
        y[i] = 1

x=x.T

We can visualise the data as follows. The stars are the positive sentiment texts, the circles are the negative sentiment texts.

In [None]:
plt.scatter(x[0][y==0], x[1][y==0], marker='*', s=100)
plt.scatter(x[0][y==1], x[1][y==1], marker='o', s=100)
plt.xlabel("log count of negative words")
plt.ylabel("log count of positive words")
plt.xlim((0,5))
plt.ylim((0,5))


To see why we might to take the log, we can exponentiate the log counts (reversing the log function) to give raw counts. These are worse for visualisation and modelling purposes

In [None]:
x_exp=np.exp(x)
plt.scatter(x_exp[0][y==0], x_exp[1][y==0], marker='*', s=100)
plt.scatter(x_exp[0][y==1], x_exp[1][y==1], marker='o', s=100)
plt.xlabel("count of negative words")
plt.ylabel("count of positive words")
plt.xlim((0,150))
plt.ylim((0,150))

Our goal in logistic regression is to find a line that allows us to estimate a probability that any text has positive sentiment. It that probability is greater than 0.5 then we will say that it is a positive text and if lower then we will say it is a negative text.

In logistic regression we first estimate a value z as a linear function of our predictors, just as in linear regression:

y_i = bias + x_i*weight

We then use the sigmoid function to convert this z values to a probability:

p(y_i=1) = 1/1+exp(-z)


We can start by setting some random weights and an arbitrary bias.

In [None]:
np.random.seed(10)
num_features=2
weights = np.random.rand(num_features)
bias=0

We can add this line to our plot of values. It should cut across the items so that items that are above the line should be mostly positive sentiment texts and those that are below should be negative sentiment texts.

In [None]:
plt.scatter(x[0][y==0], x[1][y==0], marker='*', s=100)
plt.scatter(x[0][y==1], x[1][y==1], marker='o', s=100)
plt.xlim((-5,5))
plt.ylim((-5,5))
c = -bias/weights[1]
m = -weights[0]/weights[1]
xmin, xmax = 0, 5
ymin, ymax = 0, 5
xd = np.array([xmin, xmax])
yd = m*xd + c
plt.plot(xd, yd, 'k', lw=1, ls='--')

Our random line does not do this. So we will use gradient descent to find the line of best fit.

For logistic regression we use a cross entropy loss function. I have included this in the code (See lecture for details).

To calculate the gradient of the loss function with respect to the bias term we first calculate the difference between each predicted y value and the true y value. We then take the average difference by summing the differences and dividing the result by N - the number of data points in our data:

db=1/N * Sum_i_in_N q[i]-y[i]

To calculate the gradient of the loss function with respect to each weight, we again first calculate the difference between each predicted y value and the true y value. We then calculate the dot product of this vector and the vector of x values for the relevant feature and divide the result by N - the number of data points in our data:

dw=1/N * Sum_i_in_N x[i]*q[i]-y[i]

x here is vector of values for the feature relevant to the individual weight. A different gradient is needed for each weight and this will be calculated using a different x.


Problem 6: Complete code below so that it finds the line of best fit. \\

Note: For the sigmoid function you will need to exponentiate -z. You can do this using the function np.exp(-z).

In [None]:
n_iters = 2500
num_features = 2
num_samples = len(y)
lr=0.1
logistic_loss=[]

for i in range(n_iters):
    z=????
    q =????
    loss = sum(-(y*np.log2(q)+(1-y)*np.log2(1-q)))
    logistic_loss.append(loss)
    dw1 =
    dw2 =
    db =
    weights[0] = ?????
    weights[1] = ??????
    bias = ??????
plt.plot(range(1,n_iters),linear_loss[1:])
plt.xlabel("number of epochs")
plt.ylabel("loss")


Once this is working we can add the resulting line to our data and it should separate the two classes of items.

In [None]:
plt.scatter(x[0][y==0], x[1][y==0], marker='*', s=100)
plt.scatter(x[0][y==1], x[1][y==1], marker='o', s=100)
plt.xlim((-5,5))
plt.ylim((-5,5))
c = -bias/weights[1]
m = -weights[0]/weights[1]
xmin, xmax = 0, 5
ymin, ymax = -5, 5
xd = np.array([xmin, xmax])
yd = m*xd + c
plt.plot(xd, yd, 'k', lw=1, ls='--')

Problem 7: Calculate p(y=1) for a) a text that contains two positive words and 3 negative words and b) a text that contains 10 positive words and 1 negative word.

To calculate this you will need to know the bias and the weight which are as follows. You will also need to use the sigmoid function.

In [None]:
print("BIAS: " + str(bias))
print("WEIGHT 1: " + str(weights[0]))
print("WEIGHT 2: " + str(weights[1]))

In [None]:
!wget https://github.com/cbannard/lela60331_24-25_data/archive/refs/heads/main.zip
!unzip main.zip
!gunzip lela60331_24-25_data-main/reviews_with_splits_lite.csv.gz
!mv lela60331_24-25_data-main/reviews_with_splits_lite.csv .
!rm -r lela60331_24-25_data-main
!rm main.zip

--2024-10-27 11:24:35--  https://github.com/cbannard/lela60331_24-25_data/archive/refs/heads/main.zip
Resolving github.com (github.com)... 140.82.113.3
Connecting to github.com (github.com)|140.82.113.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://codeload.github.com/cbannard/lela60331_24-25_data/zip/refs/heads/main [following]
--2024-10-27 11:24:35--  https://codeload.github.com/cbannard/lela60331_24-25_data/zip/refs/heads/main
Resolving codeload.github.com (codeload.github.com)... 140.82.114.10
Connecting to codeload.github.com (codeload.github.com)|140.82.114.10|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/zip]
Saving to: ‘main.zip’

main.zip                [ <=>                ]  13.51M  78.0MB/s    in 0.2s    

2024-10-27 11:24:36 (78.0 MB/s) - ‘main.zip’ saved [14168990]

Archive:  main.zip
afb7f7d525f212ff29d7a5d89f23d2c947e2b1f7
   creating: lela60331_24-25_data-main/
 extracting: le

In [None]:
import csv
import numpy as np
import re
from collections import Counter
labels= list()
sents = list()
split=list()
this_text=""
with open('reviews_with_splits_lite.csv') as csvfile:
     sent_dat= csv.reader(csvfile,delimiter=",")
     for i, row in enumerate(sent_dat):
            #print(i)
            this_text += row[1] + " "
            labels.append(row[0])
            sents.append(row[1])
            split.append(row[2])
tokens = re.findall("[^ ]+",this_text)
counts=Counter(tokens)
so=sorted(counts.items(), key=lambda item: item[1])
so=list(zip(*so))[0]
#so=list(so.keys())
type_list=so[len(so)-5000:len(so)]
type_count = len(type_list)




In [None]:
M = np.zeros((len(sents), 5000))
for i, sent in enumerate(sents):
    #print(i)
    tokens = re.findall("[^ ]+",sent)
    #print(tokens)
    #print(type_list in tokens)
    #sent_vec = np.zeros(5000)
    #print(str(i) + " " + type_list in tokens)
    for j,t in enumerate(type_list):
        #print(j)
        #print(t)
        if t in tokens:
              M[i,j] = 1
    #print(sent_vec)



In [None]:
#np.savetxt("reviews_one_hot.txt.gz", M)
#with open("review_vocab.txt", "w") as txt_file:
#    for line in type_list:
#        txt_file.write(" ".join(line) + "\n")
#with open("review_labels.txt", "w") as txt_file:
#    for line in labels:
#        txt_file.write(" ".join(line) + "\n")
#with open("review_split.txt", "w") as txt_file:
#    for line in split:
#        txt_file.write(" ".join(line) + "\n")

In [None]:
M2=M[1:56001]

In [None]:
from logging import logProcesses
import math
num_features=5000
y=[int(l == "positive") for l in labels[1:56001]]
weights = np.random.rand(num_features)
bias=np.random.rand(1)
n_iters = 5000
lr=0.4
num_samples=len(y)
for i in range(n_iters):
  z=M2.dot(weights)+bias
  q = 1 / (1 + pow(math.e,-z))
  y_pred=[int(ql > 0.5) for ql in q]
  acc=[int(yp == y[s]) for s,yp in enumerate(y_pred)]
  print(sum(acc)/len(acc))
  eps=0.00001
  loss = -sum((y*np.log2(q+eps)+(np.ones(len(y))-y)*np.log2(np.ones(len(y))-q+eps)))
  dw = M2.transpose().dot(q-y)/num_samples
  db = sum((q-y))/num_samples
  weights = weights - lr*dw
  bias = bias - lr*db
  print(loss)
#loss = sum(-(np.ones(len(y))*np.log2(q)+(np.ones(len(y))-y)*np.log2(np.ones(len(y))-q)))

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
0.8994285714285715
21486.564244413115
0.8994642857142857
21483.56005603433
0.8994464285714285
21480.557795757653
0.8994464285714285
21477.55746159021
0.8994464285714285
21474.559051541528
0.8994821428571429
21471.5625636251
0.8995
21468.567995855097
0.8995357142857143
21465.57534625003
0.8995178571428571
21462.584612829785
0.8995178571428571
21459.59579361744
0.8995178571428571
21456.608886639428
0.8995178571428571
21453.62388992281
0.8995
21450.640801499107
0.8994821428571429
21447.65961940174
0.8994642857142857
21444.680341666957
0.8994821428571429
21441.70296633285
0.8994821428571429
21438.727491441558
0.8995
21435.753915036355
0.8995357142857143
21432.782235163726
0.8995357142857143
21429.812449872556
0.8995714285714286
21426.844557214983
0.899625
21423.878555244853
0.8996071428571428
21420.914442018846
0.899625
21417.95221559652
0.8996071428571428
21414.991874038955
0.899625
21412.033415411883
0.8996428571428572
2140