# Week 11 Seminar notebook: Embeddings

Today we are going to look at how to make use of embeddings as a way of representing words in a simple classifier.

### Using Word2vec embeddings in a classifier

We are going to build a sentiment classifier using the Yelp sentiment classification data we first saw in week 7. Instead of one-hot encoding we will use embeddings to represent our words.

First we are going to import the pre-built embeddings and combine them to represent our texts. Note: The first block is included for reference and will not be used in the seminar- we will import precompiled representation of our reviews. DO NOT RUN THE FIRST BLOCK IN THE SEMINAR.

In [None]:
import re
import numpy as np
import matplotlib.pyplot as plt

# Download w2v embeddings
import gensim.downloader as api
w = api.load('word2vec-google-news-300')
vocab=[x for x in w.key_to_index.keys()]

!wget https://raw.githubusercontent.com/cbannard/lela60331_24-25/refs/heads/main/data/yelp_reviews.txt

# Create lists
reviews=[]
labels=[]

with open("yelp_reviews.txt") as f:
   # iterate over the lines in the file
   for line in f.readlines()[1:]:
        # split the current line into a list of two element - the review and the label
        fields = line.rstrip().split('\t')
        # put the current review in the reviews list
        reviews.append(fields[0])
        # put the current sentiment rating in the labels list
        labels.append(fields[1])

tokenized_sents = [re.findall("[^ ]+",txt) for txt in reviews]
# Collapse all tokens into a single list
tokens=[]
for s in tokenized_sents:
      tokens.extend(s)
# Count the tokens in the tokens list. The returns a list of tuples of each token and count
types=set(tokens)

indices=[vocab.index(x) for x in types if x in vocab]
types_inc=[x for x in types if x in vocab]
M=w[indices]
M.shape

import re
embeddings=[]
#iterate over the reviews
for i, rev in enumerate(reviews):
    # Tokenise the current review:
    tokens = re.findall("[^ ]+",rev)
    this_vec = np.zeros((1, 300))
    for t in tokens:
        if t in types_inc:
            #print(t)
            #print(M[types_inc.index(t)])
            this_vec = this_vec + M[types_inc.index(t)]
    embeddings.append(this_vec)
embeddings=np.array(embeddings).squeeze()

To save time in the seminar we are going to import prebuilt embedding representations of our texts

In [None]:
!wget https://raw.githubusercontent.com/cbannard/lela60331_24-25/refs/heads/main/review_embeddings.csv.gz
!gunzip review_embeddings.csv.gz
!wget https://raw.githubusercontent.com/cbannard/lela60331_24-25/refs/heads/main/data/yelp_reviews.txt

We can build training and test data from the imported data as follows

In [None]:
import math
import matplotlib.pyplot as plt
import numpy as np

labels=[]

with open("yelp_reviews.txt") as f:
   # iterate over the lines in the file
   for line in f.readlines()[1:]:
        # split the current line into a list of two element - the review and the label
        fields = line.rstrip().split('\t')
        labels.append(fields[1])

embeddings=[]
with open("review_embeddings.csv") as f:
   # iterate over the lines in the file
   for line in f.readlines():
        # split the current line into a list of two element - the review and the label
        fields = line.rstrip().split(',')
        emb = [float(x) for x in fields]
        embeddings.append(emb)


embeddings=np.array(embeddings).squeeze()
train_ints=np.random.choice(len(embeddings),int(len(embeddings)*0.8),replace=False)
test_ints=list(set(range(0,len(embeddings))) - set(train_ints))
# These are the training embeddings
M_train_emb = embeddings[train_ints,]
# These are the test embeddings
M_test_emb = embeddings[test_ints,]
# These are the training labels
labels_train = [labels[i] for i in train_ints]
# These are the test labels
labels_test = [labels[i] for i in test_ints]



Problem 1. Fit logistic regression models using the embedding-based representation of the reviews data. Train on the training data. Calculate precision and recall for the classifier on the test data. You should have example code for all the steps in previous weeks that will just need tweaking. Note that the embeddings are 300 elements long, so a weight vector of 300 is needed.

In [None]:
num_features=300
weights = np.random.rand(num_features)
bias=np.random.rand(1)

In [None]:
import math
import matplotlib.pyplot as plt

y=[int(l == "positive") for l in labels_train]

Problem 2. Fit a logistic regression model using the one hot representations of reviews using the following prebuilt one-hot encodings. Test this model in the same way and compare performance. Note that the one-hot vectors are 5000 elements long (the chosen vocab was 5000; see week 7).

In [None]:
!wget https://raw.githubusercontent.com/cbannard/lela60331_24-25/refs/heads/main/reviews_onehot.csv.gz
!gunzip reviews_onehot.csv.gz

In [None]:
onehot=[]
with open("reviews_onehot.csv") as f:
   # iterate over the lines in the file
   for line in f.readlines():
        # split the current line into a list of two element - the review and the label
        fields = line.rstrip().split(',')
        oh = [float(x) for x in fields]
        onehot.append(oh)
onehot=np.array(onehot).squeeze()
# These are the training set one-hot vectors
M_train_oh = onehot[train_ints,]
# These are the test set one-hot vectors
M_test_oh = onehot[test_ints,]
# The labels are the same as for the embeddings as the reviews are the same