<a href="https://colab.research.google.com/github/brandonowens24/Word_Embeddings/blob/main/Word_Embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Assignment 2: Word_Embeddings

##Task 2.1 Obtaining Dataset

**Installing Necessary Modules for Obtaining Word Embeddings**

In [None]:
# Install Necessary Modules
! pip install --upgrade gensim
! pip install datasets
! pip install apache_beam

**Grabbing Wikipedia Data from Hugginface**

In [None]:
from datasets import load_dataset
from tqdm import tqdm

# Grab Dataset from Huggingface
dataset = load_dataset("wikipedia", "20220301.simple")
documents = dataset['train']['text']


**Creating a Basic Tokenization Function**

In [176]:
import nltk
nltk.download('punkt')
import string

def tokenize(document):
    doc_tokens = []
    sentences = nltk.sent_tokenize(document)
    for sentence in sentences:
        sent_tokens = nltk.word_tokenize(sentence)
        sent_tokens = [word.lower() for word in sent_tokens if word]
        doc_tokens += [sent_tokens]
    return doc_tokens


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


**Tokenizing All Wikipedia Rows in the Dataset**

In [None]:
# Word2Vec in gensim wants lists of lists of tokens for sentences
all_sent_tokenized = []

for document in tqdm(documents):
  doc_tokens = tokenize(document)
  all_sent_tokenized += doc_tokens

100%|██████████| 205328/205328 [07:47<00:00, 438.91it/s]


Gensim Word2Vec wants all of the sentences for the model in a list of lists of tokens format.

##Task 2.2 Training Word Embeddings


**Creating Models and Loading in Pretrained Models**

In [3]:
from gensim.models import Word2Vec
import gensim.downloader as api

# # CBOW
# model_CBOW = Word2Vec(sentences=all_sent_tokenized, vector_size=300, window=3, workers=2, hs=1, negative=0, epochs=50, compute_loss=True)
# model_CBOW.save("CBOWword2vec.model")

# # Skip_Gram
# model_SG = Word2Vec(sentences=all_sent_tokenized, vector_size=300, window=3, workers=2, sg=1, hs=1, negative=0, epochs=50, compute_loss=True)
# model_SG.save("SGword2vec.model")

# Glove
glove_model = api.load('glove-wiki-gigaword-50')
glove_model.save("glove.model")

# Google
google_model = api.load('word2vec-google-news-300')
google_model.save("google.model")




Training time for CBOW ~ 1.5 hours
Training time for SG ~ 3.5 hours

They have been saved and downloaded.
Must still load in google and glove models for comparison.

**Downloading Model for Personal Use**

In [None]:
from google.colab import files
# files.download('CBOWword2vec.model')
# files.download('CBOWword2vec.model.syn1.npy')
# files.download('CBOWword2vec.model.wv.vectors.npy')
# files.download('SGword2vec.model')
# files.download('SGword2vec.model.syn1.npy')
# files.download('SGword2vec.model.wv.vectors.npy')
CBOWmodel = Word2Vec.load("CBOWword2vec.model")
SGmodel = Word2Vec.load("SGword2vec.model")



<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

This was to avoid re-training my models that were created.

##Task 2.3 Comparing Word Embeddings with Queries

In [5]:
# QUERY 1 (Most Similar Words and Their Similarity Scores To: "calculator")
print("CBOW Model: ", CBOWmodel.wv.most_similar('calculator')[:3])
print("SG Model: ", SGmodel.wv.most_similar('calculator')[:3])
print("Glove Model: ", glove_model.most_similar('calculator')[:3])
print("Google Model: ", google_model.most_similar('calculator')[:3])

CBOW Model:  [('calculators', 0.3898669481277466), ('computer', 0.34476712346076965), ('device', 0.3213653266429901)]
SG Model:  [('calculators', 0.5116651654243469), ('computer', 0.39925017952919006), ('device', 0.39494720101356506)]
Glove Model:  [('calculators', 0.8655334711074829), ('graphing', 0.7849180102348328), ('computerized', 0.6740948557853699)]
Google Model:  [('calculators', 0.7862768769264221), ('Calculator', 0.7237094640731812), ('Calculators', 0.5793079137802124)]


In [6]:
# QUERY 2 ("faster" - "fast" + "strong" = )

print("CBOW Model: ", CBOWmodel.wv.most_similar(positive=['faster', 'strong'], negative=['fast'], topn=3))
print("SG Model: ", SGmodel.wv.most_similar(positive=['faster', 'strong'], negative=['fast'], topn=3))
print("Glove Model: ", glove_model.most_similar(positive=['faster', 'strong'], negative=['fast'], topn=3))
print("Google Model: ", google_model.most_similar(positive=['faster', 'strong'], negative=['fast'], topn=3))

CBOW Model:  [('stronger', 0.679090678691864), ('bigger', 0.5708271265029907), ('less', 0.5614640712738037)]
SG Model:  [('stronger', 0.6459370851516724), ('weaker', 0.5052216053009033), ('weak', 0.48325246572494507)]
Glove Model:  [('stronger', 0.856702983379364), ('strength', 0.7947880625724792), ('contrast', 0.7903764843940735)]
Google Model:  [('stronger', 0.7650706171989441), ('weaker', 0.6070874333381653), ('better', 0.5499424338340759)]


In [27]:
# QUERY 3 ("england" - "london" + "madrid" = )

print("CBOW Model: ", CBOWmodel.wv.most_similar_cosmul(positive=['madrid', 'england'], negative=['london'], topn=3))
print("SG Model: ", SGmodel.wv.most_similar(positive=['madrid', 'england'], negative=['london'], topn=3 ))
print("Glove Model: ", glove_model.most_similar(positive=['madrid', 'england'], negative=['london'], topn=3))
print("Google Model: ", google_model.most_similar(positive=['Madrid', 'England'], negative=['London'], topn=3))

CBOW Model:  [('spain', 0.8290620446205139), ('portugal', 0.797576367855072), ('navarre', 0.7728044390678406)]
SG Model:  [('spain', 0.6017550230026245), ('zaragoza', 0.5119850635528564), ('valladolid', 0.503682017326355)]
Glove Model:  [('valencia', 0.8629282116889954), ('sevilla', 0.8293460607528687), ('juventus', 0.8229055404663086)]
Google Model:  [('Spain', 0.6087851524353027), ('Paraguay', 0.570982813835144), ('Real_Madrid', 0.5681555867195129)]


In [None]:
# QUERY 4 (Which Doesn't Belong: 'finance', 'business', 'market', 'candle', 'stock')

print("CBOW Model: ", CBOWmodel.wv.doesnt_match(['finance', 'business', 'market', 'candle', 'stock']))
print("SG Model: ", SGmodel.wv.doesnt_match(['finance', 'business', 'market', 'candle', 'stock']))
print("Glove Model: ", glove_model.doesnt_match(['finance', 'business', 'market', 'candle', 'stock']))
print("Google Model: ", google_model.doesnt_match(['finance', 'business', 'market', 'candle', 'stock']))

In [None]:
# QUERY 5 (Testing Similarity Scores of Two Sentences: "I picked you up from the airport" and "I got you from your flight yesterday")

print("CBOW Model: ", CBOWmodel.wv.n_similarity( "I picked you up from the airport".split(), "I got you from your flight yesterday".split()))
print("SG Model: ", SGmodel.wv.n_similarity( "I picked you up from the airport".split(), "I got you from your flight yesterday".split()))
print("Glove Model: ", glove_model.n_similarity( "I picked you up from the airport".split(), "I got you from your flight yesterday".split()))
print("Google Model: ", google_model.n_similarity( "I picked you up from the airport".split(), "I got you from your flight yesterday".split()))

##Task 2.4: Bias in Word Embeddings

In [None]:
! pip install wefe

**Install wefe and WEAT to find bias within the four downloaded models**

In [129]:
import wefe
from wefe.query import Query
from wefe.metrics import WEAT
from wefe.utils import run_queries
from wefe.word_embedding_model import WordEmbeddingModel
import numpy as np

# Decided to try to test for racism in the Wikipedia Dataset
diverse = ["black", "hispanic"]
nondiverse = ["white", "caucasian"]

negative_attributes = ["criminal", "thug", "violent", "dangerous"]
non_negative_attributes = ["educated", "skilled", "professional", "successful"]

query = Query(
    target_sets=[
        diverse, nondiverse
    ],
    attribute_sets=[
        negative_attributes, non_negative_attributes
    ],
    target_sets_names=['Diverse', 'Non-Diverse'],
    attribute_sets_names=['Negative', 'Positive'])

weat = WEAT()

# Have to wrap models in wefe WordEmbeddingModel
CBOWmodel_wrapper = WordEmbeddingModel(
    CBOWmodel.wv, "CBOWword2vec.model"
)

SGmodel_wrapper = WordEmbeddingModel(
    SGmodel.wv, "SGword2vec.model"
)

glove_wrapper = WordEmbeddingModel(
    glove_model, "glove.model"
)

google_wrapper = WordEmbeddingModel(
    google_model, "google.model"
)

# Running individual model queries to find bias
CBOWres = run_queries(
    WEAT,
    [query],
    [CBOWmodel_wrapper],
    metric_params={"preprocessors": [{"lowercase": True}]},
    warn_not_found_words=True
).T.round(2)

SGres = run_queries(
    WEAT,
    [query],
    [SGmodel_wrapper],
    metric_params={"preprocessors": [{"lowercase": True}]},
    warn_not_found_words=True
).T.round(2)

gloveres = run_queries(
    WEAT,
    [query],
    [glove_wrapper],
    metric_params={"preprocessors": [{"lowercase": True}]},
    warn_not_found_words=True
).T.round(2)

googleres = run_queries(
    WEAT,
    [query],
    [google_wrapper],
    metric_params={"preprocessors": [{"lowercase": True}]},
    warn_not_found_words=True
).T.round(2)

In [130]:
CBOWres

model_name,CBOWword2vec.model
query_name,Unnamed: 1_level_1
Diverse and Non-Diverse wrt Negative and Positive,-0.1


In [131]:
SGres

model_name,SGword2vec.model
query_name,Unnamed: 1_level_1
Diverse and Non-Diverse wrt Negative and Positive,-0.04


In [132]:
gloveres

model_name,glove.model
query_name,Unnamed: 1_level_1
Diverse and Non-Diverse wrt Negative and Positive,-0.06


In [133]:
googleres

model_name,google.model
query_name,Unnamed: 1_level_1
Diverse and Non-Diverse wrt Negative and Positive,0.03


##Task 2.5: Text Classification

###BOW Model

**Used imdb Dataset from Huggingface and Created Logistic Regression to Classify Positive/Negative Reviews**

In [165]:
# Logistic Regression - BOW
imdb = load_dataset("imdb")
imdb_train = imdb['train']

In [166]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(input="content", stop_words="english")
vectors = vectorizer.fit_transform(imdb_train['text'])
vectors

<25000x74538 sparse matrix of type '<class 'numpy.float64'>'
	with 2241793 stored elements in Compressed Sparse Row format>

In [168]:
import random
from scipy.special import expit

def sgd_for_lr_with_ce(X, y, num_passes=15, learning_rate = 0.1):

    num_data_points = X.shape[0]

    # Initialize theta -> 0
    num_features = X.shape[1]
    # set w and b to the correct values
    w = np.zeros(num_features)
    b = 0

    # repeat until done
    # how to define "done"? let's just make it num passes for now
    # we can also do norm of gradient and when it is < epsilon (something tiny)

    for current_pass in tqdm(range(num_passes)):

        # iterate through entire dataset in random order
        order = list(range(num_data_points))
        random.shuffle(order)
        for i in order:

            # compute y-hat for this value of i given y_i and x_i
            x_i = X[i].todense()
            y_i = y[i]

            # need to compute y_hat based on w and b, and x
            # sigmoid(w dot x + b)
            # use expit for sigmoid

            z = np.dot(x_i.squeeze(),w.T) + b
            y_hat_i = expit(z)

            # for each w (and b), modify by -lr * (y_hat_i - y_i) * x_i
            w = w - learning_rate * (y_hat_i - y_i) * x_i
            b = b - learning_rate * (y_hat_i - y_i)

    # return theta
    return w,b



Code Help for the Stochastic Gradient Descent provided by Dr. Steve Wilson.

In [169]:
# Obtained Training Weights and Biases for BOW LR
X = vectors
y = imdb_train['label']
w,b = sgd_for_lr_with_ce(X,y)

100%|██████████| 15/15 [04:26<00:00, 17.75s/it]


In [170]:
def predict_y_lr(w,b,X,threshold=0.5):
  y_hat = X.dot(w.reshape(-1,1)) + b
  preds = np.where(y_hat>threshold, 1, 0)

  return preds

**Trained BOW LR and Delivered Classification Report on Training Data**

In [171]:
from sklearn.metrics import classification_report

training_preds = predict_y_lr(w,b,X)
print(classification_report(y, training_preds))

              precision    recall  f1-score   support

           0       0.90      0.97      0.94     12500
           1       0.97      0.90      0.93     12500

    accuracy                           0.94     25000
   macro avg       0.94      0.94      0.94     25000
weighted avg       0.94      0.94      0.94     25000



**Delivered Classification Report on Testing Data**

In [172]:
X_test = imdb['test']['text']
y_test = imdb['test']['label']

X_test_transformed = vectorizer.transform(X_test)
testing_preds = predict_y_lr(w,b,X_test_transformed)
print(classification_report(y, testing_preds))

              precision    recall  f1-score   support

           0       0.82      0.94      0.87     12500
           1       0.93      0.79      0.85     12500

    accuracy                           0.86     25000
   macro avg       0.87      0.86      0.86     25000
weighted avg       0.87      0.86      0.86     25000



###CBOW Model

In [214]:
# Logistic Regression Continuous Bag of Words

# Creates a matrix, X, that returns the averaged word embedding vectors for words found in each document and averaged for that document on index
def determine_features(data, word_vectors):
    row_vectors = []
    for each_rating in tqdm(data):
        tmp_row_vec = []
        rating_tokens = tokenize(each_rating)
        rating_tokens = sum(rating_tokens, [])
        for tok in rating_tokens:
            if tok in word_vectors:
                tmp_row_vec.append(word_vectors[tok])
        array = np.array(tmp_row_vec)
        row_vectors.append(np.mean(array, axis=0))

    X = np.matrix(row_vectors)
    return X


In [217]:
from scipy.sparse import csr_matrix

Xtrain = determine_features(imdb_train['text'], CBOWmodel.wv)
# SGD function wants sparse matrix
Xtrain_ = csr_matrix(Xtrain)
ytrain = imdb_train['label']
w,b = sgd_for_lr_with_ce(Xtrain_,ytrain)

100%|██████████| 25000/25000 [01:33<00:00, 267.92it/s]


**Trained CBOW LR and Delivered Classification Report on Training Data**

In [221]:
training_preds = predict_y_lr(w,b,Xtrain_)
print(classification_report(ytrain, training_preds))

              precision    recall  f1-score   support

           0       0.71      0.91      0.80     12500
           1       0.88      0.62      0.73     12500

    accuracy                           0.77     25000
   macro avg       0.79      0.77      0.76     25000
weighted avg       0.79      0.77      0.76     25000



In [None]:
Xtest = determine_features(imdb['test']['text'], CBOWmodel.wv)
Xtest_ = csr_matrix(Xtest)
ytest = imdb['test']['label']



**Delivered Classification Report on Training Data**

In [223]:
testing_preds = predict_y_lr(w,b,Xtest_)
print(classification_report(ytrain, testing_preds))

              precision    recall  f1-score   support

           0       0.70      0.91      0.79     12500
           1       0.87      0.62      0.72     12500

    accuracy                           0.76     25000
   macro avg       0.79      0.76      0.76     25000
weighted avg       0.79      0.76      0.76     25000

