Lecture 11: Neural Nets + Word Embeddings
===============

10/23/2023, CS 4/6120 Natural Language Processing, Muzny

Task 1: Is a single word positive or negative?
-----------------

We'll start by trying to train a neural net to recognize if a word is positive or negative based on its word embedding.

In [None]:
import numpy as np

# if you are inspired to graph things :)
import matplotlib.pyplot as plt

%matplotlib inline
# seed random number generation so that you can 
# track the same numbers as each other

In [None]:
# if you are running on google colab, you'll want the following code
# so we can load data in
# (upload the train/dev files to your google drive first)

# from google.colab import drive
# drive.mount('/content/drive')

You'll need to upload the data files that you're reading from to your google drive so that you can access them after it's mounted and given permissions. You can find find your drive files under `/content/drive/MyDrive`.

In [None]:
# we'll get our word embeddings from the gensim package
import gensim.downloader

In [None]:
# go get some pre-trained word embeddings with 50 dimensions
# glove refers to a different algorithm for creating the vectors
# the resulting vectors follow the same structure:
# dense vectors representing words in n-dimensional space

# we're using this one because it's the smallest
# set available pre-trained from gensim
# it is still 66 MB!
word2vec_vectors = gensim.downloader.load('glove-wiki-gigaword-50')

In [None]:
# accessing a specific word vector
print(word2vec_vectors['computer'])

# TODO: measure the similarity between a few of the vectors in this set
# documentation for vectors overall:
# https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.KeyedVectors
# documentation for specific methods:
# https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.KeyedVectors.similarity

In [None]:
# this will tell you how many words are in your vocabulary with these vectors
print("Number of word vectors available:", len(word2vec_vectors))

In [None]:
def load_data(fname: str) -> list:
    data = []
    data_f = open(fname, 'r')
    for line in data_f:
        pieces = line.strip().split(",")
        data.append((pieces[0], int(pieces[1])))
    data_f.close()
    return data


# Load in some training and testing data
train = load_data("train_sentiment.txt")
dev = load_data("dev_sentiment.txt")


print("num in train:", len(train))
print("num in dev:", len(dev))
# what does the data look like?
print(sorted(train)[:5])
print(sorted(dev)[:5])

In [None]:
# TODO: what is the overlap between the train and the dev sets?
# how many words appear in both?
# (hint: you'll need to extract just the word-parts of the train/dev sets to do this test)


In [None]:
# TODO: what is the distribution of labels in the train and dev sets?
# how many positive and negative examples are there in each?

In [None]:

def create_input_array(examples: list, 
                       word_vectors: gensim.models.keyedvectors.Word2VecKeyedVectors) -> np.array:
    """
    Convert examples to input and label arrays
    Parameters:
    examples (list): list of (word, label) tuples
    word_vectors (gensim.models.keyedvectors.Word2VecKeyedVectors): dense word vectors to match with words

    Returns:
    numpy array of inputs, numpy array of corresponding labels, the words themselves (for debugging)
    """
    # this will be our y
    labels = []
    
    # this will be our input data
    X = []
    words = []
    for ex in examples:
        word = ex[0]
        label = ex[1]
        # ignore words that we don't have vectors for
        if word in word_vectors:
            # TODO: get the word vector that matches this target word 
            
            
            X.append(YOUR WORD VECTOR HERE)  
            
            labels.append(label)
            
            # collect the word too, just so that we can debug/investigate better
            words.append(word)
            
    # TODO: make y into the correct shape (see below)
    y = np.array([labels])
    return np.array(X), y, words

X, y, words = create_input_array(train, word2vec_vectors)

# goal:
# shape of inputs: (3342, 50)
# Example embedding: [ 0.14702  -0.79382  -0.15014    ...
# shape of labels: (3342, 1)
# Example label: [1]
# Example word: fresher

print("shape of inputs:", X.shape)
print("Example embedding:", X[0])
print("shape of labels:", y.shape)
print("Example label:", y[0])
print("Example word:", words[0])

Task 2: Using NN libraries
----------------

Now, we'll take a look at some common libraries used to create classifiers using neural nets. We'll take a look at [`keras`](https://keras.io/) which provides a nice API for implementing neural nets and can be run on top of TensorFlow, CNTK, or Theano. We'll look at an example using [`tensorflow`](https://github.com/tensorflow/tensorflow) as our backend.

Installation of component libraries (NOTE: this is different if you are on a mac w/ an M1 or M2 chip! [link to Apple developer page](https://developer.apple.com/metal/tensorflow-plugin/) . Follow the instructions for "Apple silicon".):

```
pip3 install tensorflow
sudo pip3 install keras
```

In [None]:
from keras.models import Sequential
from keras.layers import Dense

In [None]:
hidden_units = # YOUR NUMBERS HERE
num_epochs = 20


# set up the basis for a feed forward network
model = Sequential()
# hidden layer -- same number of hidden units as above
model.add(Dense(units=hidden_units, activation='relu', input_dim=X.shape[1]))
# output layer
model.add(Dense(units=1, activation='sigmoid'))

# configure the learning process
model.compile(loss='binary_crossentropy',
              optimizer='sgd',
              metrics=['accuracy'])


model.fit(X, y, epochs=num_epochs, verbose=1)

1. What is one single input to this network? __YOUR ANSWER HERE__
2. What is one single output of this network? __YOUR ANSWER HERE__
3. Test training your network with a few different numbers of hidden units. Try with a number __much smaller__ than the number of dimensions in your word vectors, a number __about the same__ as in your word vectors, and a number that is __much larger__. What numbers did you use and what are the final accuracies reported by your model? 
    1. much smaller: __YOUR NUM HIDDEN HERE__, __YOUR ACCURACY HERE__
    2. about the same: __YOUR NUM HIDDEN HERE__, __YOUR ACCURACY HERE__
    3. much larger: __YOUR NUM HIDDEN HERE__, __YOUR ACCURACY HERE__
4. What loss functions are available for these models? [documentation](https://keras.io/api/models/model_training_apis/) __YOUR ANSWER HERE__
5. Why is the loss function here `binary_crossentropy`? __YOUR ANSWER HERE__

In [None]:
# evaluate on our development/test data
# TODO: Get inputs and outputs for your dev set using code we've already implemented


# if you just want labels
# y_hat = model.predict(X_dev)

# if you want score/accuracy
score, acc = model.evaluate(X_dev, y_dev)

# score is an evaluation of the loss function
print("Score on dev:", score)
# Accuracy is "standard accuracy" of the model on this dev set
print("Accuracy on dev:", acc)

6. How does your model perform on dev vs. train sets? __YOUR ANSWER HERE__

In [None]:
# Finally, pick out some words that you'd like to test to see what
# labels they are assigned.

# come up with 3 - 5 test words of your own
# these words must have word embeddings but need not have appeared in the train set
test_words = # FILL IN HERE, you want a list of tuples, like train and dev

# TODO: make into corresponding X and y


# TODO: make some predictions


# TODO: make sure that you can see which predictions were correct vs. incorrect


Task 3: playing with `yield`
-------

In your final HW, you will write a function that uses the python keyword `yield` to produce batches of input data for your models that you're training. Play around with the functions below to understand more how this works.

[documentation for `yield`](https://docs.python.org/3/reference/simple_stmts.html#the-yield-statement)

In [None]:
def generate_squares(n):
    for i in range(n):
        yield i*i 
    
    
# TODO: play around with the below code
# add more! 
# how would you see the number 16?
# what happens after 16?
result = generate_squares(5)
print(next(result))