# LIN 371 Machine Learning for Text Analysis

# Homework 4 - due Wednesday Apr 17 2024 at 11:59pm

For this homework you will hand in (upload) to canvas:
- a notebook renamed ``hw4_YourEID.ipynb``

**Note**: This is a small homework and a perfect solution to this homework will be worth **100** points.

__Before submitting__, please reset your kernel and rerun everything from the beginning (`Kernel` >> `Restart and Run All`) an ensure your code outputs the correct answer.

For programming tasks, make sure that your code can run using Python 3.5+. If you cannot complete a problem, include as much pseudocode as possible for partial credit. However, make sure it does not have any output errors. **If there are any output errors, half of the points for that problem will be automatically deducted.**

Collaboration: you are free to discuss the homework assignments with other students and work towards solutions together.  However, all of the code you write must be your own! There is a channel on Slack where you can look for a study group.

Review extension and academic dishonesty policy here: https://jessyli.com/courses/lin371

For typing up your answers to non-programming problems, information can be found about Markdown cells for Jupyter Notebooks here: https://jupyter-notebook.readthedocs.io/en/stable/examples/Notebook/Working%20With%20Markdown%20Cells.html


### Please list any collaborators here:
- Rylan Vachon



## Problem 1: Multi-layer perceptron going linear (10 points)

Show mathematically that a multi-layer perceptron using the _linear (identity) activation function_ $y=z$ is still a linear classifier. You can assume that (1) each example has two features $x_1$ and $x_2$; (2) there are two hidden layers; (3) each hidden layer has two nodes; (4) this is for a binary classification task.

## Problem 2: Concepts (30 points)
Briefly answer the following questions:

**(1)** What is batching in neural networks, and why batching is used?

Batching in neural networks is when the training data is split up into several different "batches" of data rather than one large total set.

It is used for several reasons:
- Allows for more accuracy when estimating the gradient
- Creates smoother convergence
- Allows for larger learning rates
- Improves speed and lessens load on GPUs

**(2)** Name and describe two ways to prevent overfitting in a deep neural network.

1. Dropout is a method used to prevent overfitting in deep neural networks. It relies on randomly setting some activations in each layer to zero, generally about half of each layer. This method forces the network to rely less on any single node.
2. Early stopping is another method used to prevent overfitting in deep neural networks. To do early stopping, you have to have an additional set of data called a "development" or "validation" set. This set helps to determine at what point in the training iterations the model will begin to overfit so we can then go back to the actual training set and know when to cease iterations to minimize the risk of overfitting.

**(3)** Name at least one benefit of using static word embeddings (e.g., glove, word2vec) over bag-of-words representations, and at least one aspect of language that word embeddings do not account for.

- One benefit of using static word embeddings compared to bag-of-words representations is that static word embeddings take semantic relationships into account. Static word embeddings vectorize words and show their similarity by their proximity in the embedding space.

**(4)** What's a key difference between static word embeddings (e.g., glove, word2vec) and contextualized word embeddings (e.g., BERT, Elmo)?

- Contextualized word embeddings take into consideration the context of the text surrounding each word (river *bank* vs financial *bank*). Static word embeddings have a fixed (static) representation of each word, making it more difficult for the model to be flexible when encountering different contextual uses of the same word.

**(5)** Suppose we are considering a tweet classification task on 8000 examples. A common way to obtain a prepresentation of one tweet via word embeddings is to take the average of the embeddings of each token within this tweet. Suppose we are designing a neural network with:
* Word embeddings of 300 dimensions; 
* A first hidden layer with 100 hidden states;
* A second hidden layer with 10 hidden states;
* A final binary classification layer.

How many parameters will the network learn? Show your work.

## Problem 3: Bias in word embeddings (15 points)

It's important to be cognizant of the biases (gender, race, sexual orientation etc.) implicit in our word embeddings. Bias can be dangerous because it can reinforce stereotypes through applications that employ these models.

Here, we shall explore the embeddings produced by GloVe. Please revisit the class notes and lecture slides for more details on the word2vec and GloVe algorithms.

Then run the following cells to load the GloVe vectors into memory. **Note**: If this is your first time to run these cells, i.e. download the embedding model, it will take a couple minutes to run. If you've run these cells before, rerunning them will load the model without redownloading it, which will take about 1 to 2 minutes.

In [1]:
import gensim.downloader as api

def load_embedding_model():
    """ Load GloVe Vectors
        Return:
            wv_from_bin: All 400000 embeddings, each lengh 200
    """
    
    wv_from_bin = api.load("glove-wiki-gigaword-200")
    print("Loaded vocab size %i" % len(wv_from_bin.key_to_index))
    return wv_from_bin

# -----------------------------------
# Run Cell to Load Word Vectors
# Note: This will take a couple minutes
# -----------------------------------
wv_from_bin = load_embedding_model()

Loaded vocab size 400000


**(1)** Run the cell below, to examine (a) which terms are most similar to "woman" and "worker" and most dissimilar to "man", and (b) which terms are most similar to "man" and "worker" and most dissimilar to "woman". **Point out the difference between the list of female-associated words and the list of male-associated words, and explain how it is reflecting gender bias.** 

In [2]:
# Run this cell
# Here `positive` indicates the list of words to be similar to and `negative` indicates the list of words to be
# most dissimilar from.
print(wv_from_bin.most_similar(positive=['woman', 'worker'], negative=['man']))
print()
print(wv_from_bin.most_similar(positive=['man', 'worker'], negative=['woman']))
print()

[('employee', 0.6375863552093506), ('workers', 0.6068920493125916), ('nurse', 0.5837947130203247), ('pregnant', 0.5363885164260864), ('mother', 0.5321308970451355), ('employer', 0.5127025842666626), ('teacher', 0.5099576711654663), ('child', 0.5096741318702698), ('homemaker', 0.5019454956054688), ('nurses', 0.4970572590827942)]

[('workers', 0.611325740814209), ('employee', 0.5983108878135681), ('working', 0.5615329742431641), ('laborer', 0.5442320108413696), ('unemployed', 0.536851704120636), ('job', 0.5278826355934143), ('work', 0.5223963856697083), ('mechanic', 0.5088937282562256), ('worked', 0.5054520964622498), ('factory', 0.4940454363822937)]



In [3]:
# Your anwser goes here
# Note: before submitting the file, make sure the anwser cell is a markdown cell

**(2)** Now, Use the most_similar function to find another case where some bias is exhibited by the vectors. Please briefly explain the example of bias that you discover.

In [4]:
# Note: before submitting the file, make sure this anwser cell is a code cell
# write code
# put your explanation in the print statement:
# e.g. print("Explanation: xxxx")

print(wv_from_bin.most_similar(positive=['white', 'housing'], negative=['black']))
print()
print(wv_from_bin.most_similar(positive=['black', 'housing'], negative=['white']))
print()


[('apartments', 0.5706000924110413), ('residential', 0.5501178503036499), ('administration', 0.5297499895095825), ('homes', 0.5222154259681702), ('house', 0.5085390210151672), ('hud', 0.49044468998908997), ('houses', 0.48522552847862244), ('building', 0.48337918519973755), ('buildings', 0.47960007190704346), ('multifamily', 0.47858700156211853)]

[('residential', 0.5735326409339905), ('urban', 0.5647248029708862), ('apartments', 0.5408486127853394), ('sector', 0.5248250961303711), ('neighborhoods', 0.508280336856842), ('construction', 0.5015413165092468), ('rural', 0.49821096658706665), ('affordable', 0.49628469347953796), ('employment', 0.49382278323173523), ('communities', 0.4924072027206421)]



**(3)** Give one explanation of how bias gets into the word vectors. What is an experiment that you could do to test for or to measure this source of bias?

In [5]:
# Note: before submitting the file, make sure this anwser cell is a markdown cell

## Problem 4: Multi-layer perceptron in sklearn (45 points)

(a) (9 points) Using the Logistic Regression model you implemented in homework 3 as a starting point, modify this to use the [Multi-layer Perceptron model provided by sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html) as the learning algorithm. How do the two models compare in terms of predictive performance (accuracy)? 

In [6]:
from sklearn.datasets import load_files
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import metrics

dataset = load_files("movie-reviews/")
# generate train/test subsets
docs_train, docs_test, y_train, y_test = train_test_split(dataset.data, dataset.target, test_size = 0.2, random_state = 3)
vectorizer = CountVectorizer(stop_words = "english")
X_train = vectorizer.fit_transform(docs_train)
X_test = vectorizer.transform(docs_test)

In [7]:
# Note: before submitting this file, make sure the anwser cell is a code cell
# put your text explanation in the print statement:
# e.g. print("xxx")
# You need to implement logistic regression model (lr_model) and multi-layer perceptron model (mlp_model) on the movie-reviews dataset.
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier

#train the LR model
lr_model = LogisticRegression(random_state=3, solver='liblinear')
lr_model.fit(X_train, y_train)
lr_y_scores = lr_model.predict(X_test)
accuracy_score_lr = metrics.accuracy_score(y_test, lr_y_scores)

#train the mlp model
mlp_model = MLPClassifier(random_state=3)
mlp_model.fit(X_train, y_train)
mlp_y_scores = mlp_model.predict(X_test)
accuracy_score_mlp = metrics.accuracy_score(y_test, mlp_y_scores)

print(accuracy_score_lr, accuracy_score_mlp)

0.8475 0.8725


In [8]:
assert(lr_model.solver == 'liblinear')
assert(lr_model.random_state == 3)
assert(mlp_model.random_state == 3)

(b) (9x4=36 points) Note that when you build the MLPClassifier, you will see a series of parameters printed out with the classifier. Consider the following parameters, modify them, and see how performance and/or training speed changes:

**Note** that you should modify _only one_ parameter at a time! Additionally,  clearly label each time you test a parameter.

(b.1) Turn early stopping on or off. **Also answer: what does early stopping do?**

In [9]:
# Note: before submitting this file, make sure the anwser cell is a code cell
# put your text explanation in the print statement:
# e.g. print("xxx")
mlp_model = MLPClassifier(random_state=3, early_stopping=True)
mlp_model.fit(X_train, y_train)
mlp_y_scores = mlp_model.predict(X_test)
accuracy_score_mlp = metrics.accuracy_score(y_test, mlp_y_scores)
print(accuracy_score_mlp)

0.8375




(b.2) Play with other activation functions. How does that affect performance? (identity, logistic, tanh)

In [10]:
# Note: before submitting this file, make sure the anwser cell is a code cell
# put your text explanation in the print statement:
# e.g. print("xxx")
mlp_model = MLPClassifier(random_state=3, activation='identity')
mlp_model.fit(X_train, y_train)
mlp_y_scores = mlp_model.predict(X_test)
accuracy_score_mlp = metrics.accuracy_score(y_test, mlp_y_scores)
print(accuracy_score_mlp)

0.865


In [14]:
# Note: before submitting this file, make sure the anwser cell is a code cell
# put your text explanation in the print statement:
# e.g. print("xxx")
mlp_model = MLPClassifier(random_state=3, activation='logistic')
mlp_model.fit(X_train, y_train)
mlp_y_scores = mlp_model.predict(X_test)
accuracy_score_mlp = metrics.accuracy_score(y_test, mlp_y_scores)
print(accuracy_score_mlp)

0.8725


In [15]:
# Note: before submitting this file, make sure the anwser cell is a code cell
# put your text explanation in the print statement:
# e.g. print("xxx")
mlp_model = MLPClassifier(random_state=3, activation='tanh')
mlp_model.fit(X_train, y_train)
mlp_y_scores = mlp_model.predict(X_test)
accuracy_score_mlp = metrics.accuracy_score(y_test, mlp_y_scores)
print(accuracy_score_mlp)

0.87


(b.3) By default, MLPClassifier uses the *adam* algorithm to find weights, which is an adaptive method with changing learning rates. You can also choose stochastic gradient descent (sgd). Try it; how does sgd perform, compared with adam?

In [13]:
# Note: before submitting this file, make sure the anwser cell is a code cell
# put your text explanation in the print statement:
# e.g. print("xxx")
mlp_model = MLPClassifier(random_state=3, solver='sgd')
mlp_model.fit(X_train, y_train)
mlp_y_scores = mlp_model.predict(X_test)
accuracy_score_mlp = metrics.accuracy_score(y_test, mlp_y_scores)
print(accuracy_score_mlp)

0.8325




(b.4) Experiment with different learning rates (parameter `learning_rate_init`), for example, try 0.1, 0.01, 0.0001.

In [16]:
# Note: before submitting this file, make sure the anwser cell is a code cell
# put your text explanation in the print statement:
# e.g. print("xxx")
mlp_model = MLPClassifier(random_state=3, learning_rate_init=0.1)
mlp_model.fit(X_train, y_train)
mlp_y_scores = mlp_model.predict(X_test)
accuracy_score_mlp = metrics.accuracy_score(y_test, mlp_y_scores)
print(accuracy_score_mlp)

In [None]:
# Note: before submitting this file, make sure the anwser cell is a code cell
# put your text explanation in the print statement:
# e.g. print("xxx")
mlp_model = MLPClassifier(random_state=3, learning_rate_init=0.01)
mlp_model.fit(X_train, y_train)
mlp_y_scores = mlp_model.predict(X_test)
accuracy_score_mlp = metrics.accuracy_score(y_test, mlp_y_scores)
print(accuracy_score_mlp)

In [None]:
# Note: before submitting this file, make sure the anwser cell is a code cell
# put your text explanation in the print statement:
# e.g. print("xxx")
mlp_model = MLPClassifier(random_state=3, learning_rate_init=0.001)
mlp_model.fit(X_train, y_train)
mlp_y_scores = mlp_model.predict(X_test)
accuracy_score_mlp = metrics.accuracy_score(y_test, mlp_y_scores)
print(accuracy_score_mlp)