# Natural Language Processing (72 pts)

For this problem set, you'll also need the file conll2003train.txt if you are working in Google Colab.

If you are working in the SCC:

* List of modules to load: miniconda academic-ml/fall-2025

* Pre-Launch Command: conda activate fall-2025-pyt 

* When you would load 'conll2003train.txt', instead load '/projectnb/ds340/materials/conll2003train.txt'.

In [None]:
# ===> You need to **restart the notebook** after running this cell to get the right module versions
!pip install gensim

# Part I.  Named Entity Recognition (25 pts)

Named entity recognition (NER) is a classic NLP task where proper nouns and their types must be extracted from text.  The CONLL 2003 dataset labels entities in text as PER (person), ORG (organization), LOC (location), MISC (proper noun but none of the above), or O (nothing).  A classifier trained on this data can label each word in a sentence as belonging to one of these categories.

In this section, we'll use word2vec vectors to classify each word.  Word2vec doesn't use any context from the rest of the sentence, but the task of identifying proper nouns as places or people may not need a lot of context.

In [None]:
# CONLL (Computational Natural Language Learning) 2003
# data from:
# https://data.deepai.org/conll2003.zip
# description of data:
# https://huggingface.co/datasets/eriktks/conll2003

from google.colab import files
# Pick conll2003train.txt for full training
uploaded = files.upload()


In [None]:
!ls

In [None]:
import csv

# Load the data as list of tuples of lists:
# in each tuple, the zeroth element is the list of words
# and the first element is the list of labels.
#
# Original data has line breaks to separate sentences and
# -DOCSTART- at the beginning of lines separating documents;
# on each line, item 0 is the word and item 4 is the NER label.
# We don't care about the distinction between B- and I-
# (begin and intra) for the NER labels, so we just keep
# the category and turn it to a number
# (0 = nothing, 1 = PER, 2 = ORG, 3 = LOC, 4 = MISC)
def load_ner_data(filename):
  lines = []
  with open(filename, mode='r') as myfile:
    spacereader = csv.reader(myfile, delimiter=' ')
    working_sentence = []
    working_ner_tags = []
    for row in spacereader:
      if len(row) == 0:
        if len(working_sentence) > 0:
          lines.append((working_sentence, working_ner_tags))
          working_sentence = []
          working_ner_tags = []
      elif len(row) == 4:
        if row[0] != '-DOCSTART-':
          working_sentence.append(row[0])
          working_ner_tags.append(process_ner_tag(row[3]))
  return lines

def process_ner_tag(tag):
  if tag == 'O': # Not 0, for some odd reason...
    return 0
  tag = tag[2:] # Ignore B-, I-
  tag_dict = {
      'PER': 1,
      'ORG': 2,
      'LOC': 3,
      'MISC': 4
  }
  # Intentionally error if we get none of the above
  return tag_dict[tag]

The following line should load the relevant data as a list of tuples, where the first element of each tuple is a list of words in a sentence, and the second element of each tuple is a list of the words' proper numerical labels.

In [None]:
all_tuples = load_ner_data('conll2003train.txt')

In [None]:
len(all_tuples)

In [None]:
print(all_tuples[0])

For faster processing, we'll just work with the first 1000 sentences.

In [None]:
MAX_SENTENCES = 1000
tuples = all_tuples[:MAX_SENTENCES]

I.1, 10 points) Write a function words_to_word2vec() that converts the conll2003 data into a feature matrix and label array of the kind expected by scikit-learn.

The first argument to the function should be a list of tuples of the kind produced by load_ner_data().

The second argument to the function should be a model of the kind returned by gensim.downloader.load().  A call creating one of these objects has been provided.

The first return value should be a $W \times 300$ feature matrix, where $W$ is the number of words in all the input sentences combined, and 300 is the number of elements in each vector returned by the word vector model.  If a word can't be found in the word model, the corresponding line should be all zeros.

The second return value should be a 1d $W$-element array of the labels of the words.

In [None]:
# Approach #1:  Convert every vector using word2vec,
# and train a scikit-learn classifier on these vectors.

import gensim.downloader as api

wv = api.load('word2vec-google-news-300')

In [None]:
# TODO words_to_word2vec_matrix(tuple_list, wv)

In [None]:
# Test of words_to_word2vec_matrix
features, labels = words_to_word2vec_matrix([(['Sonic', 'is', 'fast'], [1, 0, 0])], wv)
print(features.shape) # expect (3, 300)
print(labels.shape) # expect (3,)

I.2, 5 points) Now, perform a train/test split on your feature matrix and labels (with test_size = 0.1 and random_state=340) and measure the accuracy of a RandomForestClassifier with 200 estimators (and also random_state=340, other settings default) that uses your word2vec matrix as its features.  You can expect roughly 94% accuracy.

In [None]:
# TODO Train a random forest on the matrix
# Use random seed 340 for train_test_split() and random forest;
# test size 10% for train_test_split; 200 trees for forest
# and otherwise defaults

I.3, 4 pts) This is a task where "not a proper noun" is a common category and a pretty good guess, inflating accuracy.  Call sklearn.metrics.precision_recall_fscore_support to see precision, recall, and f-scores for each class.

In [None]:
# TODO precision, recall, f-scores

I.4, 6 pts)  Identify which class had the lowest *precision* and what it means to have that precision.  Then do the same for *recall*.

**TODO precision**

**TODO recall**

# Part II.  Attention (15 pts)

We are now going to walk through an example of how attention could be computed in a sentence.  This omits the multiplication by learned matrices, but captures how the main mechanism of attention alters word embeddings.

II.1, 2 pts)  Use the word vector model that you used in the last problem to look up a list of vectors for the following two sentences:

* "Turkey closed its borders today."
* "Turkey is a Thanksgiving tradition."

As before, if a word isn't in the model, use a 300 element zero vector.

In [None]:
import numpy as np

sentence1 = ['Turkey', 'closed', 'its', 'borders', 'today']
sentence2 = ['Turkey', 'is', 'a', 'Thanksgiving', 'tradition']

# TODO transform to lists of vectors

II.2, 3 pts) For just the word "Turkey", find the dot product of its vector with each other vector in each sentence.  Report which word has the largest dot product in each sentence (besides the word itself).

In [None]:
# TODO sentence 1 dots

**TODO note word with largest dot**

In [None]:
# TODO sentence 2 dots

**TODO note word with largest dot**

II.3, 3 pts) Use the softmax formula, $\frac{e^{x_i}}{\sum_i e^{x_i}}$, on each element of each of the dot product lists to create a list of weights that sum to 1 in each case.  Don't include the "Turkey" vector dot product in the calculation.

In [None]:
# TODO softmax result 1

In [None]:
# TODO softmax result 2

II.4, 3 pts) Create a single vector for each sentence that is Turkey's attention vector:  the weighted sum of the four vectors that don't correspond to the word "Turkey", where the weights were created by the softmax.

In [None]:
# TODO attention weights 1

In [None]:
# TODO attention weights 2

II.5, 4 pts) Run the classifier for part 1 on 3 vectors:  the plain "Turkey" vector, the "Turkey" vector with WEIGHT times the sentence 1 attention vector added, and the "Turkey" vector with WEIGHT times sentence 2 attention vector added.  Experiment with values of WEIGHT until you find a setting where the first sentence's attention-modified Turkey vector is a location, but the second is not.  (The learned value matrices in actual attention can accomplish what WEIGHT is doing here, and more.)

In [None]:
# TODO code that finds weight that causes classifier to label one Turkey a location, the other not

# Part III.  Using a pretrained language model "off-the-shelf" (24 points)

Now, we'll try using a model that is a step up from word vectors - a BERT model that has been trained to produce a vector for each word in the sentence that is informed by attention.  We'll also change tasks, since the Google-News-trained word2vec seemed pretty good already for the CONLL2003 NER task.

The JNLPBA dataset is like CONLL 2003, but labels words as to whether they are words for DNA, RNA, proteins, cell lines, or cell types.

In [None]:
"""
Dataset description: https://huggingface.co/datasets/jnlpba/jnlpba

NLP for labeling biological terms.  Original labels (thanks to
https://medium.com/@raj.pulapakura/fine-tune-your-own-bert-token-classification-model-06b1153fbf56):

    0: O => ordinary word
    1: B-DNA => beginning of a “DNA” term
    2: I-DNA => continuation of a “DNA” term
    3: B-RNA=> beginning of an “RNA” term
    4: I-RNA => contiunation of an “RNA” term
    5: B-protein => beginning of a “protein” term
    6: I-protein => continuation of a “protein” term
    7: B-cell_line => beginning of a “cell line” term
    8: I-cell_line => continuation of a “cell line” term
    9: B-cell_type => beginning of a “cell type” term
    10: I-cell_type => continuation of a “cell type” term

    We will lump B- and I- labels together - it'll be an easier task if
    the classifier doesn't have to figure out word position in the sentence.
"""

In [None]:
!pip install transformers datasets evaluate seqeval

In [None]:
# This time, we'll also make use of Huggingface Datasets.

from datasets import load_dataset

raw_dataset = load_dataset("siddharthtumre/jnlpba-split")

Let's take a look at what a HuggingFace dataset looks like:

In [None]:
raw_dataset

In [None]:
raw_dataset['train'][0]

In [None]:
# Some BERT code adapted from
# https://github.com/jalammar/jalammar.github.io/blob/master/notebooks/bert/A_Visual_Notebook_to_Using_BERT_for_the_First_Time.ipynb

import transformers as ppb

WEIGHTS = 'distilbert-base-uncased'
def get_tokenizer():
    return ppb.DistilBertTokenizer.from_pretrained(WEIGHTS)

tokenizer = get_tokenizer()

Note that the BERT tokenizer may break a word into parts if it doesn't recognize the whole word.  "Ohioization" becomes two tokens, "Ohio" and "##ization."

The tokenizer thinks of everything as a sentence and concatenates begin and end tokens to its tokenizations.  The following function strips these.

In [None]:
# Convert a word into BERT tokens
def get_tokens(word, tokenizer):
    token_list = tokenizer.encode(word)
    return token_list[1:-1] # Strip begin and end tokens

III.1, 9 points) Write a function dataset_to_bert_input_and_labels that takes one of the Huggingface datasets (the 'train' set), the tokenizer, and a maximum number of sentences, and returns a 2D array with as many sentences as there were in the data (or max_sentences, whichever is smaller) and as many columns as are necessary for the longest number of tokens, plus two.  Each token list should start with 101 (the CLS token) and end with 102 (end) - hence the +2.

In addition, force all the B- labels (odd numbers) to be the corresponding I- labels (even numbers one larger).

You can construct your first output as a list-of-lists at first, and in a second pass through the sentences, pad each of your lists with 0's, so that they are all the same length.  Then convert to 2D array.

So, for example, `dataset_to_bert_input_and_labels(raw_dataset['train'], tokenizer, 2)` should produce two return values - the first, a 2D array that looks like np.array([[101, ..., 0], [101, ..., 102]]), and the second, a list of label lists where the two lists are composed of 8's, 10's, and 0's.

In [None]:
# Turn the lists of words into padded arrays of token numbers
# TODO dataset_to_bert_input_and_labels(dataset, tokenizer, max_sentences)


In [None]:
dataset_to_bert_input_and_labels(raw_dataset['train'], tokenizer, 2) # See instructions

The following code should then produce a $800 \times 180 \times 768$ tensor - 800 sentences with at most 180 tokens, each of which has a 768 element vector associated with it.  This may take a little while as each token list is run through the pretrained BERT network.

In [None]:
import torch

# Grab a trained DistiliBERT model
def get_model():
    return ppb.DistilBertModel.from_pretrained(WEIGHTS)

def get_bert_vectors(model, padded_tokens):
    # Mask the 0's padding from attention
    mask = torch.tensor(np.where(padded_tokens != 0, 1, 0))
    with torch.no_grad():
        word_vecs = model(torch.tensor(padded_tokens).to(torch.int64), attention_mask=mask)
    # The middle index of the return value determines which word we're talking about
    # (starting with 1 since 0 is the CLS token)
    return word_vecs[0][:,:,:].numpy()

train_input, labels = dataset_to_bert_input_and_labels(raw_dataset['train'], tokenizer, 800)
model = get_model()
bert_result = get_bert_vectors(model, train_input)

In [None]:
print(bert_result.shape) # Expect (800, 180, 768)

III.2, 3 points) In your own words, what is the use of the vector associated with the 0th token, the CLS token?

**TODO**

III.3, 8 points) Now you'll write the glue that connects the BERT part of the pipeline to some off-the-shelf ML.  Write `labels_and_bert_to_sklearn(labels, bert_result)` which takes the label list-of-lists you produced earlier and the tensor that was the result of the get_bert_vectors() call, and produce a single $W \times 768$ features matrix and a $W$-element labels array, such that both could be used as features and labels in scikit-learn.  (By $W$, we mean the total number of words in the data.)

In [None]:
# Take a list of label lists from tuples_to_bert_input_and_labels()
# and a bert result, and create the scikit-learn features and label list
# TODO labels_and_bert_to_sklearn(labels, bert_result)

In [None]:
bert_features_train, bert_labels_train = labels_and_bert_to_sklearn(labels, bert_result)

III.4, 4 points) Call the following code block to get test data as well.  Then train a scikit-learn RandomForestClassifier with 200 estimators and random state 340 - this will take about 6 minutes on Colab - and evaluate the classifier on the test set.  You can expect an accuracy of about 80%.

In [None]:
test_input, test_labels = dataset_to_bert_input_and_labels(raw_dataset['validation'], tokenizer, 100)
bert_result_test = get_bert_vectors(model, test_input)
bert_features_test, bert_labels_test = labels_and_bert_to_sklearn(test_labels, bert_result_test)

In [None]:
# TODO scikit-learn on the bert vectors

# IV.  Off-the-shelf fine-tuned model (6 points)

If you want to fine-tune a BERT model, you can follow a web tutorial [here](https://learnopencv.com/fine-tuning-bert/#aioseo-fine-tuning-bert-on-the-arxiv-abstract-classification-dataset), but this takes a while.  Let's just see how much better we can do if we have a fine-tuned BERT model, versus the word2vec approach we started with.  For common datasets like CONLL2003, it's possible to find models others have already trained on the HuggingFace website.

1, 6 pts) Make a prediction to yourself about what kind of F1 scores you might expect to see from this large fine-tuned transformer model.  Then run the two code boxes below to load a fine-tuned CONLL2003 NER model from HuggingFace ("dbmdz/bert-large-cased-finetuned-conll03-english")and evaluate it on the CONLL2003 test data.  For each class (besides O = ordinary), compare the f1 score to the f1 score for the same class in the word2vec classifier of part I.

* Roughly how much of a bump in f1 score do we see for each classification, and on average?

* Is the model's final performance better or worse than you expected, and why?

In [None]:
# More on fine-tuning token-classification models at https://medium.com/@raj.pulapakura/fine-tune-your-own-bert-token-classification-model-06b1153fbf56

from transformers import pipeline

token_classifier = pipeline(
  "token-classification",
  "dbmdz/bert-large-cased-finetuned-conll03-english",
  grouped_entities=True,
)

In [None]:
# https://huggingface.co/docs/evaluate/en/base_evaluator

from datasets import load_dataset
from evaluate import evaluator
from transformers import AutoModelForSequenceClassification, pipeline

data = load_dataset("eriktks/conll2003", split="test", revision="refs/convert/parquet").shuffle(seed=340).select(range(1000))
task_evaluator = evaluator("token-classification")

eval_results = task_evaluator.compute(
    model_or_pipeline="dbmdz/bert-large-cased-finetuned-conll03-english",
    data=data,
    metric="seqeval"
)

print(eval_results)

* **TODO F1 differences**
* **TODO compare to your predictions**

# AI Statement (2 pts)

Please briefly describe whether and how you used generative AI for this assignment.  You will not be penalized for your answer - this is mostly so the course can adapt to AI use.

**TODO**