<a href="https://colab.research.google.com/github/aroramrinaal/.github/blob/main/Assignment_1_476_Text_Classification_HW.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 1: Tri-gram Language Model and NER Tagging
Welcome to your first assignment of CSE-476! Your goal in this assignment is to implement a trigram language model, and then use its output as features to train a NER model using provided implementations of a perceptron model.

In [1]:
'''
Initial loading of the data file and the NLTK tokenizer.
Please do not modify this section.
You need to run this first.
'''
import nltk
from nltk.corpus import brown
from nltk.tokenize import word_tokenize
from collections import defaultdict
nltk.download('punkt_tab')
nltk.download('brown')

# load brown corpus
def load_corpus():
    corpus = list(brown.sents())
    for i in range(len(corpus)):
        corpus[i] = " ".join(corpus[i])
    return corpus

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.


## Task 1: Implement the TrigramLM class.

You are provided with some starting code. You are free to modify the starting code, as long as you meet all requirements as specified by the comments below, and your class can be used in the following way:


```
lm = TrigramLM("vocab.txt")
lm.train()
ranking = lm.next_word_ranking("this is a")
# Expected format of 'ranking':
# [("good", 0.04), ("matter", 0.03)....]
```

A few reviews of knowledge points:

- 1: Unknown tokens are tokens that are not in the vocabulary.

In [2]:
class TrigramLM:

    def __init__(self, vocab_file):
        self.vocabulary = []
        self.bigram_count_table = defaultdict(int)
        '''
        TODO
        Other than the given bigram_count_table, what else would you need?
        '''

        # We also need trigram counts to implement a trigram LM
        self.trigram_count_table = defaultdict(int)

        # For convenience, also keep counts of unigrams
        self.unigram_count_table = defaultdict(int)
        self.start_vocab = "<start>"
        self.end_vocab = "<end>"
        self.unknown_vocab = "<UNK>"
        self.load_vocab(vocab_file)

    def load_vocab(self, vocab_file):
        with open(vocab_file, 'r') as f:
            for line in f:
                self.vocabulary.append(line.strip())
        self.vocabulary.append(self.unknown_vocab)
        self.vocabulary.append(self.start_vocab)
        self.vocabulary.append(self.end_vocab)
        print(f"vocab loaded, size = {len(self.vocabulary)}")

    '''
    TODO
    Implement the tokenize function.
    @text is a string, e.g., text="Today is a good day"
    Return a list of strings of tokens, such as ["today", "is"...]
    1. You MUST use NLTK's word_tokenize() function to split text into tokens.
    2. You MUST implement a uncased LM. That is, the vocabularies in the given file are all lower-cased. You should lower-case all tokens here too.
    3. Think about what do you need to do with unknown_vocab?
    '''
    def tokenize(self, text):
        # Your code here.
        tokens = word_tokenize(text)               # 1) tokenization
        tokens = [t.lower() for t in tokens]       # 2) lower-casing
        # 3) handle unknown tokens
        processed_tokens = []
        for t in tokens:
            if t in self.vocabulary:
                processed_tokens.append(t)
            else:
                processed_tokens.append(self.unknown_vocab)
        return processed_tokens

    '''
    TODO
    Finish implementing the training function.
    This function takes the corpus, and iteratively build all counts that the model may need to rank next words.
    Think about what do you need to do with the start_vocab and end_vocab when loading data?
    '''
    def train(self):
        corpus = load_corpus()
        print(f"corpus loaded, size = {len(corpus)}")
        # Your code here.

        for sent in corpus:
            tokens = self.tokenize(sent)
            tokens = [self.start_vocab, self.start_vocab] + tokens + [self.end_vocab]

            #update counts
            for i in range(len(tokens)):
              self.unigram_count_table[tokens[i]] += 1
              if i < len(tokens) - 1:
                bigram = (tokens[i], tokens[i+1])
                self.bigram_count_table[bigram] += 1
              if i < len(tokens) - 2:
                # trigram = (tokens[i], tokens[i+1], tokens[i+2])
                trigram = (tokens[i], tokens[i+1], tokens[i+2])
                self.trigram_count_table[trigram] += 1


    '''
    Implement the function that produces a list of top_n next words, given 'prior_context', with their probabilties.
    @prior_context: a string of the current context, e.g., "This is a", and the function tries to predict the next word.
    @top_n: returns the top-N most likely words.
    Return a list of top_n words with their probabilties, in the format of [("good", 0.04), ("matter", 0.03)....]
    Think about what do you need to do for start_vocab and end_vocab?
    '''
    def next_word_ranking(self, prior_context, top_n=10):
        # Your code here.

        # 1) tokenize prior_context
        context_tokens = self.tokenize(prior_context)

        # If not enough tokens, pad with <start>
        if len(context_tokens) < 2:
            # We will pad at the front so we can always index last two
            context_tokens = [self.start_vocab]*(2 - len(context_tokens)) + context_tokens

        # 2) get the last two tokens
        token_minus_2, token_minus_1 = context_tokens[-2], context_tokens[-1]

        # 3) compute probabilities for each word in the vocabulary
        candidates = []
        bigram_count = self.bigram_count_table[(token_minus_2, token_minus_1)]
        if bigram_count == 0:
            # If we never saw this bigram in training, all next words have prob = 0
            return []

        for w in self.vocabulary:
            trigram = (token_minus_2, token_minus_1, w)
            trigram_count = self.trigram_count_table[trigram]
            if trigram_count > 0:
                # probability = trigram_count / bigram_count
                prob = trigram_count / bigram_count
                candidates.append((w, prob))

        # sort by probability descending
        candidates.sort(key=lambda x: x[1], reverse=True)
        # return top_n
        return candidates[:top_n]

## Task 2: Using the TrigramLM Predictions as Feature for NER.
You are given a data loading helper function that loads the training data for NER.

In [3]:
!pip install datasets
from datasets import load_dataset

'''
Loading NER training data.
Please do not modify this section.
'''

def load_conll2003(split):
    # ignore other tags
    NER_tags={
        0: "O",
        1: "B-PER",
        2: "I-PER",
        3: "B-ORG",
        4: "I-ORG",
        5: "B-LOC",
        6: "I-LOC",
    }
    dataset = load_dataset("eriktks/conll2003")
    data = []
    for text in dataset[split]:
        if len(data) > 10000:
            break
        for i in range(len(text["tokens"])):
            token = text["tokens"][i]
            tag = text["ner_tags"][i]
            if tag not in NER_tags:
                tag = NER_tags[0]
            else:
                tag = NER_tags[tag]
            data.append((token, tag))
    return data

Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.2.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m25.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl (

You are asked to finish implementing the NERTagger class. You can modify the contents in this class, as long as you meet the requirements specified by the comments below, and your NERTagger can be used as

```
lm = TrigramLM("vocab.txt")
lm.train()
tagger = NERTagger(lm)
tagger.train()
ner_output = tagger.predict("John works at Microsoft and he loves it.")

```



In [4]:
from sklearn.linear_model import Perceptron
from sklearn.feature_extraction import DictVectorizer
from sklearn.pipeline import make_pipeline
import numpy as np


def get_training_data():
    return load_conll2003("train")


class NERTagger:
    def __init__(self, trained_lm):
        self.lm = trained_lm
        # For more documention on this usage, refer to
        # https://scikit-learn.org/dev/modules/generated/sklearn.pipeline.make_pipeline.html
        self.model = make_pipeline(DictVectorizer(), Perceptron(max_iter=1000, early_stopping=True), verbose=True)

    """
    TODO
    Trains the perceptron model on the training data.
    You will need to implement the missing part that extract the features labels based on the training data.
    """
    def train(self):
        # training_data returns a list of (token, label) tuples.
        training_data = get_training_data()
        features = []
        labels = []

        # We'll treat the entire (token) sequence as a "running" sentence
        # so that context is everything before index i.
        for i in range(len(training_data)):
          # build context from all preceding tokens
            context_tokens = [training_data[j][0] for j in range(i)]
            context = " ".join(context_tokens)
            word, label = training_data[i]
            feature = self._extract_features(context, word)
            features.append(feature)
            labels.append(label)

        self.model.fit(features, labels)


    """
    TODO
    Extracts features for the perceptron model.
    The features include:
    1) the top ten predictions (unordered) of the current word based on the context from your trained TrigramLM
    2) the current word
    Assume the context is "Jeff lives in" and the current_word is "Japan", and TrigramLM predict the top next words to be ["us", "england", ...]
    The features should include "us_in_next_word": True, "england_in_next_word": True, ..., "current_word_is_japan": True or "current_word": "Japan"
    @context : str : The previous context in the sentence
    @current_word : str : The current word to be predicted (the next word of the context, not included in @context)
    Returns: dict : A dictionary of features
    """
    def _extract_features(self, context, current_word):
        features = {}

        # 1) top ten predictions from LM
        ranking = self.lm.next_word_ranking(context, top_n=10)
        for (w, prob) in ranking:
            features[f"{w}_in_next_word"] = True

        # 2) current word as a feature
        # e.g. feats["current_word"] = current_word
        features["current_word"] = current_word

        return features

    """
    TODO: fill some of the missing pieces.
    Predicts the named entities in the given sentence.
    @sentence : str : An input sentence for NER tagging
    Returns: list : A list of tuples containing the word and its predicted NER tag
    """
    def predict(self, sentence):

        words = word_tokenize(sentence)
        tags = []

        for i in range(len(words)):
            context = " ".join(words[:i])
            current_word = words[i]
            feature = self._extract_features(context, current_word)
            tag = self.model.predict([feature])[0]
            tags.append((current_word, tag))

        return tags

## Task 3: Self Evaluation

Congratulations! You have completed all parts that are actually implementing the models. Now you need to do some testing to check if your implementation is correct.

Your implemented LM or tagger may not be very high-performing because of the limitations in model sizes and data sizes. There is no need to try to improve model performances as long as the basic implementation is correct. You will not receive any extra credit by improving the models.


In [5]:
# train language model and tagger
lm = TrigramLM("vocab.txt")
lm.train()
ner_tagger = NERTagger(lm)
ner_tagger.train()

vocab loaded, size = 26997
corpus loaded, size = 57340


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/12.3k [00:00<?, ?B/s]

conll2003.py:   0%|          | 0.00/9.57k [00:00<?, ?B/s]

The repository for eriktks/conll2003 contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/eriktks/conll2003.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


Downloading data:   0%|          | 0.00/983k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/14041 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3250 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3453 [00:00<?, ? examples/s]

[Pipeline] .... (step 1 of 2) Processing dictvectorizer, total=   0.1s
[Pipeline] ........ (step 2 of 2) Processing perceptron, total=   0.1s


In [9]:
'''
You MUST not modify any of the functions in this section, except self_evaluate().
DANGER: Modifying anything outside of self_evaluate will lead to an automatic zero for this assignment.
'''
def make_prediction_lm(lm, data):
    predictions = []
    for context in data:
        pred = lm.next_word_ranking(context)
        if len(pred) == 0:
            pred = []
        else:
            pred = [x[0] for x in pred]
        predictions.append(pred)
    return predictions

def make_prediction_tagger(tagger, text):
    pred = ner_tagger.predict(text)
    return [d[1] for d in pred]

# load eval news articles
def load_eval_news_data():
    texts = []
    with open("eval_news.txt", "r") as f:
        for line in f:
            texts.append(line.strip())
    data = []
    # tokenize and create a list of contexts
    for text in texts:
        tokens = word_tokenize(text.lower())
        for i in range(2, len(tokens)):
            context = " ".join(tokens[:i])
            data.append(context)
    print(f"data len = {len(data)}")
    return data

# run lm and tagger and generate prediction files.
def evaluate(lm, tagger):
    # evaluate lm
    pred_lm = make_prediction_lm(lm, load_eval_news_data())
    with open("lm_predictions.txt", "w") as f:
        for pred in pred_lm:
            f.write(" ".join(pred) + "\n")

    # evaluate tagger
    eval_ner_data = load_conll2003("test")[:200]
    text = " ".join([x[0] for x in eval_ner_data])
    pred_ner = make_prediction_tagger(tagger, text)
    with open("ner_predictions.txt", "w") as f:
        for pred in pred_ner:
            f.write(pred + "\n")

'''
Feel free to modify this part to check if your lm and NER tagger are doing the right thing.
Note that this part will not be graded.
'''
def self_evaluate(lm, tagger):
    # evaluate lm on toy data
    pred_lm = make_prediction_lm(lm, ["A", "A student", "A student at"])
    # expected output
    labl_lm = [['<UNK>', 'few', 'man', 'new', 'second', 'number', 'good', 'little', 'third', 'couple'],
            ['at', 'council', 'in', 'orator', 'was', 'of', 'you', 'organization', 'who', 'to'],
            ['the', '<UNK>', 'georgia', 'arms', 'harvard', 'brown']]
    assert len(pred_lm) == len(labl_lm)
    for i in range(3):
        assert pred_lm[i] == labl_lm[i]

    # evaluate ner tagger
    pred_tagger = make_prediction_tagger(tagger, "Jeff lives in Japan")
    # expected output
    labl_tagger = ['B-PER', 'O', 'O', 'B-LOC']
    assert pred_tagger == labl_tagger
    print("self evaluation passed!")

In [10]:
# self evaluate
self_evaluate(lm, ner_tagger)

AssertionError: 

The final step is to run the evaluate() function below to generate lm_predictions.txt and ner_predictions.txt. These two files will need to be submitted.

In [11]:
# generate prediction files
evaluate(lm, ner_tagger)

data len = 207


## Final Question
Is accuracy a good metric for NER? Why and why not? What other metric should we use to better evaluate model performances? Write your response below.

--> Accuracy is often not a good metric for NER because
There is a severe class imbalance, typically with a large number of "O" (non-entity) tags. A model could naively predict "O" for every token and still achieve high accuracy by virtue of the imbalance. NER cares about exact boundaries and types of named entities, so even a small difference in tagging boundaries can be more important than raw token-level matches. We typically use Precision, Recall, and especially F1 score at the entity level. This measures how many named entities are correctly identified (precision) and how many of the true named entities are found (recall), giving a clearer picture of real performance on the NER task.

## Final Submission
Please answer the final question above and submit the completed notebook with intermediate runnning logs, as well as lm_predictions.txt and ner_predictions.txt

