# LAB 3: Natural Language Generation

In [1]:
!git clone https://github.com/elenipapadopulos/NLP_LAB3_Datasets.git

Cloning into 'NLP_LAB3_Datasets'...
remote: Enumerating objects: 11, done.[K
remote: Counting objects: 100% (11/11), done.[K
remote: Compressing objects: 100% (9/9), done.[K
remote: Total 11 (delta 3), reused 0 (delta 0), pack-reused 0 (from 0)[K
Receiving objects: 100% (11/11), 6.23 KiB | 6.23 MiB/s, done.
Resolving deltas: 100% (3/3), done.


In [2]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import matplotlib.pyplot as plt
import math
import pandas as pd

In this exercise, you are asked to use a Large Language Model (GPT2) to detect typos in sentences.

Let's start by uploading our dataset: it is a collection of 52 sentences of various type, half of which are grammatically sound.

In [3]:
df = pd.read_csv("/content/NLP_LAB3_Datasets/typo_dataset1.csv")

You can see that these sentences contain typos (so spelling errors or missing letters) but also grammatical errors like omophones.

In [4]:
print(f"Text: {df['text'][26]} \nLabel: {df['label'][26]} (Typo)")

Text: <s> If the owewr of the vehicle is not licensed to drive, the owner's license to drive mya be suspended. 
Label: 0 (Typo)


In [5]:
print(f"Text: {df['text'][21]} \nLabel: {df['label'][21]} (Correct)")

Text: <s> I finally submitted my project ðŸ’» after working on it for days, and now Iâ€™m treating myself to some ice cream. 
Label: 1 (Correct)


Let's import GPT2.

In [6]:
model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

We will use the token \<s> as our marker for the beginning of a sentence. First, we need to add it to the tokenizerâ€™s vocabulary, then set it as the bos_token, and finally, resize the modelâ€™s embedding size.


In [7]:
bos_token = "<s>"
tokenizer.add_tokens([bos_token])
tokenizer.bos_token = bos_token
model.resize_token_embeddings(len(tokenizer))

The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`


Embedding(50258, 768)

**Ex 3.1** Write a function that returns the log-probability assigned to each generated token.

You can use the `get_next_word_probs` function as a reference, but remember that this time weâ€™re focusing on the distribution of tokens **within** the sentence, rather than the probability distribution of the next token. You can follow the comments we left in the box to guide you in the implementation.


In [8]:
def get_token_logprobs(sentence):

    ## tokenize the sentence, compute the output and retrieve logits
    input_ids = tokenizer.encode(sentence, return_tensors='pt')
    logits = model(input_ids).logits.squeeze()[:-1]

    ## remember: logits.shape is (1, sen_len, model_size])
    ## remove the logits relative to the last token: we are not interested in next token generation
    ## expected shape: (sen_len, model_size) (suggestion: use squeeze))

    ## retrieve the indices (input_ids) of the sentence
    ## hint: remove the input_id relative to the bos
    ## expected shape: (sen_len)
    indices = input_ids[0][1:]

    ## compute log-probabilities
    log_probs = torch.nn.functional.log_softmax(logits, dim=-1)

    ## retrieve the probabilities of the tokens
    token_logprobs = log_probs[range(len(indices)), indices]

    ## convert input ids to tokens to obtain a list of tokens
    tokens = [tokenizer.decode(idx) for idx in indices]

    return tokens, token_logprobs.tolist()

**Ex 3.2** Write a function that returns the cumulative log-probability **up to each token** in a sentence, using the probabilities computed before.

Remind that, as we are considering log-probabilities, you should **sum** the individual log-probabilities of each individual token.

Hint: you could return a list of elements like (w, cumulative_probability_up_to_w)


In [9]:
def get_cumulative_token_logprobs(sentence):

    tokens, token_logprobs = get_token_logprobs(sentence)

    # write your code here
    cumulative_logprobs = []
    cum_sum = 0
    for idx, logprob in enumerate(token_logprobs):
      cum_sum+=logprob
      cumulative_logprobs.append((tokens[idx],cum_sum))

    return cumulative_logprobs

**Ex. 3.3** Write a function that determines whether a sentence contains typos or not based on the difference of log-probabilities of consecutive tokens/words. The hypothesis is that a significant drop in probability between consecutive tokens may indicate that the model did not expect that token, potentially signaling a typo.

To implement this, we can define a threshold: if the difference between the log-probabilities of consecutive tokens exceeds this threshold, we can assume the sentence likely contains a typo.

The function should take as input the sentence to be analyzed and the threshold to apply for detection and it should return 0 if the sentence is flagged as potentially containing typos or	1 if the sentence is considered correct.

In the function you should confront consecutive log-probabilities and check whether, for at least one pair, their difference is above the set threshold: in that case, classify the whole sentence as incorrect.

In [10]:
def detect_typos(sentence, threshold):

    word_logprobs = get_cumulative_token_logprobs(sentence)

    # write your code here
    i=0
    while i<len(word_logprobs)-1:
      if abs(abs(word_logprobs[i+1][1]) - abs(word_logprobs[i][1])) > threshold: return 0
      i+=1
    return 1

**Ex 3.4** Test your function experimenting with different thresholds.
Select the best threshold on the whole data and report the accuracy in the Moodle.

You should achieve an accuracy higher than 0.70.

In [11]:
correct = 0
for i, row in df.iterrows():
  sentence, label = row[1], row[2]
  pred = detect_typos(sentence,11.5)

  if label == pred:
    correct += 1

print(f"Accuracy: {correct/len(df)}")

  sentence, label = row[1], row[2]


Accuracy: 0.7884615384615384


**Error Analysis**: in this exercise, we exploited the probability distribution of a Large Language Model to detect typos.
Did you notice any linguistic or grammatical feature that make detection more accurate? Do you think relying solely on threshold-based differences in log-probabilities is sufficient or should we use a more sophisticated and complete system? Write your comments in the box below.



I think that usually the log-probability distribution can be useful to detect typos and the linguistic and grammatical features that make detection more accurate are the structure of all the sentences (standard sentences like sbj-vrb-obj or other variants), the use of standard words and the context of each word.

However, relying solely on threshold-based differences in log-probabilities is not sufficient since can miss typos that result in valid but incorrect words, like a meaningful word but not the correct word ("card"->"cart"). In addition, rare words may be detected as error because they are less typical than others but this does not necessarily mean are typos. Moreover, large differences in log-probs may not reflect a real typo, so the decision is mainly based on the threshold.

In conclusion, this system can be used for typo detection but can also be modified to obtain a better solution.

### Moodle Submission

Extract the code to run exercise 3 from the notebook as a .py file (you might have to add some initial import instructions). Make sure the code runs and computes the accuracy correctly. Upload the code on the activity "Lab assignment 3" on the Moodle of the course. In the comment box on Moodle report your accuracy and your comments related to the error analys (box right above)