# Netherlands eScience Center
## NLP pilot workshop - Day 2

### 1. Limitations of word2vec:

- Words are processed in isolation
- Fixed vocabulary size
- Fixed vector for each item in vocabulary

### 2. The Transformer

- Encoder + Attention + Decoder
- Input: Text sequence
- Output: Text sequence
- Encoder:
    - Token embedder: Tokenizes input sequence and returns and embedding vector for each token
    - Sequence embedding: Token embeddings are aggregated (e.g., summed) to produce an embedding vector for the entire input sequence
- Attention:
    - A decoder has the opportunity can double-check in the source each time it emits the next token
    - Learns relevance of components across source and target sequences

### 3.1 BERT

- Tokenizer + Enoder
- Encoder preserves sequence of input tokens: It outputs an embedding vector for each input token
- Each input sequence has a special `CLS` token that can be used to classify the entire sequence
- Self-attention: Learns relevance within source sequences to enhance contextual learning

Import the BERT tokenizer and the BERT model:

In [1]:
from transformers import BertTokenizer, BertModel

# This might take some time on the first run
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

model = BertModel.from_pretrained("bert-base-cased")

  from .autonotebook import tqdm as notebook_tqdm


Encode an input sequence:

In [2]:
text = "Maria loves Groningen"

encoded_input = tokenizer(text, return_tensors="pt") # return the input as a PyTorch tensor
print(encoded_input)

{'input_ids': tensor([[  101,  3406,  7871,   144,  3484, 15016,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]])}


Compare the token IDs with the original sequence:

In [3]:
print(encoded_input.input_ids.shape)
token_ids = list(encoded_input.input_ids[0].detach().numpy())
string_tokens = tokenizer.convert_ids_to_tokens(token_ids)
print(token_ids)
print(string_tokens)

torch.Size([1, 7])
[101, 3406, 7871, 144, 3484, 15016, 102]
['[CLS]', 'Maria', 'loves', 'G', '##ron', '##ingen', '[SEP]']


Encode the tokenized input with the BERT model:

In [4]:
output = model(**encoded_input)
print(output) # 'last_hidden_state' contains the output of the last encoding layer which is the final representation

BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[ 6.3960e-02, -4.8470e-03, -8.4682e-02,  ..., -2.8042e-02,
           4.3824e-01,  2.0693e-02],
         [-3.7247e-04, -2.0076e-01,  2.5096e-01,  ...,  9.9699e-01,
          -5.4226e-01,  1.7926e-01],
         [ 5.2341e-01, -1.6954e-01, -2.9296e-01,  ...,  1.2007e-01,
           1.1869e-01,  1.6086e-01],
         ...,
         [ 7.8391e-01, -8.5551e-01,  2.2855e-01,  ..., -2.3085e-01,
          -7.9758e-02,  1.4140e-01],
         [-1.7368e-01, -8.6337e-02, -9.3972e-02,  ...,  2.5092e-01,
           3.7788e-01, -1.0323e-01],
         [ 7.1929e-01, -1.1457e-01,  1.4804e-01,  ...,  5.3051e-01,
           7.4839e-01,  7.8222e-02]]], grad_fn=<NativeLayerNormBackward0>), pooler_output=tensor([[-0.6889,  0.4869,  0.9998, -0.9888,  0.9296,  0.8637,  0.9685, -0.9851,
         -0.9547, -0.2367,  0.9661,  0.9982, -0.9969, -0.9996,  0.8415, -0.9670,
          0.9836, -0.5703, -1.0000, -0.8448, -0.1994, -0.9998,  0.2090,  0.961

Let's look at an example where the same word ("note") is used in two different contexts:

In [5]:
text_note = "Please note that this bank note is fake!"
tokenized_text = tokenizer(text_note, return_tensors="pt")
token_ids = list(tokenized_text.input_ids[0].detach().numpy())
string_tokens = tokenizer.convert_ids_to_tokens(token_ids)
print(string_tokens)

['[CLS]', 'Please', 'note', 'that', 'this', 'bank', 'note', 'is', 'fake', '!', '[SEP]']


Get the indices of the two "note" tokens:

In [6]:
note_index_1 = 2
note_index_2 = 6
print(string_tokens[note_index_1], string_tokens[note_index_2])

note note


Get the embedding vectors of the two "note" tokens:

In [7]:
import torch

# We use 'torch.no_grad()' to prevent PyTorch from adjusting the weights in the BERT model (we always do this during inference)
with torch.no_grad():
    bert_output = model(**tokenized_text)
    
note_vector_1 = bert_output.last_hidden_state[0][note_index_1].detach().numpy()
note_vector_2 = bert_output.last_hidden_state[0][note_index_2].detach().numpy()

print(note_vector_1[:10])
print(note_vector_2[:10])

[ 1.0170376   0.9369125   0.3057146   0.33091134  0.7309374  -0.43299702
  0.6208724  -0.25355926 -0.11151288  0.09412687]
[ 0.17840008  0.65847856  0.22412625  0.21162093  0.5393074  -0.02996005
  0.11301914 -0.29698443 -0.56909984 -0.2501469 ]


Helper function to create a pretty output:

In [8]:
def pretty_print_outputs(sentences, model_outputs):
    for i, model_out in enumerate(model_outputs):
        print("\n=====\t",sentences[i])
        for label_scores in model_out:
            print(label_scores)

Use the transformers pipeline:

In [9]:
from transformers import pipeline

nlp = pipeline(task="fill-mask", model="bert-base-cased", tokenizer="bert-base-cased")
sentences = [
    "Paris is the [MASK] of France",
    "I want to eat a cold [MASK] this afternoon",
    "Maria [MASK] Groningen",
]

model_outputs = nlp(sentences, top_k=5)
pretty_print_outputs(sentences, model_outputs)

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use mps:0



=====	 Paris is the [MASK] of France
{'score': 0.9808057546615601, 'token': 2364, 'token_str': 'capital', 'sequence': 'Paris is the capital of France'}
{'score': 0.00451317522674799, 'token': 6299, 'token_str': 'Capital', 'sequence': 'Paris is the Capital of France'}
{'score': 0.00428184075281024, 'token': 2057, 'token_str': 'center', 'sequence': 'Paris is the center of France'}
{'score': 0.0028482081834226847, 'token': 2642, 'token_str': 'centre', 'sequence': 'Paris is the centre of France'}
{'score': 0.0022805905900895596, 'token': 1331, 'token_str': 'city', 'sequence': 'Paris is the city of France'}

=====	 I want to eat a cold [MASK] this afternoon
{'score': 0.19168125092983246, 'token': 13473, 'token_str': 'pizza', 'sequence': 'I want to eat a cold pizza this afternoon'}
{'score': 0.14800795912742615, 'token': 25138, 'token_str': 'turkey', 'sequence': 'I want to eat a cold turkey this afternoon'}
{'score': 0.14621137082576752, 'token': 14327, 'token_str': 'sandwich', 'sequence': 

### 3.2 BERT for text classification

Using BERT for sentiment analysis:

In [10]:
classifier = pipeline(task="text-classification", model="SamLowe/roberta-base-go_emotions", top_k=3)

sentences = [
    "I am not having a great day.",
    "This is a lovely and innocent sentence.",
    "Maria loves Groningen."
]

model_outputs = classifier(sentences)
pretty_print_outputs(sentences, model_outputs)

Device set to use mps:0



=====	 I am not having a great day.
{'label': 'disappointment', 'score': 0.5044488906860352}
{'label': 'sadness', 'score': 0.34694328904151917}
{'label': 'annoyance', 'score': 0.08485794812440872}

=====	 This is a lovely and innocent sentence.
{'label': 'admiration', 'score': 0.7667419910430908}
{'label': 'approval', 'score': 0.4139408767223358}
{'label': 'love', 'score': 0.11303544044494629}

=====	 Maria loves Groningen.
{'label': 'love', 'score': 0.9160982370376587}
{'label': 'neutral', 'score': 0.07024713605642319}
{'label': 'approval', 'score': 0.02595525234937668}


Evaluation of BERT for sentiment classification on custom dataset:

In [11]:
from transformers import pipeline
pipe = pipeline(task='text-classification',
                model='tabularisai/multilingual-sentiment-analysis')

sentences = [
    "I love this product! It's amazing and works perfectly",
    "The movie was a bit boring, I could predict the ending since minute 1.",
    "Mary Shelley wrote this book around 1816",
    "Everything suuuucks."
]

gold_labels = [
    "Very Positive",
    "Negative",
    "Neutral",
    "Very Negative"
]

result = pipe(sentences)

predicted_labels = []
for res in result:
    print(res)
    predicted_labels.append(res['label'])

from sklearn.metrics import classification_report

print(classification_report(y_true=gold_labels, y_pred=predicted_labels))

Device set to use mps:0


{'label': 'Very Positive', 'score': 0.6238982081413269}
{'label': 'Negative', 'score': 0.9448591470718384}
{'label': 'Neutral', 'score': 0.9033873081207275}
{'label': 'Negative', 'score': 0.5178337097167969}
               precision    recall  f1-score   support

     Negative       0.50      1.00      0.67         1
      Neutral       1.00      1.00      1.00         1
Very Negative       0.00      0.00      0.00         1
Very Positive       1.00      1.00      1.00         1

     accuracy                           0.75         4
    macro avg       0.62      0.75      0.67         4
 weighted avg       0.62      0.75      0.67         4



  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


## 4.1 LLMs

Running a (small) LLM from code:

In [12]:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
model_name = "HuggingFaceTB/SmolLM2-135M-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
print(model.num_parameters())

llm = pipeline("text-generation", model=model, tokenizer=tokenizer)

prompt = "Where is Groningen located?"
print(tokenizer(prompt))
print(tokenizer.convert_ids_to_tokens(tokenizer(prompt)["input_ids"]))

response = llm(prompt)
print(response)

Device set to use mps:0


134515008
{'input_ids': [9576, 314, 452, 992, 45670, 3807, 47], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}
['Where', 'ƒ†is', 'ƒ†G', 'ron', 'ingen', 'ƒ†located', '?']
[{'generated_text': 'Where is Groningen located? I want to know the main attractions and the time of year that you can visit.'}]


In [13]:
messages = [
    {"role": "system", "content": "You are a helpful assistant. Give short straight answers."},
    {"role": "user", "content": "Where is Groningen located?"}
]

response = llm(messages)
print(response)

[{'generated_text': [{'role': 'system', 'content': 'You are a helpful assistant. Give short straight answers.'}, {'role': 'user', 'content': 'Where is Groningen located?'}, {'role': 'assistant', 'content': 'Groningen is located in the northeastern part of the Netherlands. It is a city in the province of North Holland, in the state of North Holland bordering the province of Holland.'}]}]


If you are interested in a looping chat conversation [here](https://github.com/carpentries-incubator/Natural-language-processing/blob/main/episodes/notebooks/chatbot.ipynb) is a notebook with code for that.

## 4.2 LLM Hyperparameters

What are LLM hyperparameters and how do they affect results?

In [14]:
messages = [
    {"role": "user", "content": "You are a helpful assistant. Only tell me 'yes' or 'no' and a one-sentence explanation."},
    {"role": "system", "content": "Is NLP the best research field?"}
]

response = llm(
    messages,
    max_new_tokens=50,
    do_sample=True,
    top_k=5,
    # top_p=0.9,
    temperature=0.7
)

print(response)

[{'generated_text': [{'role': 'user', 'content': "You are a helpful assistant. Only tell me 'yes' or 'no' and a one-sentence explanation."}, {'role': 'system', 'content': 'Is NLP the best research field?'}, {'role': 'assistant', 'content': 'Yes, NLP is a highly regarded research field in the field of linguistics and computer science. It focuses on the development, application, and interpretation of language and language processing techniques. NLP has numerous applications in various fields including natural language processing, computer vision,'}]}]


Use Ollama to download and use larger models on your laptop / PC:

In [15]:
from langchain_ollama import ChatOllama

llm = ChatOllama(
    model = "llama3.2:1b",
    temperature=0.95,
    num_predict=100,
    top_k=5,
    top_p=0.9
)

messages = [
    ("system", "You are a helpful assistant. Give short straight answers."),
    ("human", "Where is Groningen located?")
]

response = llm.invoke(messages)

print(response.content)

Groningen is a city located in the province of Groningen, in the Netherlands.


### 4.3 LLM pitfalls

Various unexpected LLM behaviours to keep in mind and safeguard against when using them:

#### 4.3.1 Hallucination / guard rails for hallucination in LLMs

In [16]:
llm = ChatOllama(
    model = "llama3.2:1b",
    temperature=1.3,
    num_predict=500,
    top_p=0.9
)

halluc_prompt = "Give me the bio for Railen Ackerby"
response = llm.invoke(halluc_prompt)
print(response.content)

Railen Ackerby was a Swedish-born Australian journalist and radio presenter, best known as the first female radio host in Australia. Unfortunately, I could not find much information on her. If you're looking for more detailed or up-to-date information about Railen's life and career, I suggest checking with sources such as Radio Times Australia, which may have archives of past interviews and articles featuring her.


#### 4.3.2 Gender bias LLMs

In [17]:
bias_prompt = "Write a two paragraph story where a nurse, a pilot, and a CEO are having lunch together."
response = llm.invoke(bias_prompt)
print(response.content)

As they sat down at the elegant table in the upscale restaurant, Sarah, a seasoned nurse, couldn't help but notice the stark contrast between her colleagues, Jack, a rugged pilot, and Rachel, the chief executive officer. Jack, sporting his aviator sunglasses and a hint of a tan from flying, sipped on a glass of red wine while Rachel, impeccably dressed in a tailored suit, surveyed the menu with an air of precision. The two men, though vastly different in their professions and backgrounds, had been brought together by a chance meeting at a industry conference.

As they ordered their meals, the conversation flowed effortlessly between them, covering everything from business strategies to personal anecdotes. Sarah found herself drawn to Rachel's infectious enthusiasm, while Jack was captivated by her sharp intellect and dry sense of humor. At one point, Jack regaled them with tales of his latest flight adventure, exaggerating just enough to make them all laugh, while Rachel responded with

#### 4.3.3 Information bias LLMs

In [18]:
biased_prompt = "Who was the second president of the United States?"
response = llm.invoke(biased_prompt)
print(response.content)
print()
biased_prompt = "Who was the second president of Mexico?"
response = llm.invoke(biased_prompt)
print(response.content)
print()

The second President of the United States was John Adams. He served from 1797 to 1801, during the early years of the American presidency under George Washington and Thomas Jefferson's presidency, before moving on to become a U.S. Senator and later serving as the first Vice President under Jefferson in 1793.

The second president of Mexico was √Ålvaro Obreg√≥n. He served as president from April 4, 1929, to November 1, 1937.



#### 4.3.4 Outdated knowledge LLMs

In [19]:
outdated_prompt = "Who is the president of the United States?"
response = llm.invoke(outdated_prompt)
print(response.content)

As of my last update in 2023, the President of the United States is Joe Biden. He took office on January 20, 2021. Please note that political positions can change, and I'll do my best to provide the most up-to-date information.


In [20]:
outdated_prompt = "When was the last time Argentina won the World Cup?"
response = llm.invoke(outdated_prompt)
print(response.content)

Argentina's national football team, also known as La Albiceleste, has not won the FIFA World Cup since their victory in 1978. However, they did win the Copa Am√©rica title on four separate occasions: 1976, 1981, 1993, and 2016.


## 4.4 Sentiment Analysis with LLMs

Using LLMs to solve standardized NLP tasks. We can also evaluate the accuracy of responses using the same process we did for BERT.

In [21]:
sentiment_llm = ChatOllama(
    model="llama3.2:1b",
    temperature=0, # Want to be as deterministic as possible
    num_predict=10, # Keep the answer very short
    top_k=1, # Only consider the next most likely token (Greedy)
)

sentiment_texts = [
    "I love this movie! It was absolutely fantastic and made my day.",
    "This product is terrible. I hate everything about it.",
    "Nothing says quality like a phone that dies after 20 minutes.",
    "The movie was exactly what I was hoping for.",
    "The food was delicious, but the service was painfully slow."
]


gold_labels = [
    "POSITIVE",
    "NEGATIVE",
    "NEGATIVE",
    "POSITIVE",
    "NEUTRAL"
]

predicted_labels = []

general_prompt = """Analyze the sentiment expressed in this text and classify it into exactly one of three categories: POSITIVE, NEUTRAL, or NEGATIVE. Output only the label in uppercase."""

for text in sentiment_texts:
    messages = [("system", general_prompt), ("human", text)]
    response = sentiment_llm.invoke(messages)
    print(f"Example: {text}")
    print(f"Response: {response.content}")
    if 'POSITIVE' in response.content:
        predicted_labels.append('POSITIVE')
    elif 'NEGATIVE' in response.content:
        predicted_labels.append('NEGATIVE')
    else:
        predicted_labels.append('NEUTRAL')
    print("------")

Example: I love this movie! It was absolutely fantastic and made my day.
Response: APOSITIVE
------
Example: This product is terrible. I hate everything about it.
Response: NEGATIVE
------
Example: Nothing says quality like a phone that dies after 20 minutes.
Response: NEGATIVE

* The sentiment is negative due to
------
Example: The movie was exactly what I was hoping for.
Response: POSITIVE

THESE SENTIMENTS ARE EX
------
Example: The food was delicious, but the service was painfully slow.
Response: POSITIVE
------


In [22]:
from sklearn.metrics import classification_report

print(classification_report(y_true=gold_labels, y_pred=predicted_labels))

              precision    recall  f1-score   support

    NEGATIVE       1.00      1.00      1.00         2
     NEUTRAL       0.00      0.00      0.00         1
    POSITIVE       0.67      1.00      0.80         2

    accuracy                           0.80         5
   macro avg       0.56      0.67      0.60         5
weighted avg       0.67      0.80      0.72         5



  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


## üîß Exercises

### Challenge 3.1: How does context affect word meaning? (polysemy)

Think of words (at least 2) that can have more than one meaning depending on the context. Come up with one simple sentence per meaning and explain what they mean in each context. Discuss: How do you know which of the possible meanings does the word have when you see it?

- intelligence: intellect vs spy agency
    - Human intellence is boundless!
    - Are intelligence agencies spying on us?

**shot**
1. My supervisor is a big shot.
1. I took a few shots yesterday.
1. The police have a shotgun.
1. I should never have shot down my workstation, according to the ICT service staff.

- She is the kind of person that is kind 
    - Grammatically you could find out which one means which. 
    - She is a kind person // she is the kind of person that is nice to others. 
- We are working in a shared space./ NASA hopes to explore new areas of space in the coming years.
    - Dependent on the context of knowing that NASA implies 'outer space', and that work space usually refers in this context probably refers to a room.
- I bought a new watch. I want to watch the new Disney movie.
    - Watch for determining time, and "watch" meaning "viewing".
- Be the change you want to see in the world. Do you have some spare change?
- Book (reading a book) Book (book a hotel)
- what is the cube of 2? and what is inside the cube?
    - Kind of related and of course there can be an eight in the cube.
- Eats shoots and leaves. Eats, shoots and leaves.
    - shoots: young plant (noun, pl) or 'to shoot' (verb)
    - leaves: plant leaf (noun, pl) or 'to leave' (verb)

- 'bank' in Dutch (1. furniture related 2. money related)

- Bar: (I passed the bar, I went to the bar for a drink)
- Game: (I played a game, I shot some guy)
- Plant: (I watered the plants, I work at the power plant)

- Scale (Please step on the scale; This is an issue of scale; The fish lost a scale)



### Challenge 3.2: Mapping words inside translated sentences (attention)

Pair with a person who speaks a language different from English (we will call it language B). Think of 1 or 2 simple sentences in English and come up with their translations in the second language. In a piece of paper write down both sentences (one on top of the other with some distance in between) and try to:

1. Draw a mapping of words or phrases from language B to English. Is it always possible to do this one-to-one for words?
2. Think of how this might relate to attention in transformers?

#### Solution!
For the solution look at the image in the slide!

It is an example of a sentence in English and its translation into Spanish. We can look at the final mapping and observe that:

1. Even though they are closely related languages, the translation is not linear
2. There is also not a direct word-to-word mapping between the sentences
3. Some words present in the source are not present in the target (and vice versa)
4. Some words are not translations of each other but they are still very relevant to understand the context

### Challenge 3.3: Play with BERT fill-mask 

Play with the `fill-mask` pipeline and try to find examples where the model gives bad predictions and examples where the predictions are very good. You can try: 

- Changing the `top_k` parameter
- Search for bias in completions. For example, compare predictions for "This man works as a [MASK]." vs. "This woman works as a [MASK].".
- Test the multilingual BERT model to compare. To do this, you should change the `model` and `tokenizer` parameter name to `bert-base-multilingual-cased`

### Challenge 3.4: Run a BERT sentiment classifier

Now it is time to scale things a little bit more... Use the same pipeline from the given toy example to run predictions over 100 examples of short book reviews. Then print the classification report for the given *test set*. These examples are given in the `data/sentiment_film_data.tsv` file.

You can use the following helper functions, the first one helps you read the file and the second one normalizes the 5-class predictions into the 3-class annotations given in the test set:

In [23]:
def load_data(filename):
    with open(filename, 'r') as f:
        lines = f.readlines()[1:] # skip header
    sentences, labels = zip(*(line.strip().split('\t') for line in lines))
    return list(sentences), list(labels)

def get_normalized_labels(predictions):
    # predicitons is a list with dicts such as {'label': 'positive', 'score': 0.95}
    # We also need to normalize the labels to match the true labels (which are only 'positive' and 'negative')
    normalized = []
    for pred in predictions:
        label = pred['label'].lower()
        if 'positive' in label:
            normalized.append('positive')
        elif 'negative' in label:
            normalized.append('negative')
        else:
            normalized.append('neutral')
    return normalized

#### Solution

In [24]:
from sklearn.metrics import classification_report, precision_recall_fscore_support
import matplotlib.pyplot as plt

sentences, labels = load_data('data/sentiment_film_data.tsv')
# The labels from our dataset
y_true = labels
# Run the model to get predictions per sentence
y_pred = pipe(sentences)
# Normalize the labels to match the gold standard
y_pred = get_normalized_labels(y_pred)

# Detailed report with all metrics
print(classification_report(y_true, y_pred))

              precision    recall  f1-score   support

    negative       0.57      1.00      0.73        23
     neutral       0.53      0.22      0.31        37
    positive       0.69      0.78      0.73        40

    accuracy                           0.62       100
   macro avg       0.60      0.66      0.59       100
weighted avg       0.61      0.62      0.57       100



## Resources
- NL eScience Center [Digital Skills Programme](https://www.esciencecenter.nl/digital-skills/) & newsletter
- [Research Software Training](https://researchsoftwaretraining.nl/): Network of research software trainers in the Netherlands
- [RSE-NL](https://nl-rse.org/): Community of research software engineers in the Netherlands
- [TikTokenizer](https://tiktokenizer.vercel.app/): a tokenization visualization tool designed for large language models (LLMs) such as GPT, Llama, and Qwen.
- [SpaCy models](https://spacy.io/models/en#en_core_web_sm): Available trained pipelines in SpaCy
- [SpaCy entity visualizer](https://spacy.io/usage/visualizers#ent): The entity visualizer, ent, highlights named entities and their labels in a text.
- Lesson Material:
    - [Course](https://carpentries-incubator.github.io/Natural-language-processing/)
    - [Slides](https://github.com/carpentries-incubator/Natural-language-processing/raw/refs/heads/main/instructors/slides/nlp-fundamentals.pptx)
- Paper: [The Risks of Using Large Language Models for Text Annotation in Social Science Research](https://osf.io/preprints/socarxiv/79qu8_v1)
- Webpage: [sklearn metrics info page](https://scikit-learn.org/stable/api/sklearn.metrics.html)
- Webpage: [sbert](https://www.sbert.net/)
