# 🧠 NLP with Pre-trained Models

This notebook demonstrates how to use pre-trained models for common NLP tasks using the Hugging Face `transformers` library and related tools.

Don't forget to do `poetry add transformers torch peft`

To use notebooks with poetry may also need to install jupyter kernel:
```
poetry add jupyter ipykernel
poetry run python -m ipykernel install --name youarebot-quickstart --user --display-name youarebot-quickstart
```

install Jupyter VSCode plugins and reload VSCode (maybe)

NOTE: running todays notebooks will download many big models (~1Gb or more) to your laptop, so be careful when running this if you're low on free disk space.

In [1]:
# Import Required Libraries
from transformers import pipeline, AutoModelForSequenceClassification, AutoTokenizer
import torch

## QQ: which NLP tasks do you know?
Some examples:
- Classification (sentiment and topic)
- NER (entity extraction)
- Translation
- Summarization
- Zero-shot classification

## 2. Text Classification Demo

Let's classify the sentiment of some example sentences using a pre-trained model.

In [2]:
# Sentiment classification examples
from transformers import pipeline
classifier = pipeline('sentiment-analysis')
texts = [
    "I love using pre-trained models!",
    "This product is terrible and I want a refund.",
    "The movie was okay, not great but not bad either."
]
for text in texts:
    result = classifier(text)
    print(f"Text: {text}\nResult: {result}\n")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

Device set to use mps:0


Text: I love using pre-trained models!
Result: [{'label': 'POSITIVE', 'score': 0.9993071556091309}]

Text: This product is terrible and I want a refund.
Result: [{'label': 'NEGATIVE', 'score': 0.9997045397758484}]

Text: This product is terrible and I want a refund.
Result: [{'label': 'NEGATIVE', 'score': 0.9997045397758484}]

Text: The movie was okay, not great but not bad either.
Result: [{'label': 'POSITIVE', 'score': 0.9919201135635376}]

Text: The movie was okay, not great but not bad either.
Result: [{'label': 'POSITIVE', 'score': 0.9919201135635376}]



## 3. Named Entity Recognition (NER) Demo

Extract named entities from text using a pre-trained NER model.

In [3]:
# NER example
from transformers import pipeline
ner = pipeline('ner', grouped_entities=True)
text = "Apple is looking at buying U.K. startup for $1 billion."
entities = ner(text)
print(f"Text: {text}\nEntities: {entities}")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496 (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

Device set to use mps:0


Text: Apple is looking at buying U.K. startup for $1 billion.
Entities: [{'entity_group': 'ORG', 'score': np.float32(0.9990897), 'word': 'Apple', 'start': 0, 'end': 5}, {'entity_group': 'LOC', 'score': np.float32(0.999718), 'word': 'U', 'start': 27, 'end': 28}, {'entity_group': 'LOC', 'score': np.float32(0.9987226), 'word': 'K', 'start': 29, 'end': 30}]


## 4. Text Translation Demo

Translate English text to French using a pre-trained translation model.

In [4]:
# Translation example
from transformers import pipeline
translator = pipeline('translation_en_to_fr')
text = "Machine learning is amazing."
translation = translator(text)
print(f"Original: {text}\nFrench: {translation[0]['translation_text']}")

No model was supplied, defaulted to google-t5/t5-base and revision a9723ea (https://huggingface.co/google-t5/t5-base).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use mps:0


Original: Machine learning is amazing.
French: L'apprentissage par machine est étonnant.


## 5. Summarization Demo

Summarize a long text using a pre-trained summarization model.

In [5]:
# Summarization example
from transformers import pipeline

summarizer = pipeline('summarization')
long_text = (
    "Machine learning is a field of artificial intelligence that uses statistical techniques "
    "to give computer systems the ability to learn from data, without being explicitly programmed. "
    "It is seen as a part of artificial intelligence. Machine learning algorithms build a model "
    "based on sample data, known as training data, in order to make predictions or decisions "
    "without being explicitly programmed to do so."
)
summary = summarizer(long_text, max_length=40, min_length=10, do_sample=False)
print(f"Summary: {summary[0]['summary_text']}")

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

Device set to use mps:0


Summary:  Machine learning is a field of artificial intelligence that uses statistical techniques to give computer systems the ability to learn from data without being explicitly programmed . Machine learning algorithms build a model based on sample data


## 6. Zero-shot Classification Demo

Classify text into arbitrary categories using a zero-shot model like `facebook/bart-large-mnli`.

In [6]:
# Zero-shot classification example
from transformers import pipeline
zero_shot = pipeline('zero-shot-classification', model='facebook/bart-large-mnli')
text = "This is a fantastic new phone with a great camera."
labels = ["technology", "sports", "politics", "entertainment"]
result = zero_shot(text, candidate_labels=labels)
print(f"Text: {text}\nLabels: {labels}\nResult: {result}")

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use mps:0


Text: This is a fantastic new phone with a great camera.
Labels: ['technology', 'sports', 'politics', 'entertainment']
Result: {'sequence': 'This is a fantastic new phone with a great camera.', 'labels': ['technology', 'entertainment', 'sports', 'politics'], 'scores': [0.8937835693359375, 0.09825354814529419, 0.006223480682820082, 0.0017393830930814147]}


## 7. Text Generation Demo

Generate text using a sequence-to-sequence model such as T5.

In [7]:
# Text generation example with T5
from transformers import pipeline
generator = pipeline('text2text-generation', model='t5-small')
prompt = "summarize: The quick brown fox jumps over the lazy dog. This sentence is often used to demonstrate fonts and test typewriters."
generated = generator(prompt, max_length=40)
print(f"Prompt: {prompt}\nGenerated: {generated[0]['generated_text']}")

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

Device set to use mps:0
Both `max_new_tokens` (=256) and `max_length`(=40) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=256) and `max_length`(=40) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Prompt: summarize: The quick brown fox jumps over the lazy dog. This sentence is often used to demonstrate fonts and test typewriters.
Generated: sentence often used to demonstrate fonts and test typewriters .


## 8. Fine-tuning a Model with PEFT/LoRA (Overview & Example)

Fine-tuning allows you to adapt a pre-trained model to your own data. PEFT (Parameter-Efficient Fine-Tuning) and LoRA (Low-Rank Adaptation) are modern techniques for efficient fine-tuning.

- **PEFT**: Only a small subset of parameters are updated, reducing compute and memory needs.
- **LoRA**: Injects trainable low-rank matrices into each layer, making fine-tuning lightweight.

Below is a minimal example using the `peft` library (conceptual, not runnable as-is):

In [8]:
# Minimal PEFT/LoRA fine-tuning example (conceptual)
from peft import get_peft_model, LoraConfig, TaskType
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased')
lora_config = LoraConfig(task_type=TaskType.SEQ_CLS, r=8, lora_alpha=32, lora_dropout=0.1)
peft_model = get_peft_model(model, lora_config)
# Now peft_model can be trained as usual with your data

ModuleNotFoundError: No module named 'peft'

## 9. Limitations of Pre-trained NLP Models

While pre-trained models are powerful, they have important limitations:

- **Context window size**: Most models have a maximum input length (e.g., 512 or 1024 tokens). Longer texts are truncated or split.
- **Domain-specific vocabulary**: Performance drops on specialized jargon or rare terms not seen during pre-training.
- **Bias and fairness**: Models may reflect biases present in their training data.
- **Need for fine-tuning**: For best results on your data, fine-tuning is often required.

**Example:**
- Try classifying a very long text or a text with medical/legal jargon and observe the results.

In [None]:
# Limitation demo: context window and domain vocabulary
from transformers import pipeline
classifier = pipeline('sentiment-analysis')
long_text = " ".join(["This is a sentence."] * 200)  # Exceeds most model limits
specialized_text = "The patient was administered 5mg of apixaban for atrial fibrillation."
print("Long text classification:")
print(classifier(long_text))
print("\nMedical jargon classification:")
print(classifier(specialized_text))