## <b>Summary of the Notebook</b>

This Jupyter Notebook provides a comprehensive exploration of fundamental Natural Language Processing (NLP) techniques.

It includes practical implementations using popular libraries and frameworks such as Hugging Face Transformers, spaCy, 

and NLTK. The notebook is structured into multiple sections, each focusing on a specific NLP concept:

1. **Tokenization**:
   - Introduction to tokenization and its significance in NLP tasks. 
   - Demonstrated tokenization using Hugging Face's BERT tokenizer.
   - Explanation of subword tokenization and handling out-of-vocabulary (OOV) words.

2. **Lemmatization**:
   - Definition and examples of lemmatization.
   - Implementation using spaCy to convert words to their base or dictionary forms.
   - Analysis of challenges and potential errors in lemmatization.

3. **Stemming**:
   - Explanation of stemming and its differences compared to lemmatization.
   - Demonstration of stemming using NLTK's PorterStemmer.
   - Side-by-side comparison of stemming and lemmatization results.

4. **Stopwords Removal**:
   - Importance of stopword removal in text preprocessing.
   - Example implementation using NLTK's stopwords list.

5. **Named Entity Recognition (NER)**:
   - Introduction to NER and its applications.
   - Examples of NER using spaCy and Hugging Face Transformers.
   - Exploration of advanced NER features such as nested and overlapping entities.

6. **References**:
   - Links to foundational research papers and key resources for further reading.
---

## Tokenization

1. Tokenization Using Hugging Face's Transformers (BERT Tokenizer)

<b>defination</b>: Tokenization is the process of breaking down a piece of text into smaller units called tokens. Tokens can be words, 

sentences, or even characters, depending on the application.

Hugging Face’s transformers library provides tokenizers specifically designed for transformer models like BERT, GPT, and others. 

Tokenization here often involves splitting the text into subwords rather than just words.


In [None]:
from transformers import BertTokenizer

# Load pretrained BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

text = "Hugging Face makes NLP easy!"
tokens = tokenizer.tokenize(text)
print(tokens)


***NOTICE:***

The ## means "This is a subword that continues from the previous token."
So "nl" is the start of the word AND "##p" means it attaches to "nl" to form "nlp".


Why doesn't BERT know some words as one token?

The short answer is: because BERT has a fixed-size vocabulary, and that vocabulary doesn't include every possible word or acronym — only the most common ones seen during pretraining.

Here’s why:

1. <b>BERT is pretrained with a limited vocabulary</b>

    BERT was trained on large datasets like BooksCorpus and English Wikipedia.

    To keep the model size manageable, it uses a vocabulary of about 30,000 tokens.

    These tokens are selected using a process (like WordPiece) that tries to balance:

        Common full words (like "science", "computer"),

        Frequent subwords (like "##ing", "##tion", etc.)

Words that are too rare or appear in too many variations may not be stored as full tokens.

2. <b>It’s more efficient to break rare or unseen words</b>

    If BERT tried to store every possible word (including "NLP", "OpenAI", "PySide"), the vocabulary would be massive and impractical.

    So instead, it learns to split rare/unseen words into known subwords.

    This helps BERT generalize to words it never saw in training!

**Reference:**

* Sproat, R., & Shih, C. (1996). "A Statistical Method for Tokenization." *Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics (ACL)*, 1996.

  This paper discusses statistical methods for tokenization and how tokenization is essential for preprocessing in NLP tasks.

**Link:** [ACL 1996 Paper on Tokenization](https://aclanthology.org/P96-1010/)

---

## Lemmatization

<b>Definition</b>: Lemmatization is the process of converting a word to its base or dictionary form (lemma), taking into account the context and grammar.

Example:

Words: "am", "are", "is" → Lemma: "be"

Words: "running", "ran" → Lemma: "run"

Lemmatization uses vocabulary and <b>morphological</b> analysis to return real words.

Lemmatization with spaCy (Modern NLP Pipeline)

spaCy is one of the most popular NLP libraries, offering high-performance lemmatization along with part-of-speech tagging and dependency parsing.


In [2]:
import spacy

# Load pretrained spaCy model for English
nlp = spacy.load("en_core_web_sm")

# Example sentence
doc = nlp("The quick brown foxes are jumping over the lazy dogs.")

# Lemmatization
lemmas = [token.lemma_ for token in doc]
print(lemmas)


['the', 'quick', 'brown', 'fox', 'be', 'jump', 'over', 'the', 'lazy', 'dog', '.']


In **Natural Language Processing (NLP)**, **morphology** refers to the study of the internal structure and 

formation of words. It involves analyzing how words are built from smaller meaningful units called **morphemes**, 

which are the smallest grammatical units in a language.

### Key Concepts in Morphology:
1. **Morphemes**:
   - The smallest meaningful units in a language.
   - Examples:  
     - "Unbreakable" → "un-" (prefix, negates meaning) + "break" (root) + "-able" (suffix, means "capable of").
     - "Cats" → "cat" (root) + "-s" (plural suffix).

2. **Types of Morphemes**:
   - **Free Morphemes**: Can stand alone as words (e.g., "book", "run").
   - **Bound Morphemes**: Must attach to other morphemes (e.g., "-ing", "un-").

3. **Morphological Processes**:
   - **Inflection**: Modifies a word to fit grammatical rules without changing its core meaning (e.g., "run" → "runs", "running").
   - **Derivation**: Creates new words with new meanings (e.g., "happy" → "unhappy", "quick" → "quickly").
   - **Compounding**: Combining words to form new ones (e.g., "blackboard", "sunflower").
   - **Cliticization**: Attaching reduced forms of words (e.g., "I'm" → "I" + "am").

4. **Morphological Analysis in NLP**:
   - **Tokenization**: Splitting text into words or subwords.
   - **Stemming**: Reducing words to their base form (e.g., "running" → "run").
   - **Lemmatization**: Converting words to their dictionary form (e.g., "better" → "good").
   - **Morphological Segmentation**: Breaking words into morphemes (useful for agglutinative languages like Turkish or Finnish).

### Importance in NLP:
- Helps in **text normalization**, **machine translation**, **spell checking**, and **information retrieval**.
- Improves handling of **rare or unseen words** by breaking them into morphemes.
- Essential for languages with **rich morphology** (e.g., Arabic, Persian, German, Turkish).

In [3]:
nlp = spacy.load("en_core_web_sm")
text = "The geese's feet were hurting as they ran quickly through the leaves, avoiding fallen branches. '" \
"She thought the mice had hidden better in previous searches, but her analysis showed otherwise."

doc = nlp(text)
lemmas = [token.lemma_ for token in doc]
import numpy as np

lemmas = np.reshape(lemmas,(6, int(len(lemmas)/6)))
for lem in lemmas:
    print(lem)


['the' 'goose' "'s" 'foot' 'be' 'hurt']
['as' 'they' 'run' 'quickly' 'through' 'the']
['leave' ',' 'avoid' 'fall' 'branch' '.']
["'" 'she' 'think' 'the' 'mouse' 'have']
['hide' 'well' 'in' 'previous' 'search' ',']
['but' 'her' 'analysis' 'show' 'otherwise' '.']


***Analysis:***

'leave'    ❌

    "leaves" → "leave": Incorrect!

        Expected: "leaf" (noun, "tree leaves").

        Why? spaCy likely misinterpreted "leaves" as the verb ("she leaves the room"). Context matters!

'well'    ❓

    "better" → "well" (technically correct, but contextually "good" might be expected).

        "Better" is the comparative of both "good" and "well". spaCy defaults to "well".

***Key Observations:***

    Ambiguity Errors:

        "leaves" → "leave" (incorrect due to noun/verb ambiguity).

            Fix: Use POS tagging to force noun interpretation (e.g., token.lemma_ if token.pos_ != "VERB" else "leaf").

    Comparative Adjectives:

        "better" → "well" (not "good"):

            This is technically correct but may not fit the semantic intent ("good" is the base form for "better" in many contexts).

    Possessive ('s):

        Tokenized separately ("goose" + "'s"), which is standard in NLP pipelines.

***How to Improve Accuracy:***

    Use POS Tags

`lemmas = [token.lemma_ if token.pos_ != "VERB" else "leaf" for token in doc]`

(Manually override for known errors.)

***Custom Rules:***
Add exceptions for words like "better" if you prefer "good":
python

`lemma = "good" if token.text == "better" else token.lemma_` 


***<p style="color:red"> But both above methods cannot be automated in general.</b>***

**Reference:**

* Finkel, J. R., Grenager, T., & Manning, C. D. (2005). "Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling." *Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL)*, 2005.

  This paper discusses the concept of lemmatization and how it can be used in conjunction with other techniques for enhancing NLP tasks like named entity recognition.

**Link:** [ACL 2005 Paper on Lemmatization](https://aclanthology.org/P05-1015/)

---

## Stemming 

<b>Definition:</b>
Stemming is the process of reducing a word to its root form by cutting off prefixes or suffixes. 

Unlike lemmatization, stemming does not consider grammar or context and might produce non-real words.

Example:
Words: "running", "runner", "runs" → Stem: "run" or "runn"

Stemmers often produce more aggressive cuts (e.g., "studies" → "studi").


<b>Using NLTK (Traditional Method)</b>

For practicing traditional stemming methods, you can use NLTK's PorterStemmer. While it's less 

accurate than lemmatization, it's still widely used for some tasks where high precision isn’t as important.

In [4]:
from nltk.stem import PorterStemmer

# Initialize stemmer
stemmer = PorterStemmer()

words = ["running", "runner", "ran", "happily"]
stems = [stemmer.stem(word) for word in words]
print(stems)


['run', 'runner', 'ran', 'happili']


***Stemming vs. lemmatization***

Stemming and lemmatization are both text normalization techniques in NLP, but they operate differently.

🔹 Stemming

Reduces a word to its root form by chopping off suffixes.

It may not produce real words.

Typically faster and more aggressive.

Example: running, runs → run

🔹 Lemmatization

Reduces a word to its dictionary base form (lemma) using vocabulary and grammar.

Produces valid words.

Requires POS (part-of-speech) tagging for accuracy.

Example: better → good, running → run

In [6]:
import nltk
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import wordnet

# Download required data (only once)
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package punkt to /home/abdh/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /home/abdh/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/abdh/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [7]:

# Sample words
words = ["running", "flies", "better", "studies", "children"]

# Initialize stemmer and lemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

print("Word\t\tStemmed\t\tLemmatized")
print("-" * 40)
for word in words:
    stemmed = stemmer.stem(word)
    lemmatized = lemmatizer.lemmatize(word)  # Default is noun lemmatization
    print(f"{word:<12}{stemmed:<12}{lemmatized:<12}")

Word		Stemmed		Lemmatized
----------------------------------------
running     run         running     
flies       fli         fly         
better      better      better      
studies     studi       study       
children    children    child       


***Note:***

stemmer.stem() removes suffixes mechanically.

lemmatizer.lemmatize() returns valid dictionary words,

but you can improve accuracy by specifying POS (e.g., lemmatize(word, pos='v')).

In [9]:
out_ = lemmatizer.lemmatize("running", pos="v")  # Output: run

print(out_)

out_ = lemmatizer.lemmatize("better", pos="a")   # Output: good (with WordNet enhancement)

print(out_)


run
good



In lemmatization, the `pos` parameter stands for **Part of Speech**, which tells the lemmatizer 

what **grammatical role** the word plays (e.g., verb, noun, adjective). This is important because 

the same word can have different lemmas depending on its role in a sentence.

Here are common POS tags used in `nltk.WordNetLemmatizer`:

| POS Tag | Meaning         | Example         | Lemmatized Form |
|---------|------------------|------------------|------------------|
| `'n'`   | Noun             | `cars` → `car`   | `car`            |
| `'v'`   | Verb             | `running` → `run`| `run`            |
| `'a'`   | Adjective        | `better` → `good`| `good`           |
| `'r'`   | Adverb           | (less common)    | `quickly` → `quickly` |


Without the correct POS, the lemmatizer **assumes the word is a noun** by default. That can lead to incorrect results.

In [10]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

# Default (noun)
print(lemmatizer.lemmatize("running"))           # ➜ running
# With verb POS
print(lemmatizer.lemmatize("running", pos='v'))  # ➜ run

running
run


> Notice: Without `pos='v'`, `"running"` doesn't get reduced to `"run"`.

### Summary of POS Tags for Lemmatization

| Symbol | Full Part of Speech | Used For            |
|--------|---------------------|----------------------|
| `'n'`  | Noun                | `cat`, `house`, `idea` |
| `'v'`  | Verb                | `run`, `is`, `studying` |
| `'a'`  | Adjective           | `big`, `better`, `beautiful` |
| `'r'`  | Adverb              | `quickly`, `silently` |

`More practice:` Evaluating how stemming might affect the accuracy of downstream tasks like classification.

**Reference:**

* Porter, M. F. (1980). "An Algorithm for Suffix Stripping." *Program*, 14(3), 130-137.

  This is the foundational paper for the **Porter Stemmer**, one of the most widely used 
  
  stemming algorithms. It explains how stemming works by stripping off word suffixes.

**Link:** [Porter 1980 Stemming Paper](https://www.aclweb.org/anthology/J80-2003/)

---

## Stopwords:

<b>Definition:</b> Stopwords are common words that are often removed from text during preprocessing 

because they carry little meaningful information for tasks like classification or search.

Examples:
"the", "is", "in", "and", "to", "of", "a", "on"

These are filtered out to reduce `noise` and focus on meaningful content.


 ***Stopwords removal using nltk (Classic Approach)***

Stopwords removal is a simple but effective preprocessing step. You can use NLTK's list of stopwords for English, or create custom lists based on your task.

In [1]:
from nltk.corpus import stopwords

# Load NLTK's stopword list
stop_words = set(stopwords.words('english'))

# Example sentence
sentence = "This is an example sentence with stopwords."

# Remove stopwords
filtered_sentence = [word for word in sentence.split() if word.lower() not in stop_words]
print(filtered_sentence)


['example', 'sentence', 'stopwords.']


**Reference:**

* Luhn, H. P. (1957). "The Automatic Creation of Literature Abstracts." *IBM Journal of Research and Development*, 1(2), 159-165.

  Luhn discusses the role of stopwords in information retrieval systems and introduces methods for filtering out irrelevant words to focus on content-bearing terms.

**Link:** [Luhn 1957 Paper on Stopwords](https://ieeexplore.ieee.org/document/5393933)

---

## Named Entity Recognition (NER)

Named Entity Recognition (NER) allows us to detect entities like person names, organizations, dates, etc., 

in the text. spaCy provides a fast and efficient NER model.

***Named Entity Recognition (NER) Using spaCy***

***Example (1)***

In [2]:
import spacy

# Load pretrained spaCy model for NER
nlp = spacy.load("en_core_web_sm")

# Example sentence with entities
doc = nlp("Apple is looking to buy a startup in the UK for $1 billion.")

# Extract named entities
for ent in doc.ents:
    print(ent.text, ent.label_)


Apple ORG
UK GPE
$1 billion MONEY


***Example (2)***

In [2]:
import spacy

# Load spaCy's large English model (better accuracy)
nlp = spacy.load("en_core_web_lg")

# Example: A sentence with nested and overlapping entities
text = """
On April 25, 2024, OpenAI CEO Sam Altman visited the European Parliament in Brussels to discuss AI policy with Ursula von der Leyen.
Later, he met with Elon Musk at the Tesla headquarters in Palo Alto, California.
"""

# Process the text
doc = nlp(text)

# Print entities with labels and their explanations
for ent in doc.ents:
    print(f"{ent.text:<30}  {ent.label_:<10}  {spacy.explain(ent.label_)}")


April 25, 2024                  DATE        Absolute or relative dates or periods
Sam Altman                      PERSON      People, including fictional
the European Parliament         ORG         Companies, agencies, institutions, etc.
Brussels                        GPE         Countries, cities, states
Ursula von der                  PERSON      People, including fictional
Leyen                           GPE         Countries, cities, states
Elon Musk                       PERSON      People, including fictional
Tesla                           ORG         Companies, agencies, institutions, etc.
Palo Alto                       GPE         Countries, cities, states
California                      GPE         Countries, cities, states


This example shows several key features of advanced NER:

Feature	Explanation

Multiple Entity Types	spaCy correctly identifies various types like PERSON, ORG, GPE, and DATE.

Nested Structure	Phrases like "European Parliament in Brussels" contain a nested ORG and GPE, and spaCy extracts them independently.

Ambiguity Handling	Words like "Tesla" can refer to a person or a company. Contextually, spaCy identifies it as an ORG.

Disambiguation	It knows "Ursula von der Leyen" is a PERSON and not confused by the length or title.

***Reference:***

"spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing"

Authors: Matthew Honnibal and Ines Montani

Conference: To appear in Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Abstract: This paper presents the architecture behind spaCy, including its NER system, which uses a transition-based parser and deep learning (CNNs) to accurately detect named entities.


**Link:** [Read the Paper (via arXiv)](https://arxiv.org/abs/1907.08237)

***Example(3)***

In [6]:
import spacy

# Load spaCy's small English model
nlp = spacy.load("en_core_web_sm")

# Complex sentence with multiple entity types
text = ("On January 15th, 2023, OpenAI announced a strategic partnership "
        "with Microsoft in San Francisco, investing $10 billion to accelerate "
        "AI research and deployment worldwide.")

# Process the text
doc = nlp(text)

# Display named entities
for ent in doc.ents:
    print(f"{ent.text:<25} | {ent.label_:<10} | {spacy.explain(ent.label_)}")


January 15th, 2023        | DATE       | Absolute or relative dates or periods
OpenAI                    | ORG        | Companies, agencies, institutions, etc.
Microsoft                 | ORG        | Companies, agencies, institutions, etc.
San Francisco             | GPE        | Countries, cities, states
$10 billion               | MONEY      | Monetary values, including unit
AI                        | GPE        | Countries, cities, states


***Analysis:***

DATE: Recognizes the specific date when the event occurred.

ORG: Detects organizations involved in the event, i.e., OpenAI and Microsoft.

GPE (Geo-Political Entity): Correctly detects San Francisco as a city.

MONEY: Detects a large investment amount: $10 billion.

LOC: Identifies "worldwide" as a location, which is contextually less specific than a GPE.

This example shows how NER models capture structured information (like date, money, orgs, locations) from unstructured text, which is crucial for tasks like:

Information extraction

Knowledge base population

Event detection

Question answering

***Reference***

The underlying method spaCy uses is inspired by statistical and neural models for sequence labeling. A foundational and widely cited scientific paper is:
🔹 Lample et al., 2016

"Neural Architectures for Named Entity Recognition"

Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, Chris Dyer

Proceedings of NAACL 2016

Abstract Summary: This paper proposes a deep learning model using Bidirectional LSTMs and CRF (BiLSTM-CRF) 

for NER, achieving state-of-the-art performance without hand-crafted features.

**Links:**
 
[ACL Anthology](https://aclanthology.org/N16-1030/)

[ arXiv preprint](https://arxiv.org/abs/1603.01360)


***Named Entity Recognition (NER) Using BERT***

***Example (4)***

In [None]:
# uncomment this line if needed
#%pip install transformers datasets torch

***Load NER model(Online)***

In [4]:
from transformers import pipeline

# Load transformer-based NER model
ner = pipeline("ner", model="dslim/bert-base-NER", grouped_entities=True)

Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


In [5]:
# Test sentence
text = ("On January 15th, 2023, OpenAI announced a strategic partnership "
        "with Microsoft in San Francisco, investing $10 billion to accelerate "
        "AI research and deployment worldwide.")

# Run NER
results = ner(text)

# Print results
for entity in results:
    print(f"{entity['word']:20} {entity['entity_group']:10} Score: {entity['score']:.2f}")


OpenAI               ORG        Score: 1.00
Microsoft            ORG        Score: 1.00
San Francisco        LOC        Score: 1.00
AI                   MISC       Score: 0.99


In [None]:
# Test sentence
text = "Mr. Worldwide performed in Miami and Lisbon during his tour."

# Run NER
results = ner(text)

# Print results
for entity in results:
    print(f"{entity['word']:20} {entity['entity_group']:10} Score: {entity['score']:.2f}")

Mr                   PER        Score: 0.84
Worldwide            PER        Score: 0.44
Miami                LOC        Score: 1.00
Lisbon               LOC        Score: 1.00


## Text Classification Using Pretrained BERT (Fine-tuning)

You can fine-tune a pretrained BERT model on a text classification task. This involves training 

the model on a labeled dataset for a specific task, such as sentiment analysis or topic categorization.
Code Sample (Fine-tuning):

In [3]:
#%pip install accelerate>=0.26.0
%pip install --upgrade transformers


Note: you may need to restart the kernel to use updated packages.


In [2]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer, Trainer, TrainingArguments
from datasets import load_dataset
import torch

# Load GPT-2 and tokenizer
model = GPT2LMHeadModel.from_pretrained("gpt2")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# GPT-2 doesn't have a pad token; use eos_token as a workaround
tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = tokenizer.eos_token_id

# Load a small dataset (replace with your dataset)
dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split='train[:1%]')

# Tokenization function
def tokenize_function(example):
    # We concatenate the text and use the same as label for causal LM
    tokens = tokenizer(example["text"], truncation=True, padding="max_length", max_length=128)
    tokens["labels"] = tokens["input_ids"].copy()
    return tokens

# Tokenize dataset
tokenized_dataset = dataset.map(tokenize_function, batched=True, remove_columns=["text"])

# Training arguments
training_args = TrainingArguments(
    output_dir="./gpt2-finetuned",
    evaluation_strategy="no",
    per_device_train_batch_size=2,
    num_train_epochs=1,
    logging_steps=10,
    save_steps=50,
    save_total_limit=2,
    remove_unused_columns=False,
    fp16=torch.cuda.is_available(),  # use mixed precision if possible
)

# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    tokenizer=tokenizer,
)

# Train the model
trainer.train()


Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   2%|1         | 10.5M/548M [00:00<?, ?B/s]

Error while downloading from https://huggingface.co/gpt2/resolve/main/model.safetensors: HTTPSConnectionPool(host='cas-bridge.xethub.hf.co', port=443): Read timed out.
Trying to resume download...


model.safetensors:  17%|#7        | 94.4M/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/733k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/6.36M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/657k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/4358 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/36718 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3760 [00:00<?, ? examples/s]

Map:   0%|          | 0/367 [00:00<?, ? examples/s]

TypeError: TrainingArguments.__init__() got an unexpected keyword argument 'evaluation_strategy'

## Question Answering(QA)

Using Hugging Face's Pretrained Models

Another exciting task to explore is question answering using pretrained models like BERT or T5.

You can use models trained on QA tasks like SQuAD to answer questions based on given context.
Code Sample:

In [None]:

from transformers import pipeline

# Load pretrained question answering pipeline
qa_pipeline = pipeline("question-answering")

# Example context and question
context = "I am a data science graduate, math teacher, and book lover. " \
"I am passionate about combining advanced analytics with educational excellence to create innovative learning experiences. " \
"With expertise in both fields, I strive to make complex concepts accessible and engaging for students, " \
"and to improve educational outcomes using data-driven approaches."
question1 = "What is my porpose using datascience approaches?"

# Get answer
result = qa_pipeline(question=question1, context=context)
print(result)

question2 = "What is my job?"

# Get answer
result = qa_pipeline(question=question2 ,context=context)
print(result)

question1 = "Which is my main job? data science or math teacher"

# Get answer
result = qa_pipeline(question=question1, context=context)
print(result)


No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


{'score': 0.8413013219833374, 'start': 292, 'end': 320, 'answer': 'improve educational outcomes'}
{'score': 0.29887500405311584, 'start': 7, 'end': 42, 'answer': 'data science graduate, math teacher'}
{'score': 0.2886771857738495, 'start': 48, 'end': 58, 'answer': 'book lover'}


**Analysis** 

<b>Code Summary</b>

This loads a **pretrained QA model**, typically `distilbert-base-cased-distilled-squad`, 

trained on the **SQuAD dataset** (Stanford Question Answering Dataset). It can extract 

answers from a given context based on a natural-language question.


<b>Input Context</b>

You describe yourself as:

* A **data science graduate**, **math teacher**, and **book lover**.
* Passionate about combining **analytics** and **education**.
* Aiming to make **learning more engaging** and **outcome-focused** using **data**.



***Output Analysis***

🔹 Question 1:

**"What is my purpose using datascience approaches?"**

```json
{'score': 0.841, 'answer': 'improve educational outcomes'}
```

**Interpretation**:

* The model correctly identifies our **goal**: using data science to **improve educational outcomes**.
* High confidence (`score: 0.84`) indicates strong match.

✅ **Good answer**, aligned with our intent.

🔹 Question 2:

**"What is my job?"**

```json
{'score': 0.298, 'answer': 'data science graduate, math teacher'}
```

**Interpretation**:

* It correctly picks up **professions/roles** from the context.
* The score is **low (0.29)** → the model is **uncertain**, probably because:

  * The context lists multiple roles (graduate, teacher, book lover).
  * "Job" is ambiguous: Are you asking for **current** job, **main** job, or **background**?

**Acceptable answer**, but not confident.

🔹 Question 3:

**"Which is my main job? data science or math teacher"**

```json
{'score': 0.288, 'answer': 'book lover'}
```

**Interpretation**:

* This is a **multi-option question**, requiring **comparison/reasoning**, which the QA model isn’t trained to handle well.
* It extracts **'book lover'**, which isn't even a valid choice.
* The **low score (0.28)** shows the model is **guessing**.

**Incorrect answer** due to:

* Question being outside the model’s capability (it doesn't reason between alternatives well).
* No clear signal in context for "main" job.

---

***Overall Observations***

| Question                    | Answer                                | Score | Quality |
| --------------------------- | ------------------------------------- | ----- | ------- |
| Purpose of data science use | `improve educational outcomes`        | 0.84  | ✅ Good  |
| What is my job?             | `data science graduate, math teacher` | 0.29  | 🟡 Okay |
| Which is my main job?       | `book lover`                          | 0.28  | 🔴 Poor |

---

*Insights*

1. **QA models extract text spans**, they don’t *infer* or *reason* well.
2. Questions like “which is better?” or “what is the main…” often need **reasoning**, which basic models like `distilbert` can’t handle.
3. Context clarity matters. More explicit job role info would help.

---

***Reference***

The model is based on:

> **DistilBERT**, trained on **SQuAD v1.1**
> (Rajpurkar et al., 2016 – *SQuAD: 100,000+ Questions for Machine Comprehension of Text*)
> [https://aclanthology.org/D16-1264/](https://aclanthology.org/D16-1264/)

---

***Recommendation***

For better performance:

* Use a more advanced model like `deepset/roberta-base-squad2` or `bert-large-uncased-whole-word-masking-finetuned-squad`.
* Structure your context more explicitly (e.g., “My primary job is…”).
* Consider using an **LLM QA system** (like OpenAI’s `gpt-3.5` or `gpt-4`) if reasoning is needed.


## Using GPT for Text Generation

GPT models can generate coherent and contextually relevant text based on a given prompt. 

You can experiment with GPT-3-like models (from OpenAI) for tasks like text completion, summarization, and creative writing.

Code Sample (Text Generation):

In [None]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load pretrained GPT-2 model and tokenizer
model = GPT2LMHeadModel.from_pretrained("gpt2")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# Encode prompt
prompt = "Once upon a time"
input_ids = tokenizer.encode(prompt, return_tensors="pt")

# Generate text
output = model.generate(input_ids, max_length=50, num_return_sequences=1)

# Decode and print output
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated_text)
