# **Practical Natural Language Processing Use Cases**

## **What's Covered?**

1. Loading the text data
2. Classification
    - What is Classification?
    - Types of Classification
    - Example Usage - Text Classification
    - Example Usage - Token Classification
3. Summarization
    - What is Summarization?
    - Types of Summarization
    - Example Usage - Abstractive Summarization
4. Translation
    - What is Machine Translation?
    - Example Usage - Machine Translation
    - Understanding Language‚ÄìScript Codes in Multilingual AI Systems
        - Language_Script Code Format
        - Why Script Matters?
        - How the Model Sees This Difference?
        - Common Script Codes
        - Codes for Indian Languages
        - Codes for Popular International Languages
        - Key Takeaway
5. Question Answering
    - What is QA Task?
    - Example Usage - QA Task
6. Text Generation
    - What is Text Generation?
    - Example Usage - Text Generation
7. Fill-Mask
    - What is Fill-Mask Task?
    - Example Usage - Fill Mask
8. Feature Extraction
    - What is Feature Extraction?
    - Example Usage - Feature Extraction
9. Sentence Similarity
    - What is Sentence Similarity?
    - Installing sentence-transformers
    - Example Usage - Sentence Similarity

### **Loading the text data**

In [1]:
with open("text/email.txt") as f:
    email = f.read()

print(email)

Congratulations Alice ‚Äì Welcome to the GenAI Internship Program!

Dear Alice,

Congratulations! üéâ

We are thrilled to inform you that you have been selected for the GenAI Internship Program, starting on 25th of this month. Your application stood out among thousands, and we‚Äôre excited to have you on board as part of this prestigious program.

The official offer letters will be shared with all selected candidates on 20th of this month. Please keep an eye on your inbox and reach out in case you do not receive it by the end of that day.

We look forward to your active participation and can‚Äôt wait to see the incredible work you‚Äôll do during this internship!

Best regards,
Program Coordinator
GenAI Internship Team



In [2]:
with open("text/product_review.txt") as f:
    product_review = f.read()

print(product_review)

I bought this product from Flipkart website.
This product is very worst and replacement policy is very bad. Even I went to their New Delhi support center.
I used this laptop only for 30 minute and suddenly it turn off and it will never turn on.
And Flipkart website does not replace this product. I should have gone for better brands like Apple or Alienware.


## **Classification**
### **What is Classification?**
Classification is the task of predicting a discrete label for an input sequence using a learned representation of that sequence.

### **Types of Classification**

- **Text classification:** You give the whole sentence to the model. The model gives one label for the entire sentence.
- **Token classification:** You give the sentence to the model. The model gives a label for every word (token) inside that sentence.

| Task                 | What gets classified                  |
| -------------------- | ------------------------------------- |
| Text classification  | The whole sentence / document         |
| Token classification | Each word / token inside the sentence |

### **Example Usage - Text Classification**

Text Classification is the task of assigning a label or class to a given text. Some use cases are sentiment analysis, natural language inference, and assessing grammatical correctness.

In [3]:
# Import pipeline
from transformers import pipeline
import torch

# Specify the inference task
classifier = pipeline(
    task="text-classification",
    model="distilbert/distilbert-base-uncased-finetuned-sst-2-english",
)

Device set to use mps:0


In [4]:
# Pass the input to the pipeline
classifier(email)

[{'label': 'POSITIVE', 'score': 0.9998281002044678}]

In [5]:
# Pass the input to the pipeline
classifier(product_review)

[{'label': 'NEGATIVE', 'score': 0.9983298182487488}]

### **Example Usage - Token Classification**

Token classification is a natural language understanding task in which a label is assigned to some tokens in a text. Some popular token classification subtasks are Named Entity Recognition (NER) and Part-of-Speech (PoS) tagging. NER models could be trained to identify specific entities in a text, such as dates, individuals and places; and PoS tagging would identify, for example, which words in a text are verbs, nouns, and punctuation marks.

One interesting usecase is: **Information Extraction from Invoices**.

You can extract entities of interest from invoices automatically using Named Entity Recognition (NER) models. Invoices can be read with Optical Character Recognition models and the output can be used to do inference with NER models. In this way, important information such as date, company name, and other named entities can be extracted.

In [6]:
# Import pipeline
from transformers import pipeline
import torch

# Specify the inference task
ner_tagger = pipeline(
    task="token-classification",
    model="dslim/bert-base-NER",
    torch_dtype=torch.bfloat16,
)

Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use mps:0


In [7]:
# Pass the input to the pipeline
ner_tagger(email)[:4]

[{'entity': 'B-PER',
  'score': 0.945688,
  'index': 3,
  'word': 'Alice',
  'start': 16,
  'end': 21},
 {'entity': 'B-ORG',
  'score': 0.9159807,
  'index': 8,
  'word': 'Gen',
  'start': 39,
  'end': 42},
 {'entity': 'I-ORG',
  'score': 0.8153882,
  'index': 9,
  'word': '##A',
  'start': 42,
  'end': 43},
 {'entity': 'I-ORG',
  'score': 0.7519947,
  'index': 10,
  'word': '##I',
  'start': 43,
  'end': 44}]

In [8]:
# Pass the input to the pipeline
ner_tagger(product_review)[:4]

[{'entity': 'B-ORG',
  'score': 0.9920954,
  'index': 6,
  'word': 'F',
  'start': 27,
  'end': 28},
 {'entity': 'B-ORG',
  'score': 0.6834046,
  'index': 7,
  'word': '##lip',
  'start': 28,
  'end': 31},
 {'entity': 'I-ORG',
  'score': 0.98762983,
  'index': 8,
  'word': '##kar',
  'start': 31,
  'end': 34},
 {'entity': 'I-ORG',
  'score': 0.94635123,
  'index': 9,
  'word': '##t',
  'start': 34,
  'end': 35}]

## **Summarization**

### **What is Summarization?**
Summarization is a **sequence-to-sequence generative task** where a model learns to generate a shorter, semantically faithful text conditioned on a longer input document. 

Summarization requires:
- information selection
- redundancy removal
- content compression
- coherence preservation
- rephrasing

In simple terms, summarization is the task of producing a shorter version of a document while preserving its important information. Some models can extract text from the original input, while other models can generate entirely new text.


### **Types of Summarization**
- **Extractive summarization:** The summary is built by selecting and reusing parts of the original text. Technically, this can be framed as a sentence-level or token-level selection problem.
- **Abstractive summarization:** The summary is newly generated text. It may contain words or phrases not present in the original document. Modern LLM-based summarization is almost always abstractive summarization.

### **Example Usage - Abstractive Summarization**

In [9]:
# Import pipeline
from transformers import pipeline
import torch

# Specify the inference task
summarizer = pipeline(
    task="summarization",
    model="facebook/bart-large-cnn",
    torch_dtype=torch.bfloat16
)

Device set to use mps:0


In [10]:
# Pass the input to the pipeline
summarizer(email)

[{'summary_text': 'Alice has been selected for the GenAI Internship Program. Her internship will start on 25th of this month. The official offer letters will be shared with all selected candidates on 20th of the month. We can‚Äôt wait to see the incredible work you‚Äôll do during this internship!'}]

In [11]:
# Pass the input to the pipeline
summarizer(product_review, max_length=60)

[{'summary_text': 'I used this laptop only for 30 minute and suddenly it turn off and it will never turn on. Flipkart website does not replace this product. I should have gone for better brands like Apple or Alienware. Even I went to their New Delhi support center. I bought this product'}]

## **Translation**

### **What is Machine Translation?**
Machine translation is a sequence-to-sequence learning task where a neural model encodes a source sentence and autoregressively decodes a target sentence in another language conditioned on the encoded source representation.

### **Example Usage - Machine Translation**

In [12]:
# Import pipeline
from transformers import pipeline
import torch

# Specify the inference task
translator = pipeline(
    task="translation",
    model="facebook/nllb-200-distilled-600M",
    torch_dtype=torch.bfloat16
) 

# NLLB: No Language Left Behind: 'nllb-200-distilled-600M'

generation_config.json:   0%|          | 0.00/189 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/564 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/4.85M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.3M [00:00<?, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

Device set to use mps:0


In [13]:
# Pass the input to the pipeline
text_translated = translator(
    email, 
    src_lang="eng_Latn",
    tgt_lang="hin_Deva"
)

In [14]:
print(text_translated)

[{'translation_text': '‡§¨‡§ß‡§æ‡§à ‡§è‡§≤‡§ø‡§∏  ‡§ú‡•Ä‡§è‡§®‡§è‡§Ü‡§à ‡§á‡§Ç‡§ü‡§∞‡•ç‡§®‡§∂‡§ø‡§™ ‡§™‡•ç‡§∞‡•ã‡§ó‡•ç‡§∞‡§æ‡§Æ ‡§Æ‡•á‡§Ç ‡§Ü‡§™‡§ï‡§æ ‡§∏‡•ç‡§µ‡§æ‡§ó‡§§ ‡§π‡•à! ‡§™‡•ç‡§∞‡§ø‡§Ø ‡§è‡§≤‡§ø‡§∏, ‡§¨‡§ß‡§æ‡§à!  ‡§π‡§Æ‡•á‡§Ç ‡§Ü‡§™‡§ï‡•ã ‡§Ø‡§π ‡§¨‡§§‡§æ‡§§‡•á ‡§π‡•Å‡§è ‡§ñ‡•Å‡§∂‡•Ä ‡§π‡•ã ‡§∞‡§π‡•Ä ‡§π‡•à ‡§ï‡§ø ‡§Ü‡§™‡§ï‡•ã ‡§á‡§∏ ‡§Æ‡§π‡•Ä‡§®‡•á ‡§ï‡•Ä 25 ‡§§‡§æ‡§∞‡•Ä‡§ñ ‡§∏‡•á ‡§∂‡•Å‡§∞‡•Ç ‡§π‡•ã‡§®‡•á ‡§µ‡§æ‡§≤‡•á ‡§ú‡•Ä‡§è‡§®‡§è‡§Ü‡§à ‡§á‡§Ç‡§ü‡§∞‡•ç‡§®‡§∂‡§ø‡§™ ‡§™‡•ç‡§∞‡•ã‡§ó‡•ç‡§∞‡§æ‡§Æ ‡§ï‡•á ‡§≤‡§ø‡§è ‡§ö‡•Å‡§®‡§æ ‡§ó‡§Ø‡§æ ‡§π‡•à‡•§ ‡§Ü‡§™‡§ï‡•á ‡§Ü‡§µ‡•á‡§¶‡§® ‡§π‡§ú‡§æ‡§∞‡•ã‡§Ç ‡§Æ‡•á‡§Ç ‡§∏‡•á ‡§¨‡§æ‡§π‡§∞ ‡§ñ‡§°‡§º‡•á ‡§π‡•Å‡§è, ‡§î‡§∞ ‡§π‡§Æ ‡§Ü‡§™‡§ï‡•ã ‡§á‡§∏ ‡§™‡•ç‡§∞‡§§‡§ø‡§∑‡•ç‡§†‡§ø‡§§ ‡§ï‡§æ‡§∞‡•ç‡§Ø‡§ï‡•ç‡§∞‡§Æ ‡§ï‡•á ‡§π‡§ø‡§∏‡•ç‡§∏‡•á ‡§ï‡•á ‡§∞‡•Ç‡§™ ‡§Æ‡•á‡§Ç ‡§¨‡•ã‡§∞‡•ç‡§° ‡§™‡§∞ ‡§∞‡§ñ‡§®‡•á ‡§ï‡•á ‡§≤‡§ø‡§è ‡§â‡§§‡•ç‡§∏‡§æ‡§π‡§ø‡§§ ‡§π‡•à‡§Ç‡•§ ‡§Ü‡§ß‡§ø‡§ï‡§æ‡§∞‡§ø‡§ï ‡§ë‡§´‡§∞ ‡§™‡§§‡•ç‡§∞ ‡§á‡§∏ ‡§Æ‡§π‡•Ä‡§®‡•á ‡§ï‡•Ä 20 ‡§§‡§æ‡§∞‡•Ä‡§ñ ‡§

### **Understanding Language‚ÄìScript Codes in Multilingual AI Systems**

When working with multilingual NLP and Generative AI systems, you will often encounter language identifiers such as:

* `hin_Deva`
* `kan_Knda`
* `kas_Arab`
* `kas_Deva`
* `eng_Latn`

These identifiers follow a standard and production-friendly convention that helps AI systems correctly interpret how a language is written.

> **Important:**
> To explore additional supported languages and codes, refer to the official FLORES+ language coverage page hosted on
> **Hugging Face**:
> [https://huggingface.co/datasets/openlanguagedata/flores_plus#language-coverage](https://huggingface.co/datasets/openlanguagedata/flores_plus#language-coverage)

#### **Language_Script Code Format**

All such identifiers follow the same structure:

```text
<language>_<script>
```

* The **language** part specifies the language.
* The **script** part specifies the writing system used for that language.

This distinction is critical in real-world multilingual and GenAI pipelines.

#### **Why Script Matters?**

The `script` tells us **how a particular language is written**.

For example, consider the following two sentences:

```text
‡§Æ‡•Å‡§ù‡•á ‡§ï‡§ø‡§§‡§æ‡§¨ ‡§ö‡§æ‡§π‡§ø‡§è
```

```text
mujhe kitaab chahiye
```

Both sentences are in **Hindi**, but:
* the first is written in **Devanagari script** (`Deva`)
* the second is written in **Latin script** (`Latn`)

So, conceptually:
* `hin_Deva` ‚Üí Hindi written in Devanagari
* `hin_Latn` ‚Üí Hindi written in Latin characters (romanized form)

#### **How the Model Sees This Difference?**

From the model‚Äôs point of view:
* the **character set is completely different**
* the **token IDs are completely different**
* the **subword segmentation is completely different**

Even though humans perceive both examples as the same language, the model treats them as very different inputs.

#### **Common Script Codes**

| Script code | Meaning        | Writing system                                                     |
| ----------- | -------------- | ------------------------------------------------------------------ |
| **Deva**    | Devanagari     | Used for Hindi, Marathi, Sanskrit, Nepali, etc.                    |
| **Knda**    | Kannada script | Used for the Kannada language                                      |
| **Arab**    | Arabic script  | Used for Arabic, Urdu, Kashmiri (one form), Persian, etc.          |
| **Latn**    | Latin script   | Used for English, French, Spanish, German and many other languages |

#### **Codes for Indian Languages**

| Language  | FLORES Code | Script     | Notes                        |
| --------- | ----------- | ---------- | ---------------------------- |
| Hindi     | hin_Deva    | Devanagari | Standard Hindi writing       |
| Bengali   | ben_Beng    | Bengali    | Used for Bangla              |
| Tamil     | tam_Taml    | Tamil      | Native Tamil script          |
| Telugu    | tel_Telu    | Telugu     | Native Telugu script         |
| Marathi   | mar_Deva    | Devanagari | Same script as Hindi         |
| Gujarati  | guj_Gujr    | Gujarati   | Native Gujarati script       |
| Kannada   | kan_Knda    | Kannada    | Native Kannada script        |
| Malayalam | mal_Mlym    | Malayalam  | Native Malayalam script      |
| Punjabi   | pan_Guru    | Gurmukhi   | Punjabi (India)              |
| Odia      | ory_Orya    | Odia       | Earlier called Oriya         |
| Assamese  | asm_Beng    | Bengali    | Assamese uses Bengali script |
| Urdu      | urd_Arab    | Arabic     | Perso-Arabic script          |
| Kashmiri  | kas_Arab    | Arabic     | Common in Kashmir            |
| Kashmiri  | kas_Deva    | Devanagari | Alternate script             |
| Nepali    | nep_Deva    | Devanagari | Same script family as Hindi  |
| Sindhi    | snd_Arab    | Arabic     | Common form                  |
| Sindhi    | snd_Deva    | Devanagari | Alternate form               |
| Sanskrit  | san_Deva    | Devanagari | Classical usage              |


#### **Codes for Popular International Languages**

| Language              | FLORES Code | Script            | Notes                  |
| --------------------- | ----------- | ----------------- | ---------------------- |
| English               | eng_Latn    | Latin             | Global default         |
| Spanish               | spa_Latn    | Latin             |                        |
| French                | fra_Latn    | Latin             |                        |
| German                | deu_Latn    | Latin             |                        |
| Portuguese            | por_Latn    | Latin             |                        |
| Italian               | ita_Latn    | Latin             |                        |
| Russian               | rus_Cyrl    | Cyrillic          |                        |
| Arabic                | arb_Arab    | Arabic            | Modern Standard Arabic |
| Chinese (Simplified)  | zho_Hans    | Han (Simplified)  | Mainland China usage   |
| Chinese (Traditional) | zho_Hant    | Han (Traditional) | Taiwan / HK usage      |
| Japanese              | jpn_Jpan    | Japanese          | Mixed writing system   |
| Korean                | kor_Hang    | Hangul            |                        |
| Turkish               | tur_Latn    | Latin             |                        |
| Indonesian            | ind_Latn    | Latin             |                        |
| Vietnamese            | vie_Latn    | Latin             |                        |
| Thai                  | tha_Thai    | Thai              |                        |
| Hebrew                | heb_Hebr    | Hebrew            |                        |

#### **Key Takeaway**
In modern multilingual AI systems, a language is **not uniquely identified by its language name alone**.
Instead, it is defined as:
```text
(language, script)
```
This is why:
```
kas_Arab  ‚â†  kas_Deva
hin_Deva  ‚â†  eng_Latn
```
Even when the spoken language may be the same, the writing system directly affects tokenization, embeddings, retrieval quality, and model behavior.

## **Question Answering**

### **What's QA Model?**
Question Answering models can retrieve the answer to a question from a given text, which is useful for searching for an answer in a document.

### **Example Usage - QA Task**

In [15]:
# Import pipeline
from transformers import pipeline
import torch

# Specify the inference task
reader = pipeline(
    task="question-answering",
    model="deepset/roberta-base-squad2",
    torch_dtype=torch.bfloat16,
)

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/496M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/79.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

Device set to use mps:0


In [16]:
# Pass the input to the pipeline
question = "When is the Offer Letter going to be shared?"

reader(question=question, context=email)

{'score': 0.7018205523490906,
 'start': 418,
 'end': 436,
 'answer': '20th of this month'}

In [17]:
# Pass the input to the pipeline
question = "When is internship starting?"

reader(question=question, context=email)

{'score': 0.9124733209609985,
 'start': 203,
 'end': 221,
 'answer': '25th of this month'}

In [18]:
# Pass the input to the pipeline
question = "Where did the customer buy the product?"

reader(question=question, context=product_review)

{'score': 0.7776859402656555,
 'start': 27,
 'end': 43,
 'answer': 'Flipkart website'}

## **Text Generation**

Generating text is the task of generating new text given another text. These models can, for example, fill in incomplete text or paraphrase. A Text Generation model is also known as a **causal language model**.

**Usecases**
- **Instruction Models:** Some of the most powerful instruction-tuned open-access models like Mixtral 8x7B, Cohere Command R+, and Meta Llama3 70B.
- **Code Generation:** One of the most popular open-source models for code generation is StarCoder, which can generate code in 80+ languages.

### **Example Usage - Text Generation**

In [19]:
# Import pipeline
from transformers import pipeline

# Specify the inference task
generator = pipeline(
    task="text-generation",
    model="Qwen/Qwen3-0.6B",
    torch_dtype=torch.bfloat16,
)

config.json:   0%|          | 0.00/726 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.50G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

Device set to use mps:0


In [20]:
prompt = f"""You are a helpful AI assistant. Can you help me write the response for the following product review:
### PRODUCT REVIEW ###
{product_review}

### CUSTOMER SERVICE RESPONSE ###
Dear Customer, I am sorry to hear this. Please be assured that 
"""

print(prompt)

You are a helpful AI assistant. Can you help me write the response for the following product review:
### PRODUCT REVIEW ###
I bought this product from Flipkart website.
This product is very worst and replacement policy is very bad. Even I went to their New Delhi support center.
I used this laptop only for 30 minute and suddenly it turn off and it will never turn on.
And Flipkart website does not replace this product. I should have gone for better brands like Apple or Alienware.

### CUSTOMER SERVICE RESPONSE ###
Dear Customer, I am sorry to hear this. Please be assured that 



In [21]:
# Pass the input to the pipeline
response = generator(prompt, max_length=200)

response

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Both `max_new_tokens` (=256) and `max_length`(=200) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


[{'generated_text': 'You are a helpful AI assistant. Can you help me write the response for the following product review:\n### PRODUCT REVIEW ###\nI bought this product from Flipkart website.\nThis product is very worst and replacement policy is very bad. Even I went to their New Delhi support center.\nI used this laptop only for 30 minute and suddenly it turn off and it will never turn on.\nAnd Flipkart website does not replace this product. I should have gone for better brands like Apple or Alienware.\n\n### CUSTOMER SERVICE RESPONSE ###\nDear Customer, I am sorry to hear this. Please be assured that \nwe take all the issues that you have with our products. If you have any further questions, please feel free to reach out to our customer support team. \n\nThank you for your patience and for taking the time to reach out.\n### \n\n### RECOMMENDATIONS \nI recommend that users should check the product description and warranty information before purchasing. \n\n### \n### \n### \n### \n\n##

In [22]:
print(response[0]["generated_text"])

You are a helpful AI assistant. Can you help me write the response for the following product review:
### PRODUCT REVIEW ###
I bought this product from Flipkart website.
This product is very worst and replacement policy is very bad. Even I went to their New Delhi support center.
I used this laptop only for 30 minute and suddenly it turn off and it will never turn on.
And Flipkart website does not replace this product. I should have gone for better brands like Apple or Alienware.

### CUSTOMER SERVICE RESPONSE ###
Dear Customer, I am sorry to hear this. Please be assured that 
we take all the issues that you have with our products. If you have any further questions, please feel free to reach out to our customer support team. 

Thank you for your patience and for taking the time to reach out.
### 

### RECOMMENDATIONS 
I recommend that users should check the product description and warranty information before purchasing. 

### 
### 
### 
### 

### 
### 
### 

### 
### 
### 

### 
### 
###

## **Fill-Mask**

### **What is Fill-Mask Task?**
Masked language modeling (i.e. Fill-Mask) is the task of masking some of the words in a sentence and predicting which words should replace those masks. These models are useful when we want to get a statistical understanding of the language in which the model is trained in.

### **Example Usage - Fill-Mask Task**
Use Cases: Correcting the misprinted words in a book, guessing the lost words from the ancient manuscripts, etc...

In [23]:
# Import pipeline
from transformers import pipeline

# Specify the inference task
unmasker = pipeline(
    task='fill-mask',
    model="google-bert/bert-base-uncased",
    torch_dtype=torch.bfloat16,
)

Some weights of the model checkpoint at google-bert/bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use mps:0


In [25]:
# Pass the input to the pipeline
unmasker("Artificial Intelligence [MASK] take over the world.")

[{'score': 0.318359375,
  'token': 2064,
  'token_str': 'can',
  'sequence': 'artificial intelligence can take over the world.'},
 {'score': 0.1806640625,
  'token': 2097,
  'token_str': 'will',
  'sequence': 'artificial intelligence will take over the world.'},
 {'score': 0.058837890625,
  'token': 2000,
  'token_str': 'to',
  'sequence': 'artificial intelligence to take over the world.'},
 {'score': 0.0458984375,
  'token': 2052,
  'token_str': 'would',
  'sequence': 'artificial intelligence would take over the world.'},
 {'score': 0.0458984375,
  'token': 2015,
  'token_str': '##s',
  'sequence': 'artificial intelligences take over the world.'}]

## **Feature Extraction**

### **What is Feature Extraction Task?**
Feature extraction is the task of extracting features learnt in a model. This process is also known as **Vectorization/Embedding**

### **Example Usage - Feature Extraction**

In [None]:
from transformers import pipeline

# Load the feature extraction pipeline and model
pipe = pipeline(
    task="feature-extraction",
    model="",
)

In [None]:

# Get embeddings
embedding = pipe(sentence)

In [None]:
# The output is a list with one item per sentence
sentence_embeddings = embedding[0][0][0:10]
print(sentence_embeddings)

## **Sentence Similarity**

### **What is Sentence Similarity?**
Sentence Similarity is the task of determining how similar two texts are. Sentence similarity models convert input texts into vectors (embeddings) that capture semantic information and calculate how close (similar) they are between them. This task is particularly useful for information retrieval and clustering/grouping.

### **Installing sentence-transformers**
The Sentence Transformers library is very powerful for calculating embeddings of sentences, paragraphs, and entire documents. An embedding is just a vector representation of a text and is useful for finding how similar two texts are.

```
! pip install -U sentence-transformers

```

In [1]:
# ! pip install sentence-transformers

### **Example Usage - Sentence Similarity**

In [3]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [4]:
data = [
    "Apples - High in fiber, support digestion, and promote heart health.",
    "Bananas - Rich in potassium, help regulate blood pressure and muscle function.",
    "Oranges - Packed with vitamin C, boost immunity, and promote skin health.",
    "Blueberries - High in antioxidants, improve brain function and reduce inflammation.",
    "Strawberries - Support heart health and contain anti-aging antioxidants.",
    "Watermelon - Hydrating fruit with lycopene, good for heart and skin health.",
    "Pineapple - Contains bromelain, aids digestion, and reduces inflammation.",
    "Avocado - Loaded with healthy fats, supports brain and heart health.",
    "Papaya - Rich in enzymes for digestion and boosts skin health.",
    "Pomegranate - Full of antioxidants, improves blood circulation, and heart health.",
    "Carrots - High in beta-carotene, improve eye health and skin glow.",
    "Spinach - Rich in iron, good for blood health and energy levels.",
    "Broccoli - Contains sulforaphane, which has anti-cancer properties.",
    "Tomatoes - Packed with lycopene, supports heart health and skin protection.",
    "Bell Peppers - High in vitamin C, boosts immunity, and reduces inflammation.",
    "Cucumber - Hydrating vegetable, aids in digestion, and supports skin health.",
    "Garlic - Has antibacterial properties, supports heart health and immunity.",
    "Ginger - Anti-inflammatory, helps with digestion and nausea relief.",
    "Beets - Improve blood flow, support endurance, and detox the liver.",
    "Sweet Potatoes - Rich in fiber and vitamin A, supports vision and digestion."
]

# Load model
model = SentenceTransformer("all-MiniLM-L6-v2")

# Computing embeddings for the dataset
data_embeddings = model.encode(
    data,
    convert_to_numpy=True,
    normalize_embeddings=True
)

data_embeddings

array([[ 0.0432431 ,  0.007915  ,  0.02221592, ...,  0.00474837,
         0.08790506,  0.03770212],
       [-0.06636767,  0.0239995 ,  0.00552235, ...,  0.00689973,
        -0.03125972,  0.03236982],
       [-0.05473284,  0.03603868, -0.03761066, ...,  0.00829173,
         0.04758519,  0.03967499],
       ...,
       [ 0.0090486 , -0.04935008, -0.03463788, ...,  0.0069707 ,
         0.00719365,  0.07223017],
       [-0.00118566, -0.00170601,  0.04594418, ..., -0.01025474,
         0.10647611, -0.03625185],
       [ 0.04819277, -0.13161905, -0.03929338, ...,  0.05015412,
         0.02368984, -0.06484961]], dtype=float32)

### **Calculating Sentence Similarity**

In [5]:
query = "Which fruit is good for digestion and reduces inflammation?"

query_embedding = model.encode(
    query,
    convert_to_numpy=True,
    normalize_embeddings=True
)

query_embedding

array([ 7.77969360e-02, -1.30056413e-02, -1.58315280e-03,  6.25351071e-02,
       -9.57588141e-04,  4.90093194e-02,  1.67138968e-02,  2.11121961e-02,
        3.18224095e-02, -2.74740485e-03,  7.46253179e-03,  7.31462985e-03,
        1.43662486e-02, -5.87841123e-03, -3.72029445e-03,  4.27569821e-02,
        3.76452096e-02,  6.18550554e-02,  5.34261926e-04, -1.20859131e-01,
       -4.11252771e-03,  2.53534131e-02,  2.24291664e-02,  4.50075837e-03,
       -3.97906043e-02, -8.67136866e-02,  6.52504265e-02, -2.09393185e-02,
       -7.12348521e-02, -7.58382753e-02,  1.52243190e-02, -2.08580196e-02,
        1.14972107e-01,  1.42562622e-02, -7.50390589e-02, -4.60609645e-02,
        7.01426864e-02, -5.20316511e-02,  2.21953797e-03, -7.68518671e-02,
       -2.00004157e-04,  7.40764514e-02,  3.65690663e-02, -5.87832257e-02,
        3.47544253e-03, -2.50390917e-02, -3.77727766e-03,  2.30258536e-02,
        6.32770211e-02,  2.04979144e-02, -1.02413327e-01, -3.03632300e-02,
       -6.11401908e-02, -

In [10]:
from sentence_transformers import util

cosine_scores = util.cos_sim(query_embedding, data_embeddings)

print(cosine_scores)

tensor([[0.6068, 0.4714, 0.5217, 0.5700, 0.5221, 0.5584, 0.6845, 0.4721, 0.5069,
         0.4233, 0.3236, 0.4451, 0.4094, 0.4368, 0.4153, 0.4201, 0.3507, 0.5387,
         0.3945, 0.4669]])


In [13]:
# Get best match
best_idx = int(cosine_scores.argmax())

print("\nBest match:")
print(data[best_idx])
print("Score:", float(cosine_scores[0][best_idx]))


Best match:
Pineapple - Contains bromelain, aids digestion, and reduces inflammation.
Score: 0.6845415830612183
