# **Huggingface 🤗** 

Hugging Face is a company and a community of opensource ML projects, most famous for NLP.  

The **Hugging Face Hub** hosts:
1. Models
2. Datasets


Key libraries includes:
1. `datasets:` Direct download the dataset from Hub
2. `transformers:` Work with pipelines, tokenizers, models, etc
3. `evaluate:` Compute evaluation metrics


## **LLM Usecases**
1. Classification
2. Named Entity Recognition
3. Question Answering
4. Summarization
5. Translation
6. Text Generation
7. Zero-Shot Classification
8. Few-Shot Learning


## **🤗 Transformers**

🤗 Transformers provides APIs and tools to easily download and train state-of-the-art pretrained models. Using pretrained models can reduce your compute costs, carbon footprint, and save you the time and resources required to train a model from scratch. These models support common tasks in different modalities, such as:

**📝 Natural Language Processing:** text classification, named entity recognition, question answering, language modeling, summarization, translation, multiple choice, and text generation.  
**🖼️ Computer Vision:** image classification, object detection, and segmentation.  
**🗣️ Audio:** automatic speech recognition and audio classification.  
**🐙 Multimodal:** table question answering, optical character recognition, information extraction from scanned documents, video classification, and visual question answering.  

🤗 Transformers support framework interoperability between PyTorch, TensorFlow, and JAX. This provides the flexibility to use a different framework at each stage of a model’s life; train a model in three lines of code in one framework, and load it for inference in another. Models can also be exported to a format like ONNX and TorchScript for deployment in production environments.

## **Setup**

In [None]:
# ! pip install tensorflow
# ! pip install tf-keras

In [1]:
! pip install transformers



You should consider upgrading via the 'C:\Users\DELL\AppData\Local\Programs\Python\Python39\python.exe -m pip install --upgrade pip' command.


In [2]:
! pip install datasets

Collecting datasets
  Downloading datasets-2.18.0-py3-none-any.whl (510 kB)
     -------------------------------------- 510.5/510.5 KB 6.4 MB/s eta 0:00:00
Collecting multiprocess
  Downloading multiprocess-0.70.16-py39-none-any.whl (133 kB)
     ---------------------------------------- 133.4/133.4 KB ? eta 0:00:00
Collecting pyarrow>=12.0.0
  Downloading pyarrow-15.0.1-cp39-cp39-win_amd64.whl (24.9 MB)
     --------------------------------------- 24.9/24.9 MB 13.1 MB/s eta 0:00:00
Collecting pyarrow-hotfix
  Downloading pyarrow_hotfix-0.6-py3-none-any.whl (7.9 kB)
Collecting dill<0.3.9,>=0.3.0
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
     ---------------------------------------- 116.3/116.3 KB ? eta 0:00:00
Collecting xxhash
  Downloading xxhash-3.4.1-cp39-cp39-win_amd64.whl (29 kB)
Collecting aiohttp
  Downloading aiohttp-3.9.3-cp39-cp39-win_amd64.whl (366 kB)
     ------------------------------------- 366.0/366.0 KB 11.1 MB/s eta 0:00:00
Collecting frozenlist>=1.1.1
  Down

You should consider upgrading via the 'C:\Users\DELL\AppData\Local\Programs\Python\Python39\python.exe -m pip install --upgrade pip' command.


In [3]:
import pandas as pd

from transformers import pipeline
from datasets import load_dataset

## **Pipelines**

The `pipeline()` makes it simple to use any model from the `Hub` for inference on any language, computer vision, speech, and multimodal tasks. Even if you don’t have experience with a specific modality or aren’t familiar with the underlying code behind the models, you can still use them for inference with the `pipeline()`! 

### **Pipeline usage**
1. Start by creating a `pipeline()` and specify the inference task.
```python
from transformers import pipeline
classifier = pipeline(task="text-classification")
```
2. Pass the input to the `pipeline()`.
```python
classifier(input_text)
```

Explore more on:  
https://huggingface.co/docs/transformers/main_classes/pipelines

In [41]:
email = """Congratulations! You've Secured Your Spot in the Innomatics Data Science Internship Jan '24 Program! 🌟
Dear Applicant,

Congratulations!



We are thrilled to inform you that your profile has been shortlisted for the prestigious Innomatics Data Science Internship Program, January 2024. Welcome to the journey of learning, growth, and exciting opportunities!

Let's celebrate your achievement with some impressive numbers. This year, we received a staggering 20,000 applications, and you stand out as one of the top 10% who made it through our rigorous screening process. Your dedication and skills truly set you apart!

To formalize your acceptance and proceed with the onboarding process, we kindly ask you to complete your profile by filling out the following form: Form Link. If you've already submitted the form, please disregard this message. The deadline to fill the form is January 18, 2024 by 5:00 PM IST.

Important Dates to Remember:
1. Internship Offer Letters and Zoom link to join the internship will be shared via email on January 22, 2024
2. Internship Start Date is January 24, 2024
   - Timing: 6:00 PM to 7:00 PM IST

Feel free to share this exciting news with your friends and family. We look forward to welcoming you to the Innomatics community and embarking on this rewarding journey together.

Congratulations once again, and get ready for an incredible experience!

Best regards,
Innomatics Data Science Internship Team
"""

In [42]:
product_review = """I bought this product from Flipkart website.
This product is very worst and replacement policy is very bad. Even I went to their New Delhi support center.
I used this laptop only for 30 minute and suddenly it turn off and it will never turn on.
And Flipkart website does not replace this product. I should have gone for better brands like Apple or Alienware.
"""

## 1. Classification

In [39]:
classifier = pipeline(task="text-classification", model="distilbert/distilbert-base-uncased-finetuned-sst-2-english")

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

In [43]:
classifier(email)

[{'label': 'POSITIVE', 'score': 0.9997299313545227}]

In [44]:
classifier(product_review)

[{'label': 'NEGATIVE', 'score': 0.9983296990394592}]

## 2. Named Entity Recognition

In [None]:
ner_tagger = pipeline("ner", aggregation_strategy="simple")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

In [None]:
ner_tagger(product_review)

[{'entity_group': 'ORG',
  'score': 0.98347884,
  'word': 'Flipkart',
  'start': 27,
  'end': 35},
 {'entity_group': 'LOC',
  'score': 0.99955785,
  'word': 'New Delhi',
  'start': 129,
  'end': 138},
 {'entity_group': 'ORG',
  'score': 0.9879432,
  'word': 'Flipkart',
  'start': 249,
  'end': 257},
 {'entity_group': 'ORG',
  'score': 0.9986192,
  'word': 'Apple',
  'start': 339,
  'end': 344},
 {'entity_group': 'ORG',
  'score': 0.99235153,
  'word': 'Alienware',
  'start': 348,
  'end': 357}]

In [None]:
ner_tagger(email)

[{'entity_group': 'MISC',
  'score': 0.94690365,
  'word': 'Inn',
  'start': 49,
  'end': 52},
 {'entity_group': 'ORG',
  'score': 0.51610214,
  'word': '##oma',
  'start': 52,
  'end': 55},
 {'entity_group': 'MISC',
  'score': 0.67638415,
  'word': '##tics Data Science Internship Jan',
  'start': 55,
  'end': 87},
 {'entity_group': 'MISC',
  'score': 0.6539702,
  'word': '24 Program',
  'start': 89,
  'end': 99},
 {'entity_group': 'MISC',
  'score': 0.87135464,
  'word': 'Innomatics Data Science Internship Program',
  'start': 229,
  'end': 271},
 {'entity_group': 'ORG',
  'score': 0.9500925,
  'word': 'Innomatics',
  'start': 1245,
  'end': 1255},
 {'entity_group': 'ORG',
  'score': 0.94063216,
  'word': 'Innomatics Data Science Inter',
  'start': 1404,
  'end': 1433},
 {'entity_group': 'ORG',
  'score': 0.8345173,
  'word': 'Team',
  'start': 1439,
  'end': 1443}]

## 3. Question Answering

In [None]:
reader = pipeline("question-answering")

No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

In [None]:
question = "When are we going to recieve a Offer Letter?"

reader(question=question, context=email)

{'score': 0.6580012440681458,
 'start': 1038,
 'end': 1056,
 'answer': 'January 22, 2024\n2'}

In [None]:
question = "What is my rank in the internship exam?"

reader(question=question, context=email)

{'score': 0.009091172367334366, 'start': 516, 'end': 519, 'answer': '10%'}

In [None]:
question = "Where did the customer buy the product?"

reader(question=question, context=product_review)

{'score': 0.8565688729286194,
 'start': 27,
 'end': 43,
 'answer': 'Flipkart website'}

## 4. Summarization

In [None]:
summarizer = pipeline("summarization")

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

In [None]:
summarizer(email)

[{'summary_text': " Congratulations! You've Secured Your Spot in the Innomatics Data Science Internship Jan '24 Program! The deadline to fill the form is January 18, 2024 by 5:00 PM IST . The internship start date is January 24, 2024 - Timing: 6:00pm to 7pm IST ."}]

In [None]:
summarizer(product_review, max_length=60)

[{'summary_text': ' The product is very worst and replacement policy is very bad . Flipkart website does not replace this product. I should have gone for better brands like Apple or Alienware. Even I went to their New Delhi support center. I used this laptop only for 30 minute and suddenly it turn'}]

## 5. Translation

In [None]:
translator = pipeline("translation_en_to_fr")

No model was supplied, defaulted to t5-base and revision 686f1db (https://huggingface.co/t5-base).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


In [None]:
translator(email, max_length=1000)

[{'translation_text': "Félicitations! Vous avez obtenu votre place dans le programme de stagiaires en science des données d'Innomatics janv. 24!  Cher candidat, Félicitations! Nous sommes ravis de vous annoncer que votre profil a été sélectionné pour le prestigieux programme de stagiaires en science des données d'Innomatics janv. 2024. Bienvenue au voyage d'apprentissage, de croissance et"}]

## 6. Text Generation

In [None]:
generator = pipeline("text-generation")

No model was supplied, defaulted to gpt2 and revision 6c0e608 (https://huggingface.co/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [None]:
response = "Dear Customer, I am sorry to hear this. Please be assured that"

prompt = product_review + "\n\nCustomer service response:\n" + response

In [None]:
generator(prompt, max_length=200)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'I bought this product from Flipkart website.\nThis product is very worst and replacement policy is very bad. Even I went to their New Delhi support center.\nI used this laptop only for 30 minute and suddenly it turn off and it will never turn on.\nAnd Flipkart website does not replace this product. I should have gone for better brands like Apple or Alienware.\n\n\nCustomer service response:\nDear Customer, I am sorry to hear this. Please be assured that you may get similar results as I have mentioned above which is for the Lenovo S7 and Lenovo Nexus 7 models.\n\nYou will find no replacement of this product. The problem is that there are no replacement cases, but even if you use a new case, your computer will be able to continue functioning normally and you cannot upgrade the case again. We have never received any complaints so please do let us know and we can send you a replacement case.\n\nAnd finally you have received the Lenovo'}]

## 7. Fill-Mask

In [1]:
from transformers import pipeline

unmasker = pipeline('fill-mask', model='bert-base-uncased')

unmasker("Artificial Intelligence [MASK] take over the world.")

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'score': 0.3182411789894104,
  'token': 2064,
  'token_str': 'can',
  'sequence': 'artificial intelligence can take over the world.'},
 {'score': 0.18299679458141327,
  'token': 2097,
  'token_str': 'will',
  'sequence': 'artificial intelligence will take over the world.'},
 {'score': 0.056001417338848114,
  'token': 2000,
  'token_str': 'to',
  'sequence': 'artificial intelligence to take over the world.'},
 {'score': 0.04519499093294144,
  'token': 2015,
  'token_str': '##s',
  'sequence': 'artificial intelligences take over the world.'},
 {'score': 0.04515307396650314,
  'token': 2052,
  'token_str': 'would',
  'sequence': 'artificial intelligence would take over the world.'}]