#**HuggingFace Models**
In this notebook, we will cover the following models from Huggingface.

1. Summarization Models
2. Question Answering Models
3. Text Classification Models
4. Embedding Models
5. Text Generation Models
6. Named Entity Recognition (NER) Models
7. Translation Models

If you want to explore more models please refer
https://huggingface.co/models

###**Install Dependencies**

In [1]:
!pip install transformers sentence_transformers

Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence_transformers)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence_transformers)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.11.0->sentence_transformers)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.11.0->sentence_transformers)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.11.0->sentence_transformers)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch>=1.11.0->sentence_transformers)
 

###**Downloads the `hrdataset.zip` file from the CloudYuga GitHub repo**

Saves it in the current working directory of notebook

(e.g., /content/ in Google Colab).

In [2]:
!wget https://github.com/cloudyuga/mastering-genai-w-python/raw/refs/heads/main/hrdataset.zip

--2025-05-26 06:20:19--  https://github.com/cloudyuga/mastering-genai-w-python/raw/refs/heads/main/hrdataset.zip
Resolving github.com (github.com)... 20.27.177.113
Connecting to github.com (github.com)|20.27.177.113|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/cloudyuga/mastering-genai-w-python/refs/heads/main/hrdataset.zip [following]
--2025-05-26 06:20:19--  https://raw.githubusercontent.com/cloudyuga/mastering-genai-w-python/refs/heads/main/hrdataset.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 9530 (9.3K) [application/zip]
Saving to: ‘hrdataset.zip.1’


2025-05-26 06:20:20 (12.4 MB/s) - ‘hrdataset.zip.1’ saved [9530/9530]



###**Unzip `hrdataset.zip` file**
- It will automatically create **`hrdataset`** folder in our current working directory (/content/ in Google Colab)

In [3]:
!unzip hrdataset.zip

Archive:  hrdataset.zip
replace hrdataset/policies/leave_policies.md? [y]es, [n]o, [A]ll, [N]one, [r]ename: n
replace hrdataset/policies/training_and_development.md? [y]es, [n]o, [A]ll, [N]one, [r]ename: n
replace hrdataset/policies/employee_benefits.md? [y]es, [n]o, [A]ll, [N]one, [r]ename: n
replace hrdataset/policies/holiday_calendar.md? [y]es, [n]o, [A]ll, [N]one, [r]ename: n
replace hrdataset/policies/events_calendar.md? [y]es, [n]o, [A]ll, [N]one, [r]ename: n
replace hrdataset/surveys/Employee_Culture_Survey_Responses.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: n
replace hrdataset/employees/108_Rajesh_Kulkarni.md? [y]es, [n]o, [A]ll, [N]one, [r]ename: n
replace hrdataset/employees/106_Neha_Malhotra.md? [y]es, [n]o, [A]ll, [N]one, [r]ename: n
replace hrdataset/employees/103_Anjali_Das.md? [y]es, [n]o, [A]ll, [N]one, [r]ename: n
replace hrdataset/employees/105_Sunita_Patil.md? [y]es, [n]o, [A]ll, [N]one, [r]ename: n
replace hrdataset/employees/101_Priya_Sharma.md? [y]es, [n]o, [A]ll

In [4]:
!ls /content/hrdataset/employees


101_Priya_Sharma.md  105_Sunita_Patil.md     109_Meera_Iyer.md
102_Rohit_Mehra.md   106_Neha_Malhotra.md    110_Aditya_Jain.md
103_Anjali_Das.md    107_Amit_Verma.md	     payroll_information.md
104_Karan_Kapoor.md  108_Rajesh_Kulkarni.md


In [5]:
import os
import pandas as pd

# Directory containing markdown files
markdown_dir = "/content/hrdataset/employees"

# Parse markdown files into a DataFrame
employee_data = []
for filename in os.listdir(markdown_dir):
    if filename.endswith(".md"):
        with open(os.path.join(markdown_dir, filename), "r") as f:
            lines = f.readlines()
            profile = {}
            for line in lines:
                if line.startswith("- **Name:**"):
                    profile["Name"] = line.split(":")[1].strip("**").strip("\n").strip(" ")
                elif line.startswith("- **Role:**"):
                    profile["Role"] = line.split(":")[1].strip("**").strip("\n").strip(" ")
                elif line.startswith("- **Department:**"):
                    profile["Department"] = line.split(":")[1].strip("**").strip("\n").strip(" ")
                elif line.startswith("- **Joining Date:**"):
                    profile["Joining Date"] = line.split(":")[1].strip("**").strip("\n").strip(" ")
            if profile:
                employee_data.append(profile)

# Convert to DataFrame
employee_df = pd.DataFrame(employee_data)
print(employee_df)

              Name                Role       Department Joining Date
0  Rajesh Kulkarni                 CTO        Executive   2017-11-15
1       Amit Verma                 CEO        Executive   2016-02-01
2       Meera Iyer   Marketing Manager        Marketing   2020-02-20
3      Aditya Jain    Senior Developer               IT   2019-06-10
4      Rohit Mehra   Logistics Analyst        Logistics   2020-08-22
5       Anjali Das        HR Executive  Human Resources   2021-05-10
6    Neha Malhotra      Junior Analyst        Logistics   2023-07-01
7     Karan Kapoor    Fleet Supervisor        Logistics   2018-11-03
8     Priya Sharma  Operations Manager       Operations   2019-03-15
9     Sunita Patil   Finance Executive          Finance   2022-01-17


##**1. Summarization Models**
These models are pre-trained for text summarization tasks.

* facebook/bart-large-cnn: The go-to model for
summarization, trained on large news datasets.

* t5-small, t5-base, t5-large: Highly versatile models that can perform summarization and many other tasks.
  **Usage:** Prefix input with "summarize: ".

* pegasus-xsum (Google Research): Specialized in abstractive summarization.
* google/long-t5-tglobal-base: Designed for summarizing longer documents effectively.

In [6]:
from transformers import pipeline

# Load pre-trained summarization model
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

# Combine employee details for a specific department
operations_data = " ".join([
    f"{row['Name']} is the {row['Role']} in {row['Department']}, joined on {row['Joining Date']}."
    for _, row in employee_df.iterrows() if row['Department'] == "Executive"
])

# Summarize the data
summary = summarizer(operations_data, max_length=50,min_length=10, do_sample=False)
print("Summary:", summary[0]['summary_text'])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Device set to use cpu
Your max_length is set to 50, but your input_length is only 40. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=20)


Summary: Rajesh Kulkarni is the CTO in Executive. Amit Verma is the CEO in Executive, joined on 2016-02-01.


##**2. Question Answering Models**
These models are trained to answer questions based on a provided context.

* distilbert-base-uncased-distilled-squad: A lightweight QA model based on the SQuAD dataset.

* bert-large-uncased-whole-word-masking-finetuned-squad: A more powerful BERT-based QA model.

* deepset/roberta-base-squad2: Trained on SQuAD2.0, supports unanswerable questions.

In [7]:
qa_pipeline = pipeline("question-answering", model="distilbert-base-uncased-distilled-squad")
# context = "Priya Sharma is the Operations Manager who joined the company in 2019."
context = summary[0]['summary_text']
question = "Who is the CEO?"
answer = qa_pipeline(question=question, context=context)
print(answer)

Device set to use cpu


{'score': 0.9975013732910156, 'start': 41, 'end': 51, 'answer': 'Amit Verma'}


##**3. Text Classification Models**
For sentiment analysis, topic classification, and other classification tasks.

* distilbert-base-uncased-finetuned-sst-2-english: Sentiment analysis model.
* cardiffnlp/twitter-roberta-base-sentiment: Specifically designed for sentiment analysis on tweets.
* facebook/bart-large-mnli: Can be used for zero-shot text classification.

In [8]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")
result = classifier("The employee feedback has been overwhelmingly positive!")
print(result)


Device set to use cpu


[{'label': 'POSITIVE', 'score': 0.9997122883796692}]


##**4. Embedding Models**
For similarity search, clustering, and other embedding-based tasks.

* sentence-transformers/all-MiniLM-L6-v2: A lightweight embedding model for generating sentence embeddings.
* openai/clip-vit-base-patch32: Generates embeddings for both text and images.
* bert-base-uncased: You can extract hidden states to use as embeddings.

In [9]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(["This is a test sentence.", "Another example."])
print(embeddings)

[[ 8.42964798e-02  5.79537116e-02  4.49332502e-03  1.05821133e-01
   7.08347186e-03 -1.78446826e-02 -1.68880690e-02 -1.52283777e-02
   4.04731482e-02  3.34225185e-02  1.04327612e-01 -4.70358692e-02
   6.88471459e-03  4.10179794e-02  1.87119059e-02 -4.14923131e-02
   2.36474667e-02 -5.65018319e-02 -3.36961895e-02  5.09910360e-02
   6.93032816e-02  5.47842421e-02 -9.78836883e-03  2.36971844e-02
   1.99965108e-02  9.71729215e-03 -5.88991717e-02  7.30747916e-03
   4.70264405e-02 -4.51009814e-03 -5.57996780e-02 -4.15945984e-03
   6.47570491e-02  4.80762906e-02  1.70207825e-02 -3.18336999e-03
   5.74024208e-02  3.52318995e-02 -5.88387111e-03  1.48328794e-02
   1.15763210e-02 -1.07480787e-01  1.91041604e-02  2.20857337e-02
   1.08645940e-02  3.78195709e-03 -3.19403596e-02  1.07277585e-02
  -4.84234374e-03 -2.83362307e-02 -5.25735430e-02 -7.05868527e-02
  -5.75557537e-02 -1.36328777e-02  5.68219274e-03  2.30746306e-02
   3.56978178e-02  1.49984220e-02  4.97427844e-02  4.26283479e-02
  -3.45888

##**5. Text Generation Models**
For generating text, completing sentences, or creating creative outputs.

* gpt2, distilgpt2: Lightweight generative models for text generation.
* EleutherAI/gpt-neo-2.7B: A larger open-source alternative to GPT-3.
* bigscience/bloom: A multilingual generative model.
* facebook/opt-13b: Efficient, open-source large language model.

In [10]:
from transformers import pipeline

generator = pipeline("text-generation", model="gpt2")
text = generator("Once upon a time, there was a curious employee who", max_length=50)
print(text[0]['generated_text'])

Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Once upon a time, there was a curious employee who turned up with a lot of notes on the keyboard. He claimed that all of them were completely empty, only five or nine at a time, which was an almost unheard of occurrence. There were


##**6. Named Entity Recognition (NER) Models**
For extracting entities like names, dates, organizations, etc., from text.

* dbmdz/bert-large-cased-finetuned-conll03-english: Fine-tuned on CoNLL-2003 for NER.
* dslim/bert-base-NER: Lightweight NER model

In [11]:
from transformers import pipeline

ner = pipeline("ner", model="dbmdz/bert-large-cased-finetuned-conll03-english", grouped_entities=True)
text = "Neependra Khare founded CloudYuga in December 2015"
entities = ner(text)
print(entities)

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


[{'entity_group': 'PER', 'score': np.float32(0.9995653), 'word': 'Neependra Khare', 'start': 0, 'end': 15}, {'entity_group': 'ORG', 'score': np.float32(0.97774523), 'word': 'CloudYuga', 'start': 24, 'end': 33}]


##**7. Translation Models**
For translating text between languages.

- Helsinki-NLP/opus-mt-en-de: English-to-German translation.
- t5-small, t5-large: Can also be used for translation by prefixing the input with translate English to French:.

In [12]:
from transformers import pipeline

translator = pipeline("translation_en_to_de", model="Helsinki-NLP/opus-mt-en-de")
translation = translator("The employee feedback has been overwhelmingly positive!")
print(translation[0]['translation_text'])

Device set to use cpu


Das Feedback der Mitarbeiter war überwältigend positiv!
