<a href="https://colab.research.google.com/github/ezekielibe/AI-Engineering-Concepts/blob/master/Working_with_HuggingFace.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Exploring Models Programmatically with HuggingFace**



In [None]:
pip install huggingface_hub



In [None]:
from huggingface_hub import HfApi

api = HfApi()
models = list(api.list_models(limit=3))
print(models)

[ModelInfo(id='deepseek-ai/DeepSeek-Prover-V2-671B', author=None, sha=None, created_at=datetime.datetime(2025, 4, 30, 6, 14, 35, tzinfo=datetime.timezone.utc), last_modified=None, private=False, disabled=None, downloads=3471, downloads_all_time=None, gated=None, gguf=None, inference=None, inference_provider_mapping=None, likes=688, library_name='transformers', tags=['transformers', 'safetensors', 'deepseek_v3', 'text-generation', 'conversational', 'custom_code', 'autotrain_compatible', 'text-generation-inference', 'endpoints_compatible', 'fp8', 'region:us'], pipeline_tag='text-generation', mask_token=None, card_data=None, widget_data=None, model_index=None, config=None, transformers_info=None, trending_score=688, siblings=None, spaces=None, safetensors=None, security_repo_status=None, xet_enabled=None), ModelInfo(id='nari-labs/Dia-1.6B', author=None, sha=None, created_at=datetime.datetime(2025, 4, 20, 5, 36, 20, tzinfo=datetime.timezone.utc), last_modified=None, private=False, disabled

**Running a Basic Pipeline**

In [None]:
from transformers import pipeline

my_pipeline = pipeline(
    task="text-classification",
    model="distilbert-base-uncased-finetuned-sst-2-english"
)

print(my_pipeline("DataCamp is awesome"))

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Device set to use cpu


[{'label': 'POSITIVE', 'score': 0.999854326248169}]


**Adjusting Pipeline Parameters**

In [None]:
from transformers import pipeline

my_pipeline = pipeline("text-generation", model="gpt2")

results = my_pipeline("What if AI", max_length=10, num_return_sequences=2)

for result in results:
    print(result["generated_text"])

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


What if AI could create everything instead? What if
What if AI could see you? How quickly this


**When and Why to save models:**<br>

Save models locally for:<br>
* Offline access
* Custom modifications
* Large-scale deployments



In [None]:
my_pipeline = pipeline("text-classification", model="distilbert-base-uncased-finetuned-sst-2-english")
my_pipeline.save_pretrained("/path/to/saved_model_directory")

reloaded_pipeline = pipeline("text-classification", model="/path/to/saved_model_directory")

print(reloaded_pipeline("Hugging Face is great!"))

**Inspecting a dataset**<br>


In [None]:
!pip install datasets
from datasets import load_dataset_builder

# Load dataset metadata
data_builder = load_dataset_builder("imdb")

# Access dataset size
dataset_size_mb = data_builder.info.dataset_size / (1024 ** 2)

print(f"Dataset size: {round(dataset_size_mb, 2)} MB")

Collecting datasets
  Downloading datasets-3.5.1-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2025.3.0,>=2023.1.0 (from fsspec[http]<=2025.3.0,>=2023.1.0->datasets)
  Downloading fsspec-2025.3.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.5.1-py3-none-any.whl (491 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.4/491.4 kB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2025.3.0-py3-none-any.whl (1

README.md:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

Dataset size: 127.02 MB


**Downloading a dataset and split parameter**

In [None]:
from datasets import load_dataset

dataset = load_dataset("imdb", split="train")
print(dataset[0])

train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

unsupervised-00000-of-00001.parquet:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

{'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far be

**Data Manipulation**

In [None]:
imdb = load_dataset("imdb", split="train")

# Filter imdb
filtered = imdb.filter(lambda row: row['label']==0)

# Slicing
sliced = filtered.select(range(2))

print(sliced)
print(sliced[0]['text'])

***Using min_length and max_length***

In [None]:
# Create a short summarizer
short_summarizer = pipeline(task="summarization", model="cnicu/t5-small-booksum", min_length=1, max_length=10)

# Summarize the input text
short_summary_text = short_summarizer(original_text)

# Print the short summary
print(short_summary_text[0]["summary_text"])

# Repeat for a long summarizer
long_summarizer = pipeline(task="summarization", model="cnicu/t5-small-booksum", min_length=50, max_length=150)

**Auto Models and Tokenizers**

In [None]:
# Import necessary library for tokenization
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

# Split input text into tokens
tokens = tokenizer.tokenize("AI: Making robots smarter and humans lazier!")

# Display the tokenized output
print(f"Tokenized output: {tokens}")

**Extracting text with PyPDF**<br>
* Document Q&A - working with Hugging Face

Use cases for document Q&A: <br>
* Legal: Idenitify contract clauses
* Finance: Extract key figures
* Support: Retrieve answers from manuals

**Extracting text with pypdf**

In [None]:
# Extracting text with pypdf

from pypdf import PdfReader

# Load the PDF file
reader = PdfReader("US-Employee_Policy.pdf")

# Extract text from all pages
document_text = ""
for page in reader.pages:
  document_text += page.extract_text()

**Creating a Q&A pipeline**

In [None]:
#Load the question-answering pipeline
qa_pipeline = pipeline(
    task="question-answering",
    model="distilbert-base-cased-distilled-squad",

question = "How many volunteer days are offered annually?"

# Get the answer from the QA pipeline
result = qa_pipeline(question=question, context=document_text)

# Print the answer
print(f"Answer: '{result['answer']}'")
)