---
title: "Hugging Face Transformers: Getting Started"
author: "Mohammed Adil Siraju"
date: "2025-09-26"
categories: [huggingface, transformers, nlp, bert]
description: "Introduction to using Hugging Face datasets and transformers for NLP tasks."
---
This notebook demonstrates how to use Hugging Face's `datasets` and `transformers` libraries to work with pre-trained models like BERT.

In [1]:
%pip install datasets

Note: you may need to restart the kernel to use updated packages.


## Installing Required Libraries

First, we need to install the Hugging Face `datasets` library to access pre-built datasets.

In [2]:
from datasets import load_dataset

ds = load_dataset('imdb')

print(ds)

README.md: 0.00B [00:00, ?B/s]

plain_text/train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

plain_text/test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

plain_text/unsupervised-00000-of-00001.p(…):   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})


## Loading a Dataset

Load the IMDB movie reviews dataset, which contains 50,000 movie reviews labeled as positive or negative sentiment.

## Converting to Pandas DataFrame

Convert the Hugging Face dataset to a pandas DataFrame for easier exploration and analysis.

In [6]:
import pandas as pd

df = pd.DataFrame(ds['train'])
df

Unnamed: 0,text,label
0,I rented I AM CURIOUS-YELLOW from my video sto...,0
1,"""I Am Curious: Yellow"" is a risible and preten...",0
2,If only to avoid making this type of film in t...,0
3,This film was probably inspired by Godard's Ma...,0
4,"Oh, brother...after hearing about this ridicul...",0
...,...,...
24995,A hit at the time but now better categorised a...,1
24996,I love this movie like no other. Another time ...,1
24997,This film and it's sequel Barry Mckenzie holds...,1
24998,'The Adventures Of Barry McKenzie' started lif...,1


## Installing Transformers Library

Install the `transformers` library to access pre-trained models and tokenizers.

In [7]:
%pip install transformers

Note: you may need to restart the kernel to use updated packages.


## Loading Pre-trained BERT Model

Load the BERT-base-uncased model and its tokenizer. BERT (Bidirectional Encoder Representations from Transformers) is a powerful language model.

In [8]:
from transformers import AutoModel, AutoTokenizer

model_name = 'bert-base-uncased'
model = AutoModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [9]:
inputs = tokenizer('Hello, Hugging Face!', return_tensors='pt')
outputs = model(**inputs)

print(outputs.last_hidden_state.shape)

torch.Size([1, 7, 768])


## Summary

This notebook demonstrated the basics of using Hugging Face libraries:
- Loading datasets with `datasets`
- Working with pre-trained models using `transformers`
- Tokenizing text and generating embeddings

These are fundamental building blocks for many NLP tasks!

## Using the Model and Tokenizer

Tokenize a sample text and pass it through the BERT model to get embeddings. The output shows the shape of the hidden states.