<a href="https://colab.research.google.com/github/abdul9870/abdul9870/blob/main/project%3D1Copy_of_day1_text_classifier_part1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Day 1: Building a Text Classifier with Transformers
**Instructor:** Mihir Inamdar, ML Engineer @ Quoppo Ventures

## Tutorial Overview
In this tutorial, we'll cover the **data preparation** steps required to build a robust text classification pipeline using Hugging Face Transformers.

### Pipeline Architecture
```
┌─────────────┐     ┌────────────┐     ┌───────────────┐
│ Raw Text    │ ──▶ │ Dataset    │ ──▶ │ Tokenization  │
│ Data Source │     │ Loading    │     │ (Tokenizer)   │
└─────────────┘     └────────────┘     └───────────────┘
                                         │
                                         ▼
                                  ┌───────────────┐
                                  │ Preprocessed  │
                                  │ Dataset       │
                                  └───────────────┘
```


## 1. Import Necessary Libraries
- Loading datasets
- Tokenization with Transformers
- Utilities for data handling

In [None]:
!pip install datasets transformers torch numpy pandas #Installing required dependencies

Collecting datasets
  Downloading datasets-3.5.1-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2025.3.0,>=2023.1.0 (from fsspec[http]<=2025.3.0,>=2023.1.0->datasets)
  Downloading fsspec-2025.3.0-py3-none-any.whl.metadata (11 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupt

In [None]:
# Step 1: Importing libraries
# Datasets library to fetch NLP datasets
from datasets import load_dataset

# Hugging Face Transformers for tokenization
from transformers import AutoTokenizer

# PyTorch for tensor operations (can switch to TensorFlow if preferred)
import torch

# Standard utilities
import numpy as np
import pandas as pd

print("Libraries imported successfully!")

Libraries imported successfully!


## 2. Configuration & Constants
Define model checkpoint and dataset names for reproducibility.

In [None]:
# Model checkpoint for tokenizer and future model loading
MODEL_CHECKPOINT = "distilbert-base-uncased"

# Dataset choice: 'ag_news' for topic classification; 'imdb' for sentiment analysis
DATASET_NAME = "ag_news"  # Change to "imdb" for sentiment tasks

print(f"Configured MODEL_CHECKPOINT={MODEL_CHECKPOINT}, DATASET_NAME={DATASET_NAME}")

Configured MODEL_CHECKPOINT=distilbert-base-uncased, DATASET_NAME=ag_news


## 3. Load the Dataset
Fetch dataset from Hugging Face Hub and inspect splits.

In [None]:
# Load the dataset from HF Datasets
print(f"Loading '{DATASET_NAME}' dataset...")
raw_datasets = load_dataset(DATASET_NAME)

# Display available splits and sample counts
print(raw_datasets)

Loading 'ag_news' dataset...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/8.07k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/18.6M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/1.23M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/120000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/7600 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 120000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 7600
    })
})


## 4. Explore the Dataset
Understand the structure, features, and preview examples.

In [None]:
# Print features and a few examples
print('Dataset structure:', raw_datasets)

if 'train' in raw_datasets:
    print('\nFeatures of train split:', raw_datasets['train'].features)
    print('\nSample examples:')
    for idx in range(3):
        print(raw_datasets['train'][idx])

Dataset structure: DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 120000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 7600
    })
})

Features of train split: {'text': Value(dtype='string', id=None), 'label': ClassLabel(names=['World', 'Sports', 'Business', 'Sci/Tech'], id=None)}

Sample examples:
{'text': "Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\\band of ultra-cynics, are seeing green again.", 'label': 2}
{'text': 'Carlyle Looks Toward Commercial Aerospace (Reuters) Reuters - Private investment firm Carlyle Group,\\which has a reputation for making well-timed and occasionally\\controversial plays in the defense industry, has quietly placed\\its bets on another part of the market.', 'label': 2}
{'text': "Oil and Economy Cloud Stocks' Outlook (Reuters) Reuters - Soaring crude prices plus worries\\about the economy and the outlook for earnings are exp

## 5. Initialize the Tokenizer
Load the tokenizer corresponding to the chosen model checkpoint.

In [None]:
# Load the tokenizer from the specified checkpoint
tokenizer = AutoTokenizer.from_pretrained(MODEL_CHECKPOINT)
print(f"Tokenizer loaded: {tokenizer.__class__.__name__}")

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Tokenizer loaded: DistilBertTokenizerFast


## 6. Tokenization Demonstration
Tokenize a sample sentence and inspect input IDs and tokens.

In [None]:
# Sample text for demonstration
sample_text = "Transformers make NLP tasks easier and more efficient!"
print("Original text:", sample_text)

# Tokenize with truncation to model max length
tokenized = tokenizer(sample_text, truncation=True, padding='max_length', max_length=tokenizer.model_max_length)
print("\nTokenized output keys:", tokenized.keys())
print("Input IDs:", tokenized['input_ids'])
print("Tokens:", tokenizer.convert_ids_to_tokens(tokenized['input_ids']))

Original text: Transformers make NLP tasks easier and more efficient!

Tokenized output keys: dict_keys(['input_ids', 'attention_mask'])
Input IDs: [101, 19081, 2191, 17953, 2361, 8518, 6082, 1998, 2062, 8114, 999, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

## 7. Define Preprocessing Function
Create a mapping function for dataset tokenization.

In [None]:
# Define preprocessing for batched tokenization
def preprocess_function(examples):
    # Tokenize batch of texts
    return tokenizer(
        examples['text'],
        truncation=True,
        padding='max_length',       # pad to max length for uniform input size
        max_length=tokenizer.model_max_length
    )

print("Preprocessing function ready.")

Preprocessing function ready.


## 8. Apply Preprocessing to Dataset
Use `.map()` to tokenize all splits efficiently.

In [None]:
# Apply preprocessing in batched mode
tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)

# Inspect tokenized dataset
print(tokenized_datasets['train'].column_names)
print("First tokenized example:", tokenized_datasets['train'][0])

Map:   0%|          | 0/120000 [00:00<?, ? examples/s]

Map:   0%|          | 0/7600 [00:00<?, ? examples/s]

['text', 'label', 'input_ids', 'attention_mask']
First tokenized example: {'text': "Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\\band of ultra-cynics, are seeing green again.", 'label': 2, 'input_ids': [101, 2813, 2358, 1012, 6468, 15020, 2067, 2046, 1996, 2304, 1006, 26665, 1007, 26665, 1011, 2460, 1011, 19041, 1010, 2813, 2395, 1005, 1055, 1040, 11101, 2989, 1032, 2316, 1997, 11087, 1011, 22330, 8713, 2015, 1010, 2024, 3773, 2665, 2153, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

## 9. Summary & Next Steps
**What we've done:**
- Loaded and explored the dataset
- Initialized tokenizer and tokenized example text
- Defined and applied preprocessing function

**Next (Day 2)**:
1. Load a pre-trained Transformer model for classification
2. Set up training arguments and `DataCollator`
3. Fine-tune model on the tokenized dataset
4. Evaluate performance and analyze results
