<a href="https://colab.research.google.com/github/amit306/machineLearning/blob/main/00_Setup_and_Installation_COMPLETE.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Experiment 0: Setup and Installation
### Model Cascading and Token-Aware Batching for LLM Serving

**Purpose:** Install all required libraries and download datasets

**Execution Time:** ~10-15 minutes

**Run this notebook FIRST before all other experiments**

**UPDATED VERSION:** Includes proper dataset caching for all experiments

## Step 1: Install Required Libraries

In [13]:
%%capture
# Install core libraries
!pip install transformers==4.36.0
!pip install torch==2.1.0
!pip install datasets==2.16.0
!pip install scikit-learn==1.3.2
!pip install accelerate==0.25.0
!pip install bitsandbytes==0.41.3
!pip install sentencepiece==0.1.99
!pip install protobuf==3.20.3

In [14]:
# Install visualization and analysis libraries
!pip install matplotlib==3.8.2 seaborn==0.13.0 pandas==2.1.4 numpy==1.26.3
!pip install scipy==1.11.4 tqdm==4.66.1



In [15]:
# Install simulation library for batching experiments
!pip install simpy==4.1.1



## Step 2: Verify Installation

In [16]:
import torch
import transformers
import sklearn
import datasets
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

print("✓ All core libraries imported successfully")
print(f"PyTorch version: {torch.__version__}")
print(f"Transformers version: {transformers.__version__}")
print(f"Datasets version: {datasets.__version__}")
print(f"Scikit-learn version: {sklearn.__version__}")
print(f"\nGPU Available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU Name: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")

✓ All core libraries imported successfully
PyTorch version: 2.9.0+cpu
Transformers version: 4.36.0
Datasets version: 2.16.0
Scikit-learn version: 1.3.2

GPU Available: False


## Step 3: Download and Cache Datasets

This step downloads all required datasets and caches them for use in experiments.

In [17]:
from datasets import load_dataset
from sklearn.datasets import fetch_20newsgroups
import pickle
import os

print("Downloading datasets... This may take a few minutes.\n")

# Dataset 1: AG News (News Classification)
print("1. Downloading AG News...")
ag_news = load_dataset("ag_news")
print(f"   ✓ AG News: {len(ag_news['train'])} train, {len(ag_news['test'])} test samples")

# Dataset 2: TREC (Question Classification)
print("\n2. Downloading TREC...")
trec = load_dataset("trec")
print(f"   ✓ TREC: {len(trec['train'])} train, {len(trec['test'])} test samples")

# Dataset 3: SST-2 (Sentiment Analysis)
print("\n3. Downloading SST-2...")
sst2 = load_dataset("glue", "sst2")
print(f"   ✓ SST-2: {len(sst2['train'])} train, {len(sst2['validation'])} validation samples")

# Dataset 4: 20 Newsgroups (Multi-class Classification)
print("\n4. Downloading 20 Newsgroups...")

newsgroups_loaded = False

# Try Method 1: sklearn (fastest if it works)
try:
    print("   → Trying sklearn fetch_20newsgroups...")
    newsgroups_train = fetch_20newsgroups(
        subset='train',
        remove=('headers', 'footers', 'quotes'),
        random_state=42
    )
    newsgroups_test = fetch_20newsgroups(
        subset='test',
        remove=('headers', 'footers', 'quotes'),
        random_state=42
    )
    print(f"   ✓ 20 Newsgroups (sklearn): {len(newsgroups_train.data)} train, {len(newsgroups_test.data)} test samples")
    newsgroups_loaded = True
except Exception as e:
    print(f"   ✗ sklearn failed: {e}")
    newsgroups_loaded = False

# Try Method 2: HuggingFace (reliable fallback)
if not newsgroups_loaded:
    try:
        print("   → Trying HuggingFace datasets...")
        hf_newsgroups = load_dataset("SetFit/20_newsgroups")

        # Create sklearn-compatible Bunch object
        class NewsGroupsBunch:
            def __init__(self, data, target, target_names=None):
                self.data = data
                self.target = target
                self.target_names = target_names if target_names else list(range(max(target) + 1))

        newsgroups_train = NewsGroupsBunch(
            data=hf_newsgroups['train']['text'],
            target=hf_newsgroups['train']['label']
        )
        newsgroups_test = NewsGroupsBunch(
            data=hf_newsgroups['test']['text'],
            target=hf_newsgroups['test']['label']
        )

        print(f"   ✓ 20 Newsgroups (HuggingFace): {len(newsgroups_train.data)} train, {len(newsgroups_test.data)} test samples")
        newsgroups_loaded = True
    except Exception as e:
        print(f"   ✗ HuggingFace failed: {e}")
        newsgroups_loaded = False

if not newsgroups_loaded:
    print("\n   ⚠️  WARNING: Could not download 20 Newsgroups dataset")
    print("   Experiment 01 will not work without this dataset.")
    print("   Please check your internet connection and try again.")
else:
    print("\n✓ All datasets downloaded successfully!")

Downloading datasets... This may take a few minutes.

1. Downloading AG News...
   ✓ AG News: 120000 train, 7600 test samples

2. Downloading TREC...


You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


   ✓ TREC: 5452 train, 500 test samples

3. Downloading SST-2...
   ✓ SST-2: 67349 train, 872 validation samples

4. Downloading 20 Newsgroups...
   → Trying sklearn fetch_20newsgroups...
   ✗ sklearn failed: HTTP Error 403: Forbidden
   → Trying HuggingFace datasets...


Repo card metadata block was not found. Setting CardData to empty.


   ✓ 20 Newsgroups (HuggingFace): 11314 train, 7532 test samples

✓ All datasets downloaded successfully!


## Step 4: Cache Datasets for Experiments

**CRITICAL STEP:** Save datasets to pickle files so experiments can load them quickly without re-downloading.

In [18]:
print("Caching datasets for experiments...\n")

# Create experiment_results directory
os.makedirs('experiment_results', exist_ok=True)

# Cache 20 Newsgroups (for Experiment 01)
if newsgroups_loaded:
    print("1. Caching 20 Newsgroups...")
    with open('experiment_results/newsgroups_train.pkl', 'wb') as f:
        pickle.dump(newsgroups_train, f)
    with open('experiment_results/newsgroups_test.pkl', 'wb') as f:
        pickle.dump(newsgroups_test, f)
    print(f"   ✓ Saved {len(newsgroups_train.data)} train samples")
    print(f"   ✓ Saved {len(newsgroups_test.data)} test samples")
    print("   → Experiment 01 will use this cached version")
else:
    print("1. ⚠️  Skipping 20 Newsgroups cache (download failed)")

# Cache AG News (for Experiment 02)
print("\n2. Caching AG News...")
ag_news_cache = {
    'train': ag_news['train'],
    'test': ag_news['test']
}
with open('experiment_results/ag_news.pkl', 'wb') as f:
    pickle.dump(ag_news_cache, f)
print(f"   ✓ Saved {len(ag_news['train'])} train samples")
print(f"   ✓ Saved {len(ag_news['test'])} test samples")
print("   → Experiment 02 will use this cached version")

print("\n✓ All datasets cached successfully!")
print("   Location: experiment_results/")
print("   Files: newsgroups_train.pkl, newsgroups_test.pkl, ag_news.pkl")

Caching datasets for experiments...

1. Caching 20 Newsgroups...
   ✓ Saved 11314 train samples
   ✓ Saved 7532 test samples
   → Experiment 01 will use this cached version

2. Caching AG News...
   ✓ Saved 120000 train samples
   ✓ Saved 7600 test samples
   → Experiment 02 will use this cached version

✓ All datasets cached successfully!
   Location: experiment_results/
   Files: newsgroups_train.pkl, newsgroups_test.pkl, ag_news.pkl


## Step 5: Download and Cache Models

In [19]:
from transformers import AutoTokenizer, AutoModel

print("Downloading models... This will take several minutes.\n")

# Model 1: DistilBERT (small model for cascading)
print("1. Downloading DistilBERT (small model)...")
distilbert_tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')
print("   ✓ DistilBERT tokenizer cached")

# Model 2: TinyBERT (tiny model alternative)
print("\n2. Downloading TinyBERT (tiny model)...")
try:
    tinybert_tokenizer = AutoTokenizer.from_pretrained('huawei-noah/TinyBERT_General_4L_312D')
    print("   ✓ TinyBERT tokenizer cached")
except:
    print("   ⚠️  TinyBERT unavailable (optional)")

# Model 3: FLAN-T5 (LLM for cascading)
print("\n3. Downloading FLAN-T5-base (LLM)...")
flant5_tokenizer = AutoTokenizer.from_pretrained('google/flan-t5-base')
print("   ✓ FLAN-T5 tokenizer cached")

print("\n✓ All models downloaded and cached!")
print("   Models are cached in HuggingFace's default cache directory")
print("   They will load quickly in subsequent notebooks")

Downloading models... This will take several minutes.

1. Downloading DistilBERT (small model)...
   ✓ DistilBERT tokenizer cached

2. Downloading TinyBERT (tiny model)...




   ✓ TinyBERT tokenizer cached

3. Downloading FLAN-T5-base (LLM)...
   ✓ FLAN-T5 tokenizer cached

✓ All models downloaded and cached!
   Models are cached in HuggingFace's default cache directory
   They will load quickly in subsequent notebooks


## Step 6: Create Directory Structure

In [20]:
import os

# Create all necessary directories
directories = [
    'experiment_results',
    'models',
    'figures'
]

print("Creating directory structure...\n")
for directory in directories:
    os.makedirs(directory, exist_ok=True)
    print(f"✓ Created: {directory}/")

print("\n✓ Directory structure ready:")
print("  - experiment_results/ (for CSV results and cached data)")
print("  - models/ (for saving trained models)")
print("  - figures/ (for saving plots)")

Creating directory structure...

✓ Created: experiment_results/
✓ Created: models/
✓ Created: figures/

✓ Directory structure ready:
  - experiment_results/ (for CSV results and cached data)
  - models/ (for saving trained models)
  - figures/ (for saving plots)


## Step 7: Save Experiment Metadata

In [21]:
import pickle

# Save experiment metadata
metadata = {
    'torch_version': torch.__version__,
    'transformers_version': transformers.__version__,
    'datasets_version': datasets.__version__,
    'sklearn_version': sklearn.__version__,
    'gpu_available': torch.cuda.is_available(),
    'datasets_downloaded': ['ag_news', 'trec', 'sst2', '20newsgroups'],
    'models_cached': ['distilbert-base-uncased', 'flan-t5-base'],
    'newsgroups_loaded': newsgroups_loaded,
    'cache_location': 'experiment_results/'
}

with open('experiment_results/setup_metadata.pkl', 'wb') as f:
    pickle.dump(metadata, f)

print("✓ Experiment metadata saved to experiment_results/setup_metadata.pkl")

✓ Experiment metadata saved to experiment_results/setup_metadata.pkl


## Step 8: Test Basic Functionality

In [22]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

print("Testing basic ML pipeline...\n")

if newsgroups_loaded:
    # Small test with 20 newsgroups
    X_train_sample = newsgroups_train.data[:100]
    y_train_sample = newsgroups_train.target[:100]
    X_test_sample = newsgroups_test.data[:20]
    y_test_sample = newsgroups_test.target[:20]

    # Create simple TF-IDF + LogReg pipeline
    vectorizer = TfidfVectorizer(max_features=1000)
    X_train_vec = vectorizer.fit_transform(X_train_sample)
    X_test_vec = vectorizer.transform(X_test_sample)

    clf = LogisticRegression(max_iter=100, random_state=42)
    clf.fit(X_train_vec, y_train_sample)

    score = clf.score(X_test_vec, y_test_sample)
    print(f"✓ Test classification accuracy: {score:.2%}")
    print("✓ Basic ML pipeline working correctly")
else:
    print("⚠️  Skipping ML test (20 Newsgroups not available)")
    print("   Experiment 01 will need this dataset to run")

Testing basic ML pipeline...

✓ Test classification accuracy: 0.00%
✓ Basic ML pipeline working correctly


## Step 9: Verify Cache Files

In [23]:
import os

print("Verifying cached files...\n")

cache_files = [
    'experiment_results/newsgroups_train.pkl',
    'experiment_results/newsgroups_test.pkl',
    'experiment_results/ag_news.pkl',
    'experiment_results/setup_metadata.pkl'
]

all_present = True
for filepath in cache_files:
    if os.path.exists(filepath):
        size_mb = os.path.getsize(filepath) / (1024 * 1024)
        print(f"✓ {filepath} ({size_mb:.2f} MB)")
    else:
        print(f"✗ {filepath} (MISSING)")
        all_present = False

if all_present:
    print("\n✓ All cache files present and ready!")
else:
    print("\n⚠️  Some cache files are missing")
    print("   You may need to re-run the setup steps")

Verifying cached files...

✓ experiment_results/newsgroups_train.pkl (13.22 MB)
✓ experiment_results/newsgroups_test.pkl (7.93 MB)
✓ experiment_results/ag_news.pkl (0.00 MB)
✓ experiment_results/setup_metadata.pkl (0.00 MB)

✓ All cache files present and ready!


## Step 10: System Information Summary

In [24]:
import platform
import psutil

print("=" * 60)
print("SYSTEM INFORMATION SUMMARY")
print("=" * 60)

print(f"\nPlatform: {platform.platform()}")
print(f"Python Version: {platform.python_version()}")
print(f"CPU Cores: {psutil.cpu_count(logical=False)} physical, {psutil.cpu_count(logical=True)} logical")
print(f"RAM: {psutil.virtual_memory().total / 1e9:.1f} GB total, {psutil.virtual_memory().available / 1e9:.1f} GB available")

print(f"\nGPU: {'Yes (' + torch.cuda.get_device_name(0) + ')' if torch.cuda.is_available() else 'No (CPU only)'}")

print("\n" + "=" * 60)
print("SETUP COMPLETE - READY FOR EXPERIMENTS")
print("=" * 60)

print("\nNext steps:")
if newsgroups_loaded:
    print("  ✓ Run Experiment 1: Model Cascading - Basic (Notebook 01)")
else:
    print("  ⚠️  Experiment 1 requires 20 Newsgroups (download failed)")
    print("     Try re-running this setup notebook")
print("  ✓ Run Experiment 2: Model Cascading - Advanced (Notebook 02)")
print("  ✓ Run Experiment 3: Token-Aware Batching (Notebook 03)")
print("  ✓ Run Analysis: Results Visualization (Notebook 04)")

print("\nCached datasets available in: experiment_results/")
print("All experiments will load from cache (fast!)")

SYSTEM INFORMATION SUMMARY

Platform: Linux-6.6.105+-x86_64-with-glibc2.35
Python Version: 3.12.12
CPU Cores: 1 physical, 2 logical
RAM: 13.6 GB total, 12.1 GB available

GPU: No (CPU only)

SETUP COMPLETE - READY FOR EXPERIMENTS

Next steps:
  ✓ Run Experiment 1: Model Cascading - Basic (Notebook 01)
  ✓ Run Experiment 2: Model Cascading - Advanced (Notebook 02)
  ✓ Run Experiment 3: Token-Aware Batching (Notebook 03)
  ✓ Run Analysis: Results Visualization (Notebook 04)

Cached datasets available in: experiment_results/
All experiments will load from cache (fast!)


## Troubleshooting

If you encounter issues:

1. **Out of Memory**: Restart runtime and run this notebook again
2. **Download Failures**:
   - Check internet connection
   - Re-run the affected download cell
   - The notebook has fallback methods (sklearn → HuggingFace)
3. **Import Errors**: Ensure all cells in "Step 1" completed successfully
4. **GPU Not Available**: This is fine - experiments will run on CPU (slower but functional)
5. **20 Newsgroups Failed**:
   - Both sklearn and HuggingFace failed
   - Check internet connection
   - Try restarting runtime and re-running setup
   - Experiment 01 cannot run without this dataset
   - Experiments 02-04 will still work

**Important for Colab Users:**
- Free Colab has ~12GB RAM and may disconnect after 12 hours
- If disconnected, you'll need to re-run this setup notebook
- Cache files persist in the session, so re-running is fast
- Consider Colab Pro if you need more resources or longer sessions