# BBC Text Representations - Setup & Preprocessing

**Roll Number:** SE22UARI195

**Tasks:**
1. Create master.csv with stratified 5-fold splits
2. Generate deterministic train/dev/test split from roll number
3. Build preprocessing pipeline
4. Save processed data to cache

---

## 1. Setup & Imports

In [1]:
# Core libraries
import pandas as pd
import numpy as np
import pickle
import os
import re
import zlib
from pathlib import Path

# Preprocessing
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Sklearn
from sklearn.model_selection import StratifiedKFold

# Progress bar
from tqdm.notebook import tqdm
tqdm.pandas()

print("‚úÖ Imports successful!")

‚úÖ Imports successful!


In [2]:
# Download required NLTK data
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('wordnet', quiet=True)
nltk.download('omw-1.4', quiet=True)
print("‚úÖ NLTK data downloaded!")

‚úÖ NLTK data downloaded!


[nltk_data] Error loading punkt: <urlopen error [SSL:
[nltk_data]     CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data]     unable to get local issuer certificate (_ssl.c:1032)>
[nltk_data] Error loading stopwords: <urlopen error [SSL:
[nltk_data]     CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data]     unable to get local issuer certificate (_ssl.c:1032)>
[nltk_data] Error loading wordnet: <urlopen error [SSL:
[nltk_data]     CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data]     unable to get local issuer certificate (_ssl.c:1032)>
[nltk_data] Error loading omw-1.4: <urlopen error [SSL:
[nltk_data]     CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data]     unable to get local issuer certificate (_ssl.c:1032)>


In [3]:
# Configuration
ROLL = "SE22UARI195"  # Your roll number
SEED = 137  # Fixed seed for reproducible folds

# Paths
DATA_DIR = Path("../data")
CACHE_DIR = Path("../cache")
SRC_FILE = DATA_DIR / "bbc-text.csv"
MASTER_FILE = DATA_DIR / "master.csv"

# Create directories if they don't exist
DATA_DIR.mkdir(exist_ok=True)
CACHE_DIR.mkdir(exist_ok=True)

print(f"Roll Number: {ROLL}")
print(f"Data Directory: {DATA_DIR}")
print(f"Cache Directory: {CACHE_DIR}")

Roll Number: SE22UARI195
Data Directory: ../data
Cache Directory: ../cache


## 2. Create Master CSV with 5-Fold Splits

In [4]:
# Check if master.csv already exists
if MASTER_FILE.exists():
    print("‚ö†Ô∏è  master.csv already exists. Loading existing file...")
    df = pd.read_csv(MASTER_FILE)
    print(f"Loaded {len(df)} documents from master.csv")
else:
    print("Creating master.csv...")
    
    # Load BBC dataset
    if not SRC_FILE.exists():
        print(f"\n‚ùå Error: {SRC_FILE} not found!")
        print("\nPlease place 'bbc-text.csv' in the data/ folder.")
        print("You can download it from: [ADD DATASET LINK]")
    else:
        df = pd.read_csv(SRC_FILE)
        print(f"‚úÖ Loaded {len(df)} documents from bbc-text.csv")
        
        # Rename category to label
        df = df.rename(columns={"category": "label"})
        df = df[["text", "label"]]
        
        # Add sequential IDs
        df["id"] = [f"bbc_{i:05d}" for i in range(len(df))]
        
        # Create 5 stratified folds
        skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=SEED)
        folds = np.zeros(len(df), dtype=int)
        
        for fold_num, (_, val_idx) in enumerate(skf.split(df["text"], df["label"])):
            folds[val_idx] = fold_num
        
        df["fold5"] = folds
        
        # Reorder columns
        df = df[["id", "text", "label", "fold5"]]
        
        # Save master.csv
        df.to_csv(MASTER_FILE, index=False, encoding="utf-8")
        print(f"\n‚úÖ Saved master.csv with {len(df)} documents")
        
        # Quick sanity check
        assert df["id"].is_unique, "IDs are not unique!"
        assert df["fold5"].between(0, 4).all(), "Folds not in range 0-4!"
        print("‚úÖ Validation passed!")

‚ö†Ô∏è  master.csv already exists. Loading existing file...
Loaded 2225 documents from master.csv


In [5]:
# Display basic statistics
print("\nüìä Dataset Statistics:")
print(f"Total documents: {len(df)}")
print(f"\nClass distribution:")
print(df["label"].value_counts())
print(f"\nFold distribution:")
print(df["fold5"].value_counts().sort_index())


üìä Dataset Statistics:
Total documents: 2225

Class distribution:
label
sport            511
business         510
politics         417
tech             401
entertainment    386
Name: count, dtype: int64

Fold distribution:
fold5
0    445
1    445
2    445
3    445
4    445
Name: count, dtype: int64


In [6]:
# Show sample documents
print("\nüìÑ Sample Documents:")
df.head()


üìÑ Sample Documents:


Unnamed: 0,id,text,label,fold5
0,bbc_00000,tv future in the hands of viewers with home th...,tech,2
1,bbc_00001,worldcom boss left books alone former worldc...,business,0
2,bbc_00002,tigers wary of farrell gamble leicester say ...,sport,2
3,bbc_00003,yeading face newcastle in fa cup premiership s...,sport,0
4,bbc_00004,ocean s twelve raids box office ocean s twelve...,entertainment,0


## 3. Generate Train/Dev/Test Split from Roll Number

The split is **deterministic** based on your roll number using CRC32 hash.

In [7]:
# Calculate dev and test folds from roll number
r = zlib.crc32(ROLL.encode())
dev_fold = r % 5
test_fold = (r // 5) % 5

# Ensure dev and test folds are different
if test_fold == dev_fold:
    test_fold = (test_fold + 1) % 5

print(f"üé≤ Roll Number: {ROLL}")
print(f"üé≤ CRC32 Hash: {r}")
print(f"\nüìä Fold Assignment:")
print(f"  DEV fold:  {dev_fold}")
print(f"  TEST fold: {test_fold}")
print(f"  TRAIN folds: {[f for f in range(5) if f not in [dev_fold, test_fold]]}")

üé≤ Roll Number: SE22UARI195
üé≤ CRC32 Hash: 1507797122

üìä Fold Assignment:
  DEV fold:  2
  TEST fold: 4
  TRAIN folds: [0, 1, 3]


In [8]:
# Split the data
DEV = df[df.fold5 == dev_fold].copy()
TEST = df[df.fold5 == test_fold].copy()
TRAIN = df[~df.fold5.isin([dev_fold, test_fold])].copy()

print(f"\nüìà Split Sizes:")
print(f"  TRAIN: {len(TRAIN)} documents ({len(TRAIN)/len(df)*100:.1f}%)")
print(f"  DEV:   {len(DEV)} documents ({len(DEV)/len(df)*100:.1f}%)")
print(f"  TEST:  {len(TEST)} documents ({len(TEST)/len(df)*100:.1f}%)")
print(f"  TOTAL: {len(TRAIN) + len(DEV) + len(TEST)} documents")

# Verify no overlap
assert len(set(TRAIN.id) & set(DEV.id)) == 0, "TRAIN and DEV overlap!"
assert len(set(TRAIN.id) & set(TEST.id)) == 0, "TRAIN and TEST overlap!"
assert len(set(DEV.id) & set(TEST.id)) == 0, "DEV and TEST overlap!"
print("\n‚úÖ No overlap between splits!")


üìà Split Sizes:
  TRAIN: 1335 documents (60.0%)
  DEV:   445 documents (20.0%)
  TEST:  445 documents (20.0%)
  TOTAL: 2225 documents

‚úÖ No overlap between splits!


In [9]:
# Check class distribution in each split
print("\nüìä Class Distribution Across Splits:")
print("\nTRAIN:")
print(TRAIN["label"].value_counts())
print("\nDEV:")
print(DEV["label"].value_counts())
print("\nTEST:")
print(TEST["label"].value_counts())


üìä Class Distribution Across Splits:

TRAIN:
label
sport            307
business         306
politics         250
tech             241
entertainment    231
Name: count, dtype: int64

DEV:
label
sport            102
business         102
politics          83
tech              80
entertainment     78
Name: count, dtype: int64

TEST:
label
business         102
sport            102
politics          84
tech              80
entertainment     77
Name: count, dtype: int64


## 4. Text Preprocessing Pipeline

Steps:
1. Lowercase
2. Remove punctuation
3. Normalize whitespace
4. Tokenize
5. Remove stopwords
6. Lemmatize

In [10]:
# Initialize preprocessing tools
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

print(f"Stopwords loaded: {len(stop_words)} words")
print(f"Sample stopwords: {list(stop_words)[:10]}")

Stopwords loaded: 198 words
Sample stopwords: ['itself', 'wasn', 'who', 'its', 'each', 'doing', 'there', 'below', 'hasn', 'again']


In [11]:
def preprocess_text(text, remove_stopwords=True, lemmatize=True):
    """
    Preprocess a single text document.
    
    Args:
        text: Input text string
        remove_stopwords: Whether to remove stopwords
        lemmatize: Whether to lemmatize tokens
    
    Returns:
        Dictionary with:
        - 'raw': original text
        - 'tokens': list of processed tokens
        - 'text': space-joined processed tokens
    """
    # Store original
    raw_text = text
    
    # 1. Lowercase
    text = text.lower()
    
    # 2. Remove punctuation (keep only alphanumeric and spaces)
    text = re.sub(r'[^a-z0-9\s]', ' ', text)
    
    # 3. Normalize whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    
    # 4. Tokenize
    tokens = word_tokenize(text)
    
    # 5. Remove stopwords (optional)
    if remove_stopwords:
        tokens = [t for t in tokens if t not in stop_words]
    
    # 6. Lemmatize (optional)
    if lemmatize:
        tokens = [lemmatizer.lemmatize(t) for t in tokens]
    
    # Remove empty tokens and numbers-only tokens
    tokens = [t for t in tokens if len(t) > 1 and not t.isdigit()]
    
    return {
        'raw': raw_text,
        'tokens': tokens,
        'text': ' '.join(tokens)
    }

print("‚úÖ Preprocessing function defined!")

‚úÖ Preprocessing function defined!


In [12]:
# Test preprocessing on a sample document
sample_text = TRAIN.iloc[0]['text']
print("üìÑ Original Text (first 300 chars):")
print(sample_text[:300] + "...\n")

processed = preprocess_text(sample_text)
print("\nüîß Processed Tokens (first 30):")
print(processed['tokens'][:30])
print(f"\nTotal tokens: {len(processed['tokens'])}")

print("\nüìù Processed Text (first 300 chars):")
print(processed['text'][:300] + "...")

üìÑ Original Text (first 300 chars):
worldcom boss  left books alone  former worldcom boss bernie ebbers  who is accused of overseeing an $11bn (¬£5.8bn) fraud  never made accounting decisions  a witness has told jurors.  david myers made the comments under questioning by defence lawyers who have been arguing that mr ebbers was not resp...


üîß Processed Tokens (first 30):
['worldcom', 'bos', 'left', 'book', 'alone', 'former', 'worldcom', 'bos', 'bernie', 'ebbers', 'accused', 'overseeing', '11bn', '8bn', 'fraud', 'never', 'made', 'accounting', 'decision', 'witness', 'told', 'juror', 'david', 'myers', 'made', 'comment', 'questioning', 'defence', 'lawyer', 'arguing']

Total tokens: 187

üìù Processed Text (first 300 chars):
worldcom bos left book alone former worldcom bos bernie ebbers accused overseeing 11bn 8bn fraud never made accounting decision witness told juror david myers made comment questioning defence lawyer arguing mr ebbers responsible worldcom problem phone company coll

## 5. Process All Splits and Save to Cache

In [13]:
def process_split(split_df, split_name):
    """
    Process all documents in a split.
    """
    print(f"\nüîß Processing {split_name} split ({len(split_df)} documents)...")
    
    # Apply preprocessing
    processed = split_df['text'].progress_apply(preprocess_text)
    
    # Create new dataframe
    result_df = split_df.copy()
    result_df['text_raw'] = processed.apply(lambda x: x['raw'])
    result_df['tokens'] = processed.apply(lambda x: x['tokens'])
    result_df['text_processed'] = processed.apply(lambda x: x['text'])
    result_df['token_count'] = result_df['tokens'].apply(len)
    
    # Statistics
    print(f"\nüìä {split_name} Statistics:")
    print(f"  Total documents: {len(result_df)}")
    print(f"  Total tokens: {result_df['token_count'].sum():,}")
    print(f"  Avg tokens/doc: {result_df['token_count'].mean():.1f}")
    print(f"  Min tokens: {result_df['token_count'].min()}")
    print(f"  Max tokens: {result_df['token_count'].max()}")
    
    return result_df

print("‚úÖ Processing function defined!")

‚úÖ Processing function defined!


In [14]:
# Process TRAIN split
train_processed = process_split(TRAIN, "TRAIN")


üîß Processing TRAIN split (1335 documents)...


  0%|          | 0/1335 [00:00<?, ?it/s]


üìä TRAIN Statistics:
  Total documents: 1335
  Total tokens: 285,829
  Avg tokens/doc: 214.1
  Min tokens: 61
  Max tokens: 1635


In [15]:
# Process DEV split
dev_processed = process_split(DEV, "DEV")


üîß Processing DEV split (445 documents)...


  0%|          | 0/445 [00:00<?, ?it/s]


üìä DEV Statistics:
  Total documents: 445
  Total tokens: 97,572
  Avg tokens/doc: 219.3
  Min tokens: 79
  Max tokens: 1769


In [16]:
# Process TEST split
test_processed = process_split(TEST, "TEST")


üîß Processing TEST split (445 documents)...


  0%|          | 0/445 [00:00<?, ?it/s]


üìä TEST Statistics:
  Total documents: 445
  Total tokens: 100,831
  Avg tokens/doc: 226.6
  Min tokens: 48
  Max tokens: 2180


In [17]:
# Build vocabulary from TRAIN only
print("\nüìö Building vocabulary from TRAIN split...")

# Flatten all tokens
all_train_tokens = []
for tokens in train_processed['tokens']:
    all_train_tokens.extend(tokens)

# Count frequencies
from collections import Counter
vocab_counter = Counter(all_train_tokens)

print(f"\nüìä Vocabulary Statistics:")
print(f"  Total tokens: {len(all_train_tokens):,}")
print(f"  Unique tokens: {len(vocab_counter):,}")
print(f"\nüîù Top 20 most frequent tokens:")
for token, count in vocab_counter.most_common(20):
    print(f"  {token:15s} : {count:5d}")


üìö Building vocabulary from TRAIN split...

üìä Vocabulary Statistics:
  Total tokens: 285,829
  Unique tokens: 20,404

üîù Top 20 most frequent tokens:
  said            :  4415
  year            :  1912
  mr              :  1880
  would           :  1570
  also            :  1292
  people          :  1214
  new             :  1205
  one             :  1125
  time            :   923
  could           :   922
  game            :   906
  last            :   813
  two             :   778
  first           :   773
  world           :   750
  say             :   740
  film            :   692
  company         :   679
  firm            :   666
  make            :   647


In [18]:
# Save processed data to cache
print("\nüíæ Saving processed data to cache...")

cache_files = {
    'train_processed.pkl': train_processed,
    'dev_processed.pkl': dev_processed,
    'test_processed.pkl': test_processed,
    'vocab_counter.pkl': vocab_counter
}

for filename, data in cache_files.items():
    filepath = CACHE_DIR / filename
    with open(filepath, 'wb') as f:
        pickle.dump(data, f)
    print(f"  ‚úÖ Saved: {filename}")

print("\nüéâ All data saved successfully!")


üíæ Saving processed data to cache...
  ‚úÖ Saved: train_processed.pkl
  ‚úÖ Saved: dev_processed.pkl
  ‚úÖ Saved: test_processed.pkl
  ‚úÖ Saved: vocab_counter.pkl

üéâ All data saved successfully!


In [19]:
# Save split metadata
metadata = {
    'roll': ROLL,
    'dev_fold': int(dev_fold),
    'test_fold': int(test_fold),
    'train_size': len(train_processed),
    'dev_size': len(dev_processed),
    'test_size': len(test_processed),
    'vocab_size': len(vocab_counter),
    'total_train_tokens': len(all_train_tokens)
}

metadata_path = CACHE_DIR / 'metadata.pkl'
with open(metadata_path, 'wb') as f:
    pickle.dump(metadata, f)

print("‚úÖ Metadata saved!")
print("\nüìã Metadata:")
for key, value in metadata.items():
    print(f"  {key}: {value}")

‚úÖ Metadata saved!

üìã Metadata:
  roll: SE22UARI195
  dev_fold: 2
  test_fold: 4
  train_size: 1335
  dev_size: 445
  test_size: 445
  vocab_size: 20404
  total_train_tokens: 285829


## 6. Summary

‚úÖ **Completed:**
- Created master.csv with 5-fold stratified splits
- Generated train/dev/test split for roll SE22UARI195
- Preprocessed all text (lowercase, tokenize, stopwords, lemmatize)
- Built vocabulary from TRAIN split
- Saved all processed data to cache/

**Next Steps:**
- Build sparse representations (OHE, BoW, N-grams, TF-IDF)
- Build dense representations (Word2Vec, GloVe)
- Train classifiers
- Build retrieval system

In [20]:
print("\n" + "="*60)
print("üéâ NOTEBOOK 01: SETUP & PREPROCESSING COMPLETE! üéâ")
print("="*60)
print(f"\n‚úÖ Processed {len(train_processed) + len(dev_processed) + len(test_processed)} documents")
print(f"‚úÖ Built vocabulary of {len(vocab_counter):,} unique tokens")
print(f"‚úÖ Saved all data to {CACHE_DIR}")
print("\nüìù Ready for next notebook: 02_sparse_methods.ipynb")


üéâ NOTEBOOK 01: SETUP & PREPROCESSING COMPLETE! üéâ

‚úÖ Processed 2225 documents
‚úÖ Built vocabulary of 20,404 unique tokens
‚úÖ Saved all data to ../cache

üìù Ready for next notebook: 02_sparse_methods.ipynb
