# Lesson 6 (Bonus) â€” Sentence Classification with fastai

In this notebook, we will build a classifier to predict the **topic** or **meaning** of a sentence. We will use:
1. **fastai**: A high-level library built on PyTorch that simplifies training.
2. **Padding**: Handling variable-length sequences.
3. **Pretrained Embeddings**: Using a model pretrained on a large corpus (Wikipedia) to jump-start learning.

In [None]:
!pip install fastai

In [1]:
from fastai.text.all import *
import pandas as pd

## 1. The Dataset: AG News

We will use the **AG News** dataset, which consists of news articles classified into 4 topics:
1. World
2. Sports
3. Business
4. Sci/Tech

This fits the goal of "sentence meaning classification" (classifying the topic of the text).

In [2]:
# Download and extract the dataset
path = untar_data(URLs.AG_NEWS)
path.ls()

(#4) [Path('/Users/aghasi/.fastai/data/ag_news_csv/classes.txt'),Path('/Users/aghasi/.fastai/data/ag_news_csv/test.csv'),Path('/Users/aghasi/.fastai/data/ag_news_csv/readme.txt'),Path('/Users/aghasi/.fastai/data/ag_news_csv/train.csv')]

The dataset usually comes in CSV files. Let's inspect them.

In [None]:
# Load a subset for speed if needed, or the full train set
# fastai's AG_NEWS has 'train.csv' and 'test.csv'
df = pd.read_csv(path/'train.csv', header=None, names=['label', 'title', 'description'])

# Concatenate title and description for more context
df['text'] = df['title'] + " " + df['description']
df.head()

Unnamed: 0,label,title,description,text
0,3,Wall St. Bears Claw Back Into the Black (Reuters),"Reuters - Short-sellers, Wall Street's dwindling\band of ultra-cynics, are seeing green again.","Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\band of ultra-cynics, are seeing green again."
1,3,Carlyle Looks Toward Commercial Aerospace (Reuters),"Reuters - Private investment firm Carlyle Group,\which has a reputation for making well-timed and occasionally\controversial plays in the defense industry, has quietly placed\its bets on another part of the market.","Carlyle Looks Toward Commercial Aerospace (Reuters) Reuters - Private investment firm Carlyle Group,\which has a reputation for making well-timed and occasionally\controversial plays in the defense industry, has quietly placed\its bets on another part of the market."
2,3,Oil and Economy Cloud Stocks' Outlook (Reuters),Reuters - Soaring crude prices plus worries\about the economy and the outlook for earnings are expected to\hang over the stock market next week during the depth of the\summer doldrums.,Oil and Economy Cloud Stocks' Outlook (Reuters) Reuters - Soaring crude prices plus worries\about the economy and the outlook for earnings are expected to\hang over the stock market next week during the depth of the\summer doldrums.
3,3,Iraq Halts Oil Exports from Main Southern Pipeline (Reuters),"Reuters - Authorities have halted oil export\flows from the main pipeline in southern Iraq after\intelligence showed a rebel militia could strike\infrastructure, an oil official said on Saturday.","Iraq Halts Oil Exports from Main Southern Pipeline (Reuters) Reuters - Authorities have halted oil export\flows from the main pipeline in southern Iraq after\intelligence showed a rebel militia could strike\infrastructure, an oil official said on Saturday."
4,3,"Oil prices soar to all-time record, posing new menace to US economy (AFP)","AFP - Tearaway world oil prices, toppling records and straining wallets, present a new economic menace barely three months before the US presidential elections.","Oil prices soar to all-time record, posing new menace to US economy (AFP) AFP - Tearaway world oil prices, toppling records and straining wallets, present a new economic menace barely three months before the US presidential elections."


## 2. Data Processing: Tokenization & Padding

Neural networks need inputs of the same size (in a batch). Text is variable length.
**fastai** handles this automatically:
- **Tokenization**: Splitting text into words/subwords.
- **Numericalization**: Mapping tokens to integer IDs.
- **Padding**: Adding a special token (e.g., `xxpad`) to make sequences in a batch the same length.

We use `TextDataLoaders` to set this up.

In [None]:
dls = TextDataLoaders.from_df(
    df, 
    text_col='text', 
    label_col='label', 
    valid_pct=0.2,   # Use 20% for validation
    bs=64            # Batch size
)

Let's look at a batch to see the **padding**. In fastai, `xxpad` is the padding token.

In [None]:
dls.show_batch(max_n=3)

Unnamed: 0,text,category
0,"xxbos xxmaj kyoto is xxmaj dead - xxmaj long xxmaj live xxmaj xxunk xxmaj there 's troubling news ( ft subscription xxunk , alternate copy here ) coming from xxmaj japan , where the xxmaj kyoto protocol on xxmaj greenhouse xxmaj emissions was born in 1997 . xxmaj it seems that the xxmaj japanese are n't going to be able to meet their emissions targets specified in the agreement in time . xxmaj indeed , unless they buy a "" large quantity "" of emissions credits from other countries , they 're not going to be able to meet their commitment at all . xxmaj xxunk xxmaj sugiyama , a climate expert at the xxmaj central xxmaj research xxmaj institute of xxmaj electric xxmaj power xxmaj industry in xxmaj japan , said emissions were rising 1 per cent a year due to a larger - than - expected impact from",4
1,"xxbos 2004 xxup us xxmaj senate xxmaj outlook xxmaj with all the hoopla over xxmaj bush and xxmaj kerry , some of you may not have been paying close attention to the other races going on in this loaded xxup us political season . xxmaj i 've read a good dozen or so xxmaj senate outlooks , and my blurry eyes and spinning brain kept getting lost in all the numbers and losing track of who , ultimately , was likely to control the xxmaj senate on xxmaj november third . xxmaj so i made my very own xxmaj senate outlook to figure it out ( or add further confusion , depending on what you think of my predictions ) . xxmaj the bad news is , we probably wo n't know who controls the xxmaj senate on xxmaj november third . xxmaj the good news , if you 're",4
2,"xxbos xxmaj sprint : xxmaj no comment on reported xxmaj nextel merger talks xxup washington - xxmaj rumored merger talks between xxmaj sprint and xxmaj nextel xxmaj communications xxmaj thursday were met with a "" no comment "" from xxunk > advertisement < / p><p><img src=""http : / / ad.doubleclick.net / ad / idg.us.ifw.general / solaris;sz=1x1;ord=200301151450 ? "" width=""1 "" height=""1 "" border=""0 "" / > < a href=""http : / / ad.doubleclick.net / clk;12204780;10550054;n?http : / / ad.doubleclick.net / clk;12165994;105 xxrep 3 2 95;g?http : / / xxrep 3 w .sun.com / solaris10"">solaris 10(tm ) xxup os : xxmaj position your business ten moves ahead . < / a><br / > solaris 10 xxup os has arrived and provides even more \ reasons for the world 's most demanding businesses \ to operate on this , the leading xxup unix platform . \ xxmaj like the fact you can",4


## 3. The Model: Pretrained Embeddings with AWD-LSTM

Instead of training embeddings from scratch (random noise), we will use a model pretrained on **Wikipedia** (Wikitext-103). This model already understands English structure, grammar, and some semantics.

We use `text_classifier_learner` which:
1. Loads the pretrained **AWD-LSTM** (a type of Recurrent Neural Network).
2. Replaces the last layer with a classifier for our 4 topics.
3. Freezes the body so we only train the new head first.

In [None]:
learn = text_classifier_learner(dls, AWD_LSTM, drop_mult=0.5, metrics=accuracy)

## 4. Training (Fine-tuning)

We use `fine_tune`, which uses a smart strategy:
1. Train only the new head (1 epoch).
2. Unfreeze and train the whole model with discriminative learning rates.

In [None]:
# Train for 1 epoch to save time for this demo (usually 3-4 is better)
learn.fine_tune(1, base_lr=2e-3)

epoch,train_loss,valid_loss,accuracy,time


KeyboardInterrupt: 

## 6. Bonus: A Simpler Model (Bag of Embeddings)


Sometimes, we don't need a complex RNN like LSTM. A simple **Bag of Embeddings** (averaging or summing word vectors) + Linear Layer can work surprisingly well.

This approach:
1.  Takes the **pretrained** word embeddings (from the model above).
2.  **Sums** (or averages) them up for all words in the sentence.
3.  Passes the result through a single **Linear** layer to predict the 4 classes.
4.  Uses **Softmax** (implicitly via CrossEntropyLoss) to get probabilities.

In [None]:
class BoEWrapper(nn.Module):
    def __init__(self, vocab_size, emb_dim, n_classes, pad_idx=1):
        super().__init__()
        # 1. Embedding Layer
        self.emb = nn.Embedding(vocab_size, emb_dim, padding_idx=pad_idx)
        # 2. Linear Layer
        self.linear = nn.Linear(emb_dim, n_classes)
        
    def forward(self, x):
        # x shape: (batch_size, seq_len)
        embeddings = self.emb(x) # (batch, seq, emb_dim)
        
        # Mask padding (we don't want to sum the padding tokens)
        # padding_idx is 1 in fastai
        mask = (x != self.emb.padding_idx).unsqueeze(-1) # (batch, seq, 1)
        embeddings = embeddings * mask.float()
        
        # 3. Sum along sequence dimension
        summed = embeddings.sum(dim=1) # (batch, emb_dim)
        
        # 4. Linear layer -> logits
        return self.linear(summed)

# Initialize the simple model
vocab_size = len(dls.vocab[0]) # Input vocabulary size
emb_dim = 400                  # AWD-LSTM uses 400 dim embeddings
n_classes = 4                  # 4 news categories

boe_model = BoEWrapper(vocab_size, emb_dim, n_classes, pad_idx=dls.pad_idx)

# Copy the PRETRAINED weights from the LSTM learner we trained above
# This ensures we start with rich word meanings!
encoder_weights = learn.model[0].encoder.weight
boe_model.emb.weight.data.copy_(encoder_weights)

# Put model on GPU if available
boe_model = boe_model.to(dls.device)

print("Simple Bag-of-Embeddings model created and initialized with pretrained weights!")

AttributeError: pad_idx

## 5. Inference: Classifying New Sentences

Now we can pass any sentence to the model to see its predicted topic.

In [8]:
topics = {1: "World", 2: "Sports", 3: "Business", 4: "Sci/Tech"}

def predict_topic(text):
    pred, _, probs = learn.predict(text)
    # pred is the class string (e.g., '1'), convert to int if needed
    topic_id = int(pred)
    print(f"Text: '{text}'")
    print(f"Predicted Topic: {topics[topic_id]} (Confidence: {probs.max():.2f})\n")

predict_topic("The stock market crashed today due to inflation fears.")
predict_topic("Manchester United won the match against Chelsea.")
predict_topic("New AI model solves complex physics problems.")
predict_topic("Peace talks continue in the Middle East region.")

Text: 'The stock market crashed today due to inflation fears.'
Predicted Topic: Business (Confidence: 0.94)



Text: 'Manchester United won the match against Chelsea.'
Predicted Topic: Sports (Confidence: 0.99)



Text: 'New AI model solves complex physics problems.'
Predicted Topic: Sci/Tech (Confidence: 0.77)



Text: 'Peace talks continue in the Middle East region.'
Predicted Topic: World (Confidence: 0.76)



## Summary

We successfully:
1. Loaded the **AG News** dataset.
2. Used `TextDataLoaders` to handle **tokenization** and **padding**.
3. Loaded a **pretrained** AWD-LSTM model.
4. Fine-tuned it to classify sentence meaning (topic) with high accuracy.