In [23]:
!pip install numpy pandas



In [24]:
!pip install scikit-learn



# AIG230 NLP (Week 3 Lab) — Notebook 1: Text Representation

This notebook focuses on **turning raw text into numeric features** you can use in real-world ML systems.

You will build:
- a clean **train/test split**
- **Bag-of-Words** (binary and count)
- **Document-Term Matrix** (DTM)
- **TF-IDF** (with n-grams)
- **Hashing trick** (production-friendly)
- basic **retrieval** (cosine similarity) and a **baseline classifier**
- model **persistence** (save/load)

## 0) Setup


In [25]:

import re
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, HashingVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, confusion_matrix
import joblib


## 1) A small, realistic dataset (you can replace with your own CSV)


In industry, text often comes with:
- an **ID**
- free-text **description**
- a **label** (category, priority, intent, topic) or a target (churn, fraud, etc.)

Here we create a toy dataset that looks like support tickets / ops incidents.  
Swap this section with a `pd.read_csv(...)` in your own workflows.


In [48]:

data = [
    ("T-001", "VPN keeps disconnecting every 10 minutes on Windows 11 after latest update", "network"),
    ("T-002", "Password reset link is expired and user cannot login to the portal", "auth"),
    ("T-003", "Email delivery delayed, outbound messages queued for hours", "messaging"),
    ("T-004", "Cannot install printer driver, installer fails with error code 1603", "device"),
    ("T-005", "MFA prompt never arrives on mobile app, user stuck at login", "auth"),
    ("T-006", "WiFi signal drops in meeting rooms, access point reboot helps temporarily", "network"),
    ("T-007", "Outlook search not returning results, index seems corrupted", "messaging"),
    ("T-008", "Laptop battery drains fast after BIOS update, power settings unchanged", "device"),
    ("T-009", "Portal shows 500 error when submitting form, happened after deployment", "app"),
    ("T-010", "API requests timing out, latency spike observed in last hour", "app"),
    ("T-011", "User cannot access shared drive, permission denied though in correct group", "auth"),
    ("T-012", "Teams calls have choppy audio, jitter high on corporate network", "network"),
    ("T-013", "Push notifications not working on Android for the app", "app"),
    ("T-014", "Mailbox is full and cannot receive emails, auto-archive not running", "messaging"),
    ("T-015", "Bluetooth mouse not pairing after restart, device shows as unknown", "device"),
    ("T-016", "Unable to fall asleep for hours and keep waking up multiple times at night", "sleep"),
    ("T-017", "Feeling very sleepy during the day even after sleeping 8 hours", "sleep"),
    ("T-018", "Need a meal plan with higher protein to support muscle gain and recovery", "nutrition"),
    ("T-019", "Stomach discomfort after meals, suspect certain foods causing issues", "nutrition"),
    ("T-020", "Knee pain after running workouts, need advice to adjust training safely", "exercise"),
    ("T-021", "Want a beginner strength routine to build muscle and improve overall fitness", "exercise"),
    ("T-022", "High fever and body aches for two days, not improving", "disease"),
    ("T-023", "Persistent cough and breathing discomfort, symptoms getting worse", "disease"),
    ("T-024", "Sleep schedule is inconsistent due to stress and late-night screen time", "sleep"),
    ("T-025", "Need nutrition guidance for fat loss while keeping energy for workouts", "nutrition"),

]

df = pd.DataFrame(data, columns=["ticket_id", "text", "label"])
df


Unnamed: 0,ticket_id,text,label
0,T-001,VPN keeps disconnecting every 10 minutes on Wi...,network
1,T-002,Password reset link is expired and user cannot...,auth
2,T-003,"Email delivery delayed, outbound messages queu...",messaging
3,T-004,"Cannot install printer driver, installer fails...",device
4,T-005,"MFA prompt never arrives on mobile app, user s...",auth
5,T-006,"WiFi signal drops in meeting rooms, access poi...",network
6,T-007,"Outlook search not returning results, index se...",messaging
7,T-008,"Laptop battery drains fast after BIOS update, ...",device
8,T-009,"Portal shows 500 error when submitting form, h...",app
9,T-010,"API requests timing out, latency spike observe...",app


In [49]:
#create csv file
df[["text", "label"]].to_csv("tickets.csv", index=False)

In [50]:
df = pd.read_csv("tickets.csv")
df.head()

Unnamed: 0,text,label
0,VPN keeps disconnecting every 10 minutes on Wi...,network
1,Password reset link is expired and user cannot...,auth
2,"Email delivery delayed, outbound messages queu...",messaging
3,"Cannot install printer driver, installer fails...",device
4,"MFA prompt never arrives on mobile app, user s...",auth


### Train/test split


In [51]:

X_train, X_test, y_train, y_test = train_test_split(
    df["text"], df["label"], test_size=0.33, random_state=42, stratify=df["label"]
)

print("Train size:", len(X_train))
print("Test size:", len(X_test))


Train size: 16
Test size: 9


## 2) Tokenization basics and normalization (lightweight, practical)


In production pipelines you typically do **minimal, safe normalization**:
- lowercase
- normalize whitespace
- optionally strip obvious punctuation
- keep numbers when they carry meaning (error codes, versions, dates)

Heavy normalization (stemming, aggressive regexes) can hurt when your text includes:
error codes, product names, IDs, or domain terminology.


In [54]:

def simple_normalize(text: str) -> str:
    text = text.lower()
    text = re.sub(r"\s+", " ", text).strip()
    return text

df["text_norm"] = df["text"].map(simple_normalize)
df[["text_norm","label"]].head()


Unnamed: 0,text_norm,label
0,vpn keeps disconnecting every 10 minutes on wi...,network
1,password reset link is expired and user cannot...,auth
2,"email delivery delayed, outbound messages queu...",messaging
3,"cannot install printer driver, installer fails...",device
4,"mfa prompt never arrives on mobile app, user s...",auth


## 3) Vocabulary + Document-Term Matrix (DTM) with CountVectorizer


**CountVectorizer** builds:
- a vocabulary (token → column index)
- a sparse matrix where rows are documents and columns are tokens

This is the classic **Document-Term Matrix** representation.


In [55]:

count_vec = CountVectorizer(
    lowercase=True,
    token_pattern=r"(?u)\b\w+\b",  # keeps tokens like "500", "1603", "mfa"
    min_df=1
)

X_train_counts = count_vec.fit_transform(X_train)
X_test_counts  = count_vec.transform(X_test)

print("DTM shape (train):", X_train_counts.shape)
print("Vocabulary size:", len(count_vec.vocabulary_))


DTM shape (train): (16, 140)
Vocabulary size: 140


### Inspect the vocabulary and a single row


In [56]:

# Show a small slice of the vocabulary (token -> index)
vocab_items = sorted(count_vec.vocabulary_.items(), key=lambda x: x[1])[:25]
vocab_items


[('10', 0),
 ('11', 1),
 ('1603', 2),
 ('a', 3),
 ('access', 4),
 ('after', 5),
 ('and', 6),
 ('android', 7),
 ('api', 8),
 ('app', 9),
 ('archive', 10),
 ('arrives', 11),
 ('asleep', 12),
 ('at', 13),
 ('auto', 14),
 ('battery', 15),
 ('beginner', 16),
 ('bios', 17),
 ('breathing', 18),
 ('build', 19),
 ('cannot', 20),
 ('causing', 21),
 ('certain', 22),
 ('code', 23),
 ('corrupted', 24)]

In [57]:

# Look at a specific document row: non-zero entries (token counts)
row_id = 0
row = X_train_counts[row_id]
inv_vocab = {idx: tok for tok, idx in count_vec.vocabulary_.items()}

nz_cols = row.nonzero()[1]
tokens_counts = sorted([(inv_vocab[c], int(row[0, c])) for c in nz_cols], key=lambda x: -x[1])
tokens_counts[:20]


[('want', 1),
 ('a', 1),
 ('beginner', 1),
 ('strength', 1),
 ('routine', 1),
 ('to', 1),
 ('build', 1),
 ('muscle', 1),
 ('and', 1),
 ('improve', 1),
 ('overall', 1),
 ('fitness', 1)]

## 4) Binary vs Count-based Bag-of-Words


Binary BoW: token present or not (good for short texts and some classification tasks)  
Count BoW: raw frequency (baseline for many pipelines)

Both discard word order.


In [58]:
binary_vec = CountVectorizer(binary=True, token_pattern=r"(?u)\b\w+\b")

X_train_bin = binary_vec.fit_transform(X_train)
# X_test_bin  = binary_vec.transform(X_test)


In [59]:
X_train_bin.shape

(16, 140)

In [60]:
X_train_bin

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 169 stored elements and shape (16, 140)>

## 5) TF-IDF (a refinement, not a replacement)


TF-IDF downweights very common tokens and upweights tokens that are more distinctive.

In industry, TF-IDF with **n-grams** is a strong baseline for:
- ticket routing
- intent detection
- spam detection
- incident clustering


In [61]:
tfidf_vec = TfidfVectorizer(
    ngram_range=(1,3),
    token_pattern=r"(?u)\b\w+\b",
    min_df=1,
    sublinear_tf=True
)

X_train_tfidf = tfidf_vec.fit_transform(X_train)
X_test_tfidf = tfidf_vec.transform(X_test)

In [62]:
X_train_tfidf.shape

(16, 432)

## 6) Quick retrieval: 'find similar tickets' with cosine similarity


A very common industry use case is **nearest neighbor retrieval** for:
- deduplication
- suggesting knowledge base articles
- finding similar past incidents


In [64]:

# Build a search index from ALL tickets using TF-IDF
X_all = tfidf_vec.fit_transform(df["text"])

def search_similar(query: str, top_k: int = 5):
    qv = tfidf_vec.transform([query])
    sims = cosine_similarity(qv, X_all).ravel()
    top_idx = np.argsort(-sims)[:top_k]
    return df.loc[top_idx, ["text","label"]].assign(similarity=sims[top_idx])

queries = [
    "cannot sleep at night waking up often",
    "need high protein meal plan for muscle gain",
    "knee pain after running workout"
]

for q in queries:
    print("\nQUERY:", q)
    print(search_similar(q, top_k=3))



QUERY: cannot sleep at night waking up often
                                                 text  label  similarity
15  Unable to fall asleep for hours and keep wakin...  sleep    0.346746
23  Sleep schedule is inconsistent due to stress a...  sleep    0.119790
4   MFA prompt never arrives on mobile app, user s...   auth    0.055073

QUERY: need high protein meal plan for muscle gain
                                                 text      label  similarity
17  Need a meal plan with higher protein to suppor...  nutrition    0.433832
24  Need nutrition guidance for fat loss while kee...  nutrition    0.092190
21  High fever and body aches for two days, not im...    disease    0.085589

QUERY: knee pain after running workout
                                                 text      label  similarity
19  Knee pain after running workouts, need advice ...   exercise    0.539671
13  Mailbox is full and cannot receive emails, aut...  messaging    0.052155
18  Stomach discomfort after me

QUERY: cannot sleep at night waking up often
-> T-016  Unable to fall asleep for hours and keep wakin...  sleep   
 
-> The query and the top ticket share very similar wording and meaning, including the phrases like "sleep", "waking up", and "night". Because tfidf captures overlapping important terms and the top ticket recieves the highest similarity score.

QUERY: need high protein meal plan for muscle gain
-> T-018  Need a meal plan with higher protein to suppor...  nutrition   

-> Both the given query and the returned top ticket contain highly specific overlapping keywords like "meal plan", "high protein", and "muscle gain". These mutual terms strongly influence tfidf similarity score making the most relevant match.


QUERY: knee pain after running workout
-> T-020  Knee pain after running workouts, need advice ...   exercise

->The query directly matches the ticket text with identical terms like "knee pain", "running" and "workout". Because of these overlapping terms, the similarity score is highest among all.   

## 7) Classification baseline (Logistic Regression)


For text classification, a strong baseline is:

**TF-IDF → Linear model (LogReg / Linear SVM)**

This is fast, reliable, easy to explain, and often hard to beat without deep learning.


In [65]:

clf = LogisticRegression(max_iter=2000)

pipeline = Pipeline([
    ("tfidf", TfidfVectorizer(
        ngram_range=(1,3),
        token_pattern=r"(?u)\b\w+\b",
        sublinear_tf=True
    )),
    ("model", clf)
])

pipeline.fit(X_train, y_train)
pred = pipeline.predict(X_test)

print(classification_report(y_test, pred))
print("Confusion matrix:\n", confusion_matrix(y_test, pred))


              precision    recall  f1-score   support

         app       0.00      0.00      0.00         1
        auth       1.00      1.00      1.00         1
      device       0.00      0.00      0.00         1
     disease       0.00      0.00      0.00         1
    exercise       0.00      0.00      0.00         1
   messaging       0.00      0.00      0.00         1
     network       0.00      0.00      0.00         1
   nutrition       0.00      0.00      0.00         1
       sleep       0.00      0.00      0.00         1

    accuracy                           0.11         9
   macro avg       0.11      0.11      0.11         9
weighted avg       0.11      0.11      0.11         9

Confusion matrix:
 [[0 0 1 0 0 0 0 0 0]
 [0 1 0 0 0 0 0 0 0]
 [0 0 0 0 0 1 0 0 0]
 [0 0 0 0 0 1 0 0 0]
 [0 0 0 0 0 0 0 1 0]
 [0 0 0 0 0 0 0 0 1]
 [1 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 1]
 [1 0 0 0 0 0 0 0 0]]


  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


## 8) Production pattern: HashingVectorizer (no stored vocab)


In production, you may need:
- constant memory usage
- privacy (no vocabulary inspection)
- streaming support
- easier deployment across services

**HashingVectorizer** avoids building a vocabulary. Tradeoff: collisions.


In [66]:
hash_pipe = Pipeline([
    ("hash", HashingVectorizer(
        n_features=2**18,        # tune for your scale
        alternate_sign=False,    # makes features more interpretable for linear models
        ngram_range=(1,2),
        token_pattern=r"(?u)\b\w+\b"
    )),
    ("model", LogisticRegression(max_iter=2000))
])

hash_pipe.fit(X_train, y_train)
pred_hash = hash_pipe.predict(X_test)
print(classification_report(y_test, pred_hash))


              precision    recall  f1-score   support

         app       0.00      0.00      0.00         1
        auth       1.00      1.00      1.00         1
      device       0.00      0.00      0.00         1
     disease       0.00      0.00      0.00         1
    exercise       0.00      0.00      0.00         1
   messaging       0.00      0.00      0.00         1
     network       1.00      1.00      1.00         1
   nutrition       0.00      0.00      0.00         1
       sleep       0.00      0.00      0.00         1

    accuracy                           0.22         9
   macro avg       0.22      0.22      0.22         9
weighted avg       0.22      0.22      0.22         9



  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


## 9) Save and load the model (typical deployment step)


In [67]:
import joblib

#Saving the model
joblib.dump(pipeline, "tfidf_logreg_pipeline.joblib")
print("Model saved to: tfidf_logreg_pipeline.joblib")

Model saved to: tfidf_logreg_pipeline.joblib


In [68]:
#Loading the model
loaded_pipeline = joblib.load("tfidf_logreg_pipeline.joblib")
print("Loaded successfully")

Loaded successfully


In [69]:
#Testing
pred_loaded = loaded_pipeline.predict(X_test)
print("Loaded model test report:")
print(classification_report(y_test, pred_loaded))
print("Loaded confusion matrix:\n", confusion_matrix(y_test, pred_loaded))

Loaded model test report:
              precision    recall  f1-score   support

         app       0.00      0.00      0.00         1
        auth       1.00      1.00      1.00         1
      device       0.00      0.00      0.00         1
     disease       0.00      0.00      0.00         1
    exercise       0.00      0.00      0.00         1
   messaging       0.00      0.00      0.00         1
     network       0.00      0.00      0.00         1
   nutrition       0.00      0.00      0.00         1
       sleep       0.00      0.00      0.00         1

    accuracy                           0.11         9
   macro avg       0.11      0.11      0.11         9
weighted avg       0.11      0.11      0.11         9

Loaded confusion matrix:
 [[0 0 1 0 0 0 0 0 0]
 [0 1 0 0 0 0 0 0 0]
 [0 0 0 0 0 1 0 0 0]
 [0 0 0 0 0 1 0 0 0]
 [0 0 0 0 0 0 0 1 0]
 [0 0 0 0 0 0 0 0 1]
 [1 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 1]
 [1 0 0 0 0 0 0 0 0]]


  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


## Exercises (do these during lab)
1) Add 10 more tickets to `data` with realistic wording and labels. Re-train and compare results.  
2) Try `ngram_range=(1,3)` and observe what changes.  
3) For retrieval, test at least 3 queries and explain why the top result makes sense.  
4) Replace the dataset with a CSV you create (columns: `text`, `label`) and rerun the notebook.
