# PICNIC CLASSIFIER

Build a simple text classification prototype that predicts a category for short text snippets such as webpage titles, article headlines, or sentences. Example categories: sports, finance, fashion, technology (or any 4–6 you choose).

NB : I did not build this pipeline with space/time complexity in mind as im completing some extra steps (e.g sorting in place etc).

In [None]:
from sklearn.datasets import fetch_20newsgroups
from sklearn import metrics
import matplotlib.pyplot as plt

from utils import load_vectorizer, load_model, predict_text

In [None]:
# LABEL_OPTIONS = ['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']

MODEL = "Logistic Regression"
VECTORIZATION_METHOD = "TF-IDF"
LABELS = ['comp.sys.mac.hardware', 'rec.autos', 'sci.space', 'talk.politics.guns']
LABELS.sort()
REMOVE = ('headers', 'footers', 'quotes')


## STEP 1 : Load and explore the dataset

In [None]:
newsgroups_train = fetch_20newsgroups(subset='train', categories=LABELS, remove=REMOVE)
newsgroups_test = fetch_20newsgroups(subset='test', categories=LABELS, remove=REMOVE)

In [None]:
len_train = len(newsgroups_train.data)
len_test = len(newsgroups_test.data)
len_total = len_train + len_test

print(f"For {len_total} documents in total, there are {len_train} docs in train ({(len_train / len_total) * 100:0.0f}%) and {len_test} ({(len_test / len_total) * 100:0.0f}%) docs in test.")

<h3>Labels</h3>

In [None]:
for i, name in enumerate(newsgroups_train.target_names):
    print(f"{i} : {name}")

<h3>Text</h3>


In [None]:
print((newsgroups_train.data[0][:600].strip()))

## STEP 2 : Pre process the dataset

In order to feed models with the text data, we need to **turn the text into vectors of numerical values** first. One vectorizer we can use is the built in TF-IDF sklearn one which is a statistical method used in nlp to evaluate how important a word is to a document in relation to its corpus. TF-IDF combines two components:

* TF (Term Frequency): count of n words in doc / total nb words in doc
* IDF (Inverse Document Frequency): rarity of a term across a collection of documents to penalize common words (log(total nb docs / 1 + (in case 0) nb docs with the term))


In [None]:
vectorizer = load_vectorizer(VECTORIZATION_METHOD)

In [None]:
#build the vocabulary
vectors_train = vectorizer.fit_transform(newsgroups_train.data)

#only convert into vectors
vectors_test = vectorizer.transform(newsgroups_test.data)

## STEP 3 : Load and train the model

Here, we use a simple logistic regression. 

In [None]:
model = load_model(MODEL)

#find the optimal parameter of our regression
model.fit(vectors_train, newsgroups_train.target)

## STEP 4 : Predict 

<h4>Get predictions</h4>


In [None]:
probs = model.predict_proba(vectors_test)
# print(probs[0])

preds = model.predict(vectors_test)
# print(preds[0])

<h4>Metrics</h4>


In [None]:
print(f"F1 : {metrics.f1_score(newsgroups_test.target, preds, average="macro"):0.2f}")
print(f"Accuracy : {metrics.accuracy_score(newsgroups_test.target, preds):0.2f}")

<h4>Confusion Matrix</h4>


In [None]:
cm = metrics.confusion_matrix(newsgroups_test.target, preds)
disp = metrics.ConfusionMatrixDisplay(confusion_matrix=cm,
                              display_labels=LABELS)

disp.plot(cmap=plt.cm.Blues, xticks_rotation=90)
plt.title("v0 results")
plt.show()


## If you'd like to try !

The playground is where you can either use some of the example prompts I got from Claude or insert your own to try out the classifier.

In [None]:
examples = {
    "comp.sys.mac.hardware": [
        "My old PowerBook won’t recognize the new external SCSI drive.",
        "Thanks everyone — the issue was with the Mac’s RAM card, replaced it and it boots fine!"
    ],
    "rec.autos": [
        "Just got a new Honda Civic — love how smooth the engine feels.",
        "My transmission is making a strange noise when shifting gears — could it be low fluid?"
    ],
    "sci.space": [
        "How does NASA plan to maintain communication with spacecraft beyond Mars orbit?",
        "SpaceX successfully launched another batch of Starlink satellites today."
    ],
    "talk.politics.guns": [
        "The new firearm control bill just passed — what does this mean for gun owners?",
        "FBI reports show a rise in illegal weapon sales — stricter enforcement might help."
    ]
}

In [None]:
# all_texts = [text for texts in examples.values() for text in texts]
# one_text = examples["sci.space"][0]
random_text = "Hello I'm computer Charlotte a brand new car, i have 5 seats."
predict_text(random_text, model, vectorizer, LABELS)