# **Wikipedia Text Categorization**
### Second attempt: Support Vector Machine
##### *A machine learning project by Cielo Loy for CSE 151A at UCSD*

Preprocessing steps were handled in the first model's creation. I reduced the features of the original dataset from 16 to 3 (name, abstract, infoboxes), and filtered the data into 600000 articles evenly split by 3 categories (arts/entertainment, geography, STEM) using keyword matching.

## Support Vector Machine

In [2]:
import seaborn as sns
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
import sklearn as sk

In [None]:
df = pd.read_json('mini-dataset.jsonl', lines=True)
df["text"] = df["name"].astype(str) + " " + df["abstract"].astype(str)

X_train, X_test, y_train, y_test = sk.model_selection.train_test_split(
    df["text"], df["idx"], test_size=0.2, random_state=42, stratify=df["idx"]
)

# SVM model
model = sk.pipeline.make_pipeline(
    sk.feature_extraction.text.TfidfVectorizer(ngram_range=(1, 2)),
    sk.svm.LinearSVC(random_state=42)
)

model.fit(X_train, y_train)

y_pred = model.predict(X_train)
y_hat = model.predict(X_test)


In [4]:
print("Training error:", 1 - sk.metrics.accuracy_score(y_train, y_pred))
print("Testing error:", 1 - sk.metrics.accuracy_score(y_test, y_hat))

print("\nThe first five ground truths vs prediction of each split:")
print("Train:")
for i in range(5):
    print("True:", y_train.iloc[i], "Pred:", y_pred[i])

print("\nTest:")
for i in range(5):
    print("True:", y_test.iloc[i], "Pred:", y_hat[i])

print("\nClassification Report:")
print(sk.metrics.classification_report(y_test, y_hat, digits=3))

Training error: 0.11142048250386383
Testing error: 0.18250178461282562

The first five ground truths vs prediction of each split:
Train:
True: 1 Pred: 1
True: 2 Pred: 2
True: 0 Pred: 0
True: 1 Pred: 1
True: 1 Pred: 1

Test:
True: 2 Pred: 2
True: 0 Pred: 0
True: 0 Pred: 0
True: 1 Pred: 0
True: 2 Pred: 2

Classification Report:
              precision    recall  f1-score   support

           0      0.827     0.768     0.797     60000
           1      0.749     0.804     0.776     50446
           2      0.909     0.925     0.917     33843

    accuracy                          0.817    144289
   macro avg      0.829     0.832     0.830    144289
weighted avg      0.819     0.817     0.818    144289



The training and testing errors are already much better than the Naive Bayes model. There is one misclassification in the ground truth sample, but the testing error of 0.18 is better than 0.24 for the first NB model. Precision and recall are almost universally better by at least a small margin, with the most significant improvements in STEM recall (from 0.723 to 0.935).

## Testing different parameters

As with the original model, further improvements were shown after some experimentation with the model parameters. Let's once again try some restrictions to unigrams and bigrams.

In [5]:
# Unigrams only
model = sk.pipeline.make_pipeline(
    sk.feature_extraction.text.TfidfVectorizer(ngram_range=(1, 1)),
    sk.svm.LinearSVC(random_state=42)
)
model.fit(X_train, y_train)

y_pred = model.predict(X_train)
y_hat = model.predict(X_test)
print("Training error:", 1 - sk.metrics.accuracy_score(y_train, y_pred))
print("Testing error:", 1 - sk.metrics.accuracy_score(y_test, y_hat))
print("\nClassification Report:")
print(sk.metrics.classification_report(y_test, y_hat, digits=3))

Training error: 0.12545828164309125
Testing error: 0.18428293217085157

Classification Report:
              precision    recall  f1-score   support

           0      0.823     0.770     0.796     60000
           1      0.753     0.804     0.777     50446
           2      0.903     0.914     0.909     33843

    accuracy                          0.816    144289
   macro avg      0.826     0.829     0.827    144289
weighted avg      0.817     0.816     0.816    144289



In [6]:
# Bigrams only
model = sk.pipeline.make_pipeline(
    sk.feature_extraction.text.TfidfVectorizer(ngram_range=(2, 2)),
    sk.svm.LinearSVC(random_state=42)
)
model.fit(X_train, y_train)

y_pred = model.predict(X_train)
y_hat = model.predict(X_test)
print("Training error:", 1 - sk.metrics.accuracy_score(y_train, y_pred))
print("Testing error:", 1 - sk.metrics.accuracy_score(y_test, y_hat))
print("\nClassification Report:")
print(sk.metrics.classification_report(y_test, y_hat, digits=3))

Training error: 0.11090415762809358
Testing error: 0.18961251377443877

Classification Report:
              precision    recall  f1-score   support

           0      0.815     0.769     0.791     60000
           1      0.746     0.793     0.769     50446
           2      0.904     0.910     0.907     33843

    accuracy                          0.810    144289
   macro avg      0.822     0.824     0.822    144289
weighted avg      0.812     0.810     0.811    144289



To my surprise, all three SVM models have very similar metrics regardless of their n-gram parameters. I imagine this is because of NB's feature independence assumption, which is not exactly realistic when applied to language. A linear SVM has the inherent advantage of trying to find the maximum margin no matter what.