# LAB 7: Error analysis

Objectives
* Construct a  linear text classifier using SGDClassifier
* Evaluate its performance and categorize the errors that it makes
* Eaxmine model's coefficients and decision function values
* Interpret model results using LIME

In [1]:
import numpy as np
import pandas as pd
from cytoolz import *
from tqdm.auto import tqdm

tqdm.pandas()

---

## Load data

In [2]:
train = pd.read_parquet(
    "s3://ling583/lab7-train.parquet", storage_options={"anon": True}
)
test = pd.read_parquet("s3://ling583/lab7-test.parquet", storage_options={"anon": True})

In [3]:
import spacy

nlp = spacy.load(
    "en_core_web_sm",
    exclude=["tagger", "parser", "ner", "lemmatizer", "attribute_ruler"],
)


def tokenize(text):
    doc = nlp.tokenizer(text)
    return [t.norm_ for t in doc if not (t.is_space or t.is_punct or t.like_num)]

In [4]:
import multiprocessing as mp

In [5]:
with mp.Pool() as p:
    train["tokens"] = pd.Series(p.imap(tokenize, tqdm(train["text"]), chunksize=100))
    test["tokens"] = pd.Series(p.imap(tokenize, tqdm(test["text"]), chunksize=100))

  0%|          | 0/19054 [00:00<?, ?it/s]

  0%|          | 0/4764 [00:00<?, ?it/s]

The labels are: GPOL = domestic politics, GSPO = sports, GVIO = war/civil war, GJOB = labor issues

In [6]:
train["topics"].value_counts()

GPOL    7410
GSPO    5639
GVIO    3712
GJOB    2293
Name: topics, dtype: int64

---

## Baseline classifier

In [7]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import classification_report, f1_score
from sklearn.pipeline import make_pipeline

In [8]:
baseline = make_pipeline(CountVectorizer(analyzer=identity), SGDClassifier())
baseline.fit(train["tokens"], train["topics"])
base_predicted = baseline.predict(test["tokens"])
print(classification_report(test["topics"], base_predicted))

# macro average is the average of the 4 numbers of f1 
# weighted average is the largest number of correctly labelling 

              precision    recall  f1-score   support

        GJOB       0.93      0.96      0.94       573
        GPOL       0.94      0.94      0.94      1853
        GSPO       0.99      0.99      0.99      1410
        GVIO       0.91      0.89      0.90       928

    accuracy                           0.95      4764
   macro avg       0.94      0.95      0.94      4764
weighted avg       0.95      0.95      0.95      4764



----

## Hyperparameter search

Find an optimal set of hyperparameters for a Tfidf+SGDClassifier model

In [9]:
import mlflow
from dask_ml.model_selection import RandomizedSearchCV
from logger import log_search
from scipy.stats.distributions import loguniform, randint, uniform

In [10]:
from warnings import simplefilter

simplefilter(action="ignore", category=FutureWarning)

In [13]:
from dask.distributed import Client

client = Client("tcp://127.0.0.1:37123")
client

0,1
Client  Scheduler: tcp://127.0.0.1:37123  Dashboard: http://127.0.0.1:8787/status,Cluster  Workers: 4  Cores: 4  Memory: 16.62 GB


In [14]:
mlflow.set_experiment("lab-7")
sgd = make_pipeline(
    CountVectorizer(analyzer=identity), TfidfTransformer(), SGDClassifier()
)

INFO: 'lab-7' does not exist. Creating a new experiment


In [15]:
%%time

search = RandomizedSearchCV(
    sgd,
    {
        "countvectorizer__min_df": randint(1, 20),
        "countvectorizer__max_df": uniform(0.5, 0.5),
        "tfidftransformer__use_idf": [True, False],
        "sgdclassifier__alpha": loguniform(1e-6, 1e-2),
    },
    n_iter=50,
    scoring="f1_macro",
)
search.fit(train["tokens"], train["topics"])
log_search(search)

CPU times: user 10.4 s, sys: 1.46 s, total: 11.8 s
Wall time: 3min 38s


---

## Compare optimized model to baseline

In [16]:
sgd = make_pipeline(
    CountVectorizer(analyzer=identity, min_df=5, max_df=.75),
    TfidfTransformer(use_idf=True),
    SGDClassifier(alpha=1e-4),
)
sgd.fit(train["tokens"], train["topics"])
predicted = sgd.predict(test["tokens"])
print(classification_report(test["topics"], predicted))

# better result as before 

              precision    recall  f1-score   support

        GJOB       0.97      0.94      0.95       573
        GPOL       0.94      0.96      0.95      1853
        GSPO       1.00      0.99      1.00      1410
        GVIO       0.92      0.90      0.91       928

    accuracy                           0.96      4764
   macro avg       0.96      0.95      0.95      4764
weighted avg       0.96      0.96      0.96      4764



In [18]:
base_f1 = f1_score(test["topics"], base_predicted, average="macro")
sgd_f1 = f1_score(test["topics"], predicted, average="macro")

In [19]:
base_f1, sgd_f1, sgd_f1 - base_f1

(0.9438637478338561, 0.9543250517483548, 0.010461303914498732)

In [21]:
(sgd_f1 - base_f1) / (1 - base_f1)

# our optimzed model is better than the base one 

0.18635558147944903

In [22]:
from scipy.stats import binom_test, wilcoxon

In [24]:
# compare if two classifier give different answers
# counting how many right/wrong
diff = (predicted == test["topics"]).astype(int) - (
    base_predicted == test["topics"]
).astype(int)
sum(diff == 1), sum(diff == -1), sum(diff == 0)

(89, 41, 4634)

In [31]:
# binomial test 
# null hypothesis is 50/50 chance of getting right/wrong 
binom_test([sum(diff == 1), sum(diff == -1)], alternative="greater")

# the test prove that a diviation from 50/50 is very small = smaller than p-value = significant test 

1.5460961941914623e-05

In [32]:
# wil coxon test, the sign of the predicted values 
wilcoxon(diff, alternative="greater")

# significant = the signs are right 

WilcoxonResult(statistic=5829.5, pvalue=1.2775403266405453e-05)

**TO DO:** Summarize your results: how much better is the optimized model? Is it significantly better than the baseline?

The optimized model are clearly better than the base model. The optimized model has a higher accuracy score of .96 compared to .95 in the base model. In addition, macro average f1 score from the optimized model is also higher than the base model. This indicate that the overall average of the four factors are higher in the optimized model. Although, optimized model is significant better than the base model, the scores are not too different. In addition, the binomial test and wilcoxon test also has really small p-value which proved that the optimzed model is significant better than the base model. 

-----

## Save model

In [33]:
import cloudpickle

In [34]:
# feed in raw text instead of tokenized text and use tokenizes = tokenize parameter 
# get the same result 
sgd = make_pipeline(
    CountVectorizer(preprocessor=identity, tokenizer=tokenize, min_df=5, max_df=0.75),
    TfidfTransformer(use_idf=True),
    SGDClassifier(alpha=1e-4),
)
sgd.fit(train["text"], train["topics"])
predicted = sgd.predict(test["text"])
print(classification_report(test["topics"], predicted))

              precision    recall  f1-score   support

        GJOB       0.97      0.94      0.95       573
        GPOL       0.94      0.97      0.95      1853
        GSPO       1.00      0.99      1.00      1410
        GVIO       0.93      0.90      0.91       928

    accuracy                           0.96      4764
   macro avg       0.96      0.95      0.95      4764
weighted avg       0.96      0.96      0.96      4764



In [38]:
cloudpickle.dump(sgd, open("sgd.model", "wb"))