In [1]:
%load_ext lab_black

In [2]:
import requests
import json
import os
import pathlib
import spacy

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from IPython.display import clear_output
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import BernoulliNB
from sklearn.ensemble import (
    RandomForestClassifier,
    ExtraTreesClassifier,
    StackingClassifier,
)
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.preprocessing import MaxAbsScaler

# Preprocessing

In [3]:
corpus = pd.read_csv("../data/corpus.csv").sample(frac=1).reset_index(drop=True) # shuffle data 

target = "satire"
features = ["title", "article_text"]

train_corpus, test_corpus = train_test_split(
    corpus, random_state=43, stratify=corpus[target]
)

train_dummy = max([train_corpus[target].sum(), 1.0 - train_corpus[target].sum()])
test_dummy = max([test_corpus[target].sum(), 1.0 - test_corpus[target].sum()])

print(f"Train set baseline: {train_dummy / len(train_corpus)}")
print(f"Test set baseline: {test_dummy / len(test_corpus)}")

del corpus

Train set baseline: 0.5
Test set baseline: 0.5


First things first, the data is loaded in, split into train and test sets, and the baseline performance is calculated. Both the training and testing sets are found to have an equal balance of the satire and non-satire categories, so baseline performance is 50.0% accuracy. 

The following cells declare a few functions to be used in preprocessing. The first will loop through each token in each article and convert it to a lemma, returning the lemmatized article with punctuation, whitespace, stop words, and out-of-value tokens removed. The next will calculate the "imbalance" scores defined in the previous notebook (specifically, their absolute values) and return a dictionary for threshold filtering. The third function accepts the corpus, the lemma imbalance scores, and the threshold to include and returns the corpus with all lemmas with too high of a score removed.

In [4]:
nlp = spacy.load("en_core_web_lg")


def lemmatize(corpus, nlp):
    lemmatized_lists = []
    for article in corpus:
        doc = nlp(article)

        lemmatized = [
            str(token.lemma_).lower()
            for token in doc
            if not token.is_stop
            and not token.is_punct
            and not token.is_oov
            and not token.is_space
        ]

        lemmatized_lists.append(lemmatized)

    return [" ".join(lemma_list) for lemma_list in lemmatized_lists]


def lemma_scores(lemmatized, y):
    lemma_counts = {}
    y = np.array(y)
    for n, lemmas in enumerate(lemmatized):
        for lemma in lemmas.split(" "):
            if lemma not in lemma_counts:
                lemma_counts[lemma] = [0, 0]
            lemma_counts[lemma][y[n]] += 1

    lemma_scores = {
        lemma: np.log(value[0] + value[1])
        * np.abs(value[0] - value[1])
        / (value[0] + value[1])
        for lemma, value in lemma_counts.items()
    }

    return lemma_scores


def score_filtering(lemmatized_corpus, lemma_scores, threshold):
    filtered_corpus = [
        " ".join(
            [
                lemma
                for lemma in lemmas.split(" ")
                if lemma_scores.get(lemma, 0) <= threshold
            ]
        )
        for lemmas in lemmatized_corpus
    ]

    return filtered_corpus

The data is lemmatized and "scored" exactly once before all the grid search fitting, as this is not a process that needs to be repeated. 

In [5]:
train_corpus["title"] = lemmatize(train_corpus["title"], nlp)
train_corpus["article_text"] = lemmatize(train_corpus["article_text"], nlp)

title_lemma_scores = lemma_scores(train_corpus["title"], train_corpus["satire"])
article_lemma_scores = lemma_scores(
    train_corpus["article_text"], train_corpus["satire"]
)

test_corpus["title"] = lemmatize(test_corpus["title"], nlp)
test_corpus["article_text"] = lemmatize(test_corpus["article_text"], nlp)

# The Model

Below, I will construct an instance of StackingClassifier, using two estimators on the base which feed into the output estimator. One estimator will receive the title, and the other will receive the article text. Designing the model in this way grants a few advantages:

1. It will allow me to easily interpret how important the titles are compared to the article
2. It will keep the titles and article separate, so that if one is considerably easier to identify than the other that information is presented in an isolated fashion
3. When score thresholds are applied, words aren't removed from the title (what precious few there are) simply because it is common in the article body

# Grid Search - Wide

The following cell declares all sets of parameters I intend to search over to find the ideal model. In this section, few arguments are modified, and it is more about finding the best classifier and vectorizer than it is about finding the best *version* of any classifier or vectorizer. A few preliminary tests were run prior to this, and in all cases unigrams led to better fits than unigrams and bigrams, thus only unigrams will be used in the vectorizers. Similarly, ensemble models always outperformed logistic regression and naive Bayes, so the base models will only check try random forests and extra trees. Similarly, to avoid variance and to have strong interpretability, the model that the forest feeds into will only *not* use ensemble models. 

In [6]:
# vectorizers
cv = CountVectorizer()
cv_params = {
    "<vectorizer>": [cv],
}

tfidf = TfidfVectorizer()
tfidf_params = {
    "<vectorizer>": [tfidf],
}

# classifier models
logr = LogisticRegression()
logr_params = {
    "<model>": [logr],
    "<model>__C": np.geomspace(1e-2, 1e2, 5),
    "<model>__max_iter": [10000],
}

nb = BernoulliNB()
nb_params = {
    "<model>": [nb],
    "<model>__alpha": [1e-5, 1e-1, 1.0],
}

rfc = RandomForestClassifier()
rfc_params = {
    "<model>": [rfc],
    "<model>__n_estimators": [100],
    "<model>__max_depth": [100],
}

etc = ExtraTreesClassifier()
etc_params = {
    "<model>": [etc],
    "<model>__n_estimators": [100],
    "<model>__max_depth": [100],
}


# the following loop creates a list of all parameter combinations to try
# <model> and <vectorizer> will be replaced with the specific name for that part of the pipeline
params = []
vectorizers = [cv_params, tfidf_params]
base_models = [rfc_params, etc_params]
output_models = [logr_params, nb_params]
for title_vectorizer in vectorizers:
    title_vectorizer = {
        k.replace("<vectorizer>", "title_pipe__vectorizer__vectorizer"): v
        for k, v in title_vectorizer.items()
    }
    for article_vectorizer in vectorizers:
        article_vectorizer = {
            k.replace("<vectorizer>", "article_pipe__vectorizer__vectorizer"): v
            for k, v in article_vectorizer.items()
        }
        for title_model in base_models:
            title_model = {
                k.replace("<model>", "title_pipe__model"): v
                for k, v in title_model.items()
            }
            for article_model in base_models:
                article_model = {
                    k.replace("<model>", "article_pipe__model"): v
                    for k, v in article_model.items()
                }
                for output_model in output_models:
                    output_model = {
                        k.replace("<model>", "final_estimator"): v
                        for k, v in output_model.items()
                    }
                    params.append(
                        title_vectorizer
                        | article_vectorizer
                        | title_model
                        | article_model
                        | output_model
                    )

Below, the stacking classifier is declared. Two different pipelines are declared first, which will handle the vectorization and fitting of the titles and article bodies, respectively. These two estimators are then fed into the final estimator which makes the final prediction.

In [7]:
title_pipe = Pipeline(
    [
        (
            "vectorizer",
            ColumnTransformer([("vectorizer", cv, "title")]),
        ),  # the estimators are declared with initial vectorizers and models
        ("model", logr),  # but these will be replaced by the param dictionaries
    ]
)

article_pipe = Pipeline(
    [
        ("vectorizer", ColumnTransformer([("vectorizer", cv, "article_text")])),
        ("model", logr),
    ]
)

estimators = [("title_pipe", title_pipe), ("article_pipe", article_pipe)]
final_estimator = logr

model = StackingClassifier(estimators=estimators, final_estimator=final_estimator)

Next, for each integer threshold from one to ten, the titles and articles are filtered separately to only allow balanced-enough lemmas as defined by the threshold. A grid search is run for all parameter combinations for all of these thresholds to get an idea of how bias and variance behave under different thresholds and what vectorizers and classifiers are most effective for each piece in the model.

In [8]:
THRESHOLDS = np.linspace(1.0, 10.0, 10)

X_test = test_corpus[["article_text", "title"]].copy()
y_test = test_corpus["satire"].copy()

for threshold in THRESHOLDS:
    X = train_corpus[
        ["article_text", "title"]
    ].copy()  # X and y are declared directly from the full train data
    y = train_corpus["satire"].copy()  # so that previously removed lemmas are present

    X["title"] = score_filtering(
        X["title"], title_lemma_scores, threshold
    )  # titles and articles are filtered
    X["article_text"] = score_filtering(
        X["article_text"], article_lemma_scores, threshold
    )

    print(f"Threshold: {threshold}")

    gs = GridSearchCV(model, params, n_jobs=-1)
    gs.fit(X, y)
    print(f"Cross-Validation Accuracy: {gs.best_score_}")
    print(f"Test Set Accuracy: {gs.score(X_test, y_test)}")
    print(gs.best_params_)
    print()

Threshold: 1.0
Cross-Validation Accuracy: 0.7263080526935324
Test Set Accuracy: 0.7795809367296631
{'article_pipe__model': RandomForestClassifier(max_depth=100), 'article_pipe__model__max_depth': 100, 'article_pipe__model__n_estimators': 100, 'article_pipe__vectorizer__vectorizer': CountVectorizer(), 'final_estimator': LogisticRegression(C=10.0, max_iter=10000), 'final_estimator__C': 10.0, 'final_estimator__max_iter': 10000, 'title_pipe__model': RandomForestClassifier(max_depth=100), 'title_pipe__model__max_depth': 100, 'title_pipe__model__n_estimators': 100, 'title_pipe__vectorizer__vectorizer': TfidfVectorizer()}

Threshold: 2.0
Cross-Validation Accuracy: 0.8410024245778093
Test Set Accuracy: 0.8568200493015612
{'article_pipe__model': RandomForestClassifier(max_depth=100), 'article_pipe__model__max_depth': 100, 'article_pipe__model__n_estimators': 100, 'article_pipe__vectorizer__vectorizer': CountVectorizer(), 'final_estimator': LogisticRegression(C=10.0, max_iter=10000), 'final_esti

The above outputs make a few things clear:

1. RandomForests are clearly superior to ExtraTrees on this data
2. LogisticRegression is superior to NaiveBayes on this data
3. Tfidf is superior when applied on the titles, and CountVectorizer performs better on the articles

Other parameters are less clear, and need to be inspected more closely in a fine-tuned search in the next steps. As for the imbalance thresholds, higher thresholds resulted in a significant decrease in bias, but do not significantly increase variance as I expected it to. This is likely due to a combination of the fact that the corpus is quite large to begin with, and thus risk of overfitting was already quite small, and the fact that integer thresholds are quite large jumps. I suspect that there may be some interesting behavior that occurs between a threshold of 3 and 5, but this will not be explored in this analysis due to time constraints. Due to the fact that accuracy does not significantly increase above a threshold of 5, this is the value that will be used in the final model. 

The two cells below will implement these qualities, and explore more parameters that each piece of the model can accept. It is still not an especially exhaustive search, but model performance is already quite accurate. 

In [9]:
# vectorizers
cv = CountVectorizer()
cv_params = {"<vectorizer>": [cv], "<vectorizer>__max_features": [None, 10000]}

tfidf = TfidfVectorizer()
tfidf_params = {"<vectorizer>": [tfidf], "<vectorizer>__max_features": [None, 1000]}

# classifier models
logr = LogisticRegression()
logr_params = {
    "<model>": [logr],
    "<model>__C": np.geomspace(1e-2, 1, 9),
    "<model>__max_iter": [10000],
}

rfc = RandomForestClassifier()
rfc_params = {
    "<model>": [rfc],
    "<model>__n_estimators": [100],
    "<model>__max_depth": [None, 100],
}

params = []
vectorizers = [cv_params, tfidf_params]
base_models = [rfc_params]
output_models = [logr_params]
for title_vectorizer in [tfidf_params]:
    title_vectorizer = {
        k.replace("<vectorizer>", "title_pipe__vectorizer__vectorizer"): v
        for k, v in title_vectorizer.items()
    }
    for article_vectorizer in [cv_params]:
        article_vectorizer = {
            k.replace("<vectorizer>", "article_pipe__vectorizer__vectorizer"): v
            for k, v in article_vectorizer.items()
        }
        for title_model in base_models:
            title_model = {
                k.replace("<model>", "title_pipe__model"): v
                for k, v in title_model.items()
            }
            for article_model in base_models:
                article_model = {
                    k.replace("<model>", "article_pipe__model"): v
                    for k, v in article_model.items()
                }
                for output_model in output_models:
                    output_model = {
                        k.replace("<model>", "final_estimator"): v
                        for k, v in output_model.items()
                    }
                    params.append(
                        title_vectorizer
                        | article_vectorizer
                        | title_model
                        | article_model
                        | output_model
                    )

In [10]:
X_test = test_corpus[["article_text", "title"]].copy()
y_test = test_corpus["satire"].copy()

threshold = 5.0

X = train_corpus[["article_text", "title"]].copy()
y = train_corpus["satire"].copy()

X["title"] = score_filtering(X["title"], title_lemma_scores, threshold)
X["article_text"] = score_filtering(X["article_text"], article_lemma_scores, threshold)

gs = GridSearchCV(model, params, n_jobs=-1, verbose=1)
gs.fit(X, y)
print(f"Cross-Validation Accuracy: {gs.best_score_}")
print(f"Test Set Accuracy: {gs.score(X_test, y_test)}")
print(gs.best_params_)
print()

Fitting 5 folds for each of 144 candidates, totalling 720 fits
Cross-Validation Accuracy: 0.9528893745339605
Test Set Accuracy: 0.952341824157765
{'article_pipe__model': RandomForestClassifier(), 'article_pipe__model__max_depth': None, 'article_pipe__model__n_estimators': 100, 'article_pipe__vectorizer__vectorizer': CountVectorizer(max_features=10000), 'article_pipe__vectorizer__vectorizer__max_features': 10000, 'final_estimator': LogisticRegression(C=0.31622776601683794, max_iter=10000), 'final_estimator__C': 0.31622776601683794, 'final_estimator__max_iter': 10000, 'title_pipe__model': RandomForestClassifier(), 'title_pipe__model__max_depth': None, 'title_pipe__model__n_estimators': 100, 'title_pipe__vectorizer__vectorizer': TfidfVectorizer(), 'title_pipe__vectorizer__vectorizer__max_features': None}



The results are in, and the final model will implement a max_features count of 10,000 on the article vectorizer, and a regularization parameter of C = 0.316 for the logistic regression.

(NOTE: This notebook was rerun, and do to me forgetting to define the random state in pandas .sample() method, these results changed slightly. Initial runs gave C = 1.0, and that is what is used in the conclusions notebook)