This notebook documents the model fit during the first phase of *On the Books: Jim Crow and Algorithms of Resistance*, as of August 2020.

## Packages

In [1]:
import os
import re

import pandas as pd
import numpy as np
import scipy.sparse

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.calibration import CalibratedClassifierCV, calibration_curve

from xgboost import XGBClassifier

## Data Preparation


In [2]:
train_df = pd.read_csv("../training_set/training_set_v0_clean.csv")

We performed simple preprocessing on the text:
* Replaced hyphenated and line broken words with unbroken words.
* Removed section numbering from the law text ("section_text").
* Removed all non-ASCII characters (most of these were OCR errors).
* Converted all words to lower case.
* Removed stopwords based on `nltk`'s default list.
 * We also removed any words occuring in less than 2 or more than 1000 documents.
* We used session or volume identified ("csv") information to extract a numeric year.  In the case of multi-year volumes (e.g. 1956-1957) the earlier year was used.

Then we convert the text into a document-term matrix, augmented with year and law type variables.

In [3]:
repl = lambda m: m.group("letter")

#Fix hyphenated words
train_df["text"] = train_df.text.str.replace(r"-[ \|]+(?P<letter>[a-zA-Z])",repl).astype("str")
train_df["section_text"] = train_df.section_text.str.replace(r"-[ \|]+(?P<letter>[a-zA-Z])",repl).astype("str")
train_df["section_text"] = [re.sub(r'- *\n+(\w+ *)', r'\1\n',r) for r in train_df["section_text"]]

#Remove section titles (e.g. "Sec. 1") from law text.
train_df["start"] = train_df.section_raw.str.len().fillna(0).astype("int")
train_df["section_text"] = train_df.apply(lambda x: x['section_text'][(x["start"]):], axis=1).str.strip()

#Remove all non-ASCII characters
train_df["section_text"] = train_df["section_text"].str.replace(r"[^\x00-\x7F]", "", regex=True)

law_list = [word_tokenize(r.lower()) for r in train_df.section_text]
stop_words = stopwords.words('english')
law_list = [[word for word in law if word not in stop_words] for law in law_list]

#Extract a numeric year variable
train_df["year"] = train_df.sess.str.slice(start = 0, stop = 4).astype("float")
train_df.loc[train_df.sess.isna(),"year"] = train_df.csv.str.extract("(\d{4})")

def dummy(doc):
    return doc
#Remove terms appearing in less than 2 or more than 1000 documents, then convert to document-term matrix.
vect = CountVectorizer(tokenizer=dummy,preprocessor=dummy, decode_error = "ignore",
                      min_df = 2, max_df = 1000)
dtm = vect.fit_transform(law_list)

#Add year and law type variables.
extra_df = train_df.loc[:,["year","type"]].copy()
extra_df = pd.get_dummies(extra_df, columns = ["type"], prefix = ["type"])
X = scipy.sparse.hstack((dtm,extra_df.values))

## Model Details

The `fit_params` below were fit using a 80-20 training-test split, followed by 10-fold cross validation on the training set.  We will include a basic template of our model selection process later this year.

In [4]:
fit_params =  {'colsample_bytree': 0.3, 'gamma': 0.3, 'learning_rate': 0.3, 
               'max_depth': 20, 'min_child_weight': 1, 'n_estimators': 50, 
               'scale_pos_weight': 5}
all_mod = XGBClassifier(**fit_params)
all_modfit = all_mod.fit(X, train_df.assessment)

The XGBoost classifier outperformed the other models selected.  Read more about XGBoost [here](https://arxiv.org/abs/1603.02754).  

After fitting, we used probability calibration to adjust the model probabilities to better reflect the training set.

In [5]:
calibrated_mod = CalibratedClassifierCV(all_modfit, cv=10, method="isotonic")
calibrated_modfit = calibrated_mod.fit(X, train_df.assessment)


train_df["base_labels"] = all_modfit.predict(X)
train_df["base_probs"] = all_modfit.predict_proba(X)[:,1]
train_df["calibrated_probs"] = calibrated_modfit.predict_proba(X)[:,1]
train_df["calibrated_labels"] = (train_df.calibrated_probs > 0.9).astype("int")

We reported any laws with a calibrated probability over 90% as as Jim Crow laws with a source of "model", unless they were also later confirmed by an expert, in which case they were labeled as "model and expert".  We chose to be conservative at this point to minimize false positives and since this project will continue over the coming year, allowing us more time to fine tune the modeling process.