<a href="https://colab.research.google.com/github/gilbertslade/claudette_casetext/blob/main/casetext_claudette.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Let's take a dive into the Claudette TOS dataset!**</br>
In their paper CLAUDETTE: an Automated Detector of Potentially Unfair Clauses in Online Terms of Service, the authors compare a set of classifiers trained on a professionally labeled dataset made up of individual sentences extracted from online terms of service documents. Notably, they find that conventional classifiers (specifically Support Vector Machines) and encodings (unigram and bigram Bag of Words) outperform several deep learning models. Notably, they do not evaluate against any of the transformer-based class of language models, the BERTs and ELMOs that have taken the NLP world by storm.</br>
Below, I sketch out an approach to evaluating BERT embeddings and general methods for refining and comparing classifiers on natural language data.</br>
First, let's set up our environment. The code below was written and run as a Jupyter notebook on Google CoLab with GPU access enabled. Let's set the table a bit, first by installing Hugging-Face transformers, a streamlined interface to transformers and GPU based NLP tools.

In [1]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/d5/43/cfe4ee779bbd6a678ac6a97c5a5cdeb03c35f9eaebbb9720b036680f9a2d/transformers-4.6.1-py3-none-any.whl (2.2MB)
[K     |▏                               | 10kB 18.8MB/s eta 0:00:01[K     |▎                               | 20kB 26.2MB/s eta 0:00:01[K     |▍                               | 30kB 31.4MB/s eta 0:00:01[K     |▋                               | 40kB 31.3MB/s eta 0:00:01[K     |▊                               | 51kB 33.2MB/s eta 0:00:01[K     |▉                               | 61kB 30.7MB/s eta 0:00:01[K     |█                               | 71kB 27.9MB/s eta 0:00:01[K     |█▏                              | 81kB 28.6MB/s eta 0:00:01[K     |█▎                              | 92kB 28.4MB/s eta 0:00:01[K     |█▌                              | 102kB 29.2MB/s eta 0:00:01[K     |█▋                              | 112kB 29.2MB/s eta 0:00:01[K     |█▊                              | 

Now we need to load the dataset from a Google sheet. To run the below yourself, save a copy of the Claudette dataset to a Google drive you have access, accede to the (unfair?!) Google terms of service for their SDK, and replace the URL with your own.

In [2]:
import pandas as pd
from gspread_dataframe import get_as_dataframe, set_with_dataframe
from google.colab import auth
auth.authenticate_user()
import gspread
from oauth2client.client import GoogleCredentials
gc = gspread.authorize(GoogleCredentials.get_application_default())

In [3]:
gs_tos = gc.open_by_url('https://docs.google.com/spreadsheets/d/17cnlzyK8kZjaRhTkfgtYM1IXdw03tZxqRzFbx_8agoc/edit#gid=2143700408')
tos_data = get_as_dataframe(gs_tos.sheet1)
tos_data = tos_data.dropna(how='all', axis='columns')
tos_data = tos_data.dropna(how='all', axis='rows')

Now we'll import and download a BERT tokenizer and model and initialize them on the GPU (if it's available)

In [4]:
from transformers import BertTokenizer, BertForSequenceClassification
import torch

device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
model = BertForSequenceClassification.from_pretrained('bert-base-uncased').to(device)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=28.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=466062.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=570.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=440473133.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

In [5]:
print(device)
model.eval()

cuda


BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, element

Because of the size of the dataset and limitations on GPU usage, I'm going to stick to classical ML methods for classification. I still want to use BERT embeddings, so luckily we've got the CLS token! This is a 'special token' that BERT uses to represent the sentence as a whole and in its own context.

In [6]:
import numpy as np
from math import floor
import time

batch_size=12
max_seq_length=512

def featurize(tokenizer, model, input_sents, batch_size, device):
  features = np.array([])
  #features = features.reshape((0, 768))
  sent_list = [s.lower() for s in input_sents]

  num_batches = floor(len(input_sents) / batch_size)

  for i in range(num_batches):
    try:
      encoding = tokenizer(sent_list[i*batch_size:(i+1)*batch_size], max_length=max_seq_length,
                           truncation=True, padding='max_length', return_tensors='pt',
                           add_special_tokens=True).to(device)
      
      outputs = model(**encoding)
      outputs = outputs[0].cpu().detach().numpy()
      features = np.concatenate((features, outputs[:,0]))  #grab CLS tokens

    except Exception as e:
      print(e)
      pass
  #process final batch
  try:
    final_index = (len(sent_list) % batch_size) * -1
    encoding = tokenizer(sent_list[final_index:], max_length=max_seq_length,
                           truncation=True, padding='max_length', return_tensors='pt',
                           add_special_tokens=True).to(device)
    outputs = model(**encoding)
    outputs = outputs[0].cpu().detach().numpy()
    features = np.concatenate((features, outputs[:,0]))  #grab CLS tokens
  except Exception as e:
    print(e)
    pass
  return features

start = time.time()

cls_tokens = featurize(tokenizer, model, tos_data.sentence1, batch_size, device)

stop = time.time()

print('CLS token generation took {} seconds'.format(stop - start))

CLS token generation took 327.43698287010193 seconds


In [7]:
tos_data['cls_tokens'] = pd.Series(cls_tokens)

In [8]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

def print_measures(y_test, y_hats):
  print('Accuracy:  {}'.format(accuracy_score(y_test, y_hats)))
  print('Precision: {}'.format(precision_score(y_test, y_hats)))
  print('Recall:    {}'.format(recall_score(y_test, y_hats)))
  print('F1 Score:  {}'.format(f1_score(y_test, y_hats)))

Now to see what we can get from this BERT modeling. The paper's authors use full Leave One Out evaluation, but in the interest of time I'll try a 20% hold out as a first pass

In [10]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

X,y = tos_data.cls_tokens.to_numpy().reshape(-1,1), tos_data.label.to_numpy()
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
rfc = RandomForestClassifier(random_state=0)
rfc.fit(x_train, y_train)
y_hats = rfc.predict(x_test)

print_measures(y_test, y_hats)

Accuracy:  0.8125331917153479
Precision: 0.1485148514851485
Recall:    0.14218009478672985
F1 Score:  0.14527845036319612


Pretty miserable results! The accuracy here is misleadingly high given the imbalance of the labeled data. All it would take to get a score of ~0.89 is to only guess negative labels.</br></br>
With a larger data set or a greater scope to this exercise, I wouldn't give up on BERT yet. I might try a model pre-trained on legal documents, try fine tuning the base model, or using a different pooling as a classification feature, but for today let's try some other encodings. To be semi-thorough about it, let's look at TF-IDF encodings built on uni- through trigrams.

In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidfs = [TfidfVectorizer(sublinear_tf=True, min_df=4, norm='l2', 
                        ngram_range=(1, n), stop_words='english')
              for n in range(1,4)]

vectors = [vec.fit_transform(tos_data.sentence1).toarray() for vec in tfidfs]

And let's see what an untuned Random Forest Classifier can do with them

In [12]:
#tos_data['tfidf_vectors'] = pd.Series(vectors)
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

Xs = [vectors[i] for i in range(len(vectors))]
y = tos_data.label.to_numpy()
for i in range(len(Xs)):
  X = Xs[i]
  x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
  rfc = RandomForestClassifier(random_state=0)
  rfc.fit(x_train, y_train)
  y_hats = rfc.predict(x_test)

  print('Measures for {}-grams'.format(i+1))
  print_measures(y_test, y_hats)

Measures for 1-grams
Accuracy:  0.9325544344131704
Precision: 0.9117647058823529
Recall:    0.44075829383886256
F1 Score:  0.5942492012779552
Measures for 2-grams
Accuracy:  0.9341476367498672
Precision: 0.8717948717948718
Recall:    0.4834123222748815
F1 Score:  0.6219512195121952
Measures for 3-grams
Accuracy:  0.9368029739776952
Precision: 0.896551724137931
Recall:    0.4928909952606635
F1 Score:  0.636085626911315


Even without any tuning, that's quite a step up! Accuracy is past the label-imbalance threshold and the precision is higher than any of the paper's classifiers across the board. Recall still leaves quite a bit to be desired, though, so let's tinker with that a bit starting with the trigram embeddings.</br>
As a decent general approach, I'll try running a grid search across a subset of parameters for the Random Forest Classifier. This approach generalizes to other classifiers I might throw at this, like, say, Logistic Regression and SVMs. For the sake of conciseness, I'll stick with RFC.</br>
Note that this step takes a considerable amount of time, more than getting BERT embeddings even, because it's an exhaustive search of all combinations of the parameters.

In [13]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer

x_train, x_test, y_train, y_test = train_test_split(Xs[2], y, test_size=0.2, random_state=0)
rfc = RandomForestClassifier(random_state=0)
rfc.fit(x_train, y_train)

scorer = make_scorer(recall_score)  #defaults to accuracy which is a perverse incentive in this case

parameters = {'criterion':('gini', 'entropy'), 'max_features':('auto', 'log2', 'sqrt'),
              'n_estimators': (10, 100, 150, 200),  #others parameters left out
              }

CV_random_forest = GridSearchCV(rfc, parameters, scoring=scorer)
CV_random_forest.fit(X, y)

GridSearchCV(cv=None, error_score=nan,
             estimator=RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                              class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              max_samples=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators=100, n_jobs=None,
                                              oob_score=False, random_state=0,
                                

In [14]:
print(CV_random_forest.best_params_)
print(CV_random_forest.best_score_)

{'criterion': 'gini', 'max_features': 'auto', 'n_estimators': 200}
0.4709113080999954


Now with a decent hint about how to configure the Random Forest Classifier, we can replicate the Leave One Out validation. Again, if the goal were to thoroughly seek to improve upon the baseline from the paper the code below would include other classifiers (and I would seek out the full eight label dataset).
**Note, the last time I ran the grid search code above it timed out so the outputs for the cells below got kludged. Rather than spend another hour or so recreating it, just imagine a very similar F1 to the result above.

In [1]:
from sklearn.model_selection import LeaveOneOut
#from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
#from sklearn.svm import SVC

cv = LeaveOneOut()
models = [RandomForestClassifier(random_state=0)]#, **CV_random_forest.best_params_)]
num_models = len(models)
y_true, y_preds = [], [ [] for _ in range(num_models)]
X, y = vectors[2], tos_data.label.to_numpy()

for train_ix, test_ix in cv.split(X):
  x_train, x_test = X[train_ix], X[test_ix]
  y_train, y_test = y[train_ix], y[test_ix]
  #print(x_train.shape)
  #print(y_train.shape)
  
  #models = [LogisticRegression(random_state=0), RandomForestClassifier(random_state=0)]
  for m in models:
    m.fit(x_train, y_train)
	# evaluate model
  yhats = [m.predict(x_test) for m in models]
	# store
  #print(type(y_test))
  y_true.append(y_test[0])
  for i in range(num_models):
    y_preds[i].append(yhats[i][0])
    #y_pred.append(yhat[0])


NameError: ignored

In [1]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

for y_pred in y_preds:
  acc = accuracy_score(y_true, y_pred)
  print('Accuracy: %.3f' % acc)
  prec = precision_score(y_true, y_pred)
  recall = recall_score(y_true, y_pred)
  f_score = f1_score(y_true, y_pred)
  print("Precision:{}\nRecal:{}\nF1 Score:{}".format(prec,recall,f_score))

NameError: ignored