# 5.Multiclass Text Classification with BERT and [ktrain] on Our Legal Corpora

BERT is a deep learning model that has given state-of-the-art results on a wide variety of natural language processing tasks. It stands for `Bidirectional Encoder Representations for Transformers`.

As opposed to directional models, which read the text input sequentially (left-to-right or right-to-left), the Transformer encoder reads the entire sequence of words at once. Therefore it is considered bidirectional, though it would be more accurate to say that it’s non-directional. This characteristic allows the model to learn the context of a word based on all of its surroundings (left and right of the word): (https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270)

In [35]:
from IPython.display import Image
from IPython.core.display import HTML 
Image(url= "https://miro.medium.com/max/1400/0*ViwaI3Vvbnd-CJSQ.png")

In the image above, the Transformer encoder takes in an input of a sequence of tokens, which are first embedded into vectors and then processed in the neural network. The output is a sequence of vectors of size H, in which each vector corresponds to an input token with the same index.To predict if the second sentence is indeed connected to the first, the following steps are performed:
The entire input sequence goes through the Transformer model.
The output of the [CLS] token is transformed into a 2×1 shaped vector, using a simple classification layer (learned matrices of weights and biases).
Calculating the probability of IsNextSequence with softmax. :(https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270)

In doing so, the model comes closest to mimicking human reading comprehension - just like the displaCy dependency parsing visualization, we read words and comprehend their contextual meaning intuitively, and BERT is able to come close to that. 

We'll now find out for ourselves whether the performance lives up to the hype.

**Please install ktrain on Google Colab**:
`pip install ktrain`

In [None]:
#we import pandas as well as ktrain
import pandas as pd
import numpy as np

import ktrain
from ktrain import text

In [None]:
#we also have to prepare train_test_split
from sklearn.model_selection import train_test_split

In [None]:
#importing our dataframe again
df = pd.read_csv("./sample_data/df_clean_draft_2.csv", index_col = 0)

In [None]:
df.head()

In [None]:
#we reinstantiate X and y and call train test split
X = df["clause_text"]
y = df["clause_type"]

In [None]:
#train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [None]:
#I have to change our train test split objects into a list of strings.
X_train = X_train.values.tolist()
X_test = X_test.values.tolist()

y_train = y_train.values.tolist()
y_test = y_test.values.tolist()

print("classes to predict")
print(y.value_counts())

In [None]:
type(X_train)

In [None]:
#recalling our dictionary from the previous notebook
encoding = {'warranty':0, 
            'compliance':1, 
            'payment':2, 
            'support':3, 
            'delivery':4,
            'proprietary_rights':5, 
            'limited_liability':6, 
            'indemnity':7,
            'confidentiality':8, 
            'licenses':9}

# Integer values for each class
y_train = [encoding[x] for x in y_train]
y_test = [encoding[x] for x in y_test]

### BERT Specific Preprocessing
* The text must be preprocessed in a specific way for use with BERT. This is accomplished by setting preprocess_mode to ‘bert’. The BERT pre-trained model and vocabulary will be automatically downloaded

* BERT can handle a maximum length of 512, but let's use less to reduce memory and improve speed. 

In [None]:
class_names = ['warranty', 'compliance', 'payment', 'support', 'delivery',
               'proprietary_rights', 'limited_liability', 'indemnity',
               'confidentiality', 'licenses']

In [None]:
(x_train,  y_train), (x_test, y_test), preproc = text.texts_from_array(x_train=X_train, y_train=y_train,
                                                                       x_test=X_test, y_test=y_test,
                                                                       class_names=class_names,
                                                                       preprocess_mode='bert',
                                                                       maxlen=400, 
                                                                       max_features=100000)

### Training and Validation of BERT on Contract Clauses

In [None]:
model = text.text_classifier('bert', train_data=(x_train, y_train), preproc=preproc)

In [None]:
learner = ktrain.get_learner(model, train_data=(x_train, y_train), 
                             val_data=(x_test, y_test),
                             batch_size=12)

In [None]:
# #This is like GridSearch but for BERT. Using this, we can tune the learning rate.
# learner.lr_find()

In [None]:
# #This in turn lets us plot a chart on learning rate vs loss rate - and we essentially look for the bottom of the curve.
# #This is similar to "elbowing" 
# learner.lr_plot()

In [22]:
learner.fit_onecycle(2e-5, 5)



begin training using onecycle policy with max lr of 2e-05...
Train on 5333 samples, validate on 2628 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7f08bb3fc240>

## Predicting an Unseen Contractual Clause

Let's now test drive BERT on an unseen clause taken from elsewhere to see if it correctly predicts it as a warranty clause.

In [24]:
learner.validate(val_data=(x_test, y_test), class_names=class_names)

                    precision    recall  f1-score   support

          warranty       0.93      0.96      0.94       315
        compliance       0.97      0.97      0.97       276
           payment       0.98      0.98      0.98       298
           support       0.98      0.98      0.98       307
          delivery       0.96      0.97      0.97       306
proprietary_rights       0.96      0.95      0.95       256
 limited_liability       0.97      0.95      0.96       244
         indemnity       0.97      0.96      0.97       219
   confidentiality       0.97      0.99      0.98       195
          licenses       0.97      0.94      0.95       212

          accuracy                           0.96      2628
         macro avg       0.97      0.96      0.96      2628
      weighted avg       0.96      0.96      0.96      2628



array([[302,   2,   2,   2,   0,   1,   3,   1,   1,   1],
       [  2, 267,   1,   1,   1,   1,   0,   0,   0,   3],
       [  0,   0, 291,   1,   4,   2,   0,   0,   0,   0],
       [  1,   1,   2, 300,   1,   2,   0,   0,   0,   0],
       [  4,   0,   1,   2, 298,   0,   0,   0,   0,   1],
       [  7,   0,   0,   1,   0, 242,   0,   0,   4,   2],
       [  7,   0,   0,   0,   1,   1, 231,   4,   0,   0],
       [  0,   0,   0,   0,   2,   1,   5, 211,   0,   0],
       [  0,   0,   0,   0,   0,   1,   0,   0, 194,   0],
       [  3,   4,   0,   0,   3,   2,   0,   1,   0, 199]])

In [25]:
predictor = ktrain.get_predictor(learner.model, preproc)
predictor.get_classes()

['warranty',
 'compliance',
 'payment',
 'support',
 'delivery',
 'proprietary_rights',
 'limited_liability',
 'indemnity',
 'confidentiality',
 'licenses']

In [26]:
unseen_clause = "Licensor warrants that the Licensed Software under normal use shall perform the functions specified in its documentation to be developed by Licensor. If the Licensed Software does not conform to its documentation such that its functional performance is reasonably affected and Licensor is notified in writing within 30 days. THIS WARRANTY IS EXCLUSIVE AND IN LIEU OF ALL OTHER WARRANTIES WHETHER STATUTORY, EXPRESS, OR IMPLIED INCLUDING ALL WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE."

unseen_clause

'Licensor warrants that the Licensed Software under normal use shall perform the functions specified in its documentation to be developed by Licensor. If the Licensed Software does not conform to its documentation such that its functional performance is reasonably affected and Licensor is notified in writing within 30 days. THIS WARRANTY IS EXCLUSIVE AND IN LIEU OF ALL OTHER WARRANTIES WHETHER STATUTORY, EXPRESS, OR IMPLIED INCLUDING ALL WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.'

In [27]:
import time 

start_time = time.time() 
prediction = predictor.predict(unseen_clause)

print('predicted: {} ({:.2f})'.format(prediction, (time.time() - start_time)))

predicted: warranty (0.10)


### Visualizing BERT

Neural networks are notorious for being black boxes, but a custom library named `bertviz` exists that allows us to visualize the magic behind BERT. We will import it now:

In [29]:
import sys
!test -d bertviz_repo && echo "FYI: bertviz_repo directory already exists, to pull latest version uncomment this line: !rm -r bertviz_repo"
# !rm -r bertviz_repo # Uncomment if you need a clean pull from repo
!test -d bertviz_repo || git clone https://github.com/jessevig/bertviz bertviz_repo
if not 'bertviz_repo' in sys.path:
  sys.path += ['bertviz_repo']
!pip install regex
!pip install transformers

Cloning into 'bertviz_repo'...
remote: Enumerating objects: 1074, done.[K
remote: Total 1074 (delta 0), reused 0 (delta 0), pack-reused 1074[K
Receiving objects: 100% (1074/1074), 99.41 MiB | 11.74 MiB/s, done.
Resolving deltas: 100% (687/687), done.


In [30]:
from bertviz import head_view
from transformers import BertTokenizer, BertModel

In [38]:
def call_html():
  import IPython
  display(IPython.core.display.HTML('''
        <script src="/static/components/requirejs/require.js"></script>
        <script>
          requirejs.config({
            paths: {
              base: '/static/base',
              "d3": "https://cdnjs.cloudflare.com/ajax/libs/d3/3.5.8/d3.min",
              jquery: '//ajax.googleapis.com/ajax/libs/jquery/2.0.0/jquery.min',
            },
          });
        </script>
        '''))

In [40]:
#defines a function to show the bertviz visualisation
model_version = 'bert-base-uncased'
do_lower_case = True
model = BertModel.from_pretrained(model_version, output_attentions=True)
tokenizer = BertTokenizer.from_pretrained(model_version, do_lower_case=do_lower_case)

sentence_a = "licensor warrants that the licensed Software under normal use"
sentence_b = "shall perform the functions specified in its documentation"
inputs = tokenizer.encode_plus(sentence_a, sentence_b, return_tensors='pt', add_special_tokens=True)
token_type_ids = inputs['token_type_ids']
input_ids = inputs['input_ids']
attention = model(input_ids, token_type_ids=token_type_ids)[-1]
input_id_list = input_ids[0].tolist() # Batch index 0
tokens = tokenizer.convert_ids_to_tokens(input_id_list)
call_html()

head_view(attention, tokens)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [37]:
#we create 2 sentences to display the bidirectional text reading that BERT does
model_version = 'bert-base-uncased'
do_lower_case = True
model = BertModel.from_pretrained(model_version, output_attentions=True)
tokenizer = BertTokenizer.from_pretrained(model_version, do_lower_case=do_lower_case)
sentence_a = "Licensor warrants that the Licensed Software under normal use "
sentence_b = "shall perform the functions specified in its documentation"
show_head_view(model, tokenizer, sentence_a, sentence_b)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<img src="../images/BERTviz.png">

In [None]:
# from tensorflow.keras.models import Sequential
# from tensorflow.keras.layers import Dense, Embedding, GlobalAveragePooling1D
# def get_model():
#     model = Sequential()
#     model.add(Embedding(NUM_WORDS, 50, input_length=MAXLEN))
#     model.add(GlobalAveragePooling1D())
#     model.add(Dense(1, activation='sigmoid'))
#     model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
#     return model
# model = get_model()
# # rebuild the model to train from scratch 
# learner.set_model(get_model())

# # training using autofit
# learner.autofit(0.005, 2)

## Conclusion
In this project, we took a deep dive into legal corpora in the form of contractual clauses. 

We started by acknowledging that contracts can be dense, verbose and inaccessible. This is because it is written in legalese, or techno-legal language that people have no choice but to have to make sense of in commerce. This can become very unruly when one has multiple contracts and hundreds of clauses to pour over, not just for lawyers, but for compliance professionals and business stakeholders generally.

Thus, the classification of legal clauses via machine learning can be very useful to trawl through what is often seen as a necessary evil.

We observed the basic textual characteristics of contracts by analyzing word counts, clause types and top words and bigrams. 

We then explored common conceptual topics in contracts using spaCy, Blackstone, and LDA topic modelling. We found that some contractual clauses, like warranties and support obligations, can overlap. On the other hand, clauses like compliance clauses can be written with enough distinct word vectors that they are unique enough to be topics unto themselves.

We also realized that Bag-of-Words models are still strong enough to classify multiclass text problems, including legal corpora. Our SVC model outperformed every other sklearn-type model at `0.947` accuracy.

However, we highlighted its drawback as being unnatural and not intuitive in terms of how people actually read documents. In the case of highly correlated word vectors in legal corpora, we underlined the mutual dependency that words have with each other. For this, we utilized displaCy to visualize this relationship through Dependency Parsing.

We concluded the project by exploring a Bi-directional Neural Network Model utilizing Transformers specifically developed for these types of NLP problems called `BERT`, which was even more accurate than the SVC Model, and seems to come very close to what humans do in terms of comprehending words contextually and interdependently.