<h1><center>SAP IES AICOE TAKE-HOME TEST </center></h1>
<h2><center> Machine Learning </center></h2>

# Please choose <font color="red">ONLY 1</font> part of your interest: either Part A or B

# <font color="blue">Part A</font>

## Guideline:

- File to use: **contract_dataset_v20220109.csv**
- Train a **MULTI-CLASS CLASSIFICATION MODEL**. Your final model should be able to take in a test clause (column `provision`) and predict its provision type (column `label`). 
- Perform your own train-test split.
- Choose your own evaluation metrics and explain your choice(s).
- You may refer to, and/or copy code blocks from any public Github repositories, Kaggle kernels or tutorials online. However, please add a comment indicating the source.
- You are free to conduct exploratory data analysis, write unit tests or any other additional steps as you find neccesary.
- The purpose is NOT to train the best-performing model. It is to help us assess your ability to learn and apply NLP modelling techniques.

## Answer:

<h1> Model and Evaluation metric </h1>
<h2> Model </h2>
<p> I used a Recursive Nueral Network to train my model , as the input is a series of text , where each word might affect the label prediction , I wanted to use the data shape(words) and thus decided to train my model using RNN </p>
<h2> Evaluation Metrics </h2>
<p>I used accuracy and confusion matrix as evaluation metric .As the classification is non-binary , a confusion matrix helps me visualise how well my model is doing and accuracy is a numeric factor that judges my model.</p>

In [227]:
import pandas as pd
import torch
import torchtext

FILE_PATH = "../data/contract_dataset_v20220109.csv"



In [None]:
import torch.utils.data as data 
from torch.utils.data import Dataset

df = pd.read_csv(FILE_PATH)
data_len = len(df)
labels = df['label'].unique()
label_map =  {labels[i]:i for i in range(len(labels))} # Creates label_map

df = df[['provision','label']] # Truncated useless columns

df['label'] = df['label'].map(lambda x: label_map[x]) # Maps All String labels to an integer accroding to map

#My Custom dataset that has two columns (X->Provision , Y -> numeric label)
class MyDataset(Dataset):
    def __init__(self, filename):
        df = pd.read_csv(filename)
        x = df['provision']
        y = df['label'].map(lambda x: label_map[x])

        self.x = x.tolist()
        self.y = y.tolist()
  
    def __len__(self):
        return len(self.y)
  
    def __getitem__(self, i):
        return self.x[i],self.y[i]

md = MyDataset(FILE_PATH)

#Splits Dataset
train_dataset, test_dataset = data.random_split(md, [0.7,0.3])




In [None]:
# Tokenizer
from torchtext.data import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

tokenizer = get_tokenizer("basic_english")

#Tokenizes the Words inside one provision
def build_vocabulary(datasets):
    for dataset in datasets:
        for text,i in dataset:
            yield tokenizer(text)

vocab = build_vocab_from_iterator(build_vocabulary([train_dataset, test_dataset]), min_freq=1, specials=["<UNK>"])

vocab.set_default_index(vocab["<UNK>"])

In [None]:
# Vectorise Test and Train Loaders
import numpy as np
from torch.utils.data import DataLoader

max_words = 10 #Number of token that are analysed by the model , exceeding the limits is truncated and lower than the limit are padded by default.

def vectorize_batch(batch):
    np_batch = np.array(batch)
    X = np_batch[:,0].tolist()
    Y = np_batch[:,1].astype(int).tolist()
      
    X = [vocab(tokenizer(str(text))) if type(text) == int else vocab(tokenizer(text))  for text in X ]
    X = [tokens+([0]* (max_words-len(tokens))) if len(tokens)<max_words else tokens[:max_words] for tokens in X] ## Bringing all samples to max_words length.

    return torch.tensor(X, dtype=torch.int32), torch.tensor(Y)

#Prepare Test Loaders 
train_loader = DataLoader(train_dataset, batch_size=1024, collate_fn=vectorize_batch, shuffle=True)
test_loader  = DataLoader(test_dataset , batch_size=1024, collate_fn=vectorize_batch)

#Test
# for X, Y in train_loader:
#     print(X.shape, Y.shape)
#     break

In [None]:
# Creating the RNN Model

from torch import nn
from torch.nn import functional as F

# The Weights of the embedded layer
embed_len = 50
#Weights of the hidden later
hidden_dim = 50
#Only running one layer of RNN 
n_layers=1

#Class for the RNN Model input-> Embedding Layer -> Hidden RNN Layer -> Linear Layer(Output)
class RNNClassifier(nn.Module):
    def __init__(self):
        super(RNNClassifier, self).__init__()
        self.embedding_layer = nn.Embedding(num_embeddings=len(vocab), embedding_dim=embed_len)
        self.rnn = nn.RNN(input_size=embed_len, hidden_size=hidden_dim, num_layers=n_layers, batch_first=True)
        self.linear = nn.Linear(hidden_dim, len(labels))


    def forward(self, X_batch):
        embeddings = self.embedding_layer(X_batch)
        output, hidden = self.rnn(embeddings, torch.randn(n_layers, len(X_batch), hidden_dim))
        return self.linear(output[:,-1])

rnn_classifier = RNNClassifier()

#Checking Layer content
# for layer in rnn_classifier.children():
#     print("Layer : {}".format(layer))
#     print("Parameters : ")
#     for param in layer.parameters():
#         print(param.shape)
#     print()

In [None]:
# Train Network

# Calculates loss
def CalcValLossAndAccuracy(model, loss_fn, val_loader):
    with torch.no_grad():
        Y_shuffled, Y_preds, losses = [],[],[]
        for idx, data in enumerate(train_loader):
            X, Y = data
            preds = model(X)
            loss = loss_fn(preds, Y)
            losses.append(loss.item())

            Y_shuffled.append(Y)
            Y_preds.append(preds.argmax(dim=-1))

        Y_shuffled = torch.cat(Y_shuffled)
        Y_preds = torch.cat(Y_preds)

        print("Valid Loss : {:.3f}".format(torch.tensor(losses).mean()))
        print("Valid Acc  : {:.3f}".format(accuracy_score(Y_shuffled.detach().numpy(), Y_preds.detach().numpy())))

#Runs Each iteration of gradient descent
def TrainModel(model, loss_fn, optimizer, train_loader, val_loader, epochs=10):
    for i in range(1, epochs+1):
        losses = []
        for idx, data in enumerate(train_loader):
            X, Y = data
            Y_preds = model(X)

            loss = loss_fn(Y_preds, Y)
            losses.append(loss.item())

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

        print("Train Loss : {:.3f}".format(torch.tensor(losses).mean()))
        CalcValLossAndAccuracy(model, loss_fn, val_loader)



In [None]:
# Trains the Models
from torch.optim import Adam

epochs = 15
learning_rate = 1e-3

loss_fn = nn.CrossEntropyLoss()
rnn_classifier = RNNClassifier()
optimizer = Adam(rnn_classifier.parameters(), lr=learning_rate)

TrainModel(rnn_classifier, loss_fn, optimizer, train_loader, test_loader, epochs)

Train Loss : 1.915
Valid Loss : 1.557
Valid Acc  : 0.540
Train Loss : 1.344
Valid Loss : 1.158
Valid Acc  : 0.650
Train Loss : 1.015
Valid Loss : 0.865
Valid Acc  : 0.719
Train Loss : 0.785
Valid Loss : 0.717
Valid Acc  : 0.750
Train Loss : 0.684
Valid Loss : 0.651
Valid Acc  : 0.770
Train Loss : 0.631
Valid Loss : 0.604
Valid Acc  : 0.788
Train Loss : 0.587
Valid Loss : 0.563
Valid Acc  : 0.804
Train Loss : 0.547
Valid Loss : 0.524
Valid Acc  : 0.822
Train Loss : 0.509
Valid Loss : 0.487
Valid Acc  : 0.839
Train Loss : 0.473
Valid Loss : 0.449
Valid Acc  : 0.854
Train Loss : 0.436
Valid Loss : 0.414
Valid Acc  : 0.868
Train Loss : 0.399
Valid Loss : 0.380
Valid Acc  : 0.882
Train Loss : 0.368
Valid Loss : 0.349
Valid Acc  : 0.892
Train Loss : 0.340
Valid Loss : 0.326
Valid Acc  : 0.897
Train Loss : 0.319
Valid Loss : 0.303
Valid Acc  : 0.905


In [237]:
# Checks the statistics of the classifier 
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
def MakePredictions(model, loader):
    Y_shuffled, Y_preds = [], []
    for X, Y in loader:
        preds = model(X)
        Y_preds.append(preds)
        Y_shuffled.append(Y)
    gc.collect()
    Y_preds, Y_shuffled = torch.cat(Y_preds), torch.cat(Y_shuffled)

    return Y_shuffled.detach().numpy(), F.softmax(Y_preds, dim=-1).argmax(dim=-1).detach().numpy()
Y_actual, Y_preds = MakePredictions(rnn_classifier, test_loader)

print("Test Accuracy : {}".format(accuracy_score(Y_actual, Y_preds)))
print("\nClassification Report : ")
print(classification_report(Y_actual, Y_preds, target_names=labels))
print("\nConfusion Matrix : ")
print(confusion_matrix(Y_actual, Y_preds))

Test Accuracy : 0.8890389810883829

Classification Report : 
                            precision    recall  f1-score   support

               ['waivers']       0.77      0.76      0.77       976
        ['governing laws']       0.96      0.97      0.97      3784
            ['amendments']       0.77      0.82      0.80      1420
          ['counterparts']       0.98      0.98      0.98      2782
            ['warranties']       0.20      0.01      0.01       189
          ['terminations']       0.70      0.82      0.76      1107
       ['valid issuances']       0.00      0.00      0.00        41
['government regulations']       0.20      0.06      0.09        34
       ['trade relations']       0.00      0.00      0.00        20
    ['trading activities']       0.00      0.00      0.00        11

                  accuracy                           0.89     10364
                 macro avg       0.46      0.44      0.44     10364
              weighted avg       0.87      0.89      

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [None]:
# Run this function after running all the blocks above
# THIS IS MY ANSWER 
def make_predictions(X):
    # Takes in a Column of Provisions and returns the predicted label array 
    X = X[0].tolist()
    X = [vocab(tokenizer(text)) for text in X]
    X = [tokens+([0]* (max_words-len(tokens))) if len(tokens)<max_words else tokens[:max_words] for tokens in X]
    Y = rnn_classifier(torch.tensor(X, dtype=torch.int32))
    preds = F.softmax(Y, dim=1)
    preds = preds.detach().numpy()
    preds = np.argmax(preds, axis = 1)
    return labels[preds]

print(make_predictions([df["provision"][9000:10000]]))


["['governing laws']" "['governing laws']" "['governing laws']"
 "['governing laws']" "['governing laws']" "['governing laws']"
 "['governing laws']" "['governing laws']" "['governing laws']"
 "['governing laws']" "['governing laws']" "['governing laws']"
 "['governing laws']" "['governing laws']" "['governing laws']"
 "['governing laws']" "['governing laws']" "['governing laws']"
 "['governing laws']" "['governing laws']" "['waivers']"
 "['governing laws']" "['governing laws']" "['governing laws']"
 "['governing laws']" "['governing laws']" "['governing laws']"
 "['governing laws']" "['governing laws']" "['governing laws']"
 "['governing laws']" "['governing laws']" "['governing laws']"
 "['governing laws']" "['governing laws']" "['governing laws']"
 "['governing laws']" "['waivers']" "['governing laws']"
 "['governing laws']" "['governing laws']" "['governing laws']"
 "['governing laws']" "['governing laws']" "['governing laws']"
 "['governing laws']" "['governing laws']" "['governin

# <font color="blue">Part B</font>

## Guideline:

**STEP 1: DATA PREPARATION**  

- Download the CUAD dataset here: https://www.atticusprojectai.org/cuad
- Read the CUAD's [Datasheet](https://drive.google.com/drive/u/0/folders/1Yu-JnZj1LbVBfTdPiHfMDnaKZj4eqks8) and understand the format of the data.

**STEP 2: MODELLING**  
- Train a machine learning model to extract expiry date from a given plaintext contract. To account for model explainability, you may also train a model to first extract the relevant clause from a contract by outputting start and end tokens, and then extract the expiry date using a rule-based extractor. 
- Perform your own train-test split.
- Choose your own evaluation metrics and explain your choice(s)
- Feel feel to employ whichever modelling techniques you see fit, e.g. question answering, custom-NER, etc.
- We highly recommend you to read CUAD's paper on arxiv: [CUAD: An Expert-Annotated NLP Dataset for Legal Contract Review](https://arxiv.org/abs/2103.06268)

- You may refer to, and/or copy code blocks from any public Github repositories, Kaggle kernels or tutorials online. However, please add a comment indicating the source.
- You are free to conduct exploratory data analysis, write unit tests or any other additional steps as you find neccesary.
- The purpose is NOT to train the best-performing model. It is to help us assess your ability to learn and apply NLP modelling techniques.

## Answer:

In [None]:
# Begin your code here


