<a href="https://colab.research.google.com/github/casbdai/unboxing_sessions/blob/main/TextminingContractAnalysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Building A Contract Analyzer:

A short introduction in data processing for textual data and text classification

# Basic Setup


Install nltk library for text processing and download some extensions that are required. Also, we install the wordcloud library for plotting our results as wordcloud. TextBlob is a library for sentiment analysis.

In [1]:
!pip install nltk
import nltk
nltk.download('stopwords')



[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [2]:
# we import a series of specific functions from the nltk package for processing the texts.
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from nltk.stem import SnowballStemmer

# we import pandas for reading in files
import pandas as pd

## Read in the data

In [3]:
corpus = pd.read_excel("https://raw.githubusercontent.com/casbdai/datasets/main/cuad_data.xlsx")
corpus.head()

Unnamed: 0,Contract,content,Audit Rights,Covenant not to Sue,Document Name,Governing Law,Insurance,IP Ownership Assignment,Joint IP Ownership,Liquidated Damages,...,Post-termination Services,Price Restrictions,Revenue-Profit Sharing,ROFR-ROFO-ROFN,Source Code Escrow,Termination for Convenience,Third Party Beneficiary,Unlimited/All-You-Can-Eat License,Volume Restriction,Warranty Duration
0,2ThemartComInc_19990826_10-12G_EX-10.10_670028...,CO-BRANDING AND ADVERTISING AGREEMENT THIS CO...,1,0,1,1,0,0,1,0,...,1,0,1,0,0,0,0,0,0,0
1,ABILITYINC_06_15_2020-EX-4.25-SERVICES AGREEMENT,EXHIBIT 4.25 INFORMATION IN THIS EXHIBIT IDENT...,0,0,1,1,0,1,0,0,...,0,0,0,0,0,1,0,0,0,0
2,ACCELERATEDTECHNOLOGIESHOLDINGCORP_04_24_2003-...,EXHIBIT 10.13 JO...,1,0,1,1,0,0,0,0,...,1,0,1,0,0,0,0,0,0,0
3,ACCURAYINC_09_01_2010-EX-10.31-DISTRIBUTOR AGR...,Exhibit 10.31 PURSUANT TO 17 C.F.R. § 240.2...,1,0,1,1,0,0,0,0,...,1,0,0,0,0,0,0,0,0,1
4,ADAMSGOLFINC_03_21_2005-EX-10.17-ENDORSEMENT A...,REDACTED COPY CONFIDENTIAL TREATMENT REQUESTE...,0,0,1,1,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0


In [4]:
corpus.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 498 entries, 0 to 497
Data columns (total 24 columns):
 #   Column                             Non-Null Count  Dtype 
---  ------                             --------------  ----- 
 0   Contract                           498 non-null    object
 1   content                            498 non-null    object
 2   Audit Rights                       498 non-null    int64 
 3   Covenant not to Sue                498 non-null    int64 
 4   Document Name                      498 non-null    int64 
 5   Governing Law                      498 non-null    int64 
 6   Insurance                          498 non-null    int64 
 7   IP Ownership Assignment            498 non-null    int64 
 8   Joint IP Ownership                 498 non-null    int64 
 9   Liquidated Damages                 498 non-null    int64 
 10  Minimum Commitment                 498 non-null    int64 
 11  Most Favored Nation                498 non-null    int64 
 12  No-Solic

We extract the first document and save it as an object text.

In [5]:
text = corpus["content"][0]
print(text)

CO-BRANDING AND ADVERTISING AGREEMENT  THIS CO-BRANDING AND ADVERTISING AGREEMENT (the "Agreement") is made as of June 21, 1999 (the "Effective Date") by and between I-ESCROW, INC., with its principal place of business at 1730 S. Amphlett Blvd., Suite 233, San Mateo, California 94402 ("i-Escrow"), and 2THEMART.COM, INC. having its principal place of business at 18301 Von Karman Avenue, 7th Floor, Irvine, California 92612 ("2TheMart").  1. DEFINITIONS.  (a) "CONTENT" means all content or information, in any medium, provided by a party to the other party for use in conjunction with the performance of its obligations hereunder, including without limitation any text, music, sound, photographs, video, graphics, data or software. Content provided by 2TheMart is referred to herein as "2TheMart Content" and Content provided by i-Escrow is referred to herein as "i-Escrow Content."  (b) "CO-BRANDED SITE" means the web-site accessible through Domain Name, for the Services implemented by i-Escrow.

## Pre-Processing Textual Data

### Convert text to lower case:

In [6]:
lower_text = text.lower()
print (lower_text)

co-branding and advertising agreement  this co-branding and advertising agreement (the "agreement") is made as of june 21, 1999 (the "effective date") by and between i-escrow, inc., with its principal place of business at 1730 s. amphlett blvd., suite 233, san mateo, california 94402 ("i-escrow"), and 2themart.com, inc. having its principal place of business at 18301 von karman avenue, 7th floor, irvine, california 92612 ("2themart").  1. definitions.  (a) "content" means all content or information, in any medium, provided by a party to the other party for use in conjunction with the performance of its obligations hereunder, including without limitation any text, music, sound, photographs, video, graphics, data or software. content provided by 2themart is referred to herein as "2themart content" and content provided by i-escrow is referred to herein as "i-escrow content."  (b) "co-branded site" means the web-site accessible through domain name, for the services implemented by i-escrow.

### Tokenize text

Break down text into tokens, i.e, breaking the sentences into single words for analysis.

In [7]:
word_tokens = nltk.word_tokenize(lower_text)
print(word_tokens)

['co-branding', 'and', 'advertising', 'agreement', 'this', 'co-branding', 'and', 'advertising', 'agreement', '(', 'the', '``', 'agreement', "''", ')', 'is', 'made', 'as', 'of', 'june', '21', ',', '1999', '(', 'the', '``', 'effective', 'date', "''", ')', 'by', 'and', 'between', 'i-escrow', ',', 'inc.', ',', 'with', 'its', 'principal', 'place', 'of', 'business', 'at', '1730', 's.', 'amphlett', 'blvd.', ',', 'suite', '233', ',', 'san', 'mateo', ',', 'california', '94402', '(', '``', 'i-escrow', "''", ')', ',', 'and', '2themart.com', ',', 'inc.', 'having', 'its', 'principal', 'place', 'of', 'business', 'at', '18301', 'von', 'karman', 'avenue', ',', '7th', 'floor', ',', 'irvine', ',', 'california', '92612', '(', '``', '2themart', "''", ')', '.', '1.', 'definitions', '.', '(', 'a', ')', '``', 'content', "''", 'means', 'all', 'content', 'or', 'information', ',', 'in', 'any', 'medium', ',', 'provided', 'by', 'a', 'party', 'to', 'the', 'other', 'party', 'for', 'use', 'in', 'conjunction', 'with'

We need a better tokenizer also "punctuation" and "numbers" are retained as tokens. Also, very short words are translated into tokens.


In [8]:
better_tokenizer = RegexpTokenizer(r'[a-zA-Z]{3,}')
# [a-zA-Z] means that only letters are retained as tokens
# {3,} means that only tokens with at least three characters are retained

In [9]:
word_tokens = better_tokenizer.tokenize(lower_text)
print(word_tokens)

['branding', 'and', 'advertising', 'agreement', 'this', 'branding', 'and', 'advertising', 'agreement', 'the', 'agreement', 'made', 'june', 'the', 'effective', 'date', 'and', 'between', 'escrow', 'inc', 'with', 'its', 'principal', 'place', 'business', 'amphlett', 'blvd', 'suite', 'san', 'mateo', 'california', 'escrow', 'and', 'themart', 'com', 'inc', 'having', 'its', 'principal', 'place', 'business', 'von', 'karman', 'avenue', 'floor', 'irvine', 'california', 'themart', 'definitions', 'content', 'means', 'all', 'content', 'information', 'any', 'medium', 'provided', 'party', 'the', 'other', 'party', 'for', 'use', 'conjunction', 'with', 'the', 'performance', 'its', 'obligations', 'hereunder', 'including', 'without', 'limitation', 'any', 'text', 'music', 'sound', 'photographs', 'video', 'graphics', 'data', 'software', 'content', 'provided', 'themart', 'referred', 'herein', 'themart', 'content', 'and', 'content', 'provided', 'escrow', 'referred', 'herein', 'escrow', 'content', 'branded', 's

## Remove stop words

Remove irrelevant words using nltk stop words like is,the,a etc from the sentences as they don’t carry any information.

In [10]:
stopword = stopwords.words('english')
print(stopword)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

For getting rid of stopwords, we must compare each token against the words in the stop words list. With can be easily done in a list comprehension.

Reformulating the for loop as a list comprehension. List comprehensions are considered to be very understandable and are thus used very frequently by pythonistas.

In [11]:
clean_tokens = [word for word in word_tokens if word not in stopword]
print (clean_tokens)

['branding', 'advertising', 'agreement', 'branding', 'advertising', 'agreement', 'agreement', 'made', 'june', 'effective', 'date', 'escrow', 'inc', 'principal', 'place', 'business', 'amphlett', 'blvd', 'suite', 'san', 'mateo', 'california', 'escrow', 'themart', 'com', 'inc', 'principal', 'place', 'business', 'von', 'karman', 'avenue', 'floor', 'irvine', 'california', 'themart', 'definitions', 'content', 'means', 'content', 'information', 'medium', 'provided', 'party', 'party', 'use', 'conjunction', 'performance', 'obligations', 'hereunder', 'including', 'without', 'limitation', 'text', 'music', 'sound', 'photographs', 'video', 'graphics', 'data', 'software', 'content', 'provided', 'themart', 'referred', 'herein', 'themart', 'content', 'content', 'provided', 'escrow', 'referred', 'herein', 'escrow', 'content', 'branded', 'site', 'means', 'web', 'site', 'accessible', 'domain', 'name', 'services', 'implemented', 'escrow', 'homepage', 'web', 'site', 'visibly', 'display', 'themart', 'marks'

## Stemming

Often we want to map the different forms of the same word to the same root word, e.g. "walks", "walking", "walked" should all be the same as "walk".

In [12]:
print(clean_tokens)

['branding', 'advertising', 'agreement', 'branding', 'advertising', 'agreement', 'agreement', 'made', 'june', 'effective', 'date', 'escrow', 'inc', 'principal', 'place', 'business', 'amphlett', 'blvd', 'suite', 'san', 'mateo', 'california', 'escrow', 'themart', 'com', 'inc', 'principal', 'place', 'business', 'von', 'karman', 'avenue', 'floor', 'irvine', 'california', 'themart', 'definitions', 'content', 'means', 'content', 'information', 'medium', 'provided', 'party', 'party', 'use', 'conjunction', 'performance', 'obligations', 'hereunder', 'including', 'without', 'limitation', 'text', 'music', 'sound', 'photographs', 'video', 'graphics', 'data', 'software', 'content', 'provided', 'themart', 'referred', 'herein', 'themart', 'content', 'content', 'provided', 'escrow', 'referred', 'herein', 'escrow', 'content', 'branded', 'site', 'means', 'web', 'site', 'accessible', 'domain', 'name', 'services', 'implemented', 'escrow', 'homepage', 'web', 'site', 'visibly', 'display', 'themart', 'marks'

In [13]:
snowball_stemmer = SnowballStemmer('english')
stemmed_token = [snowball_stemmer.stem(word) for word in clean_tokens]
print(stemmed_token)

['brand', 'advertis', 'agreement', 'brand', 'advertis', 'agreement', 'agreement', 'made', 'june', 'effect', 'date', 'escrow', 'inc', 'princip', 'place', 'busi', 'amphlett', 'blvd', 'suit', 'san', 'mateo', 'california', 'escrow', 'themart', 'com', 'inc', 'princip', 'place', 'busi', 'von', 'karman', 'avenu', 'floor', 'irvin', 'california', 'themart', 'definit', 'content', 'mean', 'content', 'inform', 'medium', 'provid', 'parti', 'parti', 'use', 'conjunct', 'perform', 'oblig', 'hereund', 'includ', 'without', 'limit', 'text', 'music', 'sound', 'photograph', 'video', 'graphic', 'data', 'softwar', 'content', 'provid', 'themart', 'refer', 'herein', 'themart', 'content', 'content', 'provid', 'escrow', 'refer', 'herein', 'escrow', 'content', 'brand', 'site', 'mean', 'web', 'site', 'access', 'domain', 'name', 'servic', 'implement', 'escrow', 'homepag', 'web', 'site', 'visibl', 'display', 'themart', 'mark', 'escrow', 'mark', 'custom', 'mean', 'user', 'access', 'brand', 'site', 'domain', 'name', '

# Lab Session 1: Building A Contract Analyzer

## Import Model Functions

In [14]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, accuracy_score

## Preprocessing in Sklearn

Sklearn can do the precesssing for us with the functions CountVectorizer or TfIDF Vectorizer; But it provides no stemming for us.

In [15]:
# Defining a CountVectorizer with our preprocessing Steps
vectorizer = CountVectorizer(lowercase=True, stop_words="english", token_pattern=r'[a-zA-Z]{3,}',  min_df=5)

# Loooking at the TDM Matrix
X = vectorizer.fit_transform(corpus["content"])
TDM = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
TDM

Unnamed: 0,aaa,aba,abandon,abandoned,abandonment,abbreviation,abide,abilities,ability,able,...,xxiii,year,yearly,years,yield,york,zero,zip,zone,zoning
0,0,0,0,0,0,0,1,0,0,0,...,0,4,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,2,0,...,0,0,0,0,0,1,0,0,0,0
2,0,0,0,0,0,0,0,0,0,1,...,0,3,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,3,0,...,0,3,0,2,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,17,0,1,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
493,1,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
494,0,0,0,0,0,0,0,0,2,0,...,0,6,0,0,0,0,0,0,0,0
495,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
496,0,0,0,0,0,0,0,0,0,0,...,0,1,0,1,0,0,0,0,0,0


## Preprocess Train and Test sets

In [16]:
X = corpus["content"]
X

Unnamed: 0,content
0,CO-BRANDING AND ADVERTISING AGREEMENT THIS CO...
1,EXHIBIT 4.25 INFORMATION IN THIS EXHIBIT IDENT...
2,EXHIBIT 10.13 JO...
3,Exhibit 10.31 PURSUANT TO 17 C.F.R. § 240.2...
4,REDACTED COPY CONFIDENTIAL TREATMENT REQUESTE...
...,...
493,Exhibit 10.1 INTELLECTUAL PROPERTY AGREEMENT ...
494,Exhibit 10.2 CERTAIN INFORMATION (INDICATED B...
495,Exhibit 10.17(b) ...
496,"Exhibit 10.1 MANUFACTURING, DESIGN AND MARKETI..."


In [17]:
X_vectorized = vectorizer.fit_transform(X)

In [18]:
y = corpus["Third Party Beneficiary"]
y

Unnamed: 0,Third Party Beneficiary
0,0
1,0
2,0
3,0
4,0
...,...
493,0
494,0
495,0
496,0


In [19]:
classifier = MultinomialNB()
classifier.fit(X_vectorized, y)

In [20]:
classifier.predict(X_vectorized)

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,

# Lab Session 2: Evaluating our Contract Analyzer

## Pre-Process the Data

In [21]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state = 1)

*Using* our vectorizer to convert the text to a numeric representation

In [22]:
X_train_vectorized = vectorizer.fit_transform(X_train)
X_test_vectorized = vectorizer.transform(X_test)

## Train the model

In [23]:
classifier = MultinomialNB()
classifier.fit(X_train_vectorized, y_train)

## Evaluate Predictions

In [24]:
predictions = classifier.predict(X_test_vectorized)
accuracy_score(y_test, predictions)

0.9266666666666666