<a href="https://colab.research.google.com/github/casbdai/unboxing_sessions/blob/main/TextminingContractAnalysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Building A Contract Analyzer:

A short introduction in data processing for textual data and text classification

# Basic Setup


Install nltk library for text processing and download some extensions that are required.

In [None]:
!pip install nltk
import nltk
nltk.download('stopwords')
nltk.download('punkt')

In [None]:
# we import a series of specific functions from the nltk package for processing the texts.
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from nltk.stem import SnowballStemmer

# we import pandas for reading in files
import pandas as pd

## Read in the data

In [None]:
corpus = pd.read_excel("https://raw.githubusercontent.com/casbdai/datasets/main/cuad_data.xlsx")
corpus.head()

In [None]:
corpus.info()

We extract the first document and save it as an object text.

In [None]:
text = corpus["content"][0]
print(text)

## Pre-Processing Textual Data

### Convert text to lower case:

In [None]:
lower_text = text.lower()
print (lower_text)

### Tokenize text

Break down text into tokens, i.e, breaking the sentences into single words for analysis.

In [None]:
word_tokens = nltk.word_tokenize(lower_text)
print(word_tokens)

We need a better tokenizer also "punctuation" and "numbers" are retained as tokens. Also, very short words are translated into tokens.


In [None]:
better_tokenizer = RegexpTokenizer(r'[a-zA-Z]{3,}')
# [a-zA-Z] means that only letters are retained as tokens
# {3,} means that only tokens with at least three characters are retained

In [None]:
word_tokens = better_tokenizer.tokenize(lower_text)
print(word_tokens)

## Remove stop words

Remove irrelevant words using nltk stop words like is,the,a etc from the sentences as they donâ€™t carry any information.

In [None]:
stopword = stopwords.words('english')
print(stopword)

For getting rid of stopwords, we must compare each token against the words in the stop words list. With can be easily done in a list comprehension.

Reformulating the for loop as a list comprehension. List comprehensions are considered to be very understandable and are thus used very frequently by pythonistas.

In [None]:
clean_tokens = [word for word in word_tokens if word not in stopword]
print (clean_tokens)

## Stemming

Often we want to map the different forms of the same word to the same root word, e.g. "walks", "walking", "walked" should all be the same as "walk".

In [None]:
print(clean_tokens)

In [None]:
snowball_stemmer = SnowballStemmer('english')
stemmed_token = [snowball_stemmer.stem(word) for word in clean_tokens]
print(stemmed_token)

# Lab Session 1: Building A Contract Analyzer

## Import Model Functions

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, accuracy_score

## Preprocessing in Sklearn

Sklearn can do the precesssing for us with the functions CountVectorizer or TfIDF Vectorizer; But it provides no stemming for us.

In [None]:
# Defining a CountVectorizer with our preprocessing Steps
vectorizer = CountVectorizer(lowercase=True, stop_words="english", token_pattern=r'[a-zA-Z]{3,}',  min_df=5)

# Loooking at the TDM Matrix
X = vectorizer.fit_transform(corpus["content"])
TDM = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
TDM

## Preprocess Train and Test sets

In [None]:
X = corpus["content"]
X

In [None]:
X_vectorized = vectorizer.fit_transform(X)

In [None]:
y = corpus["Third Party Beneficiary"]
y

In [None]:
classifier = MultinomialNB()
classifier.fit(X_vectorized, y)

In [None]:
classifier.predict(X_vectorized)

# Lab Session 2: Evaluating our Contract Analyzer

## Pre-Process the Data

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state = 1)

*Using* our vectorizer to convert the text to a numeric representation

In [None]:
X_train_vectorized = vectorizer.fit_transform(X_train)
X_test_vectorized = vectorizer.transform(X_test)

## Train the model

In [None]:
classifier = MultinomialNB()
classifier.fit(X_train_vectorized, y_train)

## Evaluate Predictions

In [None]:
predictions = classifier.predict(X_test_vectorized)
accuracy_score(y_test, predictions)