## This Extractive Text Summarisation was based on the links below:
* https://www.mygreatlearning.com/blog/text-summarization-in-python/#Approaches%20used%20for%20Text%20Summarization
* https://affine.ai/how-to-build-a-legal-document-summarizer/
* https://medium.com/data-science-in-your-pocket/text-summarization-using-textrank-in-nlp-4bce52c5b390
* https://colab.research.google.com/github/dipanjanS/nlp_workshop_odsc19/blob/master/Module05%20-%20NLP%20Applications/Project06%20-%20Text%20Summarization.ipynb#scrollTo=QLjP9KgbFUNi

## Evaluation - Rouge and Bleu (Ignore Bleu as Bleu is used for Machine Translation)
* Difference between Rouge and Bleu: https://stackoverflow.com/questions/38045290/text-summarization-evaluation-bleu-vs-rouge#:~:text=Bleu%20measures%20precision%3A%20how%20much,in%20the%20machine%20generated%20summaries.
* Rouge Implementation Python: https://github.com/pltrdy/rouge

### Import Relevant Libraries

In [2]:
import nltk
from nltk.corpus import stopwords
stopWords = set(stopwords.words("english"))
from nltk.tokenize import word_tokenize, sent_tokenize

import pandas as pd
import numpy as np
import json
import re

import spacy

from gensim.models import Word2Vec
from scipy import spatial
from scipy.sparse.linalg import svds
import networkx as nx
from sklearn.feature_extraction.text import TfidfVectorizer

import rouge
from rouge import Rouge
from nltk.translate.bleu_score import sentence_bleu, corpus_bleu

### Load Dataset

In [56]:
#CUAD
# contracts_cuad = pd.read_excel('../data/contract_new.xlsx')
# contracts_cuad['content'] = contracts_cuad['content'].apply(lambda x: x.lower())
# contracts_cuad.head()

In [3]:
#BillSum
billsum_train = pd.read_excel('../data/billsum_train.xlsx')

billsum_train['content'] = billsum_train['content'].apply(lambda x: x.lower())
billsum_train['summary'] = billsum_train['summary'].apply(lambda x: x.lower())
billsum_train.head()

Unnamed: 0,contract,content,summary
0,To amend the Public Health Service Act to esta...,section 1. short title. this act may be cited ...,border hospital survival and illegal immigrant...
1,To amend the Richard B. Russell National Schoo...,section 1. short title. this act may be cited ...,farm to school improvements act of 2010 - amen...
2,"A bill to amend title 38, United States Code, ...",section 1. short title. this act may be cited ...,persian gulf war illness compensation act of 2...
3,A bill to provide for additional outreach and ...,section 1. short title. this act may be cited ...,medicare part d outreach and enrollment enhanc...
4,To amend the Internal Revenue Code of 1986 to ...,section 1. short title. this act may be cited ...,seniors' retirement recovery act of 2002 - ame...


In [4]:
input_text = str(billsum_train['content'][0])
#print(input_text)

### Summarize using Avg Sentence Score

In [14]:
#tokenise the text
words = word_tokenize(input_text)

In [15]:
#create a freq table to keep the score of each word
freqTable = dict()
for word in words:
    if word in stopWords:
        continue
    if word in freqTable:
        freqTable[word] += 1
    else:
        freqTable[word] = 1

In [16]:
# Create dictionary to keep the score of each sentence
sentences = sent_tokenize(input_text)
sentenceValue = dict()
for sentence in sentences:
    for word, freq in freqTable.items():
        if word in sentence:
            if sentence in sentenceValue:
                sentenceValue[sentence] += freq
            else:
                sentenceValue[sentence] = freq
                
sumValue = 0
for sentence in sentenceValue:
    sumValue += sentenceValue[sentence]

In [17]:
# Average value of a sentence from the original text
avg = int(sumValue/len(sentenceValue))

In [18]:
# Storing sentences into our summary
summary_ass = ''
summary_list = []
for sentence in sentences:
    if (sentence in sentenceValue) and (sentenceValue[sentence]>(1.2*avg)):
        summary_list.append(sentence)
        summary_ass += sentence + ' '
        
print(summary_ass)

1395dd) and state laws require that, if any individual (whether or not lawfully present in the united states) comes to a hospital and the hospital determines that the individual has an emergency medical condition, the hospital must provide either, within the staff and facilities available at the hospital, for such further medical examination and such treatment as may be required to stabilize the medical condition, or, if appropriate, for transfer of the individual to another medical facility. 249) is amended by adding at the end the following: ``(d)(1) the secretary shall establish and implement a 5-year pilot program under which funds made available under paragraph (6) are used to reimburse providers for items and services described in section 411(b)(1) of the personal responsibility and work opportunity reconciliation act of 1996 (8 u.s.c. 1621(b)(1)) provided in arizona to aliens described in paragraph (3), and to reimburse suppliers of emergency ambulance services furnished to such

In [33]:
print(billsum_train['summary'][0])

border hospital survival and illegal immigrant care act - amends the public health service act to direct the secretary of health and human services to establish a five-year pilot program of health care provider reimbursement for the costs associated with providing emergency medical and ambulance services in arizona to: (1) illegal aliens who are not detained by any federal, state, or local law enforcement authority. or (2) aliens paroled into the united states for less than one year to receive emergency medical treatment.


In [19]:
print(len(sentences))
print(len(summary_list))

30
8


In [20]:
#Rouge Implementation
rouge = Rouge()
scores_ass = rouge.get_scores(summary_ass, billsum_train['summary'][0])

In [21]:
for score, f1 in scores_ass[0].items():
    print(f"{score}:")
    print(f"precision: {f1['p']}")
    print(f"recall: {f1['r']}")
    print(f"f1-score: {f1['f']}\n")

rouge-1:
precision: 0.2033195020746888
recall: 0.765625
f1-score: 0.3213114720937383

rouge-2:
precision: 0.0625
recall: 0.3625
f1-score: 0.10661764455017307

rouge-l:
precision: 0.17012448132780084
recall: 0.640625
f1-score: 0.26885245570029564



### TextRank Algorithm (Using PageRank)

In [22]:
sentences = sent_tokenize(input_text)

In [23]:
# remove punctuations and special characters
sentences_clean=[re.sub(r'[^\w\s]','',sentence.lower()) for sentence in sentences]

# remove stopwords
sentence_tokens=[[words for words in sentence.split(' ') if words not in stopWords] for sentence in sentences_clean]

In [24]:
# word embedding
w2v=Word2Vec(sentence_tokens, vector_size=1, min_count=1, epochs=1000)

sentence_embeddings=[[w2v.wv[word][0] for word in words] for words in sentence_tokens]
max_len=max([len(tokens) for tokens in sentence_tokens])

# padding
sentence_embeddings=[np.pad(embedding,(0,max_len-len(embedding)),'constant') for embedding in sentence_embeddings]

In [25]:
def cosine(u, v):
    return np.dot(u, v) / (np.linalg.norm(u) * np.linalg.norm(v))

In [26]:
# creating a matrix of NxN where N is the total number of sentences in the text
similarity_matrix = np.zeros([len(sentence_tokens), len(sentence_tokens)])

# calculate the similarity between every 2 pairs of sentences
for i,row_embedding in enumerate(sentence_embeddings):
    for j,column_embedding in enumerate(sentence_embeddings):
        similarity_matrix[i][j]=1-spatial.distance.cosine(row_embedding,column_embedding)

In [27]:
#convert similarity matrix to a network/graph
nx_graph = nx.from_numpy_array(similarity_matrix)

#apply pagerank
scores = nx.pagerank(nx_graph)

In [28]:
#create a sorted dictionary with sentences and their pagerank value. pick the top 4 sentences
top_sentence={sentence:scores[index] for index,sentence in enumerate(sentences)}
top=dict(sorted(top_sentence.items(), key=lambda x: x[1], reverse=True)[:4])

In [29]:
#print the top 4 sentences
summary_tr = ''
for sent in sentences:
    if sent in top.keys():
        summary_tr += sent + ' '

In [30]:
print(summary_tr)

(2) the immigration and naturalization service does not take into custody all aliens who are unlawfully present in the united states. 1182(d)(5)) for less than one year in order to receive treatment for an emergency medical condition. each report shall contain at least the following information: ``(a) the number of aliens to whom assistance was rendered for which payment was made under this subsection during the previous year. ``(e) the feasibility and estimated cost of expanding the pilot program to items and services provided anywhere in the southwest border region of the united states. 


In [32]:
#Rouge Implementation
rouge = Rouge()
scores_tr = rouge.get_scores(summary_tr, billsum_train['summary'][0])

for score, f1 in scores_tr[0].items():
    print(f"{score}:")
    print(f"precision: {f1['p']}")
    print(f"recall: {f1['r']}")
    print(f"f1-score: {f1['f']}\n")

rouge-1:
precision: 0.375
recall: 0.421875
f1-score: 0.3970588185467128

rouge-2:
precision: 0.12087912087912088
recall: 0.1375
f1-score: 0.12865496578092422

rouge-l:
precision: 0.2916666666666667
recall: 0.328125
f1-score: 0.30882352442906574



### LSA

In [5]:
# normalising text
sentences = sent_tokenize(input_text)
def normalize_document(doc):
    # lower case and remove special characters\whitespaces
    doc = re.sub(r'[^a-zA-Z\s]', '', doc, re.I|re.A)
    doc = doc.lower()
    doc = doc.strip()
    # tokenize document
    tokens = nltk.word_tokenize(doc)
    # filter stopwords out of document
    filtered_tokens = [token for token in tokens if token not in stopWords]
    # re-create document from filtered tokens
    doc = ' '.join(filtered_tokens)
    return doc

normalize_corpus = np.vectorize(normalize_document)

norm_sentences = normalize_corpus(sentences)
norm_sentences[:3]

array(['section short title',
       'act may cited border hospital survival illegal immigrant care act',
       'sec'], dtype='<U461')

In [6]:
# feature extraction
tv = TfidfVectorizer(min_df=0., max_df=1., use_idf=True)
dt_matrix = tv.fit_transform(norm_sentences)
dt_matrix = dt_matrix.toarray()

vocab = tv.get_feature_names()
td_matrix = dt_matrix.T
print(td_matrix.shape)
pd.DataFrame(np.round(td_matrix, 2), index=vocab).head(10)

(215, 30)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,20,21,22,23,24,25,26,27,28,29
absorb,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.26,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
act,0.0,0.42,0.0,0.0,0.0,0.0,0.36,0.0,0.0,0.18,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.11,0.08,0.0
acts,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.15,0.0
adding,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
administration,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.15,0.0
affairs,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.15,0.0
alien,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.39,0.0,0.0,0.0,0.0,0.0
aliens,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.17,0.0,...,0.0,0.0,0.19,0.58,0.0,0.0,0.0,0.0,0.0,0.0
ambulance,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
amended,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [7]:
def low_rank_svd(matrix, singular_count=2):
    u, s, vt = svds(matrix, k=singular_count)
    return u, s, vt

In [8]:
num_sentences = 8
num_topics = 3

u, s, vt = low_rank_svd(td_matrix, singular_count=num_topics)  
print(u.shape, s.shape, vt.shape)
term_topic_mat, singular_values, topic_document_mat = u, s, vt

(215, 3) (3,) (3, 30)


In [9]:
# remove singular values below threshold                                         
sv_threshold = 0.5
min_sigma_value = max(singular_values) * sv_threshold
singular_values[singular_values < min_sigma_value] = 0

In [10]:
salience_scores = np.sqrt(np.dot(np.square(singular_values), 
                                 np.square(topic_document_mat)))
salience_scores

array([3.28447139e-01, 3.79549888e-01, 1.00000000e+00, 2.48713951e-16,
       1.14872810e-01, 2.76577898e-01, 6.60026826e-01, 5.82696370e-01,
       2.23215803e-01, 7.48100788e-01, 1.60999653e-16, 2.09484052e-01,
       1.00000000e+00, 3.49695771e-01, 8.34256991e-01, 3.75022422e-01,
       5.39183298e-01, 6.47729445e-01, 5.18552492e-01, 4.90260772e-01,
       3.56344363e-01, 1.77027702e-01, 2.60759892e-01, 2.16161894e-01,
       2.50751900e-01, 1.32702058e-01, 2.85298278e-01, 3.96789511e-01,
       3.94045729e-01, 2.85002073e-16])

In [11]:
top_sentence_indices = (-salience_scores).argsort()[:num_sentences]
top_sentence_indices.sort()

In [12]:
summary_lsa = ' '.join(np.array(sentences)[top_sentence_indices])
print(summary_lsa)

sec. (3) section 1867 of the social security act (42 u.s.c. 1395dd) and state laws require that, if any individual (whether or not lawfully present in the united states) comes to a hospital and the hospital determines that the individual has an emergency medical condition, the hospital must provide either, within the staff and facilities available at the hospital, for such further medical examination and such treatment as may be required to stabilize the medical condition, or, if appropriate, for transfer of the individual to another medical facility. (5) the southwest border region has been designated as a health professional shortage area under section 332 of the public health service act (42 u.s.c. sec. section 322 of the public health service act (42 u.s.c. 1621(b)(1)) provided in arizona to aliens described in paragraph (3), and to reimburse suppliers of emergency ambulance services furnished to such aliens for which the transportation originates in arizona (where the use of other

In [13]:
#Rouge Implementation
rouge = Rouge()
scores_lsa = rouge.get_scores(summary_lsa, billsum_train['summary'][0])

for score, f1 in scores_lsa[0].items():
    print(f"{score}:")
    print(f"precision: {f1['p']}")
    print(f"recall: {f1['r']}")
    print(f"f1-score: {f1['f']}\n")

rouge-1:
precision: 0.22627737226277372
recall: 0.484375
f1-score: 0.30845770710229947

rouge-2:
precision: 0.038135593220338986
recall: 0.1125
f1-score: 0.05696202153501067

rouge-l:
precision: 0.17518248175182483
recall: 0.375
f1-score: 0.23880596580876717



### KL-Sum

### Luhn