## This Extractive Text Summarisation was based on the links below:
* https://www.mygreatlearning.com/blog/text-summarization-in-python/#Approaches%20used%20for%20Text%20Summarization
* https://affine.ai/how-to-build-a-legal-document-summarizer/
* https://medium.com/data-science-in-your-pocket/text-summarization-using-textrank-in-nlp-4bce52c5b390

### Import Relevant Libraries

In [51]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize

import pandas as pd
import numpy as np
import json
import re

import spacy

from gensim.models import Word2Vec
from scipy import spatial
import networkx as nx

### Load Dataset

In [56]:
#CUAD
# contracts_cuad = pd.read_excel('../data/contract_new.xlsx')
# contracts_cuad['content'] = contracts_cuad['content'].apply(lambda x: x.lower())
# contracts_cuad.head()

In [45]:
#BillSum
billsum_train = pd.read_excel('../data/billsum_train.xlsx')

billsum_train['content'] = billsum_train['content'].apply(lambda x: x.lower())
billsum_train['summary'] = billsum_train['summary'].apply(lambda x: x.lower())
billsum_train.head()

Unnamed: 0,contract,content,summary
0,To amend the Public Health Service Act to esta...,section 1. short title. this act may be cited ...,border hospital survival and illegal immigrant...
1,To amend the Richard B. Russell National Schoo...,section 1. short title. this act may be cited ...,farm to school improvements act of 2010 - amen...
2,"A bill to amend title 38, United States Code, ...",section 1. short title. this act may be cited ...,persian gulf war illness compensation act of 2...
3,A bill to provide for additional outreach and ...,section 1. short title. this act may be cited ...,medicare part d outreach and enrollment enhanc...
4,To amend the Internal Revenue Code of 1986 to ...,section 1. short title. this act may be cited ...,seniors' retirement recovery act of 2002 - ame...


### Text Pre-processing

In [55]:
input_text = str(billsum_train['content'][0])
print(input_text)

In [47]:
sentences = sent_tokenize(input_text)

In [53]:
# remove punctuations and special characters
sentences_clean=[re.sub(r'[^\w\s]','',sentence.lower()) for sentence in sentences]

# remove stopwords
stopWords = set(stopwords.words("english"))

sentence_tokens=[[words for words in sentence.split(' ') if words not in stopWords] for sentence in sentences_clean]

In [105]:
# word embedding
w2v=Word2Vec(sentence_tokens, vector_size=1, min_count=1, epochs=1000)

sentence_embeddings=[[w2v.wv[word][0] for word in words] for words in sentence_tokens]
max_len=max([len(tokens) for tokens in sentence_tokens])

# padding
sentence_embeddings=[np.pad(embedding,(0,max_len-len(embedding)),'constant') for embedding in sentence_embeddings]

In [100]:
def cosine(u, v):
#     u = np.squeeze(np.asarray(u))
#     v = np.squeeze(np.asarray(v))
    return np.dot(u, v) / (np.linalg.norm(u) * np.linalg.norm(v))

In [106]:
# creating a matrix of NxN where N is the total number of sentences in the text
similarity_matrix = np.zeros([len(sentence_tokens), len(sentence_tokens)])

# calculate the similarity between every 2 pairs of sentences
for i,row_embedding in enumerate(sentence_embeddings):
    for j,column_embedding in enumerate(sentence_embeddings):
        similarity_matrix[i][j]=1-spatial.distance.cosine(row_embedding,column_embedding)

### Entity Recognition

In [25]:
# load large English NLP model
nlp = spacy.load('en_core_web_lg')

#parse text with spacy
text=clean_text[0]
doc = nlp(text)

In [26]:
# print entities for first document
for entity in doc.ents:
    print(f"{entity.text} ({entity.label_})")

section 1 (LAW)
sec (ORG)
2 (CARDINAL)
congress (ORG)
1 (CARDINAL)
2 (CARDINAL)
united states (GPE)
42 (CARDINAL)
united states (GPE)
4 (CARDINAL)
5 (CARDINAL)
332 (CARDINAL)
42 (CARDINAL)
arizona (GPE)
3 (CARDINAL)
section 322 public health service (LAW)
42 (CARDINAL)
249 (CARDINAL)
1 (CARDINAL)
5 year (DATE)
6 (CARDINAL)
411 (CARDINAL)
1 (CARDINAL)
1996 8 (DATE)
1 (CARDINAL)
arizona (GPE)
3 (CARDINAL)
arizona (GPE)
2 (CARDINAL)
3 (CARDINAL)
arizona (GPE)
1 (CARDINAL)
42 (CARDINAL)
1 (CARDINAL)
6 (CARDINAL)
united states (GPE)
24 hours (TIME)
3 (CARDINAL)
united states (GPE)
united states (GPE)
5 (CARDINAL)
8 (CARDINAL)
one year (DATE)
4 (CARDINAL)
annual (DATE)
congress (ORG)
previous year (DATE)
united states (GPE)
6 (CARDINAL)
5 fiscal years (DATE)
fiscal year (DATE)
health resources services administration department health human services (ORG)
50 000 000 year (MONEY)


### Summarize using Avg Sentence Score

In [113]:
#tokenise the text
# words = word_tokenize(input_text)

In [114]:
#create a freq table to keep the score of each word
# freqTable = dict()
# for word in words:
#     if word in stopWords:
#         continue
#     if word in freqTable:
#         freqTable[word] += 1
#     else:
#         freqTable[word] = 1

In [115]:
# Create dictionary to keep the score of each sentence
# sentences = sent_tokenize(input_text)
# sentenceValue = dict()
# for sentence in sentences:
#     for word, freq in freqTable.items():
#         if word in sentence:
#             if sentence in sentenceValue:
#                 sentenceValue[sentence] += freq
#             else:
#                 sentenceValue[sentence] = freq
                
# sumValue = 0
# for sentence in sentenceValue:
#     sumValue += sentenceValue[sentence]

In [116]:
# Average value of a sentence from the original text
# avg = int(sumValue/len(sentenceValue))

In [117]:
# Storing sentences into our summary
# summary = ''
# summary_list = []
# for sentence in sentences:
#     if (sentence in sentenceValue) and (sentenceValue[sentence]>(1.2*avg)):
#         summary_list.append(sentence)
#         summary += ' '+ sentence
        
# print(summary)

In [118]:
# print(len(sentences))
# print(len(summary_list))

### TextRank Algorithm (Using PageRank)

In [107]:
#convert similarity matrix to a network/graph
nx_graph = nx.from_numpy_array(similarity_matrix)

#apply pagerank
scores = nx.pagerank(nx_graph)

In [108]:
#create a sorted dictionary with sentences and their pagerank value. pick the top 4 sentences
top_sentence={sentence:scores[index] for index,sentence in enumerate(sentences)}
top=dict(sorted(top_sentence.items(), key=lambda x: x[1], reverse=True)[:4])

In [110]:
#print the top 4 sentences
summary = ''
for sent in sentences:
    if sent in top.keys():
        summary += sent + ' '

In [111]:
print(summary)

(2) the immigration and naturalization service does not take into custody all aliens who are unlawfully present in the united states. 1182(d)(5)) for less than one year in order to receive treatment for an emergency medical condition. each report shall contain at least the following information: ``(a) the number of aliens to whom assistance was rendered for which payment was made under this subsection during the previous year. ``(e) the feasibility and estimated cost of expanding the pilot program to items and services provided anywhere in the southwest border region of the united states. 


In [112]:
print(billsum_train['summary'][0])

border hospital survival and illegal immigrant care act - amends the public health service act to direct the secretary of health and human services to establish a five-year pilot program of health care provider reimbursement for the costs associated with providing emergency medical and ambulance services in arizona to: (1) illegal aliens who are not detained by any federal, state, or local law enforcement authority. or (2) aliens paroled into the united states for less than one year to receive emergency medical treatment.


### Evaluation - Rouge and Bleu