https://github.com/RamVegiraju/ExtractiveTextSummarizer/blob/master/textsummary.py


### Loading Dependencies

In [None]:
!pip install python-docx
from docx import Document 
import glob
import re
import nltk
import spacy
import string
import heapq
import unicodedata
from spacy import displacy
nlp = spacy.load('en_core_web_sm')
nltk.download('punkt')
nltk.download('stopwords')
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords

Collecting python-docx
[?25l  Downloading https://files.pythonhosted.org/packages/e4/83/c66a1934ed5ed8ab1dbb9931f1779079f8bca0f6bbc5793c06c4b5e7d671/python-docx-0.8.10.tar.gz (5.5MB)
[K     |████████████████████████████████| 5.5MB 1.8MB/s 
Building wheels for collected packages: python-docx
  Building wheel for python-docx (setup.py) ... [?25l[?25hdone
  Created wheel for python-docx: filename=python_docx-0.8.10-cp36-none-any.whl size=184491 sha256=be7db243aec8f7c0066a36022643b2fca4d6bb44669fbf7e624e8e83d46f680f
  Stored in directory: /root/.cache/pip/wheels/18/0b/a0/1dd62ff812c857c9e487f27d80d53d2b40531bec1acecfa47b
Successfully built python-docx
Installing collected packages: python-docx
Successfully installed python-docx-0.8.10
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


#Loading Data

In [None]:
# We pushed the SOW documents on github and clone the github repo whenever we need the documents
#This is faster than mounting our colab notebook to google drive
!git clone https://github.com/NLP-Contracts/NLP-summarization.git
%cd NLP-summarization/Sample\ SoW\ docs
list_docsNames = glob.glob('*.docx') #USING GLOB TO GET NAMES OF PROJECTS
 
docs = []
st = ""
for docsName in list_docsNames:
  docs.append(st.join([p.text for p in Document(docsName).paragraphs])) #USING DOCUMENT TO LOAD PROJECTS

Cloning into 'NLP-summarization'...
remote: Enumerating objects: 27, done.[K
remote: Counting objects: 100% (27/27), done.[K
remote: Compressing objects: 100% (25/25), done.[K
remote: Total 27 (delta 3), reused 25 (delta 1), pack-reused 0[K
Unpacking objects: 100% (27/27), done.
/content/NLP-summarization/Sample SoW docs


#Data Cleaning:

In [None]:
# The basic_cleaner function will be applied first in the main corpus 
#at which we will select the sections using regex pattern matching
def basic_cleaner(s):
  s = s.lower()
  s = re.sub(r'\n', '', s)
  s = re.sub(r'\t', '', s)
  return(s)
#after select each section using regex pattern matching
#we apply the extra_cleaner function to remove punctuations 
#excpet "." & ",", in addition the section numbers and information inside brackets 
# will be removed as well
def extra_cleaner(s):
  # s= re.sub(',', ' ', s)
  s= re.sub('/', '', s)
  s= re.sub(':', '', s)
  s= re.sub('\'', '', s)
  s= re.sub('-', '', s)
  s= re.sub('/', '', s)
  s= re.sub('<', '', s)
  s= re.sub('>', '', s)
  s= re.sub('”', '', s)
  s= re.sub('–', '', s)
  s=re.sub(r'\d{1,2}\.\d', '', s) # remove subsection numbers by removing digit numbers (that have)\d{1,2}\.\d pattern
  s = re.sub(r' +', ' ', s)
  s=re.sub(r'\([^)]*\)', '', s) #remove brakets and anything thats inside the brakets
  return(s)
c_docs = [basic_cleaner(_) for _ in docs]


# Section Selection 

Going over the SOW documents it was evident that some sections had little informative information that could be beneficial for the summarization model to generate valuable results. As such we excluded sections: 


1.   Descriptions
2.   Definitions
3.   Reports
4.   Addresses for Administration and Invoicing, and 
5.  The Appendixes 

In [None]:
# It has to be noted that often the section names are not consistant in all documents that is why we applied an "or", "|" symbol 
# to find either or matched of a string.For instance the "section9_charges" section has 
#"charges, expenses and payment terms" and in some documents"fees, expenses and payment terms" or "expenses and payment terms"
#which represnet the same section 
section3_services=[''.join(map(str, (re.findall('(?:.0services?|.sevices?| services 3.1?)(.*?) (?:term and schedule?|and schedule?|term and?)', i)))) for i in c_docs]
section4_schedule=[''.join(map(str, (re.findall('(?:term and schedule?|and schedule?|term and?)(.*?)(?:place of performance?|place of performance and hours?|performance and hours)', i)))) for i in c_docs]
section5_PPH = [''.join(map(str, (re.findall('(?:place of performance and hours?|performance and hours?)(.*?)(?:structure and roles|and roles?)', i)))) for i in c_docs]
section6_roles = [''.join(map(str, (re.findall('(?:structure and roles ?|and roles?)(.*?)(?:general responsibilities|responsibilities?)', i)))) for i in c_docs]
section7_responsibilities = [''.join(map(str, (re.findall('(?:general responsibilities)(.*?)(?:charges, expenses and payment terms|fees, expenses and payment terms?|expenses and payment terms?|milestones, deliverables, and acceptance criteria?|8.0 intentionally left blank?|.0 intentionally left blank)', i)))) for i in c_docs]
section9_charges = [''.join(map(str, (re.findall('(?:charges, expenses and payment terms|fees, expenses and payment terms?|expenses and payment terms?)(.*?)(?:specific service levels)', i)))) for i in c_docs]
section12_assumptions=[''.join(map(str, (re.findall('(?:assumptions and additional provisions?)(.*?)(?:addresses for administration and invoicing)', i)))) for i in c_docs]
section14_agreement = [''.join(map(str, (re.findall('(?:.0 agreement?)(.*?)(?:agreed and accepted?)', i)))) for i in c_docs]

#Applyinf the "extra_cleaner" function to take out brackets, subsection numbers and most of the punctuations
c_section3_services = [extra_cleaner(i) for i in section3_services]
c_section4_schedule = [extra_cleaner(i) for i in section4_schedule]
c_section5_PPH = [extra_cleaner(i) for i in section5_PPH]
c_section6_roles = [extra_cleaner(i) for i in section6_roles]
c_section7_responsibilities = [extra_cleaner(i) for i in section7_responsibilities]
c_section9_charges = [extra_cleaner(i) for i in section9_charges]
c_section12_assumptions = [extra_cleaner(i) for i in section12_assumptions]
c_section14_agreement = [extra_cleaner(i) for i in section14_agreement]
 
# This corpus is the collection of sections which we selectively chose to remain in the document
corpus = [] 
for i in range(len(docs)): # range of the loop is the number of documents that are introduced 
  corpus.append(c_section3_services[i] +"\n"+ c_section4_schedule[i] + "\n" + c_section5_PPH[i] + "\n" + c_section6_roles[i] + "\n" + c_section7_responsibilities[i] + "\n" + c_section9_charges[i] + "\n" + c_section12_assumptions[i] + "\n" + c_section14_agreement[i])

# We are using Corpus[0]  
AKA--> TI_SOW_58_2019_TM_MITS_Stratus_mock.docx as our sample document to display our summarization results
It must be noted that we tested our model for all documents but for easy representation of our result we just decided to show one sample corpus to Mahmadul. 

# TF_IDF summarization function

In [None]:
#summarizer function
def freq_calculator(text):
    word_frequencies = {} #We calculate the frequency of words in each sentence.
    for word in nltk.word_tokenize(text):
        if word not in stopwords.words():
            if word not in word_frequencies.keys():
                word_frequencies[word] = 1
            else:
                word_frequencies[word] += 1

    maximum_frequency = max(word_frequencies.values())

    #calculating average value of each word
    for word in word_frequencies.keys():
        word_frequencies[word] = (word_frequencies[word]/maximum_frequency)

    #calculating sentence value 
    sentence_scores = {}
    sentence_list = sent_tokenize(text) #Tokenize the sentences We’ll tokenize the sentences here instead of words. And we’ll give weight to these sentences.
    number_sentences = len(sentence_list)
    for sent in sentence_list:
        for word in word_tokenize(sent.lower()):
            if word in word_frequencies.keys():
                if sent not in sentence_scores.keys():
                    sentence_scores[sent] = word_frequencies[word]
                else:
                    sentence_scores[sent] += word_frequencies[word]
    if (number_sentences>1 & number_sentences<=5):
        summary_sentences = heapq.nlargest(2, sentence_scores, key=sentence_scores.get)
        summary = ' '.join(summary_sentences)
        summary = summary.capitalize()
    elif (number_sentences==1):
        summary_sentences = text
        summary = summary.capitalize()
    else:
        summary_sentences = heapq.nlargest(1, sentence_scores, key=sentence_scores.get) 
        summary = ' '.join(summary_sentences)
        summary = summary.capitalize()
    return summary

def mainFunc(text):
    summary = freq_calculator(text)
    return summary 


# summaize_full_sections function 

In [None]:
def summaize_full_sections(n_doc):
  sum_full_sec=mainFunc(corpus[n_doc])
  
  
  
    
  if nlp(sum_full_sec).ents:
    displacy.render(nlp(sum_full_sec), style="ent",jupyter=True) # shows the Named Entity Recognition labels as highlights if applicable to that summary
                                                        # This will assist the reader while looking at the summarized document
  else:
    display(sum_full_sec)                                 #If The section doesnt have NER labels then display the summarized section as is 
  print("\033[95m" + "Overal total words from the sectioned document After Summarization:"+ "\033[0m",(len(sum_full_sec.split())))
  print("\033[95m" + "Overal total words from the sectioned document before Summarization:"+ "\033[0m",(len(corpus[n_doc].split())))
  print("\033[95m" + "Ratio to the Original document: %"+ "\033[0m",(len(sum_full_sec.split())/(len(corpus[n_doc].split()))*100))
  print("\033[95m" + "Overal Orignal document words before Summarization:"+ "\033[0m",(len(c_docs[n_doc].split())))
  print("\033[95m" + "Ratio to the Original document: %"+ "\033[0m",(len(sum_full_sec.split())/(len(c_docs[n_doc].split()))*100))


# Results for summarize_full_sections function

In [None]:
summaize_full_sections(0)

[95mOveral total words from the sectioned document After Summarization:[0m 590
[95mOveral total words from the sectioned document before Summarization:[0m 2460
[95mRatio to the Original document: %[0m 23.983739837398375
[95mOveral Orignal document words before Summarization:[0m 4260
[95mRatio to the Original document: %[0m 13.849765258215962


# summaize_by_section function

In [None]:
def summaize_by_section(n_doc):

  serv_sum=mainFunc(c_section3_services[n_doc].replace("\n"," "))
  schedule_sum=mainFunc(c_section4_schedule[n_doc].replace("\n"," "))
  PPH_sum=mainFunc(c_section5_PPH[n_doc].replace("\n"," "))
  role_sum=mainFunc(c_section6_roles[n_doc].replace("\n"," "))
  resp_sum=mainFunc(c_section7_responsibilities[n_doc].replace("\n"," "))
  charge_sum=mainFunc(c_section9_charges[n_doc].replace("\n"," "))
  assum_sum=mainFunc(c_section12_assumptions[n_doc].replace("\n"," "))
  agree_sum=mainFunc(c_section14_agreement[n_doc].replace("\n"," "))




  original_sections=[c_section3_services,c_section4_schedule,c_section5_PPH,c_section6_roles,c_section7_responsibilities,c_section9_charges,c_section12_assumptions,c_section14_agreement]
  sum_sections={"Services":serv_sum,
            "Schedule":schedule_sum,
            "Place of Performance and Hours":PPH_sum,
            "Role":role_sum,
            "Responsibilities":resp_sum,
            "Charge":charge_sum,
            "Assumptions":assum_sum,
            "Agreement":agree_sum}

  for v,i in zip(sum_sections.items(),original_sections): #I zipped sum_sections and original_sections so to display 
                                                          #the length of words in the original_sections before summarization and after summarization

    print("\033[95m" + v[0]+ "\033[0m") #sum_sections is a dictionary which in a zip will be transformed to a list of lists 
                                        #where the list with index 0 becomes the keys and 
                                        #list with index 1 becomes the summarized strings
                                        #This print will display the section names
    print("\t")


    if nlp(v[1]).ents:
     displacy.render(nlp(v[1]), style="ent",jupyter=True) # shows the Named Entity Recognition labels as highlights if applicable to that summary
                                                          # This will assist the reader while looking at the summarized document
    else:
      display(v[1])                                       #If The section doesnt have NER labels then display the summarized section as is 
    
    print("\t")
    print("\033[33m" + "summarized lenght of section"+ "\033[0m",len(v[1].split())) # displays the word length of the sum_sections 
    print("\033[33m" + "original lenght of section"+ "\033[0m",len(i[n_doc].split())) # displays the word length of the original_sections 
    
    

  sum=0
  h=[]
  for k,v in sum_sections.items():
    h.append(len(v.split()))
  for i in h:
    sum=i+sum
  print("\t")
  print("\033[95m" + "Overal total words from the sectioned document in the Summarized Version:"+ "\033[0m",sum)
  print("\033[95m" + "Overal total words from the sectioned document before Summarization:"+ "\033[0m",(len(corpus[n_doc].split())))
  print("\033[95m" + "Ratio to the sectioned document: %"+ "\033[0m",(sum/(len(corpus[n_doc].split())))*100)
  print("\033[95m" + "Overal Orignal document words before Summarization:"+ "\033[0m",(len(c_docs[n_doc].split())))
  print("\033[95m" + "Ratio to the Original document: %"+ "\033[0m",(sum/(len(c_docs[n_doc].split())))*100)

    

#Results for summaize_by_section function

In [None]:
summaize_by_section(0)

[95mServices[0m
	


'The resources provided by ti service representatives scope of duties are directed and managed by telus manager and their scope of duties is therefore open to change and is dependent on the needs and priorities of telus requirements. subject to the agreement, the sowspecific scope of services shall include the following  this sow provides a broad set of it services that are all delivered in a time and materials and staff augmentation delivery model.'

	
[33msummarized lenght of section[0m 74
[33moriginal lenght of section[0m 122
[95mSchedule[0m
	


	
[33msummarized lenght of section[0m 126
[33moriginal lenght of section[0m 149
[95mPlace of Performance and Hours[0m
	


	
[33msummarized lenght of section[0m 170
[33moriginal lenght of section[0m 191
[95mRole[0m
	


'The telus manager shall be regularly available to meet with the ti manager.the telus manager shall be responsible for providing qualified ti representatives with function or project specific training, coaching, education and skill development.the parties shall appoint the following key personnel for the sow termfor telus, as telus rep under the agreement for purposes of this sow mock super ;for ti, as ti csm under the agreement for purposes of this sow mock super as ti manager  mock super or delegates as agreed by the parties ti shall be responsible for supplying the below resource plan to telus.the following table summarizes the project scope and scale that are currently identified to provide the services under this sow. the ti manager shall cooperate with telus to perform reviews, ensure ti accomplishes the tasks, activities, services and scope outlined in this sow, manage daytoday activities, and serve as ti’s single point of contact with respect to interfacing with telus.the telus

	
[33msummarized lenght of section[0m 187
[33moriginal lenght of section[0m 376
[95mResponsibilities[0m
	


	
[33msummarized lenght of section[0m 383
[33moriginal lenght of section[0m 383
[95mCharge[0m
	


	
[33msummarized lenght of section[0m 257
[33moriginal lenght of section[0m 591
[95mAssumptions[0m
	


	
[33msummarized lenght of section[0m 247
[33moriginal lenght of section[0m 577
[95mAgreement[0m
	


'This sow and any change orders issued hereunder may be executed by the exchange of signed counterparts by facsimile transmission or electronically in pdf or similar secure format. this sow and any change orders issued hereunder may be executed in counterparts, which when taken together will constitute one and the same document.'

	
[33msummarized lenght of section[0m 52
[33moriginal lenght of section[0m 71
	
[95mOveral total words from the sectioned document in the Summarized Version:[0m 1496
[95mOveral total words from the sectioned document before Summarization:[0m 2460
[95mRatio to the sectioned document: %[0m 60.8130081300813
[95mOveral Orignal document words before Summarization:[0m 4260
[95mRatio to the Original document: %[0m 35.117370892018776


In [None]:
summaize_by_section(1)

[95mServices[0m
	


'Subject to the agreement, the sowspecific scope of services shall include the following  this sow provides a broad set of it services that are all delivered in a time and materials or staff augmentation delivery model. creation of poc for td frameworkcreation of wireframescreation of application architecturecreation of database architecturegrooming sessionsfoundation workanalysis and design of database for case managementthe following activities and items are specifically excluded from the scope of services under this sow naterm'

	
[33msummarized lenght of section[0m 75
[33moriginal lenght of section[0m 136
[95mSchedule[0m
	


	
[33msummarized lenght of section[0m 141
[33moriginal lenght of section[0m 164
[95mPlace of Performance and Hours[0m
	


	
[33msummarized lenght of section[0m 167
[33moriginal lenght of section[0m 222
[95mRole[0m
	


	
[33msummarized lenght of section[0m 196
[33moriginal lenght of section[0m 361
[95mResponsibilities[0m
	


	
[33msummarized lenght of section[0m 373
[33moriginal lenght of section[0m 373
[95mCharge[0m
	


	
[33msummarized lenght of section[0m 296
[33moriginal lenght of section[0m 670
[95mAssumptions[0m
	


	
[33msummarized lenght of section[0m 247
[33moriginal lenght of section[0m 676
[95mAgreement[0m
	


'This sow and any change orders issued hereunder may be executed by the exchange of signed counterparts by facsimile transmission or electronically in pdf or similar secure format. this sow and any change orders issued hereunder may be executed in counterparts, which when taken together will constitute one and the same document.'

	
[33msummarized lenght of section[0m 52
[33moriginal lenght of section[0m 71
	
[95mOveral total words from the sectioned document in the Summarized Version:[0m 1547
[95mOveral total words from the sectioned document before Summarization:[0m 2673
[95mRatio to the sectioned document: %[0m 57.87504676393566
[95mOveral Orignal document words before Summarization:[0m 4446
[95mRatio to the Original document: %[0m 34.7953216374269
