In [None]:
!pip install --upgrade transformers==4.5.1
!pip install --upgrade spacy==3.0.5
!pip install --upgrade spacy-legacy==3.0.2
!pip install --upgrade spacytextblob==3.0
!pip install --upgrade PyPDF2==1.26.0
!pip install --upgrade nltk==3.6.2
# !conda install -c conda-forge sentencepiece

In [None]:
# Pandas
import pandas as pd

# Import dependencies
from PyPDF2 import PdfFileReader

# spaCY for NLP
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation

# Import Heapq 
from heapq import nlargest

# Import HuggingFace Transformer
from transformers import pipeline


# Import Google's T5 Model Transformer
from transformers import TFAutoModelWithLMHead, AutoTokenizer


# NLTK for splitting sentences
import nltk.data

#textblob
from textblob import TextBlob


# Incompatabilties with spaCy 3.0
from spacytextblob.spacytextblob import SpacyTextBlob


#Regex
import re

# Pretty Print
import pprint


import torch

# import sentencepiece

# Extract PDF Files

In [27]:
from PyPDF2 import PdfFileReader

## Write both the files
Henceforth in this notebook, 
accenture subscript is referred to 'Accenture-Disability-Inclusion-Research-Report.pdf'
bsr subscript is referred to 'BSR_Financial_Needs_of_Garment_Workers_in_India.pdf'

In [28]:
pdf_path_accenture='data/Accenture-Disability-Inclusion-Research-Report.pdf'
pdf_accenture = PdfFileReader(str(pdf_path_accenture))

In [29]:
pdf_path_bsr='data/BSR_Financial_Needs_of_Garment_Workers_in_India.pdf'
pdf_bsr = PdfFileReader(str(pdf_path_bsr))

In [30]:
# total num of pages in PDF Accenture
print(pdf_accenture.numPages)

17


In [31]:
# total num of pages in PDF bsr
print(pdf_bsr.numPages)

38


In [32]:
# Get 5th Page. PDF is 0-indexed, so 5th page is at 4th index
pdf_accenture.getPage(4).extractText()

'5Accenture™s internal disability champions network of more than 16,000 employees \nworldwide helps colleagues feel included at work.\nA Market Worth Targeting\n\nwith disabilities as the third-largest market segment in the U.S., after \n\nHispanics and African-Americans. The discretionary income for working-\n\n\n\nAfrican-American and Hispanic segments combined.\n22 A hidden market: The purchasing power of working-age adults with disabilities, American Institutes for Research, April, 2018\n'

In [33]:
# Get 12th Page from BSR PDF. 
pdf_bsr.getPage(11).extractText()

'Financial Inclusion for Women\n I Introduction I \n8The effect of these changes has been mixed. On one hand, bank account \nownership increased to 63 percent from 52 percent of all adults between \n\n2014 and 2015, and the proportion of Indian women with individual accounts \n\nin formal ˜nancial institutions reached 61 percent in 2015, an increase from \n48 percent in 2014.\n17 Unfortunately, there is a gender gap in account usage \nas Indian women still lag behind by eight percentage points in account \nownership, and fewer than four out of 10 Indian women actively use their \nbank accounts.\n18 Adding insult to injury, just a year after demonetization, 99 \npercent of the banned high value notes were returned to the banking system, \n\nand cash use rose to the levels they were at before demonetization.\n19These outcomes could be attributed to the fact that government initiatives \nto encourage cashless payments did not incorporate ˜nancial education, \n\nthe targets for opening ban

Read and Extract the PDF
------------------------

In [35]:
def write_pdf_extract(out_filename, pdf):
    with open(out_filename,mode="w") as output_file:
        for page in pdf.pages:
            text = page.extractText()

            output_file.write(text)

In [36]:
write_pdf_extract('data/extract-accenture.txt', pdf_accenture)
write_pdf_extract('data/extract-bsr.txt', pdf_bsr)

Clean Text
----------

In [37]:
import re

def clean_text(filename, isFile = True):
    if isFile:
        fp = open(filename)
        data = fp.read().replace('\n', '')
    else:
        data = filename
        
    return re.sub(r"([0-9]+(\.[0-9]+)?)",r" \1 ", data).strip()

Google T5 Model
---------------

In [38]:
from transformers import TFAutoModelWithLMHead, AutoTokenizer

In [39]:
model = TFAutoModelWithLMHead.from_pretrained("t5-base")
tokenizer = AutoTokenizer.from_pretrained("t5-base")

All model checkpoint layers were used when initializing TFT5ForConditionalGeneration.

All the layers of TFT5ForConditionalGeneration were initialized from the model checkpoint at t5-base.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


## Full Text Summary with Google T5

#### Accenture

In [43]:
full_text_doc = clean_text("./data/extract-accenture.txt")
# T5 uses a max_length of 512 so we cut the article to 512 tokens.
inputs = tokenizer.encode("summarize: " + full_text_doc, return_tensors="tf", max_length=512)
outputs = model.generate(inputs, max_length=250, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True)
# Decoding and printing summary

GT5_summary=tokenizer.decode(outputs[0], skip_special_tokens=True)
print(GT5_summary)

leading companies are accelerating disability inclusion as the next frontier of corporate social responsibility and mission-driven investing. only 29 percent of Americans of working age with disabilities participated in the workforce, compared with 75 percent of Americans without a disability.


#### BSR

In [44]:
full_text_doc = clean_text("./data/extract-bsr.txt")
# T5 uses a max_length of 512 so we cut the article to 512 tokens.
inputs = tokenizer.encode("summarize: " + full_text_doc, return_tensors="tf", max_length=512)
outputs = model.generate(inputs, max_length=250, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True)
# Decoding and printing summary

GT5_summary=tokenizer.decode(outputs[0], skip_special_tokens=True)
print(GT5_summary)

report examines opportunity to expand nancial inclusion for women in india's garment sector. mobile nancial services provide several bene ts that can increase women's use of formal nancial services.


In [19]:
summary_text_t5 = ''

for ids in outputs:
    summary_text_t5 += tokenizer.decode(ids, skip_special_tokens=True)
print("\n\nSummarized T5:" + summary_text_t5)

Original: GETTING TO EQUAL:THE DISABILITY INCLUSION ADVANTAGEA research report produced jointly by 2 ﬁPersons with disabilities present business and industry with unique opportunities in labor-force diversity and corporate culture, and they™re a large consumer market eager to know which businesses authentically support their goals and dreams. Leading companies are accelerating disability inclusion as the next frontier of corporate social responsibility and mission-driven investing.ﬂŒ  Ted Kennedy, Jr.,  Disabilities Rights Attorney, Connecticut State Senator and Board Chair, American Association of People with Disabilities 3 Introductiona critical talent pool? At a time when there are more job openings in the U.S. than workers, you™d want to know more, wouldn™t you?New research from Accenture, in partnership with Disability:IN and the American Association of People with Disabilities (AAPD), reveals that companies that embrace best practices for employing and supporting more persons wit

# Page-wise Summaries
## Load Models (Google T5 ~ Base/Large, Google Pegasus, Facebook BART)


#### Load Google T5

In [46]:
model_t5b = TFAutoModelWithLMHead.from_pretrained("t5-base")
tokenizer_t5b = AutoTokenizer.from_pretrained("t5-base")

model_t5l = TFAutoModelWithLMHead.from_pretrained("t5-large")
tokenizer_t5l = AutoTokenizer.from_pretrained("t5-large")

All model checkpoint layers were used when initializing TFT5ForConditionalGeneration.

All the layers of TFT5ForConditionalGeneration were initialized from the model checkpoint at t5-base.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.
All model checkpoint layers were used when initializing TFT5ForConditionalGeneration.

All the layers of TFT5ForConditionalGeneration were initialized from the model checkpoint at t5-large.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


#### Google Pegasus (for abstractive summarization)

In [47]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
  
tokenizer_pegasus = AutoTokenizer.from_pretrained("google/pegasus-xsum")
model_pegasus = AutoModelForSeq2SeqLM.from_pretrained("google/pegasus-xsum")

#### Facebook BART

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer_bart = AutoTokenizer.from_pretrained("facebook/bart-large-cnn")
model_bart = AutoModelForSeq2SeqLM.from_pretrained("facebook/bart-large-cnn")

## Utility Functions

In [51]:
def write_to_file(out_filename, content):
    f = open(out_filename, "w")
    f.write(content)
    f.close()

In [49]:
'''
    Used mostly for Google T5 variants
'''

def do_summarization(in_filename, out_filename, model, tokenizer):
    page_num = 0
    content = ''
    pdf = PdfFileReader(str(in_filename))
    
    content += "Google T5 Summarization \n \nPDF Total Pages: " + str(pdf.numPages)
    
    for page in pdf.pages:
        page_num += 1
        raw_text = page.extractText()
        cleaned_page_text = clean_text(raw_text, False)

        inputs = tokenizer.encode("summarize: " + cleaned_page_text, return_tensors="tf", max_length=512)
        outputs = model.generate(inputs, max_length=250, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True)

        GT5_summary=tokenizer.decode(outputs[0], skip_special_tokens=True)
        content += "\n\nSummary for Page No. " + str(page_num) + " [Original Length: " + str(len(cleaned_page_text)) + ", Summary Length: " + str(len(GT5_summary)) + " ]\n"
        content += "\n" + GT5_summary
        content += "---------"*5
        
    write_to_file(out_filename, content)

In [50]:
# Google T5 Base
do_summarization('data/Accenture-Disability-Inclusion-Research-Report.pdf', 'data/summary-google-t5-base-accenture.txt', model_t5b, tokenizer_t5b)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


In [53]:
# Google T5 Base
do_summarization('data/BSR_Financial_Needs_of_Garment_Workers_in_India.pdf', 'data/summary-google-t5-base-bsr.txt', model_t5b, tokenizer_t5b)

In [None]:
# Google T5 Large
do_summarization('data/Accenture-Disability-Inclusion-Research-Report.pdf', 'data/summary-google-t5-large-accenture.txt', model_t5l, tokenizer_t5l)

In [None]:
# Google T5 Large
do_summarization('data/BSR_Financial_Needs_of_Garment_Workers_in_India.pdf', 'data/summary-google-t5-large-bsr.txt', model_t5l, tokenizer_t5l)

# Google Pegasus Full PDF Summary

In [6]:
# Use Google Pegasus for full PDF Summary for Accenture

from transformers import PegasusForConditionalGeneration, PegasusTokenizer
import torch

src_text = [clean_text("./data/extract-accenture.txt")]

model_name = 'google/pegasus-xsum'
device = 'cuda' if torch.cuda.is_available() else 'cpu'

tokenizer = PegasusTokenizer.from_pretrained(model_name)
model = PegasusForConditionalGeneration.from_pretrained(model_name).to(device)
batch = tokenizer(src_text, truncation=True, padding='longest', return_tensors="pt").to(device)
translated = model.generate(**batch)
tgt_text = tokenizer.batch_decode(translated, skip_special_tokens=True)
print(tgt_text[0])

Companies that embrace disability inclusion have outperformed their peers, according to research from Accenture, in partnership with Disability:IN and the American Association of People with Disabilities (AAPD).


In [55]:
# Use Google Pegasus for full PDF Summary for BSR

from transformers import PegasusForConditionalGeneration, PegasusTokenizer
import torch

src_text = [clean_text("./data/extract-bsr.txt")]

model_name = 'google/pegasus-xsum'
device = 'cuda' if torch.cuda.is_available() else 'cpu'

tokenizer = PegasusTokenizer.from_pretrained(model_name)
model = PegasusForConditionalGeneration.from_pretrained(model_name).to(device)
batch = tokenizer(src_text, truncation=True, padding='longest', return_tensors="pt").to(device)
translated = model.generate(**batch)
tgt_text = tokenizer.batch_decode(translated, skip_special_tokens=True)
print(tgt_text[0])

Mobile nancial products and services have the potential to transform the lives of women in India.


# Google Pegasus - Pagewise Summary

In [54]:
def do_summarization_pegasus(in_filename, out_filename):
    
    from transformers import PegasusForConditionalGeneration, PegasusTokenizer
    import torch
    from PyPDF2 import PdfFileReader

    page_num = 0
    content = ''
    pdf = PdfFileReader(str(in_filename))
    
    content += "Google Pegasus Summarization \n \nPDF Total Pages: " + str(pdf.numPages)
    
    for page in pdf.pages:
        page_num += 1
        raw_text = page.extractText()
        src_text = [clean_text(raw_text, False)]
        
        model_name = 'google/pegasus-xsum'
        device = 'cuda' if torch.cuda.is_available() else 'cpu'

        tokenizer = PegasusTokenizer.from_pretrained(model_name)
        model = PegasusForConditionalGeneration.from_pretrained(model_name).to(device)
        batch = tokenizer(src_text, truncation=True, padding='longest', return_tensors="pt").to(device)
        translated = model.generate(**batch)
        tgt_text = tokenizer.batch_decode(translated, skip_special_tokens=True)
        
        content += "\n\nSummary for Page No. " + str(page_num) + " [Original Length: " + str(len(src_text[0])) + ", Summary Length: " + str(len(tgt_text[0])) + " ]\n"
        content += "\n" + tgt_text[0] + "\n"
        content += "---------"*5
        
    write_to_file(out_filename, content)

In [19]:
# Google Pegasus
do_summarization_pegasus('data/Accenture-Disability-Inclusion-Research-Report.pdf', 'data/summary-google-pegasus-accenture.txt')

In [56]:
# Google Pegasus
do_summarization_pegasus('data/BSR_Financial_Needs_of_Garment_Workers_in_India.pdf', 'data/summary-google-pegasus-bsr.txt')

In [57]:
from transformers import BartTokenizer, BartForConditionalGeneration
import torch

long_text = clean_text("./data/extract-accenture.txt")

model = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')
tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-cnn')

# tokenize without truncation
inputs_no_trunc = tokenizer(long_text, max_length=None, return_tensors='pt', truncation=False)

# get batches of tokens corresponding to the exact model_max_length
chunk_start = 0
chunk_end = tokenizer.model_max_length  # == 1024 for Bart
inputs_batch_lst = []
while chunk_start <= len(inputs_no_trunc['input_ids'][0]):
    inputs_batch = inputs_no_trunc['input_ids'][0][chunk_start:chunk_end]  # get batch of n tokens
    inputs_batch = torch.unsqueeze(inputs_batch, 0)
    inputs_batch_lst.append(inputs_batch)
    chunk_start += tokenizer.model_max_length  # == 1024 for Bart
    chunk_end += tokenizer.model_max_length  # == 1024 for Bart

# generate a summary on each batch
summary_ids_lst = [model.generate(inputs, num_beams=4, max_length=100, early_stopping=True) for inputs in inputs_batch_lst]

# decode the output and join into one string with one paragraph per summary batch
summary_batch_lst = []
for summary_id in summary_ids_lst:
    summary_batch = [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in summary_id]
    summary_batch_lst.append(summary_batch[0])
summary_all = '\n'.join(summary_batch_lst)

print(summary_all)

Token indices sequence length is longer than the specified maximum sequence length for this model (4169 > 1024). Running this sequence through the model will result in indexing errors


Leading companies are accelerating disability inclusion as the next frontier of corporate social responsibility and mission-driven investing. There are 15.1 million people of working age living with disabilities in the U.S., so the research suggests that if companies embrace disability inclusion, they will gain access to a new talent pool of more than 10.7 million people.
The Disability:IN benchmarking tool gives U.S. businesses an objective score on their disability inclusion policies and practices. Disability Inclusion Champions were, on average, two times more likely to outperform their peers in terms of total shareholder returns compared with the rest of the sample. While many are concerned about the costs of accommodating persons with disabilities, these are actually minimal and fruitful investments.
People with disabilities tend to be some of the most creative, innovative and, quite frankly, most loyal employees. People with disabilities should occupy roles at all levels, includi

## Facebook BART - Pagewise Summary

In [58]:
def do_summarization_bart(in_extracted_filename, out_filename):
    from transformers import BartTokenizer, BartForConditionalGeneration
    import torch

    long_text = clean_text(in_extracted_filename)

    model = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')
    tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-cnn')

    # tokenize without truncation
    inputs_no_trunc = tokenizer(long_text, max_length=None, return_tensors='pt', truncation=False)

    # get batches of tokens corresponding to the exact model_max_length
    chunk_start = 0
    chunk_end = tokenizer.model_max_length  # == 1024 for Bart
    inputs_batch_lst = []
    while chunk_start <= len(inputs_no_trunc['input_ids'][0]):
        inputs_batch = inputs_no_trunc['input_ids'][0][chunk_start:chunk_end]  # get batch of n tokens
        inputs_batch = torch.unsqueeze(inputs_batch, 0)
        inputs_batch_lst.append(inputs_batch)
        chunk_start += tokenizer.model_max_length  # == 1024 for Bart
        chunk_end += tokenizer.model_max_length  # == 1024 for Bart

    # generate a summary on each batch
    summary_ids_lst = [model.generate(inputs, num_beams=4, max_length=100, early_stopping=True) for inputs in inputs_batch_lst]

    # decode the output and join into one string with one paragraph per summary batch
    summary_batch_lst = []
    for summary_id in summary_ids_lst:
        summary_batch = [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in summary_id]
        summary_batch_lst.append(summary_batch[0])
    summary_all = '\n'.join(summary_batch_lst)

    print(summary_all)
    
    # write to file
    write_to_file(out_filename, summary_all)

In [59]:
do_summarization_bart("./data/extract-accenture.txt", 'data/summary-fb-bart-accenture.txt')

Token indices sequence length is longer than the specified maximum sequence length for this model (4169 > 1024). Running this sequence through the model will result in indexing errors


Leading companies are accelerating disability inclusion as the next frontier of corporate social responsibility and mission-driven investing. There are 15.1 million people of working age living with disabilities in the U.S., so the research suggests that if companies embrace disability inclusion, they will gain access to a new talent pool of more than 10.7 million people.
The Disability:IN benchmarking tool gives U.S. businesses an objective score on their disability inclusion policies and practices. Disability Inclusion Champions were, on average, two times more likely to outperform their peers in terms of total shareholder returns compared with the rest of the sample. While many are concerned about the costs of accommodating persons with disabilities, these are actually minimal and fruitful investments.
People with disabilities tend to be some of the most creative, innovative and, quite frankly, most loyal employees. People with disabilities should occupy roles at all levels, includi

In [60]:
do_summarization_bart("./data/extract-bsr.txt", 'data/summary-fb-bart-bsr.txt')

Token indices sequence length is longer than the specified maximum sequence length for this model (19358 > 1024). Running this sequence through the model will result in indexing errors


Financial Inclusion for WomenExpanding Mobile Financial Servicesin India™s Garment SectorResearch ReportAbout this ReportThis report examines the opportunity to expand ˜nancial inclusion for women in India's garment sector by increasing women's use of mobile financial products and services. The report is based on a combination of desk-based literature review, and qualitative research sessions conducted in India by the HERproject team.
More than 2 billion adults still do not use formal ˜nancial services such as deposit and savings accounts, payment services, loans, and insurance. In India, 58 percent of women owned at least one bank account as of 2015, but only 35 percent of these women were using that account, compared to  49 percent of men.
India has the world's largest number of ˜nancially excluded women. 280 million women in India don't have access to formal ˜nancial services. 60 percent of Indian women have at least one bank account. Only 40 percent of women actively use a ˜ nancia

## Note: Pagewise Summaries with Facebook BART was not done.