# Extractive and Abstractive Text Summarization for Long Documents
##### ***by Gerson Gerard Cruz***

In an increasingly information-dependent world, the ability to provide the most important and accurate information in the least amount of time is exceedingly valuable. Text summarization can provide this value. It is the process of summarizing a certain document in order to get the most important information from the original one. Essentially, text summarization produces a concise summary which preserves the valuable information and meaning of a document. 


<img src='https://drive.google.com/uc?id=1caGEZmT4ODf0R7zNZLqYh-8plpv55sIh'>


There are two general types of text summarization: Extractive and Abstractive summarization. 

#### Extractive Summarization 

Extractive summarization, from the word itself, is a method of extracting a subset of words that contain the most important information in a text. This approach takes into consideration the most important parts of document sentences and uses them to form the summarization. Then, algorithms to give weights to these parts and rank them based on similarity and importance are used. 

The general workflow for extractive summarization goes like: 

**Text input --> Get similar sentences --> Assign weights to sentences --> Rank sentences --> Choose sentences with highest ranks to form the summary**

#### Abstractive Summarization

In contrast, abstractive summarization aims to **abstract** and use words that did not appear in the input document based on the semantic information of the text. This means abstractive summarization produces a new summary. Abstractive summarization interprets and examines the document using advanced NLP techniques and generates a new concise summary based the most important information in the text. 

The general workflow for abstractive summarization goes like:

**Text input --> Understand the context of the document --> Use semantic understanding --> Abstract and create a new summary**

In general, abstractive summarization is desired more than extractive summarization because it is akin to how a human would summarize a text by first understanding its meaning and putting it into his/her own words. However, given the challenges in semantic representation, extractive summarization often gives better results. 

## The Project

In addition to the advantages and disadvantages of these two summarization techniques, there is also difficulty in summarizing long text documents. For example, in this Github issue [Bart now enforces maximum sequence length in Summarization Pipeline](https://github.com/huggingface/transformers/issues/4224), there are limits to the maximum length of a text document for abstractive summarization of some transformer models like BART. Given this, I researched on how to solve this problem and came across this paper: [Combination of abstractive and extractive approaches for summarization of long scientific texts](https://arxiv.org/abs/2006.05354) which applied extractive summarization to get a summary with the important extracted information from the text and then performed abstractive summarization on the extracted summary along with the scientific paper's abstract and conclusion. 

While I won't be going as detailed as the paper, in this project, I still aim to apply extractive and abstractive summarization in order to summarize long scientific documents. 

## The Dataset

The dataset I will use for this project consists of 100 scientific papers from the WING NUS group's Scisumm corpus found at this [github link](https://github.com/WING-NUS/scisumm-corpus). According to the authors, [Scisumm](https://cs.stanford.edu/~myasu/projects/scisumm_net/) is a summary of scientific papers should ideally incorporate the impact of the papers on the research community reflected by citations. To facilitate research in citation-aware scientific paper summarization (Scisumm), the CL-Scisumm shared task has been organized since 2014 for papers in the computational linguistics and NLP domain. 

## The Methodology

The project workflow consists of three main steps: data collection and preprocessing, modelling, and model deployment. 

### Data Collection and Preprocessing

In this step, I choose 100 scientific papers from the Scisumm corpus. I selectively decide which papers to include because the project requires papers which explicitly have an `abstract` and a `conclusion` in the .xml file. Some papers, after investigation, did not have an `abstract` section and instead was found directly in the text section of the document. This will lead to extraction errors as the .xml extraction pipeline was explicitly designed for xml documents which explicity have an `abstract` and `conclusion` section. 

For the data preprocessing step, I create a data cleaning and preprocessing functions with the following capabilities:
* Lemmatization
* Stopword removal
* Lowercase
* Punctuation cleaning
* Emoji cleaning
* Number cleaning
* Weblinks cleaning
* Unnecessary spaces removal

I gave the user the freedom to choose which cleaning to apply by creating a unified function where every cleaning step is a boolean. For the purpose of this project, I do not lemmatize, remove stopwords, lowercase, and remove punctuations so that the summarization will still have its semantic context in place. 

### Model Training

For modelling, I perform both extractive and abstractive summarization. For extractive summarization, I use the BERT transformer model and customize it to use the pre-trained weights of the **sciBERT** model which specializes in scientific texts, which fit our purpose. For every text, I determine the optimal number of sentences for the extracted summary.

For abstractive summarization, I first concatenate the abstract, extractive summary, and conclusion together since much of the important information can be found in them. Then, I use the **facebook-BART-large-cnn** transformer model to perform the abstraction. 

### Model Deployment

For deployment, I use Streamlit to create a simple user interface which requires a long text input to summarize. Then, I deploy this model using the package `localtunnel` so that I can serve this project to the web. 

# Table of Contents
I. [Importing Libraries and Installing External Dependencies](#s1) <br>
II. [Data Collection and Preprocessing](#s2) <br>
III. [Modelling](#s3) <br>
IV. [Model Deployment](#s4) <br>
V. [Recommendations](#s5) <br>


### Importing Libraries and Installing External Dependencies <a name="s1"></a>

In [1]:
!pip install lxml
!pip install sentencepiece
!pip install transformers
!pip install tensorflow-gpu # For CPMTokenizer
!pip install neuralcoref
!pip install bert-extractive-summarizer

Collecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 4.9 MB/s 
[?25hInstalling collected packages: sentencepiece
Successfully installed sentencepiece-0.1.96
Collecting transformers
  Downloading transformers-4.17.0-py3-none-any.whl (3.8 MB)
[K     |████████████████████████████████| 3.8 MB 5.4 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.4.0-py3-none-any.whl (67 kB)
[K     |████████████████████████████████| 67 kB 5.7 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 41.1 MB/s 
Collecting tokenizers!=0.11.3,>=0.11.1
  Downloading tokenizers-0.11.6-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.5 MB)
[K     |████████████████████████████████| 6.5 MB 30.4 M

In [2]:
# Data collection
from lxml import objectify
import pandas as pd
import numpy as np
import os
import glob
from glob import iglob

# Data preprocessing
import string
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

# Text Summarization
from transformers import *
from summarizer import Summarizer
from summarizer.text_processors.coreference_handler import CoreferenceHandler

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
100%|██████████| 40155833/40155833 [00:01<00:00, 34448343.47B/s]


## Data Collection and Preprocessing <a name="s2"></a>

The data from Scisumm is in the .xml format. The [XML](https://www.indeed.com/career-advice/career-development/xml-file), also known as the extensible markup language file, is used to structure data for storage and transport. It contains tags to provide structure to the data and also contains the text. Put simply, XML is a standard text file that utilizes customized tags, to describe the structure of the document and how it should be stored and transported.

The structure of a sample paper is shown below: 


<img src='https://drive.google.com/uc?id=1wheFaobd2Bw6QSmMIn0azJIhC584lNzV'>

Each part of the paper is contained in a `SECTION` tag and the succeeding paragraphs of the section are found below it. Each sentence is given a unique `security identifier` or `sid`. 

In order to properly extract this data, I follow this process:
1. Get the list of all `.xml` files names
2. Use the library `objectify` in order to extract all text contents of the data.
3. Extract the `abstract` and `conclusion` columns into separate lists for abstractive summarization. 
4. Collate the text from every section into one whole text.
5. Append the `abstract`, `entire_text`, and `conclusion` into a pandas dataframe

#### Getting the list of all .xml file names. 

In [3]:
%cd ..
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)

!ln -s /content/gdrive/My\ Drive/ /mydrive

/
Mounted at /content/gdrive


In [4]:
%cd /mydrive/Omdena\ School/Solving\ Business\ Problems\ with\ NLP/Capstone\ Project

/content/gdrive/My Drive/Omdena School/Solving Business Problems with NLP/Capstone Project


In [None]:
# Run this to unzip the scisumm zip file into google drive
# !unzip scisummnet_release1.1__20190413.zip

Then, I filtered the unzipped folder to only include **100 selected documents** that follow the format in the screenshot above for the purpose of this project. The 100 documents are contained in a folder named `top100`.

In [None]:
%cd scisummnet_release1.1__20190413

/content/gdrive/My Drive/Omdena School/Solving Business Problems with NLP/Capstone Project/scisummnet_release1.1__20190413


In [None]:
!ls

Dataset_Documentation.txt  log.txt  top100  top1000_complete  top100.zip


In [None]:
# # Run this to unzip the top100 zip file into the top100 folder
# !unzip top100.zip

Archive:  top100.zip
   creating: top100/
   creating: top100/A00-1031/
  inflating: top100/A00-1031/citing_sentences_annotated.json  
   creating: top100/A00-1031/Documents_xml/
  inflating: top100/A00-1031/Documents_xml/A00-1031.xml  
   creating: top100/A00-1031/summary/
  inflating: top100/A00-1031/summary/A00-1031.gold.txt  
   creating: top100/A00-1043/
  inflating: top100/A00-1043/citing_sentences_annotated.json  
   creating: top100/A00-1043/Documents_xml/
  inflating: top100/A00-1043/Documents_xml/A00-1043.xml  
   creating: top100/A00-1043/summary/
  inflating: top100/A00-1043/summary/A00-1043.gold.txt  
   creating: top100/A00-2004/
  inflating: top100/A00-2004/citing_sentences_annotated.json  
   creating: top100/A00-2004/Documents_xml/
  inflating: top100/A00-2004/Documents_xml/A00-2004.xml  
   creating: top100/A00-2004/summary/
  inflating: top100/A00-2004/summary/A00-2004.gold.txt  
   creating: top100/A00-2009/
  inflating: top100/A00-2009/citing_sentences_annotated.js

In [None]:
%cd ..

/content/gdrive/My Drive/Omdena School/Solving Business Problems with NLP/Capstone Project


In [None]:
# Get the list of directories for all .xml files
file_directory = glob.glob("scisummnet_release1.1__20190413/top100/*/*/*.xml", recursive=True)

# Check if the paths directory is correct
file_directory[0:5]

['scisummnet_release1.1__20190413/top100/P08-1115/Documents_xml/P08-1115.xml',
 'scisummnet_release1.1__20190413/top100/P09-1039/Documents_xml/P09-1039.xml',
 'scisummnet_release1.1__20190413/top100/P08-1102/Documents_xml/P08-1102.xml',
 'scisummnet_release1.1__20190413/top100/P08-1066/Documents_xml/P08-1066.xml',
 'scisummnet_release1.1__20190413/top100/P09-1040/Documents_xml/P09-1040.xml']

In [None]:
# Create xml file extraction function
def extract_xml(directory):
  xml_data = objectify.parse(directory)  # Parse XML data
  root = xml_data.getroot()  # Root element

  data = []
  cols = []
  for i in range(len(root.getchildren())):
      child = root.getchildren()[i]
      data.append([subchild.text for subchild in child.getchildren()])

      # If the tag is not 'SECTION', it is a section header, append that header
      # If it is, it means it is a subsection, and append the title of that 
      # subsection
      if child.tag != "SECTION":
        cols.append(child.tag)
      else:
        cols.append(child.attrib.get('title'))

  df = pd.DataFrame(data).T  # Create DataFrame and transpose it
  df.columns = cols  # Update column names

  # Get the abstract column (second column)
  abstract_list = df.iloc[:, 1].dropna()
  abstract = " ".join(abstract_list)

  # Get the conclusion column (penultimate column)
  conclusion_text = df.iloc[:, -2].dropna()
  conclusion = " ".join(conclusion_text)

  # Drop last column of a dataframe
  df = df.iloc[: , :-1]

  # Drop first column: S 
  df = df.iloc[:, 1:]

  # Iterate over all sections and join them together to get the text document
  text_list = []
  for column in df.columns:
    text_filtered = df[column].dropna()
    text = " ".join(text_filtered)
    text_list.append(text)

  text_list
  final_text = " ".join(text_list)

  return abstract, final_text, conclusion

In [None]:
%%time
abstract_list = []
full_text_list = []
conclusion_list = []

counter = 0
for directory in file_directory:
  abstract, full_text, conclusion = extract_xml(directory)
  abstract_list.append(abstract)
  full_text_list.append(full_text)
  conclusion_list.append(conclusion)

  print(f"XML extraction for document {counter} done! \n")
  counter += 1

XML extraction for document 0 done! 

XML extraction for document 1 done! 

XML extraction for document 2 done! 

XML extraction for document 3 done! 

XML extraction for document 4 done! 

XML extraction for document 5 done! 

XML extraction for document 6 done! 

XML extraction for document 7 done! 

XML extraction for document 8 done! 

XML extraction for document 9 done! 

XML extraction for document 10 done! 

XML extraction for document 11 done! 

XML extraction for document 12 done! 

XML extraction for document 13 done! 

XML extraction for document 14 done! 

XML extraction for document 15 done! 

XML extraction for document 16 done! 

XML extraction for document 17 done! 

XML extraction for document 18 done! 

XML extraction for document 19 done! 

XML extraction for document 20 done! 

XML extraction for document 21 done! 

XML extraction for document 22 done! 

XML extraction for document 23 done! 

XML extraction for document 24 done! 

XML extraction for document 25 done

In [None]:
print(abstract_list[0:2])
print(full_text_list[0:2])
print(conclusion_list[0:2])

['Word lattice decoding has proven useful in spoken language translation; we argue that it provides a compelling model for translation of text genres, as well. We show that prior work in translating lattices using finite state techniques can be naturally extended to more expressive synchronous context-free grammarbased models. Additionally, we resolve a significant complication that non-linear word lattice inputs introduce in reordering models. Our experiments evaluating the approach demonstrate substantial gains for Chinese- English and Arabic-English translation.', 'We formulate the problem of nonprojective dependency parsing as a polynomial-sized integer linear program. Our formulation is able to handle non-local output features in an efficient manner; not only is it compatible with prior knowledge encoded as hard constraints, it can also learn soft constraints from data. In particular, our model is able to learn correlations among neighboring arcs (siblings and grandparents), word 

In [None]:
pd.set_option('max_colwidth', 100)

In [None]:
text_df = pd.DataFrame(list(zip(abstract_list, full_text_list, conclusion_list)), columns=["abstract", "full_text", "conclusion"])
text_df

Unnamed: 0,abstract,full_text,conclusion
0,Word lattice decoding has proven useful in spoken language translation; we argue that it provide...,Word lattice decoding has proven useful in spoken language translation; we argue that it provide...,We have achieved substantial gains in translation performance by decoding compact representation...
1,We formulate the problem of nonprojective dependency parsing as a polynomial-sized integer linea...,We formulate the problem of nonprojective dependency parsing as a polynomial-sized integer linea...,We presented new dependency parsers based on concise ILP formulations. We have shown how non-loc...
2,We propose a cascaded linear model for joint Chinese word segmentation and partof-speech tagging...,We propose a cascaded linear model for joint Chinese word segmentation and partof-speech tagging...,"We proposed a cascaded linear model for Chinese Joint S&T. Under this model, many knowledge sour..."
3,"In this paper, we propose a novel string-todependency algorithm for statistical machine translat...","In this paper, we propose a novel string-todependency algorithm for statistical machine translat...","In this paper, we propose a novel string-todependency algorithm for statistical machine translat..."
4,"We present a novel transition system for dependency parsing, which constructs arcs only between ...","We present a novel transition system for dependency parsing, which constructs arcs only between ...",We have presented a novel transition system for dependency parsing that can handle unrestricted ...
5,Morphological processes in Semitic languages deliver space-delimited words which introduce multi...,Morphological processes in Semitic languages deliver space-delimited words which introduce multi...,"The accuracy results for segmentation, tagging and parsing using our different models and our st..."
6,Previous studies of data-driven dependency parsing have shown that the distribution of parsing e...,Previous studies of data-driven dependency parsing have shown that the distribution of parsing e...,Combinations of graph-based and transition-based models for data-driven dependency parsing have ...
7,"This paper presents an unsupervised opinanalysis method for clasi.e., recognizing which stance a...","This paper presents an unsupervised opinanalysis method for clasi.e., recognizing which stance a...","This paper addresses challenges faced by opinion analysis in the debate genre. In our method, fa..."
8,We present a phrasal synchronous grammar model of translational equivalence. Unlike previous app...,We present a phrasal synchronous grammar model of translational equivalence. Unlike previous app...,We have presented a Bayesian model of SCFG induction capable of capturing phrasal units of trans...
9,Broad-coverage annotated treebanks necessary to train parsers do not exist for many resource-poo...,Broad-coverage annotated treebanks necessary to train parsers do not exist for many resource-poo...,"In this paper, we proposed a novel and effective learning scheme for transferring dependency par..."


In [None]:
!ls

'Capstone Project - {Gerson Cruz}.ipynb'  'Text Summarization Image.png'
 __MACOSX				   top100.csv
 scisummnet_release1.1__20190413	  'XML Structure.png'
 scisummnet_release1.1__20190413.zip


In [None]:
text_df.to_csv("top100.csv")

## Data Cleaning and Preprocessing

In [None]:
# Individual cleaning functions
def remove_web_links(text):
  text = re.sub(r'http://www.\w+.org/','', text)
  text = re.sub(r'http://www.\w+.org/','', text)
  text = re.sub(r'http://www.([\w\S]+).org/\w+\W\w+','',text)
  text = re.sub(r'https://www.\w+.org/','', text)
  text = re.sub(r'https://www.([\w\S]+).org/\w+\W\w+','',text)
  text = re.sub(r'https://\w+.\w+/\d+.\d+/\w\d+\W\w+','',text)
  text = re.sub(r'https://\w+.\w+/\d+.\d+/\w\d+\W\w+','',text)
  text = re.sub(r'Figure\s\d:','', text)
  text = re.sub(r'\Wwww.\w+\W\w+\W','',text)
  text = re.sub("@[A-Za-z0-9]+", "", text)
  text = re.sub(r'www.\w+','',text)

  return text

def remove_emojis(text):
  regrex_pattern = re.compile(pattern = "["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
        u"\U00002500-\U00002BEF"  # chinese char
        u"\U00002702-\U000027B0"
        u"\U00002702-\U000027B0"
        u"\U000024C2-\U0001F251"
        u"\U0001f926-\U0001f937"
        u"\U00010000-\U0010ffff"
        u"\u2640-\u2642" 
        u"\u2600-\u2B55"
        u"\u200d"
        u"\u23cf"
        u"\u23e9"
        u"\u231a"
        u"\ufe0f"  # dingbats
        u"\u3030"  # flags (iOS)
                           "]+", flags = re.UNICODE)
  text = regrex_pattern.sub('', text)

  return text

def remove_spaces(text):
  text = re.sub(r'\n',"",text)

  return text

def remove_stopwords(text):
  stop_words=set(stopwords.words('english'))
  words=word_tokenize(text)
  sentence=[w for w in words if w not in stop_words]
  return " ".join(sentence)

def lemmatize_text(text):
  wordlist=[]
  lemmatizer = WordNetLemmatizer()
  sentences=sent_tokenize(text)
  for sentence in sentences:
      words=word_tokenize(sentence)
      for word in words:
          wordlist.append(lemmatizer.lemmatize(word))
  return ' '.join(wordlist)

def lowercase_text(text):
  return text.lower()

def remove_punctuations(text):
  additional_punctuations = ['’', '…'] # punctuations not in string.punctuation  
  for punctuation in string.punctuation:
    text = text.replace(punctuation, '')
  
  for punctuation in additional_punctuations:
    text = text.replace(punctuation, '')
    
  return text

def remove_numbers(text):
  if text is not None:
    text = text.replace(r'^\d+\.\s+','')
  
  text = re.sub("[0-9]", '', text)
  return text

# Unified boolean controlled cleaning function 
def clean_and_preprocess_data(text, lowercase=True, clean_stopwords=True, clean_punctuations=True, clean_links=True, 
                              clean_emojis=True, clean_spaces=True, clean_numbers=True,  lemmatize=True):
  
  if clean_stopwords == True:
    text = remove_stopwords(text)

  if clean_punctuations == True:
    text = remove_punctuations(text)
  
  if clean_links == True:
    text = remove_web_links(text)
  
  if clean_emojis == True:
    text = remove_emojis(text)
  
  if clean_spaces == True:
    text = remove_spaces(text)
  
  if clean_numbers == True:
    text = remove_numbers(text)
  
  if lemmatize == True:
    text = lemmatize_text(text)
  
  if lowercase == True:
    return text.lower()

  return text

In [None]:
%%time
text_df['abstract'] = text_df['abstract'].apply(lambda x: clean_and_preprocess_data(x, lemmatize=False, clean_numbers=False, clean_stopwords=False, clean_punctuations=False, lowercase=False))
text_df['full_text'] = text_df['full_text'].apply(lambda x: clean_and_preprocess_data(x, lemmatize=False, clean_numbers=False, clean_stopwords=False, clean_punctuations=False, lowercase=False))
text_df['conclusion'] = text_df['conclusion'].apply(lambda x: clean_and_preprocess_data(x, lemmatize=False, clean_numbers=False, clean_stopwords=False, clean_punctuations=False, lowercase=False))

CPU times: user 185 ms, sys: 3.89 ms, total: 189 ms
Wall time: 189 ms


In [None]:
text_df.head()

Unnamed: 0,abstract,full_text,conclusion
0,Word lattice decoding has proven useful in spoken language translation; we argue that it provide...,Word lattice decoding has proven useful in spoken language translation; we argue that it provide...,We have achieved substantial gains in translation performance by decoding compact representation...
1,We formulate the problem of nonprojective dependency parsing as a polynomial-sized integer linea...,We formulate the problem of nonprojective dependency parsing as a polynomial-sized integer linea...,We presented new dependency parsers based on concise ILP formulations. We have shown how non-loc...
2,We propose a cascaded linear model for joint Chinese word segmentation and partof-speech tagging...,We propose a cascaded linear model for joint Chinese word segmentation and partof-speech tagging...,"We proposed a cascaded linear model for Chinese Joint S&T. Under this model, many knowledge sour..."
3,"In this paper, we propose a novel string-todependency algorithm for statistical machine translat...","In this paper, we propose a novel string-todependency algorithm for statistical machine translat...","In this paper, we propose a novel string-todependency algorithm for statistical machine translat..."
4,"We present a novel transition system for dependency parsing, which constructs arcs only between ...","We present a novel transition system for dependency parsing, which constructs arcs only between ...",We have presented a novel transition system for dependency parsing that can handle unrestricted ...


In [None]:
text_df.to_csv("top100_cleaned.csv")

## Modelling: Data Explorations, Feature Extraction, Extractive Summarization, and Abstractive Summarization <a name="s3"></a>

Before moving on to our summarization, I'll first explore some important data characteristics like the average word length and token length per text.

See this reference [article](https://towardsdatascience.com/beginners-guide-for-data-cleaning-and-feature-extraction-in-nlp-756f311d8083) for a basic introduction to NLP feature extraction.

In order to perform the feature extraction, I created a `class TextSummarizer` which serves both as the feature extractor and the text summarizer. The class has the following functions and their corresponding capabilities: 
* `avg_word`: Function for getting the average length of a word in a text
* `count_punctuation`: Function for getting the number of punctuations in a text
* `get_optimal_number_sentences`: Function for getting the optimal number of sentences for extractive summarization
* `extract_text_features`: Function which returns the following features number of stopwords, punctuations, numerical characters, words, average word length and stopwords to word ratio. 
* `extractive_summarizer`: Function for performing extractive text summarization
* `join_extracted_summary`: Function for concatenating the abstract, extractive text summary, and conclusion
* `abstractive_summarizer`: Function for performing abstractive summarization with the text. 

Using this class, I perform all the necessary explorations, and then directly proceed to performing extractive text summarization and abstractive text summarization with BERT and BART respectively. 

For extractive text summarization, I used the following references:
* [bert-extractive-summarizer](https://github.com/dmmiller612/bert-extractive-summarizer)
* [Handling coreference resolution with Python](https://kaveeshabaddage.medium.com/how-to-resolve-coreference-resolution-using-python-97fcd6b2cedb) 
* [sciBERT](https://github.com/allenai/scibert)

*Note: Due to the lack of computational resources, coreference handling could not be applied as the memory provided by Google Collab is not enough and leads to a session crash.*

For abstractive text summarization, I used the following reference:
* [facebook-bart-large-cnn](https://huggingface.co/facebook/bart-large-cnn)

In [None]:
pd.set_option('max_colwidth', 100)
text_df = pd.read_csv("top100_cleaned.csv")
text_df.drop("Unnamed: 0", axis=1, inplace=True)
text_df.head()

Unnamed: 0,abstract,full_text,conclusion
0,Word lattice decoding has proven useful in spoken language translation; we argue that it provide...,Word lattice decoding has proven useful in spoken language translation; we argue that it provide...,We have achieved substantial gains in translation performance by decoding compact representation...
1,We formulate the problem of nonprojective dependency parsing as a polynomial-sized integer linea...,We formulate the problem of nonprojective dependency parsing as a polynomial-sized integer linea...,We presented new dependency parsers based on concise ILP formulations. We have shown how non-loc...
2,We propose a cascaded linear model for joint Chinese word segmentation and partof-speech tagging...,We propose a cascaded linear model for joint Chinese word segmentation and partof-speech tagging...,"We proposed a cascaded linear model for Chinese Joint S&T. Under this model, many knowledge sour..."
3,"In this paper, we propose a novel string-todependency algorithm for statistical machine translat...","In this paper, we propose a novel string-todependency algorithm for statistical machine translat...","In this paper, we propose a novel string-todependency algorithm for statistical machine translat..."
4,"We present a novel transition system for dependency parsing, which constructs arcs only between ...","We present a novel transition system for dependency parsing, which constructs arcs only between ...",We have presented a novel transition system for dependency parsing that can handle unrestricted ...


In [None]:
class TextSummarizer:
  def __init__(self, data):
    self.data = data

  # Helper functions

  # Get average word length in a document
  def avg_word(self, data):
    words = data.split()
    length = (sum(len(word) for word in words)/(len(words)+0.000001))

    return length
  
  # Get number of punctuations in a document
  def count_punctuation(self, data):
    punctuation_count = sum([1 for char in data if char in string.punctuation])

    return punctuation_count
  
  # Get optimal number of sentences for extractive summarization
  def get_optimal_number_sentences(self, data, model):

    optimal_num_sentences = model.calculate_optimal_k(data, k_max=10)

    return optimal_num_sentences
  
  # Extract numerical text features
  def extract_text_features(self, text_column):
    
    """
    Extracts text features such as number of stopwords, punctuations,
    numerical characters, average word length, average document length
    :param text_column: dataframe column to perform feature extraction on
    :return: dataframe with new feature columns
    """
    
    # Get number of stop words
    stop_words = stopwords.words('english')
    self.data["num_stopwords"] = self.data[text_column].apply(lambda x: 
    len([x for x in x.split() if x in stop_words]))

    # Get number of punctuations
    self.data["num_punctuations"] = self.data[text_column].apply(lambda x: 
    self.count_punctuation(x))

    # Get number of numerical characters
    self.data["num_numerics"] = self.data[text_column].apply(lambda x:
    len([x for x in x.split() if x.isdigit()]))

    # Get number of words in the document
    self.data["num_words"] = self.data[text_column].apply(lambda x: 
    len(str(x).split(" ")))

    # Get average word length in document

    self.data["avg_word_length"] = self.data[text_column].apply(lambda x: 
    round(self.avg_word(x),1))

    # Get the stopwords to word ratio
    self.data["stopwords_to_words_ratio"] = round(self.data["num_stopwords"] / self.data["num_words"], 3)

    return self.data
  
  def extractive_summarizer(self, model, text_column):
    
    """
    Performs extractive text summarization with BERT and allows for different 
    pretrained model loading and configurations.
    :param model: initialized pretrained model
    :param text_column: dataframe column to perform text_summarization on
    :return: dataframe with summarized text columns
    """

    self.data["extractive_summarized_text"] = self.data[text_column].apply(lambda x:
    "".join(model(x, num_sentences=self.get_optimal_number_sentences(x, model))))

    return self.data   


  def join_extracted_summary(self, abstract, extracted_summary, conclusion):

    """
    Concatenates the abstract, extractive_summarized_text, and conclusion columns
    into one column for abstractive summarization
    :param abstract: abstract column
    :param extracted_summary: extractive_summarized_text column
    :param conclusion: conclusion column
    :return: dataframe with concatenated abstract, extracted summary and conclusion 
    columns
    """

    self.data["combined_text"] = self.data[[abstract, extracted_summary, conclusion]].agg(
        " ".join, axis=1
    )

    return self.data

  def abstractive_summarizer(self, model, text_column, max_length=750, min_length=250):
    
    """
    Performs abstract text summarization with BART using the extracted summary combined
    with the abstract and conclusion of the text.
    :param model: pipeline of the abstractive summarizer model
    :param text_column: dataframe column to perform text_summarization on
    :return: dataframe with summarized text columns
    """

    summaries_list = []
    for i in range(len(self.data[text_column])):
      text = self.data[text_column][i]
      try:
        summary = model(text, max_length = max_length, 
        min_length = min_length, do_sample=False)[-1]["summary_text"]
      except:
        # Decrease the length of the token to 1024 if it exceeds
        text = text[:1024]
        summary = model(text, max_length = max_length, 
        min_length = min_length, do_sample=False)[-1]["summary_text"]
      
      summaries_list.append(summary)
    
    self.data["abstractive_summaries"] = summaries_list
      
    return self.data

#### Extractive Summarization

In [None]:
text_class = TextSummarizer(text_df)

In [None]:
text_class.extract_text_features("full_text")

Unnamed: 0,abstract,full_text,conclusion,num_stopwords,num_punctuations,num_numerics,num_words,avg_word_length,stopwords_to_words_ratio
0,Word lattice decoding has proven useful in spoken language translation; we argue that it provide...,Word lattice decoding has proven useful in spoken language translation; we argue that it provide...,We have achieved substantial gains in translation performance by decoding compact representation...,1330,723,22,3826,5.4,0.348
1,We formulate the problem of nonprojective dependency parsing as a polynomial-sized integer linea...,We formulate the problem of nonprojective dependency parsing as a polynomial-sized integer linea...,We presented new dependency parsers based on concise ILP formulations. We have shown how non-loc...,1615,1209,13,4683,5.1,0.345
2,We propose a cascaded linear model for joint Chinese word segmentation and partof-speech tagging...,We propose a cascaded linear model for joint Chinese word segmentation and partof-speech tagging...,"We proposed a cascaded linear model for Chinese Joint S&T. Under this model, many knowledge sour...",1315,707,27,3573,5.2,0.368
3,"In this paper, we propose a novel string-todependency algorithm for statistical machine translat...","In this paper, we propose a novel string-todependency algorithm for statistical machine translat...","In this paper, we propose a novel string-todependency algorithm for statistical machine translat...",1306,880,22,3774,5.2,0.346
4,"We present a novel transition system for dependency parsing, which constructs arcs only between ...","We present a novel transition system for dependency parsing, which constructs arcs only between ...",We have presented a novel transition system for dependency parsing that can handle unrestricted ...,1728,882,12,4416,4.9,0.391
...,...,...,...,...,...,...,...,...,...
95,We present an algorithm for anaphora res- olutkm which is a modified and extended version of tha...,We present an algorithm for anaphora res- olutkm which is a modified and extended version of tha...,Quantitative evaluation shows the anaphora resolution algorithm described here to run at a rate ...,1794,1286,44,5024,5.2,0.357
96,"I:n this paper, we describe a new corpus-based ap- proach to prepositional phrase attachment dis...","I:n this paper, we describe a new corpus-based ap- proach to prepositional phrase attachment dis...","Prel)ositioual phrase attachment disambiguation is a difficult problem. Take, for example, the s...",742,1185,31,2528,4.9,0.294
97,"computation of preferthe admissible argument values for a relation, is a well-known NLP task wit...","computation of preferthe admissible argument values for a relation, is a well-known NLP task wit...",We have presented an application of topic modeling to the problem of automatically computing sel...,1731,1049,38,4870,5.5,0.355
98,"If we take an existing supervised NLP system, a simple and general way to improve accuracy is to...","If we take an existing supervised NLP system, a simple and general way to improve accuracy is to...","Word features can be learned in advance in an unsupervised, task-inspecific, and model-agnostic ...",1568,1140,39,4740,5.4,0.331


In [None]:
%%time

pretrained_model = 'allenai/scibert_scivocab_uncased'
# Load model, model config and tokenizer via Transformers
custom_config = AutoConfig.from_pretrained(pretrained_model)
custom_config.output_hidden_states=True
custom_tokenizer = AutoTokenizer.from_pretrained(pretrained_model)
custom_model = AutoModel.from_pretrained(pretrained_model, config=custom_config)

# Create pretrained-model object
model = Summarizer(custom_model=custom_model, custom_tokenizer=custom_tokenizer)

https://huggingface.co/allenai/scibert_scivocab_uncased/resolve/main/config.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmprlux9unu


Downloading:   0%|          | 0.00/385 [00:00<?, ?B/s]

storing https://huggingface.co/allenai/scibert_scivocab_uncased/resolve/main/config.json in cache at /root/.cache/huggingface/transformers/858852fd2471ce39075378592ddc87f5a6551e64c6825d1b92c8dab9318e0fc3.03ff9e9f998b9a9d40647a2148a202e3fb3d568dc0f170dda9dda194bab4d5dd
creating metadata file for /root/.cache/huggingface/transformers/858852fd2471ce39075378592ddc87f5a6551e64c6825d1b92c8dab9318e0fc3.03ff9e9f998b9a9d40647a2148a202e3fb3d568dc0f170dda9dda194bab4d5dd
loading configuration file https://huggingface.co/allenai/scibert_scivocab_uncased/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/858852fd2471ce39075378592ddc87f5a6551e64c6825d1b92c8dab9318e0fc3.03ff9e9f998b9a9d40647a2148a202e3fb3d568dc0f170dda9dda194bab4d5dd
Model config BertConfig {
  "_name_or_path": "allenai/scibert_scivocab_uncased",
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range

Downloading:   0%|          | 0.00/223k [00:00<?, ?B/s]

storing https://huggingface.co/allenai/scibert_scivocab_uncased/resolve/main/vocab.txt in cache at /root/.cache/huggingface/transformers/33593020f507d72099bd84ea6cd2296feb424fecd62d4a8edcc2a02899af6e29.38339d84e6e392addd730fd85fae32652c4cc7c5423633d6fa73e5f7937bbc38
creating metadata file for /root/.cache/huggingface/transformers/33593020f507d72099bd84ea6cd2296feb424fecd62d4a8edcc2a02899af6e29.38339d84e6e392addd730fd85fae32652c4cc7c5423633d6fa73e5f7937bbc38
loading file https://huggingface.co/allenai/scibert_scivocab_uncased/resolve/main/vocab.txt from cache at /root/.cache/huggingface/transformers/33593020f507d72099bd84ea6cd2296feb424fecd62d4a8edcc2a02899af6e29.38339d84e6e392addd730fd85fae32652c4cc7c5423633d6fa73e5f7937bbc38
loading file https://huggingface.co/allenai/scibert_scivocab_uncased/resolve/main/tokenizer.json from cache at None
loading file https://huggingface.co/allenai/scibert_scivocab_uncased/resolve/main/added_tokens.json from cache at None
loading file https://huggingf

Downloading:   0%|          | 0.00/422M [00:00<?, ?B/s]

storing https://huggingface.co/allenai/scibert_scivocab_uncased/resolve/main/pytorch_model.bin in cache at /root/.cache/huggingface/transformers/de14937a851e8180a2bc5660c0041d385f8a0c62b1b2ccafa46df31043a2390c.74830bb01a0ffcdeaed8be9916312726d0c4cd364ac6fc15b375f789eaff4cbb
creating metadata file for /root/.cache/huggingface/transformers/de14937a851e8180a2bc5660c0041d385f8a0c62b1b2ccafa46df31043a2390c.74830bb01a0ffcdeaed8be9916312726d0c4cd364ac6fc15b375f789eaff4cbb
loading weights file https://huggingface.co/allenai/scibert_scivocab_uncased/resolve/main/pytorch_model.bin from cache at /root/.cache/huggingface/transformers/de14937a851e8180a2bc5660c0041d385f8a0c62b1b2ccafa46df31043a2390c.74830bb01a0ffcdeaed8be9916312726d0c4cd364ac6fc15b375f789eaff4cbb
Some weights of the model checkpoint at allenai/scibert_scivocab_uncased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.bias', 'cls.seq_relatio

CPU times: user 1min 12s, sys: 3.85 s, total: 1min 16s
Wall time: 1min 27s


In [None]:
%%time 
extractive_summarized_text = text_class.extractive_summarizer(model, "full_text")

CPU times: user 12min 30s, sys: 2min 26s, total: 14min 57s
Wall time: 12min 24s


In [None]:
# Save entire dataframe 
extractive_summarized_text.to_csv("extractive_summarized_dataframe_final.csv")
extractive_summaries = extractive_summarized_text["extractive_summarized_text"]

In [None]:
extractive_summaries

0     Word lattice decoding has proven useful in spoken language translation; we argue that it provide...
1     We formulate the problem of nonprojective dependency parsing as a polynomial-sized integer linea...
2     We propose a cascaded linear model for joint Chinese word segmentation and partof-speech tagging...
3     In this paper, we propose a novel string-todependency algorithm for statistical machine translat...
4     We present a novel transition system for dependency parsing, which constructs arcs only between ...
                                                     ...                                                 
95    We present an algorithm for anaphora res- olutkm which is a modified and extended version of tha...
96    I:n this paper, we describe a new corpus-based ap- proach to prepositional phrase attachment dis...
97    computation of preferthe admissible argument values for a relation, is a well-known NLP task wit...
98    If we take an existing supervised NLP sy

#### Abstractive Summarization with BART

In [None]:
abstractive_text = pd.read_csv("extractive_summarized_dataframe_final.csv")
abstractive_text.drop("Unnamed: 0", axis=1, inplace=True)
abstractive_text.head()

Unnamed: 0,abstract,full_text,conclusion,num_stopwords,num_punctuations,num_numerics,num_words,avg_word_length,stopwords_to_words_ratio,extractive_summarized_text
0,Word lattice decoding has proven useful in spo...,Word lattice decoding has proven useful in spo...,We have achieved substantial gains in translat...,1330,723,22,3826,5.4,0.348,Word lattice decoding has proven useful in spo...
1,We formulate the problem of nonprojective depe...,We formulate the problem of nonprojective depe...,We presented new dependency parsers based on c...,1615,1209,13,4683,5.1,0.345,We formulate the problem of nonprojective depe...
2,We propose a cascaded linear model for joint C...,We propose a cascaded linear model for joint C...,We proposed a cascaded linear model for Chines...,1315,707,27,3573,5.2,0.368,We propose a cascaded linear model for joint C...
3,"In this paper, we propose a novel string-todep...","In this paper, we propose a novel string-todep...","In this paper, we propose a novel string-todep...",1306,880,22,3774,5.2,0.346,"In this paper, we propose a novel string-todep..."
4,We present a novel transition system for depen...,We present a novel transition system for depen...,We have presented a novel transition system fo...,1728,882,12,4416,4.9,0.391,We present a novel transition system for depen...


In [None]:
text_class = TextSummarizer(abstractive_text)

In [None]:
text_class.join_extracted_summary("abstract", "extractive_summarized_text", "conclusion")

Unnamed: 0,abstract,full_text,conclusion,num_stopwords,num_punctuations,num_numerics,num_words,avg_word_length,stopwords_to_words_ratio,extractive_summarized_text,combined_text
0,Word lattice decoding has proven useful in spo...,Word lattice decoding has proven useful in spo...,We have achieved substantial gains in translat...,1330,723,22,3826,5.4,0.348,Word lattice decoding has proven useful in spo...,Word lattice decoding has proven useful in spo...
1,We formulate the problem of nonprojective depe...,We formulate the problem of nonprojective depe...,We presented new dependency parsers based on c...,1615,1209,13,4683,5.1,0.345,We formulate the problem of nonprojective depe...,We formulate the problem of nonprojective depe...
2,We propose a cascaded linear model for joint C...,We propose a cascaded linear model for joint C...,We proposed a cascaded linear model for Chines...,1315,707,27,3573,5.2,0.368,We propose a cascaded linear model for joint C...,We propose a cascaded linear model for joint C...
3,"In this paper, we propose a novel string-todep...","In this paper, we propose a novel string-todep...","In this paper, we propose a novel string-todep...",1306,880,22,3774,5.2,0.346,"In this paper, we propose a novel string-todep...","In this paper, we propose a novel string-todep..."
4,We present a novel transition system for depen...,We present a novel transition system for depen...,We have presented a novel transition system fo...,1728,882,12,4416,4.9,0.391,We present a novel transition system for depen...,We present a novel transition system for depen...
...,...,...,...,...,...,...,...,...,...,...,...
95,We present an algorithm for anaphora res- olut...,We present an algorithm for anaphora res- olut...,Quantitative evaluation shows the anaphora res...,1794,1286,44,5024,5.2,0.357,We present an algorithm for anaphora res- olut...,We present an algorithm for anaphora res- olut...
96,"I:n this paper, we describe a new corpus-based...","I:n this paper, we describe a new corpus-based...",Prel)ositioual phrase attachment disambiguatio...,742,1185,31,2528,4.9,0.294,"I:n this paper, we describe a new corpus-based...","I:n this paper, we describe a new corpus-based..."
97,computation of preferthe admissible argument v...,computation of preferthe admissible argument v...,We have presented an application of topic mode...,1731,1049,38,4870,5.5,0.355,computation of preferthe admissible argument v...,computation of preferthe admissible argument v...
98,"If we take an existing supervised NLP system, ...","If we take an existing supervised NLP system, ...",Word features can be learned in advance in an ...,1568,1140,39,4740,5.4,0.331,"If we take an existing supervised NLP system, ...","If we take an existing supervised NLP system, ..."


In [None]:
%%time
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
abstractive_summarized_text = text_class.abstractive_summarizer(summarizer, "combined_text")

# Save to csv
abstractive_summarized_text.to_csv("abstractive_summarized_dataframe_final.csv")
abstractive_summaries = abstractive_summarized_text["abstractive_summaries"]

loading configuration file https://huggingface.co/facebook/bart-large-cnn/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/199ab6c0f28e763098fd3ea09fd68a0928bb297d0f76b9f3375e8a1d652748f9.930264180d256e6fe8e4ba6a728dd80e969493c23d4caa0a6f943614c52d34ab
Model config BartConfig {
  "_name_or_path": "facebook/bart-large-cnn",
  "_num_labels": 3,
  "activation_dropout": 0.0,
  "activation_function": "gelu",
  "add_final_layer_norm": false,
  "architectures": [
    "BartForConditionalGeneration"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 0,
  "classif_dropout": 0.0,
  "classifier_dropout": 0.0,
  "d_model": 1024,
  "decoder_attention_heads": 16,
  "decoder_ffn_dim": 4096,
  "decoder_layerdrop": 0.0,
  "decoder_layers": 12,
  "decoder_start_token_id": 2,
  "dropout": 0.1,
  "early_stopping": true,
  "encoder_attention_heads": 16,
  "encoder_ffn_dim": 4096,
  "encoder_layerdrop": 0.0,
  "encoder_layers": 12,
  "eos_token_id": 2,
  "force_bos_token_to_be_gen

CPU times: user 1h 39min 20s, sys: 1min 22s, total: 1h 40min 42s
Wall time: 1h 39min 59s


In [None]:
abstractive_summaries

0     We show that prior work in translating lattice...
1     We formulate the problem of nonprojective depe...
2     We propose a cascaded linear model for joint C...
3     In this paper, we propose a novel string-todep...
4     System constructs arcs only between adjacent w...
                            ...                        
95    We present an algorithm for anaphora res- olut...
96    We describe a new corpus-based ap- proach to p...
97     computation of preferthe admissible argument ...
98    If we take an existing supervised NLP system, ...
99    In order to parse a sentence x, it suffices to...
Name: abstractive_summaries, Length: 100, dtype: object

## Model Deployment with Streamlit and Localtunnel <a name="s4"></a>

[Streamlit](https://streamlit.io/) is an open-source app framework for Machine Learning and Data Science teams. It turns data scripts into shareable web apps in minutes. 

To deploy my model with streamlit, I create a python file `app.py` which contains the following code:

```
from transformers import *
from summarizer import Summarizer
from summarizer.text_processors.coreference_handler import CoreferenceHandler
import streamlit as st

st.title('Extractive and Abstractive Text Summarization')
st.markdown('Using BERT and BART Transformer Models')

text = st.text_area('Please Input a Long Scientific Text')
abstract = st.text_area("Please Input Scientific Text Abstract")
conclusion = st.text_area("Please Input Scientific Text Conclusion")

pretrained_model = 'allenai/scibert_scivocab_uncased'

max_length = 750
min_length = 250

@st.cache(suppress_st_warning=True)
def get_summary(text, abstract, conclusion, pretrained_model):
    # Extractive Summarizer
    # Load model, model config and tokenizer via Transformers
    custom_config = AutoConfig.from_pretrained(pretrained_model)
    custom_config.output_hidden_states=True
    custom_tokenizer = AutoTokenizer.from_pretrained(pretrained_model)
    custom_model = AutoModel.from_pretrained(pretrained_model, config=custom_config)

    # Create pretrained-model object
    extractive_model = Summarizer(custom_model=custom_model, custom_tokenizer=custom_tokenizer)

    # Abstractive Summarizer
    abstractive_summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

    optimal_num_sentences = extractive_model.calculate_optimal_k(text, k_max=10)
    extractive_summarized_text = "".join(extractive_model(text, num_sentences=optimal_num_sentences))
    
    text_list = [abstract, extractive_summarized_text, conclusion]
    joined_text = " ".join(text_list)

    abstractive_summary = abstractive_summarizer(joined_text, max_length=max_length, min_length=min_length, 
                                                do_sample=False)[-1]["summary_text"]
    st.write("Summary")
    st.success(abstractive_summary)

if st.button("Summarize"):
    get_summary(text, abstract, conclusion, pretrained_model)
```

The code above in `app.py` performs the following:
1. Creates a streamlit text area for the user to place a long scientific text document.
2. Create a `Summarize` button for the user to click.
3. Once `Summarize` is click, the model performs extractive summarization and then abstractive summarization. The result is presented to the user. 

In order for this to be shared through the web, I make use of [localtunnel](https://github.com/localtunnel/localtunnel). Localtunnel exposes your localhost to the world for easy testing and sharing! No need to mess with DNS or deploy just to have others test out your changes. It is great for working with browser testing tools like browserling or external api callback services like twilio which require a public url for callbacks.

With localtunnel, I am able to deploy the streamlit interface to the web. 

To achieve all this, first ensure `app.py` is in the same directory as this notebook. Then, run the codeblocks below. 

In [5]:
!pip install streamlit

Collecting streamlit
  Downloading streamlit-1.8.0-py2.py3-none-any.whl (10.1 MB)
[K     |████████████████████████████████| 10.1 MB 5.4 MB/s 
Collecting pydeck>=0.1.dev5
  Downloading pydeck-0.7.1-py2.py3-none-any.whl (4.3 MB)
[K     |████████████████████████████████| 4.3 MB 34.6 MB/s 
Collecting validators
  Downloading validators-0.18.2-py3-none-any.whl (19 kB)
Collecting pympler>=0.9
  Downloading Pympler-1.0.1-py3-none-any.whl (164 kB)
[K     |████████████████████████████████| 164 kB 48.9 MB/s 
Collecting blinker
  Downloading blinker-1.4.tar.gz (111 kB)
[K     |████████████████████████████████| 111 kB 45.8 MB/s 
[?25hCollecting gitpython!=3.1.19
  Downloading GitPython-3.1.27-py3-none-any.whl (181 kB)
[K     |████████████████████████████████| 181 kB 46.2 MB/s 
Collecting toml
  Downloading toml-0.10.2-py2.py3-none-any.whl (16 kB)
Collecting watchdog
  Downloading watchdog-2.1.7-py3-none-manylinux2014_x86_64.whl (76 kB)
[K     |████████████████████████████████| 76 kB 5.4 MB/

In [9]:
!streamlit run app.py & npx localtunnel --port 8501

2022-03-26 15:44:56.167 INFO    numexpr.utils: NumExpr defaulting to 2 threads.
[K[?25hnpx: installed 22 in 4.559s
[0m
[34m[1m  You can now view your Streamlit app in your browser.[0m
[0m
[34m  Network URL: [0m[1mhttp://172.28.0.2:8501[0m
[34m  External URL: [0m[1mhttp://34.68.157.228:8501[0m
[0m
your url is: https://calm-hound-63.loca.lt
2022-03-26 15:45:30.922080: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
2022-03-26 15:45:32.684 Loading model from /root/.neuralcoref_cache/neuralcoref
loading configuration file https://huggingface.co/allenai/scibert_scivocab_uncased/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/858852fd2471ce39075378592ddc87f5a6551e64c6825d1b92c8dab9318e0fc3.03ff9e9f998b9a9d40647a2148a202e3fb3d568dc0

These codeblocks will expose your work to the web. It will return a URL for you to share in order to access the streamlit model. 

After that, enjoy testing out the streamlit model :) 

## Recommendations <a name="s5"></a>

For improvements to this project, the following recommendations are provided:
1. Given more computational resources, use coreference handlers to improve upon the semantic capabilities of the summarizer.
2. Finetune the abstractive summarizer with BART by training it on a dataset consisting of the other papers in the Sciscumm corpus each with its respective summary made by a human. This will improve the performance of the summarizer with the Scisumm dataset, and in turn, with scientific documents. 
3. Check out other transformer models and compare the resulting summaries in order to find a best model for this dataset.
4. Research about methods to decrease inference time on deployed Streamlit model.  

**Thank you very much for checking this project out! Hope you learned and have a nice day!**