# Document Summarization

The idea of document summarization is a
bit different from keyphrase extraction or topic modeling. In this case, the end result
is still in the form of some document, but with a few sentences based on the length we
might want the summary to be. This is similar to an abstract or an executive summary
in a research paper. The main objective of automated document summarization is
to perform this summarization without involving human input, except for running
computer programs. Mathematical and statistical models help in building and
automating the task of summarizing documents by observing their content and context.

There are two broad approaches to document summarization using automated
techniques. They are described as follows:
- __Extraction-based techniques:__ These methods use mathematical
and statistical concepts like SVD to extract some key subset of the
content from the original document such that this subset of content
contains the core information and acts as the focal point of the entire
document. This content can be words, phrases, or even sentences.
The end result from this approach is a short executive summary of a
couple of lines extracted from the original document. No new content
is generated in this technique, hence the name extraction-based.
- __Abstraction-based techniques:__ These methods are more complex
and sophisticated. They leverage language semantics to create
representations and use natural language generation (NLG)
techniques where the machine uses knowledge bases and semantic
representations to generate text on its own and create summaries
just like a human would write them. Thanks to deep learning, we can
implement these techniques easily but they require a lot of data and
compute.

We will cover extraction based methods here due to constraints of needed a lot of data + compute for abstraction based methods. But you can leverage the seq2seq models you learnt in language translation on an appropriate dataset to build deep learning based abstractive summarizers

# Extractive Summarization with Transformers

This method utilizes the HuggingFace transformers library to run extractive summarizations. 

This works by first embedding the sentences, then running a clustering algorithm, finding the sentences that are closest to the cluster's centroids. 

This library also uses coreference techniques, utilizing the https://github.com/huggingface/neuralcoref library to resolve words in summaries that need more context. The greedyness of the neuralcoref library can be tweaked in the CoreferenceHandler class.

__Library Repo:__ https://github.com/dmmiller612/bert-extractive-summarizer
__Paper:__ https://arxiv.org/abs/1906.04165


### Transformer Training Process (Already Pre-trained)

![](https://i.imgur.com/RMuSvTL.png)


### What is a Transformer?

The Transformer, a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output. It is a stacked layer of encoders and decoders.

![](https://i.imgur.com/e0XratS.png)

__Source:__ http://jalammar.github.io/illustrated-transformer/

### Transformer Architecture

Stacked encoder - decoder architecture with multi-head attention blocks

![](https://i.imgur.com/LUFXrLM.png)

__Source:__ https://arxiv.org/abs/1706.03762

In [1]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [49]:
DOCUMENT = """
Data is foundational to business intelligence, and training data size is one of the main determinants of your model’s predictive power. 
It is like a lever you always have when you are driving a car. So more data leads to more predictive power. 
For sophisticated models such as gradient boosted trees and random forests, quality data and feature engineering reduce the errors drastically.
But simply having more data is not useful. The saying that businesses need a lot of data is a myth. 
Large amounts of data afford simple models much more power; if you have 1 trillion data points, outliers are easier to classify and the underlying distribution of that data is clearer. 
If you have 10 data points, this is probably not the case. You’ll have to perform more sophisticated normalization and transformation routines on the data before it is useful.

The big data paradigm is the assumption that big data is a substitute for conventional data collection and analysis. 
In other words, it’s the belief (and overconfidence) that huge amounts of data is the answer to everything and that we can just train machines to solve problems automatically. 
Data by itself is not a panacea and we cannot ignore traditional analysis.

Researchers have demonstrated that massive data can lead to lower estimation variance and hence better predictive performance. 
More data increases the probability that it contains useful information, which is advantageous. 

However, not all data is always helpful. 
A good example is clickstream data utilised by e-com companies where a user’s actions are monitored and analysed. 
Such data includes parts of the page that are clicked, keywords, cookie data, cursor positions and web page components that are visible. 
This is a lot of data coming in rapidly, but only a portion is valuable in predicting a user’s characteristics and preferences. 
The rest is noise. When data are taken from human actions, noise rates usually are high due to the limitations enforced by behavioural tendencies. 
What you ideally need is a set of data points that outline the range of variations with each class that one would like to train the ML system with. 

Having more data certainly increases the accuracy of your model, but there comes a stage where even adding infinite amounts of data cannot improve any more accuracy. 
This is what we called the natural noise of the data. When you work with different ML models, we see that certain features of the data are spread on a given variance, which is a probabilistic distribution. 
Dipanjan Sarkar, Data Science Lead at Applied Materials explains, “The standard principle in data science is that more training data leads to better machine learning models. 
However what we need to remember is the ‘Garbage In Garbage Out’ principle! It is not just big data, but good (quality) data which helps us build better performing ML models. 
If we have a huge data repository with features which are too noisy or not having enough variation to capture critical patterns in the data, any ML models will effectively be useless regardless of the data volume.”

According to research, if the model is tuned with too much to the data, then it could essentially memorise the data, and that causes model overfitting, which causes high error rates for unseen data. 
If we are overfitting, we get wrong predictions and lose the focus on what’s actually important. 
An overfitting model implies that you have low bias and high variance and more data is not going to solve your problem. 
By placing too much emphasis on each data point, data scientists have to deal with a lot of noise and, therefore, lose sight of what’s really important. 
So adding more data points to the training set will not improve the model performance. 

We need big data mostly when you have a ton of features, like image processing, where there is a need for ample data sources to train a model or language models for that matter. 
According to experts, you have to find the right parameters for fancy models that generally lead to big datasets to get high accuracy. 
There are many knobs, and you have to try enough knobs in the right parts of the space that contributes to reduced training error. 

“There are no shortcuts or direct mathematical formulae to say if we have enough data. 
The only way would be to actually get out there and build relevant ML models on the data and validate based on performance metrics (which are in-line with the business metrics & KPIs) to see if we are getting a satisfactory performance,” Dipanjan further says.

More data in principle is good. 
But actually, it matters to have the right kind of data. 
Sampling training data from your actual target domain always matters. 
Even within a domain, it matters how you sample. 
So modelling choices and data sampling approach jointly matter more than just data. 
Samples must represent real-world example data that have a good chance of being encountered in the future.

The main reason why data is desirable is that it lends more information about the dataset and thus becomes valuable. 
However, if the newly created data resemble the existing data, or simply repeated data, then there is no added value of having more data. 
For example, in an online review dataset, there is not much of a lift from the large dataset because you probably do not have a lot of variables and thousands of user reviews get you the same sample.

From a pure regression standpoint and if you have a true sample, data size beyond a point does not matter. 
There is diminishing value in adding observations from a Mean Square Error standpoint, a standard way to measure the error of a model in predicting quantitative data. 

It is explicit from previous work that more data do not surely lead to greater predictive performance. 
It has been argued that sampling (decreasing the number of instances) or transformation of the data to lower the dimensional spaces (lessening the number of features) is beneficial. 
In fact, not all areas of machine learning are associated with big data. In fact, one of the most exciting and recent areas is related to making sense out of small data.

When we think of advanced models, we assume that advanced machine learning models, everything has to be learned from the data. 
There are several use cases where few data points have worked equally well using techniques like simulation, etc., semi-supervised learning, etc.  

Practically, there is research on neural network architectures that do reasonably well with just a thousand data points. 
They are not fancy but better than some machine learning methods if you have the right problem type. 
“With the advent of innovative methodologies like transfer learning, unsupervised, self-supervised and semi-supervised learning, we are seeing new areas of research being actually adapted in the industry to build better quality ML models with less (labeled) data,” tells Dipanjan Sarkar.
There is also extensive work going on in terms of techniques that reduce the requirements for data. 
They are working on building ways to pull in human experience and knowledge rather than trying to discover everything from the raw data itself.  
Organisations are focusing on building hybrid machine learning systems that combine old fashioned rule-based systems with the underlying neural architectures, and have a bi-directional flow of information that learn from logical statements. 

For smaller firms, fewer datasets would be equally desirable or preferable to more data, and there are situations where more data present expenses that are not justified by the added value of the additional data. 
Data storage is an expense, and analysts who can work with datasets that are too large to fit in memory, with the appropriate tools, are more expensive than those who cannot. 

A collection of a small dataset is good enough for answering the question of interest, and there is no incentive to collect additional data considering the practical time and financial burdens it may create. 
Hacking and privacy breaches are other possibilities with storing too much data which demands the efforts of a malicious entity to produce adverse consequences. 
There are also examples where a company may breach a privacy regulation in its quest to acquire a large dataset.
"""

In [50]:
import re

DOCUMENT = re.sub(r'\n|\r', ' ', DOCUMENT)
DOCUMENT = re.sub(r' +', ' ', DOCUMENT)
DOCUMENT = DOCUMENT.strip()

In [4]:
!pip install transformers==2.11.0
!pip install bert-extractive-summarizer

Collecting transformers==2.11.0
[?25l  Downloading https://files.pythonhosted.org/packages/48/35/ad2c5b1b8f99feaaf9d7cdadaeef261f098c6e1a6a2935d4d07662a6b780/transformers-2.11.0-py3-none-any.whl (674kB)
[K     |▌                               | 10kB 22.3MB/s eta 0:00:01[K     |█                               | 20kB 6.2MB/s eta 0:00:01[K     |█▌                              | 30kB 7.1MB/s eta 0:00:01[K     |██                              | 40kB 7.9MB/s eta 0:00:01[K     |██▍                             | 51kB 6.9MB/s eta 0:00:01[K     |███                             | 61kB 7.4MB/s eta 0:00:01[K     |███▍                            | 71kB 8.4MB/s eta 0:00:01[K     |███▉                            | 81kB 8.8MB/s eta 0:00:01[K     |████▍                           | 92kB 8.3MB/s eta 0:00:01[K     |████▉                           | 102kB 8.9MB/s eta 0:00:01[K     |█████▍                          | 112kB 8.9MB/s eta 0:00:01[K     |█████▉                          | 1

# Extractive Summarization with BERT

In [5]:
from summarizer import Summarizer

In [6]:
sm = Summarizer(model='bert-large-uncased')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=434.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1344997306.0, style=ProgressStyle(descr…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




In [7]:
result = sm(body=DOCUMENT, ratio=0.15)

In [8]:
result = '\n'.join(nltk.sent_tokenize(result))
print(result)

Data is foundational to business intelligence, and training data size is one of the main determinants of your model’s predictive power.
So more data leads to more predictive power.
However what we need to remember is the ‘Garbage In Garbage Out’ principle!
If we have a huge data repository with features which are too noisy or not having enough variation to capture critical patterns in the data, any ML models will effectively be useless regardless of the data volume.” According to experts, you have to find the right parameters for fancy models that generally lead to big datasets to get high accuracy.
Samples must represent real-world example data that have a good chance of being encountered in the future.
In fact, one of the most exciting and recent areas is related to making sense out of small data.
When we think of advanced models, we assume that advanced machine learning models, everything has to be learned from the data.
They are working on building ways to pull in human experience 

# Extractive Summarization with DistilBERT

In [9]:
sm = Summarizer(model='distilbert-base-uncased')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=442.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=267967963.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




In [10]:
result = sm(body=DOCUMENT, ratio=0.15)

In [11]:
result = '\n'.join(nltk.sent_tokenize(result))
print(result)

Data is foundational to business intelligence, and training data size is one of the main determinants of your model’s predictive power.
Large amounts of data afford simple models much more power; if you have 1 trillion data points, outliers are easier to classify and the underlying distribution of that data is clearer.
When you work with different ML models, we see that certain features of the data are spread on a given variance, which is a probabilistic distribution.
However what we need to remember is the ‘Garbage In Garbage Out’ principle!
It is not just big data, but good (quality) data which helps us build better performing ML models.
By placing too much emphasis on each data point, data scientists have to deal with a lot of noise and, therefore, lose sight of what’s really important.
The only way would be to actually get out there and build relevant ML models on the data and validate based on performance metrics (which are in-line with the business metrics & KPIs) to see if we ar

# Abstractive Summarization with BART

BART: Bidirectional and Auto-Regressive Transformers

BART, a denoising autoencoder for pretraining sequence-to-sequence models. BART is trained by 

- (1) corrupting text with an arbitrary noising function
- (2) learning a model to reconstruct the original text. 

It uses a standard Tranformer-based neural machine translation architecture which, despite its simplicity, can be seen as generalizing BERT (due to the bidirectional encoder), GPT (with the left-to-right decoder), and many other more recent pretraining schemes.

![](https://i.imgur.com/wsRp0dk.png)

__Source:__ https://arxiv.org/abs/1910.13461

## Load BART Model

In [12]:
from transformers import BartTokenizer, BartForConditionalGeneration, BartConfig

BART_PATH = 'facebook/bart-large-cnn'

In [13]:
bart_model = BartForConditionalGeneration.from_pretrained(BART_PATH, output_past=True)
bart_tokenizer = BartTokenizer.from_pretrained(BART_PATH, output_past=True)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1343.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1625270765.0, style=ProgressStyle(descr…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=898823.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…




## Build function to chunk text 

#### (BART etc. has limitations of 1024 tokens)

In [51]:
def nest_sentences(document):

  nested = []
  sent = []
  length = 0
  for sentence in nltk.sent_tokenize(document):
    length += len(sentence)
    if length < 1024:
      sent.append(sentence)
    else:
      nested.append(sent)
      sent = []
      length = 0

  if sent:
    nested.append(sent)

  return nested

## Chunk input document into nested list of sentences

In [52]:
nested = nest_sentences(DOCUMENT)

## Sample Summarization Pipeline on a batch of sentences

In [53]:
nested[0]

['Data is foundational to business intelligence, and training data size is one of the main determinants of your model’s predictive power.',
 'It is like a lever you always have when you are driving a car.',
 'So more data leads to more predictive power.',
 'For sophisticated models such as gradient boosted trees and random forests, quality data and feature engineering reduce the errors drastically.',
 'But simply having more data is not useful.',
 'The saying that businesses need a lot of data is a myth.',
 'Large amounts of data afford simple models much more power; if you have 1 trillion data points, outliers are easier to classify and the underlying distribution of that data is clearer.',
 'If you have 10 data points, this is probably not the case.',
 'You’ll have to perform more sophisticated normalization and transformation routines on the data before it is useful.',
 'The big data paradigm is the assumption that big data is a substitute for conventional data collection and analys

In [54]:
device = 'cuda'

In [55]:
input_tokenized = bart_tokenizer.encode(' '.join(nested[0]), truncation=True, return_tensors='pt')
input_tokenized = input_tokenized.to(device)
input_tokenized

tensor([[    0,  5423,    16, 35528,     7,   265,  2316,     6,     8,  1058,
           414,  1836,    16,    65,     9,     5,  1049, 26948,  3277,     9,
           110,  1421,    17,    27,    29, 27930,   476,     4,    85,    16,
           101,    10, 15178,    47,   460,    33,    77,    47,    32,  1428,
            10,   512,     4,   407,    55,   414,  3315,     7,    55, 27930,
           476,     4,   286, 10364,  3092,   215,    25, 43141,  5934,  3980,
             8,  9624, 14275,     6,  1318,   414,     8,  1905,  4675,  1888,
             5,  9126, 17811,     4,   125,  1622,   519,    55,   414,    16,
            45,  5616,     4,    20,   584,    14,  1252,   240,    10,   319,
             9,   414,    16,    10, 17721,     4, 13769,  5353,     9,   414,
          4960,  2007,  3092,   203,    55,   476,   131,   114,    47,    33,
           112,  4700,   414,   332,     6, 31187,  4733,    32,  3013,     7,
         36029,     8,     5,  7482,  3854,     9,  

In [56]:
summary_ids = bart_model.to('cuda').generate(input_tokenized,
                                      length_penalty=3.0,
                                      min_length=30,
                                      max_length=100)
summary_ids

tensor([[    2,     0,   133,   380,   414, 28323,    16,     5, 15480,    14,
           380,   414,    16,    10, 10268,    13,  9164,   414,  2783,     8,
          1966,     4,  5423,    16, 35528,     7,   265,  2316,     6,     8,
          1058,   414,  1836,    16,    65,     9,     5,  1049, 26948,  3277,
             9,   110,  1421,    17,    27,    29, 27930,   476,     4,   286,
         10364,  3092,   215,    25, 43141,  5934,  3980,     6,  1318,   414,
             8,  1905,  4675,  1888,     5,  9126, 17811,     4]],
       device='cuda:0')

In [57]:
output = [bart_tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in summary_ids]
output

['The big data paradigm is the assumption that big data is a substitute for conventional data collection and analysis. Data is foundational to business intelligence, and training data size is one of the main determinants of your model’s predictive power. For sophisticated models such as gradient boosted trees, quality data and feature engineering reduce the errors drastically.']

In [58]:
nested[0]

['Data is foundational to business intelligence, and training data size is one of the main determinants of your model’s predictive power.',
 'It is like a lever you always have when you are driving a car.',
 'So more data leads to more predictive power.',
 'For sophisticated models such as gradient boosted trees and random forests, quality data and feature engineering reduce the errors drastically.',
 'But simply having more data is not useful.',
 'The saying that businesses need a lot of data is a myth.',
 'Large amounts of data afford simple models much more power; if you have 1 trillion data points, outliers are easier to classify and the underlying distribution of that data is clearer.',
 'If you have 10 data points, this is probably not the case.',
 'You’ll have to perform more sophisticated normalization and transformation routines on the data before it is useful.',
 'The big data paradigm is the assumption that big data is a substitute for conventional data collection and analys

## Build Generic Function to Summarize

In [59]:
def generate_summary(nested_sentences):
  device = 'cuda'
  summaries = []
  for nested in nested_sentences:
    input_tokenized = bart_tokenizer.encode(' '.join(nested), truncation=True, return_tensors='pt')
    input_tokenized = input_tokenized.to(device)
    summary_ids = bart_model.to('cuda').generate(input_tokenized,
                                      length_penalty=3.0,
                                      min_length=30,
                                      max_length=100)
    output = [bart_tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in summary_ids]
    summaries.append(output)
  summaries = [sentence for sublist in summaries for sentence in sublist]
  return summaries

    

## Generate 1st Level Summary

In [60]:
summ = generate_summary(nested)

In [61]:
summ

['The big data paradigm is the assumption that big data is a substitute for conventional data collection and analysis. Data is foundational to business intelligence, and training data size is one of the main determinants of your model’s predictive power. For sophisticated models such as gradient boosted trees, quality data and feature engineering reduce the errors drastically.',
 'Data by itself is not a panacea and we cannot ignore traditional analysis. Massive data can lead to lower estimation variance and hence better predictive performance. More data increases the probability that it contains useful information, which is advantageous.',
 'The standard principle in data science is that more training data leads to better machine learning models. However what we need to remember is the ‘Garbage In Garbage Out’ principle! It is not just big data, but good (quality) data which helps us build better performing ML models.',
 'An overfitting model implies that you have low bias and high va

## Generate 2nd Level Summary

In [62]:
nested_summ = nest_sentences(' '.join(summ))
nested_summ

[['The big data paradigm is the assumption that big data is a substitute for conventional data collection and analysis.',
  'Data is foundational to business intelligence, and training data size is one of the main determinants of your model’s predictive power.',
  'For sophisticated models such as gradient boosted trees, quality data and feature engineering reduce the errors drastically.',
  'Data by itself is not a panacea and we cannot ignore traditional analysis.',
  'Massive data can lead to lower estimation variance and hence better predictive performance.',
  'More data increases the probability that it contains useful information, which is advantageous.',
  'The standard principle in data science is that more training data leads to better machine learning models.',
  'However what we need to remember is the ‘Garbage In Garbage Out’ principle!',
  'It is not just big data, but good (quality) data which helps us build better performing ML models.'],
 ['By placing too much emphasis

In [63]:
generate_summary(nested_summ)

['Data is foundational to business intelligence, and training data size is one of the main determinants of your model’s predictive power. Data by itself is not a panacea and we cannot ignore traditional analysis. Massive data can lead to lower estimation variance and hence better predictive performance.',
 'More data in principle is good but it matters to have the right kind of data. Samples must represent real-world example data that have a good chance of being encountered in the future. For smaller firms, fewer datasets would be equally desirable or preferable.']

In [29]:
DOCUMENT

'Data is foundational to business intelligence, and training data size is one of the main determinants of your model’s predictive power. It is like a lever you always have when you are driving a car. So more data leads to more predictive power. For sophisticated models such as gradient boosted trees and random forests, quality data and feature engineering reduce the errors drastically. But simply having more data is not useful. The saying that businesses need a lot of data is a myth. Large amounts of data afford simple models much more power; if you have 1 trillion data points, outliers are easier to classify and the underlying distribution of that data is clearer. If you have 10 data points, this is probably not the case. You’ll have to perform more sophisticated normalization and transformation routines on the data before it is useful. The big data paradigm is the assumption that big data is a substitute for conventional data collection and analysis. In other words, it’s the belief (

## Another Example

In [64]:
DOCUMENT = """
Association football, more commonly known as football or soccer, is a team sport played with a spherical ball between two teams of 11 players. 
It is played by approximately 250 million players in over 200 countries and dependencies, making it the world's most popular sport. 
The game is played on a rectangular field called a pitch with a goal at each end. The object of the game is to outscore the opposition by moving the ball beyond the goal line into the opposing goal. 
The team with the higher number of goals wins the game.
Football is played in accordance with a set of rules known as the Laws of the Game. The ball is 68–70 cm (27–28 in) in circumference and known as the football. 
The two teams each compete to get the ball into the other team's goal (between the posts and under the bar), thereby scoring a goal. 
The team that has scored more goals at the end of the game is the winner; if both teams have scored an equal number of goals then the game is a draw. 
Each team is led by a captain who has only one official responsibility as mandated by the Laws of the Game: to represent their team in the coin toss prior to kick-off or penalty kicks.
Players are not allowed to touch the ball with hands or arms while it is in play, except for the goalkeepers within the penalty area. Other players mainly use their feet to strike or pass the ball, but may also use any other part of their body except the hands and the arms. 
The team that scores most goals by the end of the match wins. If the score is level at the end of the game, either a draw is declared or the game goes into extra time or a penalty shootout depending on the format of the competition.
Football is governed internationally by the International Federation of Association Football (FIFA; French: Fédération Internationale de Football Association), which organises World Cups for both men and women every four years. 
The FIFA World Cup has taken place every four years since 1930 with the exception of 1942 and 1946 tournaments, which were cancelled due to World War II. 
Approximately 190–200 national teams compete in qualifying tournaments within the scope of continental confederations for a place in the finals. 
The finals tournament, which is held every four years, involves 32 national teams competing over a four-week period. It is the most prestigious football tournament in the world as well as the most widely viewed and followed sporting event in the world, exceeding the Olympic Games.
"""

In [65]:
DOCUMENT = re.sub(r'\n|\r', ' ', DOCUMENT)
DOCUMENT = re.sub(r' +', ' ', DOCUMENT)
DOCUMENT = DOCUMENT.strip()

## Generate 1st Level Summary

In [66]:
doc_nest = nest_sentences(DOCUMENT)
doc_nest

[['Association football, more commonly known as football or soccer, is a team sport played with a spherical ball between two teams of 11 players.',
  "It is played by approximately 250 million players in over 200 countries and dependencies, making it the world's most popular sport.",
  'The game is played on a rectangular field called a pitch with a goal at each end.',
  'The object of the game is to outscore the opposition by moving the ball beyond the goal line into the opposing goal.',
  'The team with the higher number of goals wins the game.',
  'Football is played in accordance with a set of rules known as the Laws of the Game.',
  'The ball is 68–70 cm (27–28 in) in circumference and known as the football.',
  "The two teams each compete to get the ball into the other team's goal (between the posts and under the bar), thereby scoring a goal.",
  'The team that has scored more goals at the end of the game is the winner; if both teams have scored an equal number of goals then the 

In [67]:
summ = generate_summary(doc_nest)
summ

[' association football is a team sport played with a spherical ball between two teams of 11 players. It is played by approximately 250 million players in over 200 countries and dependencies. The object of the game is to outscore the opposition by moving the ball beyond the goal line into the opposing goal. The team with the higher number of goals wins the game.',
 'The FIFA World Cup has taken place every four years since 1930 with the exception of 1942 and 1946 tournaments, which were cancelled due to World War II. Players are not allowed to touch the ball with hands or arms while it is in play, except for the goalkeepers within the penalty area. Other players mainly use their feet to strike or pass the ball.',
 "The World Cup is the most prestigious football tournament in the world. The finals tournament is held every four years. It involves 32 national teams competing over a four-week period. The tournament is the world's most widely viewed sporting event."]

In [69]:
nest_summ = nest_sentences(' '.join(summ))
nest_summ

[[' association football is a team sport played with a spherical ball between two teams of 11 players.',
  'It is played by approximately 250 million players in over 200 countries and dependencies.',
  'The object of the game is to outscore the opposition by moving the ball beyond the goal line into the opposing goal.',
  'The team with the higher number of goals wins the game.',
  'The FIFA World Cup has taken place every four years since 1930 with the exception of 1942 and 1946 tournaments, which were cancelled due to World War II.',
  'Players are not allowed to touch the ball with hands or arms while it is in play, except for the goalkeepers within the penalty area.',
  'Other players mainly use their feet to strike or pass the ball.',
  'The World Cup is the most prestigious football tournament in the world.',
  'The finals tournament is held every four years.',
  'It involves 32 national teams competing over a four-week period.',
  "The tournament is the world's most widely viewe

## Generate 2nd Level Summary

In [71]:
final_summ = generate_summary(nest_summ)
nest_sentences(' '.join(final_summ))

[[' association football is a team sport played with a spherical ball between two teams of 11 players.',
  'It is played by approximately 250 million players in over 200 countries and dependencies.',
  'The object of the game is to outscore the opposition by moving the ball beyond the goal line into the opposing goal.',
  'The team with the higher number of goals wins the game.']]