<a href="https://colab.research.google.com/github/fengyunzaidushi/Anima/blob/main/led_base_demo_token_batching.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <center> long-form summarization with token batching & LED-base

An example for [pszemraj/led-base-book-summary](https://huggingface.co/pszemraj/led-base-book-summary):

- works well for long and/or dense text summarization **because it is trained on [BookSum](https://github.com/salesforce/booksum) and has "learned" explanatory summarization**
- if you are on free-tier Colab, tips on how to adjust this notebook to ensure it runs are marked lower down in <font color="orange">**orange**</font>



by [Peter Szemraj](https://peterszemraj.ch/)


_function design/implementation for the LED decoding largely based on [this notebook](https://colab.research.google.com/drive/12INTTR6n64TzS4RrXZxMSXfrOd9Xzamo?usp=sharing) by Patrick von Platen (da real MVP)_

---


In [None]:
#@title define model, text file link

#@markdown you can also try the larger `pszemraj/led-large-book-summary`

hf_tag = "pszemraj/led-base-book-summary" #@param ["pszemraj/led-large-book-summary", "pszemraj/led-base-book-summary"] {allow-input: true}

example_url = "https://www.dropbox.com/s/trm1xb6rdjdkgt9/mkdl-An%20Introduction%20to%20Deep%20Reinforcement%20Learning.txt?dl=1" #@param {type:"string"}


# setup

In [None]:
#@title GPU info
#@markdown - usage of the model requires GPU, go to Runtime -> Change runtime type if needed
!nvidia-smi

Thu Sep  8 13:13:26 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla V100-SXM2...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   33C    P0    24W / 300W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
#@markdown add auto-Colab formatting with `IPython.display`
from IPython.display import HTML, display
# colab formatting
def set_css():
    display(
        HTML(
            """
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  """
        )
    )

get_ipython().events.register("pre_run_cell", set_css)

In [None]:
!pip install -U datasets transformers ninja -q
!pip install -U sentencepiece -q
!pip install clean-text[gpl] -q

[K     |████████████████████████████████| 365 kB 14.4 MB/s 
[K     |████████████████████████████████| 4.7 MB 60.6 MB/s 
[K     |████████████████████████████████| 108 kB 69.8 MB/s 
[K     |████████████████████████████████| 115 kB 62.6 MB/s 
[K     |████████████████████████████████| 120 kB 74.5 MB/s 
[K     |████████████████████████████████| 212 kB 66.5 MB/s 
[K     |████████████████████████████████| 127 kB 72.1 MB/s 
[K     |████████████████████████████████| 6.6 MB 62.3 MB/s 
[K     |████████████████████████████████| 1.3 MB 16.1 MB/s 
[K     |████████████████████████████████| 175 kB 14.3 MB/s 
[K     |████████████████████████████████| 53 kB 2.0 MB/s 
[K     |████████████████████████████████| 235 kB 58.6 MB/s 
[?25h  Building wheel for emoji (setup.py) ... [?25l[?25hdone


In [None]:
from transformers import pipeline
import torch
from cleantext import clean
from pathlib import Path

_device = 0 if torch.cuda.is_available() else -1


In [None]:
#@markdown setup logging
import logging
from pathlib import Path
das_logfile = Path.cwd() / "summarize_tokenbatches.log"

logging.basicConfig(
    level=logging.INFO,
    filename=das_logfile,
    format="%(asctime)s %(levelname)s %(message)s",
    datefmt="%m/%d/%Y %I:%M:%S",
)
print(f'logfile is at:\n\n{das_logfile}')

logfile is at:

/content/summarize_tokenbatches.log


In [None]:
import requests
import re

#@markdown define `get_filename`

def get_filename(url:str):
    """
    Parses a URL string to find the filename of the file that it downloads.
    """
    # get the last part of the url
    url_split = url.split('/')
    last_part = url_split[-1]
    # replace "?dl=1"
    last_part = last_part.replace('?dl=1', '')

    if '.' in last_part:
        # if the last part is a file name, return it

        suffix = last_part.split('.', maxsplit=1)[-1]
        file_head = last_part.split('.', maxsplit=1)[0].replace("%", " ")
        # replace all non-alphanumeric or whitespace chars in file stem
        file_stem = re.sub(r'[^\w\s]', '', file_head)
        # replace all whitespace chars in file stem
        file_stem = re.sub(r'\s+', '_', file_stem)
        return file_stem + '.' + suffix


    # replace all non-alphanumeric or whitespace chars in file stem
    file_stem = re.sub(r'[^\w\s]', '', last_part)
    # replace all whitespace chars in file stem
    file_stem = re.sub(r'\s+', '_', file_stem)

    # get the file extension
    r = requests.get(url, stream=True)
    content_type = r.headers['content-type']
    file_extension = content_type.split('/')[-1]

    # return the file name
    return file_stem + '.' + file_extension

## load model and tokenizer

In [None]:
from transformers import AutoModelForSeq2SeqLM
model = AutoModelForSeq2SeqLM.from_pretrained(
    hf_tag,
    use_cache=False,
).to('cuda')
model.gradient_checkpointing_enable()

Downloading config.json:   0%|          | 0.00/1.29k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/618M [00:00<?, ?B/s]

Some weights of LEDForConditionalGeneration were not initialized from the model checkpoint at pszemraj/led-base-book-summary and are newly initialized: ['led.encoder.embed_tokens.weight', 'led.decoder.embed_tokens.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
from datasets import load_dataset
from tqdm.auto import tqdm
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(
    hf_tag,
)


Downloading tokenizer_config.json:   0%|          | 0.00/1.29k [00:00<?, ?B/s]

Downloading vocab.json:   0%|          | 0.00/780k [00:00<?, ?B/s]

Downloading merges.txt:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/2.01M [00:00<?, ?B/s]

Downloading special_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

# Functions & Params



In [None]:
#@markdown `get_timestamp(detailed=False)`

from datetime import datetime

def get_timestamp(detailed=False):
    """
    get_timestamp - returns a timestamp in string format

    detailed : bool, optional, default False, if True, returns a timestamp with seconds
    """
    if detailed:
        return datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
    else:
        return datetime.now().strftime("%b-%d-%Y")

get_timestamp(True)

'2022-09-08_13-14-45'

### parameters


In [None]:
#@markdown <font color="orange"> decrease `token_batch_length` if running OOM
token_batch_length = 4096 #@param ["16384", "8192", "4096", "3072", "2048"] {type:"raw"}
batch_stride = 20 #@param {type:"integer"}

session_settings = {}
session_settings['token_batch_length'] = token_batch_length
session_settings['batch_stride'] = batch_stride



In [None]:
#@title generation parameters

#@markdown - <font color="orange">  decrease `token_batch_length` if running OOM
#@markdown - <font color="orange"> decrease `number_beams` if running OOM

number_beams = 12 #@param ["16", "12", "8", "4"] {type:"raw"}
min_length =  32#@param {type:"integer"}
max_len_ratio = 5 #@param {type:"slider", min:2, max:10, step:0.25}
length_penalty =  0.5#@param {type:"number"}

if token_batch_length > 8192 and number_beams > 8:
    logging.info(f'{number_beams} number_beams too high, reducing')
    number_beams = 8
settings = {
    'min_length':32,
    'max_length':int(token_batch_length//max_len_ratio),
    'no_repeat_ngram_size':3,
    'encoder_no_repeat_ngram_size' :4,
    'repetition_penalty':3.7,
    'num_beams':number_beams,
    'length_penalty':length_penalty,
    'early_stopping':True,
    'do_sample':False,
}
logging.info(f"using textgen params:\n\n:{settings}")
session_settings['num_beams'] = number_beams
session_settings['length_penalty'] = length_penalty
session_settings['max_len_ratio'] = max_len_ratio

## define generation functions


In [None]:
#@markdown single-shot fns
#@markdown - def `generate_answer()`

def chunks(lst, n):
    """Yield successive n-sized chunks from lst."""
    for i in range(0, len(lst), n):
        yield lst[i : i + n]



def generate_answer(batch,**kwargs):

    inputs_dict = tokenizer(batch["text"],
                            padding="max_length", max_length=16384,
                            return_tensors="pt",
                            truncation=True,
                            add_special_tokens =False,
                            )

    input_ids = inputs_dict.input_ids.to("cuda")
    attention_mask = inputs_dict.attention_mask.to("cuda")
    print(attention_mask, attention_mask.size())
    global_attention_mask = torch.zeros_like(attention_mask)
    # put global attention on <s> token
    global_attention_mask[:, 0] = 1

    predicted_abstract_ids = model.generate(
            input_ids,
            attention_mask=attention_mask,
            global_attention_mask=global_attention_mask,
            **kwargs
        )
    batch["summary"] = tokenizer.batch_decode(predicted_abstract_ids,
                                                skip_special_tokens=True,
                                                remove_invalid_values=True,
                                                )
    return batch

In [None]:
# @title token batch summarization

#@markdown - def `summarize_and_score(ids, mask, **kwargs)`
#@markdown - def `summarize_via_tokenbatches(input_text:str, batch_length=8192, batch_stride=16, **kwargs,)`
def summarize_and_score(ids, mask, **kwargs):


    ids = ids[None, :]
    mask = mask[None, :]

    input_ids = ids.to("cuda")
    attention_mask = mask.to("cuda")
    global_attention_mask = torch.zeros_like(attention_mask)
    # put global attention on <s> token
    global_attention_mask[:, 0] = 1

    summary_pred_ids = model.generate(
            input_ids,
            attention_mask=attention_mask,
            global_attention_mask=global_attention_mask,
            output_scores=True,
            return_dict_in_generate=True,
            **kwargs
        )
    summary = tokenizer.batch_decode(
                summary_pred_ids.sequences,
                skip_special_tokens=True,
                remove_invalid_values=True,
            )
    score = round(summary_pred_ids.sequences_scores.cpu().numpy()[0], 4)

    return summary, score

def summarize_via_tokenbatches(
        input_text:str,
        batch_length=8192,
        batch_stride=16,
        **kwargs,
    ):

    encoded_input = tokenizer(
                        input_text,
                        padding='max_length',
                        truncation=True,
                        max_length=batch_length,
                        stride=batch_stride,
                        return_overflowing_tokens=True,
                        add_special_tokens =False,
                        return_tensors='pt',
                    )

    in_id_arr, att_arr = encoded_input.input_ids, encoded_input.attention_mask
    gen_summaries = []

    pbar = tqdm(total=len(in_id_arr))

    for _id, _mask in zip(in_id_arr, att_arr):

        result, score = summarize_and_score(
            ids=_id,
            mask=_mask,
            **kwargs,
        )
        score = round(float(score),4)
        _sum = {
            "input_tokens":_id,
            "summary":result,
            "summary_score":score,
        }
        gen_summaries.append(_sum)
        print(f"\t{result[0]}\nScore:\t{score}")
        pbar.update()

    pbar.close()

    return gen_summaries

---


# summarize - single file

In [None]:
#@markdown `wget` the text file
example_path = get_filename(example_url)
example_path = Path(example_path)
!wget $example_url -O $example_path -q

In [None]:
#@markdown read in single file text as `long_text`
with open(example_path, 'r', errors='ignore') as f:
    raw_text = f.read()

long_text = clean(raw_text, lower=False)
logging.info(f"removed {len(long_text) - len(raw_text)} chars via cleaning")
batch = {}
batch['text'] = long_text


encoded_input = tokenizer(
    long_text,
    padding='max_length',
    truncation=True,
    max_length=token_batch_length,
    stride=batch_stride,
    return_overflowing_tokens=True,
    add_special_tokens =False,
    return_tensors='pt',
)

In [None]:
_summaries = summarize_via_tokenbatches(
    long_text,
    batch_length=token_batch_length,
    batch_stride=batch_stride,
    **settings,
)


  0%|          | 0/2 [00:00<?, ?it/s]

	Richard Sutton and Andrew Barto present a new class in reinforcement learning, called "reinforcement learning". In this chapter, they explain how agents learn to behave in an environment by performing actions and/or seeing the results. They also discuss the Markov property, i.e., the agent's ability to predict what action is going to be taken based on the behavior of the environment around him. This information can be used to predict whether an agent will perform well in a given task or not. The next step is to discount the expected cumulative expected reward for each action taken.
Score:	-9.8031
	In this chapter, Jim explains how Reinforcement Learning works. He uses the concept of "reinforcement learning" to describe his approach to training an agent: "learning from action." In other words, he uses a neural network to learn from the environment and then discount the rewards in order to get the best possible return. When it comes time to train an agent, he/she discounts the rewards b

In [None]:
#@markdown write the `_summaries` var to a `.txt`
sum_text = [s["summary"][0] for s in _summaries]
sum_scores = [f"\n - {round(s['summary_score'],4)}" for s in _summaries]
scores_text = "\n".join(sum_scores)
full_summary = "\n\t".join(sum_text)
_outpath = f"SUM_{example_path.name}"

with open(
    _outpath,
    "w",
) as fo:
    fo.writelines(full_summary)
    fo.write("\n" * 3)
    fo.write(f"\n\nSection Scores for {example_path.name}:\n")
    fo.writelines(scores_text)
    fo.write("\n\n---\n")


In [None]:
# print out the summarized output!
!cat $_outpath

Richard Sutton and Andrew Barto present a new class in reinforcement learning, called "reinforcement learning". In this chapter, they explain how agents learn to behave in an environment by performing actions and/or seeing the results. They also discuss the Markov property, i.e., the agent's ability to predict what action is going to be taken based on the behavior of the environment around him. This information can be used to predict whether an agent will perform well in a given task or not. The next step is to discount the expected cumulative expected reward for each action taken.
	In this chapter, Jim explains how Reinforcement Learning works. He uses the concept of "reinforcement learning" to describe his approach to training an agent: "learning from action." In other words, he uses a neural network to learn from the environment and then discount the rewards in order to get the best possible return. When it comes time to train an agent, he/she discounts the rewards because they are 

<font color="salmon"> note that the second summary, arguably worse quality, received the most-negative score. _(for LED-base, the scores are "fairly similar" though)_ </font>


---


# summarize - txt directory

> demonstrate use case of iterating over all text files in a directory


## load

In [None]:
zip_url = "https://www.dropbox.com/sh/c03o2gpcvh6v3yz/AACaxe00trjjuV4Zmdpn1JtOa?dl=1" #@param {type:"string"}
target_path = "/content/source-text" #@param {type:"string"}
target_path = Path(target_path)
zip_fname = 'temp.zip'

!rm -r $target_path
!rm $zip_fname

!wget $zip_url -O $zip_fname
!unzip -q $zip_fname -d $target_path

rm: cannot remove '/content/source-text': No such file or directory
rm: cannot remove 'temp.zip': No such file or directory
--2022-09-08 13:17:40--  https://www.dropbox.com/sh/c03o2gpcvh6v3yz/AACaxe00trjjuV4Zmdpn1JtOa?dl=1
Resolving www.dropbox.com (www.dropbox.com)... 162.125.65.18, 2620:100:6021:18::a27d:4112
Connecting to www.dropbox.com (www.dropbox.com)|162.125.65.18|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: /sh/dl/c03o2gpcvh6v3yz/AACaxe00trjjuV4Zmdpn1JtOa [following]
--2022-09-08 13:17:40--  https://www.dropbox.com/sh/dl/c03o2gpcvh6v3yz/AACaxe00trjjuV4Zmdpn1JtOa
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uca6f867c4703513276a7329edfe.dl.dropboxusercontent.com/zip_download_get/BP8Xa4PvAtG8HBlmRu612wxydmnUpj3_pHuwhX18b4Fi8eRVFg130CRWx1S6wvLeDhLDlTHnV69NkG-6T7J5FIRfKUNEptGCUpnq-JLiWHmnpw?dl=1# [following]
--2022-09-08 13:17:41--  https://uca6f867c4703513276a7329edfe.dl.dr

In [None]:
import random

txt_files = [f for f in target_path.iterdir() if f.is_file() and f.suffix == '.txt']
output_dir = target_path.parent / "summarized-text"
output_dir.mkdir(exist_ok=True)

random.SystemRandom().shuffle(txt_files)
txt_files


[PosixPath('/content/source-text/mkdl-Introducing Hugging Face for Education.txt'),
 PosixPath('/content/source-text/mkdl-An Introduction to Q-Learning Part 1.txt'),
 PosixPath('/content/source-text/mkdl-Introducing Decision Transformers.txt')]

## summarize

- runs model in a loop over each text file in a directory.
- FOR each text file: break into batches of `token_batch_length` tokens, which overlap with the `batch_stride` defined earlier.



In [None]:
#@markdown summarize loop

for f in tqdm(txt_files, total=len(txt_files)):

    with open(f, 'r', encoding='utf-8', errors='ignore') as fi:
        text = clean(fi.read(), lower=False)
    print(f"\nNow summarizing:\t{f.name}")
    _summaries = summarize_via_tokenbatches(
                    text,
                    batch_length=token_batch_length,
                    batch_stride=batch_stride,
                    **settings,
                )
    sum_text = [s['summary'][0] for s in _summaries]
    sum_scores = [f"\n - {round(s['summary_score'],4)}" for s in _summaries]
    scores_text = "\n".join(sum_scores)
    full_summary = "\n\t".join(sum_text)
    with open(output_dir / f"SUM_{f.name}", 'w', ) as fo:
        fo.writelines(full_summary)
    with open(output_dir / f"SESSION_SCORES.log", 'a', ) as fo:

        fo.write("\n"*3)
        fo.write(f"\n\nSection Scores for {f.name}:\n")
        fo.writelines(scores_text)
        fo.write("\n\n---\n")

  0%|          | 0/3 [00:00<?, ?it/s]


Now summarizing:	mkdl-Introducing Hugging Face for Education.txt


  0%|          | 0/1 [00:00<?, ?it/s]

	The launch of "Open Source and Open Science" on March 23, 2022 marks a significant shift in the direction of machine learning. The company now focuses on educating people from all walks of life about the potential for machine learning to improve their lives. To do this, they've launched an education tour in March with experts from the Machine Learning Lab at 16 universities in 16 countries. In addition, they're running a training tour in Argentina that will help students learn how to use open-source machine learning tools. There's also a one-week mini-learning tour in Buenos Aires scheduled for June 6.
Score:	-10.0766

Now summarizing:	mkdl-An Introduction to Q-Learning Part 1.txt


  0%|          | 0/2 [00:00<?, ?it/s]

	In this chapter, Thomas Simonini explains how he trains his first Q-learning agent in two different environments: Frozen Lake and non-slippery Lake. In both environments, he trains the agent to go from starting state to goal state by walking only barely on frozen tiles and avoiding holes. When the agent reaches the goal state, it will be able to take actions according to a set set of expected cumulative rewards. To train an agent to make smart decisions, he uses "policy-based methods", i.e. "training" him directly to learn which actions to take. By training these methods, he can predict whether an agent will get rewarded for making smart decisions based on what happens in the real world. Simultaneously, each method has its own set of predictable outcomes. For example, if an agent learns to play a game, there are three possible outcomes: 1) A reward is earned; 2) An action is lost; 3) An agent's reward is gained; 4) It is time to leave the game; 5) The behavior of the mouse is continuo

  0%|          | 0/1 [00:00<?, ?it/s]

	Hugging Face introduces Decision Transformers, an offline reinforcement learning method that uses sequence modeling to predict the behavior of decision-makers in order to predict future outcomes. In this paper, the authors explain how they can use a trained Decision Transformer to predict performance in real-world environments.
Score:	-6.0382


---


## export summarized dir

- this downloads a `.zip` of the summarized text files.



In [None]:
output_zip_tag = "hf_demo" #@param {type:"string"}
output_zip = f"summarized_textdir_{output_zip_tag}.zip"
output_zip = Path(output_zip)
output_zip = Path(output_zip.stem + get_timestamp() + '.zip')
!zip -r -q -9 $output_zip /content/summarized-text

In [None]:
from google.colab import files
files.download(output_zip)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>