# Summarize an article using the Chain of Density Prompting technique

Original article on [Advanced Stack - Technical Resources](https://advanced-stack.com/resources/how-to-summarize-using-chain-of-density-prompting.html)

In [1]:
# First do the boring stuff of converting PDF to text
# pip3 install PyPDF2

import unicodedata
import PyPDF2

from llm_core.splitters import TokenSplitter

# Open the PDF file
with open('../assets/a-path-towards-autonomous-machines.pdf', 'rb') as file:
    pdf_reader = PyPDF2.PdfReader(file)

    # Extract the text from the PDF
    pages = []
    for page in pdf_reader.pages:
        pages.append(page.extract_text())

    text = ''.join(pages)

def cleanup_unicode(text):
    corrected_chars = []
    for char in text:
        corrected_char = unicodedata.normalize("NFKC", char)
        corrected_chars.append(corrected_char)
    return "".join(corrected_chars)


article = cleanup_unicode(text)

In [2]:
# Display the length in tokens
import codecs

len(codecs.encode(article, 'tiktoken'))

39476

In [3]:
chain_of_density_system_prompt = "You are an expert in writing rich and dense summaries in broad domains."
chain_of_density_prompt = """
  Article:
  {article}
  ----

  You will generate increasingly concise, entity-dense summaries of the
  above Article.

  Repeat the following 2 steps 5 times.

  - Step 1: Identify 1-3 informative Entities from the Article
  which are missing from the previously generated summary and are the most
  relevant.

  - Step 2: Write a new, denser summary of identical length which covers
  every entity and detail from the previous summary plus the missing
  entities.

  A Missing Entity is:

  - Relevant: to the main story
  - Specific: descriptive yet concise (5 words or fewer)
  - Novel: not in the previous summary
  - Faithful: present in the Article
  - Anywhere: located anywhere in the Article

  Guidelines:
  - The first summary should be long (4-5 sentences, approx. 80 words) yet
  highly non-specific, containing little information beyond the entities
  marked as missing.

  - Use overly verbose language and fillers (e.g. "this article discusses")
  to reach approx. {length_in_words} words.

  - Make every word count: re-write the previous summary to improve flow and
  make space for additional entities.

  - Make space with fusion, compression, and removal of uninformative
  phrases like "the article discusses"

  - The summaries should become highly dense and concise yet
  self-contained, e.g., easily understood without the Article.

  - Missing entities can appear anywhere in the new summary.

  - Never drop entities from the previous summary. If space cannot be made,
  add fewer new entities.

  > Remember to use the exact same number of words for each summary.
  > Write the missing entities in missing_entities
  > Write the summary in denser_summary
  """

In [4]:
# Let's define out target structure:

from dataclasses import dataclass
from typing import List
from llm_core.assistants import OpenAIAssistant, OpenWeightsAssistant

@dataclass
class DenseSummary:
    denser_summary: str
    missing_entities: List[str]

@dataclass
class DenserSummaryCollection:
    system_prompt = chain_of_density_system_prompt
    prompt = chain_of_density_prompt
    
    summaries: List[DenseSummary]

## Generate the summaries with OpenAI

In [14]:
# Generate iteratively the summaries

with OpenAIAssistant(DenserSummaryCollection, model='gpt-4o') as assistant:
    collection = assistant.process(article=article, length_in_words=80)

In [15]:
for summary in collection.summaries:
    print(summary.missing_entities)

['Yann LeCun', 'self-supervised learning', 'energy-based model']
['Meta', 'Courant Institute', 'New York University']
['energy', 'critic', 'actor']
['gradient-based learning', 'intrinsic cost', 'short-term memory']
['']


In [16]:
print(collection.summaries[-1].denser_summary)

Yann LeCun's article, affiliated with Meta and the Courant Institute at New York University, discusses the path towards autonomous machine intelligence, focusing on how machines can learn efficiently like humans and animals. It explores the architecture and training paradigms necessary for constructing intelligent agents. The paper combines concepts such as predictive world models, intrinsic motivation, and hierarchical joint embedding architectures trained with self-supervised learning. The goal is to enable machines to reason, predict, and plan at multiple levels of abstraction and time horizons. Key components include energy, critic, and actor modules, gradient-based learning, intrinsic cost, and short-term memory.


In [17]:
print(collection.summaries[2].denser_summary)

Yann LeCun's article, affiliated with Meta and the Courant Institute at New York University, discusses the path towards autonomous machine intelligence, focusing on how machines can learn efficiently like humans and animals. It explores the architecture and training paradigms necessary for constructing intelligent agents. The paper combines concepts such as predictive world models, intrinsic motivation, and hierarchical joint embedding architectures trained with self-supervised learning. The goal is to enable machines to reason, predict, and plan at multiple levels of abstraction and time horizons.


## Generate the summaries with Llama 3.1 8B

In [None]:
# Generate iteratively the summaries

with OpenWeightsAssistant(DenserSummaryCollection, model='llama-8b-3.1-q4', loader_kwargs={"n_ctx": 50_000}) as assistant:
    collection = assistant.process(article=article, length_in_words=80)