# WS24 - Intelligente Informationssysteme

## Block 3: Retrieval Augmented Generation

**Part 10: Text Summarization**

## Abstractive Text Summarization
Abstractive summarization involves generating a concise summary that may contain words, phrases, or sentences not present in the source text. This approach relies on understand- ing the context and generating human-like language to convey the central ideas. Abstractive summarization methods often use advanced language models, such as Large Language Models (LLMs), to rewrite and rephrase content in a more concise form.

## Extractive Text Summarization
Extractive summarization, on the other hand, aims to select and extract the most important sentences or phrases directly from the source text to form the summary. It does not involve rephrasing or generating new sentences. Extractive summariza- tion methods use various techniques, such as sentence scoring and ranking, to identify and extract the most salient content.

see: https://arxiv.org/pdf/2310.10449

In [None]:
# Load some data: each youtube video transcript is one document. We want a summarization of each docutment.
import os
import json

# Path to data
base_path = f".{os.sep}data{os.sep}vRTcE19M-KE"
with open(f"{base_path}{os.sep}grammar_corrected_sentences.txt", "r") as f:
    text = f.read()
    print(len(text))
with open(f"{base_path}{os.sep}video.json", "r") as f:
    video = json.loads(f.read())
    description = video.get('description', "")
    print(description)

In [None]:
# Can we regenerate a meaningful transcription out of the grammar corrected transcription?
import ollama

In [None]:
SYS_PROMPT = """As a professional summarizer, create a concise and comprehensive summary of the provided text, be it an article, post, conversation, or passage, while adhering to these guidelines:
* Craft a summary that is detailed, thorough, in-depth, and complex, while maintaining clarity and conciseness.
* Incorporate main ideas and essential information, eliminating extraneous language and focusing on critical aspects.
* Rely strictly on the provided text, without including external information.
* Format the summary in paragraph form for easy understanding.
* Conclude your notes with [End of Notes] to indicate completion.
"""

In [None]:
messages = [{'role': 'system', 'content': SYS_PROMPT}, 
            {'role': 'user', 'content': f"Please summarize the following: ```{text}``` \n\n output: "}]
response = ollama.chat(model='llama3.2:latest', messages=messages)

In [None]:
print(response.message.content)

In [None]:
# How to measure generated summaries?
# Bilingual Evaluation Understudy (BLEU) Score, 
# Recall-Oriented Understudy for Gisting Evaluation (ROUGE) Score, 
# and Bidirectional Encoder Representations from Transformers (BERT) Score

In [None]:
import nltk

words = []
for sentence in nltk.sent_tokenize(description):
   words.extend(nltk.word_tokenize(sentence))
len(words)

In [None]:
messages = [{'role': 'system', 'content': SYS_PROMPT}, 
            {'role': 'user', 'content': f"Please summarize the following in less then {len(words)} words: ```{text}``` \n\n output: "}]
response = ollama.chat(model='llama3.2:latest', messages=messages)

In [None]:
print(response.message.content)

In [None]:
messages = [{'role': 'system', 'content': SYS_PROMPT},
            {'role': 'user', 'content': f"```{text}``` TL;DR:\n"}]
response = ollama.chat(model='llama3.2:latest', messages=messages)
print(response.message.content)

In [None]:
# who can we measure the generated summaries
# Bilingual Evaluation Understudy (BLEU) Score, 
# Recall-Oriented Understudy for Gisting Evaluation (ROUGE) Score, and 
# Bidirectional Encoder Representations from Transformers (BERT) Score

# Confidence Scores
# seee: https://medium.com/@rakesharma21/confidence-scores-in-ai-summarization-an-insightful-approach-995603c72cab

In [None]:
####### Backup
from openai import OpenAI

client = OpenAI(
    base_url = 'http://localhost:11434/v1',
    api_key='ollama', # required, but unused
)

response = client.chat.completions.create(
    model="llama3.2:latest",
    messages=messages,
)
print(response.choices[0].message.content)

In [None]:
response