# Combine LSA and LLM to Extract Information from Large Text

Expirement of extracting information from large text that doesn't fit in Llama 13B max input of 4K tokens 

In [1]:
import requests, time, os, json
from dotenv import load_dotenv, find_dotenv
from IPython.display import display, HTML
from transformers import AutoTokenizer, AutoModelForCausalLM

In [2]:
# To use Llama 2 70B on HuggingFace requires an authentication token and HuggingFace Pro account that cost $9 a month.  
# To learn more see 
# - https://huggingface.co/meta-llama/Llama-2-70b-chat-hf?inference_api=true
# - https://huggingface.co/pricing

# Loading authentication token from .env file
load_dotenv()
token = os.environ.get("hgf_token")


### Generalize methods and class that will be used in the expirements below

In [3]:
# Object to represent an answer from Llama
class Answer:
    def __init__(self, answer, elapse):
        self.answer = answer
        self.elapse = elapse


In [4]:
def generate(prompt: str, API_URL = "https://api-inference.huggingface.co/models/meta-llama/Llama-2-13b-chat-hf") -> str:
    
    """
    Other LLM endpoints:
    - "https://api-inference.huggingface.co/models/mistralai/Mistral-7B-Instruct-v0.1"
    - "https://slj8qtv495fb2zsf.us-east-1.aws.endpoints.huggingface.cloud"
    - "https://api-inference.huggingface.co/models/meta-llama/Llama-2-70b-chat-hf"
    """
    headers = {
        "Authorization": f"Bearer {token}",
        "content-type": "application/json",
    }
    
    parameters = {
        "max_length": 2092,
        "max_new_tokens": 500,
        "top_k": 10,
        "return_full_text": False,
        "do_sample": True,
        "num_return_sequences": 1,
        "temperature": 0.1,
        "repetition_penalty": 1.0,
        "length_penalty": 1.0,
        "use_cache": False,
    }
    
    payload = {"inputs": prompt, "parameters": parameters}
    response = requests.post(API_URL, headers=headers, json=payload)
    
    if(response.status_code != 200):
        return response.content
    
    results = response.json()
    answer = results[0]['generated_text']
    return answer

In [5]:
# Wrapper function that run generate and return an Answer
def run_prompt(prompt: str, url = None) -> Answer:
    start_time = time.time()
    
    if url:
        answer = generate(prompt, url)
    else:
        answer = generate(prompt)
    end_time = time.time()
    elapse = round(end_time - start_time)
    return Answer(answer, elapse)

In [6]:
# Display answer object in HTML
def display_answer(answer: Answer, header = ''):
    answer_html_template = """<h3>{HEADER} Answer - Time to Generate: {ELAPSE} seconds</h3>
    <textarea cols='100' rows={NUM_ROWS}>{ANSWER}</textarea>"""
    
    number_rows = 20 # (len(answer.answer.split(' ')) / 10)
    
    html = answer_html_template.format(ANSWER=answer.answer, ELAPSE=answer.elapse, HEADER=header, NUM_ROWS=number_rows)
    display(HTML(html))

In [7]:
# Display text in HTML
def display_text(text: str):
    html_template = """<textarea cols='100' rows={NUM_ROWS}>{TEXT}</textarea>"""
    
    number_rows = 20 # (len(text.split(' ')) / 10)
    
    html = html_template.format(TEXT=text, NUM_ROWS=number_rows)
    display(HTML(html))

### Loading Wikipedia article about the US.   

In [9]:
# Display the text we going to use
with open('wikipedia_united_states.txt', 'r') as f:
    article = f.read()
    
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")
article_number_of_tokens= len(tokenizer.encode(article))
print(f"Number of tokens in the wikipedia artilce about the U.S.A is {article_number_of_tokens}")

Number of tokens in the wikipedia artilce about the U.S.A is 19798


In [10]:
display_text(article)

## 1: Hitting the Input Token Limits
Let's start by seeing what happens if we try hitting the LLM with an article with 19798 tokens.


In [18]:
p1 = """<s>[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  
Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. 
Please ensure that your responses are socially unbiased and positive in nature.
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. 
If you don't know the answer to a question, please don't share false information.
<</SYS>>
I need you to do two things. 
1. Tell me what the article is about? 
2. Write a concise summary of the article in bullet-points. 
article: {BODY}. The artilce is about:  
Bullet-points summary: [/INST]""".format(BODY=article)

display_answer(run_prompt(p1))

The message returned from the HuggingFace API is missing leading. It suggests that the model can handle 8K tokens. According to the documentation (https://huggingface.co/docs/transformers/model_doc/llama) max tokens is 4096. Also, if you send the Llama input that is between 4K and 8K tokens, you will get a new message that looks like this:  

'{"error":"Input validation error: `inputs` must have less than 4096 tokens. Given: 6299","error_type":"validation"}'

To keep this example on point, we will keep the number of input tokens <= 4096

## Truncating the Article
We can use several ways to reduce the size of the article.


1. Randomly truncate section of the text. I don't like it because important information will be lost. 
2. Using LangChain to divide the text into chunks, summarize them separately, stitch them together, and re-summarize to get a consistent answer. This method suits huge text (books) with a logical break in context (chapters). But it also has some challenges. Each run through the LLM gives opportunities for the LLM to be creative (and hallucinate) and make the final output harder to work with. Also, each run through the LLM is cost $$$ and is slow.  
3. Using statistical methods like Latent Semantic Analytics (LSA) to remove excess (redundant) information from the text. I like the LSA method with a combination of LLM because it combines the best of worlds. LSA is good at distilling text into essential information, but the output is hard to read. LLM can take the LSA output, make sense of it, and write an excellent summary or other task. This notebook focuses on this method.


Here are the steps we are going to follow:  
1. Determine by how many tokens the article needs to be truncated.
2. Truncate the article
3. Run the shortened article through the LLM  


## Step 1. Determine the maximum tokens allocated for the article.
Hint: it's below 4096.


In [23]:
p1 = """<s>[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  
Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. 
Please ensure that your responses are socially unbiased and positive in nature.
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. 
If you don't know the answer to a question, please don't share false information.
<</SYS>>
For each paragraph in the article write a concise TL;DR summary in a bullet-point. article: {BODY}. TL;DR Summary:[/INST]"""

prompt_number_of_tokens= len(tokenizer.encode(p1))
max_tokens = 4096
output_tokens = 500
article_ideal_number_of_token = (max_tokens - output_tokens - prompt_number_of_tokens)

print(f"Number of tokens in the prompt {prompt_number_of_tokens}")
print(f"Maximum tokens the article can have {article_ideal_number_of_token}")
print(f"Number of tokens in the wikipedia artilce about the U.S.A is {article_number_of_tokens}")

Number of tokens in the prompt 168
Maximum tokens the article can have 3428
Number of tokens in the wikipedia artilce about the U.S.A is 19798


## Step 2. Truncating the Article Using LSA
In this notebook, I am going to use Latent Semantic Analysis (LSA) Summarizer that was implemeted in Sumy. For detailed explanation on LSA see the following Wikpedia article. 
- Library: https://github.com/miso-belica/sumy/blob/main/docs/summarizators.md#latent-semantic-analysis-lsa
- Wikipedia: https://en.wikipedia.org/wiki/Latent_semantic_analysis 


The truncate_text function usrd LSA to shorten the text and the HuggingFace tokenizer to count tokens. The function is recusively re-processing the text until it fit in the token limits. 


In [20]:
import math
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lsa import LsaSummarizer

def truncate_text(
    text: str, llm_max_tokens: int, hf_tokenizer: AutoTokenizer, LANGUAGE="english"
) -> str:
    """
    Truncate_text using LSA summarization to reduce the text using the number of input token the LLM 
    accept. 

    Args:
        text (str): The text that need to be truncate
        llm_max_tokens (int): Maximum number of tokens the LLM support.
        hf_tokenizer (AutoTokenizer): HuggingFace tokenizer. Use to calculate number of tokens. 

    Retruns:
        Summary (str)
    """
    
    summarizer = LsaSummarizer()
    
    # How many toke the text have?
    num_tokens = len(hf_tokenizer.encode(text))

    if num_tokens > llm_max_tokens:
        print(f"Text is too long. Splitting into chunks of {llm_max_tokens} tokens")
        parser = PlaintextParser.from_string(text, Tokenizer(LANGUAGE))
        num_sentences = len(parser.document.sentences)
        avg_tokens_per_sentence = int(num_tokens / num_sentences)
        excess_tokens = num_tokens - llm_max_tokens
        num_sentences_to_summarize = num_sentences - (
            math.ceil(excess_tokens / avg_tokens_per_sentence)
        )

        print(f"Number of tokens: {num_tokens}.")
        print(f"Number of sentences: {num_sentences}.")
        print(f"Average tokens per sentence: {avg_tokens_per_sentence}.")
        print(f"Excess tokens: {excess_tokens}.")
        print(f"Number of sentences to summarize: {num_sentences_to_summarize}")

        summary = summarizer(parser.document, num_sentences_to_summarize)
        summary_text = "\n".join([sentence._text for sentence in summary])
        return truncate_text(summary_text, llm_max_tokens, hf_tokenizer)

    else:
        print("Text is short enough. No need to summarizing.")
        return text

In [24]:
short_article = truncate_text(text=article, llm_max_tokens=article_ideal_number_of_token, hf_tokenizer=tokenizer)

display_text(short_article)

print(f"Short article number of tokens {len(tokenizer.encode(short_article))}")


Text is too long. Splitting into chunks of 3428 tokens
Number of tokens: 19798.
Number of sentences: 545.
Average tokens per sentence: 36.
Excess tokens: 16370.
Number of sentences to summarize: 90
Text is too long. Splitting into chunks of 3428 tokens
Number of tokens: 4316.
Number of sentences: 88.
Average tokens per sentence: 49.
Excess tokens: 888.
Number of sentences to summarize: 69
Text is too long. Splitting into chunks of 3428 tokens
Number of tokens: 3557.
Number of sentences: 69.
Average tokens per sentence: 51.
Excess tokens: 129.
Number of sentences to summarize: 66
Text is too long. Splitting into chunks of 3428 tokens
Number of tokens: 3441.
Number of sentences: 66.
Average tokens per sentence: 52.
Excess tokens: 13.
Number of sentences to summarize: 65
Text is short enough. No need to summarizing.


Short article number of tokens 3380


## Step 3. Asking LLM to summarize output of the LSA

In [25]:
p1 = """<s>[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  
Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. 
Please ensure that your responses are socially unbiased and positive in nature.
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. 
If you don't know the answer to a question, please don't share false information.
<</SYS>>
Write a short summary of the article. The summary should include key concepts, people and events mention in the article. 
Start the summary explaining what the article is about.   
article: {BODY}. 
Answer:
[/INST]""".format(BODY=short_article)

display_answer(run_prompt(p1))

### Aksing LLM to Extract Information from the short article

In [38]:
p2 = """<s>[INST] <<SYS>>
You are a researcher task with answering questions about an article.  
Please ensure that your responses are socially unbiased and positive in nature.
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. 
If you don't know the answer, please don't share false information.

<</SYS>>

What entities, companies, people and products and more mentioned in the article that can generalize the topic? [/INST] 
Poeope: 
- Barack Obama
- Geoge Washington
- Jason Tatum

Companies:
- Apple
- Ford 
- Macy's

Counties:
- USA
- Mexico
</s>

<s>[INST]
Extract concepts, entities, companies, people and products and more mentioned from the article. 
Article: {BODY}
Answer:
[/INST]""".format(BODY=short_article)

display_answer(run_prompt(p2,  url="https://api-inference.huggingface.co/models/meta-llama/Llama-2-70b-chat-hf"))

In [32]:
p3 = """<s>[INST] <<SYS>>
You are a researcher tasked with answering questions about an article.
Please ensure that your responses are socially unbiased and positive in nature.
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct.
If you don't know the answer, please don't share false information. Do not repeat entities. 


Output answer in JSON using the following format: {{"name": name, "type": type, "explanation": explanation}}
<</SYS>>

What entities, product, people mentioned in the article that can generalize the topic? [/INST]
[
{{"name": "semiconductor", "type": "industry", "explanation": "Companies engaged in the design and fabrication of semiconductors and semiconductor devices"}},
{{"name": "NBA", "type": "sport league", "explanation": "NBA is the national basketball league"}},
{{"name": "Ford F150", "type": "vehicle", "explanation": "Article talks about the Ford F150 truck"}},
{{"name": "Ford", "type": "company", "explanation": "Ford is a company that built vehicles"}},
{{"name": "John Smith", "type": "person", "explanation": "The founder of Smith Industries"}},
] </s>

<s>[INST]
Extract entities from the article and output them in valid JSON. The JSON must be valid. Output number of token must under 500 tokens. 
Please do not repeat entities. The answer must following this JSON format: {{"name": name, "type": type, "explanation": [concise_explanation_of_the_entity]}}
Article: {BODY}
JSON:
[/INST]""".format(BODY=short_article)

## Note Llama 13b didn't do a good job answering in valid JSON, so for this one I used 70b model. 
display_answer(run_prompt(p3, url="https://api-inference.huggingface.co/models/meta-llama/Llama-2-70b-chat-hf"))

## Conclusion

