# Mistral 7B: Extract Information from Large Text

Expirement of extracting information from large text that doesn't fit in Mistral 7B max token input  

In [1]:
import requests, time, os, json
from dotenv import load_dotenv, find_dotenv
from IPython.display import display, HTML
from transformers import AutoTokenizer, AutoModelForCausalLM

In [3]:
# To use Llama 2 70B on HuggingFace requires an authentication token and HuggingFace Pro account that cost $9 a month.  
# To learn more see 
# - https://huggingface.co/meta-llama/Llama-2-70b-chat-hf?inference_api=true
# - https://huggingface.co/pricing

# Loading authentication token from .env file
load_dotenv()
token = os.environ.get("hgf_token")
print(token)


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


### Generalize methods and class that will be used in the expirements below

In [8]:
# Object to represent an answer from Llama
class Answer:
    def __init__(self, answer, elapse):
        self.answer = answer
        self.elapse = elapse


In [145]:
def generate(prompt: str, API_URL = "https://api-inference.huggingface.co/models/mistralai/Mistral-7B-Instruct-v0.1") -> str:
    
    # API_URL = "https://slj8qtv495fb2zsf.us-east-1.aws.endpoints.huggingface.cloud"
    # API_URL = "https://api-inference.huggingface.co/models/meta-llama/Llama-2-70b-chat-hf"
    headers = {
        "Authorization": f"Bearer {token}",
        "content-type": "application/json",
    }
    
    parameters = {
        "max_length": 2092,
        "max_new_tokens": 500,
        "top_k": 10,
        "return_full_text": False,
        "do_sample": True,
        "num_return_sequences": 1,
        "temperature": 0.1,
        "repetition_penalty": 1.0,
        "length_penalty": 1.0,
        "use_cache": False,
    }
    
    payload = {"inputs": prompt, "parameters": parameters}
    response = requests.post(API_URL, headers=headers, json=payload)
    
    if(response.status_code != 200):
        return response.content
    
    results = response.json()
    answer = results[0]['generated_text']
    return answer

In [30]:
# Wrapper function that run generate and return an Answer
def run_prompt(prompt: str, url = None) -> Answer:
    start_time = time.time()
    
    if url:
        answer = generate(prompt, url)
    else:
        answer = generate(prompt)
    end_time = time.time()
    elapse = round(end_time - start_time)
    return Answer(answer, elapse)

In [11]:
# Display answer object in HTML
def display_answer(answer: Answer, header = ''):
    answer_html_template = """<h3>{HEADER} Answer - Time to Generate: {ELAPSE} seconds</h3>
    <textarea cols='100' rows={NUM_ROWS}>{ANSWER}</textarea>"""
    
    number_rows = 20 # (len(answer.answer.split(' ')) / 10)
    
    html = answer_html_template.format(ANSWER=answer.answer, ELAPSE=answer.elapse, HEADER=header, NUM_ROWS=number_rows)
    display(HTML(html))

In [74]:
# Display text in HTML
def display_text(text: str):
    html_template = """<textarea cols='100' rows={NUM_ROWS}>{TEXT}</textarea>"""
    
    number_rows = 20 # (len(text.split(' ')) / 10)
    
    html = html_template.format(TEXT=text, NUM_ROWS=number_rows)
    display(HTML(html))

### Loading Wikipedia article about the US.   

In [12]:
# Display the text we going to use
with open('wikipedia_united_states.txt', 'r') as f:
    text = f.read()
    
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")
article_number_of_tokens= len(tokenizer.encode(text))
print(f"Number of tokens in the wikipedia artilce about the U.S.A is {article_number_of_tokens}")

Number of tokens in the wikipedia artilce about the U.S.A is 19798


In [25]:
text

'United States\n\nArticle\nTalk\nRead\nView source\nView history\n\nTools\nCoordinates: 40°N 100°W\nExtended-protected article\nFrom Wikipedia, the free encyclopedia\n December 3: Wikipedia is still not on the market.\nPlease don\'t skip this quick Sunday read. We\'re sorry to interrupt, but it\'s December 3, and it will soon be too late to help the nonprofit behind Wikipedia in this end-of-year fundraiser in the United States. Wikipedia is free and doesn\'t rely on ads. Just 2% of readers donate, so if Wikipedia has given you $2.75 worth of knowledge, please give. Any contribution helps, whether it\'s $2.75 or $25.\nGive $2.75\n Give a different amount\nWikimedia Foundation Logo\nProud host of Wikipedia and its sister sites\n\nMAYBE LATER I ALREADY DONATED\nCLOSE \nSeveral terms redirect here. For other uses, see America (disambiguation), US (disambiguation), USA (disambiguation), The United States of America (disambiguation), and United States (disambiguation).\nUnited States of Amer

## 1: Hitting the Input Token Limits
Let start by seeing what happen if we try hitting Mistral with aricle with 20087 tokens. 

In [56]:
p1 = """<s>[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  
Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. 
Please ensure that your responses are socially unbiased and positive in nature.
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. 
If you don't know the answer to a question, please don't share false information.
<</SYS>>
I need you to do two things. 
1. Tell me what the article is about? 
2. Write a concise summary of the article in bullet-points. 
article: {BODY}. The artilce is about:  
Bullet-points summary: [/INST]""".format(BODY=text)

display_answer(run_prompt(p1))

The maximum numbers of tokens the model access is 18432. But the input and output token we have is 19990.
The 19990 number is calcualte as prompt + article + max_new_tokens (from API call paramaters).  

We can't significatally reduce prompt and the max new tokens. That leave us with truncting the article.   

To process the promt with the article we want to keep the prompt as is and truncate the article. 
Here are the steps we going to follow: 
1. Determine the number of token in the prompt without the article. 
2. Determine by how many token the article need to be truncated
3. Truncate the article, and try the LLM again.  

#### 1. Number of tokens in the prompt
In expirement that I ran, I discovered that Mistral performed best with input about about 8192 tokens. To keep this notebook brief, I will save you from those expirements, feel welcome to use this notebook to expirement with different number of input token. Let me know what you discovered. 

In [172]:
p1 = """<s>[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  
Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. 
Please ensure that your responses are socially unbiased and positive in nature.
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. 
If you don't know the answer to a question, please don't share false information.
<</SYS>>
For each paragraph in the article write a concise TL;DR summary in a bullet-point. article: {BODY}. TL;DR Summary:[/INST]"""

prompt_number_of_tokens= len(tokenizer.encode(p1))
print(f"Number of tokens in the prompt {prompt_number_of_tokens}")
print(f"Maximum tokens the article can have {(8192 - 500 - prompt_number_of_tokens)}")
print(f"Number of tokens in the wikipedia artilce about the U.S.A is {article_number_of_tokens}")

Number of tokens in the prompt 168
Maximum tokens the article can have 7524
Number of tokens in the wikipedia artilce about the U.S.A is 19798


## Truncating the Article Using LSA
There are several techniques we can reduce the 20K article to 7498. 
First, is using LLM. We can divide the article into three, process each section and concat them back together. 
Second option, is to use statisical method like LSA to condense the text. 

In this post, we are going to focus on the second technique. In my expirements, I found it to produce better results.  
LLM is being creative when summarizing text. When dividing the text into sections, summarizing each section and than connecting them together and re-summarizing. Summarizing, LLM output gives opportunities for the LLM creatively (and hallucination) to mess things-up. 

In this notebook, I am going to use Latent Semantic Analysis (LSA) Summarizer that was implemeted in Sumy. For detailed explanation on LSA see the following Wikpedia article. 
https://github.com/miso-belica/sumy/blob/main/docs/summarizators.md#latent-semantic-analysis-lsa

https://en.wikipedia.org/wiki/Latent_semantic_analysis 





In [175]:
import math
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lsa import LsaSummarizer

def truncate_text(
    text: str, llm_max_tokens: int, hf_tokenizer: AutoTokenizer, LANGUAGE="english"
) -> str:
    """
    Truncate_text using LSA summarization to reduce the text using the number of input token the LLM 
    accept. 

    Args:
        text (str): The text that need to be truncate
        llm_max_tokens (int): Maximum number of tokens the LLM support.
        number_of_tokens (int): The number of LLM tokens that text have

    Retruns:
        Summary (str)
    """
    
    summarizer = LsaSummarizer()
    num_tokens = len(hf_tokenizer.encode(text))

    if num_tokens > llm_max_tokens:
        print(f"Text is too long. Splitting into chunks of {llm_max_tokens} tokens")
        parser = PlaintextParser.from_string(text, Tokenizer(LANGUAGE))
        num_sentences = len(parser.document.sentences)
        avg_tokens_per_sentence = int(num_tokens / num_sentences)
        excess_tokens = num_tokens - llm_max_tokens
        num_sentences_to_summarize = num_sentences - (
            math.ceil(excess_tokens / avg_tokens_per_sentence)
        )

        print(f"Number of tokens: {num_tokens}.")
        print(f"Number of sentences: {num_sentences}.")
        print(f"Average tokens per sentence: {avg_tokens_per_sentence}.")
        print(f"Excess tokens: {excess_tokens}.")
        print(f"Number of sentences to summarize: {num_sentences_to_summarize}")

        summary = summarizer(parser.document, num_sentences_to_summarize)
        summary_text = "\n".join([sentence._text for sentence in summary])
        return truncate_text(summary_text, llm_max_tokens, hf_tokenizer)

    else:
        print("Text is short enough. No need to summarizing.")
        return text

Using the trunct_text function to reduce the aricle from X to Y. 

In [176]:
prompt_number_of_tokens= len(tokenizer.encode(p1))
print(f"Number of tokens in the prompt {prompt_number_of_tokens}")
article_ideal_number_of_token = (6000 - 500 - prompt_number_of_tokens)
print(f"Maximum tokens the article can have {article_ideal_number_of_token}")
print(f"Number of tokens in the wikipedia artilce about the U.S.A is {article_number_of_tokens}")

Number of tokens in the prompt 7488
Maximum tokens the article can have -1988
Number of tokens in the wikipedia artilce about the U.S.A is 19798


In [185]:
short_article = truncate_text(text=text, llm_max_tokens=3500, hf_tokenizer=tokenizer)

display_text(short_article)

print(f"Short article number of tokens {len(tokenizer.encode(short_article))}")


Text is too long. Splitting into chunks of 3500 tokens
Number of tokens: 19798.
Number of sentences: 545.
Average tokens per sentence: 36.
Excess tokens: 16298.
Number of sentences to summarize: 92
Text is too long. Splitting into chunks of 3500 tokens
Number of tokens: 4422.
Number of sentences: 90.
Average tokens per sentence: 49.
Excess tokens: 922.
Number of sentences to summarize: 71
Text is too long. Splitting into chunks of 3500 tokens
Number of tokens: 3637.
Number of sentences: 71.
Average tokens per sentence: 51.
Excess tokens: 137.
Number of sentences to summarize: 68
Text is too long. Splitting into chunks of 3500 tokens
Number of tokens: 3537.
Number of sentences: 68.
Average tokens per sentence: 52.
Excess tokens: 37.
Number of sentences to summarize: 67
Text is short enough. No need to summarizing.


Short article number of tokens 3484


In [205]:
p1 = """<s>[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  
Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. 
Please ensure that your responses are socially unbiased and positive in nature.
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. 
If you don't know the answer to a question, please don't share false information.
<</SYS>>
Write a short summary of the article? 
article: {BODY}. 
Answer:
[/INST]""".format(BODY=short_article)

display_answer(run_prompt(p1, url="https://api-inference.huggingface.co/models/meta-llama/Llama-2-13b-chat-hf"))

Hmm, we hit another validation error. We still have too many input tokens. Why we get two different answer. Previously, we were told that the total max token is 8K. Now we are told that it's 4096. 

# Expirement 2: System Message

The Llama paper describe the system message that uses to set the stage and concext for the model. 
In the following example, I am using the system messsage. Let see what is the different between P3 that doesn't have system message and P4 that use system message.   

In [54]:
p4 = """Write a concise TL;DR summary in numeric bullet-points for the following article, 
don't repeat ideas in bullet points. Limit the number of bullet-point to 5. article: {BODY}. TL;DR Summary:[/INST]""".format(BODY=short_article)

display_answer(run_prompt(p4, "https://l4ylldfrg0b2nplx.us-east-1.aws.endpoints.huggingface.cloud"))

Both P3 and P4 are pretty good and it's hard to see the different the the system message added. I personally prefer P4 (system message) because the answer read a better in my opinion, but I am sure someone will argue with on that. 

# Expirence 3: Modify the System Message
The system message can be modified to better fit to the task and define the persona and context we want Llama to assume. 

Changes applied to the original system message:
- Use the researcher persona and specify the tasks to summarizing articles. 
- Remove safety instruction, there are not needed since we asking Llama to be truthful to the article. 

In [125]:
p5 = """<s>[INST] <<SYS>>
You are a researcher task in summarizing and writing concise brief of articles.  
Please ensure that your responses are socially unbiased and positive in nature.
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. 
If you don't know the answer, please don't share false information.
<</SYS>>
Write a concise summary in numeric bullet-points for the following article, 
Limit the number of bullet-point to 10. article: {BODY}. Bullet-points summary:[/INST]""".format(BODY=short_article)

display_answer(run_prompt(p5))

<module 'requests' from '/home/eboraks/miniconda3/envs/icog/lib/python3.10/site-packages/requests/__init__.py'>


The answer for p5 is the best in my opinion so far. I like the into and conclusion that Llama addeed. 

# Expirement 4: Asking Questions about the Article

The article is about 'Mobile Game Soft Launch' let ask Llama specific question about it. The answer is pretty good. 

In [126]:
p10 = """<s>[INST] <<SYS>>
You are a researcher task with answering questions about an article.  
Please ensure that your responses are socially unbiased and positive in nature.
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. 
If you don't know the answer, please don't share false information.
<</SYS>>
What industries mentioned in the article? Output only the industry name. Output the answer in JSON, using format {{"industry": industry}}.
Include only valid JSON.
Article: {BODY}
[/INST]""".format(BODY=short_article)

display_answer(run_prompt(p10))

<module 'requests' from '/home/eboraks/miniconda3/envs/icog/lib/python3.10/site-packages/requests/__init__.py'>


Let try and trick Llama and ask him what sport is the article focuses on? 

In [48]:
p11 = """<s>[INST] <<SYS>>
You are a researcher task with answering questions about an article.  
Please ensure that your responses are socially unbiased and positive in nature. Don't make up answers. 
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. 
If you don't know the answer, please don't share false information.
<</SYS>>
Name sport the article is focus on? Output only the sport name. Output the answer in JSON, using format {{"sport": sport, "explanation": explanation}}. 
Include only valid JSON.
Article: {BODY}
[/INST]""".format(BODY=text)

display_answer(run_prompt(p11))

Pretty good, but there is one problem. Llama is disregarding my request for valid JSON. The model is eager to explain why sport is null. So I added explanation to JSON and that did the trick.

To review, to get Llama to output JSON, I needed to explicity tell it "only include JSON", add explanation to the JSON object format and change "write the answer" to "output the answer".  

# Expirement 5: One-to-Many Shot Learning to teach Llama
In this last expirement, I want to try and generalize the industry and sport questions by giving Llama examples of what I am looking for.  

The following prompt asks Llama to identify general topics mentioned in the article. To explain to Llama what do I mean by generalize topic, I included examples already in JSON.   

In [209]:
p12 = """<s>[INST] <<SYS>>
You are a researcher task with answering questions about an article.  
Please ensure that your responses are socially unbiased and positive in nature.
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. 
If you don't know the answer, please don't share false information.

Output answer in JSON using the following format: {{"name": name, "type": type, "explanation": explanation}}
<</SYS>>

What entities, companies, people and products mentioned in the article that can generalize the topic? [/INST] 
[
{{"name": "semiconductor", "type": "industry", "explanation": "Companies engaged in the design and fabrication of semiconductors and semiconductor devices"}},
{{"name": "NBA", "type": "sport league", "explanation": "NBA is the national basketball league"}},
{{"name": "Ford F150", "type": "vehicle", "explanation": "Article talks about the Ford F150 truck"}},
] </s>

<s>[INST]   
What entities mentioned in the article? Answer in only JSON using the following format: {{"name": [entity_name], "type": [entity_type], "explanation": [short_explanation]}}
Article: {BODY}
[/INST]""".format(BODY=short_article)

display_answer(run_prompt(p12, url="https://api-inference.huggingface.co/models/meta-llama/Llama-2-13b-chat-hf"))

In [206]:
p14 = """<s>[INST] <<SYS>>
You are a researcher tasked with answering questions about an article.
Please ensure that your responses are socially unbiased and positive in nature.
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct.
If you don't know the answer, please don't share false information. Do not repeat entities. 


Output answer in JSON using the following format: {{"name": name, "type": type, "explanation": explanation}}
<</SYS>>

What entities, product, people mentioned in the article that can generalize the topic? [/INST]
[
{{"name": "semiconductor", "type": "industry", "explanation": "Companies engaged in the design and fabrication of semiconductors and semiconductor devices"}},
{{"name": "NBA", "type": "sport league", "explanation": "NBA is the national basketball league"}},
{{"name": "Ford F150", "type": "vehicle", "explanation": "Article talks about the Ford F150 truck"}},
{{"name": "Ford", "type": "company", "explanation": "Ford is a company that built vehicles"}},
{{"name": "John Smith", "type": "person", "explanation": "The founder of Smith Industries"}},
] </s>

<s>[INST]
Extract entities from the article and output them in JSON. The JSON must be valid.
Please do not repeat entities. The answer must following this JSON format: {{"name": name, "type": type, "explanation": [concise_explanation_of_the_entity]}}
Article: {BODY}
JSON:
[/INST]""".format(BODY=short_article)

display_answer(run_prompt(p14, url="https://api-inference.huggingface.co/models/meta-llama/Llama-2-13b-chat-hf"))

I am surprised by how good (and easy) was it was to have Llama idenfity those topics. 