# Llama 2: Prompt Engineering Expirements to Extract Information

This notebook documents expirements with Llama 2 prompts. 

In [26]:
import requests, time, os, json
from dotenv import load_dotenv
from IPython.display import display, HTML
from sqlmodel import SQLModel, Field, ARRAY, Float, JSON, Relationship
from typing import Optional, List, Dict
from sqlalchemy import Column
import openai

In [7]:
# To use Llama 2 70B on HuggingFace requires an authentication token and HuggingFace Pro account that cost $9 a month.  
# To learn more see 
# - https://huggingface.co/meta-llama/Llama-2-70b-chat-hf?inference_api=true
# - https://huggingface.co/pricing

# Loading authentication token from .env file
load_dotenv('../.env')
together_token = os.getenv("TOGETHER_TOKEN")
print(together_token)


4f0cd2efebc9c13435455e28dee741fdef5188697d93e57e0618d8c9f414d9f2


In [45]:
from pydantic import BaseModel

class Entity(BaseModel):
    name: Optional[str]
    type: Optional[str]
    explanation: Optional[str]

class ConceptIdea(BaseModel):
    concept: Optional[str] = Field(default=None)
    explanation: Optional[str] = Field(default=None)
    

class DocumentInfo(BaseModel):
    oneSentenceSummary: Optional[str]
    summaryInNumericBulletPoints: Optional[List[str]]
    
    entities: Optional[List[Entity]]
    concepts_ideas: Optional[List[ConceptIdea]]

### Generalize methods and class that will be used in the expirements below

In [8]:
# Object to represent an answer from Llama
class Answer:
    def __init__(self, answer, elapse):
        self.answer = answer
        self.elapse = elapse


In [67]:
text = """
A first intro to Complex RAG (Retrieval Augmented Generation)
Chia Jeng Yang
Enterprise RAG

If you’re looking for a non-technical introduction to RAG, including answers to various getting-started questions and a discussion of relevant use-cases, check out our breakdown of RAG here.
In this article, we discuss various technical considerations when implementing RAG, exploring the concepts of chunking, query augmentation, hierarchies, multi-hop reasoning, and knowledge graphs. We also discuss unsolved problems & opportunities in the RAG infrastructure space, and introduce some infrastructure solutions for building RAG pipelines.
The first obstacles and design choices you will be making when building a RAG system are in how to prepare the documents for storage and information extraction. That will be the primary focus of this article.
As a refresher, here’s an overview of a RAG system architecture.
Relevance vs Similarity
When discussing effective information retrieval in RAG, it is crucial to understand the difference between “relevance” and “similarity.” Whereas similarity is about the similarity in words matching, relevance is about the connectedness of ideas. You can identify semantically close content using a vector database query, but identifying and retrieving relevant content requires more sophisticated tooling.
This is an important concept to keep in mind as we explore various RAG techniques below. If you haven’t yet, you should check out Llamaindex’s helpful video on building production RAG apps. This is a good primer for our discussion on various RAG system development techniques.

Chunking Strategy
In the context of natural language processing, “chunking” refers to the segmentation of text into small, concise, meaningful ‘chunks.’ A RAG system can more quickly and accurately locate relevant context in smaller text chunks than in large documents.
How can you ensure you’re selecting the right chunk? The effectiveness of your chunking strategy largely depends on the quality and structure of these chunks.
Determining the optimal chunk size is about striking a balance — capturing all essential information without sacrificing speed.
While larger chunks can capture more context, they introduce more noise and require more time and compute costs to process. Smaller chunks have less noise, but may not fully capture the necessary context. Overlapping chunks is a way to balance both of these constraints. By overlapping chunks, a query will likely retrieve enough relevant data across multiple vectors in order to generate a properly contextualized response.
One limitation is that this strategy assumes that all of the information you must retrieve can be found in a single document. If the required context is split across multiple different documents, you may want to consider leveraging solutions like document hierarchies and knowledge graphs.

Document Hierarchies
A document hierarchy is a powerful way of organizing your data to improve information retrieval. You can think of a document hierarchy as a table of contents for your RAG system. It organizes chunks in a structured manner that allows RAG systems to efficiently retrieve and process relevant, related data. Document hierarchies play a crucial role in the effectiveness of RAG by helping the LLM decide which chunks contain the most relevant data to extract.
"""

In [57]:
text = """
A first intro to Complex RAG (Retrieval Augmented Generation)
Chia Jeng Yang
Enterprise RAG
Chia Jeng Yang

If you’re looking for a non-technical introduction to RAG, including answers to various getting-started questions and a discussion of relevant use-cases, check out our breakdown of RAG here.
In this article, we discuss various technical considerations when implementing RAG, exploring the concepts of chunking, query augmentation, hierarchies, multi-hop reasoning, and knowledge graphs. We also discuss unsolved problems & opportunities in the RAG infrastructure space, and introduce some infrastructure solutions for building RAG pipelines.
The first obstacles and design choices you will be making when building a RAG system are in how to prepare the documents for storage and information extraction. That will be the primary focus of this article.
As a refresher, here’s an overview of a RAG system architecture.
Relevance vs Similarity
When discussing effective information retrieval in RAG, it is crucial to understand the difference between “relevance” and “similarity.” Whereas similarity is about the similarity in words matching, relevance is about the connectedness of ideas. You can identify semantically close content using a vector database query, but identifying and retrieving relevant content requires more sophisticated tooling.
This is an important concept to keep in mind as we explore various RAG techniques below. If you haven’t yet, you should check out Llamaindex’s helpful video on building production RAG apps. This is a good primer for our discussion on various RAG system development techniques.

Chunking Strategy
In the context of natural language processing, “chunking” refers to the segmentation of text into small, concise, meaningful ‘chunks.’ A RAG system can more quickly and accurately locate relevant context in smaller text chunks than in large documents.
How can you ensure you’re selecting the right chunk? The effectiveness of your chunking strategy largely depends on the quality and structure of these chunks.
Determining the optimal chunk size is about striking a balance — capturing all essential information without sacrificing speed.
While larger chunks can capture more context, they introduce more noise and require more time and compute costs to process. Smaller chunks have less noise, but may not fully capture the necessary context. Overlapping chunks is a way to balance both of these constraints. By overlapping chunks, a query will likely retrieve enough relevant data across multiple vectors in order to generate a properly contextualized response.
One limitation is that this strategy assumes that all of the information you must retrieve can be found in a single document. If the required context is split across multiple different documents, you may want to consider leveraging solutions like document hierarchies and knowledge graphs.

Document Hierarchies
A document hierarchy is a powerful way of organizing your data to improve information retrieval. You can think of a document hierarchy as a table of contents for your RAG system. It organizes chunks in a structured manner that allows RAG systems to efficiently retrieve and process relevant, related data. Document hierarchies play a crucial role in the effectiveness of RAG by helping the LLM decide which chunks contain the most relevant data to extract.

If you’re looking for a non-technical introduction to RAG, including answers to various getting-started questions and a discussion of relevant use-cases, check out our breakdown of RAG here.
In this article, we discuss various technical considerations when implementing RAG, exploring the concepts of chunking, query augmentation, hierarchies, multi-hop reasoning, and knowledge graphs. We also discuss unsolved problems & opportunities in the RAG infrastructure space, and introduce some infrastructure solutions for building RAG pipelines.
The first obstacles and design choices you will be making when building a RAG system are in how to prepare the documents for storage and information extraction. That will be the primary focus of this article.
As a refresher, here’s an overview of a RAG system architecture.
Relevance vs Similarity
When discussing effective information retrieval in RAG, it is crucial to understand the difference between “relevance” and “similarity.” Whereas similarity is about the similarity in words matching, relevance is about the connectedness of ideas. You can identify semantically close content using a vector database query, but identifying and retrieving relevant content requires more sophisticated tooling.
This is an important concept to keep in mind as we explore various RAG techniques below. If you haven’t yet, you should check out Llamaindex’s helpful video on building production RAG apps. This is a good primer for our discussion on various RAG system development techniques.

Chunking Strategy
In the context of natural language processing, “chunking” refers to the segmentation of text into small, concise, meaningful ‘chunks.’ A RAG system can more quickly and accurately locate relevant context in smaller text chunks than in large documents.
How can you ensure you’re selecting the right chunk? The effectiveness of your chunking strategy largely depends on the quality and structure of these chunks.
Determining the optimal chunk size is about striking a balance — capturing all essential information without sacrificing speed.
While larger chunks can capture more context, they introduce more noise and require more time and compute costs to process. Smaller chunks have less noise, but may not fully capture the necessary context. Overlapping chunks is a way to balance both of these constraints. By overlapping chunks, a query will likely retrieve enough relevant data across multiple vectors in order to generate a properly contextualized response.
One limitation is that this strategy assumes that all of the information you must retrieve can be found in a single document. If the required context is split across multiple different documents, you may want to consider leveraging solutions like document hierarchies and knowledge graphs.

Document Hierarchies
A document hierarchy is a powerful way of organizing your data to improve information retrieval. You can think of a document hierarchy as a table of contents for your RAG system. It organizes chunks in a structured manner that allows RAG systems to efficiently retrieve and process relevant, related data. Document hierarchies play a crucial role in the effectiveness of RAG by helping the LLM decide which chunks contain the most relevant data to extract.

If you’re looking for a non-technical introduction to RAG, including answers to various getting-started questions and a discussion of relevant use-cases, check out our breakdown of RAG here.
In this article, we discuss various technical considerations when implementing RAG, exploring the concepts of chunking, query augmentation, hierarchies, multi-hop reasoning, and knowledge graphs. We also discuss unsolved problems & opportunities in the RAG infrastructure space, and introduce some infrastructure solutions for building RAG pipelines.
The first obstacles and design choices you will be making when building a RAG system are in how to prepare the documents for storage and information extraction. That will be the primary focus of this article.
As a refresher, here’s an overview of a RAG system architecture.
Relevance vs Similarity
When discussing effective information retrieval in RAG, it is crucial to understand the difference between “relevance” and “similarity.” Whereas similarity is about the similarity in words matching, relevance is about the connectedness of ideas. You can identify semantically close content using a vector database query, but identifying and retrieving relevant content requires more sophisticated tooling.
This is an important concept to keep in mind as we explore various RAG techniques below. If you haven’t yet, you should check out Llamaindex’s helpful video on building production RAG apps. This is a good primer for our discussion on various RAG system development techniques.

Chunking Strategy
In the context of natural language processing, “chunking” refers to the segmentation of text into small, concise, meaningful ‘chunks.’ A RAG system can more quickly and accurately locate relevant context in smaller text chunks than in large documents.
How can you ensure you’re selecting the right chunk? The effectiveness of your chunking strategy largely depends on the quality and structure of these chunks.
Determining the optimal chunk size is about striking a balance — capturing all essential information without sacrificing speed.
While larger chunks can capture more context, they introduce more noise and require more time and compute costs to process. Smaller chunks have less noise, but may not fully capture the necessary context. Overlapping chunks is a way to balance both of these constraints. By overlapping chunks, a query will likely retrieve enough relevant data across multiple vectors in order to generate a properly contextualized response.
One limitation is that this strategy assumes that all of the information you must retrieve can be found in a single document. If the required context is split across multiple different documents, you may want to consider leveraging solutions like document hierarchies and knowledge graphs.

Document Hierarchies
A document hierarchy is a powerful way of organizing your data to improve information retrieval. You can think of a document hierarchy as a table of contents for your RAG system. It organizes chunks in a structured manner that allows RAG systems to efficiently retrieve and process relevant, related data. Document hierarchies play a crucial role in the effectiveness of RAG by helping the LLM decide which chunks contain the most relevant data to extract.

If you’re looking for a non-technical introduction to RAG, including answers to various getting-started questions and a discussion of relevant use-cases, check out our breakdown of RAG here.
In this article, we discuss various technical considerations when implementing RAG, exploring the concepts of chunking, query augmentation, hierarchies, multi-hop reasoning, and knowledge graphs. We also discuss unsolved problems & opportunities in the RAG infrastructure space, and introduce some infrastructure solutions for building RAG pipelines.
The first obstacles and design choices you will be making when building a RAG system are in how to prepare the documents for storage and information extraction. That will be the primary focus of this article.
As a refresher, here’s an overview of a RAG system architecture.
Relevance vs Similarity
When discussing effective information retrieval in RAG, it is crucial to understand the difference between “relevance” and “similarity.” Whereas similarity is about the similarity in words matching, relevance is about the connectedness of ideas. You can identify semantically close content using a vector database query, but identifying and retrieving relevant content requires more sophisticated tooling.
This is an important concept to keep in mind as we explore various RAG techniques below. If you haven’t yet, you should check out Llamaindex’s helpful video on building production RAG apps. This is a good primer for our discussion on various RAG system development techniques.

Chunking Strategy
In the context of natural language processing, “chunking” refers to the segmentation of text into small, concise, meaningful ‘chunks.’ A RAG system can more quickly and accurately locate relevant context in smaller text chunks than in large documents.
How can you ensure you’re selecting the right chunk? The effectiveness of your chunking strategy largely depends on the quality and structure of these chunks.
Determining the optimal chunk size is about striking a balance — capturing all essential information without sacrificing speed.
While larger chunks can capture more context, they introduce more noise and require more time and compute costs to process. Smaller chunks have less noise, but may not fully capture the necessary context. Overlapping chunks is a way to balance both of these constraints. By overlapping chunks, a query will likely retrieve enough relevant data across multiple vectors in order to generate a properly contextualized response.
One limitation is that this strategy assumes that all of the information you must retrieve can be found in a single document. If the required context is split across multiple different documents, you may want to consider leveraging solutions like document hierarchies and knowledge graphs.

Document Hierarchies
A document hierarchy is a powerful way of organizing your data to improve information retrieval. You can think of a document hierarchy as a table of contents for your RAG system. It organizes chunks in a structured manner that allows RAG systems to efficiently retrieve and process relevant, related data. Document hierarchies play a crucial role in the effectiveness of RAG by helping the LLM decide which chunks contain the most relevant data to extract.

If you’re looking for a non-technical introduction to RAG, including answers to various getting-started questions and a discussion of relevant use-cases, check out our breakdown of RAG here.
In this article, we discuss various technical considerations when implementing RAG, exploring the concepts of chunking, query augmentation, hierarchies, multi-hop reasoning, and knowledge graphs. We also discuss unsolved problems & opportunities in the RAG infrastructure space, and introduce some infrastructure solutions for building RAG pipelines.
The first obstacles and design choices you will be making when building a RAG system are in how to prepare the documents for storage and information extraction. That will be the primary focus of this article.
As a refresher, here’s an overview of a RAG system architecture.
Relevance vs Similarity
When discussing effective information retrieval in RAG, it is crucial to understand the difference between “relevance” and “similarity.” Whereas similarity is about the similarity in words matching, relevance is about the connectedness of ideas. You can identify semantically close content using a vector database query, but identifying and retrieving relevant content requires more sophisticated tooling.
This is an important concept to keep in mind as we explore various RAG techniques below. If you haven’t yet, you should check out Llamaindex’s helpful video on building production RAG apps. This is a good primer for our discussion on various RAG system development techniques.

Chunking Strategy
In the context of natural language processing, “chunking” refers to the segmentation of text into small, concise, meaningful ‘chunks.’ A RAG system can more quickly and accurately locate relevant context in smaller text chunks than in large documents.
How can you ensure you’re selecting the right chunk? The effectiveness of your chunking strategy largely depends on the quality and structure of these chunks.
Determining the optimal chunk size is about striking a balance — capturing all essential information without sacrificing speed.
While larger chunks can capture more context, they introduce more noise and require more time and compute costs to process. Smaller chunks have less noise, but may not fully capture the necessary context. Overlapping chunks is a way to balance both of these constraints. By overlapping chunks, a query will likely retrieve enough relevant data across multiple vectors in order to generate a properly contextualized response.
One limitation is that this strategy assumes that all of the information you must retrieve can be found in a single document. If the required context is split across multiple different documents, you may want to consider leveraging solutions like document hierarchies and knowledge graphs.

Document Hierarchies
A document hierarchy is a powerful way of organizing your data to improve information retrieval. You can think of a document hierarchy as a table of contents for your RAG system. It organizes chunks in a structured manner that allows RAG systems to efficiently retrieve and process relevant, related data. Document hierarchies play a crucial role in the effectiveness of RAG by helping the LLM decide which chunks contain the most relevant data to extract.

If you’re looking for a non-technical introduction to RAG, including answers to various getting-started questions and a discussion of relevant use-cases, check out our breakdown of RAG here.
In this article, we discuss various technical considerations when implementing RAG, exploring the concepts of chunking, query augmentation, hierarchies, multi-hop reasoning, and knowledge graphs. We also discuss unsolved problems & opportunities in the RAG infrastructure space, and introduce some infrastructure solutions for building RAG pipelines.
The first obstacles and design choices you will be making when building a RAG system are in how to prepare the documents for storage and information extraction. That will be the primary focus of this article.
As a refresher, here’s an overview of a RAG system architecture.
Relevance vs Similarity
When discussing effective information retrieval in RAG, it is crucial to understand the difference between “relevance” and “similarity.” Whereas similarity is about the similarity in words matching, relevance is about the connectedness of ideas. You can identify semantically close content using a vector database query, but identifying and retrieving relevant content requires more sophisticated tooling.
This is an important concept to keep in mind as we explore various RAG techniques below. If you haven’t yet, you should check out Llamaindex’s helpful video on building production RAG apps. This is a good primer for our discussion on various RAG system development techniques.

Chunking Strategy
In the context of natural language processing, “chunking” refers to the segmentation of text into small, concise, meaningful ‘chunks.’ A RAG system can more quickly and accurately locate relevant context in smaller text chunks than in large documents.
How can you ensure you’re selecting the right chunk? The effectiveness of your chunking strategy largely depends on the quality and structure of these chunks.
Determining the optimal chunk size is about striking a balance — capturing all essential information without sacrificing speed.
While larger chunks can capture more context, they introduce more noise and require more time and compute costs to process. Smaller chunks have less noise, but may not fully capture the necessary context. Overlapping chunks is a way to balance both of these constraints. By overlapping chunks, a query will likely retrieve enough relevant data across multiple vectors in order to generate a properly contextualized response.
One limitation is that this strategy assumes that all of the information you must retrieve can be found in a single document. If the required context is split across multiple different documents, you may want to consider leveraging solutions like document hierarchies and knowledge graphs.

Document Hierarchies
A document hierarchy is a powerful way of organizing your data to improve information retrieval. You can think of a document hierarchy as a table of contents for your RAG system. It organizes chunks in a structured manner that allows RAG systems to efficiently retrieve and process relevant, related data. Document hierarchies play a crucial role in the effectiveness of RAG by helping the LLM decide which chunks contain the most relevant data to extract.

If you’re looking for a non-technical introduction to RAG, including answers to various getting-started questions and a discussion of relevant use-cases, check out our breakdown of RAG here.
In this article, we discuss various technical considerations when implementing RAG, exploring the concepts of chunking, query augmentation, hierarchies, multi-hop reasoning, and knowledge graphs. We also discuss unsolved problems & opportunities in the RAG infrastructure space, and introduce some infrastructure solutions for building RAG pipelines.
The first obstacles and design choices you will be making when building a RAG system are in how to prepare the documents for storage and information extraction. That will be the primary focus of this article.
As a refresher, here’s an overview of a RAG system architecture.
Relevance vs Similarity
When discussing effective information retrieval in RAG, it is crucial to understand the difference between “relevance” and “similarity.” Whereas similarity is about the similarity in words matching, relevance is about the connectedness of ideas. You can identify semantically close content using a vector database query, but identifying and retrieving relevant content requires more sophisticated tooling.
This is an important concept to keep in mind as we explore various RAG techniques below. If you haven’t yet, you should check out Llamaindex’s helpful video on building production RAG apps. This is a good primer for our discussion on various RAG system development techniques.

Chunking Strategy
In the context of natural language processing, “chunking” refers to the segmentation of text into small, concise, meaningful ‘chunks.’ A RAG system can more quickly and accurately locate relevant context in smaller text chunks than in large documents.
How can you ensure you’re selecting the right chunk? The effectiveness of your chunking strategy largely depends on the quality and structure of these chunks.
Determining the optimal chunk size is about striking a balance — capturing all essential information without sacrificing speed.
While larger chunks can capture more context, they introduce more noise and require more time and compute costs to process. Smaller chunks have less noise, but may not fully capture the necessary context. Overlapping chunks is a way to balance both of these constraints. By overlapping chunks, a query will likely retrieve enough relevant data across multiple vectors in order to generate a properly contextualized response.
One limitation is that this strategy assumes that all of the information you must retrieve can be found in a single document. If the required context is split across multiple different documents, you may want to consider leveraging solutions like document hierarchies and knowledge graphs.

Document Hierarchies
A document hierarchy is a powerful way of organizing your data to improve information retrieval. You can think of a document hierarchy as a table of contents for your RAG system. It organizes chunks in a structured manner that allows RAG systems to efficiently retrieve and process relevant, related data. Document hierarchies play a crucial role in the effectiveness of RAG by helping the LLM decide which chunks contain the most relevant data to extract.

If you’re looking for a non-technical introduction to RAG, including answers to various getting-started questions and a discussion of relevant use-cases, check out our breakdown of RAG here.
In this article, we discuss various technical considerations when implementing RAG, exploring the concepts of chunking, query augmentation, hierarchies, multi-hop reasoning, and knowledge graphs. We also discuss unsolved problems & opportunities in the RAG infrastructure space, and introduce some infrastructure solutions for building RAG pipelines.
The first obstacles and design choices you will be making when building a RAG system are in how to prepare the documents for storage and information extraction. That will be the primary focus of this article.
As a refresher, here’s an overview of a RAG system architecture.
Relevance vs Similarity
When discussing effective information retrieval in RAG, it is crucial to understand the difference between “relevance” and “similarity.” Whereas similarity is about the similarity in words matching, relevance is about the connectedness of ideas. You can identify semantically close content using a vector database query, but identifying and retrieving relevant content requires more sophisticated tooling.
This is an important concept to keep in mind as we explore various RAG techniques below. If you haven’t yet, you should check out Llamaindex’s helpful video on building production RAG apps. This is a good primer for our discussion on various RAG system development techniques.

Chunking Strategy
In the context of natural language processing, “chunking” refers to the segmentation of text into small, concise, meaningful ‘chunks.’ A RAG system can more quickly and accurately locate relevant context in smaller text chunks than in large documents.
How can you ensure you’re selecting the right chunk? The effectiveness of your chunking strategy largely depends on the quality and structure of these chunks.
Determining the optimal chunk size is about striking a balance — capturing all essential information without sacrificing speed.
While larger chunks can capture more context, they introduce more noise and require more time and compute costs to process. Smaller chunks have less noise, but may not fully capture the necessary context. Overlapping chunks is a way to balance both of these constraints. By overlapping chunks, a query will likely retrieve enough relevant data across multiple vectors in order to generate a properly contextualized response.
One limitation is that this strategy assumes that all of the information you must retrieve can be found in a single document. If the required context is split across multiple different documents, you may want to consider leveraging solutions like document hierarchies and knowledge graphs.

Document Hierarchies
A document hierarchy is a powerful way of organizing your data to improve information retrieval. You can think of a document hierarchy as a table of contents for your RAG system. It organizes chunks in a structured manner that allows RAG systems to efficiently retrieve and process relevant, related data. Document hierarchies play a crucial role in the effectiveness of RAG by helping the LLM decide which chunks contain the most relevant data to extract.


"""

In [68]:
system_content = """You are a researcher task with answering questions about an article.  
        Please ensure that your responses are socially unbiased and positive in nature.
        If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. 
        If you don't know the answer, please don't share false information."""
        
user_content_1 = """Answers output must confirm to the this JSON format [/INST] 

        JSON Output: {{
        "oneSentenceSummary" : "Mobile game soft launch is a process of releasing a game to a limited audience for testing.",
        "summaryInNumericBulletPoints" : [
        "1. Mobile game soft launch is a process of releasing a game to a limited audience for testing.",
        "2. Mobile game soft launch is a process of releasing a game to a limited audience for testing.",
        ],
        "entities" : [
        {{"name": "semiconductor", "type": "industry", "explanation": "Companies engaged in the design and fabrication of semiconductors and semiconductor devices"}},
        {{"name": "NBA", "type": "sport league", "explanation": "NBA is the national basketball league"}},
        {{"name": "Ford F150", "type": "vehicle", "explanation": "Article talks about the Ford F150 truck"}},
        ],
        "concepts_ideas": [
            {{"concept": "mobile game soft launch", "explanation": "Mobile game soft launch is a process of releasing a game to a limited audience for testing."}},
            {{"concept": "US Civil War", "explanation": "The American Civil War was a civil war in the United States between the Union and the Confederacy, which had been formed by states that had seceded from the Union. The central cause of the war was the dispute over whether slavery would be permitted to expand into the western territories, leading to more slave states, or be prevented from doing so, which many believed would place slavery on a course of ultimate extinction."}},
            {{"concet": "Capitalism", "explanation": Capitalism is an economic system based on the private ownership of the means of production and their operation for profit. Central characteristics of capitalism include capital accumulation, competitive markets, price system, private property, property rights recognition, voluntary exchange, and wage labor."}}    
        ] 
        }}"""
user_content_2 = """Use the examples above to answer the following questions.
        1. Summarize the article in one sentence. Limit the answer to twenty words.
        2. Summarize the article in multiple bullet-points. Each bullet-point need to have betweeen ten to tweenty words. Limit the number of bullet points must below six.
        3. Identify ten entities (companies, people, location, products....) mentioned in the article. Include short explanation for each entity.
        4. Identify three concepts or ideas mentioned in the article. Include short explanation for each concept or idea.

        Use the JSON format above to output your answer. Only output valid JSON format."""
        
user_content_3 = """Article: {BODY}""".format(BODY=text)

client = openai.OpenAI(
    api_key=together_token,
    base_url="https://api.together.xyz/v1",
    )
chat_completion = client.chat.completions.create(
    model="mistralai/Mixtral-8x7B-Instruct-v0.1",
    response_format={"type": "json_object", "schema": DocumentInfo.model_json_schema()},
    messages=[
        {"role": "system", "content": system_content},
        {"role": "user", "content": user_content_1},
        {"role": "user", "content": user_content_2},
        {"role": "user", "content": user_content_3},
    ],
    temperature=0.2,
    max_tokens=1024,
    top_p=0.8,
    frequency_penalty=1.0,
)


In [69]:
response = chat_completion.choices[0].message.content
print("Together response:\n", response)

Together response:
  {
"oneSentenceSummary" : "This article discusses various technical considerations when implementing Retrieval Augmented Generation (RAG) systems, including chunking strategy, document hierarchies, and the difference between relevance and similarity in information retrieval",
"summaryInNumericBulletPoints" : [
"1. The article discusses the importance of understanding the difference between 'relevance' and 'similarity' in effective information retrieval for RAG systems",
"2. The article introduces the concept of 'chunking' in natural language processing, which involves segmenting text into small, concise, meaningful chunks for faster and more accurate information retrieval",
"3. The effectiveness of the chunking strategy depends on the quality and structure of the chunks, and determining the optimal chunk size is about striking a balance between capturing essential information and reducing noise",
"4. Document hierarchies are a powerful way of organizing data to impr

In [78]:
di = DocumentInfo.model_validate_json(chat_completion.choices[0].message.content)
di.concepts_ideas

[ConceptIdea(concept='RAG system architecture', explanation='RAG system architecture refers to the design and structure of a Retrieval Augmented Generation system'),
 ConceptIdea(concept='Relevance vs Similarity', explanation='Relevance and similarity are two important concepts in information retrieval, with relevance being about the connectedness of ideas and similarity being about the similarity in words matching'),
 ConceptIdea(concept='Chunking Strategy', explanation='Chunking strategy is a technique used in natural language processing to segment text into small, concise, meaningful chunks for faster and more accurate information retrieval')]

In [11]:
# MODEL = "mistralai/Mixtral-8x7B-Instruct-v0.1"  #  "togethercomputer/llama-2-13b-chat"

def together_generate(prompt, temperature = 0.2, top_p = 0.8, top_k = 70) -> str:

    URL = "https://api.together.xyz/inference"
    
    #  "stop": ".",
    payload = {
        "model": "mistralai/Mixtral-8x7B-Instruct-v0.1",
        "prompt": prompt,
        "max_tokens": 1000,
        "temperature": temperature,
        "top_p": top_p,
        "top_k": top_k,
        "repetition_penalty": 1,
    }
    headers = {
        "accept": "application/json",
        "content-type": "application/json",
        "Authorization": f"Bearer {together_token}",
        "User-Agent": "Acme Benchmark",
    }

    response = requests.post(URL, json=payload, headers=headers)
    if response.status_code == 200:
        return response.json()
    else:
        print(response.status_code)
        print(response.text)
        return response.text


In [15]:
# Wrapper function that run generate and return an Answer
def run_prompt(prompt: str, temperature = 0.2, top_p = 0.8, top_k = 70) -> Answer:
    start_time = time.time()
    res = together_generate(prompt)
    answer = res['output']['choices'][0]['text']
    end_time = time.time()
    elapse = round(end_time - start_time)
    return Answer(answer, elapse)

In [16]:
# Display answer object in HTML
def display_answer(answer: Answer, header = ''):
    answer_html_template = """<h3>{HEADER} Answer - Time to Generate: {ELAPSE} seconds</h3>
    <textarea cols='100' rows={NUM_ROWS}>{ANSWER}</textarea>"""
    
    number_rows = (len(answer.answer.split(' ')) / 10)
    
    html = answer_html_template.format(ANSWER=answer.answer, ELAPSE=answer.elapse, HEADER=header, NUM_ROWS=number_rows)
    display(HTML(html))

## The Expirements Focus
The expirements will be focused on extracting information from a blogpost from Addresson Horoviz about mobile games soft launch. Since the Llama have size limit of tokens a subset of the post is used.  

In [7]:
cwd = os.getcwd()
cwd = cwd.replace('/notebooks', '')
directory = os.path.join(cwd, "tests/data/")

def read_txt_files(directory):
    txt_files = []
    for file in os.listdir(directory):
        if file.endswith(".txt"):
            file_path = os.path.join(directory, file)
            with open(file_path, "r") as f:
                txt_files.append(f.read())
    return txt_files

text_files = read_txt_files(directory)

In [8]:
text = text_files[0]

In [9]:
p1 =  """Write a concise summary of the main ideas in article below in bullet-points, don't repeat ideas. article: {BODY}""".format(BODY=text)

display_answer(run_prompt(p1))

In [32]:
p2 =  """How would you categorized the following text? Is it news article, blog, research paper? 
Answer JSON following this format: {{'category': 'news', 'explanation': 'text of the article'}}.
Text: {BODY}
JSON:""".format(BODY=text)

display_answer(run_prompt(p2))

In [None]:
p3 =  """Summarizes the article in multiple bullet-points. The number of bullet points must below six. article: {BODY}""".format(BODY=text)
p3_answer = run_prompt(p3)
display_answer(p3_answer)

Not bed, Llama listen to me :)

# Expirement 2: System Message

The Llama paper describe the system message that uses to set the stage and concext for the model. 
In the following example, I am using the system messsage. Let see what is the different between P3 that doesn't have system message and P4 that use system message.   

In [17]:
p4 = """<s>[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  
Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. 
Please ensure that your responses are socially unbiased and positive in nature.
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. 
If you don't know the answer to a question, please don't share false information.
<</SYS>>
Write a concise TL;DR summary in numeric bullet-points for the following article. Only include bullet-points. 
Limit the number of bullet-point to five. Output the answer in JSON format [{{"1", bullet-point}}, {{"2", bullet-point}}, ...]

article: {BODY}""".format(BODY=text)

# p4_answer = run_prompt(p4)
# display_answer(p4_answer, "P4 - With System Message")
# Display p3 too for comparision. 

display_answer(run_prompt(p4))


Both P3 and P4 are pretty good and it's hard to see the different the the system message added. I personally prefer P4 (system message) because the answer read a better in my opinion, but I am sure someone will argue with on that. 

# Expirence 3: Modify the System Message
The system message can be modified to better fit to the task and define the persona and context we want Llama to assume. 

Changes applied to the original system message:
- Use the researcher persona and specify the tasks to summarizing articles. 
- Remove safety instruction, there are not needed since we asking Llama to be truthful to the article. 

In [None]:
p5 = """<s>[INST] <<SYS>>
You are a researcher task in summarizing and writing concise brief of articles.  
Please ensure that your responses are socially unbiased and positive in nature.
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. 
If you don't know the answer, please don't share false information.
<</SYS>>
In one sentence, tell me what is the main idea of the following article. Limit the answer to tweenty words. 
Article: {BODY}
[/INST]""".format(BODY=text)

display_answer(run_prompt(p5))



The answer for p5 is the best in my opinion so far. I like the into and conclusion that Llama addeed. 

# Expirement 4: Asking Questions about the Article

The article is about 'Mobile Game Soft Launch' let ask Llama specific question about it. The answer is pretty good. 

You are a researcher task with answering questions about an article.  
Please ensure that your responses are socially unbiased and positive in nature.
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. 
If you don't know the answer, please don't share false information.

In [18]:
p6 = """[INST]
You are a researcher task with extracting information from articles.  
Please ensure that your responses are socially unbiased and positive in nature.
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. 
If you don't know the answer, please don't share false information.

According to the article mobile game soft launch solves what problems? Please include examples from the article in your answer. 
Answer should include at leat 50 words.
Article: {BODY}
[/INST]
Answer:
""".format(BODY=text)

display_answer(run_prompt(p6))


This is pretty good, let see if we improve on that. By asking Llama what is the article is about and then use the answer to ask additional questions. 
Prompt 7, asks Llama to tell what the article is about and then the answer is used to generate a prompt 8 that ask a second question.

To make it easy to programmatically use the answer, I asked Llama to output the answer in JSON. Using expirements (that I didn't included here) I descover that Llama need a template and being told to only output valid JSON. 

In [19]:
p7 = """<s>[INST] <<SYS>>
You are a researcher task with answering questions about an article.  
Please ensure that your responses are socially unbiased and positive in nature.
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. 
If you don't know the answer, please don't share false information.
<</SYS>>
Tell me what is the article about in one to three words? 
Output the answer in JSON in the following format {{"article_is_about": answer}}. Only output JSON
Article: {BODY}
[/INST]""".format(BODY=text)

a7 = run_prompt(p7)
json_a7 = json.loads(a7.answer)
print(f"JSON Answer:\n {json_a7}")
about = json_a7['article_is_about']
print(about)


p8 = """<s>[INST] <<SYS>>
You are a researcher task with answering questions about an article.  
Please ensure that your responses are socially unbiased and positive in nature.
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. 
If you don't know the answer, please don't share false information.
<</SYS>>
According to the article, what problems does {{ABOUT}} solves? Give several short examples from the article in your answer. Limit the answer to fifty words.
Article: {BODY}
Answer:
[/INST]""".format(BODY=text, ABOUT=about)

display_answer(run_prompt(p8),'Prompt 8')


JSON Answer:
 {'article_is_about': "Richard Romanus' Death"}
Richard Romanus' Death


Wow this is awesume. We can feed answers into new prompts to refine the information we try to extract. 

Let try a different question. What industry the article is about.

In [None]:
p9 = """<s>[INST] <<SYS>>
You are a researcher task with answering questions about an article.  
Please ensure that your responses are socially unbiased and positive in nature.
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. 
If you don't know the answer, please don't share false information.
<</SYS>>
Name the industry the article is focus on? Output only the industry name.
Article: {BODY}
[/INST]""".format(BODY=text, ABOUT=about)

display_answer(run_prompt(p9))

Notice that I asked the answer to include only the industry name, but Llama disregarded my request and wrote a sentence. 
Let see if we can fixed that by asking the answer to be in JSON. It worked!!! 

Note: I needed to add "include only valid JSON" to prevent Llama then adding an explanation.  

In [None]:
p10 = """<s>[INST] <<SYS>>
You are a researcher task with answering questions about an article.  
Please ensure that your responses are socially unbiased and positive in nature.
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. 
If you don't know the answer, please don't share false information.
<</SYS>>
Name industry the article is focus on? Output only the industry name. Output the answer in JSON, using format {{"industry": industry}}.
Include only valid JSON.
Article: {BODY}
[/INST]""".format(BODY=text)

display_answer(run_prompt(p10))

Let try and trick Llama and ask him what sport is the article focuses on? 

In [None]:
p11 = """[INST]
You are a researcher task with answering questions about an article.  
Please ensure that your responses are socially unbiased and positive in nature.
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. 
If you don't know the answer, please don't share false information.

Name sport the article is focus on? Output only the sport name. Output the answer in JSON, using format {{"sport": sport, "explanation": explanation}}. 
Include only valid JSON. Make sure to close the JSON object with a curly bracket.
Article: {BODY}
[/INST]""".format(BODY=text)

display_answer(run_prompt(p11))

In [None]:
p12 = """<s>[INST] <<SYS>>
You are a researcher task with answering questions about an article.  
Please ensure that your responses are socially unbiased and positive in nature.
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. 
If you don't know the answer, please don't share false information.

Output answer in JSON using the following format: {{"name": name, "type": type, "explanation": explanation}}
<</SYS>>

What entities mentioned in the article that can generalize the topic? [/INST] 
[
{{"name": "semiconductor", "type": "industry", "explanation": "Companies engaged in the design and fabrication of semiconductors and semiconductor devices"}},
{{"name": "NBA", "type": "sport league", "explanation": "NBA is the national basketball league"}},
{{"name": "Ford F150", "type": "vehicle", "explanation": "Article talks about the Ford F150 truck"}},
] </s>

<s>[INST]   
What entities mentioned are important to the article subject? Limit the answert to ten most important entities. 
Output answer in JSON using the following format: {{"name": name, "type": type, "explanation": explanation}}
Article: {BODY}
[/INST]""".format(BODY=text)

display_answer(run_prompt(p12))

In [20]:
p13 = """<s>[INST] <<SYS>>
You are a researcher task with answering questions about an article.  
Please ensure that your responses are socially unbiased and positive in nature.
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. 
If you don't know the answer, please don't share false information.

<</SYS>>

Answers output must confirm to the this JSON format [/INST] 

JSON Output: {{
"oneSentenceSummary" : "Mobile game soft launch is a process of releasing a game to a limited audience for testing.",
"summaryInNumericBulletPoints" : [
"1. Mobile game soft launch is a process of releasing a game to a limited audience for testing.",
"2. Mobile game soft launch is a process of releasing a game to a limited audience for testing.",
]
"entities : [
{{"name": "semiconductor", "type": "industry", "explanation": "Companies engaged in the design and fabrication of semiconductors and semiconductor devices"}},
{{"name": "NBA", "type": "sport league", "explanation": "NBA is the national basketball league"}},
{{"name": "Ford F150", "type": "vehicle", "explanation": "Article talks about the Ford F150 truck"}},
]
}} </s>

<s>[INST]
Use the examples above to answer the following question.
1. Summarize the article in one sentence. Limit the answer to twenty words.
2. Summarize the article in multiple bullet-points. Each bullet-point need to have betweeen ten to tweenty words. Limit the number of bullet points must below six.
3. Identify ten entities (companies, people, location, products....) mentioned in the article. Include short explanation for each entity.

Use the JSON format above to output your answer. Only output valid JSON format.
Article: {BODY}
[/INST]""".format(BODY=text)

display_answer(run_prompt(p13))

In [None]:
directory = "~/Projects/icognition-backend/tests/data/"

def read_txt_files(directory):
    txt_files = []
    for file in os.listdir(directory):
        if file.endswith(".txt"):
            file_path = os.path.join(directory, file)
            with open(file_path, "r") as f:
                txt_files.append(f.read())
    return txt_files