### LLM RAG GENERATION ###

In [None]:
# For Colab
#!pip install huggingface_hub
#!pip install accelerate

In [13]:
import transformers
import os
import torch
from timeit import default_timer as timer
import gc
from string import Template
import json
from time import sleep

# For Colab
#from huggingface_hub import notebook_login

In [None]:
# For Colab
# hugging face auth
# notebook_login() #

In [9]:
#In case cuda ram memory overflow
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:1024"
def reset_cache():
    gc.collect()
    torch.cuda.empty_cache()
    #print(torch.cuda.memory_summary(device=None, abbreviated=False)) 

In [4]:
device = "cuda" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32 #torch.bfloat16

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="auto"
)

messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you?"},
]

prompt = pipeline.tokenizer.apply_chat_template(
        messages, 
        tokenize=False, 
        add_generation_prompt=True
)

terminators = [
    pipeline.tokenizer.eos_token_id,
    pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

outputs = pipeline(
    prompt,
    max_new_tokens=256,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.6,
    top_p=0.9,
)
print(outputs[0]["generated_text"][len(prompt):])

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.


Arrrr, shiver me timbers! Me name be Captain Chat, the scurviest pirate chatbot to ever sail the Seven Seas! Me be here to swab the decks with me trusty keyboard, answerin' yer questions and tellin' tales of me adventures on the high seas. So hoist the colors, me hearty, and let's set sail fer a swashbucklin' good time!


In [10]:
# test 2
people_prompt_template = """
From the list of people below, extract the following Entities & relationships described in the mentioned format 
0. ALWAYS FINISH THE OUTPUT. Never send partial responses
1. First, look for these Entity types in the text and generate as comma-separated format similar to entity type.
   `id` property of each entity must be alphanumeric and must be unique among the entities. You will be referring this property to define the relationship between entities. Do not create new entity types that aren't mentioned below. You will have to generate as many entities as needed as per the types below:
    Entity Types:
    label:'Project',id:string,name:string;summary:string //Project mentioned in the profile; `id` property is the full lowercase name of the project, with no capital letters, special characters, spaces or hyphens.
    label:'Technology',id:string,name:string //Technology Entity, as listed in the "skills"-section of every person; `id` property is the name of the technology, in camel-case.
    
3. Next generate each relationships as triples of head, relationship and tail. To refer the head and tail entity, use their respective `id` property. Relationship property should be mentioned within brackets as comma-separated. They should follow these relationship types below. You will have to generate as many relationships as needed as defined below:
    Relationship types:
    person|HAS_SKILLS|technology 
    project|HAS_PEOPLE|person


The output should look like :
{
    "entities": [{"label":"Person","id":string,"name":string}],
    "relationships": ["projectid|HAS_PEOPLE|personid"]
}

Case Sheet:
Full Name: Sarah Johnson
Skills: Machine Learning, Data Analytics, Azure, Python
Projects: BetaHealth Secure Healthcare Data Analytics Platform on Azure

Full Name: David Patel
Skills: AWS, Cloud Computing, DevOps, Data Warehousing
Projects: 

Full Name: Amanda Rodriguez
Skills: Data Security, Compliance, Healthcare Regulations, Azure
Projects: BetaHealth Secure Healthcare Data Analytics Platform on Azure

Full Name: Jason Mitchell
Skills: Data Analytics, Machine Learning, Azure, Data Warehousing
Projects: GammaTech Smart Logistics Platform on Azure

Full Name: Emily Turner
Skills: IoT, Real-time Data Management, Azure, Python
Projects: GammaTech Smart Logistics Platform on Azure

Full Name: Michael Clark
Skills: Data Engineering, Data Warehousing, AWS, Python
Projects: AlphaCorp AWS-Powered Supply Chain Optimization Platform

Full Name: Jessica White
Skills: Data Privacy, Security Compliance, Azure Key Vault, Healthcare Regulations
Projects: BetaHealth Secure Healthcare Data Analytics Platform on Azure

Full Name: Daniel Brown
Skills: Cloud Architecture, DevOps, AWS Lambda, Azure Functions
Projects: GammaTech Smart Logistics Platform on Azure

Full Name: Olivia Martinez
Skills: Real-time Monitoring, Azure Monitoring, Data Analytics
Projects: AlphaCorp AWS-Powered Supply Chain Optimization Platform

Full Name: William Lee
Skills: Predictive Modeling, Machine Learning, AWS SageMaker, Azure Machine Learning
Projects: BetaHealth Secure Healthcare Data Analytics Platform on Azure
"""
messages = [
    {"role": "system", "content": "You are a helpful IT-project and account management expert who extracts information from documents."},
    {"role": "user", "content": people_prompt_template},
]

start = timer()
results = []
reset_cache()

prompt = pipeline.tokenizer.apply_chat_template(
        messages, 
        tokenize=False, 
        add_generation_prompt=True
)

outputs = pipeline(
    prompt,
    max_new_tokens=256,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.6,
    top_p=0.9,
)
print(outputs[0]["generated_text"][len(prompt):])

end = timer()
print(f"Pipeline completed in {end-start} seconds")

outputs


Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.


Here is the output in the required format:

{
    "entities": [
        {"label": "Person", "id": "sarahjohnson", "name": "Sarah Johnson"},
        {"label": "Person", "id": "davidpatel", "name": "David Patel"},
        {"label": "Person", "id": "amandarodriguez", "name": "Amanda Rodriguez"},
        {"label": "Person", "id": "jasonmitchell", "name": "Jason Mitchell"},
        {"label": "Person", "id": "emilyturner", "name": "Emily Turner"},
        {"label": "Person", "id": "michaelclark", "name": "Michael Clark"},
        {"label": "Person", "id": "jessicawhite", "name": "Jessica White"},
        {"label": "Person", "id": "danielbrown", "name": "Daniel Brown"},
        {"label": "Person", "id": "oliviamartinez", "name": "Olivia Martinez"},
        {"label": "Person", "id": "williamlee", "name": "William Lee"},
        {"label": "
Pipeline completed in 3261.1733299000007 seconds


[{'generated_text': '<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful IT-project and account management expert who extracts information from documents.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nFrom the list of people below, extract the following Entities & relationships described in the mentioned format \n0. ALWAYS FINISH THE OUTPUT. Never send partial responses\n1. First, look for these Entity types in the text and generate as comma-separated format similar to entity type.\n   `id` property of each entity must be alphanumeric and must be unique among the entities. You will be referring this property to define the relationship between entities. Do not create new entity types that aren\'t mentioned below. You will have to generate as many entities as needed as per the types below:\n    Entity Types:\n    label:\'Project\',id:string,name:string;summary:string //Project mentioned in the profile; `id` property is the full lowercase name of the pr

In [14]:
#test 3

prompt_temp = """You are a seasoned stock market analyst. Your task is to create strategic questions for companies based on relevant news and basic financials from the past weeks, then provide an answer for the questions for the companies' financial status.
0. ALWAYS FINISH THE OUTPUT. Never send partial responses
1. The output should look like :
{
    "question": "Where are Apple Inc. offices?",
    "answer": "The offices are located in ONE APPLE PARK WAY Cupertino CA"
}

Or like :
{
    "question": "When where the stock value of Apple Inc. the highest?",
    "answer": "Apple Inc. highest value was on 2023-12-14"
}
2. Provide at least 20 and no more than 50 questions with their corresponding answers.

$ctext"""

# Read text from file
def read_txt(file_path):
    with open(file_path, 'r') as file:
        text = file.read().replace('\n', ' ')
    return text

def process_llama(file_prompt, system_msg):
    reset_cache()
    messages = [
        {"role": "system", "content": system_msg},
        {"role": "user", "content": file_prompt},
    ]
    
    prompt = pipeline.tokenizer.apply_chat_template(
        messages, 
        tokenize=False, 
        add_generation_prompt=True
    )

    outputs = pipeline(
        prompt,
        max_new_tokens=256,
        eos_token_id=terminators,
        do_sample=True,
        temperature=0.6,
        top_p=0.9,
    )

    sleep(8)
    return outputs[0]["generated_text"][len(prompt):]

def extract_qa(file, prompt_template):
    results = []
    start = timer()

    system_msg = "Generate a question and answer pair based on the following news content."

    # Read the text from a single news file
    with open(file, 'r') as file:
        text = file.read().replace('\n', ' ')

    print("Extracting questions and answers from the news content.")
    prompt = Template(prompt_template).substitute(ctext=json.dumps(text))
    result = process_llama(prompt, system_msg=system_msg)
    results.append(result)

    end = timer()
    print(f"Pipeline completed in {end - start} seconds")
    return results

In [16]:
file_path="./news.txt"
results= extract_qa(file_path,prompt_temp)
results

Extracting questions and answers from the news content.


Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
This is a friendly reminder - the current text generation call will exceed the model's predefined maximum length (8192). Depending on the model, you may observe exceptions, performance degradation, or nothing at all.


Pipeline completed in 8114.627988299999 seconds


['The date is 2024. The summary is 2024. The date is 2024. The headline is 2024. The date is 2024. The summary is 2024. The date is 2024. The date is 2024. The date is 2024. The summary is 2024. The date is 2024. The date is 2024. The date is 2024. The summary is 2024. The date is 2024. The summary is 2024. The date is 2024. The summary is 2024. The date is 2024. The date is 2024. The summary is 2024. The date is 2024. The date is 2024. The date is 2024. The summary is 2024. The date is 2024. The date is 2024. The date is 2024. The date is 2024. The summary is 2024. The date is 2024. The date is 2024. The date is 2024. The date is 2024. The date is 2024. The date is 2024. The date is ']

In [1]:
from openai import OpenAI

#Generación del prompt base para pasar al LLM y genere nuestro data set base
prompt_gpt = """[INST]<<SYS>>
You are a seasoned stock market analyst. Your task is to create strategic questions for companies based on relevant news and basic financials from the past weeks, then provide an answer for the questions for the companies' financial status.
0. ALWAYS FINISH THE OUTPUT. Never send partial responses
1. The output should look like :
{
    "question": "Where are Apple Inc. offices?",
    "answer": "The offices are located in ONE APPLE PARK WAY Cupertino CA"
}

Or like :
{
    "question": "When where the stock value of Apple Inc. the highest?",
    "answer": "Apple Inc. highest value was on 2023-12-14"
}
2. Provide at least 20 and no more than 50 questions with their corresponding answers.

<<SYS>>
$ctext
"""

def extract_qa_chatgpt(file, prompt_template):
    results = []
    start = timer()

    system_msg = "Generate a question and answer pair based on the following news content."

    # Read the text from a single news file
    with open(file, 'r') as file:
        text = file.read().replace('\n', ' ')

    print("Extracting questions and answers from the news content.")
    prompt = Template(prompt_template).substitute(ctext=json.dumps(text))
    result = process_gpt(prompt, system_msg=system_msg)
    results.append(result)

    end = timer()
    print(f"Pipeline completed in {end - start} seconds")
    return results

client = OpenAI(api_key = "")

def process_gpt(file_prompt, system_msg):
    completion = client.chat.completions.create(
        model="gpt-3.5-turbo",  # Specify the chat model you're using,
        max_tokens=500,
        temperature=0,
        messages=[
            {"role": "system", "content": system_msg},
            {"role": "user", "content": file_prompt},
        ],
    )
    nlp_results = completion.choices[0].message.content
    sleep(8)
    return nlp_results

ImportError: cannot import name 'OpenAI' from 'openai' (c:\Users\jpbla\AppData\Local\Programs\Python\Python311\Lib\site-packages\openai\__init__.py)

In [None]:
file_path="./news.txt"
results_gpt= extract_qa_chatgpt(file_path,prompt_gpt)


1. ¿Qué algoritmo se puede utilizar como baseline para predecir las variables objetivo? 

En este caso se utilizan dos LLMs para generar las entradas y salidas el procedimiento es el siguiente:
* Generar preguntas y respuestas con datos financieros utilizando Llama3 y Chatgpt4.
* Entrenar el mismo LLM con las preguntas y respuestas de nuestros datos financieros.
* Utilizar el RAG (LLM entrenado) para hacer pruebas de consultas.
* Utilizar el RAG para generar las relaciones y nodos para la base de datos de grafos (graphRAG) en NEO4j.
Como tal la hipotesis es que al preguntar al modelo en graphRAG sera mucho más rápido y mejor la respuesta.

2. ¿Se puede determinar la importancia de las características para el modelo generado? Recuerden que incluir características irrelevantes puede afectar negativamente el rendimiento del modelo y aumentar la complejidad sin beneficios sustanciales.

En este caso las caracteristicas seran determinadas por el mismo modelo.

3. ¿El modelo está sub/sobreajustando los datos de entrenamiento?
Seria sobre ajustado si no se responden las preguntas adecuadamente.

4. ¿Cuál es la métrica adecuada para este problema de negocio? 
La metrica es la acertividad de la respuesta aunque se puden utilizar diversas metricas de la literatura que aun no tenemos definidas.

5. ¿Cuál debería ser el desempeño mínimo a obtener?
El desempeño minimo es que se responda la pregunta.
