# Building a simple RAG
We have learned that ChatGPT knows very little about ITU. so let's incorporate some knowledge as a RAG!

For the sake of simplicity, we use LlamaIndex. But you could also build something manually (as most companies do) based on a custom RAG architecture (i.e., programming the communication between LLM and Vector DB/SQL DB yourself).

Either way, we are aiming for a basic setup like this:
![Typical RAG pipeline](RAG_pipeline.png)

In [2]:
!pip install llama_index chroma chromadb openai logging


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.1.2[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [5]:
!pip install llama-index llama-index-experimental



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.1.2[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


### 0: IMPORTING LIBRARIES
Using pre-built class for directory reading.

In [4]:
import os
import chromadb
from llama_index.core import StorageContext
from llama_index.core import Settings
from llama_index.core import SimpleDirectoryReader

In [6]:

import logging
import sys
from IPython.display import Markdown, display

import pandas as pd
from llama_index.experimental.query_engine import PandasQueryEngine


logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

In [None]:
documents = SimpleDirectoryReader("preprocessed").load_data()

### 1: IMPORTING OPENAPI KEY

In [8]:

# first, load and set openaikey from a txt file I stored it in
with open('oaikey.txt') as keyfile:
    oaikey = keyfile.read().strip()
    
os.environ["OPENAI_API_KEY"] = oaikey

#### 2: IMPORTING DATASET 

In [9]:
df = pd.read_csv("./data/outside/outside_7_days_from_23_to_29_August/outside_7_days_from_23_to_29_August.csv") 


In [16]:
df

Unnamed: 0,A4100209,Unnamed: 1,Port1,Port1.1,Port1.2,Port1.3,Port1.4,Port1.5,Port1.6,Port1.7,...,Port1.10,Port1.11,Port1.12,Port1.13,Port1.14,Port1.15,Port1.16,Port2,Port2.1,Unnamed: 21
0,# Records: 1954,,ATMOS 41W Sensor Suite,ATMOS 41W Sensor Suite,ATMOS 41W Sensor Suite,ATMOS 41W Sensor Suite,ATMOS 41W Sensor Suite,ATMOS 41W Sensor Suite,ATMOS 41W Sensor Suite,ATMOS 41W Sensor Suite,...,ATMOS 41W Sensor Suite,ATMOS 41W Sensor Suite,ATMOS 41W Sensor Suite,ATMOS 41W Sensor Suite,ATMOS 41W Sensor Suite,ATMOS 41W Sensor Suite,ATMOS 41W Sensor Suite,Battery,Battery,
1,Timestamps,Device RID,raw Solar Radiation,raw Precipitation,raw Drop Counts,raw Spoon Tips,raw EC,raw Wind Direction,raw Wind Speed,raw Gust Speed,...,raw Atmospheric Pressure,raw Tilt Angle,raw Min Air Temperature,raw Max Air Temperature,raw RH Sensor Temp,raw Max Precip Rate,raw VPD,raw Battery Percent,raw Battery Voltage,UTC Offset
2,08/22/2024 07:00:00 PM,65365,1292,0,0,0,0,3349,58,109,...,9808,13,8949,9037,9159,0,1.666744158708596,2100,8233,UTC+02:00
3,08/22/2024 07:05:00 PM,65366,1433,0,0,0,0,3367,67,123,...,9807,13,8959,9036,9155,0,1.6708343987898313,2100,8229,UTC+02:00
4,08/22/2024 07:10:00 PM,65367,1183,0,0,0,0,171,73,180,...,9808,13,8950,9038,9149,0,1.6720593817129479,2100,8229,UTC+02:00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1951,08/29/2024 01:25:00 PM,67321,6792,0,0,0,0,3105,134,287,...,9853,13,9500,9780,9935,0,2.3099657482349034,2100,8220,UTC+02:00
1952,08/29/2024 01:30:00 PM,67322,6791,0,0,0,0,3235,106,275,...,9853,12,9564,9770,9951,0,2.346579152453595,2100,8223,UTC+02:00
1953,08/29/2024 01:35:00 PM,67323,6787,0,0,0,0,3174,156,325,...,9853,13,9574,9745,9966,0,2.303514337544168,2100,8224,UTC+02:00
1954,08/29/2024 01:40:00 PM,67324,6786,0,0,0,0,3201,136,316,...,9853,13,9583,9741,9974,0,2.299331858615904,2100,8226,UTC+02:00


In [10]:
query_engine = PandasQueryEngine(df=df, verbose=True)


In [17]:
response = query_engine.query(
    "Please provide me the average raw Atmospheric Pressure during last day",
)

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
> Pandas Instructions:
```
df.iloc[2:, 10].astype(float).mean()
```
> Pandas Output: 8727.226714431934


In [19]:
response = query_engine.query(
    "Please provide the raw Atmospheric Pressure values in the last two days.",
)

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
> Pandas Instructions:
```
df.iloc[2:, 10].astype(int)
```
> Pandas Output: 2       8985
3       9001
4       8999
5       8978
6       8959
        ... 
1951    9657
1952    9678
1953    9661
1954    9656
1955    9681
Name: Port1.8, Length: 1954, dtype: int64


#### Wrap in LLamaIndex objects for easier handling/compatability

In [25]:
# ... basically, we are just specifying the storage to be used as the ChromaDB
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

#### Specify the embedding model 

In [26]:
from llama_index.embeddings.openai import OpenAIEmbedding

# let's use OpenAI out-of-the box embeddings.
embed_model = OpenAIEmbedding(model="text-embedding-3-small")

### Start building the database from our config!
Note that you could add multiple processing steps here, such as:
- Using an [ingestion pipeline](https://docs.llamaindex.ai/en/stable/module_guides/loading/ingestion_pipeline/) for further preprocessing
- [modifying chunk size and overlap](https://docs.llamaindex.ai/en/stable/optimizing/basic_strategies/basic_strategies/#chunk-sizes) or introduce specific chunking strategy
- and others

In [None]:
# You could modify chunk size and overlap like this
# Settings.chunk_size = 512
# Settings.chunk_overlap = 50

In [41]:
from llama_index.core import VectorStoreIndex
# now let's build an index for the database using pre-built functionality
# - Chunk the documents
# - Retrieve embeddings for document chunks
# - Create nodes in db based on docs/chunks
# - Index database for fast retrieval
index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context, embed_model=embed_model,
    show_progress=True
)

Parsing nodes:   0%|          | 0/41 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/65 [00:00<?, ?it/s]

In [42]:
# in case we are loading from disk, uncomment
# from llama_index.core import load_index_from_storage
# index2 = load_index_from_storage(storage_context)

#### Test created embedding/chunks


### Specify LLM-Chat interface
Now, we want to build the communication between an LLM and our database that resembles our typical RAG setup:
![Typical RAG pipeline](RAG_pipeline.png)



Using LlamaIndex, this is deceptively easy.

#### Specify details about retrieval from vector db

In [45]:
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.query_engine import RetrieverQueryEngine

import logging
import sys

# Let's do some logging
logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

#### Configure retrieval from VectorDB 

In [46]:
# this specifies the details for retrieving the k closest elements to the user query
retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=5, # how many documents should we consider? Let's do 5
    verbose=True
)

#### Specify the used LLM
in this case, we use the OpenAI GPT4o-mini (very performant and cheap)

In [None]:
# %pip install llama-index-llms-openai

In [52]:
from llama_index.llms.openai import OpenAI
# this is an OpenAI wrapper for llama_index
llm = OpenAI(model="gpt-4o-mini") 

#### Specify the prompt

Do a simple [RAG-prompt](https://docs.llamaindex.ai/en/stable/examples/prompts/prompts_rag/)

In [59]:
from llama_index.core import PromptTemplate
from llama_index.core import get_response_synthesizer

In [60]:
# Let's specify a prompt similar to what we have learned earlier
custom_query = """
    You are an information chatbot that informs users about the Interdisciplinary Transformation University Austria (ITU) in Linz, Austria. 
    
    Here is the context information:
    ---------------------
    {context_str}
    ---------------------
    Given the context information, this prompt, and no prior knowledge, answer the query. 
    The answer must be 100 words or less.
    
    Query: {query_str}
    Answer: """

In [61]:
# this specifies how we utilize the retrieved chunks/text in the response
# configure response synthesizer
rag_prompt = PromptTemplate(custom_query) # use LLama_index wrapper to create our query

# Build response synthesizer:
# i.e., object that combines user prompt, retrieved context, and our RAG prompt and sends it to the LLM (GPT-4o-mini)
response_synthesizer = get_response_synthesizer(
    llm=llm, text_qa_template=rag_prompt, verbose=True)

#### "Assemble" query engine 
Combine other config into the actual logic that will do the querying for us.

Again, we will stick to the basics here.


In [63]:
from llama_index.core.query_engine import RetrieverQueryEngine
#node_postprocessors=[SimilarityPostprocessor(similarity_cutoff=0.7)],

# assemble query engine
query_engine = RetrieverQueryEngine(
    retriever=retriever, # configuration for retrieval of vector chunks
    response_synthesizer=response_synthesizer, # config for synthesizing LLM prompt/response
)

#### Let's prompt away!

In [68]:
oai_response = query_engine.query('Who is the founding president?')

In [88]:
oai_response.response

'The Founding President of the Interdisciplinary Transformation University Austria (ITU) is Prof. Dr. Stefanie Lindstaedt.'

In [125]:
# which files were used as context information? 
oai_response.metadata

{'542fde3f-28d1-4690-a8a9-dd7722665809': {'file_path': '/home/jovyan/work/preprocessed/en_public-notice_provisional-bylaws-of-itu-idsa.txt',
  'file_name': 'en_public-notice_provisional-bylaws-of-itu-idsa.txt',
  'file_type': 'text/plain',
  'file_size': 36144,
  'creation_date': '2024-08-26',
  'last_modified_date': '2024-08-26'},
 '2b55da0d-4516-441d-bc60-44fb958bd8c3': {'file_path': '/home/jovyan/work/preprocessed/en_digital-transformation-university_organization.txt',
  'file_name': 'en_digital-transformation-university_organization.txt',
  'file_type': 'text/plain',
  'file_size': 2374,
  'creation_date': '2024-08-26',
  'last_modified_date': '2024-08-26'},
 '81983a30-d6ca-4fce-a39b-6b0ae3c756c5': {'file_path': '/home/jovyan/work/preprocessed/en_public-notice_provisional-bylaws-of-itu-idsa.txt',
  'file_name': 'en_public-notice_provisional-bylaws-of-itu-idsa.txt',
  'file_type': 'text/plain',
  'file_size': 36144,
  'creation_date': '2024-08-26',
  'last_modified_date': '2024-08-2

In [126]:
# some more details
example_node_id = oai_response.source_nodes[1].node_id

print(f'Gathered information from {len(oai_response.source_nodes)} text chunks, for example:\n'
      f'Node ID: {example_node_id}\n'
      f'Document: {oai_response.metadata[example_node_id]["file_name"]}\n'
      f'Text:\n{oai_response.source_nodes[1].text}')

Gathered information from 3 text chunks, for example:
Node ID: 2b55da0d-4516-441d-bc60-44fb958bd8c3
Document: en_digital-transformation-university_organization.txt
Text:
:study:careerhome : about : organization
© Felix Büchele - IT:U:organization© Lunghammer – TU GrazDipl.-Ing.in Claudia von der Linden, MBA (IMD)Chairwoman of the Founding Convent© Antje Wolm – IT:UProf. Dr.in Stefanie LindstaedtFounding President© Felix Büchele – IT:UGabriele Költringer, EMBAManaging Directorinternational strategic advisory boardfounding conventfounding presidentfounding advisory boardmanaging directorFounding ConventThe Founding Convent is the strategic body of the university during the founding phase. Two of its members were nominated by the province of Upper Austria, three by the Federal Ministry of Education, Science and Research (BMBWF), two by the Federal Ministry for Climate Protection, Environment, Energy, Mobility, Innovation and Technology (BMK), one by the Austrian Science Fund (FWF) and one

#### Some other queries

In [69]:
oai_response2 = query_engine.query('Is there a summer school?')

In [86]:
oai_response2.response

"Yes, the Interdisciplinary Transformation University Austria (ITU) is hosting a Summer School in 2024, which has attracted over 200 applicants from 66 countries. The program emphasizes interdisciplinary collaboration and diverse academic backgrounds, with approximately 40 participants expected to be selected for this unique learning opportunity. The review committee is currently evaluating applications, and updates on the selection process will be provided as preparations continue. For more information, you can check the university's website."

In [82]:
oai_response3 = query_engine.query('Are they hiring?')

In [128]:
oai_response3.response

'Yes, the Interdisciplinary Transformation University Austria (ITU) is currently hiring. They have up to 12 postdoctoral positions available in the field of Computational X, as well as openings for a LMS Administrator – Full Stack Developer, Content Creator, Financial Controller, Project Controller, and Software Developer. Interested candidates can apply online and are encouraged to submit their applications, including a CV and cover letter. The application deadline for the postdoctoral positions is September 15th, 2024.'