### Analysing Olympics Data - Building a simple RAG

Buidling a Advanced Q&A RAG - Chain and Retriever 

In [50]:
from langchain_groq import ChatGroq # Inference Engine
from langchain_community.document_loaders.csv_loader import CSVLoader # For loading CSV
from langchain.text_splitter import RecursiveCharacterTextSplitter # For chunking
from langchain_pinecone import PineconeVectorStore # To store the vector embeddings
from langchain.embeddings import OllamaEmbeddings # For vector embeddings
from langchain.chains.combine_documents import create_stuff_documents_chain # Type of chain
from langchain.prompts import ChatPromptTemplate # For prompts
from langchain.chains import create_retrieval_chain # Combining chain and retriever
from langchain.chains.summarize import load_summarize_chain # For summarizing
import os

Detecting the environment variables

In [22]:
os.environ['GROQ_API_KEY'] = os.getenv('GROQ_API_KEY')
os.environ['PINECONE_API_KEY'] = os.getenv('PINECONE_API_KEY')

Loading the olympics data

In [5]:
data = CSVLoader('R:\\Projects\\Langchain\\olympics_rag\\medals_total.csv')
data

<langchain_community.document_loaders.csv_loader.CSVLoader at 0x1679ed317f0>

Defining Prompts

In [41]:
prompt = ChatPromptTemplate.from_template("""
                                          
                                          Answer the questions based only on the provided context.
                                          Give a precise answer.
                                          <context>
                                          {context} 
                                          </context>
                                          Question:{input}
                                          
                                          """)

In [6]:
loader = data.load()

In [None]:
loader

In [10]:
len(loader)

80

Performing Chunking

In [17]:
chunks = RecursiveCharacterTextSplitter(chunk_size = 200, chunk_overlap = 0)
split_data = chunks.split_documents(loader)

In [16]:
len(split_data)

80

Indexing with Pinecone

In [26]:
index_name = "langchain3"

Storing the Vector Embeddings in Pinecone Vector Database

In [27]:
vector = PineconeVectorStore.from_documents(split_data,OllamaEmbeddings(model='mxbai-embed-large'),index_name = index_name)

Instancing the LLM model

In [32]:
groqllm = ChatGroq(model="llama3-70b-8192", temperature = 0)

Chain and Retrieval

In [35]:
docs = create_stuff_documents_chain(groqllm, prompt)

In [36]:
retrieval = vector.as_retriever()

In [44]:
retrieval_chain = create_retrieval_chain(retrieval,docs)

In [49]:
res = retrieval_chain.invoke({'input': 'How many gold medals did country USA won'})
res['answer']

'According to the provided context, USA won 27 Gold Medals.'

Inference: Loaded the olympics data, performed chunking process after that created vector embeddings and stored the embeddings in the pinecone vector database which is in cloud. Queried the system and it gave a good response based on the olympics data. Later stages this system can be developed like a chatbot assistant, that can give you all the details about the current olympics. This process will be efficient.

Summarizing the Olympics Data

In [51]:
summarize = load_summarize_chain(groqllm, chain_type="map_reduce", verbose=True)
summarize.run(split_data)

  warn_deprecated(




[1m> Entering new MapReduceDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mWrite a concise summary of the following:


"country_code: USA
Gold Medal: 27
Silver Medal: 35
Bronze Medal: 32
Total: 94"


CONCISE SUMMARY:[0m
Prompt after formatting:
[32;1m[1;3mWrite a concise summary of the following:


"country_code: CHN
Gold Medal: 25
Silver Medal: 23
Bronze Medal: 17
Total: 65"


CONCISE SUMMARY:[0m
Prompt after formatting:
[32;1m[1;3mWrite a concise summary of the following:


"country_code: AUS
Gold Medal: 18
Silver Medal: 12
Bronze Medal: 11
Total: 41"


CONCISE SUMMARY:[0m
Prompt after formatting:
[32;1m[1;3mWrite a concise summary of the following:


"country_code: FRA
Gold Medal: 13
Silver Medal: 17
Bronze Medal: 21
Total: 51"


CONCISE SUMMARY:[0m
Prompt after formatting:
[32;1m[1;3mWrite a concise summary of the following:


"country_code: GBR
Gold Medal: 12
Silver Medal: 17
Bronze Medal: 20
Total: 49"




Token indices sequence length is longer than the specified maximum sequence length for this model (1653 > 1024). Running this sequence through the model will result in indexing errors




[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mWrite a concise summary of the following:


"The USA won a total of 94 Olympic medals, consisting of 27 gold, 35 silver, and 32 bronze.

China (CHN) won 65 Olympic medals: 25 gold, 23 silver, and 17 bronze.

Australia (AUS) has won a total of 41 Olympic medals: 18 gold, 12 silver, and 11 bronze.

France (FRA) won 51 Olympic medals: 13 gold, 17 silver, and 21 bronze.

Team GBR won 49 Olympic medals: 12 gold, 17 silver, and 20 bronze.

South Korea (KOR) won 27 Olympic medals: 12 gold, 8 silver, and 7 bronze.

Japan (JPN) won 31 Olympic medals: 12 gold, 6 silver, and 13 bronze.

Italy (ITA) won 27 Olympic medals: 9 gold, 10 silver, and 8 bronze.

The Netherlands (NED) won 20 Olympic medals: 9 gold, 5 silver, and 6 bronze.

Germany (GER) won 18 Olympic medals: 8 gold, 5 silver, and 5 bronze.

Canada won 19 Olympic medals: 6 gold, 4 silver, and 9 bronze.

Ireland (IRL) has won a total of 7 Olympic medals: 4 gol

'Here is a concise summary:\n\nThe USA topped the Olympic medal count with 94 medals (27 gold, 35 silver, 32 bronze). China came in second with 65 medals, followed by Australia with 41 medals. A total of 44 countries won Olympic medals, with 24 countries winning at least one gold medal.'

Inference: Performed the summarization for the olympics dataset. It gave good summarization for the entire dataset.

#### Conclusion

Thus the olympics dataset has been analysed. Performed Q&A and Summarization by using Advanced RAG (Chain and Retriever). 