wikipedia document accessing via wikipedialoader

In [None]:
from langchain.document_loaders import WikipediaLoader
wiki=WikipediaLoader(query="cricket",load_max_docs=3,doc_content_chars_max=500).load()

In [9]:
len(wiki)

3

In [10]:
wiki[0]

Document(metadata={'title': 'Cricket', 'summary': 'Cricket is a bat-and-ball game played between two teams of eleven players on a field, at the centre of which is a 22-yard (20-metre; 66-foot) pitch with a wicket at each end, each comprising two bails (small sticks) balanced on three stumps. Two players from the batting team, the striker and nonstriker, stand in front of either wicket holding bats, while one player from the fielding team, the bowler, bowls the ball toward the striker\'s wicket from the opposite end of the pitch. The striker\'s goal is to hit the bowled ball with the bat and then switch places with the nonstriker, with the batting team scoring one run for each of these swaps. Runs are also scored when the ball reaches the boundary of the field or when the ball is bowled illegally.\nThe fielding team aims to prevent runs by dismissing batters (so they are "out"). Dismissal can occur in various ways, including being bowled (when the ball hits the striker\'s wicket and dis

Splitting

In [11]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
split=RecursiveCharacterTextSplitter(
    chunk_size=180,
    chunk_overlap=20,
    is_separator_regex=False
)
chunk=split.split_documents(wiki)
chunk[0].page_content

'Cricket is a bat-and-ball game played between two teams of eleven players on a field, at the centre of which is a 22-yard (20-metre; 66-foot) pitch with a wicket at each end, each'

In [12]:
len(chunk)

12

Embedding of the chunks

In [13]:
from langchain_huggingface import HuggingFaceEmbeddings
embedding=HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

text="Delhi is the capital of India"

vector=embedding.embed_query(text)

print(vector)

[0.043549545109272, 0.02387721836566925, -0.04524128511548042, 0.03540496155619621, -0.016651012003421783, -0.06554818898439407, 0.07626007497310638, 0.00994044914841652, -0.0019632368348538876, -0.027022695168852806, 0.007385569158941507, -0.12068236619234085, 0.06404842436313629, -0.06795038282871246, 0.03638887405395508, -0.07807772606611252, 0.03318418189883232, 0.0817556381225586, 0.07336150854825974, -0.07802224159240723, -0.02092118002474308, 0.03573280945420265, -0.008563278242945671, -0.03745512664318085, 0.0004388520901557058, 0.053464241325855255, 0.005293617025017738, -0.01687048189342022, -0.00041303757461719215, 0.0010301506845280528, 0.06669680029153824, 0.004223237279802561, -0.022522618994116783, -0.0021015736274421215, -0.055947814136743546, 0.016869986429810524, -0.1295161098241806, 0.06496334075927734, 0.17288090288639069, -0.11778350919485092, 0.03644102066755295, -0.0006774819921702147, 0.07786677032709122, -0.0281674861907959, 0.03655534237623215, -0.023698827251

Vector store

In [14]:
from langchain.vectorstores import FAISS
vector_store=FAISS.from_documents(chunk,embedding)

In [19]:
retriever=vector_store.as_retriever(search_type="mmr",search_kwargs={"k":6})

In [20]:
question="what are top 3 facts about cricket ?"
relevant_docs=retriever.invoke(question)
relevant_docs

[Document(id='b23b57d2-35a5-444f-ab3e-ed55f72c4627', metadata={'title': 'Glossary of cricket terms', 'summary': 'This is a general glossary of the terminology used in the sport of cricket. Where words in a sentence are also defined elsewhere in this article, they appear in italics. Certain aspects of cricket terminology are explained in more detail in cricket statistics and the naming of fielding positions is explained at fielding (cricket).\nCricket is known for its rich terminology. Some terms are often thought to be arcane and humorous by those not familiar with the game.', 'source': 'https://en.wikipedia.org/wiki/Glossary_of_cricket_terms'}, page_content='in italics. Certain aspects of cricket terminology are explained in more detail in cricket statistics and the naming of fielding positions is explained at fielding (cricket).'),
 Document(id='1615dd04-ada8-41b5-b07c-5df9a4a15f94', metadata={'title': 'Test cricket', 'summary': 'Test cricket is a format of the sport of cricket, cons

In [21]:
context_doc="\n\n".join(doc.page_content for doc in relevant_docs)
context_doc

'in italics. Certain aspects of cricket terminology are explained in more detail in cricket statistics and the naming of fielding positions is explained at fielding (cricket).\n\nthe sport with the long\n\nover a match that can last up to five days. It consists of four innings (two per team), with a minimum of ninety overs scheduled to be bowled per day, making it the sport with the\n\nTest cricket is a format of the sport of cricket, considered the game’s most prestigious and traditional form. Often referred to as the "ultimate test" of a cricketer\'s skill,\n\nat each end, each comprising two bails (small sticks) balanced on three stumps. Two players from the batting team, the striker and nonstriker, stand in front of either wicket\n\n== A ==\n\nAcross the line\nA sho'

In [22]:
from langchain.prompts import PromptTemplate
prompt=PromptTemplate(
    template="You are an expert of cricket. Answer the given question based on the context provied ONLY.DO NOT HELLUCINATE." \
    "Context:{context_doc} , Question:{question}",
    input_variables=['context_doc','question']
)



In [23]:
from langchain_core.output_parsers import StrOutputParser
parser=StrOutputParser()

In [25]:
from  langchain_google_genai import GoogleGenerativeAI

import os

api_key=os.getenv("GOOGLE_API_KEY")

model=GoogleGenerativeAI(
    model="gemini-1.5-flash",
    google_api_key=api_key
)

In [26]:
chain=prompt|model|parser
res=chain.invoke({"context_doc":context_doc,"question":question})
print(res)

Based solely on the provided text, here are three facts about cricket:

1. Test cricket is considered the most prestigious and traditional form of the game.
2. A Test match can last up to five days.
3.  Each team plays two innings in a Test match.

