In [11]:
!pip install langchain
!pip install datasets
!pip install faiss-cpu
%pip install -qU langchain-google-genai
!pip install -U jq
%pip install -qU langchain-huggingface
!pip install langchain-community langchain-core
!pip install tiktoken

Collecting tiktoken
  Downloading tiktoken-0.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Downloading tiktoken-0.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m11.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: tiktoken
Successfully installed tiktoken-0.8.0


In [2]:
import getpass
import os

os.environ["GOOGLE_API_KEY"] = 'AIzaSyCErEZJor01nXLjGLwTdXcqheBUkBwldnU'

In [3]:
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

parser = StrOutputParser()
llm = ChatGoogleGenerativeAI(
    model="gemini-1.5-pro",
    temperature=0,
    max_tokens=None,
    timeout=None,
    max_retries=2,
)


In [20]:
# document loaders
from langchain_community.document_loaders import CSVLoader
import json
from pathlib import Path
from pprint import pprint
from datasets import load_dataset

ds = load_dataset("Binaryy/tourism-wikipedia", cache_dir='wikipedia')
texts = [sample['Content'] for sample in ds['train']]
print(texts[0])

Tourism is travel for pleasure or business, and the commercial activity of providing and supporting such travel.  The World Tourism Organization defines tourism more generally, in terms which go "beyond the common perception of tourism as being limited to holiday activity only", as people "travelling to and staying in places outside their usual environment for not more than one consecutive year for leisure and not less than 24 hours, business and other purposes". Tourism can be domestic (within the traveller's own country) or international, and international tourism has both incoming and outgoing implications on a country's balance of payments.
Tourism numbers declined as a result of a strong economic slowdown (the late-2000s recession) between the second half of 2008 and the end of 2009, and in consequence of the outbreak of the 2009 H1N1 influenza virus, but slowly recovered until the COVID-19 pandemic put an abrupt end to the growth. The United Nations World Tourism Organization est

In [14]:
# embedding
from langchain_huggingface import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")



In [32]:
# vector store
from langchain_text_splitters import TokenTextSplitter
from langchain_community.vectorstores import FAISS
from langchain.docstore.document import Document

doc_creator = TokenTextSplitter(chunk_size=128, chunk_overlap=32)
documents=[]
for i in range(len(texts)):
  if isinstance(texts[i],str):
    documents.append(Document(page_content=texts[i],metadata={"source":"huggingface"}))
print(len(documents))

db = FAISS.from_documents(documents, embeddings)

Tourism is travel fo
This is a bibliograp
Tourism – travel for
Aburi Botanical Gard
Allotments in the to
A destination manage
Dynamic packaging is
A geomorphosite, or 
Heli hiking is a rec
Heritage commodifica
Tourism impacts tour
Infinity des Lumière
The International As
International touris
In the study of tour
Maurice-Mollard Plaz
Overtourism is the c
A souvenir  (from Fr
A souvenir spoon is 
Terminal tourism ref
A tour operator is a
Tour-realism (T.R.) 
The term Tourism 4.0
Tourism Improvement 
A tourist sign, ofte
A tourist tax is any
Touristification is 
Touron is a derogato
Travel is the moveme
Travel technology (a
Travelers' diarrhea 
A welcome sign (or g
According to the Wor
Sítio Morrinhos ("Mo
The world's busiest 
The following active
The New York metropo
The following active
The following active
The following is a l
The following active
The following active
The following active
The following active
The following active
The following is a l
The following active
This is a lis

In [51]:
# retriever
retriever = db.as_retriever(
    search_type="mmr",
    search_kwargs={"k": 3},
)
retriever.invoke("space_tourism")

[Document(metadata={'source': 'huggingface'}, page_content='Space Tourists is a feature-length documentary of the Swiss director Christian Frei. The film had its premiere at the Zurich Film Festival in 2009 and has won the "World Cinema Directing Award" at the Sundance Film Festival in 2010.')]

In [52]:
# building whole pipeline
from langchain import hub
from langchain_core.runnables import RunnablePassthrough

prompt = hub.pull("rlm/rag-prompt")

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)




In [61]:
response=rag_chain.invoke("Explain Red Sea")
print(response)

The Red Sea is an inlet of the Indian Ocean located between Africa and Asia, connected in the south by the Bab el Mandeb strait and the Gulf of Aden. It is known for its extensive shallow shelves rich in marine life and corals, with over 1,000 invertebrate species and 200 types of coral. The Red Sea is considered the world's northernmost tropical sea. 

