# Retrieval Augmented Generation
In this notebook will be extracted data from Wikipedia or from a custom folder.
A Vector DB will be created (Chroma or Faiss) and ChatGPT will answer questions about the topic!

## Pre-process data

### Use Wikipedia as data source
In this example you will donwload data from wikipedia and will use it for building the Knowledge Base

In [None]:
from langchain.document_loaders import WikipediaLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter


In [None]:
# Get data from Wikipedia
search_term = "Stanley Kubrick"
#Choose how many documents we want to load
docs = WikipediaLoader(query=search_term, load_max_docs=1).load()
print(docs)

In [None]:
#Split documents into chunks
#set up the text splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 100, #how many chars will be in a single chunk
    chunk_overlap = 20, #how many chars we want to overlap between chunks
    is_separator_regex = False
)
#split data
data = text_splitter.split_documents(docs)
data

### Use documents from a custom folder as Data Source
In this example you will use the documents from the "docs" folder and create chunks from those.

In [None]:
from langchain.document_loaders import (
    PyPDFium2Loader,
    TextLoader,
    UnstructuredMarkdownLoader,
)
import os
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [None]:
#Useful functions
def get_file_type(uploaded_item, supported_extensions=["pdf", "md", "txt"]):
    # Determine if uploaded_item is a file-like object or a string (path)
    if hasattr(uploaded_item, "name"):
        # It's a file-like object, extract the file name
        file_name = uploaded_item.name
    elif isinstance(uploaded_item, str):
        # It's a string path, use it directly
        file_name = uploaded_item
    else:
        raise ValueError("File extension not supported")

    print("filename: ", file_name)  # Debug print for file name

    # Extract the file extension
    extension = file_name.split(".")[-1].lower().strip()  # Added strip() to remove any trailing spaces
    print("Extention: ", extension)  # Debug print for extension

    # Check if the extension is in the supported list
    if extension in supported_extensions:
        return extension
    else:
        raise ValueError("File format not supported")

#function for chunking files
def chunk_file(file_path, chunk_size=100, chunk_overlap = 20):
    try:
        # check if is a supported format
        extension = get_file_type(file_path)

        match extension:
            case "pdf":
                loader = PyPDFium2Loader(file_path)
            case "md":
                loader = UnstructuredMarkdownLoader(file_path)
            case "txt":
                loader = TextLoader(file_path, encoding='utf-8')
            case _:
                return "File format not supported"

        pages = loader.load()
        n_pages = len(pages)  # get the number of pages in the
        print("Number of pages:", n_pages)

        # Split text in chunk using RecursiveCharacterTextSplitter
        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=chunk_size, chunk_overlap=chunk_overlap, length_function=len
        )
        chunks = text_splitter.transform_documents(pages)
        print(f"Number of chunks: {len(chunks)}")
        
        return chunks

    except ValueError as e:
        print(f"Error: {e}")

In [None]:
#Collecting files from directory
docs_path = "docs/data"
folder_files = [
        os.path.join(docs_path, f)
        for f in os.listdir(docs_path)
        if f.endswith(".pdf") or f.endswith(".txt") or f.endswith(".md")
    ]
#Create chunks and store them in "data"
data=[]
for idx, file_path in enumerate(folder_files):
    print(f"Processing file {idx + 1}/{len(folder_files)}: {file_path}")
    chunks = chunk_file(file_path, chunk_size=100, chunk_overlap = 20)
    data.extend(chunks)

## Store chunks into a VectorDB

### Store data in a ChromaDb
We will store our chunks in a Vector DB. 

In [None]:
from langchain_community.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings


In [None]:
#set apikey and embedding model from openai
apikey = "<OPENAI APIKEY>"
embedding_model = "text-embedding-3-small"
#Set OpenAI Embedder
embeddings = OpenAIEmbeddings(
    model=embedding_model, openai_api_key=apikey
)

In [None]:
#Lets create a persistent Chroma
db_directory = "db/chroma/data"
store = Chroma.from_documents(
    data,
    embeddings,
    ids=[f"{item.metadata['source']}-{index}" for index,item in enumerate(data)],
    collection_name="CollectionName",
persist_directory=db_directory
)
store.persist()

### Store data with Faiss
We will store our chunks in a Vector DB.

In [None]:
from langchain_community.vectorstores import FAISS
from langchain.embeddings import OpenAIEmbeddings


In [None]:
#set apikey and embedding model from openai
apikey = "<OPENAI APIKEY>"
embedding_model = "text-embedding-3-small"
#Set OpenAI Embedder
embeddings = OpenAIEmbeddings(
    model=embedding_model, openai_api_key=apikey
)

In [None]:
db_directory = "db/faiss/data"
store = FAISS.from_documents(data,embeddings)
store.save_local(db_directory,index_name="data")

## Asking Questions to the Virtual Assistant!
Let's use OpenAI for answering our questions about information retrieved on Chroma!

In [None]:
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from langchain.chat_models import ChatOpenAI
import pprint

In [None]:
#Customize the general prompt template with wat you want
template = """
                You are a Virtual Assistant that answers questions using only the context provided. 
                If there are multiple answers list them. Answer using the language of the question
                {context}
                Question: {question}
                """
#Set up the prompt
prompt = PromptTemplate(template = template, input_variables=["context", "question"])

In [None]:
#Set up the LLM
llm = ChatOpenAI(temperature=0.8, model="gpt-4o", openai_api_key=apikey)

In [None]:
#Let's create our question/answer model passing the llm, chromaDB (or faiss) and the prompt.
qa_with_source = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=store.as_retriever(),
    chain_type_kwargs={"prompt":prompt},
    return_source_documents=True
)

In [None]:
pprint.pprint(qa_with_source("Your question!"))