<a href="https://colab.research.google.com/github/davinfalahtama/AI-Dikti/blob/main/playground.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install optimum
!pip install accelerate
!pip install auto-gptq
!pip install PyPDF2
!pip install langchain
!pip install langchain_google_genai
!pip install faiss-gpu

In [None]:
import requests
import torch
from transformers import AutoModelForCausalLM, AutoConfig,AutoTokenizer
import transformers

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Using device:', device)
print()

#Additional Info when using cuda
if device.type == 'cuda':
    print(torch.cuda.get_device_name(0))
    print('Memory Usage:')
    print('Allocated:', round(torch.cuda.memory_allocated(0)/1024**3,1), 'GB')
    print('Cached:   ', round(torch.cuda.memory_reserved(0)/1024**3,1), 'GB')

In [None]:
torch.cuda.mem_get_info()

In [None]:
model_name= "TheBloke/Mistral-7B-Instruct-v0.2-GPTQ"

config = AutoConfig.from_pretrained(model_name)
config.quantization_config["disable_exllama"] = True

model = AutoModelForCausalLM.from_pretrained(model_name,
                                             config=config,
                                             revision="main")
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [None]:
text_generation_pipeline = transformers.pipeline(
            model=model,
            tokenizer=tokenizer,
            task="text-generation",
            temperature=0.2,
            max_new_tokens=50,
)
mistral_llm = HuggingFacePipeline(pipeline=text_generation_pipeline)

In [33]:
answer

'<s>[INST] What is AI [/INST]\n\nArtificial Intelligence (AI) refers to the development of computer systems or machines that can perform tasks that typically require human intelligence. These tasks include learning and adapting to new information, understanding natural language, recognizing patterns, solving problems, and making decisions.\n\nAI systems can be classified based on their ability to learn and adapt. The most basic form of AI is rule-based systems, which follow a set of predefined rules to make decisions. More advanced forms of AI include machine'

In [ ]:


print("*** Pipeline:")
pipe = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=512,
    do_sample=True,
    temperature=0.7,
    top_p=0.95,
    top_k=40,
    repetition_penalty=1.1
)

print(pipe(prompt_template)[0]['generated_text'])

# Langchain Setup

In [None]:
from PyPDF2 import PdfReader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_google_genai import GoogleGenerativeAIEmbeddings
import google.generativeai as genai
from langchain.vectorstores import FAISS
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_core.runnables import RunnablePassthrough
from langchain.llms import HuggingFacePipeline
import pandas as pd
from langchain_openai import OpenAI
from langchain_openai import OpenAIEmbeddings
from langchain.chains import create_history_aware_retriever
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains import create_retrieval_chain
from langchain_core.chat_history import BaseChatMessageHistory
from langchain_core.runnables.history import RunnableWithMessageHistory
from langchain_community.chat_message_histories import ChatMessageHistory
from langchain_core.runnables import ConfigurableFieldSpec
from langchain_openai import ChatOpenAI

**Setup LLM**

In [None]:
#OPEN-AI
model =  ChatOpenAI(temperature=0.5,openai_api_key="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")
embedding = OpenAIEmbeddings(openai_api_key="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")
#GOOGLE-GEMINI
# model = ChatGoogleGenerativeAI(model="gemini-pro", 
#                                             temperature=0.5, 
#                                             convert_system_message_to_human=True,
#                                             google_api_key="AIzaSyBYgcagyUPWzHFRyTZO3o8r85oZqmC25E8")
#embedding = GoogleGenerativeAIEmbeddings(model="models/embedding-001",google_api_key="AIzaSyBYgcagyUPWzHFRyTZO3o8r85oZqmC25E8")

## Read File
This code comprises two functions aimed at extracting text data from CSV and PDF files using Python libraries.

**Used Libraries :**
- Pandas : To extract text data from a CSV file.
- PyPDF2.PdfReader : To extract text data from a PDF file.

In [None]:
def extract_text_from_csv(file):
    """
    Function to extract text data from a CSV file.

    Args:
    - file (str): Path to the CSV file.

    Returns:
    - str: Concatenated text data from the specified column ('facts') in the CSV file.
    """
    df = pd.read_csv(file)
    return ' '.join(df['facts'])

def extract_text_from_pdf(file):
    """
    Function to extract text data from a PDF file.

    Args:
    - file (str): Path to the PDF file.

    Returns:
    - str: Extracted text data from all pages of the PDF file.
    """
    pdf_text = ""
    pdf_reader = PdfReader(file) 
    for page in pdf_reader.pages:
        pdf_text += page.extract_text()
    return pdf_text

# Example usage:
file = extract_text_from_pdf("docs/MTA023401.pdf")

## Split Text into Chunks

- The **RecursiveCharacterTextSplitter** takes a large text and splits it based on a specified chunk size.
- Chunking involves dividing the document into smaller, more manageable sections that fit comfortably within the context window of the large language model.

More details can be found in the following link
- [Understanding LangChain's RecursiveCharacterTextSplitter](https://dev.to/eteimz/understanding-langchains-recursivecharactertextsplitter-2846)
- [Langchain Documentation](https://python.langchain.com/docs/modules/data_connection/document_transformers/recursive_text_splitter)

In [None]:
def split_text_into_chunks(text):
        text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
        return text_splitter.split_text(text)

splited_text = split_text_into_chunks(file)

## Vector Store & Retriever
- A vector store is a specialized database designed to store and manage vector embeddings.
- A retriever is an interface that returns documents given an unstructured query
- **FAISS** takes 2 mandatory parameters :
   - *texts* : A list that contain string as elements
   - *embedding* : Embedding models to transform all the text into embedding vectors

More details can be found in the following link
- [LangChain in Chains: Vector Stores](https://ai.plainenglish.io/langchain-in-chains-16-vector-stores-94ff578c1aee)
- [LangChain in Chains: Retrievers](https://ai.plainenglish.io/langchain-in-chains-17-retrievers-1c252917f68f)
- [Langchain Documentation](https://python.langchain.com/docs/integrations/vectorstores/faiss) 

In [None]:
def create_vector_store(text_chunks):
      vector_store = FAISS.from_texts(texts=text_chunks, embedding=embedding)
      return vector_store

vector_store = create_vector_store(splited_text)
retriever = vector_store.as_retriever(search_type="similarity")

## Contextualizing the question
- Define a sub-chain that takes historical messages and the latest user question, and reformulates the question if it makes reference to any information in the historical information
- **create_history_aware_retriever** create a chain that takes conversation history and returns documents.

In [None]:
from langchain.chains import create_history_aware_retriever, create_retrieval_chain

def contextualize_system_prompt():
    contextualize_q_system_prompt = """Given a chat history and the latest user question \
        which might reference context in the chat history, formulate a standalone question \
        which can be understood without the chat history. Do NOT answer the question, \
        If theres no chat history before then return it as it \
        just reformulate it if needed and otherwise return it as is."""
    contextualize_q_prompt = ChatPromptTemplate.from_messages(
        [
            ("system", contextualize_q_system_prompt),
            MessagesPlaceholder("chat_history"),
            ("human", "{input}"),
        ]
    )
    history_aware_retriever = create_history_aware_retriever(model, retriever, contextualize_q_prompt)
    return history_aware_retriever

history_aware_retriever = contextualize_system_prompt()

## Chain with chat history

In [None]:
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

def question_answer():
    ### Answer question ###
    prompt_template = """
                        You are a friendly chatbot named Bogu that helps to answer question regarding Indonesian Culture\n
                        Answer the question as detailed as possible from the provided context, make sure to provide all the details\n
                        If the question is about code, answer that you don't know the answer.\n
                        Context:\n {context}?\n
                    """
    qa_prompt = ChatPromptTemplate.from_messages(
        [
            ("system", prompt_template),
            MessagesPlaceholder("chat_history"),
            ("human", "{input}"),
        ]
    )
    question_answer_chain = create_stuff_documents_chain(model, qa_prompt)
    rag_chain = create_retrieval_chain(history_aware_retriever, question_answer_chain)

    return rag_chain

conversational_rag_chain = question_answer()

In [None]:
from langchain_core.messages import AIMessage, HumanMessage
### Statefully manage chat history ###
store = {}

def get_session_history(session_id: str) -> BaseChatMessageHistory:
    if session_id not in store:
        store[session_id] = ChatMessageHistory()
    return store[session_id]

In [None]:
import os
os.environ["KMP_DUPLICATE_LIB_OK"]="TRUE"

In [None]:
conversational_rag_chain = RunnableWithMessageHistory(
        conversational_rag_chain,
        get_session_history,
        input_messages_key="input",
        history_messages_key="chat_history",
        output_messages_key="answer",
)

In [None]:
conversational_rag_chain.invoke(
    {"input": "Who are you"},
    config={"configurable": {"session_id": "abc123"}},
)

In [None]:
conversational_rag_chain.invoke(
    {"input": "who create it"},
    config={
        "configurable": {"session_id": "abc123"}
    }, 
)

In [10]:
a = store.get('111')
a.messages

[HumanMessage(content='Who are you'),
 AIMessage(content='I am Bogu, a friendly chatbot here to help answer questions about Indonesian Culture. Feel free to ask me anything related to Indonesian culture, traditions, or any other related topics!')]

In [11]:
for message in a.messages:
  if isinstance(message, HumanMessage):
    print(f"User: {message.content}")
  elif isinstance(message, AIMessage):
    print(f"Assistant: {message.content}")

NameError: name 'HumanMessage' is not defined

In [1]:
from chat_models.openai import ChatOpenAi 

In [8]:
store

{'111': ChatMessageHistory(messages=[HumanMessage(content='Who are you'), AIMessage(content='I am Bogu, a friendly chatbot here to help answer questions about Indonesian Culture. Feel free to ask me anything related to Indonesian culture, traditions, or any other related topics!')])}

In [2]:
store = {}
app = ChatOpenAi(store)

In [3]:
raw_text = app.process_files("docs/MTA023401.pdf")

In [4]:
text_chunks = app.split_text_into_chunks(raw_text)
app.create_vector_store(text_chunks)

In [5]:
app.contextualize_system_prompt()

RunnableBinding(bound=RunnableBranch(branches=[(RunnableLambda(lambda x: not x.get('chat_history', False)), RunnableLambda(lambda x: x['input'])
| VectorStoreRetriever(tags=['FAISS', 'GoogleGenerativeAIEmbeddings'], vectorstore=<langchain_community.vectorstores.faiss.FAISS object at 0x000001D3801757C0>))], default=ChatPromptTemplate(input_variables=['chat_history', 'input'], input_types={'chat_history': typing.List[typing.Union[langchain_core.messages.ai.AIMessage, langchain_core.messages.human.HumanMessage, langchain_core.messages.chat.ChatMessage, langchain_core.messages.system.SystemMessage, langchain_core.messages.function.FunctionMessage, langchain_core.messages.tool.ToolMessage]]}, messages=[SystemMessagePromptTemplate(prompt=PromptTemplate(input_variables=[], template='Given a chat history and the latest user question             which might reference context in the chat history, formulate a standalone question             which can be understood without the chat history. Do NOT

In [6]:
app.get_conversational_chain()

RunnableBinding(bound=RunnableAssign(mapper={
  context: RunnableBinding(bound=RunnableBranch(branches=[(RunnableLambda(lambda x: not x.get('chat_history', False)), RunnableLambda(lambda x: x['input'])
           | VectorStoreRetriever(tags=['FAISS', 'GoogleGenerativeAIEmbeddings'], vectorstore=<langchain_community.vectorstores.faiss.FAISS object at 0x000001D3801757C0>))], default=ChatPromptTemplate(input_variables=['chat_history', 'input'], input_types={'chat_history': typing.List[typing.Union[langchain_core.messages.ai.AIMessage, langchain_core.messages.human.HumanMessage, langchain_core.messages.chat.ChatMessage, langchain_core.messages.system.SystemMessage, langchain_core.messages.function.FunctionMessage, langchain_core.messages.tool.ToolMessage]]}, messages=[SystemMessagePromptTemplate(prompt=PromptTemplate(input_variables=[], template='Given a chat history and the latest user question             which might reference context in the chat history, formulate a standalone question 

In [7]:
app.run_invoke("Who are you","111")

{'input': 'Who are you',
 'chat_history': [],
 'context': [Document(page_content='b. Pengumpulan Data Sekunder \nData sekunder berupa literature /teori-teori dan informasi-\ninformasi yang berkaitan dengan tujuan penelitian, dan akan \ndikaitkan dengan hasil dari lapangan dengan tujuan untuk \nmenganalisis data. Adapun literature  yang di peroleh dalam \npenelitian ini berupa buku-buku, jurnal, tesis, disertasi, peraturan \npemerintah, dan website .  \n \n3. Tahap Pembahasan \nDalam menganalisis data maka peneliti harus menyesuaikan \ndata yang didapatkan dengan permasalahan dan tujuan dari penelitian \nagar kemudian dibahas. Metode analisis merupakan salah satu cara \nuntuk mencapai tujuan dari penelitian. Metode analisis data menurut \nPaton dalam (Moleong L. J., 2007) adalah proses mengatur urutan \ndata, mengorganisasikan nya dalam suatu pola, kategori, dan satuan \nuraian dasar.  \nPendekatan strukturalisme dan penelusuran sejarah dalam \nmelakukan proses pembahasan dapat diungkap

In [None]:
app.store