# **Chatbot RAG Sample Implementation**
Chatbot training data already divide into 3 columns based on historical searching, techincal documentation, and Customer Service response data.

**Data Overview:**
*   **Context** - String
*   **Question** - String
*   **Answer** - String




In this notebook, a model from ***sentence-transformers*** version ***paraphrase-MiniLM-L6-v2*** is used.

In [1]:
import pandas as pd

In [2]:
# Load the CSV file
file_path = "https://raw.githubusercontent.com/frfusch21/Manufacture-Otomotive-Analysis/refs/heads/main/Data/RAG_Chatbot_Training_Data.csv"
df = pd.read_csv(file_path)

In [3]:
df.head()

Unnamed: 0,context,question,answer
0,"Manual pengguna motor X, halaman 13",Bagaimana cara mengganti oli mesin?,Ganti oli mesin dilakukan dengan membuka baut ...
1,"Panduan teknis motor Y, halaman 61",Berapa tekanan angin ban yang disarankan?,Tekanan angin ban depan 29 psi dan belakang 33...
2,"Buku servis berkala Z, halaman 12",Kapan waktu servis berkala harus dilakukan?,Servis berkala dilakukan setiap 3.000 km atau ...
3,"Modul pelatihan mekanik, halaman 93",Apa arti kode error E01?,E01 menunjukkan adanya gangguan pada sensor su...
4,"Dokumen teknis pabrikan, halaman 79",Bagaimana prosedur pengecekan busi?,Pengecekan busi dilakukan dengan mencabut busi...


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   context   1000 non-null   object
 1   question  1000 non-null   object
 2   answer    1000 non-null   object
dtypes: object(3)
memory usage: 23.6+ KB


In [None]:
!pip install sentence-transformers

In [6]:
from sentence_transformers import SentenceTransformer, util

In [7]:
# Clean data and drop duplicates
df["question_clean"] = df["question"].str.lower().str.strip()
df_cleaned = df.drop_duplicates(subset="question_clean", keep="first").reset_index(drop=True)

In [None]:
# Load sentence-transformer model
model = SentenceTransformer("paraphrase-MiniLM-L6-v2")

In [9]:
corpus_embeddings = model.encode(df["question_clean"].tolist(), convert_to_tensor=True)

**Cosine Similarity**


We implement semantic search using Cosine Similarity. We use Sentence Embeddings to convert both user queries and existing questions into vectors. By comparing these vectors using Cosine Similarity, we identify the most semantically similar question in the dataset, and return its associated answer

If you are not familiar with Cosine Similarity, please read here: [Cosine Similarity](https://en.wikipedia.org/wiki/Cosine_similarity)

Also, if you are not familiar with Semantic Embedding, please read here: [Word Embedding](https://en.wikipedia.org/wiki/Word_embedding)

In [23]:
# Function to fetch answer with similarity score
def get_answer(user_query,  top_k=3):
    query_embedding = model.encode(user_query.lower().strip(), convert_to_tensor=True)
    similarity_scores = util.pytorch_cos_sim(query_embedding, corpus_embeddings)[0]
    best_score, best_idx = similarity_scores.topk(k=1)
    idx = int(best_idx[0])
    context = df.iloc[idx]["context"]
    answer = df.iloc[idx]["answer"]
    question = df.iloc[idx]["question"]

    response = f"- Dari {context} untuk pertanyaan {question}. Dinyatakan {answer}"
    return response

In [26]:
# Execute App
print("Hi disini chatbot! Ketik 'keluar' untuk mengakhiri sesi ini.")
while True:
    user_input = input("\n\nPertanyaan: ")
    if user_input.lower() == "keluar":
        print("Chatbot: Sampai jumpa!")
        break
    response = get_answer(user_input)
    print("Jawaban Chatbot:\n" + response)

Hi disini chatbot! Ketik 'keluar' untuk mengakhiri sesi ini.


Pertanyaan: kapan kita harus servis?
Jawaban Chatbot:
- Dari Buku servis berkala Z, halaman 12 untuk pertanyaan Kapan waktu servis berkala harus dilakukan?. Dinyatakan Servis berkala dilakukan setiap 3.000 km atau sesuai buku manual.


Pertanyaan: kenapa mesin cepat panas?
Jawaban Chatbot:
- Dari Dokumentasi teknis suspensi, halaman 26 untuk pertanyaan Apa penyebab mesin cepat panas?. Dinyatakan Mesin panas bisa karena oli kurang, sistem pendingin bermasalah, atau beban berlebih.


Pertanyaan: apa itu sistem injeksi?
Jawaban Chatbot:
- Dari Dokumentasi teknis suspensi, halaman 78 untuk pertanyaan Apa itu sistem injeksi bahan bakar?. Dinyatakan Sistem injeksi mengontrol bahan bakar secara elektronik untuk efisiensi.


Pertanyaan: apa arti kode error E01?
Jawaban Chatbot:
- Dari Modul pelatihan mekanik, halaman 93 untuk pertanyaan Apa arti kode error E01?. Dinyatakan E01 menunjukkan adanya gangguan pada sensor suhu mesin.


P

# Conclusion and Suggestions

*   	Limited variety makes it hard to match paraphrased or unexpected queries. The dataset sould be more vary especially with paraphrased questions and varied intents.
*   Multiple rows have the same or near-identical questions and answers due to replicated attempt by me (I only have limited knowledge of otomotive world). In the future, this should be deduplicated or grouped into similar entries.
*   This is only sample attempt at RAG without connecting to a language model (e.g., GPT, LLaMA) to generate responses from retrieved context which results in static retrieval and no actual "generation" or reasoning. In the future, this should be connected to a language model to return better result, preferably via API. (This would have to be private since there is API KEY involved)
*   If no match is close enough, it still returns something (possibly irrelevant , one example is if we ask how to check "Van Belt" or "Knalpot", that is not available in the dataset).	Add a similarity threshold and fallback message like "Maaf, saya tidak yakin."
