# **Week 8 Assignment -> RAG Q&A chatbot using document retrieval and generative AI for intelligent response generation (can use any light model from hugging face or a license llm(opneai, claude, grok, gemini) if free credits available**

## **By-> Arnav Chopra**
## **CT_CSI_DS_4264**

## **Note: This program does not contain any API Keys.**

### **To run this code locally kindly use api keys**
They are hidden in .env inside gitignore

In [29]:
import pandas as pd
import numpy as np
import os
from typing import Dict, List, Tuple
import json
import warnings
warnings.filterwarnings('ignore')

!pip install sentence-transformers pinecone

from sentence_transformers import SentenceTransformer
import pinecone
from pinecone import Pinecone, ServerlessSpec



In [None]:
import os

os.environ["GROQ_API_KEY"] = "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"
os.environ["PINECONE_API_KEY"] = "pcsk_3VPkFi_HSadHd7RPjvXFASPsiDP1RQzbn3VAuTC2R18FrkAEXbVJRPw5zeXKpipcbpp1tz"

## **Data Ingestion and Preprocessing**

In [30]:
df = pd.read_csv("https://raw.githubusercontent.com/ac-26/CSI-25/refs/heads/main/Loan%20Prediction%20Dataset/Training%20Dataset.csv")

In [31]:
df.shape

(614, 13)

In [32]:
df.columns.tolist()

['Loan_ID',
 'Gender',
 'Married',
 'Dependents',
 'Education',
 'Self_Employed',
 'ApplicantIncome',
 'CoapplicantIncome',
 'LoanAmount',
 'Loan_Amount_Term',
 'Credit_History',
 'Property_Area',
 'Loan_Status']

In [33]:
df.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


In [34]:
df.isnull().sum()

Unnamed: 0,0
Loan_ID,0
Gender,13
Married,3
Dependents,15
Education,0
Self_Employed,32
ApplicantIncome,0
CoapplicantIncome,0
LoanAmount,22
Loan_Amount_Term,14


### **Handling missing values in generic way**

In [35]:
for col in df.columns:
    if df[col].dtype == 'object':
        df[col].fillna(df[col].mode()[0], inplace=True)
    else:
        df[col].fillna(df[col].median(), inplace=True)

In [36]:
df.isnull().sum()

Unnamed: 0,0
Loan_ID,0
Gender,0
Married,0
Dependents,0
Education,0
Self_Employed,0
ApplicantIncome,0
CoapplicantIncome,0
LoanAmount,0
Loan_Amount_Term,0


### **Creating text records for the columns to feed in context part for our chatbot**

In [37]:
text_records = []
for idx, row in df.iterrows():
    text = f"Loan ID: {row['Loan_ID']}, "
    text += f"Gender: {row['Gender']}, "
    text += f"Married: {row['Married']}, "
    text += f"Income: {row['ApplicantIncome']}, "
    text += f"Loan Amount: {row['LoanAmount']}, "
    text += f"Credit History: {row['Credit_History']}, "
    text += f"Status: {row['Loan_Status']}"
    text_records.append(text)

In [38]:
text_records[0]

'Loan ID: LP001002, Gender: Male, Married: No, Income: 5849, Loan Amount: 128.0, Credit History: 1.0, Status: Y'

### **We need to create more detailed text records for the LLM to understand semantic meaning clearly**

In [39]:
detailed_texts = []

for idx, row in df.iterrows():
    text = f"Loan Application {row['Loan_ID']}: "
    text += f"A {'married' if row['Married']=='Yes' else 'single'} {row['Gender'].lower()} "
    text += f"with {row['Dependents']} dependents applied for a loan. "
    text += f"The applicant is {'a graduate' if row['Education']=='Graduate' else 'not a graduate'} "
    text += f"and {'self-employed' if row['Self_Employed']=='Yes' else 'works for an employer'}. "
    text += f"Monthly income is ${row['ApplicantIncome']} with co-applicant income of ${row['CoapplicantIncome']}. "
    text += f"Requested loan amount: ${row['LoanAmount']*1000} for {row['Loan_Amount_Term']} months. "
    text += f"Credit history: {'Good' if row['Credit_History']==1 else 'Poor'}. "
    text += f"Property in {row['Property_Area']} area. "
    text += f"Loan was {'APPROVED' if row['Loan_Status']=='Y' else 'REJECTED'}."
    detailed_texts.append(text)

In [40]:
detailed_texts[0]

'Loan Application LP001002: A single male with 0 dependents applied for a loan. The applicant is a graduate and works for an employer. Monthly income is $5849 with co-applicant income of $0.0. Requested loan amount: $128000.0 for 360.0 months. Credit history: Good. Property in Urban area. Loan was APPROVED.'

In [41]:
len(detailed_texts)

614

### **Creating chunks from the detailed texts we made**

In [42]:
chunks = []

for i, text in enumerate(detailed_texts):
    chunk = {
        'id': f'chunk_{i}',
        'text': text,
        'metadata': {
            'source': 'loan_dataset',
            'chunk_index': i
        }
    }
    chunks.append(chunk)

In [43]:
len(chunks)

614

In [44]:
chunks[0]

{'id': 'chunk_0',
 'text': 'Loan Application LP001002: A single male with 0 dependents applied for a loan. The applicant is a graduate and works for an employer. Monthly income is $5849 with co-applicant income of $0.0. Requested loan amount: $128000.0 for 360.0 months. Credit history: Good. Property in Urban area. Loan was APPROVED.',
 'metadata': {'source': 'loan_dataset', 'chunk_index': 0}}

### **Generating embeddings and storing it in a Pinecone Vector Database**

In [45]:
PINECONE_API_KEY = os.getenv("PINECONE_API_KEY")
pc = Pinecone(api_key=PINECONE_API_KEY)

index_name = "loan-rag-index"

if index_name not in pc.list_indexes().names():
    pc.create_index(
        name=index_name,
        dimension=384,
        metric='cosine',
        spec=ServerlessSpec(
            cloud='aws',
            region='us-east-1'
        )
    )

index = pc.Index(index_name)

In [46]:
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')

texts = [chunk['text'] for chunk in chunks]
embeddings = embedding_model.encode(texts)

In [47]:
embeddings.shape

(614, 384)

In [48]:
batch_size = 100

for i in range(0, len(chunks), batch_size):
    batch_chunks = chunks[i:i+batch_size]
    batch_embeddings = embeddings[i:i+batch_size]

    vectors = []

    for j, (chunk, embedding) in enumerate(zip(batch_chunks, batch_embeddings)):
        vectors.append({
            "id": chunk['id'],
            "values": embedding.tolist(),
            "metadata": {
                **chunk['metadata'],
                "text": chunk['text']
            }
        })

    index.upsert(vectors=vectors)

print(f"Uploaded {len(chunks)} vectors to Pinecone")

Uploaded 614 vectors to Pinecone


### **Creating RAG chatbot**

### **We will use Grok as its free**

In [49]:
!pip install groq

from groq import Groq

GROQ_API_KEY = os.getenv("GROQ_API_KEY")
groq_client = Groq(api_key=GROQ_API_KEY)



### **RAG Query Function**

In [50]:
def query_rag(question, top_k=3):
    question_embedding = embedding_model.encode([question])

    search_results = index.query(
        vector=question_embedding.tolist(),
        top_k=top_k,
        include_metadata=True
    )

    contexts = []
    for match in search_results['matches']:
        contexts.append(match['metadata']['text'])

    return contexts

### **Chatbot function**

In [51]:
def chatbot(question):
    contexts = query_rag(question, top_k=5)

    if not contexts:
        return "I couldn't find any relevant loan data to answer your question. Please try rephrasing or ask about loan approvals, income, credit history, or other loan-related topics."

    context_text = "\n\n".join(contexts)
    prompt = f"""You are a helpful loan advisor assistant. Answer the question based ONLY on the loan application data provided below. If the data doesn't contain enough information to answer the question, say so clearly.

Loan Application Data:
{context_text}

Question: {question}

Important:
- Only use information from the provided data
- If you cannot answer from the given data, say "Based on the available data, I cannot provide a complete answer"
- Do not be specific about which loan applications you're referencing. Just say

Answer: """

    try:
        response = groq_client.chat.completions.create(
            model="llama3-8b-8192",
            messages=[
                {"role": "system", "content": "You are a helpful loan advisor. Answer questions based ONLY on the provided loan data. Do not make assumptions beyond what's in the data."},
                {"role": "user", "content": prompt}
            ],
            temperature=0.3,
            max_tokens=300
        )
        return response.choices[0].message.content
    except Exception as e:
        return f"Sorry, I encountered an error: {str(e)}"

## **Testing**

In [53]:
test_questions = [
    "What's the approval rate for married people?",
    "How does credit history affect loan approval?",
    "What income level typically gets approved?",
    "What about car loans?"
]

for question in test_questions:
    print(f"\nQ: {question}")
    answer = chatbot(question)
    print(f"A: {answer}")
    print("-" * 50)


Q: What's the approval rate for married people?
A: Based on the available data, I can confirm that all loan applications for married individuals were APPROVED. Therefore, the approval rate for married people is 100%.
--------------------------------------------------

Q: How does credit history affect loan approval?
A: Based on the available data, credit history does not affect loan approval. All loan applications with a credit history of "Good" were APPROVED.
--------------------------------------------------

Q: What income level typically gets approved?
A: Based on the available data, the income level that typically gets approved is above $2479 (Loan Application LP001699) and below $3182 (Loan Application LP002888).
--------------------------------------------------

Q: What about car loans?
A: Based on the available data, I cannot provide a complete answer. The data only contains information about mortgage loan applications and does not mention car loans.
-------------------------