# AI Coding Assistant: A Hybrid RAG Chatbot

This notebook demonstrates a sophisticated, hybrid chatbot designed to function as a programming assistant. The system leverages a **Retrieval-Augmented Generation (RAG)** architecture to provide accurate and efficient answers to coding questions.

### Core Architecture:
-   **Retriever:** A `CodeBERT` model provides fast and contextually-aware answers for common questions stored in a pre-computed vector database.
-   **Generator:** A powerful generative model (`microsoft/phi-2`) creates new, on-the-fly answers for novel or complex questions.
-   **Hybrid Logic:** The system intelligently decides whether to retrieve a known answer or generate a new one based on a similarity threshold.

This demo will showcase the chatbot's ability to handle both known and unknown questions.

In [1]:
# Import Logic and Initialize the Chatbot System

import sys
import os
import logging

# Add the 'src' folder to our Python path to import the custom modules
module_path = os.path.abspath(os.path.join(os.getcwd(), 'src'))
if module_path not in sys.path:
    sys.path.append(module_path)

# Import our main HybridChatbot class from the chatbot_logic.py file
from chatbot_logic import HybridChatbot

# Configure logging for this session
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)-8s - %(message)s', force=True)

# --- Initialize the Chatbot ---
# This single line creates an instance of our chatbot.
# The __init__ method inside the class will handle loading both models and the vector DB.
# This might take a moment, especially if the models are large.
print("Initializing the chatbot system... This may take a moment.")

try:
    chatbot = HybridChatbot()
    print("\n Chatbot initialized successfully and is ready to answer questions.")
except Exception as e:
    print(f"\n An error occurred during initialization: {e}")
    print("Please ensure your models are in the 'model_cache' folder and the 'vector_database.pkl' file exists.")

  from .autonotebook import tqdm as notebook_tqdm





2025-08-05 12:57:15,267 - INFO     - Initializing the Hybrid Chatbot...
2025-08-05 12:57:15,271 - INFO     - Vector database loaded from 'database/vector_database.pkl'
2025-08-05 12:57:15,271 - INFO     - Attempting to load local embedding model from: d:\SCIT\SmartAssistant\AI-Coding-Assistant\model_cache\multi-qa-mpnet-base-dot-v1
2025-08-05 12:57:15,280 - INFO     - Use pytorch device_name: cpu
2025-08-05 12:57:15,281 - INFO     - Load pretrained SentenceTransformer: d:\SCIT\SmartAssistant\AI-Coding-Assistant\model_cache\multi-qa-mpnet-base-dot-v1


Initializing the chatbot system... This may take a moment.


2025-08-05 12:57:17,743 - INFO     - High-performance local embedding model loaded successfully.
2025-08-05 12:57:17,743 - INFO     - Attempting to load local generative model from: d:\SCIT\SmartAssistant\AI-Coding-Assistant\model_cache\phi-2
2025-08-05 12:57:17,743 - INFO     - Using device: cpu
2025-08-05 12:57:27,720 - INFO     - Local generative model and tokenizer loaded successfully.
2025-08-05 12:57:27,721 - INFO     - Hybrid Chatbot initialized successfully.



 Chatbot initialized successfully and is ready to answer questions.


## 3. Test the Hybrid Chatbot

Now for the final and most exciting part. We will test our chatbot with a series of questions to see the hybrid RAG system in action. Thanks to the powerful CodeBERT model, the similarity search should be much more accurate.

- **Test Case 1:** A common question that is likely already in our database to test the **Retrieval** mechanism.
- **Test Case 2:** A more unique, technical question that is not in our database to test the **Generative** (Phi-2) mechanism.

In [2]:
# Cell 4: Run test cases with the powerful CodeBERT embeddings

# Define a list of questions to ask the chatbot
test_questions = [
    # Test Case 1: A common question, likely in the database to test RETRIEVAL
    "how to read csv file in pandas with python?",
    
    # Test Case 2: A more specific question, likely NOT in the database to test GENERATION
    # With CodeBERT, the similarity score for this should be lower and trigger Phi-2
    "tell me how to write php function for summing tow numbers ?"
]

print("--- Starting Chatbot Test with CodeBERT Embeddings ---")
logging.info(f"Testing the chatbot with {len(test_questions)} questions.")

# Loop through each test question
for i, question in enumerate(test_questions):
    print(f"\n\n" + "="*50)
    print(f"QUERY #{i+1}: \"{question}\"")
    print("="*50)
    
    # Get the answer from our chatbot
    answer, source = chatbot.get_answer(question)
    
    # Print the results
    print(f"Source of Answer: {source}")
    print(f"\nANSWER:\n{answer}")
    print("="*50)

logging.info("Chatbot test complete.")

2025-08-05 12:57:27,732 - INFO     - Testing the chatbot with 2 questions.
2025-08-05 12:57:27,734 - INFO     - Received new question: 'how to read csv file in pandas with python?'


--- Starting Chatbot Test with CodeBERT Embeddings ---


QUERY #1: "how to read csv file in pandas with python?"


Batches: 100%|██████████| 1/1 [00:00<00:00,  5.04it/s]
2025-08-05 12:57:27,943 - INFO     - Highest similarity score found: 0.4366
2025-08-05 12:57:27,943 - INFO     - Generating a new answer with Phi-2.
2025-08-05 12:58:09,502 - INFO     - Received new question: 'tell me how to write php function for summing tow numbers ?'


Source of Answer: Generated by Phi-2 (Low Confidence Match)

ANSWER:
how to read csv file in pandas with python?

I have a csv file with the following structure:

I want to read this file in pandas and get the following dataframe:

I tried the following code:
import pandas as pd
df = pd.read_csv('file.csv', delimiter=',')

But I get the following error:

ValueError: Unable to parse column names from line 1:
  ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32


QUERY #2: "tell me how to write php function for summing tow numbers ?"


Batches: 100%|██████████| 1/1 [00:00<00:00, 21.08it/s]
2025-08-05 12:58:09,556 - INFO     - Highest similarity score found: 0.7575
2025-08-05 12:58:09,556 - INFO     - Retrieving stored answer.
2025-08-05 12:58:09,557 - INFO     - Chatbot test complete.


Source of Answer: Retrieved from Database (High Confidence)

ANSWER:
Here's a simple PHP script that accepts two numbers as command-line arguments and returns their sum:

```php
<?php

if ($argc < 3) {
    echo "Usage: php sum.php number1 number2\n";
    exit(1);
}

$num1 = $argv[1];
$num2 = $argv[2];

if (!is_numeric($num1) || !is_numeric($num2)) {
    echo "Both arguments must be numbers.\n";
    exit(1);
}

$sum = $num1 + $num2;
echo "The sum of {$num1} and {$num2} is {$sum}.\n";

?>
```

To run this script, save it to a file called `sum.php` and execute it from the command line with two numbers as arguments, like so:

```
php sum.php 3 5
```

This script checks for the correct number of arguments, validates that they are both
