# 1. Project Overview
The primary goal of this project is to build an efficient and user-friendly chatbot that can handle various queries about RGU, ranging from visa requirements for international students to general campus information. By integrating advanced Natural Language Processing (NLP) techniques, we have developed a system capable of understanding and responding to user inquiries with high accuracy.

### Skills and Technologies Used
1. Natural Language Processing (NLP): We used the SentenceTransformer library to generate and compare text embeddings. This allows the chatbot to understand and match user queries with relevant documents.

2. Vector Databases: The project involves creating and managing vector stores for document embeddings. This is crucial for efficient and scalable query processing.

3. OpenAI GPT-4: We integrated OpenAI’s GPT-4 model to generate contextually relevant answers based on the retrieved documents. This step enhances the quality of responses and ensures they are coherent and informative.

4. Error Handling and Validation: Implementing robust error handling and validation checks ensures that the chatbot operates smoothly and handles unexpected inputs gracefully.

5. Environment Configuration: Utilized environment variables and configuration management to securely handle API keys and other sensitive information.

## Evolution of AI in Chatbots
The field of AI and NLP has seen significant advancements in recent years. Modern chatbots are no longer limited to keyword-based responses but can understand context and nuance, thanks to models like GPT-4. These models are trained on vast amounts of data, enabling them to generate responses that are not only accurate but also human-like.

The Project exemplifies the application of these advancements. By combining pre-trained language models with custom embeddings, we have created a system that evolves with user interactions and continuously improves its performance.

# 2. Relevant Libraries
<p>
This set of imports indicates that the code is designed to work with various aspects of natural language processing (NLP), including interacting with APIs, handling and processing text data, and performing similarity analysis. The libraries imported facilitate tasks such as making HTTP requests, parsing and handling data, and utilizing advanced models for generating embeddings and performing similarity comparisons.</p>

In [None]:
import os
import json
import openai
import requests
import numpy as np
import xmltodict
from bs4 import BeautifulSoup
from dotenv import load_dotenv
from openai.embeddings_utils import get_embedding
from sentence_transformers import SentenceTransformer
from langchain.text_splitter import CharacterTextSplitter
from sklearn.metrics.pairwise import cosine_similarity

1. **`import os`**
   - **Purpose:** Provides a way to interact with the operating system. It allows you to perform operations such as reading or writing to the file system, handling environment variables, and more.
   - **Common Use Cases:** Accessing environment variables, manipulating file paths, and executing system commands.

2. **`import openai`**
   - **Purpose:** This is the official OpenAI Python client library for interacting with OpenAI's API services.
   - **Common Use Cases:** Making API calls to access OpenAI's models for tasks such as text generation, completion, or embeddings.

3. **`import requests`**
   - **Purpose:** A popular library for making HTTP requests in Python. It simplifies sending HTTP requests and handling responses.
   - **Common Use Cases:** Fetching data from APIs or web services, downloading files, and handling HTTP responses.

4. **`import numpy as np`**
   - **Purpose:** A fundamental library for numerical computations in Python. It provides support for large multi-dimensional arrays and matrices, along with mathematical functions to operate on these arrays.
   - **Common Use Cases:** Handling arrays, performing mathematical operations, and working with large datasets.

5. **`import xmltodict`**
   - **Purpose:** A library for converting XML data into Python dictionaries, making it easier to work with XML data.
   - **Common Use Cases:** Parsing XML files or responses and converting them into a more accessible dictionary format for further processing.

6. **`from bs4 import BeautifulSoup`**
   - **Purpose:** BeautifulSoup is a library for parsing HTML and XML documents. It provides methods for navigating and modifying the parse tree.
   - **Common Use Cases:** Web scraping, extracting data from HTML or XML, and cleaning up data for further analysis.

7. **`from dotenv import load_dotenv`**
   - **Purpose:** Loads environment variables from a `.env` file into the environment. This is useful for managing configuration settings and sensitive information.
   - **Common Use Cases:** Loading API keys, database credentials, or other environment-specific configurations.

8. **`from openai.embeddings_utils import get_embedding`**
   - **Purpose:** Provides utilities for working with embeddings generated by OpenAI's models. The `get_embedding` function retrieves embeddings for text inputs.
   - **Common Use Cases:** Generating and using text embeddings for natural language processing tasks, such as similarity analysis or text classification.

9. **`from sentence_transformers import SentenceTransformer`**
   - **Purpose:** A library for generating sentence embeddings using pre-trained models. It helps in obtaining dense vector representations of sentences or documents.
   - **Common Use Cases:** Semantic textual similarity, clustering, and retrieval tasks by converting sentences into vectors.

10. **`from langchain.text_splitter import CharacterTextSplitter`**
    - **Purpose:** Provides a utility for splitting text into smaller chunks based on character length. This is useful for processing large texts by breaking them into manageable parts.
    - **Common Use Cases:** Text preprocessing, document chunking for analysis, and handling long texts in NLP tasks.

11. **`from sklearn.metrics.pairwise import cosine_similarity`**
    - **Purpose:** Provides functions to compute pairwise similarity measures between vectors, specifically cosine similarity.
    - **Common Use Cases:** Measuring similarity between text embeddings or feature vectors, clustering, and information retrieval

### Setting Up the OpenAI API Key

In [None]:
# load environment variables from .env file
load_dotenv()

# Set up OpenAI API key
api_key = os.getenv("OPENAI_API_KEY")
openai.api_key = api_key

- **Purpose:** This line uses the `load_dotenv` function from the `dotenv` library to load environment variables from a `.env` file into the environment.
- **Explanation:** The `.env` file typically contains configuration settings and sensitive information such as API keys, which should not be hardcoded into your source code. By loading these variables from the `.env` file, you can manage your configuration separately and securely. This approach keeps sensitive information like API keys out of the source code repository.

- **Purpose:** This code snippet sets the API key for the OpenAI client library.
- **Explanation:**
  - `os.getenv("OPENAI_API_KEY")`: Retrieves the value of the environment variable named `OPENAI_API_KEY`. The `os.getenv` function is used to access environment variables, which were loaded into the environment by `load_dotenv()`.
  - `openai.api_key = api_key`: Assigns the retrieved API key to the `api_key` attribute of the `openai` library. This sets up authentication for making API calls to OpenAI's services. By setting this attribute, you ensure that all subsequent interactions with OpenAI’s API will use the provided API key.

> By keeping sensitive information in environment variables and not hardcoding them into your source code, you enhance the security and maintainability of your application.

# 3. Web Scrapping

This code snippet demonstrates how to extract and process text from web pages and sitemaps using web scraping techniques. This code showcases practical applications of web scraping skills for retrieving, parsing, and processing web data. By combining HTTP requests, HTML and XML parsing, and text processing, the code effectively extracts relevant content from web pages and sitemaps, enabling targeted data analysis and processing.


In [None]:
def extract_text_from(url):
    html = requests.get(url).text
    soup = BeautifulSoup(html, features="html.parser")
    text = soup.get_text()
    lines = (line.strip() for line in text.splitlines())
    return '\n'.join(line for line in lines if line)

def fetch_sitemap(url):
    r = requests.get(url)
    xml = r.text
    raw = xmltodict.parse(xml)
    return raw

def get_relevant_pages(sitemap, keyword):
    pages = []
    for info in sitemap['urlset']['url']:
        url = info['loc']
        if keyword in url:
            pages.append({'text': extract_text_from(url), 'source': url})
    return pages


### **3.1. `extract_text_from(url)` Function**

```python
def extract_text_from(url):
    html = requests.get(url).text
    soup = BeautifulSoup(html, features="html.parser")
    text = soup.get_text()
    lines = (line.strip() for line in text.splitlines())
    return '\n'.join(line for line in lines if line)
```

- **Purpose:** This function extracts and cleans the text content from a given URL.
- **Process:**
  - `requests.get(url).text`: Sends an HTTP GET request to the specified URL and retrieves the HTML content of the page.
  - `BeautifulSoup(html, features="html.parser")`: Parses the HTML content using BeautifulSoup, a library for web scraping and parsing HTML/XML documents.
  - `soup.get_text()`: Extracts all the text content from the HTML, stripping away any HTML tags.
  - `text.splitlines()`: Splits the extracted text into lines.
  - `(line.strip() for line in text.splitlines())`: Strips leading and trailing whitespace from each line.
  - `'\n'.join(line for line in lines if line)`: Joins the non-empty lines into a single string, separated by newline characters.
- **Necessity:** This function is essential for web scraping as it converts raw HTML content into clean, usable text. This is particularly useful for further processing or analysis of the page content.

#### **3.2. `fetch_sitemap(url)` Function**

```python
def fetch_sitemap(url):
    r = requests.get(url)
    xml = r.text
    raw = xmltodict.parse(xml)
    return raw
```

- **Purpose:** This function retrieves and parses the XML sitemap from a given URL.
- **Process:**
  - `requests.get(url)`: Sends an HTTP GET request to the URL of the sitemap.
  - `r.text`: Retrieves the XML content of the sitemap.
  - `xmltodict.parse(xml)`: Parses the XML content into a Python dictionary using `xmltodict`, which simplifies the XML structure into a dictionary format.
- **Necessity:** Sitemaps are used to inform search engines and users about the structure of a website. Fetching and parsing a sitemap allows you to programmatically access the URLs listed, which can be useful for tasks like site crawling and content extraction.

#### **3.3. `get_relevant_pages(sitemap, keyword)` Function**

```python
def get_relevant_pages(sitemap, keyword):
    pages = []
    for info in sitemap['urlset']['url']:
        url = info['loc']
        if keyword in url:
            pages.append({'text': extract_text_from(url), 'source': url})
    return pages
```

- **Purpose:** This function filters the URLs from a sitemap based on a keyword and extracts text from the relevant pages.
- **Process:**
  - Iterates over each URL entry in the parsed sitemap dictionary.
  - `info['loc']`: Extracts the URL from each sitemap entry.
  - `if keyword in url`: Checks if the given keyword is present in the URL.
  - `extract_text_from(url)`: Calls the previously defined function to extract and clean the text content from the relevant pages.
  - `pages.append({'text': extract_text_from(url), 'source': url})`: Stores the extracted text and URL in a list for each relevant page.
- **Necessity:** This function helps in filtering and processing web pages based on specific criteria (e.g., a keyword in the URL). This is useful for targeted data extraction, such as collecting content related to a particular topic or ensuring that only relevant pages are processed.

### 3.4 How Web Scraping Skills Have Been Used

3.4.1. **Data Retrieval:**
   - The `requests` library is used to fetch HTML and XML content from the web. This involves sending HTTP requests and handling the responses, which is fundamental to web scraping.

3.4.2. **HTML Parsing:**
   - `BeautifulSoup` is employed to parse HTML documents. It extracts meaningful text from the raw HTML, allowing you to navigate and manipulate the HTML structure.

3.4.3. **XML Parsing:**
   - `xmltodict` is used to parse XML sitemaps into dictionaries. This conversion simplifies the process of accessing and processing structured data in XML format.

3.4.4. **Text Processing:**
   - The `extract_text_from` function cleans and formats text extracted from HTML. This step is crucial for ensuring that the data is in a usable format for further analysis or processing.

3.4.5. **Content Filtering:**
   - The `get_relevant_pages` function demonstrates filtering and processing based on specific criteria, showcasing how to selectively handle and extract content based on keywords or other attributes.

In [None]:
# calling the function 
sitemap_url = "https://www.rgu.ac.uk/index.php?option=com_jmap&view=sitemap&format=xml"
sitemap = fetch_sitemap(sitemap_url)
pages = get_relevant_pages(sitemap, 'international-students')

In [None]:
# pages[0]

# 4. Data Preprocessing
The preprocess_text function is designed to preprocess and organize text data by splitting it into smaller chunks and associating metadata with each chunk. This approach facilitates more efficient text handling and analysis, making it easier to work with large volumes of textual data in NLP applications. The use of CharacterTextSplitter from the langchain library is a practical solution for managing text data in a scalable manner.

In [None]:
def preprocess_text(pages):
    text_splitter = CharacterTextSplitter(chunk_size=1500, separator="\n")
    docs, metadatas = [], []
    for page in pages:
        splits = text_splitter.split_text(page['text'])
        docs.extend(splits)
        metadatas.extend([{"source": page['source']}] * len(splits))
    return docs, metadatas

# Example usage
docs, metadatas = preprocess_text(pages)


### **4.1. `preprocess_text(pages)` Function**

- **Purpose:** This function processes and splits text content from a list of web pages into manageable chunks and creates metadata associated with each chunk. It prepares the text data for further analysis or use in natural language processing (NLP) tasks.

- **Process:**
  - **Text Splitter Initialization:**
    ```python
    text_splitter = CharacterTextSplitter(chunk_size=1500, separator="\n")
    ```
    - **`CharacterTextSplitter`**: This is a utility from the `langchain` library that splits text into chunks based on a specified character length and separator. Here, it's configured to split text into chunks of up to 1500 characters, separated by newline characters.
    - **Reason for Chunking:** Splitting large text into smaller chunks helps in processing and analyzing text more efficiently. It ensures that text chunks are manageable in size and easier to handle for various NLP tasks.

  - **Initialize Lists:**
    ```python
    docs, metadatas = [], []
    ```
    - **`docs`**: A list that will store the split text chunks.
    - **`metadatas`**: A list that will store metadata associated with each text chunk.

  - **Process Each Page:**
    ```python
    for page in pages:
        splits = text_splitter.split_text(page['text'])
        docs.extend(splits)
        metadatas.extend([{"source": page['source']}] * len(splits))
    ```
    - **Iterate Over Pages:** For each page in the `pages` list, extract and process the text.
    - **Split Text:** `text_splitter.split_text(page['text'])` splits the text into chunks based on the specified chunk size and separator.
    - **Extend Lists:**
      - **`docs.extend(splits)`**: Adds the split text chunks to the `docs` list.
      - **`metadatas.extend([{"source": page['source']}] * len(splits))`**: Adds metadata entries to the `metadatas` list. Each metadata entry contains the source URL of the page and is repeated for each text chunk generated from that page.


### 4.2. Importance and Benefits

4.2.1. **Efficient Text Handling:**
   - **Chunking:** By splitting text into smaller chunks, you improve the efficiency of processing and analysis. Large texts can be unwieldy and difficult to handle in a single operation.

4.2.2. **Enhanced Search and Retrieval:**
   - **Metadata Association:** Associating metadata with each text chunk allows you to trace back the source of the information, which is valuable for tracking and analyzing data sources.

4.2.3. **Improved NLP Tasks:**
   - **Manageable Size:** Smaller chunks are easier to work with in various NLP tasks, such as text classification, clustering, or similarity analysis.

4.2.4. **Scalability:**
   - **Chunking Text:** Helps in scaling the processing pipeline, allowing for handling larger volumes of text efficiently.


In [None]:
# docs

### 4.3 Embeddings
Embeddings are a fundamental component in NLP that translate text into numerical vectors, capturing semantic meaning and enabling more efficient and effective text processing. For a chatbot, embeddings enhance understanding, response generation, and similarity search capabilities. The provided code uses the `SentenceTransformer` model to generate embeddings for a list of documents, which is a crucial step in building an intelligent and responsive chatbot.

#### **4.3.1. What Are Embeddings?**

Embeddings are numerical vector representations of text data. They capture semantic meaning in a dense vector space, where similar texts have vectors that are close to each other. Embeddings are essential in natural language processing (NLP) and machine learning for tasks that involve understanding and analyzing text.

#### **4.3.2. Why Are Embeddings Necessary?**

In NLP tasks, embeddings are crucial because they provide a way to translate textual information into a format that machine learning models can process effectively. They are necessary for several reasons:

1. **Semantic Understanding:**
   - Embeddings capture the semantic meaning of words, phrases, or documents. They enable models to understand and compare the meaning of different texts, even if the exact wording is different.

2. **Dimensionality Reduction:**
   - Text data, when converted to embeddings, is represented in a lower-dimensional space compared to raw text. This reduces the computational complexity and storage requirements, making it easier to work with large datasets.

3. **Improved Performance:**
   - Embeddings enhance the performance of machine learning models by providing a more informative and nuanced representation of text. They allow models to better capture relationships and similarities between texts.

4. **Similarity Measurement:**
   - Embeddings facilitate the measurement of similarity between texts. By comparing embeddings, you can determine how similar different pieces of text are to each other, which is useful for various tasks such as search and retrieval, clustering, and recommendation.

#### **4.3.3. How Do Embeddings Help the Chatbot?**

For a chatbot, embeddings are particularly useful in several ways:

1. **Enhanced Understanding:**
   - Embeddings enable the chatbot to understand and interpret user queries more effectively by capturing the semantic meaning of the text. This helps in generating more relevant and accurate responses.

2. **Improved Response Generation:**
   - With embeddings, the chatbot can match user queries to the most relevant responses or information from its knowledge base. This improves the quality of the responses and the overall user experience.

3. **Contextual Matching:**
   - Embeddings allow the chatbot to handle variations in user input by mapping similar queries to the same or related responses. This helps in managing different phrasings and synonyms.

4. **Similarity Search:**
   - The chatbot can use embeddings to perform similarity searches within a knowledge base or document repository. By comparing embeddings, it can retrieve information that closely matches the user's query.

In [None]:
def generate_embeddings(docs):
    model = SentenceTransformer('paraphrase-MiniLM-L6-v2')
    return [model.encode(doc) for doc in docs]

# call function
embeddings = generate_embeddings(docs)


#### **Code Explanation**

```python
def generate_embeddings(docs):
    model = SentenceTransformer('paraphrase-MiniLM-L6-v2')
    return [model.encode(doc) for doc in docs]

# call function
embeddings = generate_embeddings(docs)
```

- **Function Definition:**
  ```python
  def generate_embeddings(docs):
      model = SentenceTransformer('paraphrase-MiniLM-L6-v2')
      return [model.encode(doc) for doc in docs]
  ```
  - **`SentenceTransformer('paraphrase-MiniLM-L6-v2')`:** Initializes a pre-trained Sentence Transformer model that generates embeddings. The `'paraphrase-MiniLM-L6-v2'` model is optimized for capturing the semantic similarity between sentences.
  - **`model.encode(doc)`:** Converts each document (`doc`) into its vector representation (embedding). This is done for all documents in the `docs` list.

- **Function Call:**
  ```python
  embeddings = generate_embeddings(docs)
  ```
  - **Purpose:** Calls the `generate_embeddings` function with the list of documents (`docs`) to generate their embeddings. The result is a list of embeddings, where each embedding corresponds to a document.


In [None]:
len(vector_store_json_compatible)

In [None]:
len(vector_store)

In [None]:
# vectore store
vector_store = {"documents": docs, "embeddings": embeddings, "metadatas": metadatas}


In [1]:
import json

# Specify the file path from where you want to read the JSON
file_path = 'vector_store.json'

# Read the JSON file and convert it back to a dictionary
with open(file_path, 'r') as file:
    vector_store = json.load(file)

FileNotFoundError: [Errno 2] No such file or directory: 'vector_store.json'

### 4.4 vector_store
`vector_store` is a dictionary that organizes and stores three key pieces of information.

>The `vector_store` dictionary efficiently organizes and stores text data, embeddings, and metadata in a structured format. This organization facilitates various NLP tasks, including similarity searches and information retrieval, by providing quick access to text chunks, their semantic representations, and associated metadata. This setup is essential for building effective and responsive systems, such as chatbots, that rely on textual data and semantic understanding.

4.4.1. **Documents:**
   - **Key:** `"documents"`
   - **Value:** `docs` (a list of text chunks obtained from the `preprocess_text` function)
   - **Purpose:** This list contains the actual text data split into manageable chunks. These chunks are the raw content that you want to analyze or use for further processing.

4.4.2. **Embeddings:**
   - **Key:** `"embeddings"`
   - **Value:** `embeddings` (a list of vector representations of the documents obtained from the `generate_embeddings` function)
   - **Purpose:** This list contains the embeddings generated for each text chunk. Embeddings are dense vector representations that capture the semantic meaning of the text, allowing for efficient similarity comparisons and other NLP tasks.

4.4.3. **Metadatas:**
   - **Key:** `"metadatas"`
   - **Value:** `metadatas` (a list of metadata dictionaries corresponding to each document, obtained from the `preprocess_text` function)
   - **Purpose:** This list contains metadata related to each text chunk, such as the source URL of the document. Metadata helps in tracking the origin of each chunk and can be useful for context or additional information retrieval.

#### **Why is `vector_store` Necessary?**

>**Organized Storage:**
   - **Purpose:** `vector_store` consolidates all relevant information (text chunks, embeddings, and metadata) into a single, organized structure. This makes it easier to manage and access the data for subsequent processing or querying.

>**Efficient Retrieval:**
   - **Purpose:** By storing documents, embeddings, and metadata together, you can efficiently retrieve and use this information when performing tasks such as similarity searches, data analysis, or generating responses. The structure allows for quick access to the text, its vector representation, and associated metadata.

>**Improved Performance:**
   - **Purpose:** The separation of text data (documents) and its vector representations (embeddings) supports efficient similarity calculations. For instance, if you want to find the most similar documents to a given query, you can compute the query’s embedding and compare it to the stored embeddings using vector operations.

>**Enhanced Functionality:**
   - **Purpose:** The metadata provides additional context that can be useful for understanding the origin or additional details about each document. For example, if the chatbot retrieves a document, the metadata might include the source URL or other relevant information that can be presented to the user.


The `vector_store` dictionary can be used in various ways, such as:

- **Similarity Search:**
  - Compute the embedding for a user query and compare it to the stored embeddings to find the most similar documents.
  
- **Information Retrieval:**
  - Retrieve the documents and their associated metadata based on the similarity search results to provide relevant responses or information to the user.

- **Contextual Analysis:**
  - Use metadata to provide additional context or details about the source of the information being presented.


### 4.5 Similarity in Document Retrieval

#### Concept of Similarity

In the context of document retrieval and natural language processing (NLP), **similarity** refers to how closely two pieces of text (or their vector representations) align with each other. This alignment can be based on various factors such as meaning, context, or content. The core idea is to find documents that are most relevant to a given query or context by comparing how similar their representations are.

#### Types of Similarity

4.5.1. **Cosine Similarity**: 
   - **Definition**: Cosine similarity is a metric used to measure how similar two vectors are irrespective of their magnitude. It calculates the cosine of the angle between two vectors in a multi-dimensional space.
   - **Usage**: It is widely used in NLP to measure document similarity by comparing the vector embeddings of texts. It is effective for understanding the orientation of vectors (i.e., how similar the text is) rather than their magnitude.

4.5.2. **Euclidean Distance**:
   - **Definition**: Euclidean distance measures the straight-line distance between two points in multi-dimensional space.
   - **Usage**: While less common in NLP for similarity tasks, it can be used for clustering and classification where the actual distance between points is relevant.

4.5.3. **Jaccard Similarity**:
   - **Definition**: Jaccard similarity measures the similarity between two sets by dividing the size of their intersection by the size of their union.
   - **Usage**: This is more common in text analysis for comparing the similarity of sets of words or phrases rather than vector embeddings.

### Application
> **Vector Representation**

Each document and query are converted into numerical vectors (embeddings) using the SentenceTransformer model. These embeddings capture the semantic meaning of the text, allowing for meaningful comparison.

> **Query Encoding**

When a user submits a query, it is converted into an embedding using the same model. This ensures that the query and documents are represented in the same space, making comparison feasible.

> **Similarity Calculation**

The cosine similarity between the query embedding and each document embedding is computed. This process involves:
   - **Normalization**: Vectors are normalized to have unit length (magnitude of 1). This ensures that the cosine similarity calculation only considers the angle between vectors, which is a measure of similarity.
   - **Dot Product Calculation**: The dot product of the query vector and each document vector is computed.
   - **Similarity Score**: The cosine similarity score is derived from the dot product, indicating how closely related the query is to each document.

> **Retrieving Top Documents**

The documents with the highest similarity scores are selected as the most relevant. These top documents are then used to generate a contextual response to the user query.

- **Query Encoding**: The query is converted into an embedding vector.
- **Cosine Similarity**: The cosine similarity between the query vector and each document vector is calculated.
- **Top Documents**: The documents with the highest similarity scores are selected and returned.


In [None]:
# Initialize the SentenceTransformer model
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

def generate_embeddings(docs):
    return [model.encode(doc) for doc in docs]

def generate_query_embedding(query):
    return model.encode(query)

def find_similar(query, vector_store, model, top_n=5):
    try:
        # Ensure necessary keys are present in vector_store
        if not all(key in vector_store for key in ['documents', 'embeddings', 'metadatas']):
            raise ValueError("Vector store must contain 'documents', 'embeddings', and 'metadatas' keys.")

        # Encode the query
        query_embedding = generate_query_embedding(query)
        
        # Ensure embeddings are of the same dimension
        assert len(query_embedding) == len(vector_store['embeddings'][0]), "Embedding dimensions do not match."
        
        # Compute similarity
        similarities = cosine_similarity([query_embedding], vector_store['embeddings'])[0]
        
        # Get indices of the top N most similar documents
        similar_indices = np.argsort(similarities)[-top_n:][::-1]
        
        # Collect the most similar documents, their metadata, and similarity scores
        similar_docs = [
            vector_store['documents'][i] for i in similar_indices
        ]
        
        return similar_docs
    
    except Exception as e:
        print(f"An error occurred: {e}")
        return []

def generate_answer_gpt4(relevant_documents, question):
    # Combine relevant documents into a single context
    context = "\n\n".join(relevant_documents)
    
    # Create a prompt for OpenAI
    prompt = f"Based on the following documents, answer the question:\n\n{context}\n\nQuestion: {question}\nAnswer:"
    
    # Generate the response from OpenAI using the chat endpoint
    response = openai.ChatCompletion.create(
        model="gpt-4o-mini-2024-07-18", 
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt}
        ],
        max_tokens=150,
        temperature=0.7
    )
    
    return response.choices[0].message['content'].strip()


def answer_question(query, vector_store, model, top_n=1):
    # Find similar documents
    similar_docs = find_similar(query, vector_store, model, top_n)
    
    # Generate and return an answer
    answer = generate_answer_gpt4(similar_docs, query)
    return answer

In [None]:
# Example usage
query = "What are the visa requirements for international students?"

# Get the answer
answer = answer_question(query, vector_store, model)
print(answer)

In [None]:
# # Initialize the SentenceTransformer model
# model = SentenceTransformer('paraphrase-MiniLM-L6-v2')


# def generate_embeddings(docs):
#     return [model.encode(doc) for doc in docs]

# def generate_query_embedding(query):
#     return model.encode(query)

# def find_similar(query, vector_store, model, top_n=1):
#     try:
#         # Ensure necessary keys are present in vector_store
#         if not all(key in vector_store for key in ['documents', 'embeddings', 'metadatas']):
#             raise ValueError("Vector store must contain 'documents', 'embeddings', and 'metadatas' keys.")

#         # Encode the query
#         query_embedding = generate_query_embedding(query)
        
#         # Ensure embeddings are of the same dimension
#         assert len(query_embedding) == len(vector_store['embeddings'][0]), "Embedding dimensions do not match."
        
#         # Compute similarity
#         similarities = cosine_similarity([query_embedding], vector_store['embeddings'])[0]
        
#         # Get indices of the top N most similar documents
#         similar_indices = np.argsort(similarities)[-top_n:][::-1]
        
#         # Collect the most similar documents
#         similar_docs = [
#             vector_store['documents'][i] for i in similar_indices
#         ]
        
#         return similar_docs
    
#     except Exception as e:
#         print(f"An error occurred: {e}")
#         return []

# def generate_answer_gpt4(relevant_documents, question):
#     # Combine relevant documents into a single context
#     context = "\n\n".join(relevant_documents)
    
#     # Create a prompt for OpenAI
#     prompt = f"Based on the following document, answer the question:\n\n{context}\n\nQuestion: {question}\nAnswer:"
    
#     # Generate the response from OpenAI using the chat endpoint
#     response = openai.ChatCompletion.create(
#         model="gpt-4o-mini",  # Using the chat model
#         messages=[
#             {"role": "system", "content": "You are a helpful assistant."},
#             {"role": "user", "content": prompt}
#         ],
#         max_tokens=150,
#         temperature=0.7
#     )
    
#     return response.choices[0].message['content'].strip()

# def answer_question(query, vector_store, model, top_n=5):
#     # Define custom responses for specific queries
#     greetings = ["hello", "hi", "greetings", "hey", "welcome"]
#     if any(greeting in query.lower() for greeting in greetings):
#         return "Welcome to Robert Gordon University! How can I assist you today?"

#     # Find similar documents
#     similar_docs = find_similar(query, vector_store, model, top_n)
    
#     # Generate and return an answer
#     answer = generate_answer_gpt4(similar_docs, query)
#     return answer

# def main():    
#     while True:
#         # Prompt user for input
#         query = input("Enter your query (or type 'exit' to quit): ")
        
#         if query.lower() == 'exit':
#             print("Exiting the program.")
#             break
        
#         # Get the answer
#         answer = answer_question(query, vector_store, model)
#         print(f"Answer: {answer}")

# if __name__ == "__main__":
#     main()


### Conclusion
This project demonstrates the practical application of state-of-the-art AI techniques in creating an intelligent chatbot. It showcases how integrating different technologies and methodologies can result in a powerful tool that enhances user experience and provides valuable information efficiently.

Thank you for exploring the RGU Chatbot Project. Feel free to explore the code, contribute, or provide feedback to help us further refine and enhance this tool.