#### Import libraries

In [32]:
import os
import gc
import json
import faiss
import numpy as np
from transformers import pipeline
from sentence_transformers import SentenceTransformer

#### Chunking the documents

Chunking is necessary to handle the input length limitations of language models and to improve the relevance of the retrieved context.

A simple approach is to use fixed-size chunking with some overlap:
* Fixed Size: We decide on a maximum number of characters (or tokens, though character-based is simpler to start with) for each chunk. For example, we could aim for chunks of around 500-1000 characters.
* Overlap: To maintain context between consecutive chunks, we can introduce an overlap. For instance, if our chunk size is 500 characters and we use an overlap of 100 characters, the end of one chunk will be the beginning of the next. This helps the language model understand the flow of information across chunks.

We'll read each document we loaded earlier and split its content into chunks of a specified size with a defined overlap. For now, we can just print the chunks and their metadata to verify the chunking process is working correctly, and in the next step, we'll integrate the embedding generation and vector database storage.

In [33]:
dataset_folder = "dataset"
chunk_size = 1000  # You can adjust this
chunk_overlap = 200  # You can adjust this

# Iterate through all files in the dataset folder
for filename in os.listdir(dataset_folder):
    if filename.endswith(".txt"):
        filepath = os.path.join(dataset_folder, filename)
        try:
            with open(filepath, "r", encoding="utf-8") as f:
                content = f.read()
                # Split the content into chunks and process each chunk
                for i in range(0, len(content), chunk_size - chunk_overlap):
                    chunk = content[i:i + chunk_size]
                    metadata = {"source": filename}
                    # For now, let's just print the chunk and its metadata
                    print(f"Chunk from: {metadata['source']}")
                    print(f"Length: {len(chunk)}")
                    # print(f"Content: {chunk[:50]}...") # Print first 50 characters of the chunk
                    print("-" * 20)
        except Exception as e:
            print(f"Error reading file '{filename}': {e}")

print("\nChunking process complete. Ready for embedding generation.")

Chunk from: A.I._Artificial_Intelligence.txt
Length: 1000
--------------------
Chunk from: A.I._Artificial_Intelligence.txt
Length: 1000
--------------------
Chunk from: A.I._Artificial_Intelligence.txt
Length: 224
--------------------
Chunk from: Active_learning_(machine_learning).txt
Length: 1000
--------------------
Chunk from: Active_learning_(machine_learning).txt
Length: 894
--------------------
Chunk from: Active_learning_(machine_learning).txt
Length: 94
--------------------
Chunk from: Adversarial_machine_learning.txt
Length: 842
--------------------
Chunk from: Adversarial_machine_learning.txt
Length: 42
--------------------
Chunk from: Affective_computing.txt
Length: 828
--------------------
Chunk from: Affective_computing.txt
Length: 28
--------------------
Chunk from: AI_boom.txt
Length: 487
--------------------
Chunk from: AlexNet.txt
Length: 1000
--------------------
Chunk from: AlexNet.txt
Length: 456
--------------------
Chunk from: Applications_of_artificial_intellige

#### Generate embeddings

* Load a Sentence Transformer Model: We'll choose a suitable pre-trained model from the sentence-transformers library. These models are specifically designed for generating sentence and text embeddings.
* Generate Embeddings for Each Chunk: We'll iterate through our documents, chunk them as before, and then use the loaded model to generate an embedding vector for each chunk.
* Store Embeddings in a Vector Database: We'll use FAISS (which we installed earlier) to create an index and store the generated embedding vectors along with their corresponding metadata (e.g., the source filename and the original chunk content). FAISS allows for efficient similarity searching later when we have a user's question.

In [51]:
# Load the pre-trained Sentence Transformer model
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
print("Sentence Transformer model loaded successfully.")
embedding_model

Sentence Transformer model loaded successfully.


SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

In [52]:
embedding_dimension = embedding_model.get_sentence_embedding_dimension()
embedding_dimension

384

In [53]:
# Initialize FAISS index
index = faiss.IndexFlatL2(embedding_dimension)  # Using L2 distance for similarity
index

<faiss.swigfaiss_avx2.IndexFlatL2; proxy of <Swig Object of type 'faiss::IndexFlatL2 *' at 0x00000241DC666460> >

In [54]:
# Lists to store chunks and their metadata (for now, just source)
chunks = []
metadata = []

# Iterate through all files in the dataset folder
for filename in os.listdir(dataset_folder):
    if filename.endswith(".txt"):
        filepath = os.path.join(dataset_folder, filename)
        try:
            with open(filepath, "r", encoding="utf-8") as f:
                content = f.read()
                # Split the content into chunks
                for i in range(0, len(content), chunk_size - chunk_overlap):
                    chunk = content[i:i + chunk_size]
                    chunks.append(chunk)
                    metadata.append({"source": filename})
        except Exception as e:
            print(f"Error reading file '{filename}': {e}")

# Generate embeddings for all chunks
embeddings = embedding_model.encode(chunks)
# Add embeddings to the FAISS index
index.add(np.array(embeddings).astype('float32'))

print(f"Total number of chunks: {len(chunks)}")
print(f"FAISS index size: {index.ntotal}")
print("Embeddings generated and added to FAISS index.")

# Now, 'index' contains the embeddings of our document chunks.
# We can save this index for later use if needed.
faiss.write_index(index, "document_embeddings.faiss")
print("FAISS index saved to 'document_embeddings.faiss'")

# We also need to save the chunks and metadata, perhaps in a separate file
with open("chunks_metadata.json", "w", encoding="utf-8") as f:
    json.dump({"chunks": chunks, "metadata": metadata}, f)
print("Chunks and metadata saved to 'chunks_metadata.json'")

# Explicitly delete the large variables
del chunks
del embeddings
del metadata

# Trigger garbage collection
gc.collect()

Total number of chunks: 190
FAISS index size: 190
Embeddings generated and added to FAISS index.
FAISS index saved to 'document_embeddings.faiss'
Chunks and metadata saved to 'chunks_metadata.json'


52

Explanation of the above code:

* embedding_dimension = embedding_model.get_sentence_embedding_dimension(): We get the dimensionality of the embeddings produced by our chosen Sentence Transformer model. This is needed to initialize the FAISS index.
* index = faiss.IndexFlatL2(embedding_dimension): We create a FAISS index. IndexFlatL2 is a simple index that performs an exact (non-approximate) nearest neighbor search using the L2 distance (Euclidean distance). For larger datasets, we might consider more advanced indexing techniques for faster search, but for our learning project, this is a good starting point. We still load and chunk the documents as before, storing the chunks and their metadata in lists.
* embeddings = embedding_model.encode(chunks): We use the loaded Sentence Transformer model to generate embeddings for all the chunks in one go. This returns a NumPy array of embeddings.
* index.add(np.array(embeddings).astype('float32')): We add the generated embeddings (converted to a NumPy array of float32) to our FAISS index. We print the total number of chunks and the size of the FAISS index to verify the process.
* faiss.write_index(index, "document_embeddings.faiss"): We save the FAISS index to a file so we can load it later without re-generating the embeddings. We also save the chunks and metadata to a JSON file. While the FAISS index stores the embeddings, we'll need the actual text chunks and their source information later when we retrieve context.

In [55]:
# Printing the FAISS index information
print(f"Is trained: {index.is_trained}")
print(f"Number of vectors: {index.ntotal}")
print(f"Dimension of vectors: {index.d}")
print(f"Index type: {type(index)}")

Is trained: True
Number of vectors: 190
Dimension of vectors: 384
Index type: <class 'faiss.swigfaiss_avx2.IndexFlatL2'>


Based on above output we have a FAISS index containing 2969 semantic representations of your document chunks, each represented by a 384-dimensional vector, and the search will be based on exact L2 distance.

#### Implementing the retrieval mechanism.
A function that takes a user's question as input, embeds it using our embedding_model, searches the FAISS index for the most similar embeddings, and returns the corresponding text chunks.

In [56]:
# Load the chunks and metadata
with open("chunks_metadata.json", "r", encoding="utf-8") as f:
    chunks_data = json.load(f)
    chunks = chunks_data['chunks']
    metadata = chunks_data['metadata']

def retrieve_relevant_chunks(query, top_k=3):
    """
    Embeds the query and retrieves the top_k most relevant document chunks from the FAISS index.

    Args:
        query (str): The user's question.
        top_k (int): The number of top relevant chunks to retrieve.

    Returns:
        list: A list of the top_k most relevant text chunks and their metadata.
    """
    # 1. Embed the query
    query_embedding = embedding_model.encode([query]).astype('float32')

    # 2. Search the FAISS index
    distances, indices = index.search(query_embedding, top_k)

    # 3. Retrieve the corresponding text chunks and metadata
    relevant_chunks = [chunks[i] for i in indices[0]]
    relevant_metadata = [metadata[i] for i in indices[0]]

    return relevant_chunks, relevant_metadata

In [57]:
# Example usage in Jupyter Notebook:
query_1 = "Explain the architecture of a large language model."
relevant_chunks, relevant_metadata = retrieve_relevant_chunks(query_1)

print(f"Query: {query_1}\n")
print(f"Top {len(relevant_chunks)} relevant chunks:\n")
for i, chunk in enumerate(relevant_chunks):
    print(f"Chunk {i+1}:")
    print(chunk[:200] + "...") # Print the first 200 characters
    print(f"Source: {relevant_metadata[i]['source']}\n")
    print("-" * 40)

# You can try different queries here
query_2 = "What are some applications of generative AI?"
relevant_chunks_2, relevant_metadata_2 = retrieve_relevant_chunks(query_2)

print(f"\nQuery: {query_2}\n")
print(f"Top {len(relevant_chunks_2)} relevant chunks:\n")
for i, chunk in enumerate(relevant_chunks_2):
    print(f"Chunk {i+1}:")
    print(chunk[:200] + "...")
    print(f"Source: {relevant_metadata_2[i]['source']}\n")
    print("-" * 40)

Query: Explain the architecture of a large language model.

Top 3 relevant chunks:

Chunk 1:
Title: Large language models in government

Summary:
Large language models have been used by officials and politicians in a wide variety of ways.

...
Source: Large_language_models_in_government.txt

----------------------------------------
Chunk 2:
Title: Language model

Summary:
A language model is a model of natural language. Language models are useful for a variety of tasks, including speech recognition, machine translation, natural language ...
Source: Language_model.txt

----------------------------------------
Chunk 3:
Title: List of large language models

Summary:
A large language model (LLM) is a type of machine learning model designed for natural language processing tasks such as language generation. LLMs are lan...
Source: List_of_large_language_models.txt

----------------------------------------

Query: What are some applications of generative AI?

Top 3 relevant chunks:

Chunk 1:


The code above doest the following -
* It loads the tools: It loads the Sentence Transformer model that we used to understand the meaning of our documents and the FAISS index where we stored the numerical representations (embeddings) of those documents. It also loads the original text chunks and some information about where they came from.
* It understands your question: When you give it a query, it uses the same Sentence Transformer model to convert your question into a numerical representation (an embedding). This way, the computer can understand the meaning of your question in a numerical form.
* It searches for similar information: It then uses this numerical representation of your question to search the FAISS index. The FAISS index is like a super-fast lookup system that finds the document embeddings that are most similar in meaning to your question's embedding. It returns the top few most similar ones (we set it to top_k=3 by default).
* It finds the original text: For each of the most similar embeddings found in the FAISS index, the code looks up the original text chunk that corresponds to that embedding. It also retrieves any extra information we saved about that chunk (like which document it came from).
* It gives you the results: Finally, the function returns a list of these most relevant text chunks and their associated information.

####  Integrate a Language Model (LLM) 
To generate an answer using the retrieved context and the original question.

Steps:
* Load a Pre-trained Language Model: We'll use a suitable pre-trained language model from the transformers library.
* Construct a Prompt: We need to create a prompt that we will feed to the LLM. This prompt will typically include:
    * The user's question.
    * The retrieved relevant context (the text chunks from our knowledge base).
    * Instructions for the LLM on how to use the context to answer the question.
* Generate the Answer: We'll pass the constructed prompt to the LLM and ask it to generate a response.
* Format the Output: We'll then present the generated answer to the user, potentially along with the sources (the documents from which the context was retrieved).

In [58]:
# Load the pre-trained GPT-2 model and tokenizer using the pipeline
qa_pipeline = pipeline("question-answering", model="gpt2-large")
print("GPT-2 model loaded successfully for question answering.")
qa_pipeline

Some weights of GPT2ForQuestionAnswering were not initialized from the model checkpoint at gpt2-large and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cpu


GPT-2 model loaded successfully for question answering.


<transformers.pipelines.question_answering.QuestionAnsweringPipeline at 0x241d9f04890>

In [59]:
def answer_question(query, context_chunks):
    """
    Uses the GPT-2 model to answer a question based on the provided context chunks.

    Args:
        query (str): The user's question.
        context_chunks (list): A list of relevant text chunks.

    Returns:
        str: The generated answer.
    """
    if not context_chunks:
        return "No relevant context found to answer the question."

    # Combine the context chunks into a single string
    context = " ".join(context_chunks)

    # Construct the input for the question-answering pipeline
    input_data = {
        'question': query,
        'context': context
    }

    #We can also give it a prompt in the form of string
    #prompt = f"Answer the following question based on the context provided.\n\nContext:\n{context}\n\nQuestion: {query}\n\nAnswer:"

    # Get the answer from the pipeline
    result = qa_pipeline(input_data)
    #result = qa_pipeline(prompt)
    return result['answer']

In [60]:
query = "Explain the architecture of a large language model and its key components."
relevant_chunks, relevant_metadata = retrieve_relevant_chunks(query)

print(f"Question: {query}\n")
print("Retrieved Context Chunks:\n")
for i, chunk in enumerate(relevant_chunks):
    print(f"Chunk {i+1} (Source: {relevant_metadata[i]['source']}):")
    print(chunk[:200] + "...")
    print("-" * 40)

answer = answer_question(query, relevant_chunks)
print(f"\nGenerated Answer: {answer}")

Question: Explain the architecture of a large language model and its key components.

Retrieved Context Chunks:

Chunk 1 (Source: Language_model.txt):
Title: Language model

Summary:
A language model is a model of natural language. Language models are useful for a variety of tasks, including speech recognition, machine translation, natural language ...
----------------------------------------
Chunk 2 (Source: Large_language_models_in_government.txt):
Title: Large language models in government

Summary:
Large language models have been used by officials and politicians in a wide variety of ways.

...
----------------------------------------
Chunk 3 (Source: List_of_large_language_models.txt):
Title: List of large language models

Summary:
A large language model (LLM) is a type of machine learning model designed for natural language processing tasks such as language generation. LLMs are lan...
----------------------------------------

Generated Answer:  character


In [61]:
query = "Explain what is a large language model in machine learning"
relevant_chunks, relevant_metadata = retrieve_relevant_chunks(query)

print(f"Question: {query}\n")
print("Retrieved Context Chunks:\n")
for i, chunk in enumerate(relevant_chunks):
    print(f"Chunk {i+1} (Source: {relevant_metadata[i]['source']}):")
    print(chunk[:200] + "...")
    print("-" * 40)

answer = answer_question(query, relevant_chunks)
print(f"\nGenerated Answer: {answer}")

Question: Explain what is a large language model in machine learning

Retrieved Context Chunks:

Chunk 1 (Source: List_of_large_language_models.txt):
Title: List of large language models

Summary:
A large language model (LLM) is a type of machine learning model designed for natural language processing tasks such as language generation. LLMs are lan...
----------------------------------------
Chunk 2 (Source: Language_model.txt):
Title: Language model

Summary:
A language model is a model of natural language. Language models are useful for a variety of tasks, including speech recognition, machine translation, natural language ...
----------------------------------------
Chunk 3 (Source: Large_language_model.txt):
Title: Large language model

Summary:
A large language model (LLM) is a type of machine learning model designed for natural language processing tasks such as language generation. LLMs are language mod...
----------------------------------------

Generated Answer:  character


#### Tuning to get desired results

In [110]:
!pip install nltk
!pip install huggingface_hub[hf_xet]



In [111]:
import nltk
from nltk.tokenize import sent_tokenize

# Download punkt tokenizer if not already downloaded
try:
    sent_tokenize("This is a test sentence.")
except LookupError:
    nltk.download('punkt')
    nltk.download('punkt_tab')

In [192]:
dataset_folder = "dataset"
chunk_size = 500  # Adjust chunk size for sentences
chunk_overlap = 50  # Adjust overlap for sentences

all_chunks = []
all_metadata = []

# Iterate through all files in the dataset folder
for filename in os.listdir(dataset_folder):
    if filename.endswith(".txt"):
        filepath = os.path.join(dataset_folder, filename)
        try:
            with open(filepath, "r", encoding="utf-8") as f:
                content = f.read()
                sentences = sent_tokenize(content)
                current_chunk = ""
                for sentence in sentences:
                    if len(current_chunk) + len(sentence) + 1 <= chunk_size:
                        current_chunk += sentence + " "
                    else:
                        all_chunks.append(current_chunk.strip())
                        all_metadata.append({"source": filename})
                        current_chunk = sentence + " "
                if current_chunk:
                    all_chunks.append(current_chunk.strip())
                    all_metadata.append({"source": filename})
        except Exception as e:
            print(f"Error reading file '{filename}': {e}")

In [193]:
# Load a different Sentence Transformer model
embedding_model = SentenceTransformer('multi-qa-mpnet-base-dot-v1')
embeddings = embedding_model.encode(all_chunks)
dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(embeddings)
faiss.write_index(index, "document_embeddings_sentences_mpnet.faiss")

chunks_metadata = {"chunks": all_chunks, "metadata": all_metadata}
with open("chunks_metadata_sentences_mpnet.json", "w", encoding="utf-8") as f:
    json.dump(chunks_metadata, f)

print(f"Processed {len(all_chunks)} chunks using sentence-based splitting.")
print(f"Embeddings saved to document_embeddings_sentences_mpnet.faiss using all-mpnet-base-v2")
print(f"Metadata saved to chunks_metadata_sentences_mpnet.json")

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


Processed 284 chunks using sentence-based splitting.
Embeddings saved to document_embeddings_sentences_mpnet.faiss using all-mpnet-base-v2
Metadata saved to chunks_metadata_sentences_mpnet.json


In [194]:
index = faiss.read_index("document_embeddings_sentences_mpnet.faiss")

# Load the chunks and metadata
with open("chunks_metadata_sentences_mpnet.json", "r", encoding="utf-8") as f:
    chunks_data = json.load(f)
    chunks = chunks_data['chunks']
    metadata = chunks_data['metadata']

# Load the pre-trained GPT-2 model and tokenizer using the pipeline
text_generator = pipeline("text-generation", model="gpt2-large") #For test generation instead of question answering.
print("GPT-2 model loaded successfully.")
text_generator

Device set to use cpu


GPT-2 model loaded successfully.


<transformers.pipelines.text_generation.TextGenerationPipeline at 0x241deb43c50>

In [229]:
def retrieve_relevant_chunks(query, top_k=2):
    """
    Embeds the query and retrieves the top_k most relevant document chunks from the FAISS index.

    Args:
        query (str): The user's question.
        top_k (int): The number of top relevant chunks to retrieve.

    Returns:
        list: A list of the top_k most relevant text chunks and their metadata.
    """
    # 1. Embed the query
    query_embedding = embedding_model.encode([query]).astype('float32')

    # 2. Search the FAISS index
    distances, indices = index.search(query_embedding, top_k)

    # 3. Retrieve the corresponding text chunks and metadata
    relevant_chunks = [chunks[i] for i in indices[0]]
    relevant_metadata = [metadata[i] for i in indices[0]]

    return relevant_chunks, relevant_metadata

def answer_question(query, context_chunks):
    """
    Uses the GPT-2 model to answer a question based on the provided context chunks.

    Args:
        query (str): The user's question.
        context_chunks (list): A list of relevant text chunks.

    Returns:
        str: The generated answer.
    """
    if not context_chunks:
        return "No relevant context found to answer the question."

    # Combine the context chunks into a single string
    context = "\n".join(context_chunks)

    # Construct the input prompt
    prompt = (f"You are an AI chatbot. Use the information from below context to answer the question.\nContext: {context}\n\nQuestion: {query}")
    print(f"Prompt:\n{prompt}")

    # Get the answer from the pipeline
    result = text_generator(prompt, max_length=400, truncation=True, num_return_sequences=1, temperature=0.2, do_sample=True, repetition_penalty=1.15)[0]['generated_text']
    answer = result[len(prompt):].strip()
    return answer

In [222]:
query = "What is a large language model in machine learning?"
relevant_chunks, relevant_metadata = retrieve_relevant_chunks(query)
answer = answer_question(query, relevant_chunks)
print("-"*50)
print(f"\nGenerated Output:\n {answer}")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Prompt:
You are question answer chat bot. Use the below context to answer the question.
Context: A large language model (LLM) is a type of machine learning model designed for natural language processing tasks such as language generation. LLMs are language models with many parameters, and are trained with self-supervised learning on a vast amount of text. This page lists notable large language models.
A large language model (LLM) is a type of machine learning model designed for natural language processing tasks such as language generation. LLMs are language models with many parameters, and are trained with self-supervised learning on a vast amount of text. The largest and most capable LLMs are generative pretrained transformers (GPTs). Modern models can be fine-tuned for specific tasks or guided by prompt engineering.

Question: What is a large language model in machine learning?
--------------------

Generated Answer: Answer: A large language model is a type of machine learning model d

In [230]:
query = "What is machine learning?"
relevant_chunks, relevant_metadata = retrieve_relevant_chunks(query)
answer = answer_question(query, relevant_chunks)
print("-"*50)
print(f"\nGenerated Output:\n {answer}")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Prompt:
You are an AI chatbot. Use the information from below context to answer the question.
Context: Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can learn from data and generalise to unseen data, and thus perform tasks without explicit instructions. Within a subdiscipline in machine learning, advances in the field of deep learning have allowed neural networks, a class of statistical algorithms, to surpass many previous machine learning approaches in performance.
Machine learning has been used for various scientific and commercial purposes including language translation, image recognition, decision-making, credit scoring, and e-commerce.

Question: What is machine learning?
--------------------------------------------------

Generated Output:
 Answer: ML is a branch of computer science that studies how computers learn. It is based on the idea that computers can be trained by observing and 

In [231]:
query = "Explain computer vision"
relevant_chunks, relevant_metadata = retrieve_relevant_chunks(query)
answer = answer_question(query, relevant_chunks)
print("-"*50)
print(f"\nGenerated Output:\n {answer}")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Prompt:
You are an AI chatbot. Use the information from below context to answer the question.
Context: Computer vision tasks include methods for acquiring, processing, analyzing, and understanding digital images, and extraction of high-dimensional data from the real world in order to produce numerical or symbolic information, e.g. in the form of decisions. "Understanding" in this context signifies the transformation of visual images (the input to the retina) into descriptions of the world that make sense to thought processes and can elicit appropriate action.
This image understanding can be seen as the disentangling of symbolic information from image data using models constructed with the aid of geometry, physics, statistics, and learning theory. The scientific discipline of computer vision is concerned with the theory behind artificial systems that extract information from images.

Question: Explain computer vision
--------------------------------------------------

Generated Output:
