Install the required packages:

In [1]:
!pip install faiss-cpu transformers accelerate torch

Collecting faiss-cpu
  Downloading faiss_cpu-1.10.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (4.4 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.meta

In [2]:
import torch
from transformers import pipeline
import faiss
import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer

In [8]:
def embed_question(question, model_name):
    encoder = SentenceTransformer(model_name)
    embedding = encoder.encode([question])
    return embedding

def retrieve_indexes_from_faiss(embedding, k, database_name):
    # convert embeddings numpy float32 format for faiss:
    embedding_array = np.array(embedding, dtype=np.float32)
    index = faiss.read_index(database_name)

    # search faiss using query embedding:
    distances, indexes = index.search(embedding_array, k)
    indexes = indexes[0]
    distances = distances[0]
    print("indexes: ", indexes)
    print("distances: ", distances)
    return indexes

def retrieve_relevant_context_from_csv(indexes, csv_filename):
    df = pd.read_csv(csv_filename)
    context = ""
    for index in indexes:
        # get the text chunk where the id matches and add to context
        result = df.loc[df["id"] == index, "text_chunk"]
        if not result.empty:
            context = context + "\n" + result.values[0]
        else:
            print("no text chunk for id", index)

    print("retrieved context: ", context)
    return context

def construct_prompt(question, context):
    prompt_start = "Answer the following question: "
    promt_context = " Context: " + context
    prompt = prompt_start + question + promt_context
    return prompt


question= "what is the multi-store memory model?"
embedding = embed_question(question, "BAAI/bge-small-en-v1.5")
indexes = retrieve_indexes_from_faiss(embedding, 5, "faiss.index")
indexes = np.sort(indexes)
context = retrieve_relevant_context_from_csv(indexes, "text_chunks.csv")

# get the LM:
model_name = "HuggingFaceTB/SmolLM-135M"
model = pipeline("text-generation", model=model_name)

prompt = construct_prompt(question, context)

indexes:  [24 32 27 31 26]
distances:  [0.3739136  0.51471996 0.5285149  0.63307667 0.7145467 ]
retrieved context:  
Atkinson and Shiffrin (1968) devised the multi-store model (MSM) of memory. It is a cognitive approach that explains memory as information passing through a series of 3 storage systems: the sensory register, then short-term memory, and then long-term memory.
Long-term memory lasts anywhere from more than 30 seconds to an entire lifetime. There are several different types of long-term memory (see below), but all long-term memories will have originally passed through both the sensory register and short-term memory.
Once a long-term memory is stored, it can be retrieved and temporarily transferred to short term memory and manipulated (this retrieval process may also improve the duration of the memory). An example of retrieval would be remembering a pleasant experience from your childhood and thinking about how you felt that day.
The working memory model (WMM) was developed 

Device set to use cpu


In [10]:
# generate the response:
output = model(prompt, max_length = 1000, early_stopping=False)
output_text = output[0]["generated_text"].replace(prompt, "")
print(output_text)

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.



The multi-store model of long-term memory (MSM) is a cognitive approach that explains long-term memory as information passing through a series of 3 storage systems: the sensory register, short-term memory, and long-term memory.
Long-term memory lasts anywhere from 30 seconds to an entire lifetime. There are several different types of long-term memories (see below), but all long-term memories will have originally passed through both the sensory register and short-term memory.
Once a long-term memory is stored, it can be retrieved and temporarily transferred to short term memory and manipulated (this retrieval process may also improve the duration of the memory). An example of retrieval would be remembering a pleasant experience from your childhood and thinking about how you felt that day.
The working memory model (WMM) was developed by Baddeley and Hitch (1974) and builds on the multi-store model – in particular, the MSM’s model of short-term memory. Rather than replacing the MSM, the 