<a href="https://colab.research.google.com/github/anhtranguyen-github/RingDingDingDing/blob/main/Simple_RAG_with_GROQ_Qdrant.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Bài toán:
Ứng dụng mô hình ngôn ngữ lớn để tra cứu và hỏi đáp về tài liệu các môn chuyên ngành CNTT.



### Giới thiệu

Bài tập lớn này đề xuất ứng dụng mô hình ngôn ngữ lớn (LLM) và RAG (Retrieval Augmented Generation) để xây dựng hệ thống tra cứu và hỏi đáp về tài liệu các môn chuyên ngành CNTT. Hệ thống sẽ hỗ trợ người dùng tìm kiếm thông tin, giải đáp thắc mắc và học tập hiệu quả hơn.


### Pipeline:


![A simple RAG system](https://research.aimultiple.com/wp-content/uploads/2023/09/RAG-Architect-612x406.png.webp)

Nguồn ảnh: https://research.aimultiple.com/retrieval-augmented-generation/



Retrieval Augmented Generation (RAG) là một phương pháp được giới thiệu bởi các nhà nghiên cứu của Meta AI để giải quyết các task yêu cầu nhiều kiến thức (knowledge-intensive). RAG là kết hợp của thành phần truy xuất thông tin (Retrieval) với mô hình tạo sinh văn bản (Generation).

Các tài liệu, kiến thức từ một nguồn (ví dụ: Wikipedia, Google drive, vv.) được embed bằng Embedding model và index vào Vector Database để phục vụ cho truy vấn.

RAG lấy input đầu vào và dùng nó để truy xuất ra một tập hợp các tài liệu có liên quan.

Sau đó, các tài liệu được thêm vào prompt dưới dạng in-context learning và được đưa vào generation model để tạo ra phản hồi.
Một prompt ví dụ:

"""Use the following pieces of context to answer the question at the end.

If you don't know the answer, just say that you don't know, don't try to make up an answer.

Use three sentences maximum and keep the answer as concise as possible.

{context}

Question: {question}

Helpful Answer:"""

#Cài đặt các thư viện cần thiết

In [None]:
!pip install -q transformers==4.40.0
!pip install -q accelerate==0.29.3
!pip install -q huggingface-hub==0.22.2
!pip install -q auto-gptq==0.7.1


[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.0/9.0 MB[0m [31m55.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m297.6/297.6 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.3/21.3 MB[0m [31m71.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m388.9/388.9 kB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.5/23.5 MB[0m [31m35.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.1/542.1 kB[0m [31m52.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.2/13.2 MB[0m [31m90.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m251.6/251.6 kB[0m [31m26.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━


#Chunk Class

Vì các tài liệu (Document) dài, không thể cùng lưu trữ dưới dạng 1 vector chung mà cần phải chia nhỏ thành các cụm câu (Chunker) để lưu trữ và truy vấn.



In [None]:
class Chunk:
    def __init__(
        self,
        text: str = "",
        doc_name: str = "",
        doc_type: str = "",
        doc_uuid: str = "",
        chunk_id: str = "",
    ):
        self._text = text
        self._doc_name = doc_name
        self._doc_type = doc_type
        self._doc_uuid = doc_uuid
        self._chunk_id = chunk_id
        self._tokens = 0
        self._vector = None
        self._score = 0

    @property
    def text(self):
        return self._text

    @property
    def text_no_overlap(self):
        return self._text_no_overlap

    @property
    def doc_name(self):
        return self._doc_name

    @property
    def doc_type(self):
        return self._doc_type

    @property
    def doc_uuid(self):
        return self._doc_uuid

    @property
    def chunk_id(self):
        return self._chunk_id

    @property
    def tokens(self):
        return self._tokens

    @property
    def vector(self):
        return self._vector

    @property
    def score(self):
        return self._score

    def set_uuid(self, uuid):
        self._doc_uuid = uuid

    def set_tokens(self, token):
        self._tokens = token

    def set_vector(self, vector):
        self._vector = vector

    def set_score(self, score):
        self._score = score

    def to_dict(self) -> dict:
        """Convert the Chunk object to a dictionary."""
        return {
            "text": self.text,
            "doc_name": self.doc_name,
            "doc_type": self.doc_type,
            "doc_uuid": self.doc_uuid,
            "chunk_id": self.chunk_id,
            "tokens": self.tokens,
            "vector": self.vector,
            "score": self.score,
        }

    @classmethod
    def from_dict(cls, data: dict):
        """Construct a Chunk object from a dictionary."""
        chunk = cls(
            text=data.get("text", ""),
            doc_name=data.get("doc_name", ""),
            doc_type=data.get("doc_type", ""),
            doc_uuid=data.get("doc_uuid", ""),
            chunk_id=data.get("chunk_id", ""),
        )
        chunk.set_tokens(data.get("tokens", 0))
        chunk.set_vector(data.get("vector", None))
        chunk.set_score(data.get("score", 0))
        return chunk


#Document class


Lớp class cho mọi loại tài liệu được đọc và được lưu.

In [None]:

class Document:
    def __init__(
        self,
        text: str = "",
        type: str = "",
        name: str = "",
        path: str = "",
        link: str = "",
        timestamp: str = "",
        reader: str = "",
        meta: dict = None,
    ):
        if meta is None:
            meta = {}
        self._text = text
        self._type = type
        self._name = name
        self._path = path
        self._link = link
        self._timestamp = timestamp
        self._reader = reader
        self._meta = meta
        self.chunks: list[Chunk] = []

    @property
    def text(self):
        return self._text

    @property
    def type(self):
        return self._type

    @property
    def name(self):
        return self._name

    @property
    def path(self):
        return self._path

    @property
    def link(self):
        return self._link

    @property
    def timestamp(self):
        return self._timestamp

    @property
    def reader(self):
        return self._reader

    @property
    def meta(self):
        return self._meta

    @staticmethod
    def to_json(document) -> dict:
        """Convert the Document object to a JSON dict."""
        doc_dict = {
            "text": document.text,
            "type": document.type,
            "name": document.name,
            "path": document.path,
            "link": document.link,
            "timestamp": document.timestamp,
            "reader": document.reader,
            "meta": document.meta,
            "chunks": [chunk.to_dict() for chunk in document.chunks],
        }
        return doc_dict

    @staticmethod
    def from_json(doc_dict: dict):
        """Convert a JSON string to a Document object."""
        document = Document(
            text=doc_dict.get("text", ""),
            type=doc_dict.get("type", ""),
            name=doc_dict.get("name", ""),
            path=doc_dict.get("path", ""),
            link=doc_dict.get("link", ""),
            timestamp=doc_dict.get("timestamp", ""),
            reader=doc_dict.get("reader", ""),
            meta=doc_dict.get("meta", {}),
        )
        # Assuming Chunk has a from_dict method
        document.chunks = [
            Chunk.from_dict(chunk_data) for chunk_data in doc_dict.get("chunks", [])
        ]
        return document


#Reader base class

Lớp cơ bản cho reader (để đọc các loại tài liệu)


In [None]:
class Reader():
    """
    Interface for Readers.
    """

    def __init__(self):
        super().__init__()
        self.file_types = []

    def load(
        bytes: list[str],
        contents: list[str],
        paths: list[str],
        fileNames: list[str],
        document_type: str,
    ) -> list[Document]:
        """
        @parameter: bytes : list[str] - List of bytes
        @parameter: contents : list[str] - List of string content
        @parameter: paths : list[str] - List of paths to files
        @parameter: fileNames : list[str] - List of file names
        @parameter: document_type : str - Document type
        @returns list[Document] - Lists of documents.
        """
        raise NotImplementedError("load method must be implemented by a subclass.")


#Implement SimpleReader

Sử dụng SimpleReader để đọc các văn bản dạng .txt, .md, .mdx, and .json



In [None]:
import base64
import glob
import json
from datetime import datetime
from pathlib import Path

from wasabi import msg


class SimpleReader(Reader):
    """
    The SimpleReader reads .txt, .md, .mdx, and .json files. It can handle both paths, content and bytes.
    """

    def __init__(self):
        super().__init__()
        self.file_types = [".txt", ".md", ".mdx", ".json"]
        self.name = "SimpleReader"
        self.description = "Reads text, markdown, and json files."

    def load(
        self,
        bytes: list[str] = None,
        contents: list[str] = None,
        paths: list[str] = None,
        fileNames: list[str] = None,
        document_type: str = "Documentation",
    ) -> list[Document]:
        """
        @parameter: bytes : list[str] - List of bytes
        @parameter: contents : list[str] - List of string content
        @parameter: paths : list[str] - List of paths to files
        @parameter: fileNames : list[str] - List of file names
        @parameter: document_type : str - Document type
        @returns list[Document] - Lists of documents.
        """
        if fileNames is None:
            fileNames = []
        if paths is None:
            paths = []
        if contents is None:
            contents = []
        if bytes is None:
            bytes = []
        documents = []

        # If paths exist
        if len(paths) > 0:
            for path in paths:
                if path != "":
                    data_path = Path(path)
                    if data_path.exists():
                        if data_path.is_file():
                            documents += self.load_file(data_path, document_type)
                        else:
                            documents += self.load_directory(data_path, document_type)
                    else:
                        msg.warn(f"Path {data_path} does not exist")

        # If bytes exist
        if len(bytes) > 0 and len(bytes) == len(fileNames):
            for byte, fileName in zip(bytes, fileNames):
                decoded_bytes = base64.b64decode(byte)
                try:
                    original_text = decoded_bytes.decode("utf-8")
                except UnicodeDecodeError:
                    msg.fail(
                        f"Error decoding text for file {fileName}. The file might not be a text file."
                    )
                    continue

                if ".json" in fileName:
                    json_obj = json.loads(original_text)
                    try:
                        document = Document.from_json(json_obj)
                    except Exception as e:
                        raise Exception(f"Loading JSON failed {e}")

                else:
                    document = Document(
                        name=fileName,
                        text=original_text,
                        type=document_type,
                        timestamp=str(datetime.now().strftime("%Y-%m-%d %H:%M:%S")),
                        reader=self.name,
                    )
                documents.append(document)

        # If content exist
        if len(contents) > 0 and len(contents) == len(fileNames):
            for content, fileName in zip(contents, fileNames):
                document = Document(
                    name=fileName,
                    text=content,
                    type=document_type,
                    timestamp=str(datetime.now().strftime("%Y-%m-%d %H:%M:%S")),
                    reader=self.name,
                )
                documents.append(document)

        msg.good(f"Loaded {len(documents)} documents")
        return documents

    def load_file(self, file_path: Path, document_type: str) -> list[Document]:
        """Loads text file
        @param file_path : Path - Path to file
        @param document_type : str - Document Type
        @returns list[Document] - Lists of documents.
        """
        documents = []

        if file_path.suffix not in self.file_types:
            msg.warn(f"{file_path.suffix} not supported")
            return []

        with open(file_path, encoding="utf-8") as f:
            msg.info(f"Reading {str(file_path)}")

            if file_path.suffix == ".json":
                json_obj = json.loads(f.read())
                try:
                    document = Document.from_json(json_obj)
                except Exception as e:
                    raise Exception(f"Loading JSON failed {e}")

            else:
                document = Document(
                    text=f.read(),
                    type=document_type,
                    name=str(file_path),
                    link=str(file_path),
                    timestamp=str(datetime.now().strftime("%Y-%m-%d %H:%M:%S")),
                    reader=self.name,
                )
            documents.append(document)
        msg.good(f"Loaded {str(file_path)}")
        return documents

    def load_directory(self, dir_path: Path, document_type: str) -> list[Document]:
        """Loads text files from a directory and its subdirectories.

        @param dir_path : Path - Path to directory
        @param document_type : str - Document Type
        @returns list[Document] - List of documents
        """
        # Initialize an empty dictionary to store the file contents
        documents = []

        # Convert dir_path to string, in case it's a Path object
        dir_path_str = str(dir_path)

        # Loop through each file type
        for file_type in self.file_types:
            # Use glob to find all the files in dir_path and its subdirectories matching the current file_type
            files = glob.glob(f"{dir_path_str}/**/*{file_type}", recursive=True)

            # Loop through each file
            for file in files:
                msg.info(f"Reading {str(file)}")
                with open(file, encoding="utf-8") as f:
                    document = Document(
                        text=f.read(),
                        type=document_type,
                        name=str(file),
                        link=str(file),
                        timestamp=str(datetime.now().strftime("%Y-%m-%d %H:%M:%S")),
                        reader=self.name,
                    )

                    documents.append(document)

        msg.good(f"Loaded {len(documents)} documents")
        return documents


In [None]:
reader = SimpleReader()
path = "/content/data"

In [None]:
#import a folder from drive: https://drive.google.com/drive/u/0/folders/1ML3-o2iuPPmHsNWyxWBXTZt20qWL7xe9

!gdown --folder 1ML3-o2iuPPmHsNWyxWBXTZt20qWL7xe9 -O /content/data


Retrieving folder contents
Processing file 1P-qUbqGflFpSp6joMzxSYnp4ZPZVwzX2 25-100.pdf
Retrieving folder contents completed
Building directory structure
Building directory structure completed
Downloading...
From: https://drive.google.com/uc?id=1P-qUbqGflFpSp6joMzxSYnp4ZPZVwzX2
To: /content/data/25-100.pdf
100% 799k/799k [00:00<00:00, 137MB/s]
Download completed


#Implement PDFReader

Sử dụng PyPDF2 để đọc file PDF

In [None]:
!pip install PyPDF2

Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/232.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1


In [None]:

try:
    from PyPDF2 import PdfReader
except Exception:
    msg.warn("PyPDF2 not installed, your base installation might be corrupted.")


class PDFReader(Reader):
    """
    The PDFReader reads .pdf files using Unstructured.
    """

    def __init__(self):
        super().__init__()
        self.file_types = [".pdf"]
        self.requires_library = ["PyPDF2"]
        self.name = "PDFReader"
        self.description = "Reads PDF files using the PyPDF2 library"

    def load(
        self,
        bytes: list[str] = None,
        contents: list[str] = None,
        paths: list[str] = None,
        fileNames: list[str] = None,
        document_type: str = "Documentation",
    ) -> list[Document]:
        """Ingest data into Weaviate
        @parameter: bytes : list[str] - List of bytes
        @parameter: contents : list[str] - List of string content
        @parameter: paths : list[str] - List of paths to files
        @parameter: fileNames : list[str] - List of file names
        @parameter: document_type : str - Document type
        @returns list[Document] - Lists of documents.
        """
        if fileNames is None:
            fileNames = []
        if paths is None:
            paths = []
        if contents is None:
            contents = []
        if bytes is None:
            bytes = []
        documents = []

        # If paths exist
        if len(paths) > 0:
            for path in paths:
                if path != "":
                    data_path = Path(path)
                    if data_path.exists():
                        if data_path.is_file():
                            documents += self.load_file(data_path, document_type)
                        else:
                            documents += self.load_directory(data_path, document_type)
                    else:
                        msg.warn(f"Path {data_path} does not exist")

        # If bytes exist
        if len(bytes) > 0 and len(bytes) == len(fileNames):
            for byte, fileName in zip(bytes, fileNames):
                decoded_bytes = base64.b64decode(byte)
                with open(f"{fileName}", "wb") as file:
                    file.write(decoded_bytes)

                documents += self.load_file(f"{fileName}", document_type)
                os.remove(f"{fileName}")

        # If content exist
        if len(contents) > 0 and len(contents) == len(fileNames):
            for content, fileName in zip(contents, fileNames):
                document = Document(
                    name=fileName,
                    text=content,
                    type=document_type,
                    timestamp=str(datetime.now().strftime("%Y-%m-%d %H:%M:%S")),
                    reader=self.name,
                )
                documents.append(document)

        msg.good(f"Loaded {len(documents)} documents")
        return documents

    def load_file(self, file_path: Path, document_type: str) -> list[Document]:
        """Loads .pdf file
        @param file_path : Path - Path to file
        @param document_type : str - Document Type
        @returns list[Document] - Lists of documents.
        """
        documents = []
        full_text = ""
        reader = PdfReader(file_path)

        for page in reader.pages:
            full_text += page.extract_text() + "\n\n"

        document = Document(
            text=full_text,
            type=document_type,
            name=str(file_path),
            link=str(file_path),
            timestamp=str(datetime.now().strftime("%Y-%m-%d %H:%M:%S")),
            reader=self.name,
        )
        documents.append(document)
        msg.good(f"Loaded {str(file_path)}")
        return documents

    def load_directory(self, dir_path: Path, document_type: str) -> list[Document]:
        """Loads .pdf files from a directory and its subdirectories.

        @param dir_path : Path - Path to directory
        @param document_type : str - Document Type
        @returns list[Document] - List of documents
        """
        # Initialize an empty dictionary to store the file contents
        documents = []

        # Convert dir_path to string, in case it's a Path object
        dir_path_str = str(dir_path)

        # Loop through each file type
        for file_type in self.file_types:
            # Use glob to find all the files in dir_path and its subdirectories matching the current file_type
            files = glob.glob(f"{dir_path_str}/**/*{file_type}", recursive=True)

            # Loop through each file
            for file in files:
                msg.info(f"Reading {str(file)}")
                with open(file, encoding="utf-8"):
                    documents += self.load_file(file, document_type=document_type)

        msg.good(f"Loaded {len(documents)} documents")
        return documents


In [None]:
reader = PDFReader()

In [None]:

documents = reader.load_directory(
    dir_path = path,
    document_type="document",
)


[38;5;4mℹ Reading /content/data/25-100.pdf[0m
[38;5;2m✔ Loaded /content/data/25-100.pdf[0m
[38;5;2m✔ Loaded 1 documents[0m


#Chunker


Lớp chunker sử dụng để chia nhỏ các document thành các chunks với tham số:

1.   Units : số câu trong 1 chunk
2.   Overlays: để tránh mất ngữ nghĩa trong một số trường hợp, overlays là số câu chung của 2 chunk liền kề.



In [None]:
class Chunker():
    """
    Interface for Chunking.
    """

    def __init__(self):
        super().__init__()
        self.default_units = 100
        self.default_overlap = 50

    def chunk(
        self, documents: list[Document], units: int, overlap: int
    ) -> list[Document]:
        """Chunk documents into chunks based on units and overlap.

        @parameter: documents : list[Document] - List of documents
        @parameter: units : int - How many units per chunk (words, sentences, etc.)
        @parameter: overlap : int - How much overlap between the chunks
        @returns list[str] - List of documents that contain the chunks.
        """
        raise NotImplementedError("chunk method must be implemented by a subclass.")


#Implement Sentence chunker


Sử dụng Spacy để chia văn bản thành các tập các câu nhỏ

In [None]:
import contextlib

from tqdm import tqdm
from wasabi import msg

with contextlib.suppress(Exception):
    import spacy


class SentenceChunker(Chunker):
    """
    SentenceChunker for built with spaCy.
    """

    def __init__(self):
        super().__init__()
        self.name = "WordChunker"
        self.requires_library = ["spacy"]
        self.default_units = 3
        self.default_overlap = 2
        self.description = "Chunk documents by sentences. You can specify how many sentences should overlap between chunks to improve retrieval."
        try:
            self.nlp = spacy.blank("en")
            self.nlp.add_pipe("sentencizer")
            self.nlp.max_length = 3000000
        except:
            self.nlp = None

    def chunk(
        self, documents: list[Document], units: int, overlap: int
    ) -> list[Document]:
        """Chunk documents into chunks based on units and overlap
        @parameter: documents : list[Document] - List of documents
        @parameter: units : int - How many units per chunk (words, sentences, etc.)
        @parameter: overlap : int - How much overlap between the chunks
        @returns list[str] - List of documents that contain the chunks.
        """
        for document in tqdm(
            documents, total=len(documents), desc="Chunking documents"
        ):
            # Skip if document already contains chunks
            if len(document.chunks) > 0:
                continue

            doc = list(self.nlp(document.text).sents)

            if units > len(doc) or units < 1:
                msg.warn(
                    f"Unit value either exceeds length of actual document or is below 1 ({units}/{len(doc)})"
                )
                continue

            if overlap >= units:
                msg.warn(
                    f"Overlap value is greater than unit (Units {units}/ Overlap {overlap})"
                )
                continue

            i = 0
            split_id_counter = 0
            while i < len(doc):
                # Overlap
                start_i = i
                end_i = i + units
                if end_i > len(doc):
                    end_i = len(doc)  # Adjust for the last chunk

                text = ""
                for sent in doc[start_i:end_i]:
                    text += sent.text

                doc_chunk = Chunk(
                    text=text,
                    doc_name=document.name,
                    doc_type=document.type,
                    chunk_id=split_id_counter,
                )
                document.chunks.append(doc_chunk)
                split_id_counter += 1

                # Exit loop if this was the last possible chunk
                if end_i == len(doc):
                    break

                i += units - overlap  # Step forward, considering overlap

        return documents


In [None]:
chunker = SentenceChunker()


units = 3 # Number of sentences per chunk
overlap = 2 # Overlap between chunks


chunked_documents = chunker.chunk(documents, units, overlap)


Chunking documents: 100%|██████████| 1/1 [00:00<00:00,  1.52it/s]


In [None]:
#print numbers of chunks in all docs

for document in chunked_documents:
    print(len(document.chunks))



1872


#Embedder


Lớp embedder chuyển các chunk thành các vector.

In [None]:
import time
import re

def strip_non_letters(s: str):
    return re.sub(r"[^a-zA-Z0-9]", "_", s)

In [None]:
class Embedder():
    """
    Interface forEmbedding.
    """

    def __init__(self):
        super().__init__()
        self.vectorizer = ""

    def embed(documents: list[Document], batch_size: int = 100) -> bool:
        """Embed  documents and its chunks
        @parameter: documents : list[Document] - List of  documents
        @parameter: batch_size : int - Batch Size of Input
        @returns bool - Bool whether the embedding what successful.
        """
        raise NotImplementedError("embed method must be implemented by a subclass.")

In [None]:


!mkdir sentence-transformers
!git clone https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2 sentence-transformers/all-MiniLM-L6-v2


Cloning into 'sentence-transformers/all-MiniLM-L6-v2'...
remote: Enumerating objects: 61, done.[K
remote: Counting objects: 100% (61/61), done.[K
remote: Compressing objects: 100% (39/39), done.[K
remote: Total 61 (delta 22), reused 54 (delta 19), pack-reused 0 (from 0)[K
Unpacking objects: 100% (61/61), 316.23 KiB | 6.08 MiB/s, done.
Filtering content: 100% (5/5), 433.05 MiB | 52.94 MiB/s, done.


In [None]:
class MiniLMEmbedder(Embedder):
    """
    MiniLMEmbedder
    """

    def __init__(self):
        super().__init__()
        self.name = "MiniLMEmbedder"
        self.requires_library = ["torch", "transformers"]
        self.description = "Embeds and retrieves objects using SentenceTransformer's all-MiniLM-L6-v2 model"
        self.vectorizer = "MiniLM"
        self.model = None
        self.tokenizer = None
        try:
            import torch
            from transformers import AutoModel, AutoTokenizer

            def get_device():
                if torch.cuda.is_available():
                    return torch.device("cuda")
                elif torch.backends.mps.is_available():
                    return torch.device("mps")
                else:
                    return torch.device("cpu")

            self.device = get_device()

            self.model = AutoModel.from_pretrained(
                "sentence-transformers/all-MiniLM-L6-v2", device_map=self.device
            )
            self.tokenizer = AutoTokenizer.from_pretrained(
                "sentence-transformers/all-MiniLM-L6-v2", device_map=self.device
            )
            self.model = self.model.to(self.device)

        except Exception as e:
            msg.warn(str(e))
            pass

    def embed(
        self,
        documents: list[Document],
    ) -> list[Document]:
        """Embed documents and its chunks
        @parameter: documents : list[Document] - List of documents
        @returns bool - Bool whether the embedding what successful.
        """
        for document in tqdm(
            documents, total=len(documents), desc="Vectorizing document chunks"
        ):
            for chunk in document.chunks:
                chunk.set_vector(self.vectorize_chunk(chunk.text))

        return documents

    def vectorize_chunk(self, chunk) -> list[float]:
        try:
            import torch

            text = chunk
            tokens = self.tokenizer.tokenize(text)

            max_length = (
                self.tokenizer.model_max_length
            )  # Get the max sequence length for the model
            batches = []
            batch = []
            token_count = 0

            for token in tokens:
                token_length = len(
                    self.tokenizer.encode(token, add_special_tokens=False)
                )
                if token_count + token_length <= max_length:
                    batch.append(token)
                    token_count += token_length
                else:
                    batches.append(" ".join(batch))
                    batch = [token]
                    token_count = token_length

            # Don't forget to add the last batch
            if batch:
                batches.append(" ".join(batch))

            embeddings = []

            for batch in batches:
                inputs = self.tokenizer(
                    batch, return_tensors="pt", padding=True, truncation=True
                )
                inputs = {k: v.to(self.device) for k, v in inputs.items()}
                with torch.no_grad():
                    outputs = self.model(**inputs)
                # Taking the mean of the hidden states to obtain an embedding for the batch
                embedding = outputs.last_hidden_state.mean(dim=1)
                embeddings.append(embedding)

            # Concatenate the embeddings to make averaging easier
            all_embeddings = torch.cat(embeddings)

            averaged_embedding = all_embeddings.mean(dim=0)

            averaged_embedding_list = averaged_embedding.tolist()

            return averaged_embedding_list

        except Exception:
            raise

    def vectorize_query(self, query: str) -> list[float]:
        return self.vectorize_chunk(query)


In [None]:
# print tokenizer of embedder = MiniLMEmbedder()
embedder = MiniLMEmbedder()

print(embedder.tokenizer.model_max_length)


512


In [None]:
# vectorize_chunk a chunk

chunk = "This is a chunk of text."
embedding = embedder.vectorize_chunk(chunk)
print(embedding)

[-0.014248672872781754, 0.6279491782188416, 0.07291051745414734, -0.06212655082345009, 0.2749476730823517, -0.026274532079696655, 0.10822974890470505, 0.09660141170024872, 0.3085515797138214, -0.008498645387589931, 0.043142061680555344, 0.29481783509254456, -0.10924308747053146, -0.24091914296150208, -0.011748868972063065, 0.08396703004837036, 0.179739847779274, -0.32621002197265625, -0.13620297610759735, 0.02536865696310997, 0.1376171112060547, 0.5911679863929749, 0.012456698343157768, 0.18711793422698975, 0.24485571682453156, 0.6204167604446411, -0.3848669230937958, 0.2235054224729538, 0.1716795563697815, 0.05362347140908241, -0.07576702535152435, 0.0646846666932106, 0.548338770866394, 0.11030733585357666, -0.013388832099735737, 0.2673846483230591, -0.050195179879665375, 0.3121611177921295, 0.06551774591207504, 0.0342801995575428, 0.22141079604625702, -0.33449649810791016, 0.007368538063019514, 0.4206918179988861, 0.07220632582902908, 0.013380239717662334, -0.36111459136009216, 0.019

In [None]:
embedder.embed(chunked_documents)

Vectorizing document chunks: 100%|██████████| 1/1 [00:16<00:00, 16.93s/it]


[<__main__.Document at 0x7ec9ef69dea0>]

#Install vector database


Sử dụng Vector Database là cơ sở dữ liệu lưu trữ các vector sau khi được embedded. Qdrant là 1 open source vector database. Ở đây em sử dụng API miễn phí do Qdrant cung cấp ( có giới hạn lưu trữ) để kết nốt đến vector store lưu trữ vector.

In [None]:
!pip install -U -q qdrant_client

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m229.3/229.3 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.3/2.3 MB[0m [31m13.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.6/75.6 kB[0m [31m10.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m309.2/309.2 kB[0m [31m28.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.9/77.9 kB[0m [31m11.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.5/57.5 kB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.3/58.3 kB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the 

In [None]:
from qdrant_client import QdrantClient, models


In [None]:
client = QdrantClient(
    url="https://062652c5-bc12-45a8-9bd6-573590436438.us-east4-0.gcp.cloud.qdrant.io:6333",
    api_key="pi-jVoM0EwsDSPk2sLc0zLFSYGboXVbLR06RqFlY-1wyztdzsHEoPw",
)

In [None]:
try:
    client.create_collection(
        collection_name="Computer Network",
        vectors_config=models.VectorParams(size=384, distance=models.Distance.COSINE),
    )
except Exception as e:
    msg.warn(str(e))
    pass


Mỗi vector là các point, bước tiếp theo lưu các point vào vector database

In [None]:
def create_point(chunk, vector):
    return models.PointStruct(
        id=chunk.chunk_id,
        vector=vector,
        payload={"text": chunk.text, "doc_name": chunk.doc_name, "doc_type": chunk.doc_type}
    )

points = []
for document in chunked_documents:
    for chunk in document.chunks:
        vector = chunk.vector
        point = create_point(chunk, vector)
        points.append(point)

try:
    client.upsert(collection_name="Computer Network", points=points)
    print("Successfully inserted vectors into Qdrant")
except Exception as e:
    print(f"Failed to insert vectors: {str(e)}")


Successfully inserted vectors into Qdrant


In [None]:
query = "What is computer network"
query_vector = embedder.vectorize_query(query)

# Search for similar vectors
search_result = client.search(
    collection_name="Computer Network",
    query_vector=query_vector,
    limit=10,  # Number of results to retrieve
    with_payload=True  # Retrieve the stored payload (metadata)
)

# Retrieve metadata
metadata = []
for result in search_result:
    metadata.append(result.payload)

# Process and display results
for item in metadata:
    print(f"Document Name: {item['doc_name']}")
    print(f"Document Type: {item['doc_type']}")
    print(f"Text: {item['text']}")
    print()

Document Name: /content/data/25-100.pdf
Document Type: document
Text: 
Throughout the book we will use the term ‘‘computer network’’ to mean a col-
lection of autonomous computers interconnected by a single technology.Two
computers are said to be interconnected if they are able to exchange information.
The connection need not be via a copper wire; fiber optics, microwaves, infrared,
and communication satellites can also be used.

Document Name: /content/data/25-100.pdf
Document Type: document
Text: The design and organization of
these networks are the subjects of this book.
Throughout the book we will use the term ‘‘computer network’’ to mean a col-
lection of autonomous computers interconnected by a single technology.Two
computers are said to be interconnected if they are able to exchange information.

Document Name: /content/data/25-100.pdf
Document Type: document
Text: 
These systems are called computer networks .The design and organization of
these networks are the subjects of this

#Generator



In [None]:
!pip install -q llama-index-llms-groq

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.4/15.4 MB[0m [31m35.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m141.9/141.9 kB[0m [31m16.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m324.1/324.1 kB[0m [31m27.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m39.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.2/49.2 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
from llama_index.llms.groq import Groq
llm = Groq(model="mixtral-8x7b-32768", api_key="API KEY")

response = llm.complete("Explain the importance of low latency LLMs")
print(response)

LLMs, or low-latency messaging systems, are critical for applications that require real-time communication and data transfer. Low latency refers to the time it takes for a message to travel from the sender to the receiver, and low-latency LLMs aim to minimize this time as much as possible.

The importance of low latency LLMs can be explained through the following points:

1. Real-time communication: Low-latency LLMs enable real-time communication between applications, devices, and systems. This is critical for applications such as online gaming, financial trading, and industrial automation, where real-time data transfer is essential for optimal performance.
2. Improved user experience: Low latency LLMs can significantly improve the user experience by reducing the time it takes for data to be transferred between applications and devices. This can lead to faster response times, smoother interactions, and a more enjoyable user experience.
3. Increased efficiency: Low-latency LLMs can incr

In [None]:
def prompted_user_question(user_question, metadata):
    relevant_information = "" # Initialize an empty string

    # Populate relevant_information with text from metadata
    for item in metadata:
        relevant_information += f"* {item['text']}\n"

    # Prepare the prompt for the LLM
    prompt = f"""The user asked: "{user_question}".

    Based on the information retrieved from the vector database, here are the most relevant pieces of information:
    {relevant_information}
    """

    return prompt

user_question = "What is Computer network?"

# Call the function with the sample data
prompt = prompted_user_question(user_question, metadata)
print(prompt)

The user asked: "What is Computer network?".

    Based on the information retrieved from the vector database, here are the most relevant pieces of information:
    * 
Throughout the book we will use the term ‘‘computer network’’ to mean a col-
lection of autonomous computers interconnected by a single technology.Two
computers are said to be interconnected if they are able to exchange information.
The connection need not be via a copper wire; fiber optics, microwaves, infrared,
and communication satellites can also be used.
* The design and organization of
these networks are the subjects of this book.
Throughout the book we will use the term ‘‘computer network’’ to mean a col-
lection of autonomous computers interconnected by a single technology.Two
computers are said to be interconnected if they are able to exchange information.
* 
These systems are called computer networks .The design and organization of
these networks are the subjects of this book.
Throughout the book we will use th

In [None]:
from llama_index.llms.groq import Groq
llm = Groq(model="mixtral-8x7b-32768", api_key="API KEY")

response = llm.complete(prompt)
print(response)

A computer network is a collection of autonomous computers that are interconnected using a single technology, allowing them to exchange information. This interconnection can occur through various means, such as copper wire, fiber optics, microwaves, infrared, and communication satellites. The design and organization of these networks are the focus of this book.

Computer networks can serve various purposes, including facilitating communication among employees and enabling the sharing of information and resources. They can also replace the traditional "computer center" model, where a large computer in a single room handles all of an organization's computational needs. Instead, a network of separate but interconnected computers can perform these tasks.

Networks can come in many sizes, shapes, and forms and can be connected to create larger networks, with the Internet being the most well-known example. While computer networks offer many benefits, such as easy communication and resource s

#Kết luận
1. Khả năng thích ứng cao:

RAG có thể cập nhật kiến thức mới một cách dễ dàng mà không cần đào tạo lại mô hình LLM hoàn toàn, giúp theo kịp sự thay đổi của thông tin và dữ liệu theo thời gian.
Nhờ vậy, RAG có thể cung cấp kết quả chính xác và đáng tin cậy hơn so với các mô hình truyền thống.
2. Tăng độ minh bạch:

RAG cho phép truy cập nguồn thông tin mà LLM sử dụng để tạo ra kết quả, giúp người dùng dễ dàng kiểm tra tính chính xác và độ tin cậy của thông tin.
Điều này đặc biệt quan trọng trong các lĩnh vực như pháp lý, nơi độ chính xác của thông tin là rất quan trọng.
3. Nâng cao khả năng giải thích:

Việc hiểu được nguồn gốc của thông tin giúp người dùng hiểu rõ hơn về cách thức LLM đưa ra kết luận, từ đó có thể đưa ra đánh giá và quyết định sáng suốt hơn.
Khả năng giải thích này cũng giúp ích cho việc phát triển và cải tiến các mô hình LLM trong tương lai.
4. Tiết kiệm chi phí:

Việc cập nhật kiến thức mới thông qua RAG thường tiết kiệm chi phí hơn so với việc đào tạo lại mô hình LLM hoàn toàn.
Điều này giúp cho các ứng dụng AI tạo sinh trở nên dễ tiếp cận hơn với nhiều người dùng và doanh nghiệp.