### Loading all the Documents

In [1]:
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders import TextLoader

loader = DirectoryLoader(
    path="../datas/",
    glob="*.txt",
    loader_cls=TextLoader,
    loader_kwargs={"encoding": "utf-8"},
)

documents = loader.load()

print(f"Loaded {len(documents)} documents")
print(f"\nFirst document content:\n{documents[0].page_content}")
print(documents[0].page_content[:500])  # Print first 500 characters

Loaded 3 documents

First document content:
Python is a high-level, interpreted programming language known for its readability and concise syntax, which makes it ideal for beginners and professionals alike. It supports multiple paradigms, including object-oriented, functional, and procedural programming. With a vast standard library and rich ecosystem of third-party packages (like NumPy, Pandas, Django, and FastAPI), Python excels in domains such as data science, web development, automation, scripting, and AI/ML. Its strong community, cross-platform support, and ease of integration with other languages and systems make it a versatile choice for rapid development and production-ready applications.
Python is a high-level, interpreted programming language known for its readability and concise syntax, which makes it ideal for beginners and professionals alike. It supports multiple paradigms, including object-oriented, functional, and procedural programming. With a vast standard library and

In [2]:
# Initializing test Spillter
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    length_function=len,
    separators=["\n\n", "\n", " ", ""],
)

chunks = text_splitter.split_documents(documents=documents)

print(f"Created {len(chunks)} chunks")
print(f"\nFirst chunk content:\n{chunks[0].page_content}")
print(f"Content {chunks[0].page_content[:500]}")  # Print first 500 characters
print(f"Metadata {chunks[0].metadata}")  # Print metadata of the first chunk

Created 6 chunks

First chunk content:
Deep learning is a subset of machine learning that uses multi-layered neural networks to automatically learn hierarchical representations from data. A neural network is composed of interconnected nodes (“neurons”) organized in layers—input, hidden, and output—that transform inputs through learned weights and nonlinear activations. By training on large datasets with optimization methods like stochastic gradient descent, these models capture complex patterns for tasks such as image recognition,
Content Deep learning is a subset of machine learning that uses multi-layered neural networks to automatically learn hierarchical representations from data. A neural network is composed of interconnected nodes (“neurons”) organized in layers—input, hidden, and output—that transform inputs through learned weights and nonlinear activations. By training on large datasets with optimization methods like stochastic gradient descent, these models capture complex pat

In [2]:
from dotenv import load_dotenv
load_dotenv()

True

### Loading the ChromaDB

In [4]:
from dotenv import load_dotenv
load_dotenv()

import os
import chromadb
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction


persist_directory = "./chroma_db"
client = chromadb.PersistentClient(
    path=persist_directory,
)

collection = client.get_or_create_collection(
    name="my_collection",
    embedding_function=OpenAIEmbeddingFunction(api_key=os.getenv("OPENAI_API_KEY"), model_name="text-embedding-3-small"),
)
collection.add(
    ids=[str(i) for i in range(len(documents))],
    documents=[
        document.page_content
        for document in documents
    ],
)

print(f"Vector store created with {collection.count()} vectors")
print(f"Persisted at: {persist_directory}")

Vector store created with 3 vectors
Persisted at: ./chroma_db


In [21]:
query = "What is Types of Machine Learning?"

similer_docs = collection.query(
    query_texts=[query],
    n_results=3,
)
print(similer_docs["documents"])
print(similer_docs["distances"])

[['Machine learning fundamentals center on teaching computers to learn patterns from data and make predictions without being explicitly programmed. Core steps include collecting and cleaning data, selecting features, choosing a model (e.g., linear regression, decision trees, neural networks), and training it by minimizing a loss function with algorithms like gradient descent. We evaluate performance with metrics such as accuracy, precision/recall, RMSE, or AUC on validation/test sets and guard against overfitting with techniques like regularization and cross-validation. Understanding bias-variance trade-offs, data leakage, and proper model deployment/monitoring is essential for building reliable, generalizable systems.', 'Deep learning is a subset of machine learning that uses multi-layered neural networks to automatically learn hierarchical representations from data. A neural network is composed of interconnected nodes (“neurons”) organized in layers—input, hidden, and output—that tra