# Basic Text Chunking
- We will be using a basic text splitter for chunks.
- We will be ignoring images and table data in this case.

## Setup

### Setup Vector Database to store vectors
We will be using Qdrant as our vector database. It supports vector search along with metadata search which makes it convenient for usage. Also open source.

In [1]:
# !pip install qdrant_client
from qdrant_client import QdrantClient

client = QdrantClient(url="http://localhost:6333")


# Testing if the client is working
from qdrant_client.models import Distance, VectorParams

client.create_collection(
    collection_name="test_collection",
    vectors_config=VectorParams(size=4, distance=Distance.DOT),
)

client.delete_collection(collection_name="test_collection")

True

In [2]:
# !pip install langchain_text_splitters

### Setup Langchain and OpenAI API key

In [3]:
# !pip install langchain-openai
MAX_CHUNK_SIZE = 1024
from langchain_openai import OpenAIEmbeddings

embeddings_model = OpenAIEmbeddings(model="text-embedding-3-small", chunk_size=MAX_CHUNK_SIZE)

# Testing if the embeddings model is working
# embeddings_model.embed_documents(["Hello, world!"])

## Generate Chunks from PDF files

In this exploration, we will use 1024 size chunks.

In [4]:
from unstructured.partition.pdf import partition_pdf, Element

def generate_elements(file_path) -> list[Element]:
    elements = partition_pdf(
            filename=file_path,
            chunking_strategy="by_title",
            max_characters=MAX_CHUNK_SIZE,
            # Unstructured Helpers
            strategy="auto", 
            infer_table_structure=True, 
            model_name="yolox",
            extract_images_in_pdf=True,
            image_output_dir_path="static/pdfImages/"
    )

    return elements

  from .autonotebook import tqdm as notebook_tqdm


In [5]:
pdf_path = "../../research_papers/mapreduce.pdf"

In [6]:
elements = generate_elements(pdf_path)

## Create embeddings and store to qdrant

In [12]:
embeddings = embeddings_model.embed_documents([element.text for element in elements])

In [21]:
if not client.collection_exists("pdf_collection"):
    client.create_collection(
        collection_name="pdf_collection",
        vectors_config=VectorParams(size=len(embeddings[0]), distance=Distance.COSINE),
    )

In [22]:
# generate uuid
import uuid
from qdrant_client.models import PointStruct

points = []
for i, element in enumerate(elements[:]):
    point = PointStruct(
        id=str(uuid.uuid4()),
        vector=embeddings[i],
        payload={"text": element.text, "page": element.metadata.page_number}
    )
    points.append(point)

client.upsert(collection_name="pdf_collection", points=points)

UpdateResult(operation_id=0, status=<UpdateStatus.COMPLETED: 'completed'>)

## Testing out search

In [23]:
question = "execution steps of map reduce"

In [24]:
question_embedding = embeddings_model.embed_documents([question])[0]

In [27]:
client.search(collection_name="pdf_collection", query_vector=question_embedding, limit=10)

[ScoredPoint(id='e83b17a3-cdd1-4828-9b0a-e4aef16d8177', version=0, score=0.6226226, payload={'text': '3.1 Execution Overview\n\nThe map invocations are distributed across multiple machines by auto- matically partitioning the input data into a set of M splits. The input splits can be processed in parallel by different machines. Reduce invo- cations are distributed by partitioning the intermediate key space into R pieces using a partitioning function (e.g., hash(key) mod R). The number of partitions (R) and the partitioning function are specified by the user.\n\nFigure 1 shows the overall flow of a MapReduce operation in our implementation. When the user program calls the MapReduce func- tion, the following sequence of actions occurs (the numbered labels in Figure 1 correspond to the numbers in the following list).\n\n1. The MapReduce library in the user program first splits the input files into M pieces of typically 16-64MB per piece (controllable by the user via an optional parameter).