# Generating Chunks

Date: 2024/10/05, 2024/11/25

## Chunks (paragraph-basis)

In [1]:
import glob

# Reference documents
REF_DOCS = "./virtual_showroom/*.txt"

In [2]:
docs = {}

for path in glob.glob(REF_DOCS):
    with open(path, 'r') as f:
        filename = path.split('/')[-1]
        collection = filename.split('.')[0]
        doc = f.read()
        docs[collection] = doc

In [3]:
docs['hansaplatz'][:1000]

"Hanseatic Square (Hanseplatz), located in Hamburg, Germany, holds great historical and cultural significance, reflecting the rich heritage of the Hanseatic League. As one of the most prominent open spaces in the city, Hanseplatz is not only a popular destination for tourists but also a key location for locals to gather, socialize, and participate in various public events. This vast public square is symbolic of Hamburg's maritime legacy and its status as a former Hanseatic city. To understand its importance, we need to explore both the history of the Hanseatic League and how this historical connection is manifested in Hanseatic Square.\n\n### Historical Context: The Hanseatic League\n\nThe Hanseatic League was a powerful economic and political alliance of merchant guilds and market towns in northern Europe, which flourished between the 13th and 17th centuries. At its peak, the League comprised over 200 cities across the Baltic Sea and North Sea regions. Hamburg, along with Lübeck and B

In [4]:
print(docs['hansaplatz'].replace('\n\n', '\n').split('\n')[:3])

["Hanseatic Square (Hanseplatz), located in Hamburg, Germany, holds great historical and cultural significance, reflecting the rich heritage of the Hanseatic League. As one of the most prominent open spaces in the city, Hanseplatz is not only a popular destination for tourists but also a key location for locals to gather, socialize, and participate in various public events. This vast public square is symbolic of Hamburg's maritime legacy and its status as a former Hanseatic city. To understand its importance, we need to explore both the history of the Hanseatic League and how this historical connection is manifested in Hanseatic Square.", '### Historical Context: The Hanseatic League', "The Hanseatic League was a powerful economic and political alliance of merchant guilds and market towns in northern Europe, which flourished between the 13th and 17th centuries. At its peak, the League comprised over 200 cities across the Baltic Sea and North Sea regions. Hamburg, along with Lübeck and 

In [5]:
all_chunks = {}

for collection, doc in docs.items():
    chunks = doc.replace('\n\n', '\n').split('\n')
    chunks = [c for c in chunks if c != "" and not c.startswith('###')]
    print(f'Max chunk size of "{collection}": {len(max(chunks))}')
    all_chunks[collection] = chunks

Max chunk size of "hansaplatz": 436
Max chunk size of "yokohama": 462
Max chunk size of "hamburg_station": 361
Max chunk size of "takanawa_gateway_station": 404
Max chunk size of "dresden_station": 190


## Saving chunks in SQLite

In [6]:
#import sys
#sys.path.append("../rag")

import sqlite3

DB_PATH = "../database/chunks.db"

In [7]:
i = 0

records = []

for collection, chunks in all_chunks.items():
    record = [[i+j, collection, chunks[j]] for j in range(len(chunks))]
    records.extend(record)    
    i += len(chunks)

print(len(records), '\n', records[:1], '\n', records[-2:])

84 
 [[0, 'hansaplatz', "Hanseatic Square (Hanseplatz), located in Hamburg, Germany, holds great historical and cultural significance, reflecting the rich heritage of the Hanseatic League. As one of the most prominent open spaces in the city, Hanseplatz is not only a popular destination for tourists but also a key location for locals to gather, socialize, and participate in various public events. This vast public square is symbolic of Hamburg's maritime legacy and its status as a former Hanseatic city. To understand its importance, we need to explore both the history of the Hanseatic League and how this historical connection is manifested in Hanseatic Square."]] 
 [[82, 'dresden_station', 'Looking ahead, Dresden Hauptbahnhof is expected to play an even larger role in European transportation as high-speed rail connections across the continent continue to expand. Its location in the heart of Europe positions it to become an even more critical junction for both passengers and freight, lin

In [8]:
with sqlite3.connect(DB_PATH) as conn:
    cur = conn.cursor()
    cur.execute("CREATE TABLE IF NOT EXISTS chunks (id INTEGER, collection TEXT, chunk TEXT)")
    cur.execute("DELETE FROM chunks")
    cur.executemany("INSERT INTO chunks VALUES (?, ?, ?)", records)