As we dont have GPU on local, this file is only intended for data preprocessing.

> This may change as we managed to secure pc parts and some GPU from prof bad

# What will we build

HTMX-helper to help navigating htmx library

Specifically:
1. Crawl HTMX resource book from the htmx creator (We finished this with GO and save data to json)
2. Format the text from the json into an embedding model
3. Embed all of the text chunks into numerical representations (embeddings) and store them for later use
4. Build a retrieval system that uses vector search to find relevant chunk of text to the query
5. Create a prompt that incorporates the retrieved pieces of text(from htmx book)
6. Generate an answer to a query based on the knowledge provided

We wish to run everything locally but as for now (18/2/25) we didnt have the capability to do so.

This project is based from Daniel Bourke video of the NutriChat
> [text](https://www.youtube.com/watch?v=qN_2fnOPY-M)

Thanks to him and hope we will learn a lot in this journey!

## 1. Crawling htmx resource book

This is done by using golang and colly library. Checkout the crawler.go file to view the process. Some note taking about the process was also done in the readme file. 

## 2. Format the text of the json data, ready for the embedding model
we will also choose the embeddings model now.
> The embeddings model can be change based on our resource. for now we will only follow the guidances

Steps:
1. import json data
2. Process text for embeddings (e.g. split into chunk of sentences)
3. Embede the text chunks with an embedding model
4. Save embedding to file for later use

In [5]:
import json

file_path = "second-version.json"

def text_formatter(text: str):
    cleaned_text = text.replace("\n", " ").strip()

    return cleaned_text

def read_and_load_data(file_path: str):

    with open(file_path, 'r') as file:
        data = json.load(file)

    chap_and_text =[]
    for i in data:
        chap = i['chapter']

        content = ''
        for j in i['content']:
            curr = j['Content']
            content += curr

        content = text_formatter(content)
        chap_and_text.append({"chapter": chap,
                              "page_char_count": len(content),
                              "page_word_count": len(content.split(" ")),
                              "page_sentence_count": len(content.split(". ")),
                              "page_token_count": len(content) / 4, # 1 token = ~4 characters
                              "text" : content
                              })

    return chap_and_text

In [6]:
chap_and_text = read_and_load_data(file_path)
chap_and_text

[{'chapter': 'Contents',
  'page_char_count': 0,
  'page_word_count': 1,
  'page_sentence_count': 1,
  'page_token_count': 0.0,
  'text': ''},
 {'chapter': 'JSON Data APIs',
  'page_char_count': 27034,
  'page_word_count': 4548,
  'page_sentence_count': 87,
  'page_token_count': 6758.5,
  'text': 'So far we have been focusing on using hypermedia to build Hypermedia-Driven Applications. In doing so we are following and taking advantage of the native network architecture of the web, and building a RESTful system, in the original sense of that term.However, today, we should acknowledge that many web applications are often not built using this approach. Instead, they use a Single Page Application front end library such as React to build their application, and they interact with the server via a JSON API. This JSON API almost never uses hypermedia concepts. Rather JSON APIs tend to be Data APIs, that is, an API that simply returns structured domain data to the client without any hypermedia 

In [7]:
import pandas as pd

df = pd.DataFrame(chap_and_text)
df.head()

Unnamed: 0,chapter,page_char_count,page_word_count,page_sentence_count,page_token_count,text
0,Contents,0,1,1,0.0,
1,JSON Data APIs,27034,4548,87,6758.5,So far we have been focusing on using hypermed...
2,Introduction,10156,1645,41,2539.0,This is a book about building applications usi...
3,Tricks Of The Htmx Masters,31713,5227,118,7928.25,In this chapter we are going to look deeper in...
4,A Dynamic Archive UI,27528,4617,83,6882.0,Contact.app has come a long way from a traditi...


In [8]:
df.describe().round(2)

Unnamed: 0,page_char_count,page_word_count,page_sentence_count,page_token_count
count,16.0,16.0,16.0,16.0
mean,34444.81,5640.62,140.94,8611.2
std,18605.27,3033.16,111.05,4651.32
min,0.0,1.0,1.0,0.0
25%,26235.5,4333.5,86.0,6558.88
50%,37890.5,6243.0,117.5,9472.62
75%,43984.75,7481.25,145.5,10996.19
max,63171.0,10097.0,424.0,15792.75


Why would we care about token count?

Token count is important cause:
1. embeddings model doesnt deal with infinite token
2. LLM doesnt deal with infinite token

For example an embedding model may have been trained to embed sequences of 384 tokens into numerical space

embedding model resource : mteb leaderboard in hf



## Further text processing

ways to do:
1. we've done the simplest form of this with `". "`.
2. we can furher process this with a NLP library such as spaCy and nltk.


In [9]:
from spacy.lang.en import English

nlp = English()

# add a sentencizer pipeline (please refer to docs in spacy.io)
nlp.add_pipe("sentencizer")

# Example of sentencizer use
doc = nlp("This is a sentence. This is another sentence. I like hersheys.")
assert len(list(doc.sents)) == 3

# the sentencizer split up the big text into chunks of text
list(doc.sents)

[This is a sentence., This is another sentence., I like hersheys.]

In [10]:
for item in chap_and_text:
    item['sentences'] = list(nlp(item['text']).sents)

    # Make sure all data in our df is strings, spacy return default is spacy data object
    item["sentences"] = [str(sentence) for sentence in item['sentences']]

    # Count the sentence again
    item["sentence_count_spacy"] = len(item['sentences']) 

chap_and_text

[{'chapter': 'Contents',
  'page_char_count': 0,
  'page_word_count': 1,
  'page_sentence_count': 1,
  'page_token_count': 0.0,
  'text': '',
  'sentences': [],
  'sentence_count_spacy': 0},
 {'chapter': 'JSON Data APIs',
  'page_char_count': 27034,
  'page_word_count': 4548,
  'page_sentence_count': 87,
  'page_token_count': 6758.5,
  'text': 'So far we have been focusing on using hypermedia to build Hypermedia-Driven Applications. In doing so we are following and taking advantage of the native network architecture of the web, and building a RESTful system, in the original sense of that term.However, today, we should acknowledge that many web applications are often not built using this approach. Instead, they use a Single Page Application front end library such as React to build their application, and they interact with the server via a JSON API. This JSON API almost never uses hypermedia concepts. Rather JSON APIs tend to be Data APIs, that is, an API that simply returns structured d

In [11]:
df = pd.DataFrame(chap_and_text)
df.describe().round(2)
# spacy has different sentence count than raw.
# we should also update the token count, but we will skip for nopw cause i feel like skipping it

Unnamed: 0,page_char_count,page_word_count,page_sentence_count,page_token_count,sentence_count_spacy
count,16.0,16.0,16.0,16.0,16.0
mean,34444.81,5640.62,140.94,8611.2,272.88
std,18605.27,3033.16,111.05,4651.32,167.22
min,0.0,1.0,1.0,0.0,0.0
25%,26235.5,4333.5,86.0,6558.88,200.25
50%,37890.5,6243.0,117.5,9472.62,263.0
75%,43984.75,7481.25,145.5,10996.19,325.5
max,63171.0,10097.0,424.0,15792.75,580.0


### Chunking our sentences together

as we can see, because we group our sentences per chapter, the sentence count for each chapter is ridicolously high.  
to embed this into our model, we will group each 10 sentences into a chunk of sentences
> Note: we can use any total number we want but for now we will use 10
 
 there are frameworks such as langchain that helps with this process, but today we will hardcode it

Reasons to chunks our text:
1. Easier to filter our text/ process/ inspect.
2. Text chunks will fit into our embedding model.
3. Our context that passed into the LLM will be specific

In [12]:
num_sentence_chunk_size = 7

# create a function to split list of text into chunk size
# e.g. [20] -> [10, 10] or [25] -> [10, 10, 5]
def split_list(input_list: list[str],
               slice_size: int= num_sentence_chunk_size) -> list[list[str]]:
    return [input_list[i:i + slice_size] for i in range(0, len(input_list), slice_size)]

test_list = list(range(25))
split_list(test_list)

[[0, 1, 2, 3, 4, 5, 6],
 [7, 8, 9, 10, 11, 12, 13],
 [14, 15, 16, 17, 18, 19, 20],
 [21, 22, 23, 24]]

In [13]:
# Loop through chapter and text and split sentences into chunks
for item in chap_and_text:
    item["sentence_chunks"] = split_list(input_list=item['sentences'],
                                         slice_size=num_sentence_chunk_size)
    item['num_chunks'] = len(item['sentence_chunks'])

In [14]:
import random

random.sample(chap_and_text, k=2)

[{'chapter': 'Components Of A Hypermedia System',
  'page_char_count': 41048,
  'page_word_count': 6489,
  'page_sentence_count': 131,
  'page_token_count': 10262.0,
  'text': 'A hypermedia system consists of a number of components, including:A hypermedia, such as HTML.A network protocol, such as HTTP.A server that presents a hypermedia API responding to network requests with hypermedia responses.A client that properly interprets those responses.In this chapter we will look at these components and their implementation in the context of the web.Once we have reviewed the major components of the web as a hypermedia system, we will look at some key ideas behind this system — especially as developed by Roy Fielding in his dissertation, “Architectural Styles and the Design of Network-based Software Architectures.” We will see where the terms REpresentational State Transfer (REST), RESTful and Hypermedia As The Engine Of Application State (HATEOAS) come from, and we will analyze these terms i

In [15]:
df = pd.DataFrame(chap_and_text)
df.describe().round(2)

Unnamed: 0,page_char_count,page_word_count,page_sentence_count,page_token_count,sentence_count_spacy,num_chunks
count,16.0,16.0,16.0,16.0,16.0,16.0
mean,34444.81,5640.62,140.94,8611.2,272.88,39.38
std,18605.27,3033.16,111.05,4651.32,167.22,23.97
min,0.0,1.0,1.0,0.0,0.0,0.0
25%,26235.5,4333.5,86.0,6558.88,200.25,28.75
50%,37890.5,6243.0,117.5,9472.62,263.0,38.0
75%,43984.75,7481.25,145.5,10996.19,325.5,47.0
max,63171.0,10097.0,424.0,15792.75,580.0,83.0


### Splitting each chunk into its own item
we'd like to embed each chunk of sentences into its own numerical representation


In [16]:
import re

# Split each chunk into its own item
chap_and_chunks = []

for item in chap_and_text:
    for sentence_chunk in item['sentence_chunks']:
        chunk_dict = {}
        chunk_dict['chapter'] = item['chapter']

        # join the sentences back together into paragraph like structure
        joined_sentence_chunk = "".join(sentence_chunk).replace("  ", " ").strip()
        joined_sentence_chunk = re.sub(r'\.([A-Z])', r'. \1', joined_sentence_chunk) # ".A" => ". A"

        chunk_dict['sentence_chunk'] = joined_sentence_chunk

        chunk_dict['chunk_char_count'] = len(joined_sentence_chunk)
        chunk_dict['chunk_word_count'] = len([word for word in joined_sentence_chunk.split(" ")])
        chunk_dict['chunk_token_count'] = len(joined_sentence_chunk) / 4

        chap_and_chunks.append(chunk_dict)

len(chap_and_chunks)

630

In [17]:
random.sample(chap_and_chunks, 3)

[{'chapter': 'Client Side Scripting',
  'sentence_chunk': 'Are you a bit more bold in your technical choices?Maybe _hyperscript is worth a look. (We certainly think so.)Sometimes you might even consider picking two (or more) of these approaches within an application. Each has its own strengths and weaknesses, and all of them are relatively small and self-contained, so picking the right tool for the job at hand might be the best approach. In general, we encourage a pragmatic approach to scripting: whatever feels right is probably right (or, at least, right enough) for you. Rather than being concerned about which particular approach is taken for your scripting, we would focus on these more general concerns:Avoid communicating with the server via JSON data APIs. Avoid storing large amounts of state outside of the DOM. Favor using events, rather than hard-coded callbacks or method calls.',
  'chunk_char_count': 837,
  'chunk_word_count': 135,
  'chunk_token_count': 209.25},
 {'chapter': 'A

In [18]:
df = pd.DataFrame(chap_and_chunks)
df.describe(). round(2)

Unnamed: 0,chunk_char_count,chunk_word_count,chunk_token_count
count,630.0,630.0,630.0
mean,876.74,146.19,219.18
std,261.94,44.04,65.48
min,124.0,19.0,31.0
25%,687.25,114.0,171.81
50%,849.5,142.0,212.38
75%,1039.0,174.0,259.75
max,2021.0,337.0,505.25


### Filter chunks of text for short chunks

These chunks is unlikely to have useful information

we will skip this step as it is not really useful for our data compared to the data in video

## Embedding our text chunks

What well do:
- Turn text chunks into numbers (embeddings).

Embeddings -> useful number representation for computer
> computer only understand numbers and not words

In [20]:
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer(model_name_or_path="all-mpnet-base-v2", device="cpu") # we will follow the model like the video for now
# For model choice, choose the one that could fit approprietly with our data tokens size.

#  Create a lisy of sentences
sentences = [' The sentence transformer library provide easy way to make embeddings',
             'sentences can be embedded one by one in a list',
             ' i miss japan']

# sentences are encoded/embedded by calling model.encode()

embeddings = embedding_model.encode(sentences)
embeddings_dict = dict(zip(sentences, embeddings))

# See the embeddings
for sentence, embedding in embeddings_dict.items():
  print(f"Sentence : {sentence}")
  print(f"Embedding : {embedding}")
  print("")

Sentence :  The sentence transformer library provide easy way to make embeddings
Embedding : [-2.08969861e-02  2.25286689e-02 -1.83901284e-02  6.99257627e-02
 -2.15128046e-02 -2.08300911e-03  8.15608911e-03 -5.86984456e-02
  4.91816038e-03 -1.78884510e-02  2.68452708e-02  5.05151562e-02
 -3.51441354e-02  1.57507174e-02  3.99875417e-02 -6.51047528e-02
  3.86857241e-02  1.31468056e-02 -2.84869820e-02  5.70401223e-03
  3.63794491e-02  2.90079713e-02  1.09074572e-02  3.67448144e-02
 -1.53991748e-02 -2.64061037e-02  5.52123692e-03 -4.33875099e-02
  5.72379008e-02 -2.96574295e-03 -3.28862257e-02 -1.55706517e-03
  3.78073491e-02  1.85875397e-04  9.56624490e-07  3.75024811e-03
 -5.01560532e-02 -7.46415323e-03  1.61983781e-02 -7.52421934e-03
  4.79025468e-02 -5.36020547e-02  2.08999421e-02  4.73844819e-02
 -4.73535322e-02 -5.76758478e-03  5.05656861e-02  3.14400718e-02
  8.20528790e-02  5.75641356e-02 -2.01482121e-02 -4.48154137e-02
  9.27948486e-03 -1.25863766e-02 -1.15183759e-02  2.23431587e-

In [21]:
embeddings[0].shape # the sentence use a vector of 768 values to represents the sentence. More = accurate + more storage/ram

(768,)

In [22]:
embedding = embedding_model.encode("I love moo!")
embedding

array([-1.68044008e-02,  1.37606598e-02,  7.11908704e-03, -3.29362601e-02,
        6.07296079e-02,  1.52029898e-02, -4.07691412e-02,  2.65351050e-02,
       -1.01639843e-02,  1.47313271e-02, -8.97771716e-02,  2.09716186e-02,
       -3.80742587e-02,  2.41984315e-02, -2.52292845e-02, -4.36438061e-02,
       -8.03998671e-03,  5.22163557e-03, -3.77300680e-02,  1.68232080e-02,
        4.49382998e-02,  1.65626570e-05,  5.02042053e-03,  3.41250352e-03,
        2.23594997e-02, -1.07230050e-02, -1.97291188e-02,  1.27139408e-02,
       -4.70884852e-02, -1.45507855e-02, -4.50086370e-02,  2.85994876e-02,
       -2.57534180e-02,  3.84521671e-02,  1.45520539e-06, -3.66722643e-02,
       -4.81293537e-02,  3.69038992e-02, -1.03887022e-02,  4.53571463e-03,
        3.74449184e-03, -1.71370301e-02, -5.15320823e-02,  7.62083661e-03,
        5.37400804e-02,  4.81998269e-03,  4.94585037e-02, -6.45186454e-02,
        3.44830193e-03,  4.56240512e-02,  2.50893831e-03,  4.64599580e-03,
       -4.09044549e-02, -

In [23]:
embedding_model.to("cpu")

# Embed each chunk one by one
for item in chap_and_chunks:
  item['embedding'] = embedding_model.encode(item['sentence_chunk'])

In [24]:
chap_and_chunks[420]

{'chapter': 'Hyperview: A Mobile Hypermedia',
 'sentence_chunk': 'Instead, the fragment will be added either after (in the case of append) or before (prepend) the children of the target element. For completeness, let’s look at the state of the screen after a user presses all four buttons:Update actions, after pressing buttonsFragment completely replaced the target using replace actionFragment replaced the children of the target using replace-inner actionFragment added as last child of the target using append actionfragment added as the first child of the target using prepend actionThe examples above show actions making GET requests to the backend. But these actions can also make POST requests by setting verb="post" on the <behavior> element. For both GET and POST requests, the data from the parent <form> element will be serialized and included in the request. For GET requests, the content will be URL-encoded and added as query params. For POST requests, the content will be form-URL enc

In [28]:
# Save embeddings to file
text_chunks_and_embeddings_df = pd.DataFrame(chap_and_chunks)
embeddings_df_save_path = 'text_chunks_and_embeddings_df.csv'
text_chunks_and_embeddings_df.to_csv(embeddings_df_save_path, index=False)

In [27]:
text_chunks_and_embeddings_df.head()

Unnamed: 0,chapter,sentence_chunk,chunk_char_count,chunk_word_count,chunk_token_count,embedding
0,JSON Data APIs,So far we have been focusing on using hypermed...,881,147,220.25,"[0.027365912, 0.040399387, -0.034935527, 0.018..."
1,JSON Data APIs,"Now, believe it or not, we have been creating ...",859,152,214.75,"[0.044084676, 0.038022816, -0.016651524, 0.019..."
2,JSON Data APIs,Should we include a Data API for Contact.app a...,987,156,246.75,"[0.007488669, 0.05167837, -0.030760746, -0.000..."
3,JSON Data APIs,You want programmatic access to your system vi...,659,119,164.75,"[0.021024536, 0.07608413, -0.0009299007, 0.013..."
4,JSON Data APIs,"As with the bulk-import example, this isn’t a ...",1047,181,261.75,"[0.031623088, 0.06005647, -0.013291641, -0.011..."


This is the end of this file as we will move into google drive folder for now.  
This is because we didnt have the capability to run smoothly in local. therefore we are utilizing colab notebook.

> upload the data file into drive folder