# Segmentation Model Testing

Testing segmentation model against:
- Jina Segmenter API
- Langchain semantic chunking

## Todo

- Batch long text to segmenter API
- Use JE3
- Reader-lm instead of reader?
- Integrate new qwen segmenter - hopefully Monday
- Convert markdown to plain text before segmenting

## Ideas

- How train small model
- Segmentation and what we did
- Benchmark would be necessary if wanted to make product. Maybe go without it?
- This is where it works, this is where it fails
- Use raw text, not markdown since training data didn't include markdown

## Segmentation notes post

- Benchmark important

## Other

- Copy Felix's tutorial rather than rag focus, ask LLM to ask questions for each chunk. Ensures Q can be answered with each chunk. But not way for us to do it. We have to make sure chunk is topical and complete. Use LLM to extract topics, chunk, generate Qs, when testing ask the Qs to diff indexes and then compare answers (use LLM to compare or Rouge or word-based answer not cosine). This is better for harder stuff like code, lists.
- Model prompt: copypaste as much as possible

## Training SLM

We just used library for param-efficient tuning. nothing special. lots of tutorials already. need angle, like ifnding right instruction, using diff data collator to ensure focus on actual text to be generated.

# How I'm doing this

1. Scrape several blog posts from Jina blog, convert to plain text, load into Documents
2. Use LLM to generate Document-level questions for each document (i.e. not chunk-level questions)
3. Send each Document to *x* different chunkers, each of which create their own index from the blog post text
4. Use RAG to ask each index the questions generated earlier
5. Ask LLM to compare question and set of answers from each index, choose the best one

# Basic setup

### Install prerequisites

In [12]:
!pip install -q pydantic llama_index
!pip install -q llama-index-embeddings-huggingface
!pip install -q llama-index llama-index-embeddings-jinaai llama-index-llms-huggingface "huggingface_hub[inference]"

### Basic configuration

This really only applies if you want to rebuild the docs and indexes yourself

In [13]:
# max characters per text to upload to segmenter (so we don't break input windows)
# MAX_CHARS = 50000

# context window for roberta segmentation model
context_window = 7168

# Load prebuilt docs and indexes

I pre-built a lot of the stuff to avoid Google Colab timing out or running out of credits in the middle of operations

Upload the following to Colab:

- `docs.pkl` - Pickled `list` of `Document` objects
- `indexes.pkl` - Pickled `list` of `Index` objects

In [4]:
from pydantic import BaseModel, Field, HttpUrl
from typing import List, Dict

In [6]:
class BlogPost(BaseModel):
  url: HttpUrl
  filename: str = ""
  text: str = ""
  text_short: str = ""
  markdown: str = "" # todo: convert markdown to plain text, store here
  questions: List = []
  chunks: dict = {} # populated by different chunking strategies later

In [10]:
from llama_index.core import VectorStoreIndex

class Index(BaseModel):
  name: str
  index: VectorStoreIndex
  questions: list[Dict[str, str]] = [] # store q and a here

  class Config:
    arbitrary_types_allowed = True

In [14]:
import pickle
import os

docs = pickle.load(open("docs.pkl", "rb"))
indexes = pickle.load(open("indexes.pkl", "rb"))

In [15]:
# check size looks okay

import os
import glob

# Define the directory path
directory_path = './'

# Use glob to find all .pkl files in the directory
pkl_files = glob.glob(os.path.join(directory_path, "*.pkl"))

# Iterate through the files and print their sizes in MB
for file_path in pkl_files:
    file_size_bytes = os.path.getsize(file_path)  # Get the file size in bytes
    file_size_mb = file_size_bytes / (1024 * 1024)  # Convert bytes to MB
    print(f"File: {file_path}, Size: {file_size_mb:.2f} MB")

File: ./docs.pkl, Size: 0.15 MB
File: ./indexes.pkl, Size: 451.84 MB


## Ask questions to index

In [139]:
def query_index(index, question, top_k=3):
# def query_index(index, doc, top_k=3):

  # answers = []

  # configure retriever
  retriever = VectorIndexRetriever(
      index=index.index,
      similarity_top_k=top_k
      )

  # assemble query engine
  query_engine = RetrieverQueryEngine(
      retriever=retriever,
      response_synthesizer=response_synthesizer,
  )

  # for question in doc.questions:
  answer = query_engine.query(question).response.strip()
    # answers.append(
    #     {"question: ": question,
    #      "answer": answer.response.strip(),
    #      "strategy": chunking_strategy,
    #     }
    # )

  index.questions.append(
      {"question": question,
      "answer": answer
       }
  )
  return answer

In [140]:
# all questions in one list

questions = []

for doc in docs:
  for question in doc.questions:
    questions.append(question["question"])

In [142]:
for index in indexes:
  print(f"Asking {index.name}")
  for question in questions:
    # print(f"- {question}")
    query_index(index, question)

Asking jina-segmenter-api
Asking segmentation-model
Asking langchain_semantic
Asking text-seg-lm


In [143]:
pickle.dump(indexes, open("indexes.pkl", "wb"))



In [165]:
[index.name for index in indexes]

['jina-segmenter-api',
 'segmentation-model',
 'langchain_semantic',
 'text-seg-lm']

## Evaluate answers

In [169]:
def generate_evaluation_prompt(doc):
  evaluation_prompt = """
  You are an expert evaluator of RAG systems.

  Your job is to look at a given text and then look at several questions about the text.
  Each question has answers from four different RAG systems. You will evaluate these answers based on:
  - Accuracy (does it correctly answer the question based on original text)
  - Conciseness (does it get to the point? does it include any extraneous information)
  - Readability (how easy is it for an expert user to understand?)

  The RAG systems are called:
  - 'jina-segmenter-api'
  - 'segmentation-model'
  - 'langchain_semantic'
  - 'text-seg-lm'

  Give each RAG system a score out of ten for each answer it provides. At the end of your output, include a table with the final scores for each RAG system.

  Here is the text:
  <text begin>
  {}
  <text end>

  Here are the questions and answers:
  <questions and answers begin>
  {}
  <questions and answers end>
  """
  qna_text = ""

  for doc_question in doc.questions:
    # add question
    qna_text += f"Question: {doc_question['question']}\n"

    for index in indexes:
      # add each answer
      for idx_question in index.questions:
        if doc_question["question"] == idx_question["question"]:
          qna_text += f"{index.name}: {idx_question['answer']}\n"

    prompt = evaluation_prompt.format(doc.text, qna_text)

    return prompt

In [170]:
eval_prompts = []

for doc in docs:
  prompt = generate_evaluation_prompt(doc)
  eval_prompts.append(prompt)

# No prebuilt stuff? Build it below

If it's already pre-built, ignore this, skip to the RAG bit.

## Clone segmentation model repo and install requirements

⬅️ Set your `GITHUB_TOKEN` in secrets in the sidebar

In [2]:
from google.colab import userdata

GITHUB_TOKEN = userdata.get("GITHUB_TOKEN")
repo_url = f"https://{GITHUB_TOKEN}@github.com/jina-ai/text-seg.git"

In [3]:
!git clone {repo_url}

Cloning into 'text-seg'...
remote: Enumerating objects: 629, done.[K
remote: Counting objects: 100% (299/299), done.[K
remote: Compressing objects: 100% (202/202), done.[K
remote: Total 629 (delta 178), reused 203 (delta 95), pack-reused 330 (from 1)[K
Receiving objects: 100% (629/629), 176.20 KiB | 816.00 KiB/s, done.
Resolving deltas: 100% (375/375), done.


In [4]:
import os

os.chdir("text-seg")

In [5]:
!git checkout main
!git pull

Already on 'main'
Your branch is up to date with 'origin/main'.
Already up to date.


In [6]:
%%capture

# install everything and prevent output, otherwise it bulks up the notebook
!pip install -r requirements.txt
!pip uninstall -yq numpy
!pip install -q "numpy<2.0"

## Create input/output directories

In [7]:
input_dir = "inputs"
output_dir = "outputs"
pickle_dir = "pickles" # save in progress stuff in case gpu credits run out, so we don't have to redo stuff

!rm -rf ./inputs
!rm -rf ./outputs

os.makedirs(input_dir, exist_ok=True)
os.makedirs(output_dir, exist_ok=True)
os.makedirs(pickle_dir, exist_ok=True)

In [8]:
import pickle
from datetime import datetime

def pickle_object(pickle_path, obj, suffix=""):
  timestamp = datetime.now().strftime("%Y%m%d_%H%M%S"),
  filename = f"{pickle_path}-{timestamp[0]}-{suffix}.pkl"
  with open(filename, "wb") as f:
    pickle.dump(obj, f)

## Create input format

In [114]:
from pydantic import BaseModel, Field, HttpUrl
from typing import List, Dict

In [115]:
class BlogPost(BaseModel):
  url: HttpUrl
  filename: str = ""
  text: str = ""
  text_short: str = ""
  markdown: str = "" # todo: convert markdown to plain text, store here
  # questions: List[str] = [] # created later by llm
  # answers: List[Dict] = [] # for each index, ask questions, store along with index name
  questions: List = []
  chunks: dict = {} # populated by different chunking strategies later
  # chunks: Dict[str, List[Chunk]] = Field(default_factory=dict)

## Get input data

Use Jina blog posts, pushed through reader API

In [116]:
reader_url = "https://r.jina.ai/"

blog_posts = [
    f"{reader_url}https://jina.ai/news/reader-lm-small-language-models-for-cleaning-and-converting-html-to-markdown",
    f"{reader_url}https://jina.ai/news/jina-colbert-v2-multilingual-late-interaction-retriever-for-embedding-and-reranking",
    f"{reader_url}https://jina.ai/news/late-chunking-in-long-context-embedding-models",
    f"{reader_url}https://jina.ai/news/the-what-and-why-of-text-image-modality-gap-in-clip-models",
    f"{reader_url}https://jina.ai/news/rephrased-labels-improve-zero-shot-text-classification-30"
]

input_urls = blog_posts

## Retrieve previously-pickled data

Since we're in Colab, can only use so much GPU at one time. Use pickles to save state of objects between sessions

In [9]:
!git clone https://github.com/alexcg1/text-seg-test

Cloning into 'text-seg-test'...
remote: Enumerating objects: 6, done.[K
remote: Counting objects: 100% (6/6), done.[K
remote: Compressing objects: 100% (4/4), done.[K
remote: Total 6 (delta 1), reused 6 (delta 1), pack-reused 0 (from 0)[K
Receiving objects: 100% (6/6), 182.69 KiB | 745.00 KiB/s, done.
Resolving deltas: 100% (1/1), done.


In [14]:
# import pickle

# pickle_name = "docs-langchain-semantic.pkl"
# # pickle_name = "docs.pkl"

# docs = pickle.load(open(f"text-seg-test/pickles/{pickle_name}", "rb"))

## If no pickled `docs`, generate `docs` from scratch

You really don't need to run this

In [117]:
docs = []

In [118]:
%%capture

# convert markdown to text
!pip install mdplain

In [119]:
import requests
from mdplain import plain

for url in input_urls:
  doc = BlogPost(url=url)
  doc.filename = os.path.basename(url)
  doc.markdown = requests.get(url).text
  doc.text = plain(doc.markdown)
  # doc.text_short = doc.text[:MAX_CHARS]

  docs.append(doc)

In [134]:
!rm -f docs.pkl

In [135]:
pickle_object(os.path.join("./", "docs"), docs)
# !cp ./pickles/docs.pkl ./text-seg-test/pickles/

In [136]:
!mv doc* docs.pkl

### Add questions

In [101]:
from llama_index.llms.huggingface import HuggingFaceInferenceAPI

mixtral_llm = HuggingFaceInferenceAPI(
    model_name="mistralai/Mixtral-8x7B-Instruct-v0.1", token=userdata.get("HF_TOKEN")
)

  mixtral_llm = HuggingFaceInferenceAPI(


In [102]:
import json

def create_questions(doc, count=3):
  prompt = f"""
  Generate {count} technical question(s) about the given text that the text itself answers. Use this format:

      [
          "What are the key differences between dense and sparse retrieval methods in RAG systems?",
          "How does a RAG model handle the integration of retrieved documents during the generation process?",
          "What techniques can be used to optimize the retrieval phase in a RAG system for large-scale datasets?"
      ]

  Present your output in only a structured JSON list of strings, with no other output or markdown formatting. Provide only the questions. Do not provide answers or context. Do not wrap your output in backticks. Text is as follows:

  {doc.text}
  """

  response = mixtral_llm.complete(prompt)
  # print(response)
  # print(type(response))
  print(response.json())

  raw_output = response.text.strip()
  # print(raw_output)

  if raw_output[0] == '`':
    print("Code fencing detected. Fixing it")
    raw_output = raw_output.splitlines()[1:-1]
    raw_output = "\n".join(raw_output)

  # print(raw_output)

  try:
    questions = json.loads(raw_output)
  except:
    print("Failed to convert output to JSON")

  [question.strip() for question in questions]

  for question in questions:
    doc.questions.append({"question": question, "answers": {}})

  # doc.questions = questions

In [108]:
docs[0].chunks

AttributeError: 'Document' object has no attribute 'chunks'

In [131]:
for doc in docs:
  try:
    create_questions(doc)
  except:
    print(f"Failed to create questions for {doc.filename}")
  # questions = create_questions(doc)

{"text":"\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n","additional_kwargs":{},"raw":null,"logprobs":null,"delta":null}
Failed to create questions for reader-lm-small-language-models-for-cleaning-and-converting-html-to-markdown
{"text":"\n\n  ```json\n  [\n      \"What are the main improvements of Jina-ColBERT-v2 over the original ColBERT-v2 and jina-colbert-v1-en?\",\n      \"How does Jina-ColBERT-v2 handle multilingual data in its training pipeline?\",\n      \"What is the impact of Matryoshka Representation Learning on the p

In [132]:
docs.pop(0) # mixtral consistently fails to generate qs for this doc

BlogPost(url=Url('https://r.jina.ai/https://jina.ai/news/reader-lm-small-language-models-for-cleaning-and-converting-html-to-markdown'), filename='reader-lm-small-language-models-for-cleaning-and-converting-html-to-markdown', text='Title: Reader-LM: Small Language Models for Cleaning and Converting HTML to Markdown\nURL Source: https://jina.ai/news/reader-lm-small-language-models-for-cleaning-and-converting-html-to-markdown\nPublished Time: 2024-09-11T12:25:03.000+02:00\nMarkdown Content:\njinaai/reader-lm-0.5b · Hugging Face We’re on a journey to advance and democratize artificial intelligence through open source and open science.  \njinaai/reader-lm-1.5b · Hugging Face We’re on a journey to advance and democratize artificial intelligence through open source and open science.  \nIn April 2024, we released Jina Reader, a simple API that converts any URL into LLM-friendly markdown with just a simple prefix: r.jina.ai. Despite the sophisticated network programming behind the scenes, the 

## Chunking strategy 1: inference with segmentation model (Roberta)

In [27]:
from transformers import XLMRobertaForTokenClassification
import torch

In [28]:
model = XLMRobertaForTokenClassification.from_pretrained(
    "jinaai/roberta-text-segmentation",
    use_flash_attn=True,
    torch_dtype=torch.bfloat16,
    # trust_remote_code=True
)

config.json:   0%|          | 0.00/1.82k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/567M [00:00<?, ?B/s]

Some weights of the model checkpoint at jinaai/roberta-text-segmentation were not used when initializing XLMRobertaForTokenClassification: ['roberta.emb_ln.bias', 'roberta.emb_ln.weight', 'roberta.encoder.layers.0.mixer.Wqkv.bias', 'roberta.encoder.layers.0.mixer.Wqkv.weight', 'roberta.encoder.layers.0.mixer.out_proj.bias', 'roberta.encoder.layers.0.mixer.out_proj.weight', 'roberta.encoder.layers.0.mlp.fc1.bias', 'roberta.encoder.layers.0.mlp.fc1.weight', 'roberta.encoder.layers.0.mlp.fc2.bias', 'roberta.encoder.layers.0.mlp.fc2.weight', 'roberta.encoder.layers.0.norm1.bias', 'roberta.encoder.layers.0.norm1.weight', 'roberta.encoder.layers.0.norm2.bias', 'roberta.encoder.layers.0.norm2.weight', 'roberta.encoder.layers.1.mixer.Wqkv.bias', 'roberta.encoder.layers.1.mixer.Wqkv.weight', 'roberta.encoder.layers.1.mixer.out_proj.bias', 'roberta.encoder.layers.1.mixer.out_proj.weight', 'roberta.encoder.layers.1.mlp.fc1.bias', 'roberta.encoder.layers.1.mlp.fc1.weight', 'roberta.encoder.layers.

In [30]:
from inference import inference

for doc in docs:
  print(f"Processing {doc.filename}")

  # doc.chunks["segmentation-model"] = []
  output_path = os.path.join(output_dir, doc.filename)

  chunks = inference(
  # doc.chunks["segmentation-model"] = inference(
      checkpoint="jinaai/roberta-text-segmentation",
      text=doc.text,
      output_path=output_path,
      use_flash_attention="yes",
      context_length=context_window
  )

  # for chunk in chunks:
  #   chunk_obj = Chunk(text=chunk, chunking_strategy="segmentation-model")
  #   doc.chunks["segmentation-model"].append(chunk_obj)
  doc.chunks["segmentation-model"] = chunks

Processing reader-lm-small-language-models-for-cleaning-and-converting-html-to-markdown


tokenizer_config.json:   0%|          | 0.00/1.26k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/964 [00:00<?, ?B/s]



Processing jina-colbert-v2-multilingual-late-interaction-retriever-for-embedding-and-reranking




Processing late-chunking-in-long-context-embedding-models




Processing the-what-and-why-of-text-image-modality-gap-in-clip-models




Processing rephrased-labels-improve-zero-shot-text-classification-30




In [31]:
pickle_object(os.path.join(pickle_dir, "docs.pkl"), docs, "segmentation-model")

## Chunking strategy 2: Jina's Segmenter API

In [32]:
jina_segmenter_url = "https://segment.jina.ai/"
JINA_TOKEN = userdata.get("JINA_TOKEN")

headers = {
    'Content-Type': 'application/json',
    'Authorization': f'Bearer {JINA_TOKEN}'
}

In [33]:
for doc in docs:
  data = {
      "content": doc.text,
      # "content": doc.text_short,
      "return_tokens": "false",
      "return_chunks": "true",
      "max_chunk_length": "10000",
  }

  response = requests.post(jina_segmenter_url, headers=headers, json=data)
  response = response.json()
  # print(response)

  doc.chunks["jina-segmenter-api"] = response["chunks"]

In [34]:
docs[0].chunks["jina-segmenter-api"][0]

'Title: Reader-LM: Small Language Models for Cleaning and Converting HTML to Markdown\n'

In [35]:
pickle_object(os.path.join(pickle_dir, "docs.pkl"), docs, "jina-segmenter-api")

## Chunking strategy 3: Langchain Semantic Chunking

https://python.langchain.com/v0.1/docs/modules/data_connection/document_transformers/semantic-chunker/

In [36]:
!pip install --quiet langchain_experimental langchain_openai

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.4/50.4 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m206.9/206.9 kB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.5/51.5 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.3/2.3 MB[0m [31m48.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m405.1/405.1 kB[0m [31m27.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m44.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m289.8/289.8 kB[0m [31m20.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m141.9/141.9 kB[0m [31m12.6 MB/s[0m eta [36m0:00:00[0m
[?25h

In [37]:
from langchain_experimental.text_splitter import SemanticChunker
from langchain_community.embeddings import JinaEmbeddings




In [38]:
text_splitter = SemanticChunker(JinaEmbeddings(jina_api_key=JINA_TOKEN))

for doc in docs:
  chunks = text_splitter.create_documents([doc.text])
  doc.chunks["langchain_semantic"] = [chunk.page_content for chunk in chunks]

In [39]:
pickle_object(os.path.join(pickle_dir, "docs.pkl"), docs, "langchain_semantic")

## Chunking strategy 4: Text-Seg-LM

In [15]:
# Note: this might prompt you to restart your session, just restart it if that's the case
!pip install -q torch triton xformers

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.7/16.7 MB[0m [31m98.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m797.1/797.1 MB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
fastai 2.7.15 requires torch<2.4,>=1.10, but you have torch 2.4.1 which is incompatible.
torchaudio 2.4.0 requires torch==2.4.0, but you have torch 2.4.1 which is incompatible.
torchvision 0.19.0 requires torch==2.4.0, but you have torch 2.4.1 which is incompatible.[0m[31m
[0m

In [16]:
!pip install -q "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install -q --no-deps trl peft accelerate bitsandbytes

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m25.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.6/3.6 MB[0m [31m95.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m105.7/105.7 kB[0m [31m10.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for unsloth (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m280.1/280.1 kB[0m [31m9.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m296.4/296.4 kB[0m [31m25.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m137.5/137.5 MB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
# already got token previously

# from huggingface_hub import notebook_login
# notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [17]:
from unsloth import FastLanguageModel # type: ignore
from transformers import GenerationConfig
import torch

import urllib.parse
import requests  # type: ignore
import re

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


In [18]:
max_seq_length = 8192
max_new_tokens = 1024
load_in_4bit = True

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "jinaai/text-seg-lm-qwen-0.5b",
    max_seq_length = max_seq_length,
    dtype = None,
    load_in_4bit = load_in_4bit,
)

FastLanguageModel.for_inference(model)

gen_config = GenerationConfig.from_pretrained(
    "unsloth/Qwen2-0.5B-Instruct-bnb-4bit",
    max_length=8192,
    max_new_tokens=max_new_tokens,
)

==((====))==  Unsloth 2024.9: Fast Qwen2 patching. Transformers = 4.44.0.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.4.1+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/457M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/265 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.32k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/80.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/367 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/35.2M [00:00<?, ?B/s]

Unsloth 2024.9 patched 24 layers with 0 QKV layers, 24 O layers and 24 MLP layers.


In [20]:
def extract_chunks(text, chunk_headers_raw):
    chunk_headers = re.findall(r'CHUNK \d+:\s*(.*)', chunk_headers_raw)

    chunks = []
    for i in range(len(chunk_headers) - 1):
        current_header_escaped = re.escape(chunk_headers[i])
        next_header_escaped = re.escape(chunk_headers[i + 1])
        pattern = f"{current_header_escaped}(.*?){next_header_escaped}"
        match = re.search(pattern, text, re.DOTALL)
        if match:
            chunks.append(chunk_headers[i] + match.group(1).strip())

    # Handle the last chunk, capturing until the end of the text
    last_header = chunk_headers[-1]
    last_header_escaped = re.escape(last_header)
    last_chunk_pattern = f"{last_header_escaped}(.*)"

    match = re.search(last_chunk_pattern, text, re.DOTALL)
    if match:
        chunks.append(last_header + match.group(1).strip())

    return chunks

In [21]:
# adapted to use my docs format
def generate(doc):

  text = doc.text.replace("\n", " ")
  text = re.sub(r'\s+', " ", text)
  text = text.strip()

  # print(text)

  prompt = """
Below is an instruction that describes a task, paired with an input. Write a response that appropriately completes the request.
### Instruction:
Split the given text into chunks. Use the format "CHUNK [index]: [head]" to respond, where "[index]" is the index of each chunk and "[head]" is the beginning of each chunk (up to 50 characters).
### Input:
{}
### Response:
""".lstrip()

  prompt = prompt.format(text)

  tokenized = tokenizer(prompt, return_tensors='pt')
  input_ids = tokenized['input_ids'].cuda()
  attention_mask = tokenized['attention_mask'].cuda()

  with torch.inference_mode():
      output = model.generate(
          input_ids=input_ids,
          attention_mask=attention_mask,
          generation_config=gen_config
      )

  chunk_headers = tokenizer.decode(
      output[0][len(input_ids[0]):],
      skip_special_tokens=True
  )

  chunks = extract_chunks(text, chunk_headers)

  doc.chunks["text-seg-lm"] = chunks

  return chunks

In [22]:
for doc in docs:
  generate(doc)

In [23]:
pickle_object(os.path.join(pickle_dir, "docs.pkl"), docs, "text-seg-lm")

In [24]:
docs[1].chunks.keys()

dict_keys(['segmentation-model', 'jina-segmenter-api', 'langchain_semantic', 'text-seg-lm'])

In [25]:
docs[1].questions

[{'question': 'What are the main improvements of Jina-ColBERT-v2 over the original ColBERT-v2 and jina-colbert-v1-en?',
  'answers': {}},
 {'question': 'How does Jina-ColBERT-v2 handle multilingual data and what languages does it support?',
  'answers': {}},
 {'question': 'What is Matryoshka Representation Learning and how does it benefit Jina ColBERT v2?',
  'answers': {}}]

## Get statistics

In [None]:
# work out average word count per string in list
def average_word_count(lst):
    total_words = sum(len(s.split(" ")) for s in lst)
    return round(total_words / len(lst))

In [None]:
# get longest and shortest strings from list
def get_longest_shortest(lst):
    longest = max(lst, key=len)
    shortest = min(lst, key=len)
    return longest, shortest

In [None]:
!pip -q install rich

In [None]:
from rich.table import Table
from rich.console import Console

In [None]:
for doc in docs:
  stats = []
  for chunk_type in doc.chunks.keys():
    data = {
        "Chunking method": chunk_type,
        "Average word count": average_word_count(doc.chunks[chunk_type]),
        "Longest": len(get_longest_shortest(doc.chunks[chunk_type])[0]),
        "Shortest": len(get_longest_shortest(doc.chunks[chunk_type])[1]),
    }
    stats.append(data)

  # Create a console object
  console = Console()
  table = Table(title=doc.filename)

  # Dynamically add columns based on the keys of the first dictionary
  # This ensures that all keys will be represented as columns
  for key in stats[0].keys():
      table.add_column(key, justify="right")

  # Add rows programmatically by iterating over the list of dictionaries
  for row in stats:
      table.add_row(*[str(value) for value in row.values()])

  # Display the table in the console
  console.print(table)

## Set up RAG system

- Embeddings: Jina Embeddings v2
- LLM: Mixtral 7b-instruct 0.1

In [31]:
%%capture
!pip install -q llama-index-embeddings-huggingface
!pip install -q llama-index llama-index-embeddings-jinaai llama-index-llms-huggingface "huggingface_hub[inference]"

In [30]:
%%capture
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

hf_embedding_model = HuggingFaceEmbedding(model_name="jinaai/jina-embeddings-v2-base-en")

Some weights of BertModel were not initialized from the model checkpoint at jinaai/jina-embeddings-v2-base-en and are newly initialized: ['embeddings.position_embeddings.weight', 'encoder.layer.0.intermediate.dense.bias', 'encoder.layer.0.intermediate.dense.weight', 'encoder.layer.0.output.LayerNorm.bias', 'encoder.layer.0.output.LayerNorm.weight', 'encoder.layer.0.output.dense.bias', 'encoder.layer.0.output.dense.weight', 'encoder.layer.1.intermediate.dense.bias', 'encoder.layer.1.intermediate.dense.weight', 'encoder.layer.1.output.LayerNorm.bias', 'encoder.layer.1.output.LayerNorm.weight', 'encoder.layer.1.output.dense.bias', 'encoder.layer.1.output.dense.weight', 'encoder.layer.10.intermediate.dense.bias', 'encoder.layer.10.intermediate.dense.weight', 'encoder.layer.10.output.LayerNorm.bias', 'encoder.layer.10.output.LayerNorm.weight', 'encoder.layer.10.output.dense.bias', 'encoder.layer.10.output.dense.weight', 'encoder.layer.11.intermediate.dense.bias', 'encoder.layer.11.intermedi

In [34]:
# defined this earlier when generating questions

from llama_index.llms.huggingface import HuggingFaceInferenceAPI

mixtral_llm = HuggingFaceInferenceAPI(
    model_name="mistralai/Mixtral-8x7B-Instruct-v0.1", token=userdata.get("HF_TOKEN")
)




  mixtral_llm = HuggingFaceInferenceAPI(


In [36]:
from llama_index.core import Settings

Settings.llm = mixtral_llm
Settings.embed_model = hf_embedding_model

In [37]:
from llama_index.core import PromptTemplate

qa_prompt_tmpl = (
    "Context information is below.\n"
    "---------------------\n"
    "{context_str}\n"
    "---------------------\n"
    "Given the context information and not prior knowledge, "
    "answer the query. Please be brief, concise, and complete.\n"
    "If the context information does not contain an answer to the query, "
    "respond with \"No information\"."
    "Query: {query_str}\n"
    "Answer: "
)
qa_prompt = PromptTemplate(qa_prompt_tmpl)

In [None]:


# llama_index_chunks = StringIterableReader().load_data(docs[0].chunks["segmentation-model"])

In [None]:

# index = VectorStoreIndex.from_documents(
#     documents=llama_index_chunks, service_context=Settings
# )



# Create indexes

In [39]:
from llama_index.core import VectorStoreIndex

class Index(BaseModel):
  name: str
  index: VectorStoreIndex
  questions: list[Dict[str, str]] = [] # store q and a here

  class Config:
    arbitrary_types_allowed = True

In [48]:
from llama_index.core.readers import StringIterableReader
from llama_index.core.schema import Document

def create_index(chunk_strategy, name, docs):
  all_docs_chunks = []
  for doc in docs:
    all_docs_chunks += doc.chunks[chunk_strategy]


  print(all_docs_chunks)
  # load chunks
  llama_index_chunks = StringIterableReader().load_data(all_docs_chunks)

  # index chunks
  index = VectorStoreIndex.from_documents(
      documents=llama_index_chunks, service_context=Settings
  )

  output = Index(name=name, index=index)

  print(f"{name} index: {len(output.index.docstore.docs)} items")

  return output

In [53]:
index_names = [
    "jina-segmenter-api",
    "segmentation-model",
    "langchain_semantic",
    "text-seg-lm",
]

indexes = []

jina_segmenter_api_index = create_index("jina-segmenter-api", "jina-segmenter-api", docs)

# indexes.append(jina_segmenter_api_index)
# # segmentation_model_index = create_index("segmentation-model")


In [58]:
indexes[1].index.docstore.docs
for index in indexes:
  print(index.name)

jina-segmenter-api
segmentation-model
langchain_semantic
text-seg-lm


In [52]:
indexes[0]

Index(name='jina-segmenter-api', index=<llama_index.core.indices.vector_store.base.VectorStoreIndex object at 0x7ff62579b6a0>, questions=[])

In [65]:
pickle_object(os.path.join(pickle_dir, "indexes.pkl"), indexes, "indexes")



In [None]:
# segmentation_model_index = create_index("segmentation-model")

# langchain_semantic_index = create_index("langchain_semantic")



jina-segmenter-api index: 52 items


In [None]:
indexes = [
    # segmentation_model_index,
    jina_segmenter_api_index,
    # langchain_semantic_index
    ]

In [68]:
# sanity check
assert indexes[0] != indexes[1]

In [69]:
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core import get_response_synthesizer

# configure response synthesizer
response_synthesizer = get_response_synthesizer(
    # service_context=Settings,
    text_qa_template=qa_prompt,
    response_mode="compact",
)

In [None]:
# import numpy as np

# def cosine_similarity(vector1, vector2):
#     # Ensure the vectors are numpy arrays
#     vector1 = np.array(vector1)
#     vector2 = np.array(vector2)

#     # Compute dot product
#     dot_product = np.dot(vector1, vector2)

#     # Compute norms (magnitudes)
#     norm_vector1 = np.linalg.norm(vector1)
#     norm_vector2 = np.linalg.norm(vector2)

#     # Compute cosine similarity
#     if norm_vector1 == 0 or norm_vector2 == 0:
#         return 0  # Avoid division by zero
#     cosine_sim = dot_product / (norm_vector1 * norm_vector2)

#     return cosine_sim

In [None]:
# def get_answer(index_name, question, top_k=2):

#   # configure retriever
#   retriever = VectorIndexRetriever(
#       index=index_name,
#       similarity_top_k=top_k
#       )

#   # assemble query engine
#   query_engine = RetrieverQueryEngine(
#       retriever=retriever,
#       response_synthesizer=response_synthesizer,
#   )

#   question_emb = hf_embedding_model.get_text_embedding(question)
#   retrieved_texts = retriever.retrieve(question)

#   for i, result in enumerate(retrieved_texts):
#     print(f"Text {i+1}:\n\n{result.text}")
#     doc_id = result.id_
#     embedding = index_name.vector_store.get(doc_id)
#     cosine_sim = cosine_similarity(question_emb, embedding)
#     print("Cosine similarity: ", cosine_sim)
#     print("\n")
#     scores.append(cosine_sim)

#   print("---")

#     # return scores

In [None]:
# def get_scores(index_name, questions, top_k=2):
#   scores = []

#   # configure retriever
#   retriever = VectorIndexRetriever(
#       index=index_name,
#       similarity_top_k=top_k
#       )

#   # assemble query engine
#   query_engine = RetrieverQueryEngine(
#       retriever=retriever,
#       response_synthesizer=response_synthesizer,
#   )

#   for question in questions:
#     question_emb = hf_embedding_model.get_text_embedding(question)
#     retrieved_texts = retriever.retrieve(question)

#     for i, result in enumerate(retrieved_texts):
#       print(f"Text {i+1}:\n\n{result.text}")
#       doc_id = result.id_
#       embedding = index_name.vector_store.get(doc_id)
#       cosine_sim = cosine_similarity(question_emb, embedding)
#       print("Cosine similarity: ", cosine_sim)
#       print("\n")
#       scores.append(cosine_sim)

#     print("---")

#     # return scores

## Test answering LLM-generated questions

In [None]:
# def get_answers(doc, index, chunking_strategy, top_k=3):
#   answers = []

#   # configure retriever
#   retriever = VectorIndexRetriever(
#       index=index,
#       similarity_top_k=top_k
#       )

#   # assemble query engine
#   query_engine = RetrieverQueryEngine(
#       retriever=retriever,
#       response_synthesizer=response_synthesizer,
#   )

#   for question in doc.questions:
#     answer = query_engine.query(question)
#     answers.append(
#         {"question: ": question,
#          "answer": answer.response.strip(),
#          "strategy": chunking_strategy,
#         }
#     )

#   return answers

In [None]:
# # do it question by question easier?

# def get_answer(question, index, top_k=3):
#   # answers = []

#   # configure retriever
#   retriever = VectorIndexRetriever(
#       index=index,
#       similarity_top_k=top_k
#       )

#   # assemble query engine
#   query_engine = RetrieverQueryEngine(
#       retriever=retriever,
#       response_synthesizer=response_synthesizer,
#   )

#   # for question in doc.questions:
#   answer = query_engine.query(question).response.strip()
#     # answers.append(
#     #     {"question: ": question,
#     #      "answer": answer.response.strip(),
#     #      "strategy": chunking_strategy,
#     #     }
#     # )

#   return answer

In [76]:
def query_index(index, question, top_k=3):
# def query_index(index, doc, top_k=3):

  # answers = []

  # configure retriever
  retriever = VectorIndexRetriever(
      index=index.index,
      similarity_top_k=top_k
      )

  # assemble query engine
  query_engine = RetrieverQueryEngine(
      retriever=retriever,
      response_synthesizer=response_synthesizer,
  )

  # for question in doc.questions:
  answer = query_engine.query(question).response.strip()
    # answers.append(
    #     {"question: ": question,
    #      "answer": answer.response.strip(),
    #      "strategy": chunking_strategy,
    #     }
    # )

  index.questions.append(
      {"question": question,
      "answer": answer
       }
  )
  return answer

In [71]:
# all questions in one list

questions = []

for doc in docs:
  for question in doc.questions:
    questions.append(question["question"])

In [72]:
questions

['What are the main improvements of Jina-ColBERT-v2 over the original ColBERT-v2 and jina-colbert-v1-en?',
 'How does Jina-ColBERT-v2 handle multilingual data and what languages does it support?',
 'What is Matryoshka Representation Learning and how does it benefit Jina ColBERT v2?',
 'What are the challenges of the simple RAG pipeline of chunking-embedding-retrieving-generating?',
 'How does the Late Chunking approach differ from the naive chunking strategy in generating chunk embeddings?',
 'What is the correlation between the average length of documents and the effectiveness of late chunking in improving nDCG scores?',
 'What is the purpose of multimodal models in AI, and how do they differ from single-mode models?',
 "What is the 'modality gap' in multimodal models, and how does it affect the performance of CLIP-style models?",
 'What are the three major sources behind the modality gap, as identified by Liang et al. [2022]?',
 'What are the constraints of the zero-shot setting in t

In [77]:
for index in indexes:
  print(f"Asking {index.name}")
  for question in questions:
    print(f"- {question}")
    query_index(index, question)

Asking jina-segmenter-api
- What are the main improvements of Jina-ColBERT-v2 over the original ColBERT-v2 and jina-colbert-v1-en?
Asking jina-segmenter-api
- How does Jina-ColBERT-v2 handle multilingual data and what languages does it support?
Asking jina-segmenter-api
- What is Matryoshka Representation Learning and how does it benefit Jina ColBERT v2?
Asking jina-segmenter-api
- What are the challenges of the simple RAG pipeline of chunking-embedding-retrieving-generating?
Asking jina-segmenter-api
- How does the Late Chunking approach differ from the naive chunking strategy in generating chunk embeddings?
Asking jina-segmenter-api
- What is the correlation between the average length of documents and the effectiveness of late chunking in improving nDCG scores?
Asking jina-segmenter-api
- What is the purpose of multimodal models in AI, and how do they differ from single-mode models?
Asking jina-segmenter-api
- What is the 'modality gap' in multimodal models, and how does it affect th

In [78]:
pickle_object(os.path.join(pickle_dir, "indexes.pkl"), indexes, "qna")



In [None]:
# for doc in docs:
#   if doc.questions:
#     for q in doc.questions:
#       # print(q)
#       # print(q["question"])
#       # foo = get_answer("foo", jina_segmenter_api_index)
#       # foo = get_answer(q["question"], jina_segmenter_api_index)
#       q["answers"]["jina-segmenter-api"] = get_answer(q["question"], jina_segmenter_api_index)
#       # q["answer"]["chunking_strategy"] = "jina-segmenter-api"

In [None]:
# from pprint import pprint
# for question in docs[1].questions:
#   print(question["question"])
#   pprint(question["answers"])

#   print("---")

What are the main improvements of Jina-ColBERT-v2 over the original ColBERT-v2 and jina-colbert-v1-en?
{'jina-segmenter-api': 'Jina-ColBERT-v2 introduces several improvements over '
                       'the original ColBERT-v2 and jina-colbert-v1-en:\n'
                       '\n'
                       '1. Dynamic Vocabulary: Jina-ColBERT-v2 uses a dynamic '
                       'vocabulary, which allows it to handle '
                       'out-of-vocabulary words more effectively.\n'
                       '\n'
                       '2. Adaptive Hard Negative Sampling: This technique '
                       'helps the model to focus on harder negative samples '
                       'during training, leading to better performance.\n'
                       '\n'
                       '3. Support for Multilingual Search: Jina-ColBERT-v2 '
                       'supports multilingual search, enabling users to search '
                       'in multiple languages.\n'
       

## Evaluate answers

In [None]:
def evaluate_answers(docs):
  results = []
  for doc in docs:

    questions_and_answers_string = ""
    for i, question in enumerate(doc.questions):
      print(f"Question {i}: {question['question'].upper()}")
      questions_and_answers_string += f"Question {i}: {question['question'].upper()}"
      for key in question["answers"].keys():
        print(key)
        print(questions["answers"][key])

    results.append(questions_and_answers_string)

  return results

    # prompt = f"""
    # Your job is to evaluate three students who are answering questions based on a text. The text is as follows:

    # <begin text>
    # {doc.text}
    # <end text>

    # Here is each question and the answer from the students. Which student provided the most accurate and concise answer to the question?

    # <begin questions>
    # {doc.questions}
    # <end questions>
    # """
    # # Generate {count} technical question(s) about the given text that the text itself answers. Use this format:

    # #     [
    # #         "What are the key differences between dense and sparse retrieval methods in RAG systems?",
    # #         "How does a RAG model handle the integration of retrieved documents during the generation process?",
    # #         "What techniques can be used to optimize the retrieval phase in a RAG system for large-scale datasets?"
    # #     ]

    # # Present your output in only a structured JSON list of strings, with no other output or markdown formatting. Provide only the questions. Do not provide answers or context. Do not wrap your output in backticks. Text is as follows:

    # # {doc.text}
    # # """

    # response = mixtral_llm.complete(prompt)
    # # print(response)
    # # print(type(response))
    # print(response.json())

    # raw_output = response.text.strip()
    # # print(raw_output)

    # if raw_output[0] == '`':
    #   print("Code fencing detected. Fixing it")
    #   raw_output = raw_output.splitlines()[1:-1]
    #   raw_output = "\n".join(raw_output)

    # print(raw_output)

    # try:
    #   questions = json.loads(raw_output)
    # except:
    #   print("Failed to convert output to JSON")

    # [question.strip() for question in questions]

    # for question in questions:
    #   doc.questions.append({"question": question, "answers": {}})

    # # doc.questions = questions

In [None]:
output = evaluate_answers(docs)

Question 0: WHAT ARE THE MAIN IMPROVEMENTS OF JINA-COLBERT-V2 OVER THE ORIGINAL COLBERT-V2 AND JINA-COLBERT-V1-EN?
jina-segmenter-api


TypeError: 'NoneType' object is not subscriptable

In [None]:
doc0_answers = get_answers(docs[0], jina_segmenter_api_index, "jina-segmenter-api")

In [None]:
doc0_answers

[{'question: ': 'What are the key differences between the two-stage training process for Reader-LM models?',
  'answer': 'The two-stage training process for Reader-LM models involves first training a language model on a large corpus of text, followed by fine-tuning the model on a smaller dataset of HTML-to-Markdown pairs. The key difference is that in the first stage, the model is trained to predict the next token in a sequence, while in the second stage, the model is fine-tuned to classify tokens as either `1` or `0`, depending on whether they exist in both the input and output. This two-stage process allows the model to learn general language patterns in the first stage, and then specialize to the HTML-to-Markdown task in the second stage.',
  'strategy': 'jina-segmenter-api'},
 {'question: ': 'How does the data preparation process for Reader-LM models ensure high-quality training data?',
  'answer': 'The context information does not provide specific details on how the data preparati

In [None]:
# examine returned chunks
for question in docs[0].questions:
  retrieved_texts = retriever.retrieve(question)
  print(question.upper())
  for i, rt in enumerate(retrieved_texts):
    print(f"Text {i+1}:\n\n{rt.text}\n\n")

WHAT ARE THE KEY DIFFERENCES BETWEEN THE TWO-STAGE TRAINING PROCESS FOR READER-LM MODELS?
Text 1:

[](https://jina-ai-gmbh.ghost.io/content/images/2024/09/Qualitative-Evaluation-of-Reader-LM-vs-LLMs-and-Jina-Reader-API--1-.svg)


Text 2:

In the early stages of this project, we explored using an encoder-only architecture to tackle this task. As mentioned earlier, the HTML-to-Markdown conversion task appears to be primarily a "selective-copy" task. Given a training pair (raw HTML and markdown), we can label tokens that exist in both the input and output as `1`, and the rest as `0`. This converts the problem into a token classification task, similar to what is used in Named Entity Recognition (NER).


Text 3:

4.  **Markdown Syntax Usage**: Evaluated each model’s ability to correctly convert HTML elements such as `<a>` (links), `<strong>` (bold text), and `<em>` (italics) into their appropriate markdown equivalents.


HOW DOES THE DATA PREPARATION PROCESS FOR READER-LM MODELS ENSURE HIGH

In [None]:
# all_scores = []

# for idx in indexes:
#   scores = get_answer(idx, questions)
#   all_scores.append(scores)
#   # print(idx)

In [None]:
# get_answer(jina_segmenter_api_index, questions)

Text 1:

Try It Out                                                18
Cosine similarity:  0.5623142088441102


Text 2:

bases. Some of the subjects of particular interest to farmers include
Cosine similarity:  0.5026040589389831


---


## Why segmentation model so bad?

Let's check index

In [None]:
# os.makedirs("chunks", exist_ok=True)

# for doc in docs:
#   with open(f"./chunks/{doc.filename}", "w") as file:
#     for chunk_strategy in doc.chunks.keys():
#       text = ""
#       text += f"=== {chunk_strategy.upper()} - {len(docs[0].chunks[chunk_strategy])} chunks ===\n\n"
#       # text += "\n---\n"
#       for item in doc.chunks[chunk_strategy]:
#         text += item
#         text += "\n---\n"
#       file.write(text)

In [None]:
# with open("segmentation_model_chunks_farming.txt", "w") as file:
#   text = ""
#   for item in segmentation_model_index.docstore.docs.items():
#     text += item[1].text
#     text += "\n---\n"
#   file.write(text)

The Project Gutenberg eBook of Computers on the Farm This ebook is for the use of anyone anywhere in the United States and most other parts of the world at no cost and with almost no restrictions whatsoever. You may copy it, give it away or re-use it under the terms of the Project Gutenberg License included with this ebook or online at www.gutenberg.org. If you are not located in the United States, you will have to check the laws of the country where you are located before using this eBook.
Title: Computers on the Farm Author: Deborah Takiff Smith Release date: April 20, 2019 [eBook #59316] Language: English Credits: Produced by Tom Cosmas compiled from images provided by The Internet Archive *** START OF THE PROJECT GUTENBERG EBOOK COMPUTERS ON THE FARM *** Produced by Tom Cosmas compiled from images provided by The Internet Archive Transcribers Note Text emphasis denoted as _Italics_ and =Bold=.
Computers on the Farm Farm Uses for Computers, How to Select Software and Hardware, and O