# text-seg-lm
Model list:
*   jinaai/text-seg-lm-qwen-0.5b
*   jinaai/text-seg-lm-qwen2-0.5b-cot-topic-chunking



## Original code

This is pretty much what Andrei provided originally

In [None]:
# Note: this might prompt you to restart your session, just restart it if that's the case
!pip install torch triton xformers

In [None]:
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps trl peft accelerate bitsandbytes

In [1]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [6]:
# we load our own docs later in the notebook

# import requests

# def read(url):

#     url = f"https://r.jina.ai/{url}"

#     headers = {
#         "X-Return-Format": "text",
#         "X-No-Cache": "true",
#         "X-Timeout": "1000"
#     }

#     response = requests.get(url, headers=headers)
#     return response.text


In [2]:
from unsloth import FastLanguageModel # type: ignore
from transformers import GenerationConfig
import torch

import urllib.parse
import requests  # type: ignore
import re

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


In [71]:
max_seq_length = 8192
max_new_tokens = 1024
load_in_4bit = True

models = {
    "simple_chunking": "jinaai/text-seg-lm-qwen2-0.5b",
    "cot_topic_chunking": "jinaai/text-seg-lm-qwen2-0.5b-cot-topic-chunking",
    "summary_chunking": "jinaai/text-seg-lm-qwen2-0.5b-summary-chunking"
}

simple_chunking_prompt = """
Below is an instruction that describes a task, paired with an input. Write a response that appropriately completes the request.
### Instruction:
Split the given text into chunks. Use the format "CHUNK [index]: [head]" to respond, where "[index]" is the index of each chunk and "[head]" is the beginning of each chunk (up to 50 characters).
### Input:
{}
### Response:
""".lstrip()

cot_topic_chunking_prompt = """
Below is an instruction that describes a task, paired with an input. Write a response that appropriately completes the request.
### Instruction:
Identify the topics in the given text in this format:
### TOPICS:
TOPIC [index]: [topic]
Then, split the text into chunks in this format:
### CHUNKS:
CHUNK [index]: [chunk_head]
...
Pay attention to these details:
1. Topics should short and concise.
2. Chunk heads should be the begining text of each chunk, up to 50 characters long.

The topics and chunks should be in the same order they appear in the original text.
### Input:
{}
### Response:
""".lstrip()

summary_chunking_prompt = """
Below is an instruction that describes a task, paired with an input. Write a response that appropriately completes the request.
### Instruction:
Split the given text into chunks and generate a summary for each chunk.
Respond in the following format:
CHUNK 0
SUMMARY: [chunk_summary]
HEAD: [chunk_head]
CHUNK 1
SUMMARY: [chunk_summary]
HEAD: [chunk_head]
... and so on ...
Pay attention to these details:
1. Summaries should be one sentence long.
2. Chunk heads should be the begining text of each chunk, up to 50 characters long.
### Input:
{}
### Response:
""".lstrip()

prompts = {
    "simple_chunking": simple_chunking_prompt,
    "cot_topic_chunking": cot_topic_chunking_prompt,
    "summary_chunking": summary_chunking_prompt
}

extraction_regex = {
    "simple_chunking": r'CHUNK \d+:\s*(.*)',
    "cot_topic_chunking": r'CHUNK \d+:\s*(.*)',
    "summary_chunking": r'HEAD: (.*)'
}

In [4]:
# # Three options: simple_chunking, cot_topic_chunking, summary_chunking
# model_version = "cot_topic_chunking"

# model_name = models[model_version]

# model, tokenizer = FastLanguageModel.from_pretrained(
#     model_name = model_name,
#     max_seq_length = max_seq_length,
#     dtype = None,
#     load_in_4bit = load_in_4bit,
# )

# FastLanguageModel.for_inference(model)

# gen_config = GenerationConfig.from_pretrained(
#     "unsloth/Qwen2-0.5B-Instruct-bnb-4bit",
#     max_length=8192,
#     max_new_tokens=max_new_tokens,
# )

# prompt_template = prompts[model_version]
# regex = extraction_regex[model_version]

==((====))==  Unsloth 2024.9: Fast Qwen2 patching. Transformers = 4.44.2.
   \\   /|    GPU: NVIDIA GeForce RTX 3090. Max memory: 23.659 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.4.1+cu121. CUDA = 8.6. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


Unsloth 2024.9 patched 24 layers with 0 QKV layers, 24 O layers and 24 MLP layers.


In [64]:
def extract_chunks(text, chunk_headers_raw):
    chunk_headers = re.findall(regex, chunk_headers_raw)

    # print("\n\n")
    # print("Chunk headers:")
    # print(chunk_headers)

    chunks = []
    for i in range(len(chunk_headers) - 1):
        current_header_escaped = re.escape(chunk_headers[i])
        next_header_escaped = re.escape(chunk_headers[i + 1])
        pattern = f"{current_header_escaped}(.*?){next_header_escaped}"
        match = re.search(pattern, text, re.DOTALL)
        if match:
            chunks.append(chunk_headers[i] + match.group(1).strip())

    # Handle the last chunk, capturing until the end of the text
    last_header = chunk_headers[-1]
    last_header_escaped = re.escape(last_header)
    last_chunk_pattern = f"{last_header_escaped}(.*)"

    match = re.search(last_chunk_pattern, text, re.DOTALL)
    if match:
        chunks.append(last_header + match.group(1).strip())

    return chunks

In [61]:
def generate(text):

  text = text.replace("\n", " ")
  text = re.sub(r'\s+', " ", text)
  text = text.strip()

  prompt = prompt_template.format(text)
  # print(prompt)

  tokenized = tokenizer(prompt, return_tensors='pt')
  input_ids = tokenized['input_ids'].cuda()
  attention_mask = tokenized['attention_mask'].cuda()

  with torch.inference_mode():
      output = model.generate(
          input_ids=input_ids,
          attention_mask=attention_mask,
          generation_config=gen_config
      )

  result = tokenizer.decode(
      output[0][len(input_ids[0]):],
      skip_special_tokens=True
  )

  # print(result)

  chunks = extract_chunks(text, result)

  return chunks

In [None]:
# text = read("https://jina.ai/news/reader-lm-small-language-models-for-cleaning-and-converting-html-to-markdown")
# chunks = generate(text)
# print("Chunks:")
# for chunk in chunks:
#   print("\n\n")
#   print(chunk)

In [None]:
# text = read("https://jina.ai/news/jina-reranker-v2-for-agentic-rag-ultra-fast-multilingual-function-calling-and-code-search/")
# chunks = generate(text)
# print("Chunks:")
# for chunk in chunks:
#   print("\n\n")
#   print(chunk)

## Adapted

This adapts _some_ of the code above (in "original code" section) to work nicer with looping over models and docs, and confirming to `BlogPost` class standard.

In [87]:
# load pre-populated docs (text already there)

import pickle
with open("docs-populated.pkl", "rb") as file:
    docs = pickle.load(file)

In [88]:
# different versions of qwen model for chunking in different ways
model_versions = [
    "simple_chunking",
    "cot_topic_chunking", 
    "summary_chunking"
]

In [92]:
errors = []

for model_version in model_versions:
    print(f"=== {model_version.upper()} ===")
    
    model_name = models[model_version]
    
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = model_name,
        max_seq_length = max_seq_length,
        dtype = None,
        load_in_4bit = load_in_4bit,
    )
    
    FastLanguageModel.for_inference(model)
    
    gen_config = GenerationConfig.from_pretrained(
        "unsloth/Qwen2-0.5B-Instruct-bnb-4bit",
        max_length=8192,
        max_new_tokens=max_new_tokens,
    )
    
    prompt_template = prompts[model_version]
    regex = extraction_regex[model_version]

    for doc in docs:
        try:
            print(doc.filename)
            if model_version not in doc.chunks.keys(): # if already has chunks for model version, skip it (sometimes cot chunking failed so have to rerun)
                doc.chunks[model_version] = generate(doc.text)
        except:
            errors.append(f"{model_version} - {doc.filename}")

print(errors)

=== SIMPLE_CHUNKING ===
==((====))==  Unsloth 2024.9: Fast Qwen2 patching. Transformers = 4.44.2.
   \\   /|    GPU: NVIDIA GeForce RTX 3090. Max memory: 23.659 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.4.1+cu121. CUDA = 8.6. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
reader-lm-small-language-models-for-cleaning-and-converting-html-to-markdown
jina-colbert-v2-multilingual-late-interaction-retriever-for-embedding-and-reranking
late-chunking-in-long-context-embedding-models
the-what-and-why-of-text-image-modality-gap-in-clip-models
by-hoovering-up-the-web-ai-is-poisoning-itself
what-we-learned-at-icml2024-ft-plag-xrm-tinybenchmark-magiclens-prompt-sketching-etc
jina-embeddings-and-reranker-on-azure-scalable-business-ready-ai-solutions
having-it-both-ways-combining-bm25-with-ai-reranking
smaller-faster-cheaper-jina-rerankers-turbo-and-tiny
enhancing-search-results

In [93]:
for doc in docs:
    print("="*3, doc.filename)
    for key in doc.chunks.keys():
        print(key, ":", len(doc.chunks[key]))

=== reader-lm-small-language-models-for-cleaning-and-converting-html-to-markdown
simple_chunking : 1
cot_topic_chunking : 12
summary_chunking : 12
=== jina-colbert-v2-multilingual-late-interaction-retriever-for-embedding-and-reranking
simple_chunking : 6
cot_topic_chunking : 7
summary_chunking : 7
=== late-chunking-in-long-context-embedding-models
simple_chunking : 4
cot_topic_chunking : 9
summary_chunking : 7
=== the-what-and-why-of-text-image-modality-gap-in-clip-models
simple_chunking : 9
cot_topic_chunking : 11
summary_chunking : 8
=== by-hoovering-up-the-web-ai-is-poisoning-itself
simple_chunking : 6
cot_topic_chunking : 5
summary_chunking : 7
=== what-we-learned-at-icml2024-ft-plag-xrm-tinybenchmark-magiclens-prompt-sketching-etc
simple_chunking : 1
cot_topic_chunking : 6
summary_chunking : 3
=== jina-embeddings-and-reranker-on-azure-scalable-business-ready-ai-solutions
simple_chunking : 6
cot_topic_chunking : 4
summary_chunking : 2
=== having-it-both-ways-combining-bm25-with-ai-

In [94]:
with open("docs-qwen-chunks.pkl", "wb") as file:
    pickle.dump(docs, file)