
# Article Chunking with `unstructured`

We are planning to use the [unstructured](https://unstructured.io/) for our primary chunking strategy. We are going to use this for the actual body content and it is common to change the arguments of the unstructured [partitioning](https://docs.unstructured.io/open-source/core-functionality/partitioning) functions upon future iterations where we are improving our Dataset curation for pre-training or fine-tuning or our chunking strategy for our VS index.

**NOTE**: Since we are working with XML data we are going to use the [partition-xml](https://docs.unstructured.io/open-source/core-functionality/partitioning#partition-xml) function. There are many libraries out there that can make use of the xml tags we left in our body column and they can excluded easily with regex or opensource xml parsing library. Thus, we left the xml in the body to allow for discovery of new / different parsing strategies in the future.

**NOTE**: YES. We could have used [partition-xml](https://docs.unstructured.io/open-source/core-functionality/partitioning#partition-xml) function to parse from file instead of from the `curated_articles` delta table. Similar to the above note, we did this to make future iterative improvements faster as reading text from file in blob storage has a much larger I/O preformance cost. This was a deliberate architecture decision for future enhancements, not just to conform to a [Medallion Architecture](https://www.databricks.com/glossary/medallion-architecture)... although we are doing that as well.

In [0]:
%pip install unstructured
%pip install databricks_genai_inference
dbutils.library.restartPython()

In [0]:
from unstructured.partition.xml import partition_xml

# We'll just collect a couple body records and parse locally to understand that 
xml_bodies = spark.sql("SELECT body FROM `pubmed-pipeline`.curated.articles_xml limit 5").collect()
# TODO, add xlink back from original source
xml_body = '<root xmlns:xlink="http://www.w3.org/1999/xlink">'+ \
           xml_bodies[0][0] + \
           '</root>'

print(xml_body)

In [0]:
?partition_xml

In [0]:
from unstructured.partition.xml import partition_xml
from datetime import datetime

# Create a datetime object representing the last modified time
# TODO: update to pull from source (or determine safe to exclude)
last_modified_time = datetime(2023, 5, 29, 15, 30)

#TODO: check if we have multi-language sources

body_parts = partition_xml(text=xml_body,
                           xml_keep_tags = False,
                           encoding='utf-8',
                           include_metadata=False,
                           languages=['eng',],
                           date_from_file_object=None,
                           chunking_strategy='by_title',
                           multipage_sections=True,
                           combine_text_under_n_chars=275,
                           new_after_n_chars=500,
                           max_characters=520)

In [0]:
# Let's confirm that the chunks lengths distributed as expected
import matplotlib.pyplot as plt
import numpy as np

lengths = [len(element.text) for element in body_parts]
lengths.sort()
lengths

# Calculate the cumulative probabilities
cumulative_probs = np.arange(1, len(lengths) + 1) / len(lengths)

# Plot the empirical distribution
plt.plot(lengths, cumulative_probs, marker='o')
plt.xlabel('Length')
plt.ylabel('Cumulative Probability')
plt.title('Empirical Distribution of Chunk Lengths')
plt.grid(True)
plt.show()


This was achieved with minimal arguments / modifications. However, we can see that we still have some chucks that have less than 100 characters. We'll have to make a chunking strategy decision to either drop those chunks or force into larger chunks with a follow on process. We'll choose the former which will result in the following for our chunks of a single body example:

In [0]:
chunks = [e.text for e in body_parts if len(e.text)]
chunks


It is also common to use LLMs to improve the chunks including the use of summarization of these chunks. That will be left for a future improvement, but here is an example using a databricks foundation model:

In [0]:
from databricks_genai_inference import ChatCompletion
import os

os.environ['DATABRICKS_HOST'] = dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiUrl().getOrElse(None)
os.environ['DATABRICKS_TOKEN'] = dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiToken().getOrElse(None)

def dbrx_summarize(chunk: str):
    return ChatCompletion.create(model="databricks-dbrx-instruct",
                                 messages=[{"role": "system", "content": "You are a researcher who wants to summarize user text without losing technical detail. Respond to the users with only a summary of the content they provide. Try to summarize with not more than 500 characters."},
                                          {"role": "user","content": chunk}],
                                max_tokens=600).message


In [0]:
summarized_chunks = [dbrx_summarize(c) for c in chunks]

In [0]:
summarized_chunks[:3]