feat/ enable partition_json to partition any json #1038

Coniferish · 2023-08-03T21:49:11Z

Currently partition_json is intended only for deserializing the unstructured JSON outputs/elements and is not included as a file format we accept for partitioning (see here).

The goal of this issue is to make partition_json work for any JSON file (probably similar to how partition_xml works)

The text was updated successfully, but these errors were encountered:

lironrisk · 2023-12-27T16:19:33Z

any news?

Coniferish · 2024-01-18T01:00:02Z

any news?

Hey @lironrisk !
Apologies for the delay. We've been really busy with the launch of the paid api. @scanny has some preliminary code for this, but it is lower priority than the chunking improvements we have coming up. It has been added to our roadmap for Q2.

orlandounstructured · 2024-02-08T23:54:14Z

Over 180 days old but keeping open due to addition to roadmap, as mentioned by @Coniferish

adrianruchti · 2024-07-18T04:50:58Z

Hello unstructured Team. Would be interested in the partition_json.

apmavrin · 2024-07-22T03:25:47Z

We, so far, have implemented a helper function that parses the json, wraps it into a list and feeds it Unstructured.IO, so it could be parsed with the current version.

utility function

from typing import Dict, Optional

from google.protobuf.struct_pb2 import Struct


def struct_to_dict(struct: Struct, out: Optional[Dict] = None) -> Dict:
    if not out:
        out = {}
    for key, value in struct.items():
        if isinstance(value, Struct):
            out[key] = struct_to_dict(value)
        else:
            out[key] = value
    return out

the main function where we receive json and feed to to Unstructured

# TLDR: Unstructured can't handle a single JSON, but needs a List of JSON
parsed_list = [struct_to_dict(json_content_from_api)]
content_for_unstructured = json.dumps(parsed_list)

adrianruchti · 2024-07-22T07:37:38Z

thank you. And the content_for_unstructured can then be parsed by the unstructured partition function in the unstructured pipeline?

`runner = LocalRunner(
processor_config=ProcessorConfig(
verbose=True,
output_dir="local-ingest-output",
num_processes=2,
),
read_config=ReadConfig(),
partition_config=PartitionConfig(
partition_by_api=False
),

        connector_config=SimpleLocalConfig(
            input_path="data",
            recursive=True,
        ),
        chunking_config=ChunkingConfig(
            chunk_elements=True
        ),
        
    )
    #      writer=writer,
    #      writer_kwargs={},

     
    # # Run the DropboxRunner
    runner.run()`

scanny · 2024-07-22T17:46:54Z

@apmavrin @adrianruchti can you say a little more about your use case for this? It's not clear to me yet how a useful JSON partitioner would behave.

In particular:

A JSON file, assumedly a JSON array or JSON object, is arguably a data file, not a document. How would we we want to discriminate between fields we want to partition and fields we want to ignore or perhaps add to metadata?
A JSON "document" is often the response from an HTTP API and will often contain a collection of objects, say perhaps a series of blog posts between two dates. How would it make sense to handle that? Ignore "header" information and process array items only?
How recurrent and stable are your use cases? Is it partitioning the same sort of JSON payload over and over (same schema, different day/data)? Or do you need to partition uncharacterized collections of JSON as encountered among a pile of other documents?

Any help you can give characterizing the use cases will help in developing a spec for something like this. Grateful for whatever help you can provide :)

adrianruchti · 2024-07-22T18:41:30Z

@scanny the use case: 60 mb law documents (xml format). Was trying unstructured xml parser and azure document intelligence. The extraction and recognition of different types is not satisfying with the xml parser. So I parsed the xml with lxml etree and saved it as json. This worked well with azure document intelligence that can partition and chunk json. Would be nice to get the same from unstructured as this would be a cheaper solution. By the way: unstructured converts the documents to json afte Partitioning so why not accepting this format in your partitioner. Do you have any platform I could share the big xml and converted json file in private with you Steve? (If you are interested)

scanny · 2024-07-25T17:55:44Z

Hi @adrianruchti you can DM me on the Unstructured Slack channel or reach me at the email on my GitHub profile. No need for a huge document just yet but a modest sized one might be helpful.

The big question for me is schema information, like should an XML partitioner take some schema descriptors to determine what to partition and how, or should it be made to do the best it can without any schema information (like partition all text it finds) or possibly both.

Just glancing at some LegalXML documents online, it looks like there is a lot of metadata and only a little narrative text. So if you can help me understand what a successful outcome in your case would be maybe that will help us noodle this a bit further.

Also if you can give a sense of the diversity of XML-vocabularies/schemas you need to partition, that would help in reasoning about it.

One aspect of an approach that occurs to me is applying a legal-document-type-specific XSLT transform to produce a "standardized" XML document that partitions in a well-known way, including adding whatever "extra" metadata you might want on the partitioned elements. Not sure how that fits in but thought I'd mention in the interest of brainstorming :)

Coniferish added the enhancement New feature or request label Aug 3, 2023

awalker4 mentioned this issue Sep 6, 2023

bug/Error message from partition_json is unhelpful #547

Closed

scanny self-assigned this Nov 27, 2023

scanny added the json Related to partitioning JSON label Nov 27, 2023

scanny removed their assignment Jan 11, 2024

awalker4 mentioned this issue Feb 17, 2024

Parsing JSON document for chunks produces empty array Unstructured-IO/unstructured-api#366

Closed

MthwRobinson mentioned this issue May 15, 2024

bug/partition_json can't parse json file as one dictionary #1498

Closed

MthwRobinson mentioned this issue May 28, 2024

ValueError: Detected a JSON file that does not conform to the Unstructured schema. partition_json currently only processes serialized Unstructured output. #2930

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat/ enable partition_json to partition any json #1038

feat/ enable partition_json to partition any json #1038

Coniferish commented Aug 3, 2023

lironrisk commented Dec 27, 2023

Coniferish commented Jan 18, 2024

orlandounstructured commented Feb 8, 2024 •

edited

Loading

adrianruchti commented Jul 18, 2024

apmavrin commented Jul 22, 2024

adrianruchti commented Jul 22, 2024 •

edited

Loading

scanny commented Jul 22, 2024

adrianruchti commented Jul 22, 2024

scanny commented Jul 25, 2024

feat/ enable partition_json to partition any json #1038

feat/ enable partition_json to partition any json #1038

Comments

Coniferish commented Aug 3, 2023

lironrisk commented Dec 27, 2023

Coniferish commented Jan 18, 2024

orlandounstructured commented Feb 8, 2024 • edited Loading

adrianruchti commented Jul 18, 2024

apmavrin commented Jul 22, 2024

adrianruchti commented Jul 22, 2024 • edited Loading

scanny commented Jul 22, 2024

adrianruchti commented Jul 22, 2024

scanny commented Jul 25, 2024

orlandounstructured commented Feb 8, 2024 •

edited

Loading

adrianruchti commented Jul 22, 2024 •

edited

Loading