Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat/ enable partition_json to partition any json #1038

Open
Coniferish opened this issue Aug 3, 2023 · 9 comments
Open

feat/ enable partition_json to partition any json #1038

Coniferish opened this issue Aug 3, 2023 · 9 comments
Labels
enhancement New feature or request json Related to partitioning JSON

Comments

@Coniferish
Copy link
Collaborator

Currently partition_json is intended only for deserializing the unstructured JSON outputs/elements and is not included as a file format we accept for partitioning (see here).

The goal of this issue is to make partition_json work for any JSON file (probably similar to how partition_xml works)

@Coniferish Coniferish added the enhancement New feature or request label Aug 3, 2023
@scanny scanny self-assigned this Nov 27, 2023
@scanny scanny added the json Related to partitioning JSON label Nov 27, 2023
@lironrisk
Copy link

any news?

@scanny scanny removed their assignment Jan 11, 2024
@Coniferish
Copy link
Collaborator Author

any news?

Hey @lironrisk !
Apologies for the delay. We've been really busy with the launch of the paid api. @scanny has some preliminary code for this, but it is lower priority than the chunking improvements we have coming up. It has been added to our roadmap for Q2.

@orlandounstructured
Copy link

orlandounstructured commented Feb 8, 2024

Over 180 days old but keeping open due to addition to roadmap, as mentioned by @Coniferish

@adrianruchti
Copy link

Hello unstructured Team. Would be interested in the partition_json.

@apmavrin
Copy link

We, so far, have implemented a helper function that parses the json, wraps it into a list and feeds it Unstructured.IO, so it could be parsed with the current version.

utility function

from typing import Dict, Optional

from google.protobuf.struct_pb2 import Struct


def struct_to_dict(struct: Struct, out: Optional[Dict] = None) -> Dict:
    if not out:
        out = {}
    for key, value in struct.items():
        if isinstance(value, Struct):
            out[key] = struct_to_dict(value)
        else:
            out[key] = value
    return out

the main function where we receive json and feed to to Unstructured

# TLDR: Unstructured can't handle a single JSON, but needs a List of JSON
parsed_list = [struct_to_dict(json_content_from_api)]
content_for_unstructured = json.dumps(parsed_list)

@adrianruchti
Copy link

adrianruchti commented Jul 22, 2024

thank you. And the content_for_unstructured can then be parsed by the unstructured partition function in the unstructured pipeline?

`runner = LocalRunner(
processor_config=ProcessorConfig(
verbose=True,
output_dir="local-ingest-output",
num_processes=2,
),
read_config=ReadConfig(),
partition_config=PartitionConfig(
partition_by_api=False
),

        connector_config=SimpleLocalConfig(
            input_path="data",
            recursive=True,
        ),
        chunking_config=ChunkingConfig(
            chunk_elements=True
        ),
        
    )
    #      writer=writer,
    #      writer_kwargs={},

     
    # # Run the DropboxRunner
    runner.run()`

@scanny
Copy link
Collaborator

scanny commented Jul 22, 2024

@apmavrin @adrianruchti can you say a little more about your use case for this? It's not clear to me yet how a useful JSON partitioner would behave.

In particular:

  • A JSON file, assumedly a JSON array or JSON object, is arguably a data file, not a document. How would we we want to discriminate between fields we want to partition and fields we want to ignore or perhaps add to metadata?
  • A JSON "document" is often the response from an HTTP API and will often contain a collection of objects, say perhaps a series of blog posts between two dates. How would it make sense to handle that? Ignore "header" information and process array items only?
  • How recurrent and stable are your use cases? Is it partitioning the same sort of JSON payload over and over (same schema, different day/data)? Or do you need to partition uncharacterized collections of JSON as encountered among a pile of other documents?

Any help you can give characterizing the use cases will help in developing a spec for something like this. Grateful for whatever help you can provide :)

@adrianruchti
Copy link

@scanny the use case: 60 mb law documents (xml format). Was trying unstructured xml parser and azure document intelligence. The extraction and recognition of different types is not satisfying with the xml parser. So I parsed the xml with lxml etree and saved it as json. This worked well with azure document intelligence that can partition and chunk json. Would be nice to get the same from unstructured as this would be a cheaper solution. By the way: unstructured converts the documents to json afte Partitioning so why not accepting this format in your partitioner. Do you have any platform I could share the big xml and converted json file in private with you Steve? (If you are interested)

@scanny
Copy link
Collaborator

scanny commented Jul 25, 2024

Hi @adrianruchti you can DM me on the Unstructured Slack channel or reach me at the email on my GitHub profile. No need for a huge document just yet but a modest sized one might be helpful.

The big question for me is schema information, like should an XML partitioner take some schema descriptors to determine what to partition and how, or should it be made to do the best it can without any schema information (like partition all text it finds) or possibly both.

Just glancing at some LegalXML documents online, it looks like there is a lot of metadata and only a little narrative text. So if you can help me understand what a successful outcome in your case would be maybe that will help us noodle this a bit further.

Also if you can give a sense of the diversity of XML-vocabularies/schemas you need to partition, that would help in reasoning about it.

One aspect of an approach that occurs to me is applying a legal-document-type-specific XSLT transform to produce a "standardized" XML document that partitions in a well-known way, including adding whatever "extra" metadata you might want on the partitioned elements. Not sure how that fits in but thought I'd mention in the interest of brainstorming :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request json Related to partitioning JSON
Projects
None yet
Development

No branches or pull requests

6 participants