diff --git a/api-reference/how-to/generate-schema.mdx b/api-reference/how-to/generate-schema.mdx new file mode 100644 index 00000000..f071c975 --- /dev/null +++ b/api-reference/how-to/generate-schema.mdx @@ -0,0 +1,111 @@ +--- +title: Generate a JSON schema for a file +--- + +## Task + +You want to generate a schema for a JSON file that Unstructured produces, so that you can validate, test, +and document related JSON files across your systems. + +## Approach + +Use a Python package such as [genson](https://pypi.org/project/genson/) to generate schemas for your +JSON files. + +The `genson` package is not owned or supported by Unstructured. For questions and +requests, see the [Issues](https://github.com/wolverdude/genson/issues) tab of the +`genson` repository in GitHub. + +## Generate a schema from the terminal + + + + Use [pip](https://pip.pypa.io/en/stable/installation/) to install the [genson](https://pypi.org/project/genson/) package. + + ```bash + pip install genson + ``` + + + By default, `genson` generates the JSON schema as a single string without any line breaks or indented whitespace. + + To pretty-print the schema that `genson` produces, install the [jq](https://jqlang.github.io/jq/) utility. + + The `jq` utility is not owned or supported by Unstructured. For questions and + requests, see the [Issues](https://github.com/jqlang/jq/issues) tab of the + `jq` repository in GitHub. + + + 1. Run the `genson` command, specifying the path to the input (source) JSON file, and the path to + the output (target) JSON schema file to be generated. Use `jq` to pretty-print the schema's content + into the file to be generated. + + ```bash + genson "/path/to/input/file.json" | jq '.' > "/path/to/output/schema.json" + ``` + + 2. You can find the generated JSON schema file in the output path that you specified. + + + +## Generate a schema from Python code + + + + In your Python project, install the [genson](https://pypi.org/project/genson/) package. + + ```bash + pip install genson + ``` + + + 1. Set the following local environment variables: + + - Set `LOCAL_FILE_INPUT_PATH` to the local path to the input (source) JSON file. + - Set `LOCAL_FILE_OUTPUT_PATH` to the local path to the output (target) JSON schema file to be generated. + + 2. Add the following Python code file to your project: + + ```python + import os, json + from genson import SchemaBuilder + + def json_schema_from_file( + input_file_path: str, + output_schema_path: str + ) -> None: + try: + with open(input_file_path, "r") as file: + json_data = json.load(file) + + builder = SchemaBuilder() + builder.add_object(json_data) + + schema = builder.to_schema() + + try: + with open(output_schema_path, "w") as schema_file: + json.dump(schema, schema_file, indent=2) + except IOError as e: + raise IOError(f"Error writing to output file: {e}") + + print(f"JSON schema successfully generated and saved to '{output_schema_path}'.") + except FileNotFoundError: + print(f"Error: Input file '{input_file_path}' not found.") + except IOError as e: + print(f"I/O error occurred: {e}") + except Exception as e: + print(f"An unexpected error occurred: {e}") + + if __name__ == "__main__": + json_schema_from_file( + input_file_path=os.getenv("LOCAL_FILE_INPUT_PATH"), + output_schema_path=os.getenv("LOCAL_FILE_OUTPUT_PATH") + ) + ``` + + 3. Run the Python code file. + 4. Check the path specified by `LOCAL_FILE_OUTPUT_PATH` for the generated JSON schema file. + + + diff --git a/mint.json b/mint.json index 1d31268a..67239f84 100644 --- a/mint.json +++ b/mint.json @@ -379,7 +379,8 @@ "api-reference/how-to/powerpoint", "api-reference/how-to/use-langchain-ollama", "api-reference/how-to/use-langchain-llama-3", - "api-reference/how-to/transform-schemas" + "api-reference/how-to/transform-schemas", + "api-reference/how-to/generate-schema" ] }, {