From ee47c21937217a477cf4da7e321ec4f9a06627b8 Mon Sep 17 00:00:00 2001 From: Paul Cornell Date: Mon, 9 Jun 2025 16:25:14 -0700 Subject: [PATCH 1/3] How to use jq --- docs.json | 1 + examplecode/tools/jq.mdx | 170 +++++++++++++++++++++++++++++++++++++++ 2 files changed, 171 insertions(+) create mode 100644 examplecode/tools/jq.mdx diff --git a/docs.json b/docs.json index f94eb77f..4ffd5f77 100644 --- a/docs.json +++ b/docs.json @@ -262,6 +262,7 @@ { "group": "Tool demos", "pages": [ + "examplecode/tools/jq", "examplecode/tools/firecrawl", "examplecode/tools/langflow", "examplecode/tools/vectorshift", diff --git a/examplecode/tools/jq.mdx b/examplecode/tools/jq.mdx new file mode 100644 index 00000000..097c034a --- /dev/null +++ b/examplecode/tools/jq.mdx @@ -0,0 +1,170 @@ +--- +title: Query JSON with jq +--- + +[jq](https://jqlang.org/) is a lightweight and flexible command-line JSON processor. You can use `jq` on a local development machine to +slice, filter, map, and transform the JSON data that Unstructured outputs in much the same ways that tools such as `sed`, `awk`, and `grep` let you work with text. + +To get `jq`, see the [Download jq](https://jqlang.org/download/) page. + + + `jq` is not owned or supported by Unstructured. For questions about `jq`and + feature requests for future versions of `jq`, see the [Issues](https://github.com/jqlang/jq/issues) tab of the + `jq` repository in GitHub. + + +The following command examples use `jq` with the +[spring-weather.html.json](https://github.com/Unstructured-IO/unstructured/blob/main/example-docs/spring-weather.html.json) file in the +**example-docs** directory within the **Unstructured-IO/unstructured** repository in GitHub. + +Find the element with a `type` of `Address`, and print the element's `text` field value. + +```bash +jq '.[] + | select(.type == "Address") + | .text' spring-weather.html.json + +# Output: +# +# "Silver Spring, MD 20910" +``` + +Find all elements with a `type` of `Title`, and print the `text` field of each found element as a string in a JSON array. + +```bash +jq '[ + .[] + | select(.type == "Title") + | .text]' spring-weather.html.json + +# Output: +# +# [ +# "News Around NOAA", +# "National Program", +# "Are You Weather-Ready for the Spring?", +# "Weather.gov >", +# "News Around NOAA > Are You Weather-Ready for the Spring?", +# "US Dept of Commerce", +# "National Oceanic and Atmospheric Administration", +# "National Weather Service", +# "News Around NOAA", +# "1325 East West Highway", +# "Comments? Questions? Please Contact Us.", +# "Disclaimer", +# "Information Quality", +# "Help", +# "Glossary", +# "Privacy Policy", +# "Freedom of Information Act (FOIA)", +# "About Us", +# "Career Opportunities" +# ] +``` + +Find all elements with a `type` of `Title`. Of these, find the ones that have a `text` field that contains the phrase "Contact Us", and print the contents of each element's `metadata.link_urls` field. + +```bash +jq '.[] + | select(.type == "Title") + | select(.text + | contains("Contact Us")) + | .metadata.link_urls' spring-weather.html.json + +# Output: +# +# [ +# "https://www.weather.gov/news/contact" +# ] +``` + +Find all elements with a `type` of `ListItem`. Of these, find the ones that have a `text` field that contains the phrase "Weather Safety". +For each item in `metadata.link_texts`, print the item's value as the key, followed by the matching item in +`metadata.link_urls` as the value. Trim any leading and trailing whitespace from all values. Wrap the output in a JSON array. + +```bash +jq '[ + .[] + | select(.type == "ListItem") + | select(.text | test("Weather Safety"; "i")) + | [.metadata.link_texts, .metadata.link_urls] + | transpose[] + | { + (.[0] | gsub("^\\s+|\\s+$"; "")) : (.[1] | gsub("^\\s+|\\s+$"; "")) + } +]' spring-weather.html.json + +# Output: +# +# [ +# { +# "Weather Safety": "http://www.weather.gov/safetycampaign" +# }, +# { +# "Air Quality": "https://www.weather.gov/safety/airquality" +# }, +# { +# "Beach Hazards": "https://www.weather.gov/safety/beachhazards" +# }, +# { +# "Cold": "https://www.weather.gov/safety/cold" +# }, +# { +# "Cold Water": "https://www.weather.gov/safety/coldwater" +# }, +# { +# "Drought": "https://www.weather.gov/safety/drought" +# }, +# { +# "Floods": "https://www.weather.gov/safety/flood" +# }, +# { +# "Fog": "https://www.weather.gov/safety/fog" +# }, +# { +# "Heat": "https://www.weather.gov/safety/heat" +# }, +# { +# "Hurricanes": "https://www.weather.gov/safety/hurricane" +# }, +# { +# "Lightning Safety": "https://www.weather.gov/safety/lightning" +# }, +# { +# "Rip Currents": "https://www.weather.gov/safety/ripcurrent" +# }, +# { +# "Safe Boating": "https://www.weather.gov/safety/safeboating" +# }, +# { +# "Space Weather": "https://www.weather.gov/safety/space" +# }, +# { +# "Sun (Ultraviolet Radiation)": "https://www.weather.gov/safety/heat-uv" +# }, +# { +# "Thunderstorms & Tornadoes": "https://www.weather.gov/safety/thunderstorm" +# }, +# { +# "Tornado": "https://www.weather.gov/safety/tornado" +# }, +# { +# "Tsunami": "https://www.weather.gov/safety/tsunami" +# }, +# { +# "Wildfire": "https://www.weather.gov/safety/wildfire" +# }, +# { +# "Wind": "https://www.weather.gov/safety/wind" +# }, +# { +# "Winter": "https://www.weather.gov/safety/winter" +# } +# ] +``` + +## Additional resources + +- [jq Tutorial](https://jqlang.org/tutorial/) +- [jq Manual](https://jqlang.org/manual/) +- [jq Playground](https://play.jqlang.org/) \ No newline at end of file From b7e9ad14d612a4ace4bdcff6f975797f8500d67a Mon Sep 17 00:00:00 2001 From: Paul Cornell Date: Mon, 9 Jun 2025 17:00:37 -0700 Subject: [PATCH 2/3] Push unsaved file --- examplecode/tools/jq.mdx | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/examplecode/tools/jq.mdx b/examplecode/tools/jq.mdx index 097c034a..2a58def9 100644 --- a/examplecode/tools/jq.mdx +++ b/examplecode/tools/jq.mdx @@ -17,7 +17,7 @@ The following command examples use `jq` with the [spring-weather.html.json](https://github.com/Unstructured-IO/unstructured/blob/main/example-docs/spring-weather.html.json) file in the **example-docs** directory within the **Unstructured-IO/unstructured** repository in GitHub. -Find the element with a `type` of `Address`, and print the element's `text` field value. +Find the element with a `type` of `Address`, and print the element's `text` field's value. ```bash jq '.[] @@ -62,7 +62,7 @@ jq '[ # ] ``` -Find all elements with a `type` of `Title`. Of these, find the ones that have a `text` field that contains the phrase "Contact Us", and print the contents of each element's `metadata.link_urls` field. +Find all elements with a `type` of `Title`. Of these, find the ones that have a `text` field that contains the phrase `Contact Us`, and print the contents of each found element's `metadata.link_urls` field. ```bash jq '.[] @@ -78,7 +78,7 @@ jq '.[] # ] ``` -Find all elements with a `type` of `ListItem`. Of these, find the ones that have a `text` field that contains the phrase "Weather Safety". +Find all elements with a `type` of `ListItem`. Of these, find the ones that have a `text` field that contains the phrase `Weather Safety`. For each item in `metadata.link_texts`, print the item's value as the key, followed by the matching item in `metadata.link_urls` as the value. Trim any leading and trailing whitespace from all values. Wrap the output in a JSON array. From 93ca73a83051f4a8d3dba1a825323391b4ec5618 Mon Sep 17 00:00:00 2001 From: Paul Cornell Date: Mon, 16 Jun 2025 08:50:54 -0700 Subject: [PATCH 3/3] Separated outputs --- examplecode/tools/jq.mdx | 204 ++++++++++++++++++++------------------- 1 file changed, 106 insertions(+), 98 deletions(-) diff --git a/examplecode/tools/jq.mdx b/examplecode/tools/jq.mdx index 2a58def9..c4b6b691 100644 --- a/examplecode/tools/jq.mdx +++ b/examplecode/tools/jq.mdx @@ -23,10 +23,12 @@ Find the element with a `type` of `Address`, and print the element's `text` fiel jq '.[] | select(.type == "Address") | .text' spring-weather.html.json +``` + +The output is: -# Output: -# -# "Silver Spring, MD 20910" +```bash +"Silver Spring, MD 20910" ``` Find all elements with a `type` of `Title`, and print the `text` field of each found element as a string in a JSON array. @@ -36,30 +38,32 @@ jq '[ .[] | select(.type == "Title") | .text]' spring-weather.html.json +``` + +The output is: -# Output: -# -# [ -# "News Around NOAA", -# "National Program", -# "Are You Weather-Ready for the Spring?", -# "Weather.gov >", -# "News Around NOAA > Are You Weather-Ready for the Spring?", -# "US Dept of Commerce", -# "National Oceanic and Atmospheric Administration", -# "National Weather Service", -# "News Around NOAA", -# "1325 East West Highway", -# "Comments? Questions? Please Contact Us.", -# "Disclaimer", -# "Information Quality", -# "Help", -# "Glossary", -# "Privacy Policy", -# "Freedom of Information Act (FOIA)", -# "About Us", -# "Career Opportunities" -# ] +```bash +[ + "News Around NOAA", + "National Program", + "Are You Weather-Ready for the Spring?", + "Weather.gov >", + "News Around NOAA > Are You Weather-Ready for the Spring?", + "US Dept of Commerce", + "National Oceanic and Atmospheric Administration", + "National Weather Service", + "News Around NOAA", + "1325 East West Highway", + "Comments? Questions? Please Contact Us.", + "Disclaimer", + "Information Quality", + "Help", + "Glossary", + "Privacy Policy", + "Freedom of Information Act (FOIA)", + "About Us", + "Career Opportunities" +] ``` Find all elements with a `type` of `Title`. Of these, find the ones that have a `text` field that contains the phrase `Contact Us`, and print the contents of each found element's `metadata.link_urls` field. @@ -70,12 +74,14 @@ jq '.[] | select(.text | contains("Contact Us")) | .metadata.link_urls' spring-weather.html.json +``` + +The output is: -# Output: -# -# [ -# "https://www.weather.gov/news/contact" -# ] +```bash +[ + "https://www.weather.gov/news/contact" +] ``` Find all elements with a `type` of `ListItem`. Of these, find the ones that have a `text` field that contains the phrase `Weather Safety`. @@ -93,74 +99,76 @@ jq '[ (.[0] | gsub("^\\s+|\\s+$"; "")) : (.[1] | gsub("^\\s+|\\s+$"; "")) } ]' spring-weather.html.json +``` + +The output is: -# Output: -# -# [ -# { -# "Weather Safety": "http://www.weather.gov/safetycampaign" -# }, -# { -# "Air Quality": "https://www.weather.gov/safety/airquality" -# }, -# { -# "Beach Hazards": "https://www.weather.gov/safety/beachhazards" -# }, -# { -# "Cold": "https://www.weather.gov/safety/cold" -# }, -# { -# "Cold Water": "https://www.weather.gov/safety/coldwater" -# }, -# { -# "Drought": "https://www.weather.gov/safety/drought" -# }, -# { -# "Floods": "https://www.weather.gov/safety/flood" -# }, -# { -# "Fog": "https://www.weather.gov/safety/fog" -# }, -# { -# "Heat": "https://www.weather.gov/safety/heat" -# }, -# { -# "Hurricanes": "https://www.weather.gov/safety/hurricane" -# }, -# { -# "Lightning Safety": "https://www.weather.gov/safety/lightning" -# }, -# { -# "Rip Currents": "https://www.weather.gov/safety/ripcurrent" -# }, -# { -# "Safe Boating": "https://www.weather.gov/safety/safeboating" -# }, -# { -# "Space Weather": "https://www.weather.gov/safety/space" -# }, -# { -# "Sun (Ultraviolet Radiation)": "https://www.weather.gov/safety/heat-uv" -# }, -# { -# "Thunderstorms & Tornadoes": "https://www.weather.gov/safety/thunderstorm" -# }, -# { -# "Tornado": "https://www.weather.gov/safety/tornado" -# }, -# { -# "Tsunami": "https://www.weather.gov/safety/tsunami" -# }, -# { -# "Wildfire": "https://www.weather.gov/safety/wildfire" -# }, -# { -# "Wind": "https://www.weather.gov/safety/wind" -# }, -# { -# "Winter": "https://www.weather.gov/safety/winter" -# } -# ] +```bash +[ + { + "Weather Safety": "http://www.weather.gov/safetycampaign" + }, + { + "Air Quality": "https://www.weather.gov/safety/airquality" + }, + { + "Beach Hazards": "https://www.weather.gov/safety/beachhazards" + }, + { + "Cold": "https://www.weather.gov/safety/cold" + }, + { + "Cold Water": "https://www.weather.gov/safety/coldwater" + }, + { + "Drought": "https://www.weather.gov/safety/drought" + }, + { + "Floods": "https://www.weather.gov/safety/flood" + }, + { + "Fog": "https://www.weather.gov/safety/fog" + }, + { + "Heat": "https://www.weather.gov/safety/heat" + }, + { + "Hurricanes": "https://www.weather.gov/safety/hurricane" + }, + { + "Lightning Safety": "https://www.weather.gov/safety/lightning" + }, + { + "Rip Currents": "https://www.weather.gov/safety/ripcurrent" + }, + { + "Safe Boating": "https://www.weather.gov/safety/safeboating" + }, + { + "Space Weather": "https://www.weather.gov/safety/space" + }, + { + "Sun (Ultraviolet Radiation)": "https://www.weather.gov/safety/heat-uv" + }, + { + "Thunderstorms & Tornadoes": "https://www.weather.gov/safety/thunderstorm" + }, + { + "Tornado": "https://www.weather.gov/safety/tornado" + }, + { + "Tsunami": "https://www.weather.gov/safety/tsunami" + }, + { + "Wildfire": "https://www.weather.gov/safety/wildfire" + }, + { + "Wind": "https://www.weather.gov/safety/wind" + }, + { + "Winter": "https://www.weather.gov/safety/winter" + } +] ``` ## Additional resources