# JSON

Hướng dẫn này trình bày cách sử dụng `JSONLoader` của LangChain để tải và xử lý các tệp JSON. Chúng ta sẽ khám phá cách trích xuất dữ liệu cụ thể từ các tệp JSON có cấu trúc bằng cách sử dụng các truy vấn kiểu `jq`.

```bash
pip install jq
```


In [2]:
import os
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser

load_dotenv(override=True, dotenv_path="../.env")


True

## Generate JSON Data

---

If you want to generate JSON data, you can use the following code.


In [3]:
from langchain import PromptTemplate
from pathlib import Path
from pprint import pprint
import json
import os

# Initialize ChatOpenAI
llm = ChatOpenAI(
    model="gpt-3.5-turbo",
    temperature=0.7,
    model_kwargs={"response_format": {"type": "json_object"}}
)

# Create prompt template
prompt = PromptTemplate(
    input_variables=[],
    template="""Generate a JSON array containing detailed personal information for 5 people. 
        Include various fields like name, age, contact details, address, personal preferences, and any other interesting information you think would be relevant."""
)

# Create and invoke runnable sequence using the new pipe syntax
response = (prompt | llm).invoke({})
generated_data = json.loads(response.content)

# Save to JSON file
current_dir = Path().absolute()
data_dir = current_dir / "data"
data_dir.mkdir(exist_ok=True)

file_path = data_dir / "people.json"
with open(file_path, "w", encoding="utf-8") as f:
    json.dump(generated_data, f, ensure_ascii=False, indent=2)

print("Generated and saved JSON data:")
pprint(generated_data)

Generated and saved JSON data:
{'people': [{'address': {'city': 'Los Angeles',
                         'state': 'CA',
                         'street': '123 Main St',
                         'zip': '90001'},
             'age': 32,
             'contact_details': {'email': 'alice.johnson@example.com',
                                 'phone': '555-123-4567'},
             'interesting_information': 'Alice is a certified yoga instructor '
                                        'and volunteers at a local animal '
                                        'shelter in her free time.',
             'name': 'Alice Johnson',
             'personal_preferences': {'favorite_color': 'blue',
                                      'favorite_food': 'sushi',
                                      'hobbies': ['reading', 'hiking']}},
            {'address': {'city': 'New York',
                         'state': 'NY',
                         'street': '456 Oak St',
                         'zip': '100

In [4]:
import json
from pathlib import Path
from pprint import pprint


file_path = "data/people.json"
data = json.loads(Path(file_path).read_text())

pprint(data)

{'people': [{'address': {'city': 'Los Angeles',
                         'state': 'CA',
                         'street': '123 Main St',
                         'zip': '90001'},
             'age': 32,
             'contact_details': {'email': 'alice.johnson@example.com',
                                 'phone': '555-123-4567'},
             'interesting_information': 'Alice is a certified yoga instructor '
                                        'and volunteers at a local animal '
                                        'shelter in her free time.',
             'name': 'Alice Johnson',
             'personal_preferences': {'favorite_color': 'blue',
                                      'favorite_food': 'sushi',
                                      'hobbies': ['reading', 'hiking']}},
            {'address': {'city': 'New York',
                         'state': 'NY',
                         'street': '456 Oak St',
                         'zip': '10001'},
             'age': 45,
 

In [5]:
print(type(data))

<class 'dict'>


## `JSONLoader`

---

Khi bạn muốn trích xuất các giá trị bên dưới trường content trong khóa message của dữ liệu JSON, bạn có thể dễ dàng thực hiện việc này bằng `JSONLoader` như được hiển thị bên dưới.


### Basic Usage

This usage shows off how to execute load JSON and print what I get from

In [7]:
from langchain_community.document_loaders import JSONLoader

# Create JSONLoader
loader = JSONLoader(
    file_path="data/people.json",
    jq_schema=".people[]",  # Access each item in the people array
    text_content=False,
)

# Load documents
docs = loader.load()
pprint(docs)

[Document(metadata={'source': '/home/dino/Documents/aidino.github.io/example_codes/building_llms_for_productions/langchain/data/people.json', 'seq_num': 1}, page_content='{"name": "Alice Johnson", "age": 32, "contact_details": {"email": "alice.johnson@example.com", "phone": "555-123-4567"}, "address": {"street": "123 Main St", "city": "Los Angeles", "state": "CA", "zip": "90001"}, "personal_preferences": {"favorite_color": "blue", "favorite_food": "sushi", "hobbies": ["reading", "hiking"]}, "interesting_information": "Alice is a certified yoga instructor and volunteers at a local animal shelter in her free time."}'),
 Document(metadata={'source': '/home/dino/Documents/aidino.github.io/example_codes/building_llms_for_productions/langchain/data/people.json', 'seq_num': 2}, page_content='{"name": "John Smith", "age": 45, "contact_details": {"email": "john.smith@example.com", "phone": "555-987-6543"}, "address": {"street": "456 Oak St", "city": "New York", "state": "NY", "zip": "10001"}, "

### Loading Each Person as a Separate Document

We can load each person object from `people.json` as an individual document using the `jq_schema=".people[]"`

In [8]:
loader = JSONLoader(
    file_path="data/people.json",
    jq_schema=".people[]",
    text_content=False,
)

data = loader.load()
data

[Document(metadata={'source': '/home/dino/Documents/aidino.github.io/example_codes/building_llms_for_productions/langchain/data/people.json', 'seq_num': 1}, page_content='{"name": "Alice Johnson", "age": 32, "contact_details": {"email": "alice.johnson@example.com", "phone": "555-123-4567"}, "address": {"street": "123 Main St", "city": "Los Angeles", "state": "CA", "zip": "90001"}, "personal_preferences": {"favorite_color": "blue", "favorite_food": "sushi", "hobbies": ["reading", "hiking"]}, "interesting_information": "Alice is a certified yoga instructor and volunteers at a local animal shelter in her free time."}'),
 Document(metadata={'source': '/home/dino/Documents/aidino.github.io/example_codes/building_llms_for_productions/langchain/data/people.json', 'seq_num': 2}, page_content='{"name": "John Smith", "age": 45, "contact_details": {"email": "john.smith@example.com", "phone": "555-987-6543"}, "address": {"street": "456 Oak St", "city": "New York", "state": "NY", "zip": "10001"}, "

### Using `content_key` within `jq_schema`

To load documents from a JSON file using `content_key` within the `jq_schema`, set `is_content_key_jq_parsable=True`. Ensure that `content_key` is compatible and can be parsed using the `jq_schema`.

In [9]:
loader = JSONLoader(
    file_path="data/people.json",
    jq_schema=".people[]",
    content_key="name",
    text_content=False
)

data = loader.load()
data

[Document(metadata={'source': '/home/dino/Documents/aidino.github.io/example_codes/building_llms_for_productions/langchain/data/people.json', 'seq_num': 1}, page_content='Alice Johnson'),
 Document(metadata={'source': '/home/dino/Documents/aidino.github.io/example_codes/building_llms_for_productions/langchain/data/people.json', 'seq_num': 2}, page_content='John Smith'),
 Document(metadata={'source': '/home/dino/Documents/aidino.github.io/example_codes/building_llms_for_productions/langchain/data/people.json', 'seq_num': 3}, page_content='Emily Davis'),
 Document(metadata={'source': '/home/dino/Documents/aidino.github.io/example_codes/building_llms_for_productions/langchain/data/people.json', 'seq_num': 4}, page_content='Michael Brown'),
 Document(metadata={'source': '/home/dino/Documents/aidino.github.io/example_codes/building_llms_for_productions/langchain/data/people.json', 'seq_num': 5}, page_content='Sarah Wilson')]

### Extracting Metadata from `people.json`

Let's define a `metadata_func` to extract relevant information like name, age, and city from each person object.

In [10]:
def metadata_func(record: dict, metadata: dict) -> dict:
    metadata["name"] = record.get("name")
    metadata["age"] = record.get("age")
    metadata["city"] = record.get("address", {}).get("city")
    return metadata

loader = JSONLoader(
    file_path="data/people.json",
    jq_schema=".people[]",
    content_key="name",
    metadata_func=metadata_func,
    text_content=False
)

data = loader.load()
data

[Document(metadata={'source': '/home/dino/Documents/aidino.github.io/example_codes/building_llms_for_productions/langchain/data/people.json', 'seq_num': 1, 'name': 'Alice Johnson', 'age': 32, 'city': 'Los Angeles'}, page_content='Alice Johnson'),
 Document(metadata={'source': '/home/dino/Documents/aidino.github.io/example_codes/building_llms_for_productions/langchain/data/people.json', 'seq_num': 2, 'name': 'John Smith', 'age': 45, 'city': 'New York'}, page_content='John Smith'),
 Document(metadata={'source': '/home/dino/Documents/aidino.github.io/example_codes/building_llms_for_productions/langchain/data/people.json', 'seq_num': 3, 'name': 'Emily Davis', 'age': 28, 'city': 'Chicago'}, page_content='Emily Davis'),
 Document(metadata={'source': '/home/dino/Documents/aidino.github.io/example_codes/building_llms_for_productions/langchain/data/people.json', 'seq_num': 4, 'name': 'Michael Brown', 'age': 50, 'city': 'Houston'}, page_content='Michael Brown'),
 Document(metadata={'source': '/h

### Understanding JSON Query Syntax

Let's explore the basic syntax of jq-style queries used in `JSONLoader`:

Basic Selectors
   - **`.`** : Current object
   - **`.key`** : Access specific key in object
   - **`.[]`** : Iterate over array elements

Pipe Operator
   - **`|`** : Pass result of left expression as input to right expression
   
Object Construction
   - **`{key: value}`** : Create new object

Example JSON:
```json
{
  "people": [
    {"name": "Alice", "age": 30, "contactDetails": {"email": "alice@example.com", "phone": "123-456-7890"}},
    {"name": "Bob", "age": 25, "contactDetails": {"email": "bob@example.com", "phone": "098-765-4321"}}
  ]
}
```

**Common Query Patterns**:
- `.people[]` : Access each array element
- `.people[].name` : Get all names
- `.people[] | {name: .name}` : Create new object with name
- `.people[] | {name, email: .contact.email}` : Extract nested data

[Note] 
- Always use `text_content=False` when working with complex JSON data
- This ensures proper handling of non-string values (objects, arrays, numbers)

### Advanced Queries

Here are examples of extracting specific information using different jq schemas:

In [11]:
# Extract only contact details
contact_loader = JSONLoader(
    file_path="data/people.json",
    jq_schema=".people[] | {name: .name, contact: .contactDetails}",
    text_content=False
)

docs = contact_loader.load()
docs

[Document(metadata={'source': '/home/dino/Documents/aidino.github.io/example_codes/building_llms_for_productions/langchain/data/people.json', 'seq_num': 1}, page_content='{"name": "Alice Johnson", "contact": null}'),
 Document(metadata={'source': '/home/dino/Documents/aidino.github.io/example_codes/building_llms_for_productions/langchain/data/people.json', 'seq_num': 2}, page_content='{"name": "John Smith", "contact": null}'),
 Document(metadata={'source': '/home/dino/Documents/aidino.github.io/example_codes/building_llms_for_productions/langchain/data/people.json', 'seq_num': 3}, page_content='{"name": "Emily Davis", "contact": null}'),
 Document(metadata={'source': '/home/dino/Documents/aidino.github.io/example_codes/building_llms_for_productions/langchain/data/people.json', 'seq_num': 4}, page_content='{"name": "Michael Brown", "contact": null}'),
 Document(metadata={'source': '/home/dino/Documents/aidino.github.io/example_codes/building_llms_for_productions/langchain/data/people.js

In [12]:
# Extract nested data
hobbies_loader = JSONLoader(
    file_path="data/people.json",
    jq_schema=".people[] | {name: .name, hobbies: .personalPreferences.hobbies}",
    text_content=False
)

docs = hobbies_loader.load()
docs

[Document(metadata={'source': '/home/dino/Documents/aidino.github.io/example_codes/building_llms_for_productions/langchain/data/people.json', 'seq_num': 1}, page_content='{"name": "Alice Johnson", "hobbies": null}'),
 Document(metadata={'source': '/home/dino/Documents/aidino.github.io/example_codes/building_llms_for_productions/langchain/data/people.json', 'seq_num': 2}, page_content='{"name": "John Smith", "hobbies": null}'),
 Document(metadata={'source': '/home/dino/Documents/aidino.github.io/example_codes/building_llms_for_productions/langchain/data/people.json', 'seq_num': 3}, page_content='{"name": "Emily Davis", "hobbies": null}'),
 Document(metadata={'source': '/home/dino/Documents/aidino.github.io/example_codes/building_llms_for_productions/langchain/data/people.json', 'seq_num': 4}, page_content='{"name": "Michael Brown", "hobbies": null}'),
 Document(metadata={'source': '/home/dino/Documents/aidino.github.io/example_codes/building_llms_for_productions/langchain/data/people.js

In [13]:
# Get all interesting facts
facts_loader = JSONLoader(
    file_path="data/people.json",
    jq_schema=".people[] | {name: .name, facts: .interestingFacts}",
    text_content=False
)

docs = facts_loader.load()
docs

[Document(metadata={'source': '/home/dino/Documents/aidino.github.io/example_codes/building_llms_for_productions/langchain/data/people.json', 'seq_num': 1}, page_content='{"name": "Alice Johnson", "facts": null}'),
 Document(metadata={'source': '/home/dino/Documents/aidino.github.io/example_codes/building_llms_for_productions/langchain/data/people.json', 'seq_num': 2}, page_content='{"name": "John Smith", "facts": null}'),
 Document(metadata={'source': '/home/dino/Documents/aidino.github.io/example_codes/building_llms_for_productions/langchain/data/people.json', 'seq_num': 3}, page_content='{"name": "Emily Davis", "facts": null}'),
 Document(metadata={'source': '/home/dino/Documents/aidino.github.io/example_codes/building_llms_for_productions/langchain/data/people.json', 'seq_num': 4}, page_content='{"name": "Michael Brown", "facts": null}'),
 Document(metadata={'source': '/home/dino/Documents/aidino.github.io/example_codes/building_llms_for_productions/langchain/data/people.json', 'se

In [14]:
# Extract email and phone together
contact_info = JSONLoader(
    file_path="data/people.json",
    jq_schema='.people[] | {name: .name, email: .contactDetails.email, phone: .contactDetails.phone}',
    text_content=False
)

docs = contact_loader.load()
docs

[Document(metadata={'source': '/home/dino/Documents/aidino.github.io/example_codes/building_llms_for_productions/langchain/data/people.json', 'seq_num': 1}, page_content='{"name": "Alice Johnson", "contact": null}'),
 Document(metadata={'source': '/home/dino/Documents/aidino.github.io/example_codes/building_llms_for_productions/langchain/data/people.json', 'seq_num': 2}, page_content='{"name": "John Smith", "contact": null}'),
 Document(metadata={'source': '/home/dino/Documents/aidino.github.io/example_codes/building_llms_for_productions/langchain/data/people.json', 'seq_num': 3}, page_content='{"name": "Emily Davis", "contact": null}'),
 Document(metadata={'source': '/home/dino/Documents/aidino.github.io/example_codes/building_llms_for_productions/langchain/data/people.json', 'seq_num': 4}, page_content='{"name": "Michael Brown", "contact": null}'),
 Document(metadata={'source': '/home/dino/Documents/aidino.github.io/example_codes/building_llms_for_productions/langchain/data/people.js