## Import the Extraction Function
!pip install -r ../requirements.txt

Clearly import the custom extraction function from the `src/extraction.py` module.

In [1]:
import sys
from pathlib import Path

# Clearly add parent directory (to access src folder)
sys.path.append(str(Path.cwd().parent))

# Import your extraction function explicitly
from src.extraction import extract_json_from_text

## Define File Paths

Clearly specify paths to:

- Input text file (`example.txt`)
- Ontology YAML file (`urban_air_quality.yaml`)
- Output JSON file (`extracted_knowledge.json`)
- LLM model file (`.gguf` model file)

In [4]:
# Clearly defined file paths
input_txt_path = Path("../data/example txt/example.txt")
ontology_yaml_path = Path("../ontology/urban_air_quality.yaml")
output_json_path = Path("../data/output/extracted_knowledge.json")
model_path = Path("mistral-7b-instruct-v0.2.Q4_K_M.gguf")  # <-- explicitly replace this path

## Run Extraction Function

Execute the extraction function clearly to generate the structured JSON output.

In [6]:
extract_json_from_text(
    input_txt_path=input_txt_path,
    ontology_yaml_path=ontology_yaml_path,
    output_json_path=output_json_path,
    model_path=model_path
)

llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 12287 MiB free
llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from /Users/nxx20/Library/Application Support/nomic.ai/GPT4All/mistral-7b-instruct-v0.2.Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = mistralai_mistral-7b-instruct-v0.2
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
l

âœ… JSON saved to: ../data/output/extracted_knowledge.json


## Load and View the Extracted JSON Output

Explicitly load and display the generated JSON to confirm extraction results.

In [8]:
import json

# Load JSON output explicitly
with open(output_json_path, "r", encoding="utf-8") as file:
    extracted_data = json.load(file)

# Clearly display the extracted data
extracted_data

{'pollutants': [{'name': 'PM2.5', 'category': 'particulate matter'},
  {'name': 'PM10', 'category': 'particulate matter'}],
 'pollution_sources': [{'name': 'Diesel buses', 'source_type': 'stationary'},
  {'name': 'City center', 'source_type': 'stationary'}],
 'mitigation_measures': [{'name': 'Selective catalytic reduction technology',
   'measure_type': 'technology'}],
 'pollutant_source_relations': [{'pollutant': {'name': 'PM2.5'},
   'source': {'name': 'Diesel buses'}},
  {'pollutant': {'name': 'PM10'}, 'source': {'name': 'Diesel buses'}}],
 'source_mitigation_relations': [{'source': {'name': 'Diesel buses'},
   'mitigation': {'name': 'Selective catalytic reduction technology'}}],
 'meteorological_factors': [{'name': 'Temperature inversions'},
  {'name': 'Wind speed'}],
 'street_canyon_factors': [{'name': 'Street canyon geometry'}],
 'total_reduction_percentage': 0.0}