The Advent of Large Language Models (LLMs) has transformed many facets of society, enabling groundbreaking applications across diverse fields. This project aims to leverage LLMs to analyze and study the zoning landscape in the United States. This current repository offers a demo model that is tested with zoning ordinances extracted from Wheaton, Illinois.
For the latest paper please see here.
For the latest data please see here.
All are included in the repository besides the .env file, which you will need to create in the main AI-Zoning directory (same directory as config.yaml). A guide on what to include in you .env file, as well as documentation for all the key components of the repository, can be found in the readme folder of this repository.
The code in this repository was containerized using Docker. To demo the code, please refer to the following guide on running containerized code: https://drive.google.com/file/d/1BiEs74T4dKHhyQI2Je3EUJxNfzEcvsD0/view?usp=sharing
This repository is dedicated to the use of Large Language Models (LLMs) for parsing zoning documents. We introduce a generative regulatory measurement approach to decode and interpret statutes and administrative documents. This project leverages LLMs to construct a detailed assessment of U.S. zoning regulations and examines the correlation between these regulations, housing costs, and construction.
The work demonstrates the reliability of LLMs in analyzing complex regulatory datasets.
For the latest paper, please see here.
To install the Python packages required for this project, please run the following command in your terminal:
pip install -r requirements.txtThe project is divided into several key components:
- Embeddings Setup: This process creates embeddings from raw text and stores them in the defined path.
- LLM Inference: We use OpenAI models (e.g., GPT-3.5, GPT-4) to process zoning questions and generate structured answers.
- Data Preprocessing: This includes downloading shape files, merging raw housing/demographic data, and preparing municipality identifier datasets.
- Main Model Code: The core processing logic handles question-municipality pairs, organizes the data flow, and runs the model using parallel processing.
- Embeddings Workflow
- Inference Workflow
The codebase is organized into several key folders and files:
-
Raw Data
- Contains the raw zoning and housing-related data necessary for analysis.
-
Processed Data
- Stores the output from embedding processes, LLM inferences, and additional processed datasets.
-
Code
- Split into the following sections:
- Pre-Processing Code: Prepares raw datasets for analysis, including downloading shape files and merging data.
- Main Model Code: Runs the core model logic, including LLM inference and embeddings.
- Tables and Figures Code: Generates the tables and figures used for analysis and reporting.
- Split into the following sections:
- Configuration Setup: Defines paths and settings required for the embedding and LLM processes, including API keys and paths to data directories.
- Context Building Code: Builds the context needed to process and answer zoning-related questions.
- Embedding Code: Manages embedding processes, ensuring raw text is split into manageable sections.
- GPT Functions: Helper functions for interacting with OpenAI's API, including token counting and batching logic.
- Helper Functions: Utility functions for data management and error handling.
- Question-Answer Code: Core logic for processing question-municipality pairs and managing LLM inferences.
- Question-Municipality Pairing: Manages the lifecycle of question-municipality pairs, including initialization and embedding.
- Model Batch Process: Batch processing logic for SLURM job arrays, allowing for distributed computing across nodes.
The raw_data folder contains several essential datasets:
- Sample Data.xlsx: List of municipalities and their zoning ordinances.
- Questions.xlsx: List of questions used in the analysis, including binary and numerical categories.
Processed data includes:
- Embeddings: Contains text embeddings created from raw zoning data, stored in the directory defined in
config.yaml. - Model Output: Inference results from LLMs, stored in separate folders for each model (GPT-3.5, GPT-4).
- Enriched Sample Data: A merged dataset of municipality characteristics, zoning regulations, and additional variables used for analysis.
The results folder contains the outputs of model runs and visualizations:
- Tables Folder: Contains Excel files with table outputs from the analysis.
- Figures Folder: Contains images of charts and maps generated from the zoning analysis.
Tables and figures can be recreated by running the appropriate scripts in the Table and Figures Code section.
Contact Information
For any inquiries, please contact dm4766@stern.nyu.edu.
License
This project is licensed under the MIT License. See the LICENSE file for more details.