In [None]:
# Uncomment the following line to install all dependencies
# !pip install openai instructor pydantic tqdm

# Parsing Dataverse metadata using GPT4

This notebook demonstrates how to use GPT4 to extract metadata from Dataverse datasets. The dataset metadata has been obtained from the [Harvard Dataverse](https://dataverse.harvard.edu/) repository, a platform that facilitates sharing and preservation of research data. To extract the metadata, we've utilized the [Dataverse API](https://guides.dataverse.org/en/latest/api/index.html) and saved the output as a JSON file.

Our goal with this notebook is to leverage GPT4 to extract metadata from unstructured data and convert it into a structured format using [`instructor`](https://jxnl.github.io/instructor/) compliant with a questionnaire model, implemented through [`pydantic`](https://docs.pydantic.dev/latest/) classes. This approach allows us to obtain metadata in a structured format, which can be further analyzed and scaled with ease.

Please note, in order to use this notebook it is necessary to provide an OpenAI API Token via the `OPENAPI_API_KEY` environment variable and sufficient credits to use GPT-4.

In [1]:
import openai
import instructor
import glob
import os
import json
import tqdm

from model.questionaires import ResponseModel

instructor.patch(openai.OpenAI())
openai.api_key = os.environ["OPENAI_API_KEY"]

In [2]:
descriptions = {}
for path in glob.glob("../data/metadata/*.json"):
    fname = os.path.basename(path)
    descriptions[fname] = {
        "description": 
            json.load(open(path))
                .get("data")
                .get("items")[0]
                .get("description")
    }

In [4]:
system_msg = "You are a helpful assistant who understands geospatial sciences."

for key in tqdm.tqdm(descriptions.keys()):
    
    if descriptions[key].get("response"):
        continue
    
    user_msg = (
        "Extract metadata from the following dataset description: "
        + descriptions[key]["description"]
    )
    
    response: ResponseModel = openai.ChatCompletion.create(
        model="gpt-4",
        response_model=ResponseModel,
        messages=[
            {"role": "system", "content": system_msg},
            {"role": "user", "content": user_msg},
        ],
    )
    
    descriptions[key]["response"] = response.model_dump(exclude_unset=True)

100%|██████████| 134/134 [15:24<00:00,  6.90s/it]


In [6]:
with open("../data/gpt4_on_metadata.json", "w") as f:
    json.dump(descriptions, f, indent=2)