In [1]:
import json
import pandas as pd
from text2graph.llm import ask_llm, OpenSourceModel
import os

os.getcwd()

'/data/clo36/repo/text2graph_llm'

In [2]:
print(f"Supported open-source (OSS) models: {[m.value for m in OpenSourceModel]}")

Supported open-source (OSS) models: ['mixtral', 'openhermes']


- `mixtral`: top open-source llm model based on [chatbot arean](https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard), ELO: 1118 ~= GPT3.5 turbo
- `openhermes`: decent open-source small-ish (7b) model, ELO: 1078 (won't test on this yet, Bill is using this with somewhat decent results in similar use case?)

## Extract locations based-on Shanan's example

In [3]:
NEGATIVE_EXAMPLE = """
This community-vetted CO2 syn- thesis represents the most reliable data avail- able to date and a means to improve our understanding of past changes in global cli- mate and carbon cycling as well as organismal evolution. However, this effort is still incomplete. Data remain sparse during the earlier part of the record and in some instances are domi- nated by estimates from a single proxy system. Generating a paleo-CO2 record with even greater confidence will require further research using multiple proxies to fill in data gaps and increase overall data resolution, resolve discrepancies between estimates from contemporaneous proxy analyses, reduce uncertainty o
"""

POSITIVE_EXAMPLE = """
The top of the Sauk megasequence in Minnesota is at the unconformable contact of the Shakopee Formation with the St. Peter Sandstone. Younger rocks are present beneath the St. Peter Sandstone on the southern and east- ern flanks of the Ozark dome, where the upper Sauk succession includes the Roubidoux, Jefferson City, Cotter, Powell – Smithville – Black Rock, and Everton units in that stratigraphic order (Ethington et al., 2012; Palmer et al., 2012). The Shakopee Formation is equivalent to some lower part of this succession, but sparse inverte- brate faunas and long-ranging conodonts in these units preclude correlation with high resolution. The Jasper Member of the Everton Formation of northern Arkansas contains conodonts of the Histiodella holodentata Biozone, which demonstrates the latest early Whiterockian age for the top of the rocks of the GACB in that region. No faunal evidence is available there for the age of the base of the St. Peter Sandstone. The boundary between the Sauk and Tippecanoe megasequences may be a cor- relative conformity in the Reelfoot rift of southeastern Missouri and northeastern Missouri, but this has not been demonstrated.
"""

In [Shanan's example](https://chat.openai.com/share/d14a5163-ddb4-4f28-b6d4-bbf4839f62cb), he forced the model to perform a guided 2 steps chain-of-thought.

1. What locations are mentioned in the following paragraph?
2. What geological entities are associated with those locations?

We can try to make an exact replication on our OSS models.  

In [4]:
q1 = "What locations are mentioned in the following paragraph?"
q2 = "What geological entities are associated with those locations?"

message = {"role": "user", "content": f"{q1}\n{NEGATIVE_EXAMPLE}"}
response = ask_llm(messages=[message], model="mixtral")
print(response)

 The paragraph does not mention any specific locations. It discusses a community-vetted CO2 synthesis and the need for further research to improve understanding of past changes in global climate, carbon cycling, and organismal evolution. The focus is on data gaps, discrepancies between proxy estimates, and increasing overall data resolution rather than specific geographic locations.


For reference, this is the exact return from Shanan's example:
```
The paragraph does not mention specific locations. Instead, it discusses data synthesis and the need for further research to improve our understanding of past changes in global climate and carbon cycling. Therefore, specific locations are not mentioned in this paragraph.
``````

Somewhat close enough, perhaps GPT prefers a bit simpler wordings.

In [5]:
message = {"role": "user", "content": f"{q1}\n{POSITIVE_EXAMPLE}"}
response = ask_llm(messages=[message], model="mixtral")
print(response)

 The paragraph mentions several locations:

* The top of the Sauk megasequence in Minnesota, specifically at the unconformable contact of the Shakopee Formation with the St. Peter Sandstone.
* The southern and eastern flanks of the Ozark dome, where younger rocks are present beneath the St. Peter Sandstone.
* Northern Arkansas, where the Jasper Member of the Everton Formation contains conodonts of the Histiodella holodentata Biozone, indicating a latest early Whiterockian age for the top of the rocks of the GACB in that region.
* The Reelfoot rift of southeastern Missouri and northeastern Missouri, where the boundary between the Sauk and Tippecanoe megasequences may be a correlative conformity.

Additionally, the paragraph mentions several geological formations and units, including:

* The Shakopee Formation
* The St. Peter Sandstone
* The Roubidoux
* The Jefferson City
* The Cotter
* The Powell – Smithville – Black Rock
* The Everton units
* The Jasper Member of the Everton Formation


For reference, this is the exact return from Shanan's example:
```
The locations mentioned in the paragraph are:

Minnesota
Ozark dome (southern and eastern flanks)
Northern Arkansas
Reelfoot rift (southeastern Missouri and northeastern Missouri)
```

Simulate the conversation with follow-up question:

In [6]:
messages = [message]  # Last message
messages.append({"role": "assistant", "content": response})  # Last response
messages.append({"role": "user", "content": q2})
response = ask_llm(messages=messages, model="mixtral")
print(response)

 The following geological entities are associated with the locations mentioned in the paragraph:

* The Sauk megasequence is a large-scale division of rock strata that represents a long period of geologic time. It is present in Minnesota, where it is represented by the Shakopee Formation and the St. Peter Sandstone, as well as in other locations such as the southern and eastern flanks of the Ozark dome, where it includes the Roubidoux, Jefferson City, Cotter, Powell – Smithville – Black Rock, and Everton units.
* The Shakopee Formation is a geologic formation that is part of the Sauk megasequence in Minnesota. It is present at the top of the Sauk megasequence and is in unconformable contact with the St. Peter Sandstone.
* The St. Peter Sandstone is a geologic formation that is part of the Sauk megasequence in Minnesota. It lies below the Shakopee Formation on the southern and eastern flanks of the Ozark dome, where it is covered by younger rocks.
* The Roubidoux, Jefferson City, Cotter

For reference, this is the exact return from Shanan's example:
```
The geological entities associated with the mentioned locations are:

Minnesota: Sauk megasequence, Shakopee Formation, St. Peter Sandstone.
Ozark dome (southern and eastern flanks): Sauk megasequence, Roubidoux Formation, Jefferson City Formation, Cotter Formation, Powell-Smithville-Black Rock Formation, Everton Formation.
Northern Arkansas: Everton Formation, Jasper Member.
Reelfoot rift (southeastern Missouri and northeastern Missouri): Sauk megasequence, Tippecanoe megasequence.
``````

Interim conclusion:

In this particular use case, perhaps `mixtral` is suffice?

More issues to consider:

1. Can we speed it up a bit by Reducing query round trip from 2 to 1?
1. Can we return json format instead of plain text?
1. How to lemmatize location or geo-entities? (Try traditional methods first, then we can try embedding distance based method)
1. Test on larger test set. Use a subset of Devesh's test set. Perhaps 30 examples first, scale up later if we reach a good result. Need manual labeling. (Devesh's test set don't have labels) 
1. Serve API endpoint for user testing.
1. Serve Demo for user testing.

Backlog low-priority items:

1. Geo-coding for extracted locations.
1. How to use the Stratigraphy and Lithology list from MacroStrat?
1. Multi-agent setup to improve quality.
1. Logging and monitoring: Collect user feedback for fine-tuning (e.g., prompt-tune (interactive or generative) or Direct Preference Optimization (need around 20k labeled data))


## Speed up experiment: Zero-shot CoT

- Borrowing some system prompt idea from Sky's [ta2-extraction example](https://github.com/DARPA-CRITICALMAAS/ta2-extraction/blob/master/prompts.py)
- Try to reduce round-trip from 2 to 1.
- Improve response format from free text to json-like.


In [7]:
def get_prompt(text: str) -> list[dict]:
    """V0 geo-location prompting."""

    system_prompt = {
        "role": "system",
        "content": "You are a geology expert and you are very good in understanding mining reports. Think step by step: What locations are mentioned in the following paragraph? and What geological entities are associated with those locations? Return in json format like this: {'location1': ['entity1', 'entity2', ...], 'location2': ['entity3', 'entity4', ...]}. Return an empty dictionary if there is no location.",
    }
    user_prompt = {"role": "user", "content": text}
    return [system_prompt, user_prompt]


def experiment1(text: str) -> str:
    """V0 geo-location experiment."""
    response = ask_llm(messages=get_prompt(text), model="mixtral")
    return response

In [8]:
print(experiment1(NEGATIVE_EXAMPLE))

 {  }

The paragraph does not mention any specific locations or geological entities associated with those locations.


In [9]:
print(experiment1(POSITIVE_EXAMPLE))

 {
"Minnesota": ["Sauk megasequence", "Shakopee Formation", "St. Peter Sandstone"],
"southern and eastern flanks of the Ozark dome": ["younger rocks", "Roubidoux", "Jefferson City", "Cotter", "Powell – Smithville – Black Rock", "Everton units"],
"northern Arkansas": ["Jasper Member of the Everton Formation", "Histiodella holodentata Biozone"],
"Reelfoot rift of southeastern Missouri and northeastern Missouri": ["correlative conformity between Sauk and Tippecanoe megasequences"]
}


For reference, this is the exact return from Shanan's example:
```
The geological entities associated with the mentioned locations are:

Minnesota: Sauk megasequence, Shakopee Formation, St. Peter Sandstone.
Ozark dome (southern and eastern flanks): Sauk megasequence, Roubidoux Formation, Jefferson City Formation, Cotter Formation, Powell-Smithville-Black Rock Formation, Everton Formation.
Northern Arkansas: Everton Formation, Jasper Member.
Reelfoot rift (southeastern Missouri and northeastern Missouri): Sauk megasequence, Tippecanoe megasequence.
``````

## Micro testset (n=30)

In [10]:
# Run once
# critical_maas_knowledge_graph_testset = pd.read_parquet("data/formation_sample.parquet.gzip")
# testset_micro = critical_maas_knowledge_graph_testset.sample(30)
# testset_micro.to_parquet("data/testset_micro.parquet.gzip", compression="gzip")

testset_micro = pd.read_parquet("data/testset_micro.parquet.gzip")
testset_micro.sample(3)

Unnamed: 0,formation_name,paper_id,paragraph
1156,Coyote Butte Formation,578e49dbcf58f15c7f6149cf,The Grindstone terrane contains some of the ol...
5663,Sepur Formation,54b43252e138239d8684b90c,"At Campur, the rudist limestones of the Campur..."
2890,Murdock Mountain Formation,5c64fd1e1faed6554895cac3,Occurrence.—Hinganella felderi occurs througho...


In [11]:
results = {
    "note": "V0 geo-location experiment.",
    "testset_path": "data/testset_micro.parquet.gzip",
    "input": [],
    "output": [],
}

for i, row in testset_micro.iterrows():
    input = row["paragraph"]
    output = experiment1(input)
    results["input"].append(input)
    results["output"].append(output)

In [12]:
with open("results.json", "w") as f:
    json.dump(results, f, indent=4)

In [13]:
def print_output(index: int, results: dict) -> None:
    print(f"Input: {results['input'][index]}")
    print()
    print(f"Output: {results['output'][index]}")


for i in range(30):
    print_output(i, results)
    print("=" * 200)

Input: (4) Depositional salinities high enough at times for the development of anhydrite crystals in fine-grained sediment (first recorded at - 27 m just below the top of the flakestone member). In the underlying strata the restriction of anhydrite pseudomorphs to permeable tempestites suggests formation during the deposition of the stromatolitic member (see later discussion).
Overall these changes indicate a transition to a lagoonal environment, i.e., shallow, nonemergent, intermittently hypersaline. Higher in the stromatolitic member (above -15 m) an intermittently emergent and more actively evaporitic setting is recorded by desiccated surfaces, abundant generation of originally rigid intraclasts, and increased abundance of ooids and anhydrite pseudomorphs. Evidence for tidal currents is best seen in the basal sandstone of the Spiral Creek Formation. Here, the combination of apparently bimodal or polymodal palaeocurrents, the size of cross-stratification sets, and the presence of des