In [None]:
import catminer
from catminer.multiturn import extract
from catminer.functions import define_client
import os

# set the current working directory to be the same as the file
os.path.abspath('')

First, specify which LLM API you will be using.
Currently we only support APIs, specifically Fireworks AI, Amazon Bedrock, and OpenAI.
For this example, we will use Bedrock.

In [None]:
api_name = "Bedrock"

Because we are using the Bedrock API, there are four environmental variables that we must define. These are:
1. MODEL_ID
2. AWS_REGION
3. AWS_ACCESS_KEY_ID
4. AWS_SECRET_ACCESS_KEY

Documentation for Amazon Bedrock, including information on how to sign up and obtain an Access Key ID and Secret Access Key, can be found here: https://docs.aws.amazon.com/bedrock/

Model IDs and supported AWS Regions are listed here: https://docs.aws.amazon.com/bedrock/latest/userguide/models-supported.html

Below, we define these variables with *PLACEHOLDER* text for the latter two. Once you are registered and can provide your own confidential API keys, replace the *PLACEHOLDER* text with your key strings. 

In this example, we will use Llama 3.1 8B and specify an AWS Region that works for us. Depending on your location, you may need to change the model and Region.

In [None]:
os.environ['MODEL_ID'] = "us.meta.llama3-1-405b-instruct-v1:0"

os.environ['AWS_REGION'] = "us-east-2"

os.environ['AWS_ACCESS_KEY_ID'] = PLACEHOLDER

os.environ['AWS_SECRET_ACCESS_KEY'] = PLACEHOLDER

We can now define our client_type variable, which is a necessary input for running the extraction workflow. Since we are using Llama 3.1 8B in this example, we will also define model_type = 'Meta'. 

In [None]:
client = define_client(api_name)
model_type = 'Meta'

Lastly, we will define the final few key inputs to our extraction workflow. These include the names of the properties and conditions we want to target, and the locations of our source text and custom system prompts. Since the sample paper we will be extracting from is on the oxidative coupling of methane, the target properties will be C2(+) yield and C2(+) selectivity -- two common figures of merit for this reaction. For our condition, we will search for temperature. A single preprocessed paper has already been provided in the adjacent "downloaded_paper/" folder, so that will be our source_dir. The system prompts are optional and can be customized arbitrarily, but samples are provided in yield-sp.txt and selectivity-sp.txt. 

In addition to specifying the target properties and conditions, it is often helpful to specify *required phrases*, strings of text that must be present in a sentence in order for CatMiner to consider extracting a given property from it. These can substantially cut down on API costs with minimal losses in recall. In our case, we will require '%' when searching for our yield and selectivity properties, and common temperature units for the operating condition. 

In [None]:
source_dir = 'downloaded_paper/'

target_properties = ['C2(+) yield', 'C2(+) selectivity']
required_prop_phrases = [['%'], ['%']]

target_conditions = ['temperature']
required_cond_phrases = [' K', '°C']

sp_paths = ['yield-sp.txt', 'selectivity-sp.txt']

Finally, we now call the extract function. We could optionally specify more extraction parameters. From this function call, we can enable features like abbrevation resolution or the prompting strategies tested in the associated publication. We could also customize the number of sentences considered by the LLM when searching for property and condition values. For now, we will stick with the defaults for these. 

In [None]:
extract(source_dir, 
        client, 
        target_properties, 
        target_conditions, 
        model_type, 
        sp_paths, 
        required_prop_phrases=required_prop_phrases, 
        required_cond_phrases=required_cond_phrases)