In [1]:
%load_ext autoreload
%autoreload 2

https://www.notion.so/scalr/bb14f178490140e0bac9016a30949b84?v=5cd436e7aea34a5daaa677ba63790450&p=4d815b136f164c429bc319f45b9db015&pm=s

# [Rule Modifier] Rule splitting

## Problem statement

- current UI can modify elements in a rule (e.g.),
- need: ability to modify the rule as a whole

## Proposed work

Create a new prompt:

- split rule based on the following two scenarios:

1. [<or> Tags](https://www.notion.so/or-Tags-83b18d9960614627b7a3e5430193689e?pvs=21)
2. [Rule too broad](https://www.notion.so/Rule-too-broad-26c11a6d02cc44c388710ce7d1a47915?pvs=21)

The resulting split rules would need:

- New IDs
  - `"BRIEFCATCH_{rand_int(40)}"`
- Rule `name` tag modified like so
  - `"BRIEFCATCH_PUNCHINESS_288.1"`
  - `"BRIEFCATCH_PUNCHINESS_288.2"`

### when deploying

- Use the rule modification checker (developed in previous story) to ensure example / suggestion tags match the pattern after splitting the rules

# Success criteria

The criteria that must be met in order to consider this project a success.

- UI updated to allow for the splitting of rules
  - "Separate branch of logic"
    - not sure what is meant here
  - Split option added to drop down
  - Once user selects “Split”, i think a secondary select to determine why the user wants to split the rule would be necessary
    - `<or> tag` or `rule too broad` could be the options
      - If there is no <or> tag present in the rule we could drop that option
    - <or> tag splitting won’t need other user input or need to call GPT to split the rule up
    - if the rule is too broad, then we will need a field for user input and call GPT to identify how to best split the rules
- Splitting a Rule prompt written and tested
- Method for creating PR for updating repo will need to change too
  - Instead of replacing a rule that has been modified or just adding in new rules, we will need to delete the original rule that has now been split into N many


Using this rule as an example

```
<rule id="BRIEFCATCH_11012406027615556274904173201077833804" name="BRIEFCATCH_PUNCHINESS_288">
    <pattern>
        <or>
            <token inflected="yes">inquire</token>
            <token>inquiry</token>
        </or>
        <token>as</token>
        <token>to</token>
    </pattern>
    <message>Would direct language...</message>
    <suggestion>\1 into</suggestion>
    <suggestion>\1 about</suggestion>
    <suggestion>\1 in</suggestion>
    <suggestion>\1 from</suggestion>
    <short>{"ruleGroup":null,"ruleGroupIdx":0,"isConsistency":false,"isStyle":true,"correctionCount":4,"priority":"3.209","WORD":true,"OUTLOOK":true}</short>
    <example correction="inquiry into|inquiry about|inquiry in|inquiry from">The <marker>inquiry as to</marker> the strong majority of documents and testimony sought continues.</example>
</rule>
```

We would need to split the rule into two rules:

`Rule 1` - just the inflected token for `inquire` (this one requires an update to the example tag)

```
<rule id="BRIEFCATCH_11012406027615556274904173201077833804" name="BRIEFCATCH_PUNCHINESS_288.1">
    <pattern>
        <token inflected="yes">inquire</token>
        <token>as</token>
        <token>to</token>
    </pattern>
    <message>Would direct language...</message>
    <suggestion>\1 into</suggestion>
    <suggestion>\1 about</suggestion>
    <suggestion>\1 in</suggestion>
    <suggestion>\1 from</suggestion>
    <short>{"ruleGroup":null,"ruleGroupIdx":0,"isConsistency":false,"isStyle":true,"correctionCount":4,"priority":"3.209","WORD":true,"OUTLOOK":true}</short>
    <example correction="inquiry into|inquiry about|inquiry in|inquiry from">He <marker>inquired as to</marker> the willpower of the group</example>
</rule>
```

`Rule 2` - just the token for `inquiry` (no change for the example tag)

```
<rule id="BRIEFCATCH_11012406027615556274904173201077833804" name="BRIEFCATCH_PUNCHINESS_288.2">
    <pattern>
        <token>inquiry</token>
        <token>as</token>
        <token>to</token>
    </pattern>
    <message>Would direct language...</message>
    <suggestion>\1 into</suggestion>
    <suggestion>\1 about</suggestion>
    <suggestion>\1 in</suggestion>
    <suggestion>\1 from</suggestion>
    <short>{"ruleGroup":null,"ruleGroupIdx":0,"isConsistency":false,"isStyle":true,"correctionCount":4,"priority":"3.209","WORD":true,"OUTLOOK":true}</short>
    <example correction="inquiry into|inquiry about|inquiry in|inquiry from">The <marker>inquiry as to</marker> the strong majority of documents and testimony sought continues.</example>
</rule>
```


In [2]:
# Using this rule as an example

original_rule = """
<rule id="BRIEFCATCH_11012406027615556274904173201077833804" name="BRIEFCATCH_PUNCHINESS_288">
    <pattern>
        <or>
            <token inflected="yes">inquire</token>
            <token>inquiry</token>
        </or>
        <token>as</token>
        <token>to</token>
    </pattern>
    <message>Would direct language...</message>
    <suggestion>\1 into</suggestion>
    <suggestion>\1 about</suggestion>
    <suggestion>\1 in</suggestion>
    <suggestion>\1 from</suggestion>
    <short>{"ruleGroup":null,"ruleGroupIdx":0,"isConsistency":false,"isStyle":true,"correctionCount":4,"priority":"3.209","WORD":true,"OUTLOOK":true}</short>
    <example correction="inquiry into|inquiry about|inquiry in|inquiry from">The <marker>inquiry as to</marker> the strong majority of documents and testimony sought continues.</example>
</rule>
"""

# We would need to split the rule into two rules:

# `Rule 1` - just the inflected token for `inquire` (this one requires an update to the example tag)

rule_1 = """
<rule id="BRIEFCATCH_11012406027615556274904173201077833804" name="BRIEFCATCH_PUNCHINESS_288.1">
    <pattern>
        <token inflected="yes">inquire</token>
        <token>as</token>
        <token>to</token>
    </pattern>
    <message>Would direct language...</message>
    <suggestion>\1 into</suggestion>
    <suggestion>\1 about</suggestion>
    <suggestion>\1 in</suggestion>
    <suggestion>\1 from</suggestion>
    <short>{"ruleGroup":null,"ruleGroupIdx":0,"isConsistency":false,"isStyle":true,"correctionCount":4,"priority":"3.209","WORD":true,"OUTLOOK":true}</short>
    <example correction="inquiry into|inquiry about|inquiry in|inquiry from">He <marker>inquired as to</marker> the willpower of the group</example>
</rule>
"""

# `Rule 2` - just the token for `inquiry` (no change for the example tag)

rule_2 = """
<rule id="BRIEFCATCH_11012406027615556274904173201077833804" name="BRIEFCATCH_PUNCHINESS_288.2">
    <pattern>
        <token>inquiry</token>
        <token>as</token>
        <token>to</token>
    </pattern>
    <message>Would direct language...</message>
    <suggestion>\1 into</suggestion>
    <suggestion>\1 about</suggestion>
    <suggestion>\1 in</suggestion>
    <suggestion>\1 from</suggestion>
    <short>{"ruleGroup":null,"ruleGroupIdx":0,"isConsistency":false,"isStyle":true,"correctionCount":4,"priority":"3.209","WORD":true,"OUTLOOK":true}</short>
    <example correction="inquiry into|inquiry about|inquiry in|inquiry from">The <marker>inquiry as to</marker> the strong majority of documents and testimony sought continues.</example>
</rule>
"""

# split on or operands


In [3]:
simple_rule = """
<rule id="BRIEFCATCH_331448315792705843437979608685430062094" name="BRIEFCATCH_PUNCHINESS_1872">
    <antipattern>
        <token postag="RB.*" postag_regexp="yes"/>
        <token inflected="yes">file</token>
        <token min="0"/>
        <token regexp="yes">motion|motions</token>
        <token min="0">seeking</token>
        <token>to</token>
    </antipattern>
    <pattern>
        <token inflected="yes">file<exception>filing</exception></token>
        <or>
            <token min="0" postag="PRP$"/>
            <token>a</token>
        </or>
        <token regexp="yes">motion|motions</token>
        <token min="0">seeking</token>
        <token>to</token>
    </pattern>
    <message>Would a stronger verb help engage the reader?|**Example** from Justice Kagan: "Lange **moved to suppress** all evidence obtained after the officer entered his garage[.]"|**Example** from Justice Kavanaugh: "Before trial, Edwards **moved to suppress** the videotaped confession on the ground that the confession was involuntary."|**Example** from Morgan Chu: "The defendants also **moved to transfer** another state court action to the state court considering the petitions."</message>
    <suggestion><match no="1" postag="(V.*)" postag_regexp="yes" postag_replace="$1">move</match> to</suggestion>
    <short>{"ruleGroup":null,"ruleGroupIdx":0,"isConsistency":false,"isStyle":true,"correctionCount":1,"priority":"5.319","WORD":true,"OUTLOOK":true}</short>
    <example correction="moved to">The prosecution <marker>filed a motion seeking to</marker> have the victim.</example>
    <example>It was a properly filed motion seeking to overturn the election.</example>
</rule>
"""

In [4]:
product_rule = """
<rule id="BRIEFCATCH_245927502998399442504807079542143713153" name="BRIEFCATCH_CONCISENESS_3959">
    <pattern>
        <token inflected="yes">assist</token>
        <or>
            <token min="0" postag="JJ.*|PRP" postag_regexp="yes"/>
            <token>the</token>
        </or>
        <or>
            <token min="0" postag="JJ.*|PRP|PRP\$" postag_regexp="yes"/>
            <token>the</token>
        </or>
        <token postag="N.*|PRP" postag_regexp="yes">
            <exception>all</exception>
            <exception>are</exception>
            <exception>being</exception>
            <exception>beliefs</exception>
            <exception>but</exception>
            <exception>by</exception>
            <exception>can</exception>
            <exception>circuit</exception>
            <exception>clear</exception>
            <exception>concerning</exception>
            <exception>concerns</exception>
            <exception>dissent</exception>
            <exception>does</exception>
            <exception>due</exception>
            <exception>even</exception>
            <exception>fails</exception>
            <exception>find</exception>
            <exception>finds</exception>
            <exception>get</exception>
            <exception>given</exception>
            <exception>having</exception>
            <exception>his</exception>
            <exception>hold</exception>
            <exception>holds</exception>
            <exception>if</exception>
            <exception>in</exception>
            <exception>left</exception>
            <exception>like</exception>
            <exception>likes</exception>
            <exception>long</exception>
            <exception>make</exception>
            <exception>makes</exception>
            <exception>may</exception>
            <exception>might</exception>
            <exception>must</exception>
            <exception>no</exception>
            <exception>note</exception>
            <exception>one</exception>
            <exception>or</exception>
            <exception>other</exception>
            <exception>prior</exception>
            <exception>regarding</exception>
            <exception>see</exception>
            <exception>then</exception>
            <exception>try</exception>
            <exception>will</exception>
        </token>
        <token>in</token>
        <token postag="VBG">
            <exception>regarding</exception>
            <exception>concerning</exception>
            <exception>pending</exception>
            <exception>following</exception>
            <exception>standing</exception>
            <exception>helping</exception>
            <exception>neighboring</exception>
            <exception>neighbouring</exception>
        </token>
    </pattern>
    <message>Would using fewer words help sharpen the point?|**Example** from Justice Sotomayor: “The Affordable Care Act did this by, among other things, providing tax credits to **help people buy** insurance and establishing online marketplaces where insurers could sell plans.”</message>
    <suggestion><match no="1" postag="(V.*)" postag_regexp="yes" postag_replace="$1">help</match> \2 \3 \4 <match no="6" postag="V.*" postag_regexp="yes" postag_replace="VB"/></suggestion>
    <short>{"ruleGroup":null,"ruleGroupIdx":0,"isConsistency":false,"isStyle":true,"correctionCount":1,"priority":"6.3201","WORD":true,"OUTLOOK":true}</short>
    <example correction="Help his brother find"><marker>Assist his brother in finding</marker> an apartment.</example>
</rule>
"""
# split_rule_by_or_operands(rule)

In [5]:
from typing import List
import re


def extract_or_tag(rule_xml: str) -> str:
    or_contentL = re.search(r"(<or>.*?</or>)", rule_xml, re.DOTALL)
    if not or_contentL:
        return None
    return or_contentL.group(1)


def extract_operands(or_input_string: str) -> List[str]:
    # regular expression to find <token> tags
    token_pattern = r"(<token.*?/>|<token.*?</token>)"
    # extract all <token> tags
    return re.findall(token_pattern, or_input_string, re.DOTALL)


def split_rule_by_or_operands(input_rule: str) -> List[str]:
    """
    TODO: currently does not handle case where rule has two or tags.

    """
    or_content = extract_or_tag(input_rule)
    if not or_content:
        return input_rule
    operand_list = extract_operands(or_content)

    split_rule = input_rule.split(or_content)
    operand_rules = []
    for operand_str in operand_list:
        operand_rule = f"{split_rule[0]}{operand_str}{split_rule[1]}"
        operand_rules.append(operand_rule)
    return operand_rules


## split rule that is too broad
split_rule_by_or_operands(split_rule_by_or_operands(simple_rule)[0])

'\n<rule id="BRIEFCATCH_331448315792705843437979608685430062094" name="BRIEFCATCH_PUNCHINESS_1872">\n    <antipattern>\n        <token postag="RB.*" postag_regexp="yes"/>\n        <token inflected="yes">file</token>\n        <token min="0"/>\n        <token regexp="yes">motion|motions</token>\n        <token min="0">seeking</token>\n        <token>to</token>\n    </antipattern>\n    <pattern>\n        <token inflected="yes">file<exception>filing</exception></token>\n        <token min="0" postag="PRP$"/>\n        <token regexp="yes">motion|motions</token>\n        <token min="0">seeking</token>\n        <token>to</token>\n    </pattern>\n    <message>Would a stronger verb help engage the reader?|**Example** from Justice Kagan: "Lange **moved to suppress** all evidence obtained after the officer entered his garage[.]"|**Example** from Justice Kavanaugh: "Before trial, Edwards **moved to suppress** the videotaped confession on the ground that the confession was involuntary."|**Example** 

# split rule that is too broad


## algorithm

- input: `input_rule`
  - assume `input_rule` $\isin \{broad\_rules\}$
- assemble dynamic prompt conditioned on rule
  - POS present in `input_rule`
  - maybe regex, if present in `input_rule`


In [6]:
input_rule = product_rule
user_considerations = ""

In [7]:
import json
from utils.dynamic_prompting import get_pos_tag_dicts_from_rule, POS_MAPS
from domain.dynamic_prompting.prompt_leggo import (
    GENERAL_INSTRUCTIONS_PROMPT,
    SPLITTING_FEWSHOT_PROMPT,
    REGEX_INSTRUCTIONS_PROMPT,
)

In [8]:
# grab part of speech tag from rule
pos_tags_input_rule = get_pos_tag_dicts_from_rule(input_rule, list(POS_MAPS.keys()))
# NOTE: prompt has the following POStags, including them manually here
pos_tags_in_prompt = {
    "VB": "VB Verb, base form: eat, jump, believe, be, have",
    "VBD": "VBD Verb, past tense: ate, jumped, believed",
    "VBG": "VBG Verb, gerund/present participle: eating, jumping, believing",
    "VBN": "VBN Verb, past participle: eaten, jumped, believed",
    "VBP": "VBP Verb, non-3rd ps. sing. present: eat, jump, believe, am (as in 'I am'), are",
    "VBZ": "VBZ Verb, 3rd ps. sing. present: eats, jumps, believes, is, has",
}
all_pos = {**pos_tags_input_rule, **pos_tags_in_prompt}
_replace_pos = "\n".join([f"{v}" for k, v in all_pos.items()])
print(_replace_pos)

VBG Verb, gerund/present participle: eating, jumping, believing
VB Verb, base form: eat, jump, believe, be, have
VBD Verb, past tense: ate, jumped, believed
VBN Verb, past participle: eaten, jumped, believed
VBP Verb, non-3rd ps. sing. present: eat, jump, believe, am (as in 'I am'), are
VBZ Verb, 3rd ps. sing. present: eats, jumps, believes, is, has


In [9]:
from utils.dynamic_prompting import rule_has_regex

if rule_has_regex(input_rule):
    _replace_regex = REGEX_INSTRUCTIONS_PROMPT
else:
    _replace_regex = ""

In [10]:
_replace_general_instruction = GENERAL_INSTRUCTIONS_PROMPT.format(
    part_of_speech=_replace_pos,
    regex_rules=_replace_regex,
)
_replace_task_instruction = """
You are a language system used for modifying gramatical logic encoded as XML rules. 
The user will provide you with (i) a rule that is deemed too broad (ii) some additional considerations. 
Your task is to split (i) the rule that is too broad while taking into account (ii) the provided additional considerations. 
Below I will provide you with some additional context and at the bottom of this message is an example of a rule being split.
"""

final_prompt_template = (
    """
{task_instruction}

{general_instructions}

{splitting_fewshot}

"""
    + """ Respond, in JSON format. Your output should contain two fields (`rule_1` and `rule_2`) with the split rules"""
)

system_prompt = final_prompt_template.format(
    task_instruction=_replace_task_instruction,
    general_instructions=_replace_general_instruction,
    splitting_fewshot=SPLITTING_FEWSHOT_PROMPT,
)

In [11]:
user_prompt = json.dumps(
    {
        "rule_deemed_too_broad": input_rule,
        "additional_considerations": user_considerations,
    },
)

# TODO: parse gpt's response for two rule xmls


In [12]:
from typing import Dict, Tuple
import json
import openai

pricing = json.load(open("pricing.json"))


def compute_cost(usage: Dict[str, int], model: str) -> float:
    prices = pricing[model]
    return (
        prices["prompt"] * usage["input_tokens"] / 1000
        + prices["completion"] * usage["output_tokens"] / 1000
    )


def call_gpt(
    model: str, messages: List, temperature: float = 0.7, max_length: int = 256
) -> Tuple[str, Dict]:
    """
    Generic function to call GPT4 with specified messages
    """
    response = openai.ChatCompletion.create(
        response_format={"type": "json_object"},
        model=model,
        messages=messages,
        temperature=temperature,
        max_tokens=max_length,
        frequency_penalty=0.0,
        top_p=1,
    )
    usage = {
        "input_tokens": response["usage"]["prompt_tokens"],
        "output_tokens": response["usage"]["completion_tokens"],
    }
    usage["cost"] = compute_cost(usage, model)

    return response["choices"][0]["message"]["content"], usage

In [21]:
import json
import openai
from utils.utils import generate_simple_message

openai.api_key = "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"
openai_resp = call_gpt(
    messages=generate_simple_message(system_prompt, user_prompt),
    model="gpt-4-0125-preview",
    max_length=4096,
)
split_rules_dict = json.loads(openai_resp[0])

In [26]:
split_rules_dict

{'rule_1': {'id': 'BRIEFCATCH_245927502998399442504807079542143713153.1',
  'name': 'BRIEFCATCH_CONCISENESS_3959.1',
  'pattern': [{'token': {'inflected': 'yes', 'text': 'assist'}},
   {'or': [{'token': {'min': '0',
       'postag': 'JJ.*|PRP',
       'postag_regexp': 'yes'}},
     {'token': {'text': 'the'}}]},
   {'or': [{'token': {'min': '0',
       'postag': 'JJ.*|PRP|PRP$',
       'postag_regexp': 'yes'}},
     {'token': {'text': 'the'}}]},
   {'token': {'postag': 'N.*|PRP',
     'postag_regexp': 'yes',
     'exception': ['all',
      'are',
      'being',
      'beliefs',
      'but',
      'by',
      'can',
      'circuit',
      'clear',
      'concerning',
      'concerns',
      'dissent',
      'does',
      'due',
      'even',
      'fails',
      'find',
      'finds',
      'get',
      'given',
      'having',
      'his',
      'hold',
      'holds',
      'if',
      'in',
      'left',
      'like',
      'likes',
      'long',
      'make',
      'makes',
      'may