# Goals

The goals of this notebook are to download and examine synthetic datasets of BAD and GOOD sentences that were generated by applying style-guide rules (and some prompts that I played around with in previous notebooks).

When I first started this project, I was thinking that generating pairs of (bad, good) rules was important in order to eventually train a model to improve the rule.

However, in other notebooks, I move towards generating just BAD sentences (i.e. those that violate the rule).

## Download data from the server

(this part isn't relevant for the undergrads who don't have access to this server)

In [364]:
import ast
import json
from tqdm.auto import tqdm
import pandas as pd 
import re 
import sys
sys.path.insert(0, '../tasks/generate_synthetic_example_data/')
import run_open_source_models as r
import warnings
warnings.simplefilter(action='ignore', category=SyntaxWarning)

tok, gpt_4_model,_ = r.get_model('gpt-4-turbo', cache_dir=None)

In [250]:
input_data_df = pd.read_json('../corpora/chicago-style-guide/all-rules.jsonl', lines=True)

In [None]:
proj_dir = '/project/jonmay_231/spangher/Projects/style-guides'
# ! ssh end "ls -p $proj_dir/tasks/generate_synthetic_example_data/ | grep -v / | xargs -I {} ls {}"
! ssh end "ls -lh $proj_dir/tasks/generate_synthetic_example_data/ | grep jsonl"

In [None]:
! scp end:$proj_dir/tasks/generate_synthetic_example_data/mixtral_zeroshot.jsonl ../tasks/generate_synthetic_example_data/
! scp end:$proj_dir/tasks/generate_synthetic_example_data/gpt-4-turbo_zeroshot.jsonl ../tasks/generate_synthetic_example_data/
! scp end:$proj_dir/tasks/generate_synthetic_example_data/gpt-3.5-turbo_zeroshot.jsonl ../tasks/generate_synthetic_example_data/
! scp end:$proj_dir/tasks/generate_synthetic_example_data/command-r_zeroshot.jsonl ../tasks/generate_synthetic_example_data/

# Read these results

In [620]:
mixtral_data_df = pd.read_json('../tasks/generate_synthetic_example_data/mixtral_zeroshot.jsonl', lines=True)
command_r_data_df = pd.read_json('../tasks/generate_synthetic_example_data/command-r_zeroshot.jsonl', lines=True)
gpt4_data_df = pd.read_json('../tasks/generate_synthetic_example_data/gpt-4-turbo_zeroshot.jsonl', lines=True)
gpt3_data_df = pd.read_json('../tasks/generate_synthetic_example_data/gpt-3.5-turbo_zeroshot.jsonl', lines=True)

In [621]:
if 'parsed_output' not in mixtral_data_df.columns:
    parsed, failed = mixtral_data_df.pipe(parse_data, return_failed=True, verbose=True)
    mixtral_data_df['parsed_output'] = parsed

if 'parsed_output' not in gpt4_data_df.columns:
    parsed, failed = gpt4_data_df.pipe(parse_data, return_failed=True, verbose=True)
    gpt4_data_df['parsed_output'] = parsed

if 'parsed_output' not in command_r_data_df.columns:
    parsed, failed = command_r_data_df.pipe(parse_data, return_failed=True, verbose=True)
    command_r_data_df['parsed_output'] = parsed

  0%|          | 0/1093 [00:00<?, ?it/s]

  0%|          | 0/361 [00:00<?, ?it/s]

In [622]:
input_data_df_w_all_annotations = (
    input_data_df
    ## gpt 4
         .merge(gpt4_data_df.drop(columns=['message']).rename(columns={'key': 'url'}))
         .assign(was_unclear_gpt4=lambda df: df['parsed_output'] == WAS_UNCLEAR_STR)
         .assign(parsed_output=lambda df: df['parsed_output'].apply(lambda x: x if x != WAS_UNCLEAR_STR else None))
         .rename(columns={'parsed_output': 'parsed_gpt4'})
    ##  mixtral
         .merge(mixtral_data_df.drop(columns=['message']).rename(columns={'key': 'url'}), how='left')
         .assign(was_unclear_mixtral=lambda df: df['parsed_output'] == WAS_UNCLEAR_STR)
         .assign(parsed_output=lambda df: df['parsed_output'].apply(lambda x: x if x != WAS_UNCLEAR_STR else None))
         .rename(columns={'parsed_output': 'parsed_mixtral'})
    ##  command-r
         .merge(command_r_data_df.drop(columns=['message']).rename(columns={'key': 'url'}), how='left')
         .assign(was_unclear_command_r=lambda df: df['parsed_output'] == WAS_UNCLEAR_STR)
         .assign(parsed_output=lambda df: df['parsed_output'].apply(lambda x: x if x != WAS_UNCLEAR_STR else None))
         .rename(columns={'parsed_output': 'parsed_command_r'})
)

In [623]:
input_data_df_w_all_annotations.head(2)

Unnamed: 0,url,section_number,title,hierarchy,content,parsed_gpt4,was_unclear_gpt4,parsed_mixtral,was_unclear_mixtral,parsed_command_r,was_unclear_command_r
0,/book/ed17/part2/ch05/psec001.html,5.1,The field of grammar,"[{'href': '/book/ed17/part2/ch05/toc.html', 't...","In its usual sense, grammar is the set of rule...",,True,,False,[{'bad sentence': 'The city's mayor has not ye...,False
1,/book/ed17/part2/ch05/psec002.html,5.2,Schools of grammatical thought,"[{'href': '/book/ed17/part2/ch05/toc.html', 't...",There are many schools of grammatical thought—...,,True,[{'bad sentence': 'A journalist should never s...,False,[{'bad sentence': 'The city's mayor has not ye...,False


How well do the different language models interpret rules as unclear? This is important, since many style-guide rules are vague.

In [624]:
input_data_df_w_all_annotations.pipe(lambda df: df[list(filter(lambda x: 'was_unclear' in x, df.columns))]).mean()

was_unclear_gpt4         0.174748
was_unclear_mixtral      0.000000
was_unclear_command_r    0.000000
dtype: float64

# See if the rules are correctly applied

Here, I go through and manually inspect the data.

In [625]:
t = (
    input_data_df_w_all_annotations
        .loc[lambda df: df['was_unclear_gpt4'] == True]
        [['title', 'content', 'parsed_mixtral', 'parsed_command_r']]
        .iloc[1]
)

print(t.pipe(lambda s: json.dumps(s.to_dict(), indent=4)))

{
    "title": "Schools of grammatical thought",
    "content": "There are many schools of grammatical thought\u2014and differing vocabularies for describing grammar. Grammatical theories have been in upheaval in recent years. It seems that the more we learn, the less we know. As the illustrious editor in chief of the Oxford English Dictionary wrote in 1991: \u201cAn entirely adequate description of English grammar is still a distant target and at present seemingly an unreachable one, the complications being what they are.\u201d1 In fact, the more detailed the grammar (it can run to many large volumes), the less likely it is to be of any practical use to most writers and speakers.\n\n\n1. Robert W. Burchfield, Unlocking the English Language (New York: Hill and Wang, 1991), 22.",
    "parsed_mixtral": [
        {
            "bad sentence": "A journalist should never split an infinitive, as it is a cardinal sin of grammar.",
            "good sentence": "While some people believe that s

In [469]:
rule_title = t['title']
rule = t['content']

In [484]:
idx = 2
b, g = t['parsed_mixtral'][idx]['bad sentence'], t['parsed_mixtral'][idx]['good sentence']

In [679]:
import prompts

In [489]:
messages = [
    {
        "role": "system",
        "content": "You are a helpful editor who checks other editors."
    },
    {
        'role': 'user',
        'content': prompts.ZERO_SHOT_SINGLE_PAIRWISE_SEMANTIC_CHECK.format(rule_title=rule_title, rule=rule, b=b, g=g)
    }
]
    
output = gpt_35_model(messages=messages, logprobs=True, top_logprobs=5)

In [490]:
output.choices[0]

Choice(finish_reason='stop', index=0, logprobs=ChoiceLogprobs(content=[ChatCompletionTokenLogprob(token='yes', bytes=[121, 101, 115], logprob=-0.4398991, top_logprobs=[TopLogprob(token='yes', bytes=[121, 101, 115], logprob=-0.4398991), TopLogprob(token='Yes', bytes=[89, 101, 115], logprob=-1.1569077), TopLogprob(token='No', bytes=[78, 111], logprob=-3.9005122), TopLogprob(token='no', bytes=[110, 111], logprob=-4.2488647), TopLogprob(token='Unc', bytes=[85, 110, 99], logprob=-5.7861056)])]), message=ChatCompletionMessage(content='yes', role='assistant', function_call=None, tool_calls=None))

In [492]:
print(prompt.format(rule_title=rule_title, rule=rule, b=b, g=g))


Please check this work. I followed this editing rule:

```Schools of grammatical thought

There are many schools of grammatical thought—and differing vocabularies for describing grammar. Grammatical theories have been in upheaval in recent years. It seems that the more we learn, the less we know. As the illustrious editor in chief of the Oxford English Dictionary wrote in 1991: “An entirely adequate description of English grammar is still a distant target and at present seemingly an unreachable one, the complications being what they are.”1 In fact, the more detailed the grammar (it can run to many large volumes), the less likely it is to be of any practical use to most writers and speakers.


1. Robert W. Burchfield, Unlocking the English Language (New York: Hill and Wang, 1991), 22.```

To edit this bad sentence:
```The passive voice is always weaker than the active voice, and should be avoided in writing.```

Into this good one: 
```While the active voice can often make writing more

In [677]:
all_synthetic_data = (
    input_data_df_w_all_annotations
        [['url', 'title', 'content', 'parsed_gpt4', 'parsed_mixtral', 'parsed_command_r']]
        .set_index(['url', 'title', 'content'])
        .stack()
        .reset_index()
        .rename(columns={'level_3': 'algorithm', 0: 'synthetic_data'})
        .explode('synthetic_data')
        .reset_index(drop=True)
        .loc[lambda df: df['synthetic_data'].str.get('bad sentence') != df['synthetic_data'].str.get('good sentence')]
)

all_single_prompts = (
    all_synthetic_data
        .apply(
            lambda x: prompt.format(
                rule_title=x['title'], 
                rule=x['content'],
                b=x['synthetic_data']['bad sentence'],
                g=x['synthetic_data']['good sentence']
            ),
            axis=1)
)

In [678]:
all_synthetic_data.head(2)

Unnamed: 0,url,title,content,algorithm,synthetic_data
0,/book/ed17/part2/ch05/psec001.html,The field of grammar,"In its usual sense, grammar is the set of rule...",parsed_command_r,{'bad sentence': 'The city's mayor has not yet...
1,/book/ed17/part2/ch05/psec001.html,The field of grammar,"In its usual sense, grammar is the set of rule...",parsed_command_r,{'bad sentence': 'The new legislation will hop...


In [646]:
all_synthetic_data.head(2)

Unnamed: 0,url,title,content,algorithm,synthetic_data
0,/book/ed17/part2/ch05/psec001.html,The field of grammar,"In its usual sense, grammar is the set of rule...",parsed_command_r,{'bad sentence': 'The city's mayor has not yet...
1,/book/ed17/part2/ch05/psec001.html,The field of grammar,"In its usual sense, grammar is the set of rule...",parsed_command_r,{'bad sentence': 'The new legislation will hop...


In [648]:
idx = 100
k = 10
grouped_synthetic_data = (
    all_synthetic_data
        .groupby(['url', 'title', 'algorithm', 'content'])
        ['synthetic_data']
        .aggregate(list)
)
batch = grouped_synthetic_data.iloc[idx][:k]
batch = list(map(lambda x: f'OLD: "{x["bad sentence"]}" -> NEW: "{x["good sentence"]}"', batch))
batch = '\n'.join(list(map(lambda x: f"{x[0]+1}. {x[1]}", enumerate( batch))))
url, title, alg, rule = grouped_synthetic_data.index[idx]

In [650]:
messages = [
    {
        "role": "system",
        "content": "You are a helpful editor who checks other editors."
    },
    {
        'role': 'user',
        'content': multi_example_prompt.format(rule_title=title, rule=rule, k=k, sentences=batch)
    }
]
    
output = gpt_4_model(messages=messages, logprobs=True, top_logprobs=5)

In [656]:
grouped_synthetic_data.index[0]

('/book/ed17/part2/ch05/psec001.html',
 'The field of grammar',
 'parsed_command_r',
 'In its usual sense, grammar is the set of rules governing how words are put together in sentences to communicate ideas—or the study of these rules. Native speakers of a language learn them unconsciously. The rules govern most constructions in any given language. The small minority of constructions that lie outside these rules fall mostly into the category of idiom and customary usage.')

In [662]:
all_messages = []
for (url, title, alg, rule), examples in grouped_synthetic_data.items():
    for batch_idx, batch in enumerate(batchify(examples, n=k)):
        batch = list(map(lambda x: f'OLD: "{x["bad sentence"]}" -> NEW: "{x["good sentence"]}"', batch))
        batch = '\n'.join(list(map(lambda x: f"{x[0] + k * batch_idx}. {x[1]}", enumerate( batch))))

        messages = {
            'url': url,
            'algorithm': alg,
            'messages': [
                {
                    "role": "system",
                    "content": "You are a helpful editor who checks other editors."
                }, {
                    'role': 'user',
                    'content': multi_example_prompt.format(rule_title=title, rule=rule, k=k, sentences=batch)
                }
            ]
        }
        all_messages.append(messages)

In [675]:
output = gpt_4_model(messages=all_messages[50]['messages'])

In [686]:
print(output.choices[0].message.content)

1. no
2. no
3. no
4. no
5. no
6. no
7. no
8. no
9. no
10. no


In [658]:
(
    pd.Series(all_messages)
     .str.get('messages')
     .pipe(lambda s: s.str.get(0).str.get('content') + ' ' + s.str.get(1).str.get('content'))
     .progress_apply(tok.encode)
     .str.len()
     .sum()
)

  0%|          | 0/2214 [00:00<?, ?it/s]

1701302

In [642]:
examples

[{'bad sentence': "The president spoke about the economy's recovery, stating, 'We have seen a remarkable turnaround in the past year.' (John Doe, Economic Trends, 2023, p. 58).",
  'good sentence': "The president spoke about the economy's recovery, stating, 'We have seen a remarkable turnaround in the past year.' (John Doe, Economic Trends [2023], 58)"},
 {'bad sentence': "In her latest book, the author describes the city as 'a vibrant hub that never sleeps.' (Jane Smith, Cityscapes, 2023, p. 112).",
  'good sentence': "In her latest book, the author describes the city as 'a vibrant hub that never sleeps.' (Jane Smith, Cityscapes [2023], 112)"},
 {'bad sentence': "The study found that 'more than 50% of respondents prefer flexible work options.' (Research Institute, Work Trends, 2023, p. 204).",
  'good sentence': "The study found that 'more than 50% of respondents prefer flexible work options.' (Research Institute, Work Trends [2023], 204)"},
 {'bad sentence': "As the climate conferenc