# Goals

The goal of this notebook is to improve the BAD outputs that the language models generate.

I was anecdotally examining a lot of outputs, and had been noticing that many of the outputs in earlier notebooks weren't that good. So I decided to apply a prompting technique known as Chain of Thought (https://arxiv.org/abs/2201.11903) to this problem.

It definitely did end up working pretty well, so I continued to use it in later approaches.

# Read in the style guide rules

In [10]:
import pandas as pd
import glob

In [14]:
file_dict = [
    ('../corpora/mother_jones/mother_jones_parsed_rules.csv', 'mother_jones', 'csv'),
    ('../corpora/buzzfeed/parsed-rules-df.csv', 'buzzfeed', 'csv'),
    ('../corpora/guardian/parsed-rules-df.csv', 'guardian', 'csv'),
    ('../corpora/chicago-style-guide/all-rules.jsonl', 'chicago', 'jsonl')
]

In [37]:
all_dfs = []
for fname, outlet_name, read_type in file_dict:
    if read_type == 'csv':
        df = pd.read_csv(fname, index_col=0)
    else:
        df = pd.read_json(fname, lines=True).rename(columns={
            'hierarchy': 'rule_hierarchy',
            'content': 'rule_text',
            'title': 'rule_title',
        }).assign(rule_hierarchy=lambda df: df['rule_hierarchy'].apply(lambda x: ' -> '.join(list(map(lambda y: y['text'], x)))))
    df = df.assign(outlet_name=outlet_name).reset_index(drop=True)
    df = df.assign(idx=lambda df: outlet_name + '_' + (df.reset_index()['index']+ 1).astype(str) )
    all_dfs.append(df)

In [43]:
cols = ['rule_title', 'rule_hierarchy', 'rule_text', 'outlet_name', 'idx']
all_rules_combined_df = pd.concat(all_dfs)[cols].reset_index(drop=True)

In [44]:
all_rules_combined_df['outlet_name'].value_counts()

outlet_name
guardian        4732
buzzfeed        1121
chicago         1093
mother_jones     283
Name: count, dtype: int64

# Experiment with Different Prompting Techniques

## Try a more chain-of-thought approach

In [47]:
import sys
sys.path.insert(0, '../tasks/generate_synthetic_example_data/')
import model_util

In [63]:
tok, gpt4_model, _ = model_util.get_model('gpt-4-turbo', cache_dir=None)

In [87]:
rule_title, rule_text = (
    all_rules_combined_df
         .loc[lambda df: df['outlet_name'] == 'chicago']
         [['rule_hierarchy', 'rule_text']]
         .iloc[3]
)

In [89]:
prompt = """You are a helpful editor's assistant. I am trying to read a style guide. Here is an entry in the style guide:

Title: {rule_title}

```{rule_text}```

The entry might specify a generic grammar definition, a specific rule, a specific spelling, or something else.

I want examples of sentences that this would be applicable to. 

Think step-by-step. 
1. Is there a rule being expressed? 
2. Is this rule something that can be violated? 

If the answer to 1 and 2 is Yes, then:
3. Simplify the rule in plain language and specify ways it can be broken.

If you have done all of that, then say generate 10 example sentences that violate this rule. Output them in a python list.
Make them topically sound like news-article sentences: they should be about current events, or similar news writing.
"""

messages = [
    {'role': 'system', 'content': "You are a helpful editor's assistant."},
    {
        'role': 'user',
        'content': prompt.format(
            rule_text=rule_text,
            rule_title=rule_title,
        )
    }
]

In [88]:
print(rule_text)

A noun is a word that names something, whether abstract (intangible) or concrete (tangible). It may be a common noun (the name of a generic class or type of person, place, thing, process, activity, or condition) or a proper noun (the name of a specific person, place, or thing—hence capitalized). A concrete noun may be a count noun (if what it names can be counted—as with horses or cars) or a mass noun (if what it names is uncountable or collective—as with information or salt).


In [90]:
output = gpt4_model(messages=messages)

In [91]:
print(output.choices[0].message.content)

The entry from the style guide provides a general definition for nouns and distinguishes between different types of nouns (common vs. proper; concrete vs. abstract; countable vs. mass). Let's analyze if there is a specific rule and whether it can be violated:

1. **Is there a rule being expressed?**
   Yes, the rule here concerns the distinctions between types of nouns and proper usage, including capitalization for proper nouns and the correct treatment of count and mass nouns.

2. **Is this rule something that can be violated?**
   Absolutely. Violations can occur if proper nouns are not capitalized, or if count nouns and mass nouns are misused (e.g., treating a mass noun like it is countable).

3. **Simplify the rule in plain language and specify ways it can be broken.**
   - **Rule**: Capitalize proper nouns. Treat common nouns according to their nature: if they refer to individual items that can be counted, use them accordingly; if they refer to uncountable substances or concepts, 