# Goals

The goals of this demo is to download the style-guide rules and use llama3, using Ollama, to generate synthetic data.

You'll also have the chance to see the synthetic data that I generated when using these tools.

In [None]:
## get the data

! mkdir ../demo-data

! wget https://storage.googleapis.com/usc-data/style-guide-data/6-1-2024__alex__synthetic-generations.tar.gz
! wget https://storage.googleapis.com/usc-data/style-guide-data/style-guides.tar.gz
! mkdir ../demo-data/trial-synthetic-generations
! mkdir ../demo-data/style-guides
! tar -xzvf 6-1-2024__alex__synthetic-generations.tar.gz  -C ../demo-data/trial-synthetic-generations --strip-components=1
! tar -xzvf style-guides.tar.gz  -C ../demo-data/style-guides --strip-components=1
# clean up 
! rm style-guides.tar.gz
! rm 6-1-2024__alex__synthetic-generations.tar.gz

# Let's start by look through the synthetic data I generated
(remember, this synthetic data only violates 1 rule)

In [None]:
import sys
sys.path.insert(0, '../tasks/generate_synthetic_example_data/')
import glob
import pandas as pd 

style_guide_dir = '../demo-data/style-guides'
synthetic_dir = '../demo-data/trial-synthetic-generations'
orig_input_file = pd.read_json(f'{style_guide_dir}/chicago-style-guide/all-rules.jsonl', lines=True)
bad_sentence_files = glob.glob(f'{synthetic_dir}/*cot*')

mixtral_data_df = pd.read_json(f'{synthetic_dir}/mixtral_zeroshot_cot_bad_only.jsonl', lines=True)
gpt4_data_df = pd.read_json(f'{synthetic_dir}/gpt-4-turbo_zeroshot_cot_bad_only.jsonl', lines=True)
llama_data_df = pd.read_json(f'{synthetic_dir}/llama-3-70b_zeroshot_cot_bad_only.jsonl', lines=True)
gpt3_data_df = pd.read_json(f'{synthetic_dir}/gpt-3.5-turbo_zeroshot_cot_bad_only.jsonl', lines=True)
command_r_data_df = pd.read_json(f'{synthetic_dir}/command-r_zeroshot_cot_bad_only.jsonl', lines=True)

This data was generated via the following prompt:

In [34]:
import prompts
print(prompts.ZERO_SHOT_COT_BAD_SYNTHETIC_DATA)

You are a helpful editor's assistant. I am trying to read a style guide. Here is an entry in the style guide:

Title: {style_guide_rule_title}

```{style_guide_rule}```

The entry might specify a generic grammar definition, a specific rule, a specific spelling, or something else.
If the rule contains just a single word, it is likely demonstrating the correct spelling of that word.

I want examples of sentences that this would be applicable to.

Let's think about this step-by-step. 
1. Is there a rule being expressed? 
2. Is this rule something that can be violated?

If the answer to 1 or 2 is No, then say "No clear rule or preference expressed." and STOP.

If the answer to 1 and 2 is Yes, then:
3. Simplify the rule in plain language and specify ways it can be broken.
(Note: if it is a spelling rule, to violate it, simply misspell the word).

Then, generate {n} example sentences that violate this rule. Don't worry about them sounding wrong.
Make them topically sound like news-article se

Let's put it into a dataframe so we can read it more easily:

In [23]:
input_data_df_w_all_annotations = (
    orig_input_file
    ## gpt 4
         .merge(gpt4_data_df.rename(columns={'key': 'url'})).rename(columns={'message': 'message_gpt4'})
    # ##  mixtral
         .merge(mixtral_data_df.rename(columns={'key': 'url'}), how='left').rename(columns={'message': 'message_mixtral'})
    # ##  command-r
         .merge(command_r_data_df.rename(columns={'key': 'url'}), how='left').rename(columns={'message': 'message_command_r'})
    # ##  llama 3
         .merge(llama_data_df.rename(columns={'key': 'url'}), how='left').rename(columns={'message': 'message_llama'})
        .drop_duplicates('url')
)

This dataframe has the following columns:

`['url', 'section_number', 'title', 'hierarchy']` <- These are all unique IDs for the style-guide rule being used.

`'content'` <- This is the content of the style guide rule

`['message_gpt4', 'message_mixtral', 'message_command_r', 'message_llama']` <- These all refer to outputs from different language models.

For reference:

* GPT4: this is GPT4, from OpenAI, https://openai.com/index/gpt-4/
* Mixtral: this is Mixtral-8x7b, https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1
* Command R: This command-r-v01, https://huggingface.co/CohereForAI/c4ai-command-r-v01-4bit
* Llama: This is llama 70b, https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct 

In [36]:
input_data_df_w_all_annotations.head(2)

Unnamed: 0,url,section_number,title,hierarchy,content,message_gpt4,message_mixtral,message_command_r,message_llama
0,/book/ed17/part2/ch05/psec001.html,5.1,The field of grammar,"[{'href': '/book/ed17/part2/ch05/toc.html', 't...","In its usual sense, grammar is the set of rule...",1. **Is there a rule being expressed?**\n Th...,No clear rule or preference expressed.\n\nThe ...,"Yes, there are rules being expressed, and they...",Let's break down the entry step-by-step.\n\n1....
8,/book/ed17/part2/ch05/psec002.html,5.2,Schools of grammatical thought,"[{'href': '/book/ed17/part2/ch05/toc.html', 't...",There are many schools of grammatical thought—...,1. Is there a rule being expressed?\n\nUpon re...,No clear rule or preference expressed.\n\nThe ...,"Yes, there is a rule being expressed, and it i...",Let's break it down step by step.\n\n1. Is the...


Here is the output from one row of the dataframe. What this shows is different LLM generations of bad, synthetic data:

In [24]:
d = (input_data_df_w_all_annotations
     [['content', 'message_gpt4', 'message_mixtral', 'message_command_r', 'message_llama']]
     .iloc[700]
     # .iloc[0]
     .to_dict()
)

for k, v in d.items():
    print(k)
    print('---------------------------------')
    print(v)

    print()
    print()

content
---------------------------------
Large or complex fractions are expressed as numeric decimal fractions (cf. 9.14). When a quantity equals less than 1.00, a zero normally appears before the decimal point as an aid to readability, particularly in scientific contexts and especially if quantities greater than 1.00 appear in the same context. Note that a unit of measure with a quantity of less than one is generally treated as if it were plural (see 10.65, 10.53). See also 9.55, 9.58.

a mean of 0.73
the ratio 0.85
In Cyprus, there were 0.96 females for every male in the general population; in the sixty-five-and-over age group, the number was 1.30.

In contexts where decimal quantities must be 1.00 or less, as in probabilities, batting averages, and the like, or between −1.00 and 1.00, as in correlation coefficients, a zero is typically omitted before the decimal point. For zeros with decimal points in tables, see 3.72.


p < .05

R = .10
Ty Cobb’s career batting average was .367.



# Recreating this output

How did I do this? I launched big language models on a server. Hold tight on doing this yourselves – we're working on figuring out how to get you access. In the meantime, you can play around with prompts as follows. 

Please download and install Ollama: https://ollama.com/

Run this in your terminal:

`ollama run llama3`

Even llama3 8b is pretty powerful. Llama 70b is better, but that may not run on your local computer.

Let's run the zero-shot Chain-of-Thought prompt that I printed above through your local Ollama instance. We'll pick one style-guide rule:

In [49]:
buzzfeed_style_guide = pd.read_csv(f'{style_guide_dir}/buzzfeed/parsed-rules-df.csv')

In [54]:
title, text = buzzfeed_style_guide.iloc[3][['rule_title', 'rule_text']]
print(f'Title: {title}')
print('----------------------------------')
print(f'Text: {text}')

Title: ableism (see also disability )
----------------------------------
Text: • Avoid ableist language (discriminating against people with disabilities), including figurative use of words like lame , crutch , or handicap (which historically described physical disabilities) and crazy , stupid , or insane (once used as official diagnoses for people housed in institutions) and idioms like lame duck (opt for outgoing president
), turns a blind eye , or falls on deaf ears (use willfully ignorant
instead) . See Lydia X. Z. Brown’s glossary of ableist language to consider which words and terms to avoid and “ Ableism Is Embedded In Our Language. We Can Dismantle It. ”
• Use nondisabled
or person without a disability
rather than able-bodied , which implies that people with disabilities lack “able bodies,” per NCDJ guidance .
• Use low support needs
or high support needs
when writing about disabled people. Avoid high- and low-functioning ; how someone is “functioning” can change day to day, but

In [72]:
import requests
p = prompts.ZERO_SHOT_COT_BAD_SYNTHETIC_DATA.format(
    style_guide_rule_title=title,
    style_guide_rule=text,
    n=3
)
r = requests.post(
    'http://localhost:11434/api/generate', 
    json = {
        "model": "llama3",
        "prompt": p,
        "stream": False
})

In [74]:
print(r.json()['response'])

Let's go through the steps:

1. Is there a rule being expressed? Yes
2. Is this rule something that can be violated? Yes

So, let's simplify the rule and specify ways it can be broken:

Simplify the rule: Avoid using ableist language and terms when writing about people with disabilities.

Ways to break the rule:
* Use the words "lame", "crutch", or "handicap" figuratively.
* Refer to someone as "crazy", "stupid", or "insane".
* Use idioms like "turns a blind eye" or "falls on deaf ears".

Here are three example sentences that violate this rule:

[["The new policy is going to be a lame duck," said the mayor, trying to downplay its impact.]]
[['The politician has been accused of turning a blind eye to the crisis in the city's schools.']]
[['The company's CEO has always been crazy about innovation and taking risks.']]


# interesting!!!!