# Exp6: GPT data augmentation

In this notebook, I draw 1 construction from every level and generate 100 positive and 50 negative examples using GPT-4.

In [1]:
import pandas as pd
import re
import sys
import os
import re
sys.path.append(os.path.dirname(os.getcwd()))
import config
from openai import OpenAI
client = OpenAI(api_key=config.OPENAI_API_KEY)

Read data and sample from each level once

In [26]:
egp = pd.read_csv('../dat/egponline.csv')
egp_samples = egp.groupby('Level', group_keys=False).apply(lambda x: x.sample(1, random_state=config.SEED))
egp_samples['Example'] = egp_samples['Example'].str.replace(r"\(.*\)", "", regex=True).str.strip()

In [27]:
egp_samples.head(20)

Unnamed: 0.1,Unnamed: 0,#,SuperCategory,SubCategory,Level,Lexical Range,guideword,Can-do statement,Example,type
683,683,684,NOUNS,plural,A1,,FORM: PLURAL '-S',Can form plurals by adding '-s' to common coun...,We play computer games. \n\nYou have to walk j...,FORM
929,929,930,PRONOUNS,subject/ object,A2,,FORM: (SUBJECT) EMPTY 'IT',Can use 'it' as an empty subject when there is...,It's warm and windy. \n\nNow it 's 3 pm.,FORM
343,343,344,DETERMINERS,quantity,B1,,"FORM: 'SO MUCH', 'TOO MUCH' WITH UNCOUNTABLE N...",Can use 'so much' and 'too much' with uncounta...,Hello Marina. Unfortunately I have so much wor...,FORM
480,480,481,MODALITY,expressions with be,B2,,"USE: 'BE OBLIGED TO', OBLIGATION",Can use 'be obliged to' to talk about an exter...,The Sales Representatives are obliged to parti...,USE
803,803,804,PAST,past simple,C1,3.0,FORM: QUESTIONS,"Can use 'yes/no', 'wh-', tag and negative ques...",Did you spot the fact that what is expected fr...,FORM
601,601,602,MODALITY,used to,C2,,FORM: 'USED NOT TO',Can use negative form 'used not to',"I used to play until late in the evening, I us...",FORM


Let's create the prompts right away.

In [39]:
NUM_EXAMPLES = 20

def get_prompt(construction):
    lexical_range = ''
    if not pd.isna(construction["Lexical Range"]):
        if construction["Lexical Range"] == 1:
            lexical_range = 'low'
        elif construction["Lexical Range"] == 2:
            lexical_range = 'medium'
        elif construction["Lexical Range"] == 3:
            lexical_range = 'high'
        lexical_range = f'Use words of {lexical_range} difficulty in the rule.'
    return f'Learn the grammar rule "{construction["Can-do statement"]}" ({construction["SuperCategory"]}, {construction["SubCategory"]}, {construction["guideword"]}). It is CEFR level {construction["Level"]}. {lexical_range}\nExamples:\n{construction["Example"]}\nCreate {NUM_EXAMPLES} more examples using that rule.'
   
egp_samples['prompt'] = egp_samples.apply(get_prompt, axis=1)

In [40]:
print(egp_samples['prompt'].iloc[4])

Learn the grammar rule "Can use 'yes/no', 'wh-', tag and negative question forms with a wide range of verbs.  ► Questions" (PAST, past simple, FORM: QUESTIONS ). It is CEFR level C1. Use words of high difficulty in the rule.
Examples:
Did you spot the fact that what is expected from us in Tall Trees Campsite is to help in organising the entertainment? 

How did it turn out then? 

I told you about Peter and Mary, didn't I? 

Didn't we all dream of being famous once in our lives?
Create 20 more examples using that rule.


In [8]:
def get_examples(construction):
    print(construction['prompt'])
    messages = [
        {"role": "system", "content": "You are an English as a foreign language teacher who is knowledgable about grammar."},
        {"role": "user", "content": construction['prompt']}
    ]
    response = client.chat.completions.create(model=config.OPENAI_MODEL, messages=messages )
    msg_content = response.choices[0].message.content
    print(f'{msg_content}\n\n')
    lines = msg_content.split('\n')
    positive_examples = [re.sub(r'^\d+\.\s*', '', line).strip() for line in lines if re.match(r'^\d+\.', line)]

    messages.append(response.choices[0].message)

    # negative examples
    messages.append({"role": "user", "content": "Rewrite each created example as a minimal pair that does not show the usage of the given rule."})
    response = client.chat.completions.create(model=config.OPENAI_MODEL, messages=messages)
    msg_content = response.choices[0].message.content
    print(f'{msg_content}\n\n')
    lines = msg_content.split('\n')
    negative_examples = [re.sub(r'^\d+\.\s*', '', line).strip() for line in lines if re.match(r'^\d+\.', line)]
    return positive_examples, negative_examples

def get_multiple_examples(row):
    all_positive_examples = []
    all_negative_examples = []

    # Call get_examples 5 times and concatenate results
    for _ in range(5):
        positive_examples, negative_examples = get_examples(row)
        all_positive_examples.extend(positive_examples)
        all_negative_examples.extend(negative_examples)

    return all_positive_examples, all_negative_examples

egp_samples[['augmented_examples', 'augmented_negative_examples']] = egp_samples.apply(get_multiple_examples, axis=1, result_type='expand')

Create 20 more examples for the grammatical construction on CEFR level A1 in the category "DETERMINERS: articles" with guideword "FORM: 'A' + 'VERY' + ADJECTIVES" and the rule: "Can use 'a' + 'very' + adjectives in basic noun phrases. ► adverbs ►  noun phrases"

Examples:

He's a very good man. 

It is a very beautiful place. 

It's a very nice town.

Output format:
1. [EXAMPLE 1]
2. [EXAMPLE 2]
1. She's a very talented singer. 
2. He's a very funny comedian.
3. It's a very interesting book.
4. The car is a very fast vehicle.
5. We had a very long day.
6. It was a very difficult test.
7. She has a very small apartment.
8. He's a very generous person.
9. That's a very expensive watch.
10. I had a very tasty meal.
11. She's a very intelligent student.
12. It's a very hot summer day.
13. He's a very handsome man.
14. We had a very exciting adventure.
15. It was a very loud noise.
16. She's a very kind teacher.
17. They live in a very peaceful neighborhood.
18. He's a very strong athlete.


Save the generated examples

In [9]:
egp_samples.to_json('../dat/egpaugmented_6x100.json')