# Exp1: Data augmentation of EGP
Let's first prompt for 10 more examples of 6 random constructions (1 per level) and for negative examples to manually evaluate them to check the pipeline.

In [1]:
import pandas as pd
import sys
import os
sys.path.append(os.path.dirname(os.getcwd()))
import config

In [2]:
egp = pd.read_csv('../dat/egponline.csv')
egp.head()

Unnamed: 0,#,SuperCategory,SubCategory,Level,Lexical Range,guideword,Can-do statement,Example
0,1,ADJECTIVES,combining,A1,,FORM: COMBINING TWO ADJECTIVES WITH 'AND',Can use 'and' to join a limited range of commo...,The teachers are very nice and friendly . (A1 ...
1,2,ADJECTIVES,combining,A2,,FORM: COMBINING TWO ADJECTIVES WITH 'BUT',Can use 'but' to join a limited range of commo...,The weather was cloudy but fine. (A2 WAYSTAGE;...
2,3,ADJECTIVES,combining,B1,,FORM: BEFORE THE NOUN,Can use a comma to combine two adjectives used...,"It was the biggest TV in the bazaar, with its ..."
3,4,ADJECTIVES,combining,B1,,FORM: COMBINING COMPARATIVE ADJECTIVES WITH 'AND',Can use 'and' to join a limited range of compa...,In the winter you can go to the fitness centre...
4,5,ADJECTIVES,combining,B1,,FORM: COMBINING MORE THAN TWO ADJECTIVES,Can use commas and 'and' to join more than two...,"Her name is Cassia, she is beautiful, intellig..."


Sample from every level

In [24]:
egp_samples = egp.groupby('Level', group_keys=False).apply(lambda x: x.sample(n=1, random_state=config.SEED))
egp_samples.head(n=6)

Unnamed: 0,#,SuperCategory,SubCategory,Level,Lexical Range,guideword,Can-do statement,Example
1076,1077,VERBS,patterns_with to and -ing,A1,,FORM: 'LIKE' + 'TO'-INFINITIVE OR + '-ING',Can use 'like' followed either by a 'to'-infin...,I like to work with my teacher. (A1 BREAKTHROU...
384,385,FUTURE,future with will and shall,A2,,FORM: NEGATIVE 'WILL',Can use the negative forms 'will not' and 'won...,Without them I will not be able to play footba...
1110,1111,CLAUSES,conditional,B1,,"FORM/USE: PRESENT CONTINUOUS 'IF' CLAUSE , PLA...",Can use 'if' + present continuous or 'going to...,"Billy, if you are moving you should think a lo..."
727,728,PASSIVES,passives: form,B2,,USE: SUMMARIES AND EVALUATIONS,Can use the passive with modal verbs to evalua...,"In conclusion, it can be seen that the bicycle..."
54,55,ADJECTIVES,position,C1,,FORM: PAST PARTICPLE AS ADJECTIVE,Can use the '–ed' form of a verb as an adjecti...,"First of all, I would recommend providing a wi..."
780,781,PAST,past perfect simple,C2,,"FORM/USE: AFTER 'IF ONLY', IMAGINED PAST",Can use the negative form after 'if only' to e...,"If only she had not changed, if only I had hel..."


Let's have a closer look at the structure of the Example column:

In [26]:
print(egp_samples.Example.iloc[0])

I like to work with my teacher. (A1 BREAKTHROUGH; 2007; Italian; Pass)

I don't like to eat at College. (A1 BREAKTHROUGH; 2007; Italian; Pass)

I like reading lots of magazines. (Hungary; A1 BREAKTHROUGH; 2010; Hungarian; Pass)

I don't like doing homework in college. (A1 BREAKTHROUGH; 2007; Thai; Pass)

I like learning grammar. (A1 BREAKTHROUGH; 2007; Polish; Pass)

I like travelling by car. (A1 BREAKTHROUGH; 2011; Others; Pass)


Let's remove the additional information.

In [27]:
egp_samples['Example'] = egp_samples['Example'].str.replace(r"\(.*\)", "", regex=True).str.strip()
print(egp_samples['Example'].iloc[0])

I like to work with my teacher. 

I don't like to eat at College. 

I like reading lots of magazines. 

I don't like doing homework in college. 

I like learning grammar. 

I like travelling by car.


Let's assemble the prompt.

In [28]:
construction = egp_samples.iloc[1]
num_examples = 20
prompt = f'Create {num_examples} more examples for the grammatical construction on CEFR level {construction["Level"]} in the category "{construction["SuperCategory"]}: {construction["SubCategory"]}" with guideword "{construction["guideword"]}" and the rule: "{construction["Can-do statement"]}"\n\nExamples:\n\n{construction["Example"]}\n\nOutput format:\n1. [EXAMPLE 1]\n2. [EXAMPLE 2]'

print(prompt)

Create 20 more examples for the grammatical construction on CEFR level A2 in the category "FUTURE: future with will and shall" with guideword "FORM: NEGATIVE 'WILL'" and the rule: "Can use the negative forms 'will not' and 'won't'."

Examples:

Without them I will not be able to play football. 

Can you come at 7 o'clock? I won't be at home before.

Output format:
1. [EXAMPLE 1]
2. [EXAMPLE 2]


Let's add the prompt as a column to the dataframe:

In [29]:
def get_prompt(construction):
    return f'Create {num_examples} more examples for the grammatical construction on CEFR level {construction["Level"]} in the category "{construction["SuperCategory"]}: {construction["SubCategory"]}" with guideword "{construction["guideword"]}" and the rule: "{construction["Can-do statement"]}"\n\nExamples:\n\n{construction["Example"]}\n\nOutput format:\n1. [EXAMPLE 1]\n2. [EXAMPLE 2]'

egp_samples['prompt'] = egp_samples.apply(get_prompt, axis=1)

egp_samples['prompt'].iloc[0]

'Create 20 more examples for the grammatical construction on CEFR level A1 in the category "VERBS: patterns_with to and -ing" with guideword "FORM: \'LIKE\' + \'TO\'-INFINITIVE OR + \'-ING\'" and the rule: "Can use \'like\' followed either by a \'to\'-infinitive or an \'-ing\' form, with no change in meaning. "\n\nExamples:\n\nI like to work with my teacher. \n\nI don\'t like to eat at College. \n\nI like reading lots of magazines. \n\nI don\'t like doing homework in college. \n\nI like learning grammar. \n\nI like travelling by car.\n\nOutput format:\n1. [EXAMPLE 1]\n2. [EXAMPLE 2]'

Now let's access the openai ChatCompletion API

In [37]:
from openai import OpenAI
client = OpenAI(api_key=config.OPENAI_API_KEY)
SYSTEM_PROMPT = "You are an English as a foreign language teacher who is knowledgable about grammar."

In [31]:
response = client.chat.completions.create(
    messages=[
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": prompt}
    ],
    model=config.OPENAI_MODEL,
)

Let's print the response of the API.

In [32]:
msg_content = response.choices[0].message.content
msg_content

"1. She won't go to the party because she has to study for her exam.\n2. They will not buy a new car because they can't afford it.\n3. He won't eat sushi because he doesn't like raw fish.\n4. We will not go on vacation this year due to financial constraints.\n5. She won't tell anyone about the surprise party.\n6. They will not go to the concert because they don't like the band.\n7. He won't help you with your project because he's too busy with his own work.\n8. I will not be able to attend the meeting as I have a prior appointment.\n9. We won't be late for the movie if we leave now.\n10. She will not lend you her laptop because she's using it for her presentation.\n11. They won't cook dinner tonight as they're going out for a meal.\n12. He will not visit his grandparents as they're out of town.\n13. We won't forget to bring your birthday present to the party.\n14. She won't give up easily because she's determined to succeed.\n15. They will not ask for your help as they prefer to figure

Let's process the response into an array.

In [33]:
import re

In [34]:
lines = msg_content.split('\n')
[re.sub(r'^\d+\.\s*', '', line).strip() for line in lines]

["She won't go to the party because she has to study for her exam.",
 "They will not buy a new car because they can't afford it.",
 "He won't eat sushi because he doesn't like raw fish.",
 'We will not go on vacation this year due to financial constraints.',
 "She won't tell anyone about the surprise party.",
 "They will not go to the concert because they don't like the band.",
 "He won't help you with your project because he's too busy with his own work.",
 'I will not be able to attend the meeting as I have a prior appointment.',
 "We won't be late for the movie if we leave now.",
 "She will not lend you her laptop because she's using it for her presentation.",
 "They won't cook dinner tonight as they're going out for a meal.",
 "He will not visit his grandparents as they're out of town.",
 "We won't forget to bring your birthday present to the party.",
 "She won't give up easily because she's determined to succeed.",
 'They will not ask for your help as they prefer to figure it out 

Now let's do this for every row in our dataframe and also ask for negative examples.

In [40]:
NEGATIVE_PROMPT = "Rewrite every example with the same content but without using the rule."
def get_examples(construction):
    print(construction['prompt'])
    response = client.chat.completions.create(
        model=config.OPENAI_MODEL,
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": construction['prompt']}
        ]
    )
    msg_content = response.choices[0].message.content
    print(f'{msg_content}\n\n')
    lines = msg_content.split('\n')
    positive_examples = [re.sub(r'^\d+\.\s*', '', line).strip() for line in lines]

    # negative examples
    print(f'{NEGATIVE_PROMPT}\n\n')
    response = client.chat.completions.create(
        model=config.OPENAI_MODEL,
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": construction['prompt']},
            response.choices[0].message,
            {"role": "user", "content": NEGATIVE_PROMPT}
        ]
    )
    msg_content = response.choices[0].message.content
    print(f'{msg_content}\n\n')
    lines = msg_content.split('\n')
    negative_examples = [re.sub(r'^\d+\.\s*', '', line).strip() for line in lines]
    return positive_examples, negative_examples

In [41]:
egp_samples[['augmented_examples', 'augmented_negative_examples']] = egp_samples.apply(get_examples, axis=1, result_type='expand')

egp_samples['augmented_examples']

Create 20 more examples for the grammatical construction on CEFR level A1 in the category "VERBS: patterns_with to and -ing" with guideword "FORM: 'LIKE' + 'TO'-INFINITIVE OR + '-ING'" and the rule: "Can use 'like' followed either by a 'to'-infinitive or an '-ing' form, with no change in meaning. "

Examples:

I like to work with my teacher. 

I don't like to eat at College. 

I like reading lots of magazines. 

I don't like doing homework in college. 

I like learning grammar. 

I like travelling by car.

Output format:
1. [EXAMPLE 1]
2. [EXAMPLE 2]
1. She likes to dance in her free time.
2. He doesn't like to cook on weekends.
3. They like to go hiking in the mountains.
4. We like to play football after school.
5. I like to listen to music before going to bed.
6. She likes to watch movies in her spare time.
7. He doesn't like to swim in the pool.
8. They like to take long walks on the beach.
9. We like to read books during our lunch break.
10. I like to paint beautiful landscapes.
11

1. From my perspective, it can be argued that the new government policies should be revised.
2. An interesting point that can be made is that English is spoken by over a billion people worldwide.
3. It could be suggested that more funding needs to be allocated to social welfare programs.
4. Another factor that should be considered is that climate change is being exacerbated by human activities.
5. It can be concluded that the project was completed successfully, despite facing several challenges.
6. One possible explanation for this phenomena is that technology has advanced rapidly in recent years.
7. It might be inferred that the company's success is largely due to its innovative marketing strategies.
8. A strong case can be made for the importance of early childhood education in shaping future outcomes.
9. It may be deduced that the overall satisfaction rate among customers has steadily increased.
10. It can be observed that the number of people using public transportation has signifi

1. If only they hadn't missed the train, they would have arrived on time. 
2. If only he hadn't gambled away his savings, he would still have enough money to retire comfortably. 
4. If only they hadn't wasted so much time, they could have finished the project ahead of schedule. 
5. If only I hadn't forgotten my passport, I wouldn't have missed my flight. 
6. If only he hadn't dropped the glass, it wouldn't have shattered into pieces. 
7. If only they hadn't eaten so much junk food, they wouldn't have felt sick afterwards. 
8. If only she hadn't overslept, she would have made it to the meeting in time. 
9. If only he hadn't lied to her, their relationship wouldn't have ended on such bad terms. 
10. If only they hadn't lost the key, they could have unlocked the door. 
11. If only I hadn't skipped breakfast, I wouldn't have felt so hungry now. 
12. If only she hadn't quit her job, she would still have a stable source of income. 
13. If only he hadn't missed the deadline, he wouldn't be fa

1076    [She likes to dance in her free time., He does...
384     [She won't go to the party tonight., They will...
1110    [If you are planning to go camping, you should...
727     [From my perspective, it can be argued that th...
54      [I was amazed by the power of the wind when I ...
780     [If only they hadn't missed the train, they wo...
Name: augmented_examples, dtype: object

Save the augmented results for further processing and evaluation.

In [42]:
egp_samples.to_json('../dat/egpaugmented.json')