# Data augmentation of EGP
Let's first prompt for 10 more examples of 6 random constructions (1 per level) and for negative examples and manually evaluate them to check the pipeline.

In [1]:
import pandas as pd

In [2]:
egp = pd.read_csv('../dat/egponline.csv')
egp.head()

Unnamed: 0,#,SuperCategory,SubCategory,Level,Lexical Range,guideword,Can-do statement,Example
0,1,ADJECTIVES,combining,A1,,FORM: COMBINING TWO ADJECTIVES WITH 'AND',Can use 'and' to join a limited range of commo...,The teachers are very nice and friendly . (A1 ...
1,2,ADJECTIVES,combining,A2,,FORM: COMBINING TWO ADJECTIVES WITH 'BUT',Can use 'but' to join a limited range of commo...,The weather was cloudy but fine. (A2 WAYSTAGE;...
2,3,ADJECTIVES,combining,B1,,FORM: BEFORE THE NOUN,Can use a comma to combine two adjectives used...,"It was the biggest TV in the bazaar, with its ..."
3,4,ADJECTIVES,combining,B1,,FORM: COMBINING COMPARATIVE ADJECTIVES WITH 'AND',Can use 'and' to join a limited range of compa...,In the winter you can go to the fitness centre...
4,5,ADJECTIVES,combining,B1,,FORM: COMBINING MORE THAN TWO ADJECTIVES,Can use commas and 'and' to join more than two...,"Her name is Cassia, she is beautiful, intellig..."


Sample from every level

In [28]:
egp_samples = egp.groupby('Level', group_keys=False).apply(pd.DataFrame.sample)
egp_samples.head(n=6)

Unnamed: 0,#,SuperCategory,SubCategory,Level,Lexical Range,guideword,Can-do statement,Example
1101,1102,VERBS,linking,A1,,FORM: 'BE' + COMPLEMENT,Can use linking verb 'be' with complements.,Because today is my birthday. (A1 BREAKTHROUGH...
871,872,PRESENT,present simple,A2,2.0,FORM: NEGATIVE,Can use the negative form with an increasing r...,I don't need the video because I've got the DV...
909,910,PRONOUNS,reflexive,B1,2.0,"FORM: SAME SUBJECT AND OBJECT, SINGULAR",Can use an increasing range of singular reflex...,I decided to talk to him and introduce myself....
1103,1104,VERBS,linking,B2,3.0,FORM: LINKING + COMPLEMENT,Can use a wide range of linking verbs with com...,There are indeed many ways of doing this and i...
272,273,CONJUNCTIONS,coordinating,C1,,FORM/USE: 'NOT ONLY … BUT ALSO' WITH INVERSION,Can use inverted auxiliary 'do' + the subject ...,"Indeed, not only did they teach us useful know..."
314,315,DETERMINERS,demonstratives,C2,,"USE: 'THIS', IN NARRATIVES",Can use 'this' with nouns and noun phrases in ...,[talking about two characters in a story] Ther...


Let's have a closer look at the structure of the Example column:

In [4]:
print(egp_samples.Example.iloc[0])

When I have free time, I go to the park with my dog. (A1 BREAKTHROUGH; 2007; Polish; Pass)

We have lots of vegetables and fruit in the summer. (A1 BREAKTHROUGH; 2009; Slovak; Pass)

We sometimes meet friends or go to the Sports Center at the weekend. (A1 BREAKTHROUGH; 2006; Mandarin; Pass)


Let's remove the additional information.

In [5]:
egp_samples['Example'] = egp_samples['Example'].str.replace(r"\(.*\)", "", regex=True).str.strip()
print(egp_samples['Example'].iloc[0])

When I have free time, I go to the park with my dog. 

We have lots of vegetables and fruit in the summer. 

We sometimes meet friends or go to the Sports Center at the weekend.


Let's assemble the prompt.

In [22]:
construction = egp_samples.iloc[1]
num_examples = 20
prompt = f'Create {num_examples} more examples for the grammatical construction on CEFR level {construction["Level"]} in the category "{construction["SuperCategory"]}: {construction["SubCategory"]}" with guideword "{construction["guideword"]}" and the rule: "{construction["Can-do statement"]}"\n\nExamples:\n\n{construction["Example"]}\n\nOutput format:\n1. [EXAMPLE 1]\n2. [EXAMPLE 2]'

print(prompt)

Create 20 more examples for the grammatical construction on CEFR level A2 in the category "MODALITY: would" with guideword "USE: WISHES AND PREFERENCES " and the rule: "Can use 'would' with a wide range of verbs to talk about wishes and preferences."

Examples:

I would prefer to go on Sunday, because on Sunday I have no school. 

[talking about going to see a film] I'd prefer to see it tomorrow. 

Bring your new CD! I'd love to hear it!

Output format:
1. [EXAMPLE 1]
2. [EXAMPLE 2]


Let's add the prompt as a column to the dataframe:

In [30]:
def get_prompt(construction):
    return f'Create {num_examples} more examples for the grammatical construction on CEFR level {construction["Level"]} in the category "{construction["SuperCategory"]}: {construction["SubCategory"]}" with guideword "{construction["guideword"]}" and the rule: "{construction["Can-do statement"]}"\n\nExamples:\n\n{construction["Example"]}\n\nOutput format:\n1. [EXAMPLE 1]\n2. [EXAMPLE 2]'

egp_samples['prompt'] = egp_samples.apply(get_prompt, axis=1)

egp_samples['prompt'].iloc[0]

'Create 20 more examples for the grammatical construction on CEFR level A1 in the category "VERBS: linking" with guideword "FORM: \'BE\' + COMPLEMENT" and the rule: "Can use linking verb \'be\' with complements."\n\nExamples:\n\nBecause today is my birthday. (A1 BREAKTHROUGH; 2008; Mandarin; Pass)\n\nMartin is from Poland. (A1 BREAKTHROUGH; 2006; Polish; Pass)\n\nMy house is very clean. (A1 BREAKTHROUGH; 2009; Bengali; Pass)\n\nOutput format:\n1. [EXAMPLE 1]\n2. [EXAMPLE 2]'

Now let's access the openai ChatCompletion API

In [10]:
import openai
import sys
import os

sys.path.append(os.path.dirname(os.getcwd()))
import config
openai.api_key = config.OPENAI_API_KEY

In [11]:
response = openai.ChatCompletion.create(
    model=config.OPENAI_MODEL,
    messages=[
        {"role": "system", "content": "You are an English as a foreign language teacher who is knowledgable about grammar."},
        {"role": "user", "content": prompt}
    ]
)

In [13]:
msg_content = response.choices[0].message.content
msg_content

'1. We were all impressed by the performance of theirs at the concert.\n2. The teacher praised the artwork of hers displayed in the exhibition.\n3. They brought the dog of theirs to our house for a playdate.\n4. The company provided the employees with laptops of theirs.\n5. I borrowed a book from the library of hers and it was a great read.\n6. The team celebrated the victory of theirs with a big party.\n7. He received a letter from an old friend of his living abroad.\n8. The professor discussed the research findings of hers during the lecture.\n9. We visited the new house of theirs and it is beautifully decorated.\n10. I tasted the delicious homemade cookies of hers at the party.\n11. The parents were proud of the achievements of their children.\n12. She showed us the latest design sketches of hers for the fashion collection.\n13. They sold the car of theirs to buy a bigger and more comfortable one.\n14. He introduced us to the manager of his company, who offered us a job.\n15. We saw

In [14]:
# should not be necessary anymore to handle
# msg_content = "Premable bla bla bla. " + msg_content

In [15]:
import re

In [16]:
lines = msg_content.split('\n')
[re.sub(r'^\d+\.\s*', '', line).strip() for line in lines]

['We were all impressed by the performance of theirs at the concert.',
 'The teacher praised the artwork of hers displayed in the exhibition.',
 'They brought the dog of theirs to our house for a playdate.',
 'The company provided the employees with laptops of theirs.',
 'I borrowed a book from the library of hers and it was a great read.',
 'The team celebrated the victory of theirs with a big party.',
 'He received a letter from an old friend of his living abroad.',
 'The professor discussed the research findings of hers during the lecture.',
 'We visited the new house of theirs and it is beautifully decorated.',
 'I tasted the delicious homemade cookies of hers at the party.',
 'The parents were proud of the achievements of their children.',
 'She showed us the latest design sketches of hers for the fashion collection.',
 'They sold the car of theirs to buy a bigger and more comfortable one.',
 'He introduced us to the manager of his company, who offered us a job.',
 'We saw the pai

Now let's do this for every row in our dataframe and also ask for negative examples.

In [31]:
def get_examples(construction):
    print(construction['prompt'])
    response = openai.ChatCompletion.create(
        model=config.OPENAI_MODEL,
        messages=[
            {"role": "system", "content": "You are an English as a foreign language teacher who is knowledgable about grammar."},
            {"role": "user", "content": construction['prompt']}
        ]
    )
    msg_content = response.choices[0].message.content
    print(f'{msg_content}\n\n')
    lines = msg_content.split('\n')
    positive_examples = [re.sub(r'^\d+\.\s*', '', line).strip() for line in lines]

    # negative examples
    print([
            {"role": "system", "content": "You are an English as a foreign language teacher who is knowledgable about grammar."},
            {"role": "user", "content": construction['prompt']},
            response.choices[0].message,
            {"role": "user", "content": "Rewrite every example with the same content but without using the rule."}
        ])
    response = openai.ChatCompletion.create(
        model=config.OPENAI_MODEL,
        messages=[
            {"role": "system", "content": "You are an English as a foreign language teacher who is knowledgable about grammar."},
            {"role": "user", "content": construction['prompt']},
            response.choices[0].message,
            {"role": "user", "content": "Rewrite every example with the same content but without using the rule."}
        ]
    )
    msg_content = response.choices[0].message.content
    print(f'{msg_content}\n\n')
    lines = msg_content.split('\n')
    negative_examples = [re.sub(r'^\d+\.\s*', '', line).strip() for line in lines]
    return positive_examples, negative_examples

In [32]:
egp_samples[['augmented_examples', 'augmented_negative_examples']] = egp_samples.apply(get_examples, axis=1, result_type='expand')

egp_samples['augmented_examples']

Create 20 more examples for the grammatical construction on CEFR level A1 in the category "VERBS: linking" with guideword "FORM: 'BE' + COMPLEMENT" and the rule: "Can use linking verb 'be' with complements."

Examples:

Because today is my birthday. (A1 BREAKTHROUGH; 2008; Mandarin; Pass)

Martin is from Poland. (A1 BREAKTHROUGH; 2006; Polish; Pass)

My house is very clean. (A1 BREAKTHROUGH; 2009; Bengali; Pass)

Output format:
1. [EXAMPLE 1]
2. [EXAMPLE 2]
1. The weather is sunny today.
2. She is a doctor.
3. The cat is black.
4. His name is John.
5. They are students.
6. The flowers are beautiful.
7. It is hot outside.
8. The book is on the table.
9. The movie is interesting.
10. This shirt is too small.
11. The soup is delicious.
12. The dog is friendly.
13. The car is red.
14. The game is on TV tonight.
15. It is important to be kind.
16. The exams are difficult.
17. The party is at my house.
18. The music is loud.
19. The room is messy.
20. The shoes are comfortable.


[{'role': '

1101    [The weather is sunny today., She is a doctor....
871     [I don't like coffee, I prefer tea., She doesn...
909     [He bought himself a new car for his birthday....
1103    [The movie seems really interesting but ends u...
272     [Not only did she win the race, but she also b...
314     [She was telling me about this incredible adve...
Name: augmented_examples, dtype: object

In [34]:
egp_samples.to_json('../dat/egpaugmented.json')