# Data augmentation of EGP
Let's first prompt for 10 more examples of 6 random constructions (1 per level) and for negative examples and manually evaluate them to check the pipeline.

In [1]:
import pandas as pd

In [2]:
egp = pd.read_csv('../dat/egponline.csv')
egp.head()

Unnamed: 0,#,SuperCategory,SubCategory,Level,Lexical Range,guideword,Can-do statement,Example
0,1,ADJECTIVES,combining,A1,,FORM: COMBINING TWO ADJECTIVES WITH 'AND',Can use 'and' to join a limited range of commo...,The teachers are very nice and friendly . (A1 ...
1,2,ADJECTIVES,combining,A2,,FORM: COMBINING TWO ADJECTIVES WITH 'BUT',Can use 'but' to join a limited range of commo...,The weather was cloudy but fine. (A2 WAYSTAGE;...
2,3,ADJECTIVES,combining,B1,,FORM: BEFORE THE NOUN,Can use a comma to combine two adjectives used...,"It was the biggest TV in the bazaar, with its ..."
3,4,ADJECTIVES,combining,B1,,FORM: COMBINING COMPARATIVE ADJECTIVES WITH 'AND',Can use 'and' to join a limited range of compa...,In the winter you can go to the fitness centre...
4,5,ADJECTIVES,combining,B1,,FORM: COMBINING MORE THAN TWO ADJECTIVES,Can use commas and 'and' to join more than two...,"Her name is Cassia, she is beautiful, intellig..."


Sample from every level

In [3]:
egp_samples = egp.groupby('Level', group_keys=False).apply(pd.DataFrame.sample)
egp_samples.head(n=6)

Unnamed: 0,#,SuperCategory,SubCategory,Level,Lexical Range,guideword,Can-do statement,Example
180,181,CLAUSES,declarative,A1,,"FORM: NEGATIVE DECLARATIVE, LEXICAL VERBS",Can form simple negative declarative clauses.,I don't like maths. (A1 BREAKTHROUGH; 2007; Fa...
653,654,NOUNS,noun phrases,A2,,FORM: WITH RELATIVE CLAUSE AS POSTMODIFIER,Can post-modify noun phrases by using a non-de...,"She's in a volleyball team like me, but we're ..."
857,858,PRESENT,present continuous,B1,,USE: TEMPORARY REPEATED ACTIONS,Can use the present continuous to describe tem...,I'm doing a lot of great leisure activities. (...
953,954,PRONOUNS,demonstratives,B2,,"FORM/USE: 'THOSE', SUBSTITUTION","Can use 'those' as a substitute, followed by a...","Moreover, for those who are not interested in ..."
280,281,CONJUNCTIONS,subordinating,C1,,FORM/USE: FOCUS,"Can use 'whatever', 'wherever', 'however', etc...","Whatever you choose, you'll probably have to s..."
1046,1047,VERBS,phrasal-prepositional,C2,,FORM: VERB + DIRECT OBJECT + PARTICLE + PREPOS...,Can use a direct object with some prepositiona...,It is widely accepted that being a loyal and t...


Let's have a closer look at the structure of the Example column:

In [4]:
print(egp_samples.Example.iloc[0])

I don't like maths. (A1 BREAKTHROUGH; 2007; Farsi; Pass)

I don't have children. (A1 BREAKTHROUGH; 2008; Polish; Pass)

I don't know. (A1 BREAKTHROUGH; 2009; French; Pass)


Let's remove the additional information.

In [5]:
egp_samples['Example'] = egp_samples['Example'].str.replace(r"\(.*\)", "", regex=True).str.strip()
print(egp_samples['Example'].iloc[0])

I don't like maths. 

I don't have children. 

I don't know.


Let's assemble the prompt.

In [14]:
construction = egp_samples.iloc[5]
num_examples = 20
prompt = f'Create an enumerated list of {num_examples} more examples for grammatical construction on CEFR level {construction["Level"]} in the category "{construction["SuperCategory"]}: {construction["SubCategory"]}" with guideword "{construction["guideword"]}" and the rule: "{construction["Can-do statement"]}"\n\nHere are examples:\n\n{construction["Example"]}'

print(prompt)

Create an enumerated list of 20 more examples for grammatical construction on CEFR level C2 in the category "VERBS: phrasal-prepositional" with guideword "FORM: VERB + DIRECT OBJECT + PARTICLE + PREPOSITION + OBJECT" and the rule: "Can use a direct object with some prepositional verbs as well as an object of the preposition."

Here are examples:

It is widely accepted that being a loyal and trusted individual means that the people around us can trust us or let us in on a secret.


Let's add the prompt as a column to the dataframe:

In [25]:
def get_prompt(construction):
    return f'Create an enumerated list of {num_examples} more examples for grammatical construction on CEFR level {construction["Level"]} in the category "{construction["SuperCategory"]}: {construction["SubCategory"]}" with guideword "{construction["guideword"]}" and the rule: "{construction["Can-do statement"]}"\n\nHere are examples:\n\n{construction["Example"]}'

egp_samples['prompt'] = egp_samples.apply(get_prompt, axis=1)

egp_samples['prompt'].iloc[0]

'Create an enumerated list of 20 more examples for grammatical construction on CEFR level A1 in the category "CLAUSES: declarative" with guideword "FORM: NEGATIVE DECLARATIVE, LEXICAL VERBS" and the rule: "Can form simple negative declarative clauses."\n\nHere are examples:\n\nI don\'t like maths. \n\nI don\'t have children. \n\nI don\'t know.'

Now let's access the openai ChatCompletion API

In [8]:
import openai
import sys
import os

sys.path.append(os.path.dirname(os.getcwd()))
import config
openai.api_key = config.OPENAI_API_KEY

In [17]:
response = openai.ChatCompletion.create(
    model=config.OPENAI_MODEL,
    messages=[
        {"role": "system", "content": "You are an English as a foreign language teacher who is knowledgable about grammar."},
        {"role": "user", "content": prompt}
    ]
)

In [20]:
msg_content = response.choices[0].message.content

In [19]:
msg_content = "Premable bla bla bla. " + msg_content

In [29]:
import re

In [30]:
lines = msg_content.split('\n')
[re.sub(r'^\d+\.\s*', '', line).strip() for line in lines]

["I can't believe she called off the meeting with just a few hours' notice.",
 'The manager deals with customer complaints by listening carefully and following up promptly.',
 'We need to get rid of these old files to make space for new ones.',
 "The students handed in their assignments to the professor's assistant.",
 'The chef cooked up a delicious meal for the guests.',
 "The teacher pointed out the mistakes in the students' essays.",
 'He put off making a decision until he had gathered more information.',
 'The team carried out an experiment to test their hypothesis.',
 'I picked up the book from the library and returned it on time.',
 'She turned down the invitation because she already had plans.',
 'We ran out of milk, so I went to the store to buy some more.',
 "They're looking for volunteers to help out at the event.",
 'The doctor advised the patient to cut out sugary drinks from their diet.',
 'The police broke up the fight before it escalated further.',
 'He took on the resp

Now let's do this for every row in our dataframe.

In [31]:
def get_examples(construction):
    print(construction['prompt'])
    response = openai.ChatCompletion.create(
        model=config.OPENAI_MODEL,
        messages=[
            {"role": "system", "content": "You are an English as a foreign language teacher who is knowledgable about grammar."},
            {"role": "user", "content": construction['prompt']}
        ]
    )
    msg_content = response.choices[0].message.content
    print(msg_content)
    lines = msg_content.split('\n')
    return [re.sub(r'^\d+\.\s*', '', line).strip() for line in lines]

In [32]:
egp_samples['augmented_examples'] = egp_samples.apply(get_examples, axis=1)

egp_samples['augmented_examples']

Create an enumerated list of 20 more examples for grammatical construction on CEFR level A1 in the category "CLAUSES: declarative" with guideword "FORM: NEGATIVE DECLARATIVE, LEXICAL VERBS" and the rule: "Can form simple negative declarative clauses."

Here are examples:

I don't like maths. 

I don't have children. 

I don't know.
1. He doesn't play golf.
2. She doesn't eat meat.
3. They don't speak French.
4. We don't live in the city.
5. It doesn't rain often.
6. Mary doesn't work on weekends.
7. John doesn't drive a car.
8. The dog doesn't bark at strangers.
9. He doesn't watch TV in the evening.
10. She doesn't like spicy food.
11. They don't understand the question.
12. We don't have any bananas left.
13. The cat doesn't like to be petted.
14. He doesn't go to the gym.
15. She doesn't drink coffee.
16. They don't play musical instruments.
17. We don't go on vacation during winter.
18. It doesn't snow in this region.
19. Mary doesn't like horror movies.
20. John doesn't believe in

180     [He doesn't play golf., She doesn't eat meat.,...
653     [We went to the amusement park, which had many...
857     [My sister is staying with us this week., They...
953     [I can provide examples for the given rule and...
280     [However hard you try, you can't change the pa...
1046    [She always takes care of her younger siblings...
Name: augmented_examples, dtype: object

In [33]:
egp_samples.to_json('../dat/egpaugmented.json')