Notebook for preprocessing anli datasets.

In [1]:
import os
import json
from os.path import join as oj
import pandas as pd
from tqdm import tqdm
from collections import defaultdict

Download the raw NLI data

In [1]:
!git clone https://github.com/allenai/natural-instructions.git

fatal: destination path 'natural-instructions' already exists and is not an empty directory.


List of tasks is available here: https://github.com/allenai/natural-instructions/blob/master/tasks/README.md

**Choose some tasks**
- The ideal task for inverting has a description that isn't something generic like "Answer the true or false question."
- Example of nice tasks
  - Given a country, return it's currency
  - Given an item, check if it is edible or not
- Constraints
  - The task description (or atleast its keyword) should not be present in the examples (e.g. this is the case for `task109_smsspamcollection_spamsmsdetection`).
  - If they are, then rename the labels in the output.

In [7]:
TASK_NAMES = [
    # cleanest tasks ############
    'task1146_country_capital',
    'task1509_evalution_antonyms', # Given a word generate its antonym	
    'task1147_country_currency',
    'task1149_item_check_edible',
    'task183_rhyme_generation', # Given an input word, generate a list of words that rhyme exactly with the input.	
    'task1191_food_veg_nonveg', # Given the name of an indian dish, classify it as non vegetarian or a vegetarian dish	

    # overlaps with arithmetic ############
    'task092_check_prime_classification', # Identify whether the number is prime or not.	    

    # a little repetitive ############
    # 'task1321_country_continent', # Given a country name, return the continent name of the given country
        
    # phrasing a little long ############
    'task088_identify_typo_verification', # Identify the typo in a sentence
    # 'task609_sbic_potentially_offense_binary_classification',
    # 'task063_first_i_elements',	# Given a list return the first i elements of the list
    # 'task064_all_elements_except_first_i', # Given a list return all the elements of the list except the first i elements	
    # 'task123_conala_sort_dictionary', # Sort a list of dictionaries based on a given key.	

    # artificially easy (answers give some hint of the keyword) ############
    'task1336_peixian_equity_evaluation_corpus_gender_classifier', # Classifying gender of speaker of sentence	   
    'task107_splash_question_to_sql', # Generate an SQL statement from a question asking for certain data.
    # 'task1506_celebrity_minimal_dob_span', # Find the date of birth of a celebrity given a sentence bio	
    # 'task114_is_the_given_word_longest', # Identify whether the word is the longest in the sentence. -- note: the phrasing contains the task descr in the middle

    # task is very hard to infer ############
    # 'task430_senteval_subject_count',
    # 'task429_senteval_tense',
]
TASK_MODIFICATIONS = {
    # This one just had unnatural phrasing
    'task1149_item_check_edible': {
        'description': 'Return whether the input item is edible (yes or no).',
        'output_remap': {
            '1': 'yes',
            '2': 'no',
        }
    },

    # Fix examples that give away the answer
    'task1191_food_veg_nonveg': {
        'description': 'Return whether the input food dish is vegetarian (yes or no).',
        'output_remap': {
            'vegetarian': 'yes',
            'non vegetarian': 'no',
        }    
    },
    'task1336_peixian_equity_evaluation_corpus_gender_classifier': {
        'description': 'Return the gender of the person in the input sentence.',
        'output_remap': {
            'male': 'M',
            'female': 'F',
        }    
    },    
}
task_defs = {}
metadata = defaultdict(list)
nli_tasks_dir = '/home/chansingh/interpretable-autoprompting/data_utils/natural-instructions/tasks'
out_dir = '/home/chansingh/interpretable-autoprompting/data_utils/anli_processed'
os.makedirs(out_dir, exist_ok=True)

# read in task brief
tabs = pd.read_html('https://github.com/allenai/natural-instructions/blob/master/tasks/README.md')
tab = tabs[0]
task_defs_brief = {}
for task_name in sorted(TASK_NAMES):

    # brief task descr
    row = tab[tab.Name == task_name]
    # print(row)
    task_defs_brief[task_name] = row.Summary.values[0]

    task_json_file = oj(nli_tasks_dir, task_name + '.json')
    task = json.load(open(task_json_file, 'r'))
    task_def = task['Definition'][0]
    df = pd.DataFrame.from_dict(task['Instances'])
    df = df.drop(columns='id')
    df['output'] = df['output'].apply(lambda x: x[0])

    # modify some tasks    
    if task_name in TASK_MODIFICATIONS:
        mod = TASK_MODIFICATIONS[task_name]
        task_def = mod['description']
        if 'output_remap' in mod:
            df['output'] = df['output'].map(mod['output_remap'])

    
    df['text'] = 'Input: ' + df['input'] + ' Answer: ' + df['output'] + '\n'
    print(task_name, '\n' + task_def)
    print('brief:', task_defs_brief[task_name])
    print(df.iloc[0].text)
    print(df['output'].value_counts())
    task_defs[task_name] = task_def
    df.to_csv(oj(out_dir, task_name + '.csv'), index=False)

    
json.dump(task_defs, open(oj(out_dir, 'task_defs.json'), 'w'), indent=4)
json.dump(task_defs_brief, open(oj(out_dir, 'task_defs_brief.json'), 'w'), indent=4)

task088_identify_typo_verification 
The given sentence contains a typo which could be one of the following four types: (1) swapped letters of a word e.g. 'niec' is a typo of the word 'nice'. (2) missing letter in a word e.g. 'nic' is a typo of the word 'nice'. (3) extra letter in a word e.g. 'nicce' is a typo of the word 'nice'. (4) replaced letter in a word e.g 'nicr' is a typo of the word 'nice'. You need to identify the typo in the given sentence. To do this, answer with the word containing the typo.
brief: Identify the typo in a sentence.
Input: The karge motorcycle has been painted white and light blue. Answer: karge

ma          108
mabn        105
mab          90
None         76
mna          74
           ... 
Lugfage,      1
Mulitple      1
Scebne        1
kier          1
shrtless      1
Name: output, Length: 3046, dtype: int64
task092_check_prime_classification 
In this task, you need to output 'Yes' if the given number is a prime number otherwise output 'No'. A 'prime number'