Notebook for preprocessing anli datasets.

In [2]:
import os
import json
from os.path import join as oj
import pandas as pd
from tqdm import tqdm
from collections import defaultdict

Download the raw NLI data

In [1]:
!git clone https://github.com/allenai/natural-instructions.git

fatal: destination path 'natural-instructions' already exists and is not an empty directory.


List of tasks is available here: https://github.com/allenai/natural-instructions/blob/master/tasks/README.md

**Choose some tasks**
- The ideal task for inverting has a description that isn't something generic like "Answer the true or false question."
- Example of nice tasks
  - Given a country, return it's currency
  - Given an item, check if it is edible or not
- Constraints
  - The task description (or atleast its keyword) should not be present in the examples (e.g. this is the case for `task109_smsspamcollection_spamsmsdetection`).
  - If they are, then rename the labels in the output.

In [7]:
task_names = [
    # nicest tasks
    'task1146_country_capital',
    'task1147_country_currency',
    'task1321_country_continent', # Given a country name, return the continent name of the given country
    'task1149_item_check_edible',
    'task429_senteval_tense',
    'task430_senteval_subject_count',
    'task609_sbic_potentially_offense_binary_classification',
    'task063_first_i_elements',	# Given a list return the first i elements of the list
    'task064_all_elements_except_first_i', # Given a list return all the elements of the list except the first i elements	
    'task088_identify_typo_verification', # Identify the typo in a sentence
    'task092_check_prime_classification', # Identify whether the number is prime or not.	
    'task107_splash_question_to_sql', # Generate an SQL statement from a question asking for certain data.
    'task114_is_the_given_word_longest', # Identify whether the word is the longest in the sentence.
    'task123_conala_sort_dictionary', # Sort a list of dictionaries based on a given key.	
    'task183_rhyme_generation', # Given an input word, generate a list of words that rhyme exactly with the input.	
    'task1191_food_veg_nonveg', # Given the name of an indian dish, classify it as non vegetarian or a vegetarian dish	
    'task1336_peixian_equity_evaluation_corpus_gender_classifier', # Classifying gender of speaker of sentence	   
    'task1509_evalution_antonyms', # Given a word generate its antonym	

    # a little repetitive


    # artificially easy (answers give some hint of the keyword)
    'task1506_celebrity_minimal_dob_span', # Find the date of birth of a celebrity given a sentence bio	
]
task_defs = {}
metadata = defaultdict(list)
nli_tasks_dir = '/home/chansingh/interpretable-autoprompting/data_utils/natural-instructions/tasks'
out_dir = '/home/chansingh/interpretable-autoprompting/data_utils/anli_processed'
os.makedirs(out_dir, exist_ok=True)
for task_name in task_names:
    task_json_file = oj(nli_tasks_dir, task_name + '.json')
    task = json.load(open(task_json_file, 'r'))
    task_def = task['Definition'][0]
    df = pd.DataFrame.from_dict(task['Instances'])
    df = df.drop(columns='id')
    df['output'] = df['output'].apply(lambda x: x[0])
    df['text'] = 'Input: ' + df['input'] + ' Answer: ' + df['output'] + '\n'
    print(task_name, '\n' + task_def + '\n')
    task_defs[task_name] = task_def
    df.to_csv(oj(out_dir, task_name + '.csv'), index=False)
json.dump(task_defs, open(oj(out_dir, 'task_defs.json'), 'w'), indent=4)

 37%|███▋      | 7/19 [00:00<00:00, 69.60it/s]

task1146_country_capital 
In this task, you are given a country name and you need to return the capital city of the given country

task1147_country_currency 
You are given a country name and you need to return the currency of the given country.

task1321_country_continent 
In this task, you are given a country name and you need to return the continent to which the country belongs.

task1149_item_check_edible 
In this task, you are given an item and you need to check whether it is edible or not, return 1 if it is edible, else return 2.

task429_senteval_tense 
In this task you are given a sentence. You must judge whether the main verb of the sentence is in present or past tense. Label the instances as "Present" or "Past" based on your judgment. If there is no verb in the given text, answer "Present".

task430_senteval_subject_count 
In this task you are given a sentence. You must judge whether subject of the main clause is singular or plural. Label the instances as "Singular" or "Plural

100%|██████████| 19/19 [00:00<00:00, 39.83it/s]

task183_rhyme_generation 
Given an input word generate a word that rhymes exactly with the input word. If not rhyme is found return "No"

task1191_food_veg_nonveg 
In this task, you are given the name of an Indian food dish. You need to return whether the dish is "non vegetarian" or "vegetarian". Do not answer with any words other than those two.

task1336_peixian_equity_evaluation_corpus_gender_classifier 
You will be given a sentence containing a pronoun/person name and an emotion. From these implicit parameters, the main goal is to find the gender of the person (male / female).

task1509_evalution_antonyms 
In this task, you are given an adjective, and your job is to generate its antonym. An antonym of a word is a word opposite in meaning to it.

task1506_celebrity_minimal_dob_span 
Given a short bio of a person, find the minimal text span containing the date of birth of the person. The output must be the minimal text span that contains the birth date, month and year as long as they




In [7]:
def check_text(s):
    return isinstance(s, str) and len(s) > 0
df['text'].apply(check_text).all()
df['output'].apply(check_text)

0        No
1       Yes
2        No
3       Yes
4        No
       ... 
2941    Yes
2942     No
2943    Yes
2944     No
2945    Yes
Name: output, Length: 2946, dtype: object

# metadata

In [11]:
tabs = pd.read_html('https://github.com/allenai/natural-instructions/blob/master/tasks/README.md')
tab = tabs[0]

In [22]:
task_defs_brief = {}
for task_name in task_names:
    row = tab[tab.Name == task_name]
    # print(row)
    task_defs_brief[task_name] = row.Summary.values[0]
json.dump(task_defs_brief, open(oj(out_dir, 'task_defs_brief.json'), 'w'), indent=4)