# Example) Build Extractive QA Dataset

# 1. Without Tokenization

1. Create a `SquadGuru` object who is an NLP expert. Let him `gather` the complex squad json dataset organized to make it usable in an NLP task.

   - Constructor Signature

     ```python
     SquadGuru(parser: SquadParser, #parser which implement SquadParser
               tokenizer=None, #tokenizer which implement .tokenize(text: str)
               tags=SQUAD_TAGS, #iterable of str
               versions=SQUAD_VERSIONS #iterable of float
     )
     ```

   - Inject a `parser`, which the guru will use to extract X and Y data from the original suqad dataset.
   - Inject a `tokenizer` that will be used to create tokenized X and Y.
   - Inject an iterable of `tags` that describes the tags of the dataset to load.
   - Inject an iterable of `versions` that describes the versions of the dataset to load. Version's datatype is `float`.


   Here we're gonna use `ExtractiveQAParser`. In order to create an instance of it, use static factory method pattern like: `SquadParser.from_nlp_task('EXT_QA')`

In [1]:
from prep_squad.guru import SquadGuru
from prep_squad.parser import SquadParser

squad_parser = SquadParser.from_nlp_task('EXT_QA')
guru = SquadGuru(squad_parser, tags=['dev'], versions=(1.1, 2.0))

2. Use `squadGuru.gather()` to let the guru remember extracted X and Y.

   - Method Signature

     ```python
     squadGuru.gather(only_first_answer=False, 
                      verbose=False)
```
     
     - Set `only_first_answer` to extract the first answer in each of question-answers sets.
     - Set `verbose` to print some logs.

In [2]:
guru.gather(only_first_answer=True, verbose=True)

SQuAD-v1.1 dev dataset has been parsed.
SQuAD-v2.0 dev dataset has been parsed.


## Get DataFrame Table

In [3]:
guru.to_dataframe()

Unnamed: 0,Input,Target
0,Super Bowl 50 was an American football game to...,"(177, 14)"
1,Super Bowl 50 was an American football game to...,"(249, 17)"
2,Super Bowl 50 was an American football game to...,"(403, 23)"
3,Super Bowl 50 was an American football game to...,"(177, 14)"
4,Super Bowl 50 was an American football game to...,"(488, 4)"
...,...,...
16493,"The pound-force has a metric counterpart, less...","(82, 14)"
16494,"The pound-force has a metric counterpart, less...","(114, 8)"
16495,"The pound-force has a metric counterpart, less...","(274, 4)"
16496,"The pound-force has a metric counterpart, less...","(712, 3)"


## Get Numpy Array 

In [4]:
guru.to_numpy()

array([['Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi\'s Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50. [SEP] Which NFL team represented the AFC at Super Bowl 50?',
        (177, 14)],
       ['Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 sea

## Save to files

In [5]:
#passage-question
#answer
guru.to_file('examples/extqa.pq.example.txt', 'examples/extqa.a.example.txt')

# 2. Applying Bert Tokenization

1. Create a `SquadGuru` object who is an NLP expert. Here we give him a pretrained `BertTokenizer`.

In [6]:
from prep_squad.guru import SquadGuru
from prep_squad.parser import SquadParser
from transformers import BertTokenizer

task = SquadParser.from_nlp_task('EXT_QA')
tok = BertTokenizer.from_pretrained('bert-large-cased')
guru = SquadGuru(task, tok, tags=['dev'], versions=(1.1, 2.0))

In [7]:
guru.gather(only_first_answer=True, verbose=True)

SQuAD-v1.1 dev dataset has been parsed.
SQuAD-v2.0 dev dataset has been parsed.


In [8]:
guru.to_dataframe()

Unnamed: 0,Input,Target
0,Super Bowl 50 was an American football game to...,"(182, 14)"
1,Super Bowl 50 was an American football game to...,"(256, 17)"
2,Super Bowl 50 was an American football game to...,"(417, 24)"
3,Super Bowl 50 was an American football game to...,"(182, 14)"
4,Super Bowl 50 was an American football game to...,"(506, 4)"
...,...,...
16493,"The pound - force has a metric counterpart , l...","(89, 22)"
16494,"The pound - force has a metric counterpart , l...","(135, 17)"
16495,"The pound - force has a metric counterpart , l...","(329, 7)"
16496,"The pound - force has a metric counterpart , l...","(830, 6)"


In [9]:
guru.to_numpy()

array([['Super Bowl 50 was an American football game to determine the champion of the National Football League ( NFL ) for the 2015 season . The American Football Conference ( AFC ) champion Denver Broncos defeated the National Football Conference ( NFC ) champion Carolina Panthers 24 – 10 to earn their third Super Bowl title . The game was played on February 7 , 2016 , at Levi \' s Stadium in the San Francisco Bay Area at Santa Clara , California . As this was the 50th Super Bowl , the league emphasized the " golden anniversary " with various gold - themed initiatives , as well as temporarily su ##sp ##ending the tradition of naming each Super Bowl game with Roman n ##ume ##rals ( under which the game would have been known as " Super Bowl L " ) , so that the logo could prominently feature the Arabic n ##ume ##rals 50 . [SEP] Which NFL team represented the AFC at Super Bowl 50 ?',
        (182, 14)],
       ['Super Bowl 50 was an American football game to determine the champion of the 

In [10]:
#passage-question
#answer
guru.to_file('examples/extqa.pq.tok.example.txt', 'examples/extqa.a.tok.example.txt')