# Example) Training Vectorizers with SQuAD corpus

1. Create a `SquadGuru` object who is an NLP expert. Let him `gather` the complex squad json dataset organized to make it usable in an NLP task.

   - Constructor Signature

     ```python
     SquadGuru(parser: SquadParser, #parser which implement SquadParser
               tokenizer=None, #tokenizer which implement .tokenize(text: str)
               tags=SQUAD_TAGS, #iterable of str
               versions=SQUAD_VERSIONS #iterable of float
     )
     ```

   - Inject a `parser`, which the guru will use to extract X and Y data from the original suqad dataset.
   - Inject a `tokenizer` that will be used to create tokenized X and Y.
   - Inject an iterable of `tags` that describes the tags of the dataset to load.
   - Inject an iterable of `versions` that describes the versions of the dataset to load. Version's datatype is `float`.


   Here we're gonna use `CorpusParser`. In order to create an instance of it, use static factory method pattern like: `SquadParser.from_nlp_task('CORPUS')`

In [1]:
from prep_squad import SquadGuru
from prep_squad import SquadParser

squad_parser = SquadParser.from_nlp_task('CORPUS')
guru = SquadGuru(squad_parser)

2. Use `squadGuru.gather()` to let the guru remember extracted X and Y.

   - Method Signature

     ```python
     squadGuru.gather(only_first_answer=False, 
                      verbose=False)
```
     
     - Set `only_first_answer` to extract the first answer in each of question-answers sets.
     - Set `verbose` to print some logs.

In [2]:
guru.gather(only_first_answer=True, verbose=True)

SQuAD-v1.1 train dataset has been parsed.
SQuAD-v2.0 train dataset has been parsed.
SQuAD-v1.1 dev dataset has been parsed.
SQuAD-v2.0 dev dataset has been parsed.


## Getting the SQuAD Corpus
- Because ground-truth labels are not needed, just use `guru.X` to get the corpus from SQuAD datasets.
- It is a list that contains the texts from the SQuAD's passages.

In [3]:
corpus = guru.X

print(len(corpus))
print(corpus[0])

41199
Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.


# Training Tfidf Vectorizer
- Let us train `sklearn.feature_extraction.text.TfIdfVectorizer` using the SQuAD corpus

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words='english')
vectorizer.set_params(ngram_range=(1, 3))
vectorizer = vectorizer.fit(corpus)

  from collections import Mapping, defaultdict


In [5]:
doc1 = """The iPod has also been credited with accelerating shifts within the music industry. The iPod's popularization of digital music storage allows users to abandon listening to entire albums and instead be able to choose specific singles which hastened the end of the Album Era in popular music.
"""

doc2 = """Soviet generals with extensive combat experience from the Second World War were sent to North Korea as the Soviet Advisory Group. These generals completed the plans for the attack by May.
"""

vectors = vectorizer.transform([doc1, doc2])

In [6]:
words = vectorizer.inverse_transform(vectors)
tfidfs = []
for doc_id, vector in enumerate(vectors):
    tfidf = {}
    be = vector.indptr.tolist()
    for feature_id in range(be[0], be[1]):
        tfidf[words[doc_id][feature_id]] = vector.data[feature_id]
    tfidfs.append(tfidf)

In [7]:
tfidf = tfidfs[1]

In [8]:
sorted_tfidf = sorted(tfidf.items(), key=lambda d: d[1], reverse=True)

In [9]:
sorted_tfidf

[('generals', 0.22023447373849717),
 ('world war sent', 0.16716567545813155),
 ('war sent north', 0.16716567545813155),
 ('soviet generals extensive', 0.16716567545813155),
 ('soviet generals', 0.16716567545813155),
 ('soviet advisory group', 0.16716567545813155),
 ('soviet advisory', 0.16716567545813155),
 ('sent north korea', 0.16716567545813155),
 ('plans attack', 0.16716567545813155),
 ('north korea soviet', 0.16716567545813155),
 ('korea soviet advisory', 0.16716567545813155),
 ('group generals completed', 0.16716567545813155),
 ('group generals', 0.16716567545813155),
 ('generals extensive combat', 0.16716567545813155),
 ('generals extensive', 0.16716567545813155),
 ('generals completed plans', 0.16716567545813155),
 ('generals completed', 0.16716567545813155),
 ('extensive combat experience', 0.16716567545813155),
 ('extensive combat', 0.16716567545813155),
 ('experience second world', 0.16716567545813155),
 ('experience second', 0.16716567545813155),
 ('completed plans attack',

In [13]:
import pickle

with open('squad_tfidf_vectorizer.skl', 'wb') as file:
    pickle.dump(vectorizer, file)