Data EDA and pre-processing analysis
------
In this notebook, I will document my preliminary data exploration and analysis steps, that have lead to the final steps in my production source code. In order to efficiently perform the data engineering and feature engineering steps, and make informed decisions at the model training phase, some exploration is beneficial. 

Here, we are working with our training dataset, the file [train-v2.0.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json) in the SQuAD dataset.

### Library imports

In [22]:
import pprint
import os
import json
import numpy as np

### Data ingest and EDA
We load the data and explore the structure of each datapoint. Here I use 'datapoint' to actually refer to a small dataset consisting of Q&A combinations and their relevant context snippet (paragraph) relating to one Wikipedia article, such as the "iPod" article below.

In [9]:
input_filepath = "../data/raw"
in_file_name = "train-v2.0.json"
in_file = os.path.join(input_filepath, in_file_name)
with open(in_file) as f:
    train = json.load(f)
 
pp = pprint.PrettyPrinter()
pp.pprint(train['data'][3])


{'paragraphs': [{'context': 'The iPod is a line of portable media players and '
                            'multi-purpose pocket computers designed and '
                            'marketed by Apple Inc. The first line was '
                            'released on October 23, 2001, about 8½ months '
                            'after iTunes (Macintosh version) was released. '
                            'The most recent iPod redesigns were announced on '
                            'July 15, 2015. There are three current versions '
                            'of the iPod: the ultra-compact iPod Shuffle, the '
                            'compact iPod Nano and the touchscreen iPod Touch.',
                 'qas': [{'answers': [{'answer_start': 105, 'text': 'Apple'}],
                          'id': '56cc55856d243a140015ef0a',
                          'is_impossible': False,
                          'question': 'Which company produces the iPod?'},
                         {'answer

                 'qas': [{'answers': [{'answer_start': 32,
                                       'text': 'weak bass response'}],
                          'id': '56cc643d6d243a140015ef88',
                          'is_impossible': False,
                          'question': 'What audio deficiency was found in the '
                                      '3rd gen iPods?'},
                         {'answers': [{'answer_start': 360,
                                       'text': 'high-impedance'}],
                          'id': '56cc643d6d243a140015ef8a',
                          'is_impossible': False,
                          'question': 'What kind of headphones could partially '
                                      'mitigate the bass response issues of '
                                      'the 3rd gen iPods?'},
                         {'answers': [{'answer_start': 470,
                                       'text': 'external headphone amplifier'}],
                         

                          'is_impossible': False,
                          'question': 'What was the original format for '
                                      'purchased audio files on iTunes?'},
                         {'answers': [{'answer_start': 116,
                                       'text': 'FairPlay'}],
                          'id': '56cfb8e5234ae51400d9beec',
                          'is_impossible': False,
                          'question': 'What was the name of the DRM system '
                                      'originally used by Apple and iTunes?'},
                         {'answers': [{'answer_start': 512,
                                       'text': 'iTunes Plus'}],
                          'id': '56cfb8e5234ae51400d9beed',
                          'is_impossible': False,
                          'question': 'What was the name of the premium '
                                      'service that offered higher quality and '
                         

                         {'answers': [{'answer_start': 209,
                                       'text': '$5.2 billion'}],
                          'id': '56cd73af62d2951400fa65c7',
                          'is_impossible': False,
                          'question': 'How much revenue did Apple announce for '
                                      'Q2 2007?'},
                         {'answers': [{'answer_start': 12, 'text': '2007'}],
                          'id': '56d12cc017492d1400aabb58',
                          'is_impossible': False,
                          'question': 'In which year did Apple top sales of '
                                      '100,000,000 iPods?'},
                         {'answers': [{'answer_start': 232, 'text': '32%'}],
                          'id': '56d12cc017492d1400aabb59',
                          'is_impossible': False,
                          'question': "As of 2007, what percentage of Apple's "
                                      'r

                 'qas': [{'answers': [{'answer_start': 3, 'text': '2010'}],
                          'id': '56cd828562d2951400fa6670',
                          'is_impossible': False,
                          'question': 'In what year did Chinese Foxconn '
                                      'emplyees kill themselves?'},
                         {'answers': [{'answer_start': 257,
                                       'text': 'Apple prototype'}],
                          'id': '56cd828562d2951400fa6672',
                          'is_impossible': False,
                          'question': 'What disappeared in 2009 prior to the '
                                      'suicide of a Foxconn employee?'},
                         {'answers': [{'answer_start': 3, 'text': '2010'}],
                          'id': '56d13543e7d4791d00902004',
                          'is_impossible': False,
                          'question': 'In what year did several Foxconn '
                      

Here, we see that each dataset consists of a `title` field (here, "Ipod"), a list of `paragraph` fields, and each paragraph contains the paragraph's text under `context` and a set of questions and answers, under `qas`. Per QA-set we have an ID field, data on the starting character of the text span that contains the answer (`answer_start`), the ground truth answer text itself (`text`) and whether or not the question is impossible to answer (`is_impossible`).


**Questions we want to answer**: 
    - how many of these small datasets do we have (in this training dataset)? 
    - how many QA points per dataset?
    - how to preprocess this data for model training?

In [37]:
# We create simple counts and lists to store all counts for calculations of the averages.
par_count = 0
par_avg = []
qas_count = 0
qas_avg = []
impossible_qas = 0

for d in train['data']:
    count = len(d['paragraphs'])
    par_count += count
    par_avg.append(count)
    for p in d['paragraphs']: 
        count = len(p['qas'])
        for qa in p['qas']:
            if qa['is_impossible'] == True:
                impossible_qas +=1
        qas_count += count
        qas_avg.append(count)
        

print(f"We have data on {len(train['data'])} articles, containing {par_count} combined paragraphs, \
containing {qas_count} combined Q&A pairs, {impossible_qas} of which ({(impossible_qas / qas_count):.2%}) are classified as 'impossible to answer';")
print()
print(f"On average, we have {np.average(par_avg):.1f} paragraphs per article and {np.average(qas_avg):.1f} Q&A pairs per paragraph.")

We have data on 442 articles, containing 19035 combined paragraphs, containing 130319 combined Q&A pairs, 43498 of which (33.38%) are classified as 'impossible to answer';

On average, we have 43.1 paragraphs per article and 6.8 Q&A pairs per paragraph.
