In [2]:
import json
import pandas as pd

In [3]:
# Opening JSON file 
f = open('data/train-v2.0.json',)

# returns JSON object as  
# a dictionary 
data = json.load(f) 

In [4]:
df = pd.DataFrame.from_dict(data)

In [5]:
df.head()

Unnamed: 0,version,data
0,v2.0,"{'title': 'Beyoncé', 'paragraphs': [{'qas': [{..."
1,v2.0,"{'title': 'Frédéric_Chopin', 'paragraphs': [{'..."
2,v2.0,{'title': 'Sino-Tibetan_relations_during_the_M...
3,v2.0,"{'title': 'IPod', 'paragraphs': [{'qas': [{'qu..."
4,v2.0,{'title': 'The_Legend_of_Zelda:_Twilight_Princ...


In [6]:
df = pd.DataFrame.from_dict(df['data'])
df.head()

Unnamed: 0,data
0,"{'title': 'Beyoncé', 'paragraphs': [{'qas': [{..."
1,"{'title': 'Frédéric_Chopin', 'paragraphs': [{'..."
2,{'title': 'Sino-Tibetan_relations_during_the_M...
3,"{'title': 'IPod', 'paragraphs': [{'qas': [{'qu..."
4,{'title': 'The_Legend_of_Zelda:_Twilight_Princ...


In [7]:
df['data'][0]

nnounced that Beyoncé is a co-owner, with various other music artists, in the music streaming service Tidal. The service specialises in lossless audio and high definition music videos. Beyoncé\'s husband Jay Z acquired the parent company of Tidal, Aspiro, in the first quarter of 2015. Including Beyoncé and Jay-Z, sixteen artist stakeholders (such as Kanye West, Rihanna, Madonna, Chris Martin, Nicki Minaj and more) co-own Tidal, with the majority owning a 3% equity stake. The idea of having an all artist owned streaming service was created by those involved to adapt to the increased demand for streaming within the current music industry, and to rival other streaming services such as Spotify, which have been criticised for their low payout of royalties. "The challenge is to get everyone to respect music again, to recognize its value", stated Jay-Z on the release of Tidal.'},
  {'qas': [{'question': "House of Dereon became known through Beyonce and which of Beyonce's relatives?",
     'id

## Splitting to test and train sets

In [8]:
import random
from sklearn.model_selection import train_test_split

random.seed(100)
print('Data Count = {0}'.format(len(df)))

Data Count = 442


In [9]:
train, test= train_test_split(df, test_size=0.2)

In [10]:
print('Train set size = {0}'.format(train.shape))
print('Test set size = {0}'.format(test.shape))

Train set size = (353, 1)
Test set size = (89, 1)


### Splitting Training Dataset into 2 datasets, 1. context paragraphs and 2. Questions/Answers

In [11]:
contextID = 1
questionAnswerData_train = pd.DataFrame()
contextData_train = pd.DataFrame()

## For training set
for topic in train['data']:
    for paras in topic["paragraphs"]:
        
        questions = pd.DataFrame(paras["qas"])
        questions["contextID"] = contextID
        merging = [questionAnswerData_train, questions]
        questionAnswerData_train = pd.concat(merging)
        
        context = pd.Series([paras["context"]],name="context")
        context = context.to_frame()
        context["contextID"] = contextID
        merging = [contextData_train, context]
        contextData_train = pd.concat(merging)
        
        contextID = contextID +1
    

In [12]:
questionAnswerData_test = pd.DataFrame()
contextData_test = pd.DataFrame()

## For training set
for topic in test['data']:
    for paras in topic["paragraphs"]:
        
        questions = pd.DataFrame(paras["qas"])
        questions["contextID"] = contextID
        merging = [questionAnswerData_test, questions]
        questionAnswerData_test = pd.concat(merging)
        
        context = pd.Series([paras["context"]],name="context")
        context = context.to_frame()
        context["contextID"] = contextID
        merging = [contextData_test, context]
        contextData_test = pd.concat(merging)
        
        contextID = contextID +1

In [13]:
print('Number of Context Paragraphs in Train = {0}'.format(contextData_train.shape))
print('Number of QA in Train = {0}'.format(questionAnswerData_train.shape))

print('Number of Context Paragraphs in Test = {0}'.format(contextData_test.shape))
print('Number of QA in Test = {0}'.format(questionAnswerData_test.shape))


Number of Context Paragraphs in Train = (14798, 2)
Number of QA in Train = (104194, 6)
Number of Context Paragraphs in Test = (4237, 2)
Number of QA in Test = (26125, 6)


In [14]:
contextData_train.columns

Index(['context', 'contextID'], dtype='object')

In [15]:
contextData_train.to_excel("data/train/train_context.xlsx", index =False)
questionAnswerData_train.to_excel("data/train/train_QA.xlsx", index =False)

contextData_test.to_excel("data/test/test_context.xlsx", index =False)
questionAnswerData_test.to_excel("data/test/test_QA.xlsx", index =False)

PS. We understand that splitting of the dataset should be done after the EDA process, however since we are exploring the context paragraphs and QA separately, it would be computationally cheaper to also split the data into train and test dataset at this stage.

The characteristic of the dataset where 1 context paragraph is linked to many Questions and Answers would mean having a very big and deep dataframe should we put it into a single dataframe which can be computationally expensive to process, thus we are splitting it into 2 datasets, one being the context and the second being the QnA. For us to trace back the question/answer to the relevant context paragraphs, with havev included a contextID which function as a key between the 2 datasets.
