<a href="https://colab.research.google.com/github/alexpod1000/SQuAD-QA/blob/main/NLP_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Clone repository
# https://sysadmins.co.za/clone-a-private-github-repo-with-personal-access-token/
!git clone https://fb5b65b126107273e595ce8b6c9d2d533103c6e2:x-oauth-basic@github.com/alexpod1000/SQuAD-QA.git
# Change current working directory to match project
%cd SQuAD-QA/
!pwd

In [7]:
# External imports
import numpy as np
import pandas as pd
import torch
from typing import Tuple, List, Dict

# Project imports
from squad_data.parser import SquadFileParser
from squad_data.utils import build_mappers_and_dataframe, add_paragraphs_spans

### Download Embedding

In [3]:
from utils.embedding_utils import EmbeddingDownloader

embedding_downloader = EmbeddingDownloader(
    "embedding_models", 
    "embedding_model.kv", 
    model_name="fasttext-wiki-news-subwords-300"
)

embedding_model = embedding_downloader.load()

Downloading embedding into /content/SQuAD-QA/embedding_models/embedding_model.kv
End!
Embedding dimension: 300


### Parse the json and get the data

In [4]:
parser = SquadFileParser("squad_data/data/training_set.json")
data = parser.parse_documents()

### Prepare the mappers and datafram

In [5]:
paragraphs_mapper, questions_mapper, df = build_mappers_and_dataframe(data)
print(questions_mapper[next(iter(questions_mapper))])
print(paragraphs_mapper[next(iter(paragraphs_mapper))])
df.head()

To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?
Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.


Unnamed: 0,paragraph_id,question_id,answer_id,answer_start,answer_text
0,0_0,5733be284776f41900661182,0,515,Saint Bernadette Soubirous
1,0_0,5733be284776f4190066117f,0,188,a copper statue of Christ
2,0_0,5733be284776f41900661180,0,279,the Main Building
3,0_0,5733be284776f41900661181,0,381,a Marian place of prayer and reflection
4,0_0,5733be284776f4190066117e,0,92,a golden statue of the Virgin Mary


In [8]:
# Extend the paragraphs mapper to include spans
paragraphs_spans_mapper = add_paragraphs_spans(paragraphs_mapper)

In [10]:
print(paragraphs_spans_mapper['0_0']['text'])
print(paragraphs_spans_mapper['0_0']['spans'])

Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.
[(0, 16), (17, 20), (21, 27), (28, 31), (32, 33), (34, 42), (43, 53), (54, 58), (59, 62), (63, 67), (68, 78), (79, 83), (84, 88), (89, 91), (92, 93), (94, 100), (101, 107), (108, 110), (111, 114), (115, 121), (122, 127), (128, 139), (140, 142), (143, 148), (149, 151), (152, 155), (156, 160), (161, 169),

### DataConverter and CustomQADataset

In [11]:
from data_loading.utils import DataConverter, padder_collate_fn
from data_loading.qa_dataset import CustomQADataset

data_converter = DataConverter(embedding_model, paragraphs_spans_mapper)
datasetQA = CustomQADataset(data_converter, df, paragraphs_mapper, questions_mapper)
data_loader = torch.utils.data.DataLoader(datasetQA, collate_fn = padder_collate_fn, batch_size=10, shuffle=True)
print(next(iter(data_loader))[0].shape)
print(next(iter(data_loader))[2].shape)

torch.Size([1348, 10, 300])
torch.Size([10, 2])
