# Sequence to Sequence Prediction on Wikipedia Question Answer Data
---

**Task: Answer Questions**

**Description:**

There are three directories, one for each year of students: S08, S09, and S10.

The file "question_answer_pairs.txt" contains the questions and answers. The first line of the file contains 
column names for the tab-separated data fields in the file. This first line follows:

ArticleTitle    Question        Answer  DifficultyFromQuestioner        DifficultyFromAnswerer  ArticleFile

Field 1 is the name of the Wikipedia article from which questions and answers initially came.

Field 2 is the question.

Field 3 is the answer.

Field 4 is the prescribed difficulty rating for the question as given to the question-writer. 

Field 5 is a difficulty rating assigned by the individual who evaluated and answered the question, 
which may differ from the difficulty in field 4.

Field 6 is the relative path to the prefix of the article files. html files (.htm) and cleaned 
text (.txt) files are provided.

In [1]:
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)

Mounted at /content/gdrive


## Extract Downloaded Data
---

In [None]:
import os

base_path = 'gdrive/MyDrive/TSAI_END2/Session7/Assignment2'
data_path = os.path.join(base_path, 'data')
qa_data_base_path = os.path.join(data_path, 'QuestionAnswerData')
qa_data_path = os.path.join(qa_data_base_path, 'Question_Answer_Dataset_v1.2')
qa_data_tar_path = os.path.join(qa_data_base_path, 'Question_Answer_Dataset_v1.2.tar.gz')

In [None]:
import tarfile

qa_file = tarfile.open(qa_data_tar_path)
print(qa_file.getnames())
qa_file.extractall(qa_data_base_path)
qa_file.close()

['Question_Answer_Dataset_v1.2', 'Question_Answer_Dataset_v1.2/S08', 'Question_Answer_Dataset_v1.2/S08/question_answer_pairs.txt', 'Question_Answer_Dataset_v1.2/S08/data', 'Question_Answer_Dataset_v1.2/S08/data/set4', 'Question_Answer_Dataset_v1.2/S08/data/set4/a6.txt.clean', 'Question_Answer_Dataset_v1.2/S08/data/set4/a3.txt.clean', 'Question_Answer_Dataset_v1.2/S08/data/set4/a3.txt', 'Question_Answer_Dataset_v1.2/S08/data/set4/a5.txt', 'Question_Answer_Dataset_v1.2/S08/data/set4/a4o.htm', 'Question_Answer_Dataset_v1.2/S08/data/set4/a3.htm', 'Question_Answer_Dataset_v1.2/S08/data/set4/a9.htm', 'Question_Answer_Dataset_v1.2/S08/data/set4/a2.txt', 'Question_Answer_Dataset_v1.2/S08/data/set4/a9.txt.clean', 'Question_Answer_Dataset_v1.2/S08/data/set4/a4.htm', 'Question_Answer_Dataset_v1.2/S08/data/set4/a4.txt', 'Question_Answer_Dataset_v1.2/S08/data/set4/a4.txt.clean', 'Question_Answer_Dataset_v1.2/S08/data/set4/a2.htm', 'Question_Answer_Dataset_v1.2/S08/data/set4/a7o.htm', 'Question_Answ

## Data Exploration
---

In [None]:
import os

base_path = 'gdrive/MyDrive/TSAI_END2/Session7/Assignment2'
data_path = os.path.join(base_path, 'data')
qa_data_base_path = os.path.join(data_path, 'QuestionAnswerData')
qa_data_path = os.path.join(qa_data_base_path, 'Question_Answer_Dataset_v1.2')
qa_data_tar_path = os.path.join(qa_data_base_path, 'Question_Answer_Dataset_v1.2.tar.gz')

Folder Contents:-

In [None]:
os.listdir(qa_data_path)

['S09', 'S08', 'S10', 'LICENSE-S08,S09', 'README.v1.2']

There are 3 folders: S08, S09 and S10. The other 2 are files.

Within each of S08, S09, S10, there are:

> 1) data folder containing wikipedia articles

> 2) question answer pair text file

We need to consider this question answer text file.

In [None]:
for subdir in os.listdir(qa_data_path):
  subdirectory = os.path.join(qa_data_path, subdir)
  print(subdirectory)
  if os.path.isdir(subdirectory):
    print(os.listdir(subdirectory))
  print('\n')

gdrive/MyDrive/TSAI_END2/Session7/Assignment2/data/QuestionAnswerData/Question_Answer_Dataset_v1.2/S09
['data', 'question_answer_pairs.txt']


gdrive/MyDrive/TSAI_END2/Session7/Assignment2/data/QuestionAnswerData/Question_Answer_Dataset_v1.2/S08
['data', 'question_answer_pairs.txt']


gdrive/MyDrive/TSAI_END2/Session7/Assignment2/data/QuestionAnswerData/Question_Answer_Dataset_v1.2/S10
['data', 'question_answer_pairs.txt']


gdrive/MyDrive/TSAI_END2/Session7/Assignment2/data/QuestionAnswerData/Question_Answer_Dataset_v1.2/LICENSE-S08,S09


gdrive/MyDrive/TSAI_END2/Session7/Assignment2/data/QuestionAnswerData/Question_Answer_Dataset_v1.2/README.v1.2




All 3 'question_answer_pairs.txt' were read and combined. The result is below.

In [None]:
import pandas as pd

qa_data = pd.DataFrame()
for subdir in os.listdir(qa_data_path):
  subdirectory = os.path.join(qa_data_path, subdir)
  if os.path.isdir(subdirectory):
    for txt_file in os.listdir(subdirectory):
      if '.txt' in txt_file:
        df = pd.read_csv(os.path.join(subdirectory, txt_file), sep='\t', encoding=' ISO-8859-1')
        qa_data = pd.concat([qa_data, df]).reset_index(drop=True) 
qa_data

Unnamed: 0,ArticleTitle,Question,Answer,DifficultyFromQuestioner,DifficultyFromAnswerer,ArticleFile
0,Alessandro_Volta,Was Volta an Italian physicist?,yes,easy,easy,data/set4/a10
1,Alessandro_Volta,Was Volta an Italian physicist?,yes,easy,easy,data/set4/a10
2,Alessandro_Volta,Is Volta buried in the city of Pittsburgh?,no,easy,easy,data/set4/a10
3,Alessandro_Volta,Is Volta buried in the city of Pittsburgh?,no,easy,easy,data/set4/a10
4,Alessandro_Volta,Did Volta have a passion for the study of elec...,yes,easy,medium,data/set4/a10
...,...,...,...,...,...,...
3993,Zebra,What areas do the Grevy's Zebras inhabit?,,hard,,data/set1/a9
3994,Zebra,Which species of zebra is known as the common ...,"Plains Zebra (Equus quagga, formerly Equus bur...",hard,medium,data/set1/a9
3995,Zebra,Which species of zebra is known as the common ...,Plains Zebra,hard,medium,data/set1/a9
3996,Zebra,At what age can a zebra breed?,five or six,hard,medium,data/set1/a9


In [None]:
qa_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3998 entries, 0 to 3997
Data columns (total 6 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   ArticleTitle              3998 non-null   object
 1   Question                  3961 non-null   object
 2   Answer                    3422 non-null   object
 3   DifficultyFromQuestioner  3043 non-null   object
 4   DifficultyFromAnswerer    3418 non-null   object
 5   ArticleFile               3996 non-null   object
dtypes: object(6)
memory usage: 187.5+ KB


In [None]:
qa_data.describe()

Unnamed: 0,ArticleTitle,Question,Answer,DifficultyFromQuestioner,DifficultyFromAnswerer,ArticleFile
count,3998,3961,3422,3043,3418,3996
unique,109,2456,1828,4,5,57
top,Amedeo_Avogadro,Was King Victor Emmanuel III there to pay homa...,Yes,easy,easy,data/set4/a8
freq,132,6,492,1035,1344,132


There were many columns. We just needed question and answer column.

In [None]:
qa_data = qa_data[['Question','Answer']] 
qa_data

Unnamed: 0,Question,Answer
0,Was Volta an Italian physicist?,yes
1,Was Volta an Italian physicist?,yes
2,Is Volta buried in the city of Pittsburgh?,no
3,Is Volta buried in the city of Pittsburgh?,no
4,Did Volta have a passion for the study of elec...,yes
...,...,...
3993,What areas do the Grevy's Zebras inhabit?,
3994,Which species of zebra is known as the common ...,"Plains Zebra (Equus quagga, formerly Equus bur..."
3995,Which species of zebra is known as the common ...,Plains Zebra
3996,At what age can a zebra breed?,five or six


There are null values. These needs to be dropped.

In [None]:
qa_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3998 entries, 0 to 3997
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Question  3961 non-null   object
 1   Answer    3422 non-null   object
dtypes: object(2)
memory usage: 62.6+ KB


Final data with only question-answer column and no null values.

In [None]:
qa_data = qa_data.dropna().reset_index(drop=True)
qa_data

Unnamed: 0,Question,Answer
0,Was Volta an Italian physicist?,yes
1,Was Volta an Italian physicist?,yes
2,Is Volta buried in the city of Pittsburgh?,no
3,Is Volta buried in the city of Pittsburgh?,no
4,Did Volta have a passion for the study of elec...,yes
...,...,...
3417,What areas do the Grevy's Zebras inhabit?,semi-arid grasslands of Ethiopia and northern ...
3418,Which species of zebra is known as the common ...,"Plains Zebra (Equus quagga, formerly Equus bur..."
3419,Which species of zebra is known as the common ...,Plains Zebra
3420,At what age can a zebra breed?,five or six


In [None]:
qa_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3422 entries, 0 to 3421
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Question  3422 non-null   object
 1   Answer    3422 non-null   object
dtypes: object(2)
memory usage: 53.6+ KB


## Create NLP Pipeline
---


### Import libraries and set seed
---

In [2]:
%cd 'gdrive/MyDrive/TSAI_END2/Session7/Assignment2'

/content/gdrive/MyDrive/TSAI_END2/Session7/Assignment2


In [3]:
from nlp_seq2seq_api import *

In [4]:
import torch
import torch.nn as nn
import torch.optim as optim

from torchtext.legacy import data
from torchtext.legacy.data import Field, BucketIterator

import spacy
import numpy as np

import random
import math
import time

In [5]:
SEED = 1234

random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True
# !pip install spacy --upgrade

### Data Loading and Model Building
---

In [6]:
model_params = {'enc_emb_dim': 256, 'dec_emb_dim': 256, 'hid_dim': 512, 'n_layers': 2, 'enc_dropout': 0.5, 'dec_dropout': 0.5}
params = {'data_path': './data/QuestionAnswerData/Question_Answer_Dataset_v1.2', 'data_name': 'wikipedia qa', 'model_name': 'lstm encoder-decoder sequence model', 'model_params': model_params, 'seed': SEED, 'batch_size': 128, 'device': torch.device('cuda' if torch.cuda.is_available() else 'cpu')}

nlp_pipeline = NLPSeq2SeqPipeline(**params)

Loading data...
Number of training examples: 2395
Number of testing examples: 1027
Unique tokens in source vocabulary: 1951
Unique tokens in target vocabulary: 1279
Sample Data:-
                                                 src  trg
0                    Was Volta an Italian physicist?  yes
1                    Was Volta an Italian physicist?  yes
2         Is Volta buried in the city of Pittsburgh?   no
3         Is Volta buried in the city of Pittsburgh?   no
4  Did Volta have a passion for the study of elec...  yes
Data is loaded


Loading model...
Model Loaded...
Model Structure:- 
Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(1951, 256)
    (rnn): LSTM(256, 512, num_layers=2, dropout=0.5)
    (dropout): Dropout(p=0.5, inplace=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(1279, 256)
    (rnn): LSTM(256, 512, num_layers=2, dropout=0.5)
    (fc_out): Linear(in_features=512, out_features=1279, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
)
T

### Model Training
---

In [7]:
nlp_pipeline.train_model(10, './saved_models/QAData')

Epoch: 01 | Time: 1m 53s
	Train Loss: 5.193 | Train PPL: 179.977
	 Val. Loss: 3.529 |  Val. PPL:  34.074
Epoch: 02 | Time: 1m 57s
	Train Loss: 4.417 | Train PPL:  82.868
	 Val. Loss: 3.507 |  Val. PPL:  33.344
Epoch: 03 | Time: 1m 53s
	Train Loss: 4.319 | Train PPL:  75.121
	 Val. Loss: 3.499 |  Val. PPL:  33.086
Epoch: 04 | Time: 1m 56s
	Train Loss: 4.246 | Train PPL:  69.798
	 Val. Loss: 3.463 |  Val. PPL:  31.915
Epoch: 05 | Time: 1m 55s
	Train Loss: 4.207 | Train PPL:  67.173
	 Val. Loss: 3.423 |  Val. PPL:  30.663
Epoch: 06 | Time: 1m 55s
	Train Loss: 4.141 | Train PPL:  62.892
	 Val. Loss: 3.436 |  Val. PPL:  31.054
Epoch: 07 | Time: 1m 47s
	Train Loss: 4.045 | Train PPL:  57.131
	 Val. Loss: 3.329 |  Val. PPL:  27.901
Epoch: 08 | Time: 1m 53s
	Train Loss: 4.002 | Train PPL:  54.704
	 Val. Loss: 3.340 |  Val. PPL:  28.228
Epoch: 09 | Time: 1m 53s
	Train Loss: 3.915 | Train PPL:  50.170
	 Val. Loss: 3.301 |  Val. PPL:  27.138
Epoch: 10 | Time: 1m 56s
	Train Loss: 3.904 | Train PPL

### Model Evaluation
---

In [8]:
nlp_pipeline.evaluate_model()

| Test Loss: 3.279 | Test PPL:  26.562 |
