# Sequence to Sequence Prediction on Commonsense QA Data
---

**Task: Answer Common sense question**

**Description:**

CommonsenseQA is a new multiple-choice question answering dataset that requires different types of commonsense knowledge to predict the correct answers . It contains 12,102 questions with one correct answer and four distractor answers.  The dataset is provided in two major training/validation/testing set splits.

There are 3 JSON files for: train, validate, test.

We will consider train and validate files because test does not contain answers.

In [1]:
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)

Mounted at /content/gdrive


## Data Exploration
---


In [1]:
from io import BytesIO
from zipfile import ZipFile
from urllib.request import urlopen
import json
import pandas as pd

train_url = 'https://s3.amazonaws.com/commensenseqa/train_rand_split.jsonl'
dev_url = 'https://s3.amazonaws.com/commensenseqa/dev_rand_split.jsonl'
test_url = 'https://s3.amazonaws.com/commensenseqa/test_rand_split_no_answers.jsonl'

resp = urlopen(train_url).read().decode()
data = pd.read_json(resp,lines=True)

# zipfile = ZipFile(BytesIO(resp.read()))
# files = zipfile.namelist()
# print(files)
# for fs in files:
#   if 'train' in fs:
#     with zipfile.open(fs) as json_file:
#       train_json_data = json.load(json_file)
#   elif 'dev' in fs:
#     with zipfile.open(fs) as json_file:
#       test_json_data = json.load(json_file)

In [2]:
df = pd.json_normalize(data.to_dict(orient='records'))

Each line in JSON file represents record. Each record consists of: 

1) answerKey:- It denotes the key  or label for correct option.

2) id:- uniques question id

3) question:- It is a dictionary of:

> i) question_concept - denotes the category to which question belong.

> ii) choices - denotes choices among which answer lies. It is a list of dictionary containing:

>> a) label: can be A or B or C or D

>> b) text

> iii) stem

Below is the structure when converted to Data Frame

In [3]:
df

Unnamed: 0,answerKey,id,question.question_concept,question.choices,question.stem
0,A,075e483d21c29a511267ef62bedc0461,punishing,"[{'label': 'A', 'text': 'ignore'}, {'label': '...",The sanctions against the school were a punish...
1,B,61fe6e879ff18686d7552425a36344c8,people,"[{'label': 'A', 'text': 'race track'}, {'label...",Sammy wanted to go to where the people were. ...
2,A,4c1cb0e95b99f72d55c068ba0255c54d,choker,"[{'label': 'A', 'text': 'jewelry store'}, {'la...",To locate a choker not located in a jewelry bo...
3,D,02e821a3e53cb320790950aab4489e85,highway,"[{'label': 'A', 'text': 'united states'}, {'la...",Google Maps and other highway and street GPS s...
4,C,23505889b94e880c3e89cff4ba119860,fox,"[{'label': 'A', 'text': 'pretty flowers.'}, {'...","The fox walked from the city into the forest, ..."
...,...,...,...,...,...
9736,E,f1b2a30a1facff543e055231c5f90dd0,going public,"[{'label': 'A', 'text': 'consequences'}, {'lab...",What would someone need to do if he or she wan...
9737,D,a63b4d0c0b34d6e5f5ce7b2c2c08b825,chair,"[{'label': 'A', 'text': 'stadium'}, {'label': ...",Where might you find a chair at an office?
9738,A,22d0eea15e10be56024fd00bb0e4f72f,jeans,"[{'label': 'A', 'text': 'shopping mall'}, {'la...",Where would you buy jeans in a place with a la...
9739,A,7c55160a4630de9690eb328b57a18dc2,well,"[{'label': 'A', 'text': 'fairytale'}, {'label'...",John fell down the well. he couldn't believe ...


In [4]:
df['question.choices']

0       [{'label': 'A', 'text': 'ignore'}, {'label': '...
1       [{'label': 'A', 'text': 'race track'}, {'label...
2       [{'label': 'A', 'text': 'jewelry store'}, {'la...
3       [{'label': 'A', 'text': 'united states'}, {'la...
4       [{'label': 'A', 'text': 'pretty flowers.'}, {'...
                              ...                        
9736    [{'label': 'A', 'text': 'consequences'}, {'lab...
9737    [{'label': 'A', 'text': 'stadium'}, {'label': ...
9738    [{'label': 'A', 'text': 'shopping mall'}, {'la...
9739    [{'label': 'A', 'text': 'fairytale'}, {'label'...
9740    [{'label': 'A', 'text': 'put in to the water'}...
Name: question.choices, Length: 9741, dtype: object

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9741 entries, 0 to 9740
Data columns (total 5 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   answerKey                  9741 non-null   object
 1   id                         9741 non-null   object
 2   question.question_concept  9741 non-null   object
 3   question.choices           9741 non-null   object
 4   question.stem              9741 non-null   object
dtypes: object(5)
memory usage: 380.6+ KB


In [6]:
df['answerKey'].value_counts()

D    1985
B    1973
C    1946
E    1928
A    1909
Name: answerKey, dtype: int64

To create answer column, answerKey is matched with question.choices.

In [14]:
# pd.concat([pd.DataFrame(x) for x in df['question.choices']], keys=df.index).reset_index(level=1, drop=True).reset_index(drop=True)
df['answer'] = df.apply(lambda r: [x for x in r['question.choices'] if x['label']==r['answerKey']][0]['text'], axis=1)

In [15]:
df

Unnamed: 0,answerKey,id,question.question_concept,question.choices,question.stem,answer
0,A,075e483d21c29a511267ef62bedc0461,punishing,"[{'label': 'A', 'text': 'ignore'}, {'label': '...",The sanctions against the school were a punish...,ignore
1,B,61fe6e879ff18686d7552425a36344c8,people,"[{'label': 'A', 'text': 'race track'}, {'label...",Sammy wanted to go to where the people were. ...,populated areas
2,A,4c1cb0e95b99f72d55c068ba0255c54d,choker,"[{'label': 'A', 'text': 'jewelry store'}, {'la...",To locate a choker not located in a jewelry bo...,jewelry store
3,D,02e821a3e53cb320790950aab4489e85,highway,"[{'label': 'A', 'text': 'united states'}, {'la...",Google Maps and other highway and street GPS s...,atlas
4,C,23505889b94e880c3e89cff4ba119860,fox,"[{'label': 'A', 'text': 'pretty flowers.'}, {'...","The fox walked from the city into the forest, ...",natural habitat
...,...,...,...,...,...,...
9736,E,f1b2a30a1facff543e055231c5f90dd0,going public,"[{'label': 'A', 'text': 'consequences'}, {'lab...",What would someone need to do if he or she wan...,telling all
9737,D,a63b4d0c0b34d6e5f5ce7b2c2c08b825,chair,"[{'label': 'A', 'text': 'stadium'}, {'label': ...",Where might you find a chair at an office?,cubicle
9738,A,22d0eea15e10be56024fd00bb0e4f72f,jeans,"[{'label': 'A', 'text': 'shopping mall'}, {'la...",Where would you buy jeans in a place with a la...,shopping mall
9739,A,7c55160a4630de9690eb328b57a18dc2,well,"[{'label': 'A', 'text': 'fairytale'}, {'label'...",John fell down the well. he couldn't believe ...,fairytale


Question column is question.stem

In [16]:
df[['question.stem','answer']]

Unnamed: 0,question.stem,answer
0,The sanctions against the school were a punish...,ignore
1,Sammy wanted to go to where the people were. ...,populated areas
2,To locate a choker not located in a jewelry bo...,jewelry store
3,Google Maps and other highway and street GPS s...,atlas
4,"The fox walked from the city into the forest, ...",natural habitat
...,...,...
9736,What would someone need to do if he or she wan...,telling all
9737,Where might you find a chair at an office?,cubicle
9738,Where would you buy jeans in a place with a la...,shopping mall
9739,John fell down the well. he couldn't believe ...,fairytale


## Data Paths
---

In [2]:
import os

base_path = 'gdrive/MyDrive/TSAI_END2/Session7/Assignment2'
data_path = ['https://s3.amazonaws.com/commensenseqa/train_rand_split.jsonl', 'https://s3.amazonaws.com/commensenseqa/dev_rand_split.jsonl']

## Create NLP Pipeline
---


### Import libraries and set seed
---

In [3]:
%cd 'gdrive/MyDrive/TSAI_END2/Session7/Assignment2'

/content/gdrive/MyDrive/TSAI_END2/Session7/Assignment2


In [4]:
from nlp_seq2seq_api import *

In [5]:
import torch
import torch.nn as nn
import torch.optim as optim

from torchtext.legacy import data
from torchtext.legacy.data import Field, BucketIterator

import spacy
import numpy as np

import random
import math
import time

In [6]:
SEED = 1234

random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True
# !pip install spacy --upgrade

### Data Loading and Model Building
---

In [7]:
model_params = {'enc_emb_dim': 256, 'dec_emb_dim': 256, 'hid_dim': 512, 'n_layers': 2, 'enc_dropout': 0.5, 'dec_dropout': 0.5}
params = {'data_path': data_path, 'data_name': 'commonsense', 'model_name': 'lstm encoder-decoder sequence model', 'model_params': model_params, 'seed': SEED, 'batch_size': 128, 'device': torch.device('cuda' if torch.cuda.is_available() else 'cpu')}

nlp_pipeline = NLPSeq2SeqPipeline(**params)

Loading data...
Number of training examples: 7673
Number of testing examples: 3289
Unique tokens in source vocabulary: 3989
Unique tokens in target vocabulary: 1576
Sample Data:-
                                                 src              trg
0  The sanctions against the school were a punish...           ignore
1  Sammy wanted to go to where the people were.  ...  populated areas
2  To locate a choker not located in a jewelry bo...    jewelry store
3  Google Maps and other highway and street GPS s...            atlas
4  The fox walked from the city into the forest, ...  natural habitat
Data is loaded


Loading model...
Model Loaded...
Model Structure:- 
Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(3989, 256)
    (rnn): LSTM(256, 512, num_layers=2, dropout=0.5)
    (dropout): Dropout(p=0.5, inplace=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(1576, 256)
    (rnn): LSTM(256, 512, num_layers=2, dropout=0.5)
    (fc_out): Linear(in_features=512, out_featur

### Model Training
---

In [8]:
nlp_pipeline.train_model(10, './saved_models/CommonsenseQAData')

Epoch: 01 | Time: 0m 2s
	Train Loss: 4.370 | Train PPL:  79.055
	 Val. Loss: 3.704 |  Val. PPL:  40.627
Epoch: 02 | Time: 0m 1s
	Train Loss: 3.981 | Train PPL:  53.564
	 Val. Loss: 3.669 |  Val. PPL:  39.217
Epoch: 03 | Time: 0m 1s
	Train Loss: 3.913 | Train PPL:  50.067
	 Val. Loss: 3.628 |  Val. PPL:  37.643
Epoch: 04 | Time: 0m 1s
	Train Loss: 3.816 | Train PPL:  45.423
	 Val. Loss: 3.659 |  Val. PPL:  38.823
Epoch: 05 | Time: 0m 1s
	Train Loss: 3.812 | Train PPL:  45.256
	 Val. Loss: 3.659 |  Val. PPL:  38.832
Epoch: 06 | Time: 0m 1s
	Train Loss: 3.742 | Train PPL:  42.173
	 Val. Loss: 3.683 |  Val. PPL:  39.761
Epoch: 07 | Time: 0m 1s
	Train Loss: 3.712 | Train PPL:  40.929
	 Val. Loss: 3.706 |  Val. PPL:  40.707
Epoch: 08 | Time: 0m 1s
	Train Loss: 3.656 | Train PPL:  38.712
	 Val. Loss: 3.758 |  Val. PPL:  42.873
Epoch: 09 | Time: 0m 1s
	Train Loss: 3.605 | Train PPL:  36.764
	 Val. Loss: 3.709 |  Val. PPL:  40.801
Epoch: 10 | Time: 0m 1s
	Train Loss: 3.534 | Train PPL:  34.275


### Model Evaluation
---

In [9]:
nlp_pipeline.evaluate_model()

| Test Loss: 3.723 | Test PPL:  41.378 |
