# Question Generation Task using T5

In this task, the T5 model will be tasked with asking relevant questions when given a context. We will be using Google's pretrained T5 Small Model and then fine tune it on our task.

## 1. Navigating to the resource folder and importing a pretrained Instance of T5-Small

In [2]:
cd /content/drive/MyDrive/Colab Notebooks/T5_Question_Generation

/content/drive/MyDrive/Colab Notebooks/T5_Question_Generation


In [2]:
ls

[0m[01;34mcache_dir[0m/  [01;34moutputs[0m/   [01;34mruns[0m/                         [01;34mwandb[0m/
[01;34mdata[0m/       [01;34mresource[0m/  T5_Question_Generation.ipynb


## 2. Importing Simple Transformers
We will be using the [Simple Transformers library](https://github.com/ThilinaRajapakse/simpletransformers) which is based on the [Hugging Face Transformers](https://github.com/huggingface/transformers) to train the T5 model.
The instructions given below will install all the requirements.
- Install Anaconda or Miniconda Package Manager from [here](https://www.anaconda.com/products/individual).
- Create a new virtual environment and install packages.
  - conda create -n simpletransformers python
  - conda activate simpletransformers
  - conda install pytorch cudatoolkit=10.1 -c pytorch
- Install simpletransformers.
  - pip install simpletransformers

**NOTE** - The first two steps are necessary only if you choose to run the files on your local system.


In [2]:
pip install simpletransformers

Collecting simpletransformers
[?25l  Downloading https://files.pythonhosted.org/packages/35/ef/0b70ae95138064d665d9298c4d96afba2edf4b86dc44f762807ceb12668e/simpletransformers-0.61.4-py3-none-any.whl (213kB)
[K     |████████████████████████████████| 215kB 5.6MB/s 
[?25hCollecting transformers>=4.2.0
[?25l  Downloading https://files.pythonhosted.org/packages/d8/b2/57495b5309f09fa501866e225c84532d1fd89536ea62406b2181933fb418/transformers-4.5.1-py3-none-any.whl (2.1MB)
[K     |████████████████████████████████| 2.1MB 7.0MB/s 
Collecting seqeval
[?25l  Downloading https://files.pythonhosted.org/packages/9d/2d/233c79d5b4e5ab1dbf111242299153f3caddddbb691219f363ad55ce783d/seqeval-1.2.2.tar.gz (43kB)
[K     |████████████████████████████████| 51kB 9.0MB/s 
Collecting tokenizers
[?25l  Downloading https://files.pythonhosted.org/packages/ae/04/5b870f26a858552025a62f1649c20d29d2672c02ff3c3fb4c688ca46467a/tokenizers-0.10.2-cp37-cp37m-manylinux2010_x86_64.whl (3.3MB)
[K     |█████████████████

## 3. Importing The Dataset

We will be using the Amazon Review Data (2018) [link text](https://nijianmo.github.io/amazon/index.html) dataset which contains (among other things) descriptions of the various products on Amazon and question-answer pairs related to those products.

The descriptions and the question-answer pairs are to be downloaded separately. You can either download the data manually Question-Answer Pairs list below, or you can run the provided shell script. The list of categories used in this study is given below.

1. Appliances
2. Arts_Crafts_and_Sewing
3. Automotive
4. Beauty Product

(The number of topics has been confined to 4 due to limitations of the Google Colab GPU)

In [None]:
!chmod +x download_data.sh

In [None]:
!./download_data.sh

--2021-04-07 06:33:22--  http://deepyeti.ucsd.edu/jianmo/amazon/metaFiles/meta_All_Beauty.json.gz
Resolving deepyeti.ucsd.edu (deepyeti.ucsd.edu)... 169.228.63.50
Connecting to deepyeti.ucsd.edu (deepyeti.ucsd.edu)|169.228.63.50|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10032305 (9.6M) [application/octet-stream]
Saving to: ‘meta_All_Beauty.json.gz’


2021-04-07 06:33:24 (6.02 MB/s) - ‘meta_All_Beauty.json.gz’ saved [10032305/10032305]

--2021-04-07 06:33:24--  http://deepyeti.ucsd.edu/jianmo/amazon/metaFiles/meta_Appliances.json.gz
Reusing existing connection to deepyeti.ucsd.edu:80.
HTTP request sent, awaiting response... 200 OK
Length: 59568961 (57M) [application/octet-stream]
Saving to: ‘meta_Appliances.json.gz’


2021-04-07 06:33:28 (16.9 MB/s) - ‘meta_Appliances.json.gz’ saved [59568961/59568961]

--2021-04-07 06:33:28--  http://deepyeti.ucsd.edu/jianmo/amazon/metaFiles/meta_Arts_Crafts_and_Sewing.json.gz
Reusing existing connection to deepyeti.ucsd.

## 4. Preprocessing The Data

We can process the data files and save them in a convenient format using the script given below. This will also split the data into train and evaluation sets. Adapted from the helpful scripts given in the [Amazon Review Data page.](https://nijianmo.github.io/amazon/index.html)

In the following cell, we convert our data to test and train dataframe that contains two columns - `input_text` and `target_text` which correspond to the product description and related questions respectively. Once the dataframes are created, we export the created dataframe to two .tsv files - `train_df.tsv` and `eval_df.tsv` which will be used later to train and evaluate the model.

In [None]:
import pandas as pd
import gzip
from sklearn.model_selection import train_test_split
import os
from tqdm.auto import tqdm


def parse(path):
    g = gzip.open(path, 'rb')
    for l in g:
        yield eval(l)


def getDF(path):
    i = 0
    df = {}
    for d in parse(path):
        df[i] = d
        i += 1

    return pd.DataFrame.from_dict(df, orient='index')


categories = [category[3:] for category in os.listdir("data") if category.endswith(".gz") and category.startswith("qa")]

for category in tqdm(categories):
    if not os.path.isfile(f"data/{category.split('.')[0]}.tsv"):
        try:
            df1 = getDF(f'data/qa_{category}')
            df2 = getDF(f'data/meta_{category}')

            df = pd.merge(df1, df2, on="asin", how="left")
            df = df[["question", "answer", "description"]]
            df = df.dropna()
            df = df.drop_duplicates(subset="answer")
            print(df.head())

            df.to_csv(f"data/{category.split('.')[0]}.tsv", "\t")
        except:
            pass

df = pd.concat((pd.read_csv(f"data/{f}", sep="\t") for f in os.listdir("data") if f.endswith(".tsv")))
df = df[["question", "description"]]
df["description"] = df["description"].apply(lambda x: x[2:-2])
df.columns = ["target_text", "input_text"]
df["prefix"] = "ask_question"

df.to_csv(f"data/data_all.tsv", "\t")

train_df, eval_df = train_test_split(df, test_size=0.05)

train_df.to_csv("data/train_df.tsv", "\t")
eval_df.to_csv("data/eval_df.tsv", "\t")

  0%|          | 0/3 [00:00<?, ?it/s]

                                              question  ...                                        description
112                           Is Carmol #40 available?  ...  [<P><B>CARMOL 20:<BR></B><UL><LI>20% Carbamide...
114                        do you need a prescription?  ...  [<P><B>CARMOL 20:<BR></B><UL><LI>20% Carbamide...
687                      Can this be used on wet hair?  ...  [Helen of Troy 1514 Brush Iron, White, 1 1/2 I...
689  has anyone else had a lot of the "brisles" bre...  ...  [Helen of Troy 1514 Brush Iron, White, 1 1/2 I...
691  Does it work for really long hair? I have fine...  ...  [Helen of Troy 1514 Brush Iron, White, 1 1/2 I...

[5 rows x 3 columns]
                                             question  ...                                        description
46                                  filter for vicks3  ...  [Keep your air humidifier operating at peak ef...
48  I have not purchased the humidifier that uses ...  ...  [Keep your air humidifier operat

## 5. Training The T5 Model (t5-small)

The input data to a T5 model should ideally be a DataFrame containing 3 columns as shown below.
- prefix: A string indicating the task to perform.
- input_text: The input text sequence.
- target_text: The target sequence.

Internally, Simple Transformers will build the properly formatted input and target sequences (shown below) from the Pandas DataFrame.
The input to a T5 model has the following pattern:

 `"<prefix>: <input_text> </s>"`

The target sequence has the following pattern:

`"<target_sequence> </s>"`

The prefix value specifies the task we want the T5 model to perform. To train a T5 model to perform a new task, we simply train the model while specifying an appropriate prefix. In this case, we will be using the prefix `ask_question`. i.e. All the rows in our DataFrame will have the value ask_question in the prefix column.

The model's training loss and final test loss would be plotted with the use the WandB library and the plots can be found in the plots folder in the repository home.

The model arguments are explored in depth in other notebooks.

In [None]:
import pandas as pd

from simpletransformers.t5 import T5Model


train_df = pd.read_csv("data/train_df.tsv", sep="\t").astype(str)
eval_df = pd.read_csv("data/eval_df.tsv", sep="\t").astype(str)

model_args = {
    "reprocess_input_data": True,
    "overwrite_output_dir": True,
    "max_seq_length": 128,
    "train_batch_size": 8,
    "num_train_epochs": 1,
    "save_eval_checkpoints": True,
    "save_steps": -1,
    "use_multiprocessing": False,
    "evaluate_during_training": True,
    "evaluate_during_training_steps": 15000,
    "evaluate_during_training_verbose": True,
    "fp16": False,

    "wandb_project": "Question Generation with T5",
}

model = T5Model("t5", "t5-small", args=model_args)

model.train_model(train_df, eval_data=eval_df)

Downloading:   0%|          | 0.00/1.20k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/242M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

  0%|          | 0/14770 [00:00<?, ?it/s]



Using Adafactor for T5


Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize


wandb: Paste an API key from your profile and hit enter: ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Running Epoch 0 of 1:   0%|          | 0/1847 [00:00<?, ?it/s]

	add_(Number alpha, Tensor other)
Consider using one of the following signatures instead:
	add_(Tensor other, *, Number alpha) (Triggered internally at  /pytorch/torch/csrc/utils/python_arg_parser.cpp:1005.)
  exp_avg_sq_row.mul_(beta2t).add_(1.0 - beta2t, update.mean(dim=-1))


  0%|          | 0/778 [00:00<?, ?it/s]



(1847,
 {'eval_loss': [3.4606100028874924],
  'global_step': [1847],
  'train_loss': [3.479576349258423]})

## 6. Evaluating The Model

Evaluating a language generation model is a little more complicated than evaluating something like a classification model. This is because there is no right answer you can compare against like you could with a classification model. The evaluation dataset contains descriptions and the questions that people have asked about those products, but that doesn’t mean that those are the only right questions you can ask.

So we let the model run on the evaluation data and store the generated questions to understand how well does the model perform at the task.

In [None]:
from simpletransformers.t5 import T5Model
import pandas as pd
from pprint import pprint


model_args = {
    "reprocess_input_data": True,
    "overwrite_output_dir": True,
    "max_seq_length": 128,
    "eval_batch_size": 128,
    "num_train_epochs": 1,
    "save_eval_checkpoints": False,
    "use_multiprocessing": False,
    "num_beams": None,
    "do_sample": True,
    "max_length": 50,
    "top_k": 50,
    "top_p": 0.95,
    "num_return_sequences": 3,
}

model = T5Model("t5","outputs/best_model", args=model_args)

df = pd.read_csv("data/eval_df.tsv", sep="\t").astype(str)
preds = model.predict(
    ["ask_question: " + description for description in df["input_text"].tolist()]
)

questions = df["target_text"].tolist()

with open("test_outputs_large/generated_questions_sampling.txt", "w") as f:
    for i, desc in enumerate(df["input_text"].tolist()):
        pprint(desc)
        pprint(preds[i])
        print()

        f.write(str(desc) + "\n\n")

        f.write("Real question:\n")
        f.write(questions[i] + "\n\n")

        f.write("Generated questions:\n")
        for pred in preds[i]:
            f.write(str(pred) + "\n")
        f.write("________________________________________________________________________________\n")

Generating outputs:   0%|          | 0/7 [00:00<?, ?it/s]



Decoding outputs:   0%|          | 0/2334 [00:00<?, ?it/s]

FileNotFoundError: ignored

## 7. Generating Questions

The predict() method of a Simple Transformers T5 model is used to generate the predictions or, in our case, the questions. Here, we are generating 3 questions for each description in the eval_df dataset.

In [None]:
with open("outputs/generated_questions_sampling.txt", "w") as f:
    for i, desc in enumerate(df["input_text"].tolist()):
        pprint(desc)
        pprint(preds[i])
        print()

        f.write(str(desc) + "\n\n")

        f.write("Real question:\n")
        f.write(questions[i] + "\n\n")

        f.write("Generated questions:\n")
        for pred in preds[i]:
            f.write(str(pred) + "\n")
        f.write("________________________________________________________________________________\n")

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
 '\\n\\n\\n\\n\\n\\n\\n\\n\\n    \\n    \\n        \\n    '
 '\\n\\n\\n\\n\\n\\n    \\n        <div class="a-text-center">\\n            '
 '<a class="a-link-normal" target="_blank" rel="noopener" '
 'href="https://m.media-amazon.com/images/S/aplus-media/vc/48a7a4e7-a0ac-45c7-8494-fbab7cd99a16.jpg">\\n                '
 "View larger\\n            </a>\\n        </div>', '<a "
 'class="a-link-normal" target="_blank" rel="noopener" '
 'href="https://m.media-amazon.com/images/S/aplus-media/vc/8fc20ff9-be98-4dcd-809a-c4df26c55694.jpg">\\n            '
 '<img alt="" '
 'src="https://m.media-amazon.com/images/S/aplus-media/vc/8fc20ff9-be98-4dcd-809a-c4df26c55694._SL220__.jpg">\\n        '
 '</a>\\n    \\n    \\n\\n\\n                                        '
 '\\n\\n\\n\\n\\n\\n\\n\\n\\n    \\n    \\n        \\n    '
 '\\n\\n\\n\\n\\n\\n    \\n        <div class="a-text-center">\\n            '
 '<a class="a-link-normal" target

## 8. Model Performance -

I’ve shuffled the generated questions with the actual question from the dataset. There are 4 questions for each description, 3 of which are generated and one is the original. The homogeneity of the generated response with that of the original ones decides how well the model has performed.

***Sample 1***

- **_Description:_**
  - Basic EZ Change system provides clear, clean, great-tasting water for fridgerator or icemaker systems. The included RC-EZ-1 basic filtration cartridge can filter 3,000 gallons and last up to 12 months. Avoid the mess of changing traditional cartridges and eliminates the need for buckets, towels, wrenches, and other tools.

- **_Questions:_**
  - Does this filter connect to this faucet with this kit?
  - Is this filter waterproof?
  - Will this filter fit my Kenmore FW2100?
  - What is the weight of this item?

***Sample 2***

- **_Description:_**
  - Elegant and sleek, this TV Stand gives a new look to your home. Finished in a dark Espresso color. Two sliding doors. Four Sections for storage.

- **_Questions:_**
  - Is there any way to adjust the height in this unit or does the width need to adjust itself if my TV is not 32"?
  - How tall are the shelves? I have a tall receiver and want to be sure it will fit.
  - What are the dimensions of the two storage compartments? Thanks!
  - Can the drawers be removed or are they fixed?

**_Answers:_**

1. Original Question - Does this filter connect to this faucet with this kit?
2. What are the dimensions of the two storage compartments? Thanks!

## 9. Conclusion

We were successfully able to finetune the T5-small model for the task of question generation. Despite only a single epoch, the model was able to perform considerably well.. The generated questions can be found in the `outputs` folder in the file `generated_questions.txt` and the performance of the model in terms of it's training and validation loss can be found in the plots folder.