# Generating artificial dataset

---

Generating syntatic dataset via the generative QA model.

This notebook takes the last notebook output as input, which means that we get the zip of the (train.pkl / test.pkl) of the last notebook, we will only use the (train.pkl) to generate the questions and answers.

That said, the train.pkl only have the `ids` and the unique  `context`

----

#### **Input**

train.pkl :
 - id -> (string)
 - context -> (string)



---

#### **Output**

The output of this notebook will be `{dataset_name}_syntatic_train.pkl` with the following data:


 - id -> (string from the start data)
 - context -> (string from the train set)
 - question -> (string)
 - answers : ( dict with the 'text' and 'answer_start' keys )
  ```json
     {   
      'text': ['ability to give rise to a new individual plant'],
      'answer_start': [135]
     }
  ``` 
obs : if the data has no answers, in the `text` and `answer_start` there will be a empty list

---

Referências:

 - [git with the model for generating the data](https://github.com/patil-suraj/question_generation.git)


 - Verificar tópicos gerados pelo modelo de extração : [notebook](https://colab.research.google.com/drive/1uep0brNBf70fwVTw2_GDHDVP4dP-eke3#scrollTo=ZRYmZ40hpbGc)

---


In [None]:
## parameters ##

dataset_name = 'tweet_qa'
random_state = 42


dir_path = f'/content/drive/Shareddrives/question gen 2/pipe similarity fold extraction - gen issues - 7/{dataset_name}/'
data_dir_path = f'{dir_path}data/'


input_file_path = f'{data_dir_path}{dataset_name}_train_form.pkl'
output_file_path = f'{data_dir_path}{dataset_name}_syntatic_train.pkl'

In [None]:
from google.colab import drive
import os

drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
%%capture
!pip install gdown
!pip install -U transformers==3.0.0
!python -m nltk.downloader punkt
!git clone https://github.com/patil-suraj/question_generation.git

In [None]:
%cd question_generation

/content/question_generation


In [None]:
import os 
import shutil
from sklearn.model_selection import train_test_split
from pipelines import pipeline
from tqdm import tqdm
import pandas as pd
import re

In [None]:
df = pd.read_pickle(input_file_path)[['id', 'context']]
contexts = df['context'].unique()

In [None]:
df.head(1)

Unnamed: 0,id,context
0,0c871b7e5320d0816d5b2979d67c2649,"Our prayers are with the students, educators &..."


In [None]:
nlp = pipeline("question-generation")

Downloading:   0%|          | 0.00/627 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/31.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/90.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/242M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/656 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/31.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/90.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/242M [00:00<?, ?B/s]

In [None]:
ans_questions_map = {}
for context in tqdm(contexts):
  L_questions = []
  try:
    L_questions = nlp(context)
  except:
    pass
  ans_questions_map[context] = L_questions

  beam_id = beam_token_id // vocab_size
100%|██████████| 5699/5699 [42:01<00:00,  2.26it/s]


In [None]:
L_id = []
L_context = []
L_question = []
L_possible_ans = []
for _, row in df.iterrows():
  for cur_pred in ans_questions_map[row['context']]:  
    L_id.append(row['id'])
    L_context.append(row['context'])
    L_question.append(cur_pred['question'])
    L_possible_ans.append(cur_pred['answer'])


In [None]:
df_question_gen = pd.DataFrame({'id' : L_id,
                                'context' : L_context,
                                'question' : L_question,
                                'answers': L_possible_ans})

In [None]:
df_question_gen.head(3)

Unnamed: 0,id,context,question,answers
0,0c871b7e5320d0816d5b2979d67c2649,"Our prayers are with the students, educators &...","Where are our prayers with students, educators...",Independence High School
1,0c871b7e5320d0816d5b2979d67c2649,"Our prayers are with the students, educators &...",Who is the name of the #PatriotPride?,Doug Ducey
2,0c871b7e5320d0816d5b2979d67c2649,"Our prayers are with the students, educators &...",Who is the name of the #PatriotPride?,Doug Ducey


## Format the data to the correct output format

---

 - id -> 5709630f200fba1400367f2b  (string)
 - context -> contexto
 - question -> pergunta
 - answers :
  ```json
     {   
      'text': ['ability to give rise to a new individual plant'],
      'answer_start': [135]
     }

---


In [None]:
import numpy as np

In [None]:
def find_string_in_text(cur_sub_str, cur_text):
  lower_cur_sub_str = cur_sub_str.lower()
  lower_cur_text = cur_text.lower()
  return lower_cur_text.find(lower_cur_sub_str)

def format_answer_col(cur_df, ans_col_name, text_col_name='text'):
  formated_ans_list = []

  all_ans   = cur_df[ans_col_name].values
  all_texts = cur_df[text_col_name].values

  for cur_ans_list, cur_text in zip(all_ans, all_texts):
    temp_formated_ans_list = { 'text':[], 'answer_start' : []}
    # print(type(cur_ans_list))
    if isinstance(cur_ans_list, list) or isinstance(cur_ans_list, np.ndarray):
      for cur_ans in cur_ans_list:
        start_pos = find_string_in_text(cur_ans, cur_text)
        if start_pos != -1:
          temp_formated_ans_list['text'].append(cur_ans)
          temp_formated_ans_list['answer_start'].append(start_pos)

    else:
      start_pos = find_string_in_text(cur_ans_list, cur_text)
      if start_pos != -1:
        temp_formated_ans_list['text'].append(cur_ans_list)
        temp_formated_ans_list['answer_start'].append(start_pos)
    
    formated_ans_list.append(temp_formated_ans_list)
  return formated_ans_list


In [None]:
df_question_gen['answers'] =  format_answer_col(df_question_gen, 'answers', 'context')
df_train_extractive = df_question_gen[df_question_gen['answers'].apply(lambda x: len(x['text']) > 0)]

In [None]:
df_train_extractive.head(5)

Unnamed: 0,id,context,question,answers
0,0c871b7e5320d0816d5b2979d67c2649,"Our prayers are with the students, educators &...","Where are our prayers with students, educators...","{'text': ['Independence High School'], 'answer..."
1,0c871b7e5320d0816d5b2979d67c2649,"Our prayers are with the students, educators &...",Who is the name of the #PatriotPride?,"{'text': ['Doug Ducey'], 'answer_start': [140]}"
2,0c871b7e5320d0816d5b2979d67c2649,"Our prayers are with the students, educators &...",Who is the name of the #PatriotPride?,"{'text': ['Doug Ducey'], 'answer_start': [140]}"
3,d16eb85d141d5a87bfbc438afbcf50aa,KAINE IS ABLE!!!— Cory Booker (@CoryBooker) Ju...,What is ABLE?,"{'text': ['KAINE'], 'answer_start': [0]}"
4,d16eb85d141d5a87bfbc438afbcf50aa,KAINE IS ABLE!!!— Cory Booker (@CoryBooker) Ju...,When did Cory Booker arrive?,"{'text': ['July 23, 2016'], 'answer_start': [44]}"


In [None]:
print(f"total size of the artificial dataset {len(df_train_extractive)} ")

total size of the artificial dataset 12944 


In [None]:
df_train_extractive[['id', 'context', 'question', 'answers']].to_pickle(output_file_path)