## Coding Portfolio Description

This code snippet is from the first coding portfolio. It sets up several pipelines for natural language processing tasks using the Hugging Face `transformers` library. These tasks include translation, paraphrasing, and question answering.

- **Translation Pipelines**: Two translation pipelines are initialized to translate between German and English using pre-trained models.
- **Paraphrase Pipeline**: A pipeline is set up to generate paraphrases using the Pegasus model.
- **Question Answering Pipeline**: Creates a pipeline for answering questions using the UnifiedQA model.

The code defines various functions for formatting different types of questions and their contexts, such as extractive and abstractive question answering, multiple choice questions, yes/no questions, etc.

Utility functions are provided for identifying similar answers and for finding the most frequent string in a list of strings.

Finally, an `answer' function is defined, which takes a question, context (if any), and choices (if multiple choice) as input. It first translates the question and context to English, generates alternative phrasings of the question, formats the questions, and then attempts to answer them. If similar answers are found among the generated answers, it returns the most common answer in the original language; otherwise, it prints a message indicating that no common answer was found.

In [None]:
from typing import Optional, List
from transformers import pipeline

pipe_translation = pipeline("translation", model="Helsinki-NLP/opus-mt-de-en")
pipe_translation_original_language = pipeline("translation", model="Helsinki-NLP/opus-mt-en-de")
pipe_paraphrases = pipeline("text2text-generation", model="tuner007/pegasus_paraphrase")
pipe_generate_answer = pipeline("text2text-generation", model="allenai/unifiedqa-t5-base")

def format_extractive_qa(question: str, context: Optional[str] = None):
  return f"{question} \n {context}"

def format_abstractive_qa(question: str, context: Optional[str] = None):
  return f"{question} \n {context}"

def format_multiple_choice_qa(question: str, choices: List[str]):
  formatted_choices = " ".join([f"({chr(97 + i)}) {choice}" for i, choice in enumerate(choices)])
  return f"{question} \n {formatted_choices}"

def format_multiple_choice_qa_with_context(question: str, choices: List[str], context: Optional[str] = None):
  formatted_choices = " ".join([f"({chr(97 + i)}) {choice}" for i, choice in enumerate(choices)])
  return f"{question} \n {formatted_choices} \n {context}"

def format_yes_no_qa(question: str, context: Optional[str] = None):
  return f"{question} \n {context}"

def format_question(question: str, context: Optional[str] = None, choices: Optional[List[str]] = None):
  if choices:
    if context:
      return format_multiple_choice_qa_with_context(question, choices, context)
    else:
      return format_multiple_choice_qa(question, choices)
  elif context:
    return format_extractive_qa(question, context)
  else:
    return question  # Default to just returning the question if no context or choices provided

from collections import Counter

def similar_answers(strings):
  counts = Counter(strings)

  # Check if any string occurs at least three times
  for count in counts.values():
    if count >= 3:
      return True

  return False

def most_frequent_string(strings):
  counts = Counter(strings)

  # Find the string with the highest count
  most_common_string = max(counts, key=counts.get)

  return most_common_string


def answer(question: str, context: Optional[str] = None, choices: Optional[List[str]] = None):
  question_en = pipe_translation(question)[0]["translation_text"]
  question_alternatives = pipe_paraphrases(question_en, num_return_sequences=4)
  question_alternatives_vec = [item['generated_text'] for item in question_alternatives]
  questions_vec = [question_en] + question_alternatives_vec
  # Translate context and choices
  if context:
    context = pipe_translation(context)[0]["translation_text"]
  if choices:
    choices = [pipe_translation(choice) for choice in choices]
    choices = [item[0]['translation_text'] for item in choices]

  # format questions
  questions_formatted_vec = []
  for question in questions_vec:
    formatted_question = format_question(question, context, choices)
    questions_formatted_vec.append(formatted_question)

  # answer questions
  questions_answered_vec = []
  for question in questions_formatted_vec:
    question_answered = pipe_generate_answer(question)
    questions_answered_vec.append(question_answered)
  questions_answered_vec = [item[0]['generated_text'] for item in questions_answered_vec]

  if(similar_answers(questions_answered_vec)):
    return(pipe_translation_original_language(most_frequent_string(questions_answered_vec))[0]["translation_text"])
  else:
    print("No common answer found")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.38k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/298M [00:00<?, ?B/s]

  return self.fget.__get__(instance, owner)()


generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/797k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/768k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.27M [00:00<?, ?B/s]



config.json:   0%|          | 0.00/1.33k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/298M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/768k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/797k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.27M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.14k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at tuner007/pegasus_paraphrase and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/86.0 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/1.91M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.24k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/892M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/1.79k [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


In [None]:
!pip install transformers[torch]
!pip install datasets

Collecting accelerate>=0.21.0 (from transformers[torch])
  Downloading accelerate-0.27.2-py3-none-any.whl (279 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m280.0/280.0 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: accelerate
Successfully installed accelerate-0.27.2


## Approach 1 Description

This code demonstrates an approach to generate questions and answers from a given context. It uses the NLTK library for natural language processing and the Datasets library to load a dataset. The dataset used here is the SciQ dataset from AllenAI.

- Preprocessing: The NLTK data is downloaded, including tokenizers and part-of-speech taggers.
- **Dataset Loading**: The SciQ dataset is loaded using the `load_dataset` function from the datasets library.
- **Question-Answer Generation**: The `generate_question_answer` function is defined to generate a question and its answer from a given context. It tokenizes the context, performs part-of-speech tagging, extracts nouns and proper nouns, and selects the latter as the answer. The question is generated based on the context.

An example usage is provided where a context is given and the function generates a question and its answer, demonstrating the functionality of the approach.

In [None]:
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
from datasets import load_dataset


# Download NLTK data
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

# Load the dataset
data = load_dataset("allenai/sciq")

def generate_question_answer(context):
    # Tokenize the context
    tokens = word_tokenize(context)

    # Part-of-speech tagging
    tagged_tokens = pos_tag(tokens)

    # Extract nouns and proper nouns from the context
    nouns = [word for word, pos in tagged_tokens if pos in ['NN', 'NNS', 'NNP', 'NNPS']]

    # Select the last noun or proper noun as the answer
    answer = nouns[-1].lower() if nouns else None

    # Generate question
    question = "What " + ("is" if answer else "are") + " " + context.split(' ')[-2] + "?"

    return question, answer

# Example usage
context = "Divergent plate boundaries produce huge mountain ranges underwater in every ocean basin."
question, answer = generate_question_answer(context)
print("Question:", question)
print("Answer:", answer)

Collecting datasets
  Downloading datasets-2.18.0-py3-none-any.whl (510 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m10.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: dill, multiprocess, datasets
Successfully installed datasets-2.18.0 dill-0.3.8 multiprocess-0.70.16


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


Downloading readme:   0%|          | 0.00/7.02k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/3.99M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/339k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/343k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/11679 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1000 [00:00<?, ? examples/s]

Question: What is ocean?
Answer: basin


## Approach 2 Description with preprocessing

This code illustrates an approach with preprocessing of the SciQ dataset obtained from AllenAI. It uses the Datasets library to load the dataset.

- **Dataset Loading**: The SciQ dataset is loaded using the `load_dataset' function from the datasets library.

- **Preprocessing**: The code iterates over the examples in the training split of the dataset. For each example, it extracts the question, the correct answer, the distractors, and the context. It then concatenates this information into a preprocessed text format that includes the question, context, answer, and distractors. These preprocessed texts are stored in a list.

- **Saving Preprocessed Dataset**: Finally, the preprocessed dataset is saved to a file named "preprocessed_sciq_dataset.txt".

This approach preprocesses the dataset by organizing the information into a structured format, which can be useful for various downstream tasks such as training machine learning models or performing analysis on the data.

In [None]:
from datasets import load_dataset


# Load the SciQ dataset
sciq_dataset = load_dataset("allenai/sciq")

# Preprocess the dataset and save it to a file

preprocessed_texts = []

for example in sciq_dataset['train']:
    question = example['question']
    answer = example['correct_answer']
    distractors = [example['distractor1'], example['distractor2'], example['distractor3']]
    context = example['support']

    preprocessed_text = f"Question: {question} Context: {context} Answer: {answer} Distractors: {' '.join(distractors)}"
    preprocessed_texts.append(preprocessed_text)

# Save the preprocessed dataset to a file
with open("preprocessed_sciq_dataset.txt", "w") as file:
    file.write("\n".join(preprocessed_texts))

Model training takes 20+ h, so after will a small sample follow

In [None]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel
from transformers import TextDataset, DataCollatorForLanguageModeling
from transformers import Trainer, TrainingArguments

# Load pre-trained tokenizer and model
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")

# Load the preprocessed dataset from file
file_path = "preprocessed_sciq_dataset.txt"
train_dataset = TextDataset(
    tokenizer=tokenizer,
    file_path=file_path,
    block_size=128
)

# Define training arguments
training_args = TrainingArguments(
    output_dir="./output",
    overwrite_output_dir=True,
    num_train_epochs=3,
    per_device_train_batch_size=8,
    save_steps=10_000,
    save_total_limit=2
)

# Define data collator
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False
)

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset
)

# Fine-tune the model
trainer.train()

# Save the fine-tuned model
trainer.save_model("./fine-tuned-model")

# Load the fine-tuned model for question generation
fine_tuned_model = GPT2LMHeadModel.from_pretrained("./fine-tuned-model")
tokenizer = GPT2Tokenizer.from_pretrained("./fine-tuned-model")

# Generate questions
context = "Divergent plate boundaries produce huge mountain ranges underwater in every ocean basin."
inputs = tokenizer.encode("Question: " + context, return_tensors="pt")
outputs = fine_tuned_model.generate(inputs, max_length=100, num_return_sequences=1, temperature=0.7)

# Decode and print the generated questions
generated_questions = [tokenizer.decode(output, skip_special_tokens=True) for output in outputs]
for i, question in enumerate(generated_questions):
    print(f"Generated Question {i+1}: {question}")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Step,Training Loss


KeyboardInterrupt: 

Same approach with small sample:

In [None]:
from datasets import load_dataset

# Load a subset of the SciQ dataset for testing
subset_sciq_dataset = load_dataset("allenai/sciq", split="train[:10]")  # Adjust the subset size as needed

# Preprocess and save the subset of the dataset to a file
subset_preprocessed_texts = []

for example in subset_sciq_dataset:
    question = example['question']
    answer = example['correct_answer']
    distractors = [example['distractor1'], example['distractor2'], example['distractor3']]
    context = example['support']

    # Preprocess the text as needed, e.g., concatenate question, answer, and context
    preprocessed_text = f"Question: {question} Context: {context} Answer: {answer} Distractors: {' '.join(distractors)}"
    subset_preprocessed_texts.append(preprocessed_text)

# Save the preprocessed dataset subset to a file
with open("preprocessed_subset_sciq_dataset.txt", "w") as file:
    file.write("\n".join(subset_preprocessed_texts))

In [None]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel
from transformers import TextDataset, DataCollatorForLanguageModeling
from transformers import Trainer, TrainingArguments

# Load pre-trained tokenizer and model
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")

# Load the preprocessed dataset from file
file_path = "preprocessed_subset_sciq_dataset.txt"
train_dataset = TextDataset(
    tokenizer=tokenizer,
    file_path=file_path,
    block_size=128  # Adjust block size as needed
)

# Define training arguments
training_args = TrainingArguments(
    output_dir="./output",
    overwrite_output_dir=True,
    num_train_epochs=3,
    per_device_train_batch_size=8,
    save_steps=10_000,
    save_total_limit=2
)

# Define data collator
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False
)

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset
)

# Fine-tune the model
trainer.train()

# Save the fine-tuned model
trainer.save_model("./fine-tuned-model")

# Load the fine-tuned model for question generation
fine_tuned_model = GPT2LMHeadModel.from_pretrained("./fine-tuned-model")
tokenizer = GPT2Tokenizer.from_pretrained("./fine-tuned-model")

# Generate questions
context = "Divergent plate boundaries produce huge mountain ranges underwater in every ocean basin."
inputs = tokenizer.encode("Question: " + context, return_tensors="pt")
outputs = fine_tuned_model.generate(inputs, max_length=100, num_return_sequences=1, temperature=0.7)

# Decode and print the generated questions
generated_questions = [tokenizer.decode(output, skip_special_tokens=True) for output in outputs]
for i, question in enumerate(generated_questions):
    print(f"Generated Question {i+1}: {question}")


Step,Training Loss


OSError: Can't load tokenizer for './fine-tuned-model'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure './fine-tuned-model' is the correct path to a directory containing all relevant files for a GPT2Tokenizer tokenizer.

In [None]:
# Load the fine-tuned model for question generation
fine_tuned_model = GPT2LMHeadModel.from_pretrained("./fine-tuned-model")
# tokenizer = GPT2Tokenizer.from_pretrained("./fine-tuned-model")
# Load the tokenizer separately
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# Generate questions
context = "Divergent plate boundaries produce huge mountain ranges underwater in every ocean basin."
inputs = tokenizer.encode("Question: " + context, return_tensors="pt")
outputs = fine_tuned_model.generate(inputs, max_length=100, num_return_sequences=1, temperature=0.7)

# Decode and print the generated questions
generated_questions = [tokenizer.decode(output, skip_special_tokens=True) for output in outputs]
for i, question in enumerate(generated_questions):
    print(f"Generated Question {i+1}: {question}")


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generated Question 1: Question: Divergent plate boundaries produce huge mountain ranges underwater in every ocean basin.

The most common type of plate is the "marine plate," which is composed of a large number of small plates that are separated by a thin layer of water. The most common type of plate is the "marine plate," which is composed of a large number of small plates that are separated by a thin layer of water.

The most common type of plate is the "marine plate," which is composed of


## 3rd Approach Description

This code demonstrates an approach that uses the GPT-2 model for fine-tuning on the SciQ dataset, and then uses the fine-tuned model for question generation based on provided contexts.

- **Dataset Loading**: The SciQ dataset is loaded using the `load_dataset' function from the dataset library. Preprocessing steps may be applied if necessary, but are not explicitly shown in this snippet.

- **Model Loading and Tokenization**: Pre-trained GPT-2 tokenizer and language model are loaded using the `GPT2Tokenizer` and `GPT2LMHeadModel` classes from the transform library.

- **Tokenization of Dataset**: The dataset is tokenized with the tokenizer and a `TextDataset` with the given block size is instantiated.

- **Training Arguments**: Training arguments such as output directory, number of epochs, batch size, and save steps are defined.

- **Data Collator**: A language modeling data collator is defined to prepare data batches for training.

- **Trainer Initialization**: The trainer is initialized with the loaded model, training arguments, and data collator, along with the training data set.

- **Fine-tuning the Model**: The model is trained using the `trainer.train()` method.

- **Saving the Fine-tuned Model**: After training, the fine-tuned model is saved.

- **Question Generation**: The tuned model is loaded for question generation. A context is provided and questions are generated using the model's `generate' method.

- **Decoding and Printing Generated Questions**: The generated questions are decoded from token IDs to text using the tokenizer, and then printed.

This approach fine-tunes the GPT-2 model on the SciQ dataset to generate questions based on given contexts, thus harnessing the power of language modeling for question generation tasks.

In [None]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel
from transformers import TextDataset, DataCollatorForLanguageModeling
from transformers import Trainer, TrainingArguments
from datasets import load_dataset


# Load the SciQ dataset
data = load_dataset("allenai/sciq")
# Preprocess the dataset as needed

# Load pre-trained tokenizer and model
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")

# Tokenize the dataset
train_dataset = TextDataset(
    tokenizer=tokenizer,
    file_path="path_to_preprocessed_dataset",
    block_size=128
)

# Define training arguments
training_args = TrainingArguments(
    output_dir="./output",
    overwrite_output_dir=True,
    num_train_epochs=3,
    per_device_train_batch_size=8,
    save_steps=10_000,
    save_total_limit=2
)

# Define data collator
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False
)

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset
)

# Fine-tune the model
trainer.train()

# Save the fine-tuned model
trainer.save_model("./fine-tuned-model")

# Load the fine-tuned model for question generation
fine_tuned_model = GPT2LMHeadModel.from_pretrained("./fine-tuned-model")
tokenizer = GPT2Tokenizer.from_pretrained("./fine-tuned-model")

# Generate questions
context = "Divergent plate boundaries produce huge mountain ranges underwater in every ocean basin."
inputs = tokenizer.encode("Question: " + context, return_tensors="pt")
outputs = fine_tuned_model.generate(inputs, max_length=100, num_return_sequences=1, temperature=0.7)

# Decode and print the generated questions
generated_questions = [tokenizer.decode(output, skip_special_tokens=True) for output in outputs]
for i, question in enumerate(generated_questions):
    print(f"Generated Question {i+1}: {question}")


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]



ValueError: Input file path path_to_preprocessed_dataset not found

## 4. Approach Description

This code demonstrates an approach to generating question-answer pairs using a pre-trained GPT-2 model and the SciQ dataset.

- **Dataset Loading**: The SciQ dataset is loaded using the `load_dataset' function from the datasets library.

- **Pipeline Initialization**: A text generation pipeline is initialized using the `pipeline` function from the transform library, specifying the GPT-2 model.

- **Question Generation Function**: A `generate_question_answer' function is defined to generate a question based on the given context, answer, and distractors. It uses the initialized pipeline to generate the question.

- **Random Sampling of Examples**: Five random indices are sampled from the data set to generate questions for these examples.

- **Example Usage**: For each randomly sampled index, the context, answer and distractors are extracted from the data set. The `generate_question_answer' function is then called to generate a question based on this information. The generated question is printed along with the correct answer and context.

This approach uses a pre-trained GPT-2 model to generate questions based on contexts, correct answers and distractors from the SciQ dataset, allowing the generation of different question-answer pairs for educational or assessment purposes.

In [None]:
import random
from transformers import pipeline
from datasets import load_dataset

# Load the SciQ dataset
sciq_dataset = load_dataset("allenai/sciq")

# Get the number of examples in the dataset
num_examples = len(sciq_dataset['train'])

# Initialize the pipeline for question generation
question_generator = pipeline("text-generation", model="gpt2", tokenizer="gpt2")

# Function to generate question-answer pairs
def generate_question_answer(context, answer, distractors):
    # Generate a question based on the context
    prompt = f"Question: {context} Answer: {answer} Distractors: {' '.join(distractors)}"
    question = question_generator(prompt, max_new_tokens=50, num_return_sequences=1)[0]['generated_text'].split("Question: ")[-1].strip()

    return question, answer

# Randomly sample 5 indices from the dataset
random_indices = random.sample(range(num_examples), 5)

# Example usage
for idx in random_indices:  # Generate questions for 5 random examples
    example = sciq_dataset['train'][idx]
    context = example['support']
    answer = example['correct_answer']
    distractors = [example['distractor1'], example['distractor2'], example['distractor3']]

    question, correct_answer = generate_question_answer(context, answer, distractors)
    print("Context:", context)
    print("Generated Question:", question)
    print("Correct Answer:", correct_answer)
    print("-" * 50)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Context: Comparing Meiosis and Mitosis Mitosis and meiosis are both forms of division of the nucleus in eukaryotic cells. They share some similarities, but also exhibit distinct differences that lead to very different outcomes (Figure 11.7). Mitosis is a single nuclear division that results in two nuclei that are usually partitioned into two new cells. The nuclei resulting from a mitotic division are genetically identical to the original nucleus. They have the same number of sets of chromosomes, one set in the case of haploid cells and two sets in the case of diploid cells. In most plants and all animal species, it is typically diploid cells that undergo mitosis to form new diploid cells. In contrast, meiosis consists of two nuclear divisions resulting in four nuclei that are usually partitioned into four new cells. The nuclei resulting from meiosis are not genetically identical and they contain one chromosome set only. This is half the number of chromosome sets in the original cell, w

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Context: 
Generated Question: Answer: water vapor Distractors: rainwater ice distilled water, filtered water, and bottled water (eg Bottling, Refills, Bottles, Refills). (The latter option is called "water extraction.")

(Refilled/refilled water is used in most situations except in rare circumstances where
Correct Answer: water vapor
--------------------------------------------------


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Context: All living things need energy. They need it to power the processes of life. For example, it takes energy to grow. It also takes energy to produce offspring. In fact, it takes energy just to stay alive. Remember that energy can’t be created or destroyed. It can only change form. Energy changes form as it moves through ecosystems.
Generated Question: All living things need energy. They need it to power the processes of life. For example, it takes energy to grow. It also takes energy to produce offspring. In fact, it takes energy just to stay alive. Remember that energy can’t be created or destroyed. It can only change form. Energy changes form as it moves through ecosystems. Answer: form Distractors: kind fluid shape. The human mind is filled with these. Here are some things that attract animals, especially humans. Humans, you guessed it. They do anything they can to hold on to their tiny minds. But it becomes easier and easier to change in a way
Correct Answer: form
-----------

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Context: If all three bulbs are set a full intensity, the person sees __________.
Generated Question: If all three bulbs are set a full intensity, the person sees __________. Answer: two Distractors: one four three. Answer: The people see __________ when it seems all three are illuminated. Answer: The lights in the room don't turn a full color. Answer: The lights do turn a full color when the lamps have a bit of motion on them
Correct Answer: two
--------------------------------------------------
Context: Unfortunately, maternal mt(DNA)is susceptible to mutations which are a cause of inherited disease, such as breast cancer. Although, it is important to note that most mutations do not lead to defected mtDNA. Heteroplasmy is the presence of a mixture of more than one type of mtDNA. Most people have homoplasmic cells, meaning that their cells contain only normal, undefected mtDNA. However, people with both normal, undefected mtDNA and not normal, defected mtDNA, may inherit mitochondrial

## 5th Approach Description: Tokenization

This code demonstrates an approach to generating questions based on context and answer, using tokenization techniques from the `spacy` and `nltk` libraries.

- **Tokenizer Initialization**: The English tokenizer, tagger, parser, and Named Entity Recognition (NER) components are loaded using spaCy.

- **Question Generation Function**: A function called `generate_question` is defined to generate a question based on the given context and answer. 
  - First, the context is tokenized into sentences using the `sent_tokenize` function from the `nltk.tokenize` module.
  - Named entities are extracted from the context using spaCy.
  - A question is then generated based on the context and answer, specifically targeting the type of answer in the given context.

- **Example Usage**: An example context and answer are provided and the `generate_question` function is called to generate a question and extract named entities. The context, generated question, and named entities are printed to demonstrate the functionality of the approach.

This approach combines tokenization techniques from `spacy` and `nltk` to generate questions that specifically target the type of answer within the given context, thereby increasing the relevance of the generated questions.

In [None]:
import spacy
from nltk.tokenize import sent_tokenize

# Load English tokenizer, tagger, parser, and NER
nlp = spacy.load("en_core_web_sm")

# Function to generate questions based on context and answer
def generate_question(context, answer):
    # Tokenize context into sentences
    sentences = sent_tokenize(context)

    # Extract named entities from context using spaCy
    doc = nlp(context)
    named_entities = [ent.text for ent in doc.ents]

    # Generate a question based on the context and answer
    question = f"What type of {answer} produces in the given context?"
    return question, named_entities

# Example usage
context = "Divergent plate boundaries produce huge mountain ranges underwater in every ocean basin."
answer = "plate boundaries"
question, named_entities = generate_question(context, answer)

print("Context:", context)
print("Generated Question:", question)
print("Named Entities:", named_entities)

Context: Divergent plate boundaries produce huge mountain ranges underwater in every ocean basin.
Generated Question: What type of plate boundaries produces in the given context?
Named Entities: ['Divergent']


## 5th Approach Part 2 Description: Tokenization

This code is a continuation of the previous approach and demonstrates an extended version of the question generation function that includes replacing the answer placeholder with named entities in the context.

- **Replacement of Answer Placeholder**: After extracting named entities from the context using spaCy, the function replaces the answer placeholder with a comma-separated list of named entities. This step ensures that the generated question reflects the specific entities mentioned in the context and not just the answer placeholder.

- **Example Usage**: A sample context and answer are provided, and the extended `generate_question' function is called to generate a question and extract named entities. The context, generated question and named entities are printed to demonstrate the improved functionality.

This enhanced approach further improves the relevance and specificity of the generated questions by incorporating specific named entities from the context into the question formulation.

In [None]:
import spacy
from nltk.tokenize import sent_tokenize

# Load English tokenizer, tagger, parser, and NER
nlp = spacy.load("en_core_web_sm")

# Function to generate questions based on context and answer
def generate_question(context, answer):
    # Tokenize context into sentences
    sentences = sent_tokenize(context)

    # Extract named entities from context using spaCy
    doc = nlp(context)
    named_entities = [ent.text for ent in doc.ents]

    # Replace answer placeholder with named entities in the context
    context_with_entities = context.replace(answer, ", ".join(named_entities))

    # Generate a question based on the context and answer
    question = f"What type of {context_with_entities} produces in the given context?"
    return question, named_entities

# Example usage
context = "The stored food in a seed is called endosperm . It nourishes the embryo until it can start making food on its own.."
answer = "endosperm"
question, named_entities = generate_question(context, answer)

print("Context:", context)
print("Generated Question:", question)
print("Named Entities:", named_entities)

Context: The stored food in a seed is called endosperm . It nourishes the embryo until it can start making food on its own..
Generated Question: What type of The stored food in a seed is called  . It nourishes the embryo until it can start making food on its own.. produces in the given context?
Named Entities: []


## Medium Approach Description

This code snippet demonstrates an approach to generating multiple-choice questions (MCQs) using NLP techniques, specifically the BERT Extractive Summarizer.

- **Library Installation**: The `bert-extractive-summarizer` library is installed to enable text summarization using BERT-based models.

- Import library: The library is imported to use the summarizer class for text summarization.

- Text summarization**: A piece of text is provided and the summarizer model is initialized. The model is then applied to the text to generate a summary. Parameters such as the minimum and maximum length of the summary and the summary ratio are specified.

- **Example Usage**: An example text about mesophiles and their growth conditions is given. The text is summarized using the BERT Extractive Summarizer.

This approach uses BERT-based extractive summarization to condense a piece of text into a concise summary that can then be used to generate multiple-choice questions or for other purposes in educational or informational contexts.

For more details, see the [Medium article] (https://medium.com/@pragyan_/mcq-generation-using-nlp-fe6166beaef8) that discusses this approach.

In [None]:
#installing Bert Extractive Summarizer
!pip install bert-extractive-summarizer
#import library
from summarizer import Summarizer

text = "Mesophiles grow best in moderate temperature, typically between 25°C and 40°C (77°F and 104°F). Mesophiles are often found living in or on the bodies of humans or other animals. The optimal growth temperature of many pathogenic mesophiles is 37°C (98°F), the normal human body temperature. Mesophilic organisms have important uses in food preparation, including cheese, yogurt, beer and wine."
model = Summarizer()
result = model(text, min_length=60, max_length = 500 , ratio = 0.4)
summarized_text = ''.join(result)

Collecting bert-extractive-summarizer
  Downloading bert_extractive_summarizer-0.10.1-py3-none-any.whl (25 kB)
Installing collected packages: bert-extractive-summarizer
Successfully installed bert-extractive-summarizer-0.10.1


config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]



In [None]:
print(summarized_text)

Mesophiles grow best in moderate temperature, typically between 25°C and 40°C (77°F and 104°F). Mesophilic organisms have important uses in food preparation, including cheese, yogurt, beer and wine.


In [None]:
!pip install python-rake==1.4.4
# Import RAKE
import RAKE
import operator

# Rake setup with stopword directory
stop_dir = "/content/drive/MyDrive/NLP Project/SmartStoplist.txt"
rake_object = RAKE.Rake(stop_dir)
# Sample text to test RAKE
text = """Mesophiles grow best in moderate temperature, typically between 25°C and 40°C (77°F and 104°F). Mesophiles are often found living in or on the bodies of humans or other animals. The optimal growth temperature of many pathogenic mesophiles is 37°C (98°F), the normal human body temperature. Mesophilic organisms have important uses in food preparation, including cheese, yogurt, beer and wine."""

# Extract keywords
keywords = rake_object.run(text)
print ("keywords: ", keywords)

keywords:  [('normal human body temperature', 15.0), ('optimal growth temperature', 9.0), ('moderate temperature', 5.0), ('found living', 4.0), ('mesophilic organisms', 4.0), ('food preparation', 4.0), ('including cheese', 4.0), ('mesophiles grow', 3.666666666666667), ('pathogenic mesophiles', 3.666666666666667), ('mesophiles', 1.6666666666666667), ('typically', 1.0), ('bodies', 1.0), ('humans', 1.0), ('animals', 1.0), ('important', 1.0), ('yogurt', 1.0), ('beer', 1.0), ('wine', 1.0), ('25°', 0), ('40°', 0), ('77°', 0), ('104°', 0), ('37°', 0), ('98°', 0)]


In [None]:
question_model = T5ForConditionalGeneration.from_pretrained(model_path)
question_tokenizer = T5Tokenizer.from_pretrained(model_path)
question_model = question_model.to(device)

def get_question(context,answer,model,tokenizer):
  text = "context: {} answer: {}".format(context,answer)
  encoding = tokenizer.encode_plus(text,max_length=384, pad_to_max_length=False,truncation=True, return_tensors="pt").to(device)
  input_ids, attention_mask = encoding["input_ids"], encoding["attention_mask"]

  outs = model.generate(input_ids=input_ids,
                                  attention_mask=attention_mask,
                                  early_stopping=True,
                                  num_beams=5,
                                  num_return_sequences=1,
                                  no_repeat_ngram_size=2,
                                  max_length=72)


  dec = [tokenizer.decode(ids,skip_special_tokens=True) for ids in outs]


  Question = dec[0].replace("question:","")
  Question= Question.strip()
  return Question

NameError: name 'T5ForConditionalGeneration' is not defined

In [None]:
import torch
from transformers import T5ForConditionalGeneration, T5Tokenizer

# Load the default question generation model and tokenizer
question_model = T5ForConditionalGeneration.from_pretrained("t5-small")
question_tokenizer = T5Tokenizer.from_pretrained("t5-small")

# Set device to GPU if available, otherwise use CPU
device = "cuda" if torch.cuda.is_available() else "cpu"
question_model = question_model.to(device)

def get_question(context, answer, model, tokenizer):
    text = f"context: {context} answer: {answer}"
    encoding = tokenizer.encode_plus(text, max_length=384, pad_to_max_length=False, truncation=True, return_tensors="pt").to(device)
    input_ids, attention_mask = encoding["input_ids"], encoding["attention_mask"]

    outs = model.generate(input_ids=input_ids,
                          attention_mask=attention_mask,
                          early_stopping=True,
                          num_beams=5,
                          num_return_sequences=1,
                          no_repeat_ngram_size=2,
                          max_length=72)

    dec = [tokenizer.decode(ids, skip_special_tokens=True) for ids in outs]

    Question = dec[0].replace("question:", "")
    Question = Question.strip()
    return Question


config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
import spacy
from nltk.tokenize import sent_tokenize
from summarizer import Summarizer
import RAKE
import operator
import torch
from transformers import T5ForConditionalGeneration, T5Tokenizer

# Load English tokenizer, tagger, parser, and NER
nlp = spacy.load("en_core_web_sm")
stop_dir = "/content/drive/MyDrive/NLP Project/SmartStoplist.txt"
rake_object = RAKE.Rake(stop_dir)

# Load the default question generation model and tokenizer
question_model = T5ForConditionalGeneration.from_pretrained("t5-small")
question_tokenizer = T5Tokenizer.from_pretrained("t5-small")
device = "cuda" if torch.cuda.is_available() else "cpu"
question_model = question_model.to(device)

def generate_question(context, answer):
    # Tokenize context into sentences
    sentences = sent_tokenize(context)

    # Extract named entities from context using spaCy
    doc = nlp(context)
    named_entities = [ent.text for ent in doc.ents]

    # Replace answer placeholder with named entities in the context
    context_with_entities = context.replace(answer, ", ".join(named_entities))

    # Generate a question based on the context and answer
    question = f"What type of {context_with_entities} produces in the given context?"
    return question, named_entities

def get_summary_keywords(context):
    model = Summarizer()
    result = model(context, min_length=60, max_length=500, ratio=0.4)
    summarized_text = ''.join(result)
    keywords = rake_object.run(summarized_text)
    return keywords

def get_question_and_answer(context, answer):
    question, named_entities = generate_question(context, answer)
    keywords = get_summary_keywords(context)

    # Choose the most relevant keyword as the answer
    relevant_keyword = max(keywords, key=lambda x: x[1])[0]
    # Generate question with answer
    generated_question = get_question(context, relevant_keyword, question_model, question_tokenizer)

    return generated_question, relevant_keyword

# Example usage
context = "Usually, testing a hypothesis requires making observations or performing experiments. In this case, we will look into existing scientific literature to see if either of these hypotheses can be disproved, or if one or both can be supported by the data."
answer = "hypothesis"

generated_question, relevant_keyword = get_question_and_answer(context, answer)
print("Context:", context)
print("Generated Question:", generated_question)
print("Answer:", relevant_keyword)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Context: Usually, testing a hypothesis requires making observations or performing experiments. In this case, we will look into existing scientific literature to see if either of these hypotheses can be disproved, or if one or both can be supported by the data.
Generated Question: True
Answer: hypothesis requires making observations


In [None]:
# Example usage
context = "Usually, testing a hypothesis requires making observations or performing experiments. In this case, we will look into existing scientific literature to see if either of these hypotheses can be disproved, or if one or both can be supported by the data."
answer = "hypothesis"

generated_question, relevant_keyword = get_question_and_answer(context, answer)
print("Context:", context)
print("Generated Question:", generated_question)
print("Answer:", relevant_keyword)