<a href="https://colab.research.google.com/github/VigilCiph3r/CREDIT-CARD-FRAUD-DETECTION/blob/main/Generate_MCQ_From_Text_Deployment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Generate Multiple Choice Questions from a Text

## PROBLEM STATEMENT
Given a text phrase, we want to generate multiple choice questions (MCQ)s automatically from the text. The tool can be used by educators, teachers or by professionals to create assesments, evaluations, quizes surveys at scale.

## Introduction

MCQs have several advantages, including rapid evaluation, shorter testing time, uniform scoring, and the option of an electronic evaluation. Many examinations employ MCQ-based question papers administered in a computerised setting. Manually preparing MCQs, on the other hand, is time-consuming and costly. As a result, the research community expended much effort in developing approaches for the automatic creation of MCQs. We will explore one such method here using machine learning and natural language processing.

In [None]:
# We will use following text as an example
# source: https://mitsloan.mit.edu/ideas-made-to-matter/machine-learning-explained
text = """ Machine learning is behind chatbots and predictive text, language translation apps, the shows Netflix suggests to you, and how your social media feeds are presented. It powers autonomous vehicles and machines that can diagnose medical conditions based on images.

When companies today deploy artificial intelligence programs, they are most likely using machine learning — so much so that the terms are often used interchangeably, and sometimes ambiguously. Machine learning is a subfield of artificial intelligence that gives computers the ability to learn without explicitly being programmed.

“In just the last five or 10 years, machine learning has become a critical way, arguably the most important way, most parts of AI are done,” said MIT Sloan professorThomas W. Malone, the founding director of the MIT Center for Collective Intelligence. “So that's why some people use the terms AI and machine learning almost as synonymous … most of the current advances in AI have involved machine learning.”
With the growing ubiquity of machine learning, everyone in business is likely to encounter it and will need some working knowledge about this field. A 2020 Deloitte survey found that 67% of companies are using machine learning, and 97% are using or planning to use it in the next year.

From manufacturing to retail and banking to bakeries, even legacy companies are using machine learning to unlock new value or boost efficiency. “Machine learning is changing, or will change, every industry, and leaders need to understand the basic principles, the potential, and the limitations,” said MIT computer science professor Aleksander Madry, director of the MIT Center for Deployable Machine Learning.

While not everyone needs to know the technical details, they should understand what the technology does and what it can and cannot do, Madry added. “I don’t think anyone can afford not to be aware of what’s happening.”

That includes being aware of the social, societal, and ethical implications of machine learning. “It's important to engage and begin to understand these tools, and then think about how you're going to use them well. We have to use these [tools] for the good of everybody,” said Dr. Joan LaRovere, MBA ’16, a pediatric cardiac intensive care physician and co-founder of the nonprofit The Virtue Foundation. “AI has so much potential to do good, and we need to really keep that in our lenses as we're thinking about this. How do we use this to do good and better the world?”

What is machine learning?
Machine learning is a subfield of artificial intelligence, which is broadly defined as the capability of a machine to imitate intelligent human behavior. Artificial intelligence systems are used to perform complex tasks in a way that is similar to how humans solve problems.

The goal of AI is to create computer models that exhibit “intelligent behaviors” like humans, according to Boris Katz, a principal research scientist and head of the InfoLab Group at CSAIL. This means machines that can recognize a visual scene, understand a text written in natural language, or perform an action in the physical world.

Machine learning is one way to use AI. It was defined in the 1950s by AI pioneer Arthur Samuel as “the field of study that gives computers the ability to learn without explicitly being programmed.”

The definition holds true, according toMikey Shulman, a lecturer at MIT Sloan and head of machine learning at Kensho, which specializes in artificial intelligence for the finance and U.S. intelligence communities. He compared the traditional way of programming computers, or “software 1.0,” to baking, where a recipe calls for precise amounts of ingredients and tells the baker to mix for an exact amount of time. Traditional programming similarly requires creating detailed instructions for the computer to follow.

But in some cases, writing a program for the machine to follow is time-consuming or impossible, such as training a computer to recognize pictures of different people. While humans can do this task easily, it’s difficult to tell a computer how to do it. Machine learning takes the approach of letting computers learn to program themselves through experience.

Machine learning starts with data — numbers, photos, or text, like bank transactions, pictures of people or even bakery items, repair records, time series data from sensors, or sales reports. The data is gathered and prepared to be used as training data, or the information the machine learning model will be trained on. The more data, the better the program.

From there, programmers choose a machine learning model to use, supply the data, and let the computer model train itself to find patterns or make predictions. Over time the human programmer can also tweak the model, including changing its parameters, to help push it toward more accurate results. (Research scientist Janelle Shane’s website AI Weirdness is an entertaining look at how machine learning algorithms learn and how they can get things wrong — as happened when an algorithm tried to generate recipes and created Chocolate Chicken Chicken Cake.)

Some data is held out from the training data to be used as evaluation data, which tests how accurate the machine learning model is when it is shown new data. The result is a model that can be used in the future with different sets of data.

Successful machine learning algorithms can do different things, Malone wrote in a recent research brief about AI and the future of work that was co-authored by MIT professor and CSAIL director Daniela Rus and Robert Laubacher, the associate director of the MIT Center for Collective Intelligence.

“The function of a machine learning system can be descriptive, meaning that the system uses the data to explain what happened; predictive, meaning the system uses the data to predict what will happen; or prescriptive, meaning the system will use the data to make suggestions about what action to take,” the researchers wrote.

There are three subcategories of machine learning:

Supervised machine learning models are trained with labeled data sets, which allow the models to learn and grow more accurate over time. For example, an algorithm would be trained with pictures of dogs and other things, all labeled by humans, and the machine would learn ways to identify pictures of dogs on its own. Supervised machine learning is the most common type used today.

In unsupervised machine learning, a program looks for patterns in unlabeled data. Unsupervised machine learning can find patterns or trends that people aren’t explicitly looking for. For example, an unsupervised machine learning program could look through online sales data and identify different types of clients making purchases.

Reinforcement machine learning trains machines through trial and error to take the best action by establishing a reward system. Reinforcement learning can train models to play games or train autonomous vehicles to drive by telling the machine when it made the right decisions, which helps it learn over time what actions it should take.
In the Work of the Future brief, Malone noted that machine learning is best suited for situations with lots of data — thousands or millions of examples, like recordings from previous conversations with customers, sensor logs from machines, or ATM transactions. For example, Google Translate was possible because it “trained” on the vast amount of information on the web, in different languages.

In some cases, machine learning can gain insight or automate decision-making in cases where humans would not be able to, Madry said. “It may not only be more efficient and less costly to have an algorithm do this, but sometimes humans just literally are not able to do it,” he said.

Google search is an example of something that humans can do, but never at the scale and speed at which the Google models are able to show potential answers every time a person types in a query, Malone said. “That’s not an example of computers putting people out of work. It's an example of computers doing things that would not have been remotely economically feasible if they had to be done by humans.”

Machine learning is also associated with several other artificial intelligence subfields:

Natural language processing

Natural language processing is a field of machine learning in which machines learn to understand natural language as spoken and written by humans, instead of the data and numbers normally used to program computers. This allows machines to recognize language, understand it, and respond to it, as well as create new text and translate between languages. Natural language processing enables familiar technology like chatbots and digital assistants like Siri or Alexa.

Neural networks

Neural networks are a commonly used, specific class of machine learning algorithms. Artificial neural networks are modeled on the human brain, in which thousands or millions of processing nodes are interconnected and organized into layers.

In an artificial neural network, cells, or nodes, are connected, with each cell processing inputs and producing an output that is sent to other neurons. Labeled data moves through the nodes, or cells, with each cell performing a different function. In a neural network trained to identify whether a picture contains a cat or not, the different nodes would assess the information and arrive at an output that indicates whether a picture features a cat.

Deep learning

Deep learning networks are neural networks with many layers. The layered network can process extensive amounts of data and determine the “weight” of each link in the network — for example, in an image recognition system, some layers of the neural network might detect individual features of a face, like eyes, nose, or mouth, while another layer would be able to tell whether those features appear in a way that indicates a face.

Like neural networks, deep learning is modeled on the way the human brain works and powers many machine learning uses, like autonomous vehicles, chatbots, and medical diagnostics.

“The more layers you have, the more potential you have for doing complex things well,” Malone said.

Deep learning requires a great deal of computing power, which raises concerns about its economic and environmental sustainability.

How businesses are using machine learning
Machine learning is the core of some companies’ business models, like in the case of Netflix’s suggestions algorithm or Google’s search engine. Other companies are engaging deeply with machine learning, though it’s not their main business proposition.
Others are still trying to determine how to use machine learning in a beneficial way. “In my opinion, one of the hardest problems in machine learning is figuring out what problems I can solve with machine learning,” Shulman said. “There’s still a gap in the understanding.”

In a 2018 paper, researchers from the MIT Initiative on the Digital Economy outlined a 21-question rubric to determine whether a task is suitable for machine learning. The researchers found that no occupation will be untouched by machine learning, but no occupation is likely to be completely taken over by it. The way to unleash machine learning success, the researchers found, was to reorganize jobs into discrete tasks, some which can be done by machine learning, and others that require a human.

Companies are already using machine learning in several ways, including:

Recommendation algorithms. The recommendation engines behind Netflix and YouTube suggestions, what information appears on your Facebook feed, and product recommendations are fueled by machine learning. “[The algorithms] are trying to learn our preferences,” Madry said. “They want to learn, like on Twitter, what tweets we want them to show us, on Facebook, what ads to display, what posts or liked content to share with us.”

Image analysis and object detection. Machine learning can analyze images for different information, like learning to identify people and tell them apart — though facial recognition algorithms are controversial. Business uses for this vary. Shulman noted that hedge funds famously use machine learning to analyze the number of cars in parking lots, which helps them learn how companies are performing and make good bets.

Fraud detection. Machines can analyze patterns, like how someone normally spends or where they normally shop, to identify potentially fraudulent credit card transactions, log-in attempts, or spam emails.

Automatic helplines or chatbots. Many companies are deploying online chatbots, in which customers or clients don’t speak to humans, but instead interact with a machine. These algorithms use machine learning and natural language processing, with the bots learning from records of past conversations to come up with appropriate responses.

Self-driving cars. Much of the technology behind self-driving cars is based on machine learning, deep learning in particular.

Medical imaging and diagnostics. Machine learning programs can be trained to examine medical images or other information and look for certain markers of illness, like a tool that can predict cancer risk based on a mammogram.

"""

In [None]:
len(text)

5163

## Sentence Selection and Question Generation
In order to generate questions we must first identify significant sentences that hold fact or knowledge. There two broad ways to do this:

### a. Name Entity Recognition:
We can identify important names, locations and formulate them as questions.
If we want to test grammar skills we can idenity instead verbs, nouns and   other adpositions.

### b. Keyword Extraction:
We identify important keywords from our text, and then formulate questions such that those keywords as answers. We will explore this method in this notebook.

### Question Generation:
For generating question,  we have two broad ways:<br>
a. Either we can train a model that takes answer and context as input, and generates a question. This would be a sequence to sequence problem. This is opposite of Question Answering task where we feed the model and Question, and context and it generates a answer. We can use the same dataset instead to formulate questions.  The advantage of this method is we can proper questions.

b. Another simple way we can formulate questions, is  just by replacing the keyword in the original text with a blank. This would generate only fill in the blanks type declarative questions. It will be very straightforward and easy to implement. The advantage is it's easy to implement and we don't need any machine learning model.

We explore both ways in this notebook, but we preferred the second method in production.

## Identifying Keywords

In [None]:
!pip install keybert > /dev/null
from sentence_transformers import SentenceTransformer
from keybert import KeyBERT
from nltk.tokenize import sent_tokenize
import nltk

# we use nltk library to tokenize our text
nltk.download('punkt')

# KeyBert uses BERT-embeddings and simple cosine similarity to find the sub-phrases in a document that are the most similar to the document itself.
sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
kw_model = KeyBERT(sentence_model)

def get_keywords(text):
    """
    Given @input text, identify important keywords.
    Here we use Sentence Transformer to extract keywords that best describe the text
    """
    keywords_with_scores = kw_model.extract_keywords(text, keyphrase_ngram_range=(1, 2), top_n=5, stop_words='english')
    keywords = [kw[0] for kw in keywords_with_scores]
    scores = [kw[1] for kw in keywords_with_scores]
    return keywords

def tokenize_sentences(text):
    """
    Given a @text input, returns tokenized sentences
    """
    sentences = [sent_tokenize(text)]
    sentences = [sentence for paragraph in sentences for sentence in paragraph]

    # Remove sentences shorter than 20 letters.
    sentences = [sentence.strip() for sentence in sentences if len(sentence) > 20]
    return sentences

def get_sentences_for_keyword(kw_model, sentences, lemmatizer):
    """
    @kw_model: keyBERT model to extract keywords
    @sentences: list of tokenized sentences
    returns a map with keywords as keys mapped to the sentences they appear in.
    """
    keyword_sentences = {}
    for sentence in sentences:
        keywords_found = [kw[0] for kw in kw_model.extract_keywords(sentence, keyphrase_ngram_range=(1, 2), top_n=10) if len(kw[0]) > 2]
        for key in keywords_found:
            keyword_sentences[key] = keyword_sentences.get(key, [])
            keyword_sentences[key].append(sentence)

    return keyword_sentences

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


2022-01-15 16:54:47.682 Load pretrained SentenceTransformer: all-MiniLM-L6-v2


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


2022-01-15 16:54:51.989 Use pytorch device: cpu


In [None]:
# Tokenize text into sentences
# Install the required module
!pip install keybert
!pip install nltk

# Import the required modules
from keybert import KeyBERT
import nltk
from nltk import sent_tokenize

# Download the required NLTK data
nltk.download('punkt')

# Define the function to get sentences for a keyword
def get_sentences_for_keyword(kw_model, sentences):
    keyword_to_sentences_map = {}
    sentences = sent_tokenize(text)
    for sentence in sentences:
        # Tokenize the sentence
        sentence_tokens = nltk.word_tokenize(sentence)
        # Extract keywords from the sentence
        keywords = kw_model.extract_keywords(' '.join(sentence_tokens))
        # Add the sentence to the map for each extracted keyword
        for keyword in keywords:
            if keyword not in keyword_to_sentences_map:
                keyword_to_sentences_map[keyword] = []
            keyword_to_sentences_map[keyword].append(sentence)
    return keyword_to_sentences_map

# Create a KeyBERT model
kw_model = KeyBERT()

# Get sentences for each keyword
keyword_to_sentences_map = get_sentences_for_keyword(kw_model, sentences)

# Print the results
for keyword in keyword_to_sentences_map:
    print(keyword, " : ", keyword_to_sentences_map[keyword], "\n")



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


('chatbots', 0.5151)  :  [' Machine learning is behind chatbots and predictive text, language translation apps, the shows Netflix suggests to you, and how your social media feeds are presented.'] 

('learning', 0.4184)  :  [' Machine learning is behind chatbots and predictive text, language translation apps, the shows Netflix suggests to you, and how your social media feeds are presented.', 'It was defined in the 1950s by AI pioneer Arthur Samuel as “the field of study that gives computers the ability to learn without explicitly being programmed.”\n\nThe definition holds true, according toMikey Shulman, a lecturer at MIT Sloan and head of machine learning at Kensho, which specializes in artificial intelligence for the finance and U.S. intelligence communities.'] 

('feeds', 0.4059)  :  [' Machine learning is behind chatbots and predictive text, language translation apps, the shows Netflix suggests to you, and how your social media feeds are presented.'] 

('netflix', 0.3828)  :  [' Mac

## Generating Distractors with BERT

In [None]:
# load BERT model
from transformers import pipeline
unmasker = pipeline('fill-mask', model='distilbert-base-uncased')

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/256M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

In [None]:
import random

def get_questions(keyword_to_sentences_map, model, k=5):
    """"
    Generates questions along with distractors
    @input keyword_to_sentences_map : maps keywords to sentences they appear in
    @model: BERT model that will be used to mask the keyword and generate distractors
    @k (default 5), number of questions to return
    """

    # we can choose answer keys randomly from the pool of keywords
    keys = random.choices( list(keyword_to_sentences_map.keys()), k=k)

    for word in keys:
        answer = word
        #print(get_question(answer, context))
        questions = keyword_to_sentences_map[word]
        q = max(questions, key=len)

        a = q.lower().find(answer)
        b = a + len(answer)
        q = q.replace(q[a:b], '_____')

        results= model(q.replace('_____', '[MASK]'))
        #print(results)

        options = [result['token_str'] for result in results if isinstance(result, dict) and (answer not in result['token_str'].lower())]

        if options:
            print(q)
            print(f'Ans: {answer}')
            print(options)
            print()


get_questions(keyword_to_sentences_map, unmasker, 10)

This displaced the _____, deep-rooted grasses that trapped soil and moisture even during dry periods and high winds.
Ans: native
['dense', 'thick', 'tall', 'coarse', 'thin']

The storms damaged the soil in around 100 million acres of land, leading to the greatest short-time migration in the American history, with approximately 3.5 million people _____ their farms and fields.
Ans: abandoning
['leaving', 'losing', 'fleeing', 'destroying']

In fact, people welcome dust storms as they bring down temperatures and herald the arrival of the _____.
Ans: monsoons
['monsoon', 'sun', 'moon', 'rain', 'comet']

But, the dust storms that have hit India since February this year have been quantitatively and qualitatively different from those in the _____.
Ans: past
['caribbean', 'himalayas', 'philippines', 'west']

Else, we will see more _____ dust storms, and a choked Delhi would be a permanent feature.
Ans: intense
['frequent', 'severe', 'recent', 'permanent']

Large-scale shelterbelt plantations, c

In [None]:
!mkdir -p src

In [None]:
%%writefile src/answerkey.py
from sentence_transformers import SentenceTransformer
from keybert import KeyBERT
from nltk.tokenize import sent_tokenize
import nltk
# we use nltk library to tokenize our text
nltk.download('punkt')

class AnswerKey:
    """
    Generate answers using keyword extraction, and map them to sentences they appear in
    """

    def __init__(self, text):
        self.text = text

        # KeyBert uses BERT-embeddings and simple cosine similarity to find the sub-phrases in a document that are the most similar to the document itself.
        sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
        self.kw_model = KeyBERT(sentence_model)

    def get_keywords(self, text):
        """
        Given @input text, identify important keywords.
        Here we use Sentence Transformer to extract keywords that best describe the text
        """
        keywords_with_scores = self.kw_model.extract_keywords(text, keyphrase_ngram_range=(1, 2), top_n=5, stop_words='english')
        keywords = [kw[0] for kw in keywords_with_scores]
        scores = [kw[1] for kw in keywords_with_scores]
        return keywords

    def tokenize_sentences(self, text):
        """
        Given a @text input, returns tokenized sentences
        """
        sentences = [sent_tokenize(text)]
        sentences = [sentence for paragraph in sentences for sentence in paragraph]

        # Remove sentences shorter than 20 letters.
        sentences = [sentence.strip() for sentence in sentences if len(sentence) > 20]
        return sentences

    def get_sentences_for_keyword(self, kw_model, sentences, ngram_range=(1, 1), top_n=10):
        """
        @kw_model: keyBERT model to extract keywords
        @sentences: list of tokenized sentences
        returns a map with keywords as keys mapped to the sentences they appear in.
        """
        keyword_sentences = {}
        for sentence in sentences:
            keywords_found = [kw[0] for kw in kw_model.extract_keywords(sentence, keyphrase_ngram_range=ngram_range, top_n=top_n) if len(kw[0]) > 2]

            for key in keywords_found:
                keyword_sentences[key] = keyword_sentences.get(key, [])
                keyword_sentences[key].append(sentence)

        return keyword_sentences

    def get_answers(self, ngram_range=(1, 2), top_n=10):
        sentences = self.tokenize_sentences(self.text)
        keyword_to_sentences = self.get_sentences_for_keyword(self.kw_model, sentences, ngram_range=ngram_range, top_n=top_n)
        return keyword_to_sentences

Writing src/answerkey.py


In [None]:
%%writefile src/t5model.py

import requests
import json

API_URL = "https://api-inference.huggingface.co/models/mrm8488/t5-base-finetuned-question-generation-ap"

with open('../.env') as f:
    API_TOKEN = str(f.read()).strip('\n')
    API_TOKEN = API_TOKEN.strip()

headers = {"Authorization": f"{API_TOKEN}"}

def query(payload):
    response = requests.post(API_URL, headers=headers, json=payload)
    output =  json.loads(response.content.decode("utf-8"))
    return output

# ping model to wake it up when this package is imported
print(query("answer: Manuel context: Manuel has created RuPERTa-base with the support of HF-Transformers and Google"))

Overwriting src/t5model.py


In [None]:
%%writefile src/model.py

from transformers import pipeline
from .t5model import query
import random


# load BERT model once
unmasker = pipeline('fill-mask', model='distilbert-base-uncased')


class Model:

    def get_questions(self, keyword_to_sentences_map, model, k=5, declarative=True):
        """"
        Generates questions along with distractors
        @input keyword_to_sentences_map : maps keywords to sentences they appear in
        @model: BERT model that will be used to mask the keyword and generate distractors
        @k (default 5), number of questions to return
        """

        results = []

        # we can choose answer keys randomly from the pool of keywords
        answer_keys = random.choices( list(keyword_to_sentences_map.keys()), k=k)

        for answer in answer_keys:
            sentences = keyword_to_sentences_map[answer]
            sentence = max(sentences, key=len)

            if len(sentence) < 20:
                continue

            start_idx = sentence.lower().find(answer)
            end_idx = start_idx + len(answer)

            # replace answer in sentence with blank line to form question
            question = sentence.replace(sentence[start_idx: end_idx], '__________')

            # generate distractors from BERT model
            distractors = model(question.replace('__________', '[MASK]'))
            options = [option['token_str'] for option in distractors if isinstance(option, dict) and (answer not in option['token_str'].lower())]
            #print(distractors)

            # generate question
            if not declarative:
                context = sentence
                output = query(f"answer: {answer} context: {context}")
                question_t5 = output[0]["generated_text"]

                if options:
                    results.append((question_t5, options, answer))
            else:
                if options:
                    results.append((question, options, answer))

        return results


Overwriting src/model.py


In [None]:
%%writefile app.py

from src.answerkey import AnswerKey
from src.model import Model, unmasker
import streamlit as st

PAGE_CONFIG = {"page_title":"MCQ-App by Glad Nayak","page_icon":":white_check_mark:"}
st.set_page_config(**PAGE_CONFIG)

def render_input():
    """
    Renders text area for input, and button
    """
    # source of default text: https://www.fresherslive.com/online-test/reading-comprehension-test-questions-and-answers
    text = """The Dust Bowl, considered one of the greatest man-made ecological disasters, was a period of severe dust storms that lasted nearly a decade, starting 1931, and engulfed large parts of the US. The dust storms originated in the Great Plains-from states like Texas, Oklahoma, New Mexico, Colorado and Kansas. They were so severe that they choked everything and blocked out the sun for days. Sometimes, the storms travelled thousands of kilometres and blotted out monuments such as the Statue of Liberty. Citizens developed “dust pneumonia” and experienced chest pain and difficulty in breathing. The storms damaged the soil in around 100 million acres of land, leading to the greatest short-time migration in the American history, with approximately 3.5 million people abandoning their farms and fields.

    Dust storms are an annual weather pattern in the northern region of India comprising Delhi, Haryana, Punjab, Uttar Pradesh and Rajasthan and Punjab, as also in the Sindh region of Pakistan. But, they are normally low in intensity and accompanied by rains. In fact, people welcome dust storms as they bring down temperatures and herald the arrival of the monsoons. But, the dust storms that have hit India since February this year have been quantitatively and qualitatively different from those in the past. They are high-powered storms travelling long distances and destroying properties and agricultural fields. Since February, they have affected as many as 16 states and killed more than 500 people. Cities like Delhi were choked in dust for days, with air quality level reaching the “severe” category on most days.

    The Dust Bowl areas of the Great Plains are largely arid and semi-arid and prone to extended periods of drought. The US federal government encouraged settlement and development of large-scale agriculture by giving large parcels of grasslands to settlers. Waves of European settlers arrived at the beginning of the 20th century and converted grasslands into agricultural fields. At the same time, technological improvements allowed rapid mechanization of farm equipment, especially tractors and combined harvesters, which made it possible to operate larger parcels of land.

    For the next two decades, agricultural land grew manifold and farmers undertook extensive deep ploughing of the topsoil with the help of tractors to plant crops like wheat. This displaced the native, deep-rooted grasses that trapped soil and moisture even during dry periods and high winds. Then, the drought struck. Successive waves of drought, which started in 1930 and ended in 1939, turned the Great Plains into bone-dry land. As the soil was already loose due to extensive ploughing, high winds turned them to dust and blew them away in huge clouds. Does this sound familiar? The dust storm regions of India and Pakistan too are largely arid and semi-arid. But they are at a lower altitude and hence less windy compared to the Great Plains. Over the last 50 years, chemical- and water-intensive agriculture has replaced the traditional low-input agriculture. Canal irrigation has been overtaken by the groundwater irrigation. In addition, mechanized agriculture has led to deeper ploughing, loosening more and more topsoil. The result has been devastating for the soil and groundwater. In most of these areas, the soil has been depleted and groundwater levels have fallen precipitously. On top of the man-made ecological destruction, the natural climatic cycle along with climate change is affecting the weather pattern of this region.

    First, this area too is prone to prolonged drought. In fact, large parts of Haryana, Punjab, Delhi and western UP have experienced mildly dry to extremely dry conditions in the last six years. The Standardized Precipitation Index (SPI), which specifies the level of dryness or excess rains in an area, of large parts of Haryana, Punjab and Delhi has been negative since 2012. Rajasthan, on the other hand shows a positive SPI or excess rainfall. Second, this area is experiencing increasing temperatures. In fact, there seems to be a strong correlation between the dust storms and the rapid increase in temperature. Maximum temperatures across northern and western India have been far higher than normal since April this year. Last, climate change is affecting the pattern of Western Disturbances (WDs), leading to stronger winds and stronger storms. WDs are storms originating in the Mediterranean region that bring winter rain to northwestern India. But because of the warming of the Arctic and the Tibetan Plateau, indications are that the WDs are becoming unseasonal, frequent and stronger.

    The Dust Bowl led the US government to initiate a large-scale land-management and soil-conservation programme. Large-scale shelterbelt plantations, contour ploughing, conservation agriculture and establishment of conservation areas to keep millions of acres as grassland, helped halt wind erosion and dust storms. It is time India too recognizes its own Dust Bowl and initiates a large-scale ecological restoration programme to halt it. Else, we will see more intense dust storms, and a choked Delhi would be a permanent feature.
    """
    st.sidebar.subheader('Enter Text:')
    text = st.sidebar.text_area('', text.strip(), height = 275)

    ngram_range = st.sidebar.slider('answer ngram range:', value=[1, 2], min_value=1, max_value=3, step=1)
    num_questions = st.sidebar.slider("number of questions:", value=10, min_value=10, max_value=20, step=1)
    question_type_str = st.sidebar.radio('question type:', ('declarative (fill in the blanks)', 'imperative'))
    question_type = question_type_str == 'declarative (fill in the blanks)'

    button = st.sidebar.button('Generate')

    if button:
        return (text, ngram_range, num_questions, question_type)

def main():
    # Render input text area
    inputs = render_input()

    if not inputs:
        st.title('Generate Multiple Choice Questions(MCQs) from Text Automatically')
        st.subheader('Enter Text, select how long a single answer should be(ngram_range), and number of questions to get started.')

    else:
        with st.spinner('Loading questions and distractors using BERT model'):
            st.subheader("")
            st.title("")
            text, ngram_range, num_questions, question_type = inputs

            # Load model
            answerkeys = AnswerKey(text)
            keyword_to_sentence = answerkeys.get_answers(ngram_range, num_questions)

            model = Model()
            quizzes = model.get_questions(keyword_to_sentence, unmasker, k=num_questions, declarative=question_type)

            st.subheader('Questions')
            for id, quiz in enumerate(quizzes):
                question, options, answer = quiz
                st.write(question)

                for option in options[:3]:
                    st.checkbox(option, key=id)

                ans_button = st.checkbox(answer, key=id, value=True)

            st.balloons()
            st.button('Save')


if __name__ == '__main__':
    main()



Overwriting app.py


## Local Deployment

In [None]:
# to test locally we install localtunnel
!pip install streamlit > /dev/null
!npm install -g localtunnel > /dev/null

[K[?25h

## Generate requirements.txt and Install Requirements

In [None]:
!pip install pipreqs

Collecting pipreqs
  Downloading pipreqs-0.4.11-py2.py3-none-any.whl (32 kB)
Collecting yarg
  Downloading yarg-0.1.9-py2.py3-none-any.whl (19 kB)
Installing collected packages: yarg, pipreqs
Successfully installed pipreqs-0.4.11 yarg-0.1.9


In [None]:
!pipreqs .

INFO: Successfully saved requirements file in ./requirements.txt


In [None]:
!pip install -r requirements.txt

Collecting keybert==0.5.0
  Downloading keybert-0.5.0.tar.gz (19 kB)
Collecting sentence_transformers==2.1.0
  Downloading sentence-transformers-2.1.0.tar.gz (78 kB)
[K     |████████████████████████████████| 78 kB 3.0 MB/s 
Collecting transformers==4.15.0
  Downloading transformers-4.15.0-py3-none-any.whl (3.4 MB)
[K     |████████████████████████████████| 3.4 MB 10.6 MB/s 
Collecting rich>=10.4.0
  Downloading rich-11.0.0-py3-none-any.whl (215 kB)
[K     |████████████████████████████████| 215 kB 56.6 MB/s 
Collecting tokenizers>=0.10.3
  Downloading tokenizers-0.11.2-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.8 MB)
[K     |████████████████████████████████| 6.8 MB 41.1 MB/s 
Collecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 55.8 MB/s 
[?25hCollecting huggingface-hub
  Downloading huggingface_hub-0.4.0-py3-none-any.whl (67 kB)
[K     |████

## Run App

In [None]:
!streamlit run app.py --server.enableCORS=false &>/dev/null&

!lt --Bypass-Tunnel-Reminder --subdomain 'myapp' --port 8501 &>/dev/null&

In [None]:
# kill app and clean up memory
st_id = !pgrep streamlit
!kill {st_id[0]}

lt_id = !pgrep lt
!kill {lt_id[0]}

## Setup CI/CD Pipeline

Now, for production deployment, we would first setup a CI/CD pipeline.

1. first we, dockerize our app and push the code to container registry (Google's GCR) using Dockerfile and docker build.

2. Automate deployment using Cloud Build such that everytime we push our code to Github our app gets deployed with latest changes.


In [None]:
%%writefile Dockerfile
#Base Image to use
FROM python:3.7.9-slim

#Expose port 8080
EXPOSE 8080

#Optional - install git to fetch packages directly from github
RUN apt-get update && apt-get install -y git

#Copy Requirements.txt file into app directory
COPY requirements.txt app/requirements.txt
COPY env app/.env

#install all requirements in requirements.txt
RUN pip install -r app/requirements.txt

#Copy all files in current directory into app directory
COPY . /app

#Change Working Directory to app directory
WORKDIR /app

#Run the application on port 8080
ENTRYPOINT ["streamlit", "run", "app.py", "--server.port=8080", "--server.address=0.0.0.0"]

Writing Dockerfile


In [None]:
%%writefile cloudbuild.yaml

 steps:
 # Build the container image
 - name: 'gcr.io/cloud-builders/docker'
   args: ['build', '-t', 'gcr.io/$PROJECT_ID/mcq-app:$COMMIT_SHA', '.']
 # Push the container image to Container Registry
 - name: 'gcr.io/cloud-builders/docker'
   args: ['push', 'gcr.io/$PROJECT_ID/mcq-app:$COMMIT_SHA']
 # Deploy container image to Cloud Run
 - name: 'gcr.io/google.com/cloudsdktool/cloud-sdk'
   entrypoint: gcloud
   args:
   - 'run'
   - 'deploy'
   - 'mcq-app'
   - '--image'
   - 'gcr.io/$PROJECT_ID/mcq-app:$COMMIT_SHA'
   - '--region'
   - 'us-central1'
 images:
 - 'gcr.io/$PROJECT_ID/mcq-app:$COMMIT_SHA'

Overwriting cloudbuild.yaml


In [None]:
from google.colab import auth
auth.authenticate_user()

# setup Google Cloud project

PROJECT_ID = 'mcq-from-text'
!gcloud config set project {PROJECT_ID}

Updated property [core/project].


In [None]:
# now, we package our code as docker image using Dockerfile and push it to Google's Container Registry

# 1. build image locally and test run at localhost
!docker build  --timeout 20m --tag gcr.io/{PROJECT_ID}/mcq-app:latest .

# 2. build using gcloud
!docker build  --timeout 20m --tag gcr.io/{PROJECT_ID}/mcq-app:latest .

# 3. submit to Google's Container registry
!gcloud builds submit --timeout 30m  --tag gcr.io/${PROJECT_ID}/mcq-app:latest

# 4. automate CI/CD pipeline
!gcloud builds submit --config cloudbuild.yaml .