# Building Text Summarization Web Application in Flask

Hello Guys, 

In this tutorial, we will be building a Text Summarization Web Application in Flask, Python. Before getting too much technical about the inner workings of the application, Let me first demonstrate the working of this application. First, when we build the complete application, we will be greeted with the following screen:

<img src = "https://raw.githubusercontent.com/datageekrj/ForHostingFiles/master/display1.PNG">

Take a good look at the various pieces which are mentioned in the above application. So, there is a Input Field which says `Enter your text here` and then there is a `submit` button which is for submitting and getting the summarized version of the input text.   

Now, to demonstrate you the workings of the application, what i have done is taken the following piece of text from [Hyperparameter Tuning](https://towardsdatascience.com/understanding-hyperparameters-optimization-in-deep-learning-models-concepts-and-tools-357002a3338a) and copy and pasted it to the following input field.

**Model optimization is one of the toughest challenges in the implementation of machine learning solutions. Entire branches of machine learning and deep learning theory have been dedicated to the optimization of models. Typically, we think about model optimization as a process of regularly modifying the code of the model in order to minimize the testing error. However, deep learning optimization often entails fine tuning elements that live outside the model but that can heavily influence its behavior. Deep learning often refers to those hidden elements as hyperparameters as they are one of the most critical components of any machine learning application.
Hyperparameters are settings that can be tuned to control the behavior of a machine learning algorithm. Conceptually, hyperparameters can be considered orthogonal to the learning model itself in the sense that, although they live outside the model, there is a direct relationship between them.
The criteria of what defines a hyperparameter is incredibly abstract and flexible. Sure, there are well established hyperparameters such as the number of hidden units or the learning rate of a model but there are also an arbitrarily number of settings that can play the role of hyperparameters for specific models. In general, hyperparameters are very specific to the type of machine learning model you are trying to optimize. Sometimes, a setting is modeled as a hyperparameter because is not appropriate to learn it from the training set. A classic example are settings that control the capacity of a model (the spectrum of functions that the model can represent). If a deep learning algorithm learns those settings directly from the training set, then it is likely to try to optimize for that dataset which will cause the model to overfit( poor generalization).
If hyperparameters are not learned from the training set, then how does a model learn them? Remember that classic rule in machine learning models to split the input dataset in an 80/20 percent ratio between the training set and the validation set respectively? Well, part of the role of that 20% validation set is to guide the selection of hyperparameters. Technically, the validation set is used to “train” the hyperparameters prior to optimization.**

The following is a quick display of application once the data is entered:

<img src = "https://raw.githubusercontent.com/datageekrj/ForHostingFiles/master/display2.PNG">

Now, after entering the text, I will hit submit button and then i will be displayed the summarized version of the above text and it will look like the following:

<img src = "https://raw.githubusercontent.com/datageekrj/ForHostingFiles/master/display3.PNG"> 

Now, notice how the summarized version of the entered text is displayed above. You might be wondering how long is this new text.

This new text will always be having 30% of the original text. Now, this is something that you can easily change and i will show exactly how you might be able to do this. 

# Steps to create this Application...

**Learning Objectives:**
  * Logic of the text summarizing function using `nltk` library:
      * Take a text input
      * Extract all the sentences of the text = sentences tokenize
      * Extract all the words of the text = word tokenize
      * Create a weighted word frequency dictionary
      * Weight all the sentences, higher the weight, more important the sentence will be
      * Choose top x% of the sentences list which can be changed according to the use case.
  * Building a function for handling and taking any text input and returning the summarized version of the text in 30%. 
  * Finally, building the Flask Web Application and embedding the model that we have created. 

# Installing the necessary Modules

The following code will be run and all the necessary modules will be installed to make sure that you all guys on the same place and can start building the application..

In [35]:
!pip install pandas
!pip install flask
!pip install flask-ngrok
!pip install nltk



## Importing the necessary Modules

Before we actually start the actual coding process, we need to first import the nessary modules that we will require in the buinding of the application.

So, Let us do that..

In [0]:
from flask import Flask,request
from flask_ngrok import run_with_ngrok
import nltk
import pandas as pd
import re

# Let us briefly explain the objective of each of the above module

**Module Explanation:**
  * `flask` for building the web application server in python 
  * `pandas` is for making a data frame of the entered text and then choosing the top sentences according to their weights.
  * `nltk` for accesing the functions `sent_tokenize` and `word_tokenize` which will be used to calculated the weights of the sentences.  
  * `flask_ngrok` for building an alternative server. Why do we need an alternative server? You might ask. `Flask` by defalut runs on the local server and it is fine if you are building this application on locally but if you are using some cloud services like `amazon aws`, `colab`, `azure notebooks`, then in that case if you run the flask application, then it is simply not going to work. For these type of situations, we have imported this library to make sure that we have an alternative server for flask application with which we can access the flask application.
  * `re` for some text parsing and regular expression function for data cleaning.

After importing all the necessary modules, we have to download few necessary things.

We need to download the following two corpus from the `nltk` library. This is important because we use the `stopwords` list from `nltk` and we download `punkt`.  

In [37]:
nltk.download("stopwords")
nltk.download("punkt")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

# Logic of the Text Summarization Application..

## Weighted Text Rank or Sentence Ranking Algorithm for Text Summarizatio using NLTK

Now, we will start talking about the logic of the text summarization application that we saw earlier.

Before actually discussing anything, first let us look at the following function `get_weighted_dictionary(words, stopwords)`


In [0]:
def get_weighted_dictionary(words, stopwords):
  words_freq = {}
  for word in words:
    if word not in stopwords:
      if word not in words_freq:
        words_freq[word] = 1
      else:
        words_freq[word] += 1
  max_freq = max(words_freq.values())
  for key in words_freq.keys():
    words_freq[key] /= max_freq
  return words_freq

Before actually going inside the above function, let us try to run this function with some inputs and see what is the output we are getting..

We will first extract the stopwords from the `nltk` librarywhich we can pass to the above functions.  

In [39]:
stopwords = nltk.corpus.stopwords.words("english")
word1 = ["Hello", "Yes", "Bye", "Hello", "Bye", "Rahul", "Bye", "Bye", "Jai"]
dict1 = get_weighted_dictionary(word1, stopwords)
dict1

{'Bye': 1.0, 'Hello': 0.5, 'Jai': 0.25, 'Rahul': 0.25, 'Yes': 0.25}

Try to understand the output that we are getting after running the above function and do not yet think about what's going inside this function. 

The word "Bye" has maximum value of 1 followed by "Hello" of 0.5. The reason for this is that the list of words has the "Bye" with the maximum frequency. 

Similary, after that, the word "Hello" has the next least frequency. 

Now, once the output makes sense, we can now start talking about the inner workings of the above function and see how it actually works?

* First, we initialize an empty dictionary.
* Then, we loop through the words.
* For each word, we check two things, First if it is not in the stopwords list 
* and then we need to check if it is not in that empty dictionary, and if it is not in that empty dictionary, we initialize its frequency to 1 and otheriwse we increment it by one.
* After, loop ends, we will have a dictionary with all the words in `words` list and corresponding frequency with it.
* Then, to normaize these frequency, we divide the frequency with maximum frequency.

Now, let us understand the function `get_sentence_score(sentences, word_freq, stopwords)` which takes sentences list and word_freq and will return a list of scores of the corresponding sentences. Before, understanding about this function, let us first run this function with a list of sentences made from the above `word1` list. 

In [0]:
def get_sentence_score(sentences, word_freq, stopwords):
  sentences_score = []
  for sent in sentences:
    current_words = nltk.word_tokenize(sent)
    current_score = 0
    for word in current_words:
      if word not in stopwords:
        current_score += word_freq[word]
    sentences_score.append(current_score)
  return sentences_score

Let us call this function with some sentences:

In [41]:
sent = ["Hello Rahul", "Yes", "Hello and Bye Rahul"]
get_sentence_score(sent, dict1, stopwords)

[0.75, 0.25, 1.75]

Now, the above code produces a list of scores and the higher the score, the higher this particular sentence has a say in the final summzarization.

Make sure the above output makes sense to you and once it does, then let us go further in understanding how does this function works...

### `get_sentence_score()`

This function takes three arguments `sentences` list, `word_freq` dictionary and `stopwords` list. 

* We first make an empty list to store the sentences score.
* We loop through each sentences and begin by 
  * setting the current score = 0
  * The, we look for the normalized frequency in the word_freq dictionary
  * and add that corresponding freq to the current_freq
* Finally, after the loop ends, we append that current score to the list of sentences scores and finally return that score list.

## Finally, creating a helpful function called `get_summary(text, percentage = 0.3)`.

This function, what it does is, given a `text` string, it will come up with a summary of the given text using the 30% of the sentences by default but you can give whatever percentage you desire. This function uses the two helpful function that we have created and talked about earlier..

Let us first run this function on a sample text and see how does it work..

In [0]:
def get_summary(text, percentage = 0.3):
  sentences = nltk.sent_tokenize(text)
  stopwords = nltk.corpus.stopwords.words("english")
  words = nltk.word_tokenize(text)
  word_freq = get_weighted_dictionary(words, stopwords)
  sentences_score = get_sentence_score(sentences, word_freq, stopwords)
  data = pd.DataFrame({"sentence":sentences, "score":sentences_score})
  n = data.shape[0]
  top = data.sort_values(by = "score", ascending=False).iloc[0:int(n*percentage),0]
  return " ".join(top)

In [43]:
get_summary("Machine Learning is really cool. Machine Learning finds it application in statistical and data mining application. Machine Learning takes data and returns a prediction on that data. There are many cool applications of machine learning. Machine Learning has enabled data scientists to find hidden patterns in data.")

'Machine Learning has enabled data scientists to find hidden patterns in data.'

So, i called the `get_summary(text, percentage = 0.3)` function on the following data...

**Machine Learning is really cool. Machine Learning finds it application in statistical and data mining application. Machine Learning takes data and returns a prediction on that data. There are many cool applications of machine learning. Machine Learning has enabled data scientists to find hidden patterns in data.**

The resulting summary is a one line summary which is looking quite cool because it clearly captures the whole idea of the passage. Now, that we have a working understanding of the above function, let us now start to understand the inner workings of it...

### `get_summary(text, percentage = 0.3)`

This function takes two inputs `text` and `percentage` which defaults to 0.3. 

* First, using the nltk's `sent_tokenize` function, we get a list of sentences.
* Then, using the nltk's `word_tokenize` function, we get a list of words.
* Then, we access the stopwords collection of the `stopwords` of the `nltk` library.
* After that, we call the helper functions `get_weighted_dictionary()` and `get_sentence_score()`  to get a list of sentences with their scores,
* Finally, we make a data frame from these two lists `sentences` and `scores`. and then sort them to get top 30% of the sentences.
* Once, we have those top 30% of the sentences based on their normalized scores, we join them using space. 

Now, that we have a function `get_summary` which produces a summarized version of any text, we can now get started to build web application which uses this function to produce the summary of any input text.


# Building the web application

To build the web application, we are going to be using a framework in python called `flask`. It is one of the most lightweight frameworks out there.

We have instantiated a `app` variable by calling the `Flask` constructor. Further, notice that we have wrapped the `app` variable with the function `run_with_ngrok` which enables us to run an online server and local server.

So, we will have two links. 

* If you are working on your local machine, then open the local link
* If you are running on cloud platforms like, `colab`, run with online link. 

I will mention at the end of the lecture, which one is which.


In [0]:
app = Flask(__name__)
run_with_ngrok(app)

## Making sense of the following `html_style` variable. 

First of let us understand what the heck is this long string? Look at the cell below and you will find a long `HTML` string. 

If you are familiar with `HTML`, then the following string is just a piece of cake for you. But even if you are not, then also after reading this small discussion on `HTML` everything will start to make sense. 

### HTML - The language of Web

HTML stands for `Hyper Text Markup Language` and it is the language of the web. With the help of HTML, we can define the structure of the web application that we want to build. 

`HTML` decides the structure of a web page. Whatever you see on the web page is structured because of `HTML`. Now, everything on our `wikipedia content classification application` is also structured using `HTML`.

But Note here, HTML only handles the structure of the application while the look and feel of your application is handled by another technique called `CSS`. More on that later. 

Now, what do we mean by look and feel? All the colors, position, layout is decided by `CSS`.

To understand the following string, first of all understand that everything in `HTML` is made up of tags. 

What are tags? You might wonder..

Tags in HTML are made up of opening and closing tags. Opening tags looks like `<tag_name>` and closing tags looks like `</tag_name>`. 


Let us take the example of our application. It has a input box, a submit button, various texts (bottom of the application which has some texts after the input box and submit button). 

Also, the application has one image, on which we have the heading of the application and some orange color (which comes under CSS). 

The header on the image of the application can be handled by header tags. There can be various different types of header tags: 

* `h1` header tag
* `h2` header tag
* `h3` header tag
* `h4` header tag
* `h4` header tag


These all are used to place header for the HTML content. Each of them differs from each other because of the the size. 

Taking `h1` as an example, the opening tag would look like `<h1>` and the closing tag would look like `</h1>`. The content will be placed in between. 

Now, there are some types of tags which are self closing which means for those tags, there are no closing tags. `input` tag is one of the example self closing tag. 

Now, if you look very closely on the following `html_style` variable, then you will notice that everything is just `tags` nothing else. 

Now, if you want to structure the HTML page, then only tag that we have to concentrate on is `<body></body>`.

In side, this tag, most of the application resides.  One more tag which is important is `<div></div>` tag which is for making some sensible divisions on the HTML pages. 

### One special tag `<style> </style>`

In this style tag, most of the colors and position of `HTML` elements has been decided. If you observe the following `html_style` text below, most of the content reside in between `<style>` and `</style>`. If you don't notice this immediately, I want all of you to stop right now and notice this right now. Then, you can move on.




In [0]:
html_style = '''<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Text Summarizer</title>
    <style>
        *,html,body{
            margin:0px;
            padding: 0px;
        }
        #header{
            background: url(https://dtsvkkjw40x57.cloudfront.net/1350xnull/8160/uploads/2ab85f0f-bc62-417f-8a33-a064535a6ebd.jpg);
            opacity: 0.9;
            height: 300px;
            width: 100%;
            margin: 0px 0px 10px 0px;
            display: flex;
            flex-direction: column;
            justify-content: center;
            align-items: center;
        }

        #header h1 {
            border: 0px solid #F49200;
            background-color: #F49200;
            border-radius: 5px;
        }
        #form-input,#output
        {
            display: flex;
            flex-direction: column;
            justify-content: center;
            align-items: center;
        }
        #output{
          display: flex;
            flex-direction: column;
            justify-content: center;
            align-items: center;
            width:600px;
            margin:auto;
        }
        textarea{
            width: 500px;
            height: 100px;
            border: 1px solid orange;
            resize: none;
        }
        input{
            background-color: orange;
            color: white;
            border:none;
            outline: none;
            border-radius: 5px;
            width:70px;
            height: 30px;
        }
        form,#output{
            margin-top: 10px;
        }
        #output p{
            border: 1px solid orange;
            border-radius: 5px;
            padding: 2px 2px 0px 2px;
        }

        a{
            margin-top: 10px;
        }
        #desc{
            display: flex;
            flex-direction: column;
            justify-content: center;
            align-items: center;
            margin-top: 30px;
        }
        ul{
            margin-top: 10px;
        }
    </style>
</head>
<body>
    <div id="header">
        <h1 style="color: white;">Welcome to Text Summarization Application</h1>
    </div>
'''

## Two states of our application - Posted or not Posted

The final application will be having two different states. One, where the user of the application will be shown the input box and some information about the application. The another state is where the user will submit the text to get the summarization of. 

So, for each of these states, We have different `HTML` string. One which is shown when the User visits the application `not_posted` and one where the user posts some text and expects the summarized version of that entered text and for that notice we have created a lambda function called `posted(data)` which accepts the data argument and that data is nothing but the summarized version of the text posted by user.

Notice one thing, we add html_style to the beginning in both the states (`not_posted` and in the function `posted(data)`). It is because the application style (which is color, font size, and all other styles are consistent for both the states of the application). 

Hence, although the user might be looking at the welcome screen of the application or he might be looking at the screen where the result is displayed after he submitted his text, for both of these states, the style should be same and that style we declared earlier as `html_style`.

Now, remember the screen which is displayed once the user opens the app?

This is the state when the `input` is shown. In that page, there were lot of things. Like text `Welcome to the Text Summarization Application` and other texts. Also, we had list of informing users about the algorithm that we have used in the application. 




In [0]:
not_posted = '''
<div id="form-input">
        <h3>Please Enter Your Text Here....</h3>
        <form method="POST">
            <textarea id = "text" name="text" placeholder="Enter your text here..."></textarea>
            <input type="submit" id="submit">
        </form>
    </div>'''

In [0]:
posted = lambda x : '''
<div id="output">
        <h3>Here is the summarized version of the text you provided....</h3>
        <p>''' + x + '''</p>
        <a href="/">Go Back</a>
    </div>
'''

## bottom

For the bottom part of the application, we have one more `HTML` which lists all the information about the application like how this works and which algorithm is used by this application.

In [0]:
bottom = '''
<div id = "desc">
        <h2>Here is a Little description on how the algorithm works...</h2>
        <ul>
            <li>This uses one of the most powerful algorithms in Natural Language Processing for Summarization Called TextRank.</li>
            <li>It is the same algorithm which was used by Google to rank the search results to show you the best results.</li>
            <li>You enter the text in the Input and the text data is sent to the server.</li>
            <li>One the server receives the text, we have a function which will receive the entered text and it is going to apply Text Rank algorithm.</li>
            <li>Then, that function is going to produce a summarized version of the entered text.</li>
        </ul>
    </div>
</body>
</html>
'''

## Creating the 'route' for our web application

Now, that we have defined the necessry html structure and saved that structure in some separate variables, we need to create the route of the web application. 

### What is route?

Route is just a fancy word for saying `URL`. 

A web application is just like any other website you visit online. Now, when you visit any website you enter some `URL` and that `URL` is what we are talking about here.

So, since we want our web application to run on our own system. We don't want to share it with world. So, we have to create some way to open the web application on our local system.  

**Further, with the help of `flask-ngrok`, we can also run the application online as long as the flask server on the cloud is running.**

Hence, we create route for this purpose. 

So, before the function `index()`, we have a statement which starts with `@` which has a special name in python `decoraters`. 

Inside this route we say that the URL is `'/'` and we will accept two types of methods `GET` and `POST`. 

### Now, let us understand what do we mean by methods?

Remember, when we talked about the states of the application. There were two states:

* Where you are welcomed to enter the `text` for summarization == `GET` request
* Once the user enters the texts, he will be displayed the summarization == `POST` request

Also, if you remember, we had two different types of string for both of the different types methods. For `GET` requests, we had `not_posted` while for the `POST` request we had a function which will take the `data` and give us the corresponding `HTML` representation and the function was lambda function in python called `posted()`

Now, inside the `index()` function, we then test whether the request that user is making (the type of request) type is 'POST' or it is 'GET'. 

If the method is not `POST`, in the else block, we just return the `not_posted` while in the post, we first get the text content that the user has entered using the `request.form.get` method. Once, the text has been extracted, we call the `get_summary(text, percentage = 0.3)` function with the extracted text and the function will perform all the sequence of steps that we discussed and compute a summarized version of the text.  

Once we get summary text, we store it in a variable called, `summarized_text` which will be passed to the lambda function `posted(data)` and the can be concatenated to html_style and bottom_style and can be returned.

In [0]:
@app.route("/", methods = ["GET", "POST"])
def index():
    if request.method == "POST":
        required_text = request.form.get("text")
        summarized_text = get_summary(required_text, percentage = 0.3)
        request.form.data = ""
        return html_style + posted(summarized_text) + bottom
    return html_style + not_posted + bottom

## Running the application

Finally, once our app is ready and all the routes have been created, we can now run our application by calling the `run()` method on the app variable and we are ready to use this application..


** *NOTE: Once we run the application, because we have used app with flask-ngrok, it is going to point us at two links. One which will always be http://localhost:5000/ and another which will be always different but will include the name ngrok in the URL so that you can get to know that this is a ngrok URL. With this URL, your application will be temporarily hosted on a ngrok platform and you can access it from there. This is useful when your are not running this notebook on your local system, otherwise, localhost URL will work just fine.* **

In [50]:
if __name__ == "__main__":
    app.run()

 * Serving Flask app "__main__" (lazy loading)
 * Environment: production
[2m   Use a production WSGI server instead.[0m
 * Debug mode: off


 * Running on http://127.0.0.1:5000/ (Press CTRL+C to quit)


 * Running on http://1fcf3bc0.ngrok.io
 * Traffic stats available on http://127.0.0.1:4040


127.0.0.1 - - [28/Apr/2020 11:23:39] "[37mGET / HTTP/1.1[0m" 200 -
127.0.0.1 - - [28/Apr/2020 11:23:40] "[33mGET /favicon.ico HTTP/1.1[0m" 404 -
