# T-725 Natural Language Processing: Lab 7
In today's lab, we will be working with spaCy and Huggingface for a variety of tasks. We'll also learn how to use Gradio to quickly create convenient user interfaces.

To begin with, do the following:
* Select `"File" > "Save a copy in Drive"` to create a local copy of this notebook that you can edit.
* **Select `"Runtime" > "Change runtime type"`, and make sure that you have "Hardware accelerator" set to "GPU"**
* Select `"Runtime" > "Run all"` to run the code in this notebook.

## spaCy

[spaCy](https://spacy.io) is a free, open-source library for advanced Natural Language Processing (NLP) in Python.

spaCy is designed specifically for production use and helps you build applications that process and “understand” large volumes of text. It can be used to build information extraction or natural language understanding systems, or to pre-process text for deep learning.

### Features

Name | Description
---|---
**Tokenization** | Segmenting text into words, punctuations marks etc.
**Part-of-speech (POS) Tagging** | Assigning word types to tokens, like verb or noun.
**Dependency Parsing** | Assigning syntactic dependency labels, describing the relations between individual tokens, like subject or object.
**Lemmatization** |	Assigning the base forms of words. For example, the lemma of “was” is “be”, and the lemma of “rats” is “rat”.
**Sentence Boundary Detection (SBD)** |	Finding and segmenting individual sentences.
**Named Entity Recognition (NER)** | Labelling named “real-world” objects, like persons, companies or locations.
**Entity Linking (EL)** | Disambiguating textual entities to unique identifiers in a knowledge base.
**Similarity** | Comparing words, text spans and documents and how similar they are to each other.
**Text Classification** | Assigning categories or labels to a whole document, or parts of a document.
**Rule-based Matching** | Finding sequences of tokens based on their texts and linguistic annotations, similar to regular expressions.
**Training** | Updating and improving a statistical model’s predictions.
**Serialization** | Saving objects to files or byte strings.

### Trained Pipelines

While some of spaCy’s features work independently, others require [trained pipelines](https://spacy.io/models) to be loaded, which enable spaCy to predict linguistic annotations – for example, whether a word is a verb or a noun. A trained pipeline can consist of multiple components that use a statistical model trained on labeled data. spaCy currently offers trained pipelines for a variety of languages, which can be installed as individual Python modules.

### Summarization Example
Let's take a look at some of the functionality of spaCy through the example of [automatic summarization](https://medium.com/luisfredgs/automatic-text-summarization-with-machine-learning-an-overview-68ded5717a25). There are two main types of summarization: extractive and abstractive. Extractive summarization selects a subset of sentences from the text to form a summary; abstractive summarization reorganizes the language in the text and adds novel words/phrases into the summary if necessary.

For this example we'll be doing automatic [extractive summarization](https://medium.com/analytics-vidhya/text-summarization-using-spacy-ca4867c6b744).

First install spaCy:

In [1]:
!pip install -U spacy



Then import all necessary modules:

In [2]:
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation
from collections import Counter
from heapq import nlargest

There are many [other languages](https://spacy.io/usage/models) to choose from. Here we load the English language models:

In [3]:
nlp = spacy.load('en_core_web_sm')



Choose some text to be summarized and store it in a variable:

In [4]:
long_text = """
Machine learning (ML) is the scientific study of algorithms and statistical models
that computer systems use to progressively improve their performance on a specific
task. Machine learning algorithms build a mathematical model of sample data, known as
“training data”, in order to make predictions or decisions without being explicitly
programmed to perform the task. Machine learning algorithms are used in the applications
of email filtering, detection of network intruders, and computer vision, where it
is infeasible to develop an algorithm of specific instructions for performing the task.
Machine learning is closely related to computational statistics, which focuses on
making predictions using computers. The study of mathematical optimization delivers
methods, theory and application domains to the field of machine learning. Data mining
is a field of study within machine learning and focuses on exploratory data analysis
through unsupervised learning. In its application across business problems, machine
learning is also referred to as predictive analytics.
"""

Pass the text to the `nlp` function:

In [5]:
doc = nlp(long_text)

At this point, the text has been processed, i.e., tokenized, lemmatized, tagged with parts-of-speech, and parsed. A variety of lingustic features are accessbile via the `doc` object, e.g.:

* Lemmas
* Parts of speech
* Dependency parse
* Named entities
* Chunks
* Is alphabet character
* Is capitalized
* Is in the stop list

The following will print out each of those bits of information for every token in the original text, one token per line:

In [6]:
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop)


 
 SPACE _SP dep 
 False False
Machine Machine PROPN NNP compound Xxxxx True False
learning learning NOUN NN nsubj xxxx True False
( ( PUNCT -LRB- punct ( False False
ML ML PROPN NNP appos XX True False
) ) PUNCT -RRB- punct ) False False
is be AUX VBZ ROOT xx True True
the the DET DT det xxx True True
scientific scientific ADJ JJ amod xxxx True False
study study NOUN NN attr xxxx True False
of of ADP IN prep xx True True
algorithms algorithm NOUN NNS pobj xxxx True False
and and CCONJ CC cc xxx True True
statistical statistical ADJ JJ amod xxxx True False
models model NOUN NNS conj xxxx True False

 
 SPACE _SP dep 
 False False
that that SCONJ IN mark xxxx True True
computer computer NOUN NN compound xxxx True False
systems system NOUN NNS nsubj xxxx True False
use use VERB VBP relcl xxx True False
to to PART TO aux xx True True
progressively progressively ADV RB advmod xxxx True False
improve improve VERB VB xcomp xxxx True False
their their PRON PRP$ poss xxxx True True
performanc

Next, we'll use this information to filter keywords from the original text.

* Define the keywords list
* Choose the parts-of-speech that are likely to be important ([pos tags in spaCy](https://spacy.io/usage/linguistic-features/#pos-tagging))
* Skip tokens that are in the stop list
* Add tokens that have the part-of-speech we care about to the keywords list

In [7]:
keywords = []
pos_tags = ["PROPN", "ADJ", "NOUN", "VERB"]
for token in doc:
  if token.is_stop:
    continue
  if token.pos_ in pos_tags:
    keywords.append(token.text)

Next, we calculate the frequency of each token using the `Counter` function and store it in `freq_words`.

To view the top `n` most frequent words, the `most_common(n)` method can be used:

In [8]:
freq_words = Counter(keywords)
freq_words.most_common(5)

[('learning', 8), ('Machine', 4), ('study', 3), ('algorithms', 3), ('task', 3)]

This frequency should be normalised for better processing and it can be done by dividing the token's frequencies by the maximum frequency:

In [9]:
max_freq = freq_words.most_common(1)[0][1]
for word in freq_words.keys():
  freq_words[word] = (freq_words[word]/max_freq)

freq_words.most_common(5)

[('learning', 1.0),
 ('Machine', 0.5),
 ('study', 0.375),
 ('algorithms', 0.375),
 ('task', 0.375)]

Next, we weigh each sentence based on the frequency of the keyword token present in each sentence. The result is stored as a key-value pair in `sent_strength` where keys are the sentences and the values are the weight of each sentence:

In [10]:
sent_strength = {}
for sent in doc.sents:
  for word in sent:
    if word.text in freq_words.keys():
      if sent in sent_strength.keys():
        sent_strength[sent] += freq_words[word.text]
      else:
        sent_strength[sent] = freq_words[word.text]

print(sent_strength)

{
Machine learning (ML) is the scientific study of algorithms and statistical models
that computer systems use to progressively improve their performance on a specific
task.: 4.125, Machine learning algorithms build a mathematical model of sample data, known as
“training data”, in order to make predictions or decisions without being explicitly
programmed to perform the task.: 4.625, Machine learning algorithms are used in the applications
of email filtering, detection of network intruders, and computer vision, where it
is infeasible to develop an algorithm of specific instructions for performing the task.
: 4.25, Machine learning is closely related to computational statistics, which focuses on
making predictions using computers.: 2.625, The study of mathematical optimization delivers
methods, theory and application domains to the field of machine learning.: 3.125, Data mining
is a field of study within machine learning and focuses on exploratory data analysis
through unsupervised learn

Next, the `nlargest` function is used to summarize the string. It takes 3 arguments:

* Number of elements to extract
* An Iterable (List/Tuple/Dictionary)
* Condition to be satisfied

This nlargest function returns a list containing the 3 sentences with the highest sentence strength score calculated in the previous step.

We store this output in `summarized_sentences`:

In [11]:
summarized_sentences = nlargest(4, sent_strength, key=sent_strength.get)

print(summarized_sentences)

[Machine learning algorithms build a mathematical model of sample data, known as
“training data”, in order to make predictions or decisions without being explicitly
programmed to perform the task., Machine learning algorithms are used in the applications
of email filtering, detection of network intruders, and computer vision, where it
is infeasible to develop an algorithm of specific instructions for performing the task.
, Data mining
is a field of study within machine learning and focuses on exploratory data analysis
through unsupervised learning., 
Machine learning (ML) is the scientific study of algorithms and statistical models
that computer systems use to progressively improve their performance on a specific
task.]


Lastly, convert the text data in the `summarized_sentences` to a string and print it:

In [12]:
final_sentences = [w.text for w in summarized_sentences]
summary = ' '.join(final_sentences)
print(summary)

Machine learning algorithms build a mathematical model of sample data, known as
“training data”, in order to make predictions or decisions without being explicitly
programmed to perform the task. Machine learning algorithms are used in the applications
of email filtering, detection of network intruders, and computer vision, where it
is infeasible to develop an algorithm of specific instructions for performing the task.
 Data mining
is a field of study within machine learning and focuses on exploratory data analysis
through unsupervised learning. 
Machine learning (ML) is the scientific study of algorithms and statistical models
that computer systems use to progressively improve their performance on a specific
task.


This example only shows a very limited application of [spaCy](https://spacy.io). The package has many powerful tools to create NLP applications.

##Gradio
[Gradio](https://gradio.app) is a fast way to demo your machine learning model with a nice web interface so that anyone can use it. The possibilities with Gradio are vast, this lab only scratches the surface.

Here's the setup for a very basic UI:

* First, define a function that does your main processing when users click the 'Submit' button in the UI.
* Then define a gradio `Interface` called `demo`. This constructor has several arguments:
  * The first in this example is the name of the function you defined
  * The second is the type of inputs you want to capture (one text input in this case)
  * The third is the type of output (also text)
* Lastly, call the `Interface` object's `launch()` function to render the UI.

In [13]:
!pip install gradio



In [14]:
import gradio as gr

In [15]:
def greet(name):
    return "Hello " + name + "!"

demo = gr.Interface(fn=greet, inputs="text", outputs="text")

demo.launch()

Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://584134c8de835b6924.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)




You can do any kind of processing inside your main function, like call other functions. Let's create a quick summarization tool using [sumy](https://github.com/miso-belica/sumy).

* First, install the necessary packages for sumy.
* Then import the modules for the summarization task.

In [16]:
!pip install sumy



In [17]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [18]:
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lsa import LsaSummarizer as Summarizer
from sumy.nlp.stemmers import Stemmer
from sumy.utils import get_stop_words

LANGUAGE = "english"

def sumy_summarize(txt, n_sent=1):
  parser = PlaintextParser.from_string(txt, Tokenizer(LANGUAGE))
  stemmer = Stemmer(LANGUAGE)

  summarizer = Summarizer(stemmer)
  summarizer.stop_words = get_stop_words(LANGUAGE)

  sents = ""
  for sentence in summarizer(parser.document, n_sent):
    sents += str(sentence) + "\n"

  return sents

Return the output of the `sumy_summarize` function inside your main Gradio function, passing it the user input.

Try using the text from the previous summarization example as input and experiment with the `n_sent` parameter.

In [19]:
def sum(text):
    return sumy_summarize(text)

sum_demo = gr.Interface(fn=sum, inputs="text", outputs="text")
sum_demo.launch()

Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://55c1a53381685f12ea.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)




The `sumy_summarize` function takes another parameter in addition to the text:

* n_sent (int or None, optional) – The number of sentences of the original text to be chosen for the summary.

Let's add more input elements to the interface:

* A checkbox widget allowing the user to capitalize the output
* A text box to set the `n_sent`

In [20]:
def sum(text, make_caps, number_of_sentences):
    summary = sumy_summarize(text, n_sent=number_of_sentences)
    return summary.upper() if make_caps else summary

sum_demo = gr.Interface(
    fn=sum,
    inputs=["text", "checkbox", "number"],
    outputs="text"
)

sum_demo.launch()

Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://b26af80e1d037a1d6a.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)




The `Interface` is highly customizable. For example, you can give each element different labels, have placeholder text, change colors, etc. Check the [Gradio documentation](https://gradio.app/docs/) for details.

## Hugging Face


🤗 [Transformers](https://huggingface.co/docs/transformers/index) is a state-of-the-art Machine Learning for PyTorch, TensorFlow, and JAX, and provides APIs and tools to easily download and train state-of-the-art pretrained models.

Begin by installing the Hugging Face `transformers` library:

In [21]:
!pip install transformers



Import the necessary modules for the tasks up ahead:

In [22]:
from transformers import AutoModelForCausalLM, AutoModelForSeq2SeqLM, AutoTokenizer, pipeline, Conversation

Hugging Face pipelines make it simple to use any model for inference on any language, computer vision, speech, and multimodal tasks (see [docs](https://huggingface.co/docs/transformers/pipeline_tutorial)). The `pipeline()` automatically loads a default model and a preprocessing class capable of inference for your task.

The following is an example of using a Hugging Face pipeline to do automatic abstractive summarization:

* First create a pipeline object, here called `summarizer`
* The `pipeline()` constructor takes two arguments:
  * The name of the task (see [docs for existing pipelines](https://huggingface.co/docs/transformers/main_classes/pipelines)) - In our case, the task is 'summarization'
  * The model - We can use [facebook/bart-large-xsum](https://huggingface.co/facebook/bart-large-xsum). This can be changed to any number of models available for this task on Hugging Face.

* Next we define a string variable containing the text we want to summarize. Here it's called `text_to_summarize`.
* Then call the `summarizer` pipeline, pass the string, and optionally set the max and min length, and method for generation
* Finally, print the output

In [23]:
summarizer = pipeline("summarization", model="facebook/bart-large-xsum")

In [24]:
text_to_summarize = """ New York (CNN)When Liana Barrientos was 23 years old, she got married in Westchester County, New York.
A year later, she got married again in Westchester County, but to a different man and without divorcing her first husband.
Only 18 days after that marriage, she got hitched yet again. Then, Barrientos declared "I do" five more times, sometimes only within two weeks of each other.
In 2010, she married once more, this time in the Bronx. In an application for a marriage license, she stated it was her "first and only" marriage.
Barrientos, now 39, is facing two criminal counts of "offering a false instrument for filing in the first degree," referring to her false statements on the
2010 marriage license application, according to court documents.
Prosecutors said the marriages were part of an immigration scam.
On Friday, she pleaded not guilty at State Supreme Court in the Bronx, according to her attorney, Christopher Wright, who declined to comment further.
After leaving court, Barrientos was arrested and charged with theft of service and criminal trespass for allegedly sneaking into the New York subway through an emergency exit, said Detective
Annette Markowski, a police spokeswoman. In total, Barrientos has been married 10 times, with nine of her marriages occurring between 1999 and 2002.
All occurred either in Westchester County, Long Island, New Jersey or the Bronx. She is believed to still be married to four men, and at one time, she was married to eight men at once, prosecutors say.
Prosecutors said the immigration scam involved some of her husbands, who filed for permanent residence status shortly after the marriages.
Any divorces happened only after such filings were approved. It was unclear whether any of the men will be prosecuted.
The case was referred to the Bronx District Attorney\'s Office by Immigration and Customs Enforcement and the Department of Homeland Security\'s
Investigation Division. Seven of the men are from so-called "red-flagged" countries, including Egypt, Turkey, Georgia, Pakistan and Mali.
Her eighth husband, Rashid Rajput, was deported in 2006 to his native Pakistan after an investigation by the Joint Terrorism Task Force.
If convicted, Barrientos faces up to four years in prison.  Her next court appearance is scheduled for May 18.
"""

In [25]:
summary = summarizer(text_to_summarize, max_length=130, min_length=30, do_sample=False)
print(summary)

[{'summary_text': 'A New York woman has pleaded not guilty to falsely claiming to be married 10 times, including to eight men from different countries, in what prosecutors say was an immigration scam.'}]


Let's take things up a notch and create a non-goal-oriented chatbot using pipelines and Gradio.

We can define our model and tokenizer explicitly to choose the model best suited for the task, instead of relying on the pipeline's default model. A good model for aimless chit-chat is the [Blenderbot](https://huggingface.co/facebook/blenderbot-400M-distill) model. Note that there are many other chat models you could choose for this task.

Start by defining the tokenizer and model:

In [26]:
chat_tokenizer = AutoTokenizer.from_pretrained("facebook/blenderbot-400M-distill")
chat_model = AutoModelForSeq2SeqLM.from_pretrained("facebook/blenderbot-400M-distill")

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Next, we create a 'conversational' pipeline called `bot` using the model and tokenizer we defined above:

In [27]:
bot = pipeline(task="conversational", model=chat_model, tokenizer=chat_tokenizer)

In order to build a basic chatbot interface using a pretrained model, we take the following steps:

* Initialize the `Conversation` object outside the function, in order to keep building the dialog history over time, so the model has the dialog context when it generates responses.
* Define a `chat` function that takes text from user input and keeps a history of the dialog across turns. This function:
  * Adds the user's current input to the `Conversation` by calling `add_user_input` on the `convo` object, passing the `input`.
  * Passes the `convo` object to the `bot` to make the model generate a response.
  * The `bot` object outputs the whole conversation by default, so we call `generated_responses` to get a simple list of the generated responses so far, and get the last item in the list (i.e., the most recent response).
  * Then we update the `history`, which is a list of tuples: (user-input, bot-response)
* Lastly, we define a Gradio `Interface` that:
  * Calls the `chat` function
  * Takes text as input
  * Creates widgets from a predefined Gradio layout for a 'chatbot'
  * Retains a 'state' of the dialog by retaining a list of the chat `history`.


In [28]:
convo = Conversation()
def chat(input, history=[]):
    convo.add_user_input(input)
    output = bot([convo]).generated_responses[-1]
    history.append(input)
    history.append(output)
    response = [(history[i], history[i+1]) for i in range(0, len(history)-1, 2)]
    return response, history

interface = gr.Interface(
    fn=chat,
    theme="default",
    css=".footer {display:none !important}",
    inputs=["text", "state"],
    outputs=["chatbot", "state"],
)

if __name__ == '__main__':
    interface.launch()

Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://c3a30ef23ad0844d25.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


You now have a variety of powerful tools at your disposal to rapidly create interesting and useful NLP applications!

# Assignment
Complete the following questions and hand in your solution in Canvas before 8:30 Monday morning, October 16th. Remember to save your file before uploading it.

## Part 1
Compare the outputs of the three summarization methods covered in this notebook.

Use the same piece of text for each method to answer these questions:

1. Which method performs the best, in your opinion?
2. What are the pros and cons of each method?
3. What kind of summarization is each method doing?

In [29]:
# Your solution here




## Part 2
Create a sentiment classifier using Gradio and Huggingface.

* Augment the simple version of the Gradio interface.
* Add a Huggingface 'text-classification' pipleline, using the 'cardiffnlp/twitter-roberta-base-sentiment' model.

This model outputs a list that contains one dictionary object. In this dictionary, the predicted class is the value of the key 'label'. The model outputs one of three sentiment classes:

* `LABEL_0` for negative
* `LABEL_1` for neutral
* `LABEL_2` for positive

Your app should take some text as input and output **one** of these three words:

* Positive
* Neutral
* Negative

In [30]:
# Alter this codeblock and/or create as many blocks as necessary to accomplish the task for this part

def analyze(text):
    return "Echo: " + text

sentiment_analyzer = gr.Interface(
    fn=analyze,
    inputs="text",
    outputs="text"
)

sentiment_analyzer.launch()

Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://f6f62dddcd5aa27471.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)




## Part 3

Create a chatbot that can do these three things:

*   Summarize text
*   Analyze sentiment
*   Mindless chit-chat

Augment the chatbot code below to accomplish the following:

1. When the user ticks a checkbox that says 'do_summary':
  * The user's long text input should be summarized by the bot using the pipeline defined earlier and responds with the summary
  * This process may bypass the default chit-chat functionality
2. When the user ticks a checkbox that says 'do_sentiment':
  * The user's text input may be analyzed by the bot for sentiment using the method created in Part 2 and responds accordingly
  * This process may bypass the default chit-chat functionality
3. Process any other input using the default chat functionality

In [31]:
# Alter this codeblock and/or create as many blocks as necessary to accomplish the task for this part

bot = pipeline(task="conversational", model=chat_model, tokenizer=chat_tokenizer)

convo = Conversation()
def chat(input, history=[]):
    convo.add_user_input(input)
    output = bot([convo]).generated_responses[-1]
    history.append(input)
    history.append(output)
    response = [(history[i], history[i+1]) for i in range(0, len(history)-1, 2)]
    return response, history

interface = gr.Interface(
    fn=chat,
    theme="default",
    css=".footer {display:none !important}",
    inputs=["text", "state"],
    outputs=["chatbot", "state"],
)

if __name__ == '__main__':
    interface.launch()

Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://ecfa64c1f665fb9d11.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)
