# Natural Language Processors with Transformers


## Text Classification- Sentiment Analysis

In [1]:
# We're going to use Hugging Face Transformers"
# The pipeline function from the transformers library is designed to make it easier to use various
#  Natural Language Processing (NLP) tasks. When you create a pipeline, you specify two key
#  arguments: the task you want to perform, and optionally, the model you want to use for that task. 
# Start by importing pipelines from transformer
from transformers import pipeline

In [2]:
# Lets make a positive sentiment
text = """Dear Esteemed Lunar Vanguard Team,

I recently acquired the Multi-Functional Gadgetron from your lunar store and I'm genuinely impressed! The design is remarkably innovative, demonstrating the exceptional creativity and technological prowess of humans. Its diverse functions have been immensely helpful in my intergalactic travels. I did notice a slight hiccup with the interstellar compatibility mode under different gravitational conditions, but it's a minor issue in the grand scheme. Also, a suggestion for future iterations - including more galactic language options in the manual would be a delightful touch for us extraterrestrial users. Overall, it's a fantastic product that embodies the spirit of human ingenuity. Eagerly awaiting future innovations!

Warm regards,
Zarlox from Zeta Reticuli"""

In [3]:
# Put together some text that you want evaluated to see if its positive or negative...
text2 = """Dear Terran Outpost,

I am writing to express my experience with the Multi-Functional Gadgetron I purchased during my last visit to your lunar establishment. Firstly, I must commend the innovative design and versatile functionality; it truly showcases the ingenuity of human engineering. However, I encountered some challenges with the interstellar compatibility mode. The device struggles to adapt to the gravitational variances of my home planet, leading to occasional malfunctions in its holographic display. Additionally, the user manual, while comprehensive, lacks translations in common galactic languages, which made initial setup quite perplexing. I appreciate the effort to accommodate diverse species, but these improvements would greatly enhance usability for us non-Terrans. Looking forward to the updated version.

Sincerely,
Zarlox from Zeta Reticuli"""


In [4]:
# I'm going to use pandas to look at dataframes that I get back from the classifier, so i'll import it heref
import pandas as pd

In [5]:
# Now, we want to instantiate a pipeline by calling the `pipelines()` function
#
# When you pass "text-classification" as an argument, the pipeline function creates a pipeline for
#  classifying text. This includes loading a model that has been trained on a text classification
#  task, along with all necessary preprocessing and postprocessing steps.
#
# Note: it defaults to distilbert for the model, but just use it so you don't get a warning msg
#  This is a model that’s been trained specifically for sentiment analysis (a type of text 
#  classification) on English text.
# 
classifier = pipeline("text-classification", model="distilbert-base-uncased-finetuned-sst-2-english")

In [6]:
outputs = classifier("text")
outputs

[{'label': 'POSITIVE', 'score': 0.9942922592163086}]

In [7]:
# Lets make the array look nicer with dataframes in pandas
pd.DataFrame(outputs)

Unnamed: 0,label,score
0,POSITIVE,0.994292


**As you can see the label shows up as positive, and the score is almost 1. Very positive!**

## Named Entity Recognition (NER)
**Named Entities are Persons / Places / Products / People. Basically Real world objects. NER helps extract them from the text**


In [8]:
# We can employ NER by instantiating a pipeline and then feeding our customer review to it
# If no model is supplied, it defaults to "dbmdz/bert-large-cased-finetuned-conll03-english"
ner_tagger = pipeline("ner", aggregation_strategy="simple", model="dbmdz/bert-large-cased-finetuned-conll03-english")

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [9]:
output_ner = ner_tagger(text)
pd.DataFrame(output_ner)

Unnamed: 0,entity_group,score,word,start,end
0,ORG,0.994586,Lunar Vanguard Team,14,33
1,MISC,0.818788,Multi - Functional Gadgetron,60,86
2,PER,0.668797,Zar,776,779
3,ORG,0.505405,##lox,779,782
4,ORG,0.953664,Zeta Reticuli,788,801


**As you can see it was able to pick out the important bits of the text**

## Question Answering
**This is pretty cool. You can pass your text to the model as an argument 'context', and then supply a question that you would like the model to try to derive from the context.**
**This is using what is known as _extractive question answering_**

In [10]:
reader = pipeline("question-answering", model="distilbert-base-cased-distilled-squad")

In [11]:
question = "What does the customer want?"

In [12]:
outputs = reader(question=question, context=text)

In [13]:
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_rows', None)
pd.DataFrame([outputs])

Unnamed: 0,score,start,end,answer
0,0.04932,541,571,more galactic language options


## Summarization
**Take all the text and generate a shorter version of it**


In [14]:
# If No model is supplied, defaulted to sshleifer/distilbart-cnn-12-6
# In this example you'll see that we can tweak the output post processing (ie by using max length and cleanup tokens etc.
summarizer = pipeline("summarization", model="sshleifer/distilbart-cnn-12-6")
outputs = summarizer(text, max_length=56, clean_up_tokenization_spaces=True)
print(outputs[0]['summary_text'])

 The design is remarkably innovative, demonstrating the exceptional creativity and technological prowess of humans. I did notice a slight hiccup with the interstellar compatibility mode under different gravitational conditions. Overall, it's a fantastic product that embodies the spirit of human ingenuity. Eagerly awaiting


## Translation
**Lets use a translation pipeline to translate English to Russian**


In [15]:
# create the pipeline with the specific translators you want. i'll choose both german and russian
german_translator = pipeline("translation_en_to_de", model="Helsinki-NLP/opus-mt-en-de")
ru_translator = pipeline("translation_en_to_ru", model="Helsinki-NLP/opus-mt-en-ru")

In [16]:
# now create outputs based on our text, and a few of the postprocessor arguments (cleanup and min len)
outputs = german_translator(text, clean_up_tokenization_spaces=True, min_length=100)
print(outputs[0]['translation_text'])

Liebe Esteemed Lunar Vanguard Team, Ich habe vor kurzem die Multi-Functional Gadgetron aus Ihrem lunar Store erworben und ich bin wirklich beeindruckt! Das Design ist bemerkenswert innovativ, zeigt die außergewöhnliche Kreativität und technologische Leistungsfähigkeit der Menschen. Seine vielfältigen Funktionen waren enorm hilfreich bei meinen intergalaktischen Reisen. Ich habe eine leichte Schluckauf mit dem interstellaren Kompatibilitätsmodus unter verschiedenen Gravitationsbedingungen bemerkt, aber es ist ein kleines Problem in der großen Schema. Auch ein Vorschlag für zukünftige Iterationen - einschließlich mehr galaktische Sprachoptionen im Handbuch wäre eine herrliche Berührung für uns außerirdische Benutzer. Insgesamt ist es ein fantastisches Produkt, das den Geist der menschlichen Einfallsreichtum verkörpert. Eifrig auf zukünftige Innovationen warten! Warme Grüße, Zarlox von Zeta Reticuli


In [17]:
# and now lets do it for russian!
outputs = ru_translator(text, clean_up_tokenization_spaces=True, min_length=100)
print(outputs[0]['translation_text'])

Дорогая Эстемед Лунар Вангар, я недавно приобрела многофункциональное Гадгетрон из вашего лунного магазина и я действительно впечатлена! Этот дизайн удивительно новаторский, демонстрирующий исключительный творческий и технологический талант людей. Его разнообразные функции были чрезвычайно полезны в моих межгалактических путешествиях. Я заметила небольшую икотику с межзвездным режимом совместимости при различных гравитационных условиях, но это небольшая проблема в великой схеме. Кроме того, предложение о будущих итерациях - включая более галактические варианты в руководстве было бы восхитительным прикосновением для внеземных пользователей. В целом, это фантастический продукт, который воплощает дух человеческой изобретательности. Эгерли ожидает будущих инноваций.


## Text Generation for a response
**You can use transformers pipeline() to create a response to a message as well!**

In [18]:
# If no model is selected, it defaults to gpt2
generator = pipeline("text-generation", model="gpt2")
response = """Dear Zarlox,

Thank you for your glowing review of the Gadgetron! The Lunar Vanguard Team is excited to incorporate your suggestions for improved interstellar compatibility and broader language support in our next update.

Best regards,
The Lunar Vanguard Team"""


prompt = text + "\nTerra Nova Customer Service Response:\n" + response
outputs = generator(prompt, max_length=200)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [19]:
print(outputs[0]["generated_text"])

Dear Esteemed Lunar Vanguard Team,

I recently acquired the Multi-Functional Gadgetron from your lunar store and I'm genuinely impressed! The design is remarkably innovative, demonstrating the exceptional creativity and technological prowess of humans. Its diverse functions have been immensely helpful in my intergalactic travels. I did notice a slight hiccup with the interstellar compatibility mode under different gravitational conditions, but it's a minor issue in the grand scheme. Also, a suggestion for future iterations - including more galactic language options in the manual would be a delightful touch for us extraterrestrial users. Overall, it's a fantastic product that embodies the spirit of human ingenuity. Eagerly awaiting future innovations!

Warm regards,
Zarlox from Zeta Reticuli
Terra Nova Customer Service Response:
Dear Zarlox,

Thank you for your glowing review of the Gadgetron! The Lunar Vanguard Team is excited to incorporate your suggestions for improved interstellar c

## Interesting Links
* HF Hub:                       https://oreil.ly/zLK11
* HF Tokenizers & Transformers: https://oreil.ly/Z79jF
* HF Datasets:                  https://oreil.ly/959YT
* HF Accelerate:                https://oreil.ly/iRfDe

# Conclusion- The main challenges with transformers
1. **Language**- NLP Is dominated with english language
2. **Data availability**- Transfer learning can reduce the amount of labeled training data, but its still a lot compared to how much a human needs to perform a task
3. **Working with long documents**- self-attention works well on paragraph-long texts, but it gets harder with whole documents
4. **Opacity**- Its hard to unravel 'why' the pipeline makes the predictions it does
5. **Bias**- The models are trained on the internet, and the internet is a bias effing place for real