# NLP Transformers Quick Start

## Project Setup

This project uses Pipenv

```bash
pipenv install
pipenv shell
>> jupyter lab
```

In [1]:
from transformers import pipeline
import pandas as pd

> 🛑 Warning: These models are quite large. If you download them, they will take up space in your hard drive and running this notebook locally will require compute resources. It's not really recommended that you run this notebook from start to finish! You can clean these up by running `rm -rf ~/.cache/huggingface/hub/models*`.

## Basic Examples

### Text Classification

In [5]:
# by default, will use a DistelBERT sentiment analysis model
classifier = pipeline('text-classification')
text = 'This product is really terrible. I hate it.'
outputs = classifier(text)
pd.DataFrame(outputs)

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Unnamed: 0,label,score
0,NEGATIVE,0.999743


### Named Entity Recognition

In [6]:
ner_tagger = pipeline('ner', aggregation_strategy='simple')
ner_text = 'Hello. My name is Luke Skywalker. My birthday is August 10, 2000. I live in Paris, France. I work for a company called Google.'
outputs = ner_tagger(ner_text)
pd.DataFrame(outputs)

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Unnamed: 0,entity_group,score,word,start,end
0,PER,0.993974,Luke Skywalker,18,32
1,LOC,0.999145,Paris,76,81
2,LOC,0.999771,France,83,89
3,ORG,0.998361,Google,119,125


### Question Answering

In [3]:
reader = pipeline('question-answering')
question = 'What does the customer want?'
qa_text = 'I have some questions about the 2020 JEEP Grand Cherokee. Does the basic model come with heated seats and a heated steering wheel?'
outputs = reader(question=question, context=qa_text)
pd.DataFrame([outputs])

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Unnamed: 0,score,start,end,answer
0,0.309721,89,101,heated seats


### Summarization

In [7]:
summarizer = pipeline('summarization')
summ_text = "The Nicomachean Ethics is among Aristotle's best-known works on ethics: the science of the good for human life, that which is the goal or end at which all our actions aim.[1]: I.2  It consists of ten sections, referred to as books, and is closely related to Aristotle's Eudemian Ethics."
outputs = summarizer(summ_text, max_length=45, clean_up_tokenization_spaces=True)
print(outputs[0]['summary_text'])

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.
Your min_length=56 must be inferior than your max_length=45.


 Nicomachean Ethics is one of Aristotle's best-known works on ethics. It consists of ten sections, referred to as books, and is closely related to Aristotle's Eudemian Ethics. It


### Translation

In [3]:
translator = pipeline('translation_en_to_de', model='Helsinki-NLP/opus-mt-en-de')
en_text = 'Hello. Where is the bathroom please?'
outputs = translator(en_text, clean_up_tokenization_spaces=True, min_length=100)
print(outputs[0]['translation_text'])

Hallo, wo ist das Badezimmer bitte? - Ich weiß nicht, wo es ist........................................................................................................................................................................................................................................................


### Text Generation

In [6]:
generator = pipeline('text-generation')
text = 'Dear Amazon, last week I ordered an Optimus Prime action figure from your online store. Unfortunately when I opened it, I discovered to my horror that it was a Megatron action figure instead. Can I please exchange it for the action figure that I ordered?'
response = 'Dear Customer, I am sorry to hear that your order was mixed up.'
prompt = "\n\nCustomer service response:\n" + response
outputs = generator(prompt, max_length=200)
print(outputs[0]['generated_text'])

No model was supplied, defaulted to openai-community/gpt2 and revision 6c0e608 (https://huggingface.co/openai-community/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.




Customer service response:
Dear Customer, I am sorry to hear that your order was mixed up. We've received the problem and there is nothing we can do about it. Please contact us soon to investigate. Thank you again for your patience and please get back as soon as possible.

Customer service response:

Hello, I found a good-quality pair of scissors and a pair of scissors for an upcoming haircut at an online shop. The items were nicely priced, and when I picked up the scissors they were cut with razor sharpness. A lot of compliments have come to me from customers who have purchased this product. I find the quality of the scissors extremely high. We also want to commend the service by the customer that sent us the scissors. We sent them promptly, and they arrived as promised right on time! Thank you for your patience.

Customer service response:

Thank-you that your order arrived very promptly. I've worked so hard recently with
