### Text Generation (a.k.a. Language Modeling)

In [8]:
# Import libraries
from transformers import pipeline

# Specify the model
model = "gpt2"

# Specify the task
task = "text-generation"

# Instantiate pipeline
generator = pipeline(model = model, task = task, max_new_tokens = 30)

# Specify input text
input_text = "If you are interested in learing more about data science, I can teach you how to"

# Perform text generation and store the results
output = generator(input_text)

# Return the results
output

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "If you are interested in learing more about data science, I can teach you how to get started with Linux without installing any additional dependencies (i.e. a separate tool for windows so you don't have to run into a Linux issue"}]

### Question Answering

#### There are generally two types of question answering tasks:

##### Extractive (i.e. context-dependent):
Where the user describes a situation to the model in the question/prompt and ask the model to generate a response, given that provided information. In this scenario, the model picks the relevant parts of the information from the prompt and returns the results.

##### Abstractive (i.e. context-independent):
Where the user asks a question from the model, without providing any context.

In [3]:
# Specify model
model = 'distilbert-base-cased-distilled-squad'
# Instantiate pipeline
answerer = pipeline(model = model, task="question-answering")

# Specify question and context
question = "What does NLP stand for?"
context = "Today we are talking about machine learning and specifically the natural language processing, which enables computers to understand, process and generate languages"

# Generate predictions
preds = answerer(
    question = question,
    context = context,
)

# Return results
print(
    f"score: {round(preds['score'], 4)}, start: {preds['start']}, end: {preds['end']}, answer: {preds['answer']}"
)

score: 0.3341, start: 65, end: 92, answer: natural language processing


In [4]:
# Specify model
model = "deepset/roberta-base-squad2"
# Specify task
task = "question-answering"

# Instantiate pipeline
answerer = pipeline(task = task, model = model, tokenizer = model)

# Specify input
qa_input = {
    'question': 'What does NLP stand for?',
    'context': 'Today we are talking about machine learning and specifically the natural language processing, which enables computers to understand, process and generate languages'
}

# Generate predictions
output = answerer(qa_input)

# Return results
output

Downloading (…)lve/main/config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/496M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/79.0 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

{'score': 0.04662749171257019,
 'start': 65,
 'end': 92,
 'answer': 'natural language processing'}

### Sentiment Analysis — Implementation:

In [14]:
# Specify pre-trained model to use
model = 'distilbert-base-uncased-finetuned-sst-2-english'
# Specify task
task = 'sentiment-analysis'

# Text to be analyzed
input_text = 'Performing NLP tasks using HuggingFace pipeline is super easy!'

# Instantiate pipeline
analyzer = pipeline(task, model = model)

# Store the output of the analysis
output = analyzer(input_text)

# Return output
output

[{'label': 'POSITIVE', 'score': 0.8548845052719116}]

### Text Classification:

In [22]:
# Specify model
model = 'facebook/bart-large-mnli'
# Specify Task
task = 'zero-shot-classification'

# Specify input text
input_text = 'This is a tutorial about using pre-trained models through HuggingFace'

# Identify the classes/categories/labels
labels = ['business', 'sports', 'education', 'politics', 'music']
for i in range(5):
# Instantiate pipeline
    classifier = pipeline(task, model = model, device = 0 )

    # Store the output of the analysis
    output = classifier(input_text, candidate_labels = labels)

    # Return output
    print(i , output)

0 {'sequence': 'This is a tutorial about using pre-trained models through HuggingFace', 'labels': ['education', 'business', 'music', 'sports', 'politics'], 'scores': [0.40113699436187744, 0.2170693427324295, 0.14547252655029297, 0.14507530629634857, 0.09124579280614853]}
1 {'sequence': 'This is a tutorial about using pre-trained models through HuggingFace', 'labels': ['education', 'business', 'music', 'sports', 'politics'], 'scores': [0.40113699436187744, 0.2170693427324295, 0.14547252655029297, 0.14507530629634857, 0.09124579280614853]}
2 {'sequence': 'This is a tutorial about using pre-trained models through HuggingFace', 'labels': ['education', 'business', 'music', 'sports', 'politics'], 'scores': [0.40113699436187744, 0.2170693427324295, 0.14547252655029297, 0.14507530629634857, 0.09124579280614853]}
3 {'sequence': 'This is a tutorial about using pre-trained models through HuggingFace', 'labels': ['education', 'business', 'music', 'sports', 'politics'], 'scores': [0.401136994361877

### Text Summarization

In [24]:
# Specify model and tokenizer
model = "t5-base"
tokenizer = "t5-base"
# Specify task
task = "summarization"

# Specify input text
input_text = "Text summarization is the task of automatically summarizing textual input, while still conveying the main points and gist of the incoming text. One example of the business intuition behind the need for such summarization models is the situations where humans read incoming text communications (e.g. customer emails) and using a summarization model can save human time. "

# Instantiate pipeline
summarizer = pipeline(task = task, model = model, tokenizer = tokenizer)

# Summarize and store results
output = summarizer(input_text)

# Return output
output

Downloading pytorch_model.bin:   0%|          | 0.00/892M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.
Your max_length is set to 200, but you input_length is only 82. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=41)


[{'summary_text': 'text summarization is the task of automatically summarizing textual input . using a model can save human time in situations where humans read incoming text .'}]

### Machine Translation

In [28]:
# Specify prefix
original_language = 'English'
target_language = 'French'
prefix = f"translate {original_language} to {target_language}: "
# Specify input text
input_text = f"{prefix}This is a post on Medium about various NLP tasks using Hugging Face."

# Specify model
model = "t5-base"

# Specify task
task = "translation"

# Instantiate pipeline
translator = pipeline(task = task, model = model)

# Perform translation and store the output
output = translator(input_text)

# Return output
output

[{'translation_text': "Il s'agit d'un poste sur Medium sur diverses tâches de NLP utilisant Hugging Face."}]