#How can I leverage State-of-the-Art Natural Language Models with only one line of code?



Newly introduced in transformers v2.3.0, `pipelines` provides a high-level, easy to use, API for doing inference over a variety of downstream-tasks, including:

* **Sentence Classification (Sentiment Analysis):** Indicate if the overall sentence is either positive or negative, i.e. binary classification task or logistic regression task.
* **Token Classification (Named Entity Recognition, Part-of-Speech tagging):** For each sub-entities (tokens) in the input, assign them a label, i.e. classification task.
* **Question-Answering:** Provided tuple ( `question`, `context` ) the model should find the span of text in `content` answering the `question`.
* **Mask-Filling:** Suggests possible word(s) to fill the masked input with respect to the provided `context`.
* **Summarization:** Summarizes the `input` article to a shorter article.
* **Feature Extraction:** Maps the input to a higher, multi-dimensional space learned from the data.

Pipelines encapsulate the overall process of every NLP process:

1.  **Tokenization:** Split the initial input into multiple sub-entities with ... properties (i.e. tokens).
2.  **Inference:** Maps every tokens into a more meaningful representation.
3.  **Decoding:** Use the above representation to generate and/or extract the final output for the underlying task.

The overall API is exposed to the end-user through the `pipeline()` method with the following structure:

"""
from transformers import pipeline

# Using default model and tokenizer for the task
pipeline("<task-name>")

# Using a user-specified model
pipeline("<task-name>", model="<model_name>")

# Using custom model/tokenizer as str
pipeline('<task-name>', model='<model_name>', tokenizer='<tokenizer_name>')
"""

In [None]:
from transformers import pipeline

# Using default model and tokenizer for the task
pipeline("<task-name>")

# Using a user-specified model
pipeline("<task-name>", model="<model_name>")

# Using custom model/tokenizer as str
pipeline('<task-name>', model='<model_name>', tokenizer='<tokenizer_name>')

##Pre-trained models are available at https://huggingface.co/models

In [2]:
!pip install -q transformers
from transformers import pipeline

### Sentence Classification - Sentiment Analysis

###pipeline('sentiment-analysis')
* This line uses the Hugging Face Transformers library to create a sentiment analysis pipeline. A pipeline is an abstraction provided by Hugging Face that allows you to easily use pretrained models for common tasks.

* 'sentiment-analysis' is the task name, which tells Hugging Face to load a model that can classify input text as having positive or negative sentiment.

* Under the hood, this downloads and loads a pretrained model (like distilbert-base-uncased-finetuned-sst-2-english) suitable for this task.

In [2]:
#Initialize Sentiment analysis pipeline
nlp_sentence_classif = pipeline('sentiment-analysis')




No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Device set to use cpu


In [3]:
nlp_sentence_classif.model  # check the model name

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): DistilBertSdpaAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)


In [4]:

#Feed the text for classification
nlp_sentence_classif('Such a nice weather outside !')


[{'label': 'POSITIVE', 'score': 0.9997655749320984}]

In [5]:

#Feed the text for classification
nlp_sentence_classif('That was a beautiful movie.')

[{'label': 'POSITIVE', 'score': 0.9998801946640015}]

In [6]:
nlp_sentence_classif('i dint like the movie horrible.')

[{'label': 'NEGATIVE', 'score': 0.9997653365135193}]

##2. Named Entity Recognition

In [3]:
#Initialize the pipeline for NER
nlp_token_class = pipeline('ner')



No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496 (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Device set to use cpu


In [None]:
#nlp_token_class.model

In [4]:


#Feed the text for NER
nlp_token_class('Hugging Face is a French company based in New-York.')

[{'entity': 'I-ORG',
  'score': 0.9970939,
  'index': 1,
  'word': 'Hu',
  'start': 0,
  'end': 2},
 {'entity': 'I-ORG',
  'score': 0.934575,
  'index': 2,
  'word': '##gging',
  'start': 2,
  'end': 7},
 {'entity': 'I-ORG',
  'score': 0.97870606,
  'index': 3,
  'word': 'Face',
  'start': 8,
  'end': 12},
 {'entity': 'I-MISC',
  'score': 0.9981997,
  'index': 6,
  'word': 'French',
  'start': 18,
  'end': 24},
 {'entity': 'I-LOC',
  'score': 0.9983047,
  'index': 10,
  'word': 'New',
  'start': 42,
  'end': 45},
 {'entity': 'I-LOC',
  'score': 0.89134514,
  'index': 11,
  'word': '-',
  'start': 45,
  'end': 46},
 {'entity': 'I-LOC',
  'score': 0.99795234,
  'index': 12,
  'word': 'York',
  'start': 46,
  'end': 50}]

##Question Answering

In [5]:
#Initialize pipeline for Question Answering
nlp_qa = pipeline('question-answering', model='twmkn9/bert-base-uncased-squad2')

config.json:   0%|          | 0.00/465 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Some weights of the model checkpoint at twmkn9/bert-base-uncased-squad2 were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/39.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Device set to use cpu


In [6]:
# Feed a Paragraph and ask questions from the same
paragraph = 'Hugging Face is a French company based in New-York.'
question = 'Where is Hugging Face based?'
nlp_qa(context=paragraph, question=question)

{'score': 0.9884894490242004, 'start': 42, 'end': 50, 'answer': 'New-York'}

In [7]:
article = """Google LLC is an American multinational technology company that specializes in Internet-related
services and products, which include online advertising technologies, search engine,
cloud computing, software, and hardware. It is considered one of the Big Four technology companies,
alongside Amazon, Apple, and Facebook. Google was founded in September 1998 by Larry Page and
Sergey Brin while they were Ph.D. students at Stanford University in California. Together they
own about 14 percent of its shares and control 56 percent of the stockholder voting power
through supervoting stock. They incorporated Google as a California privately held company
on September 4, 1998, in California. Google was then reincorporated in Delaware on October
22, 2002. An initial public offering (IPO) took place on August 19, 2004, and Google moved to
its headquarters in Mountain View, California, nicknamed the Googleplex. In August 2015,
Google announced plans to reorganize its various interests as a conglomerate called Alphabet Inc.
Google is Alphabet's leading subsidiary and will continue to be the umbrella company for Alphabet's
Internet interests. Sundar Pichai was appointed CEO of Google, replacing Larry Page who became
the CEO of Alphabet.""".replace('\n', '')

In [8]:

nlp_qa(context=article, question='Who is the CEO of Google?')

{'score': 0.9405813813209534,
 'start': 1131,
 'end': 1144,
 'answer': 'Sundar Pichai'}

In [9]:
nlp_qa(context=article, question='When did google start its operation?')

{'score': 0.8834809064865112,
 'start': 339,
 'end': 353,
 'answer': 'September 1998'}

In [10]:
nlp_qa(context=article, question='Who is the founder of google?')

{'score': 0.9977760910987854,
 'start': 357,
 'end': 382,
 'answer': 'Larry Page andSergey Brin'}

In [11]:
nlp_qa(context=article, question='Where is google office located?')

{'score': 0.9604061245918274,
 'start': 847,
 'end': 872,
 'answer': 'Mountain View, California'}