# Introduction to Huggingface Library

### Introduction to Huggingface
Huggingface is an open-source library that provides state-of-the-art Natural Language Processing (NLP) tools and pre-trained models for a wide range of NLP tasks. The library is built on top of PyTorch and TensorFlow and provides a unified API for working with various models. Huggingface is widely used in the NLP community for research and development purposes.



### Installing the transformers

To use Huggingface, you need to install the transformers library. You can do this by running the following command in your Colab notebook:

In [3]:
#%pip install datasets transformers[sentencepiece] tensorflow

#%pip install transformers


This will install the latest version of the library.

### Pipelines
Pipelines in Huggingface are a simple and intuitive way to use pre-trained models for various NLP tasks such as text classification, question answering, and text generation. Pipelines provide a high-level API for working with pre-trained models without the need for extensive coding.


### Using the pipelines
To use a pre-trained model for a specific NLP task, you can simply create a pipeline object and specify the task you want to perform. Here's an example of how to use the text classification pipeline:

In [1]:
from transformers import pipeline

generator = pipeline("text-generation")
generator("In the galaxy far far")

No model was supplied, defaulted to openai-community/gpt2 and revision 6c0e608 (https://huggingface.co/openai-community/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In the galaxy far far, far away, there had a strange entity, a new God.\n\nCaught in an eternal limbo, the God had gone from being a cosmic being to a super-god, an even greater than himself. That'}]

## Specifying a custom model in the pipeline

In [2]:
from transformers import pipeline

generator = pipeline("text-generation", model="distilgpt2")
generator(
    "In the galaxy far far",
    max_length=30,
    num_return_sequences=2,
)

config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In the galaxy far far away is not a perfect model of the galaxy, but for some things it will be quite interesting if astronomers could find out that'},
 {'generated_text': "In the galaxy far far enough away, it's in a way that makes it more interesting than anything I've ever seen. It creates a place for"}]

&nbsp;

## Available Pipelines


- AudioClassificationPipeline
- AutomaticSpeechRecognitionPipeline
- ConversationalPipeline
- FeatureExtractionPipeline
- FillMaskPipeline
- ImageClassificationPipeline
- ImageSegmentationPipeline
- ObjectDetectionPipeline
- QuestionAnsweringPipeline
- SummarizationPipeline
- TableQuestionAnsweringPipeline
- TextClassificationPipeline
- TextGenerationPipeline
- Text2TextGenerationPipeline
- TokenClassificationPipeline
- TranslationPipeline
- VisualQuestionAnsweringPipeline
- ZeroShotClassificationPipeline
- ZeroShotImageClassificationPipeline

## Another example

In [3]:
from transformers import pipeline

unmasker = pipeline("fill-mask")
unmasker("You are going to <mask> about a wonderful library today.", top_k=2)

No model was supplied, defaulted to distilbert/distilroberta-base and revision ec58a5b (https://huggingface.co/distilbert/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/331M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert/distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

[{'score': 0.5679112672805786,
  'token': 1798,
  'token_str': ' hear',
  'sequence': 'You are going to hear about a wonderful library today.'},
 {'score': 0.2281838059425354,
  'token': 1532,
  'token_str': ' learn',
  'sequence': 'You are going to learn about a wonderful library today.'}]

# Looking inside the pipeline with Tensorflow API

In [4]:
from transformers import pipeline

input_sentences = [
        "I don't like this movie",
        "Upgrad is helping me learn new and wonderful things.",

    ]
classifier = pipeline("sentiment-analysis")
classifier(
    input_sentences
)

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

[{'label': 'NEGATIVE', 'score': 0.9839025139808655},
 {'label': 'POSITIVE', 'score': 0.9998325109481812}]

## Tokenizing the input sentences

In [5]:
from transformers import AutoTokenizer

model = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(model)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

In [6]:

inputs = tokenizer(input_sentences, padding=True, truncation=True, max_length = 12, return_tensors="tf",)
print(inputs)

{'input_ids': <tf.Tensor: shape=(2, 12), dtype=int32, numpy=
array([[  101,  1045,  2123,  1005,  1056,  2066,  2023,  3185,   102,
            0,     0,     0],
       [  101,  2039, 16307,  2003,  5094,  2033,  4553,  2047,  1998,
         6919,  2477,   102]], dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(2, 12), dtype=int32, numpy=
array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], dtype=int32)>}


## Classifying the input sentences into positive and negative sentiments

In [7]:
from transformers import TFAutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = TFAutoModelForSequenceClassification.from_pretrained(model)
outputs = model(inputs)

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

All PyTorch model weights were used when initializing TFDistilBertForSequenceClassification.

All the weights of TFDistilBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.


In [8]:
print(outputs.logits.shape)

(2, 2)


In [9]:
print(outputs.logits)

tf.Tensor(
[[ 2.2426078 -1.8702557]
 [-4.1284223  4.4328403]], shape=(2, 2), dtype=float32)


In [10]:
import tensorflow as tf

predictions = tf.math.softmax(outputs.logits, axis=-1)
print(predictions)

tf.Tensor(
[[9.8390251e-01 1.6097488e-02]
 [1.9134098e-04 9.9980873e-01]], shape=(2, 2), dtype=float32)


# Exploring the Tokenizer

In [11]:
tokenized_text = "Learning NLP is so much rewarding".split()
print(tokenized_text)

['Learning', 'NLP', 'is', 'so', 'much', 'rewarding']


In [12]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

In [13]:
tokenizer("Learning NLP is so much rewarding")

{'input_ids': [101, 9681, 21239, 2101, 1110, 1177, 1277, 10703, 1158, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

## Breaking the tokenizer functions down

In [15]:
tokens = tokenizer.tokenize("Learning NLP is so much rewarding", )
print(tokens)

['Learning', 'NL', '##P', 'is', 'so', 'much', 'reward', '##ing']


In [16]:
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)

[9681, 21239, 2101, 1110, 1177, 1277, 10703, 1158]


## Final touch-ups

In [17]:
tokens = tokenizer.tokenize("Learning NLP is so much rewarding", add_special_tokens = True )
print(tokens)
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)

final_ids = tokenizer.build_inputs_with_special_tokens(ids)
print(final_ids)

Keyword arguments {'add_special_tokens': True} not recognized.


['Learning', 'NL', '##P', 'is', 'so', 'much', 'reward', '##ing']
[9681, 21239, 2101, 1110, 1177, 1277, 10703, 1158]
[101, 9681, 21239, 2101, 1110, 1177, 1277, 10703, 1158, 102]


## Decoding the tokens ang getting back the string

In [18]:
tokenizer.decode(final_ids)

'[CLS] Learning NLP is so much rewarding [SEP]'

In [19]:
tokenizer.decode(final_ids, skip_special_tokens=True)

'Learning NLP is so much rewarding'

## Handling Multiple sequences

In [20]:
tokenized_output = tokenizer(["Learning NLP is so much rewarding","Another test sentence"])

In [21]:
tokenized_output['input_ids']

[[101, 9681, 21239, 2101, 1110, 1177, 1277, 10703, 1158, 102],
 [101, 2543, 2774, 5650, 102]]

### Padding

In [22]:
sequences = ["Learning NLP is so much rewarding","Another test sentence"]
# Will pad the sequences up to the maximum sequence length
model_inputs = tokenizer(sequences, padding="longest")
print(model_inputs)
# Will pad the sequences up to the model max length
# (512 for BERT or DistilBERT)
print('\n')
model_inputs = tokenizer(sequences, padding="max_length")
print(model_inputs)

# Will pad the sequences up to the specified max length
print('\n')
model_inputs = tokenizer(sequences, padding="max_length", max_length=6)
print(model_inputs)


{'input_ids': [[101, 9681, 21239, 2101, 1110, 1177, 1277, 10703, 1158, 102], [101, 2543, 2774, 5650, 102, 0, 0, 0, 0, 0]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 0, 0, 0, 0, 0]]}


{'input_ids': [[101, 9681, 21239, 2101, 1110, 1177, 1277, 10703, 1158, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

### Truncation

In [24]:
# Will truncate the sequences that are longer than the model max length
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequences, truncation=True)
print(model_inputs)
print("\n")
# Will truncate the sequences that are longer than the specified max length
model_inputs = tokenizer(sequences, max_length=6, truncation=True)
print(model_inputs)
print("\n")

{'input_ids': [[101, 9681, 21239, 2101, 1110, 1177, 1277, 10703, 1158, 102], [101, 2543, 2774, 5650, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1]]}


{'input_ids': [[101, 9681, 21239, 2101, 1110, 102], [101, 2543, 2774, 5650, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1]]}




### Different Output Types

In [25]:
# sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

# Returns PyTorch tensors
# model_inputs = tokenizer(sequences, padding=True, return_tensors="pt")

# Returns TensorFlow tensors
model_inputs = tokenizer(sequences, padding=True, return_tensors="tf")
print(model_inputs)
print("\n")

# Returns NumPy arrays
model_inputs = tokenizer(sequences, padding=True, return_tensors="np")
print(model_inputs)
print("\n")

{'input_ids': <tf.Tensor: shape=(2, 10), dtype=int32, numpy=
array([[  101,  9681, 21239,  2101,  1110,  1177,  1277, 10703,  1158,
          102],
       [  101,  2543,  2774,  5650,   102,     0,     0,     0,     0,
            0]], dtype=int32)>, 'token_type_ids': <tf.Tensor: shape=(2, 10), dtype=int32, numpy=
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(2, 10), dtype=int32, numpy=
array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 0, 0, 0, 0, 0]], dtype=int32)>}


{'input_ids': array([[  101,  9681, 21239,  2101,  1110,  1177,  1277, 10703,  1158,
          102],
       [  101,  2543,  2774,  5650,   102,     0,     0,     0,     0,
            0]]), 'token_type_ids': array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 0, 0, 0, 0, 0]])}




In [26]:
import tensorflow as tf
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint)
sequences = ["Learning NLP is so much rewarding","Another test sentence"]

tokens = tokenizer(sequences, padding=True, truncation=True, return_tensors="tf")
output = model(**tokens)

All PyTorch model weights were used when initializing TFDistilBertForSequenceClassification.

All the weights of TFDistilBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.


In [27]:
tokens

{'input_ids': <tf.Tensor: shape=(2, 10), dtype=int32, numpy=
array([[  101,  4083, 17953,  2361,  2003,  2061,  2172, 10377,  2075,
          102],
       [  101,  2178,  3231,  6251,   102,     0,     0,     0,     0,
            0]], dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(2, 10), dtype=int32, numpy=
array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 0, 0, 0, 0, 0]], dtype=int32)>}

In [28]:
tf.math.softmax(output.logits, axis = 1)

<tf.Tensor: shape=(2, 2), dtype=float32, numpy=
array([[1.2099689e-04, 9.9987900e-01],
       [9.9222517e-01, 7.7747735e-03]], dtype=float32)>