[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1snm3MvWTR9WRlJFRVHgjNiNSmg0LitL7?usp=sharing)

# Natural Language Processing 

## What is NLP? 

The goal of NLP is not to understand individual words but to comprehend the context of those words. 

Some common NLP tasks are: 

- **Classifying whole sentences** 
  - sentiment analysis
  - spam classification
  - grammatical correctness 
  - cohesion between sentences
- **Classifying each word in a sentence**
  - named entity recognition (NER)
  - parts of speech (pos) 
- **Generating text content**
  - gap fill aka cloze activity 
  - completing a prompt
- **Extracting an answer from a text**
  - Q&A based on a passage
- **Generating a new sentence from a text**
  - translation
  - summarization

## Working with pipelines

In [None]:
!pip install transformers[sentencepiece] #need sentencepiece for translations

In [2]:
from transformers import pipeline

In [3]:
classifier = pipeline("sentiment-analysis")
classifier("I've been waiting for a HuggingFace course my whole life")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)


Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/255M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

[{'label': 'POSITIVE', 'score': 0.9516071081161499}]

In [4]:
classifier(
    ["I've been waiting for a HuggingFace course my whole life.",
     "I hate this so much!"]
)

[{'label': 'POSITIVE', 'score': 0.9598048329353333},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455}]

## Current pipelines
- feature-extraction (get the vector representation of a text)
- fill-mask
- ner (named entity recognition)
- question-answering
- sentiment-analysis
- summarization
- text-generation
- translation
- zero-shot-classification

In [5]:
# Zero-shot classifiction, AKA classiying unlabelled data

from transformers import pipeline

classifier = pipeline('zero-shot-classification')

No model was supplied, defaulted to facebook/bart-large-mnli (https://huggingface.co/facebook/bart-large-mnli)


Downloading:   0%|          | 0.00/1.13k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.52G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

In [10]:
# Text Generation

from transformers import pipeline

generator = pipeline("text-generation")

generator("In this course, we will teach you how to", 
          max_length=15,
          num_return_sequences=2)


No model was supplied, defaulted to gpt2 (https://huggingface.co/gpt2)
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, we will teach you how to find and connect your way'},
 {'generated_text': 'In this course, we will teach you how to make real life events and'}]

## Using any model from the Hub in a pipeline

In [14]:
from transformers import pipeline

generator = pipeline('text-generation', model='distilgpt2')

starter = "In this course, we will teach you how to"

generator(text_inputs=starter, 
          max_length=30, 
          num_return_sequences=2)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "In this course, we will teach you how to set up your own own mobile app.\n\n\nOnce we've got the Android apps and its"},
 {'generated_text': 'In this course, we will teach you how to become better.\n\nOur students won’t even know how to make money so we�'}]

## Mask filling

In assessment terms, this is a cloze test or gap-fill exercise.

In [15]:
unmasker = pipeline("fill-mask")
unmasker("This course will teach you all about <mask> models.", 
         top_k=2)

No model was supplied, defaulted to distilroberta-base (https://huggingface.co/distilroberta-base)


Downloading:   0%|          | 0.00/480 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/316M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

[{'score': 0.196198508143425,
  'sequence': 'This course will teach you all about mathematical models.',
  'token': 30412,
  'token_str': ' mathematical'},
 {'score': 0.040527332574129105,
  'sequence': 'This course will teach you all about computational models.',
  'token': 38163,
  'token_str': ' computational'}]

## Named entity recognition 

Identify the proper nouns in a text.

In [18]:
ner = pipeline("ner", grouped_entities=True)

ner("My name is Evan Simpson and I work at Engoo in Rome.")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english)
  f'`grouped_entities` is deprecated and will be removed in version v5.0.0, defaulted to `aggregation_strategy="{aggregation_strategy}"` instead.'


[{'end': 23,
  'entity_group': 'PER',
  'score': 0.9995202,
  'start': 11,
  'word': 'Evan Simpson'},
 {'end': 43,
  'entity_group': 'ORG',
  'score': 0.9867735,
  'start': 38,
  'word': 'Engoo'},
 {'end': 51,
  'entity_group': 'LOC',
  'score': 0.99703896,
  'start': 47,
  'word': 'Rome'}]

In [20]:
# Question and Answering

qa = pipeline('question-answering')

question = "Where do I work?"

context = "My name is Evan Simpson and I work at Engoo in Rome."

qa(question=question,
   context=context)

No model was supplied, defaulted to distilbert-base-cased-distilled-squad (https://huggingface.co/distilbert-base-cased-distilled-squad)


{'answer': 'Engoo', 'end': 43, 'score': 0.5854694843292236, 'start': 38}

In [21]:
# Summarization

summarizer = pipeline("summarization")

text =  """
    America has changed dramatically during recent years. Not only has the number
    of graduates in traditional engineering disciplines such as mechanical, civil, 
    electrical, chemical, and aeronautical engineering declined, but in most of 
    the premier American universities engineering curricula now concentrate on 
    and encourage largely the study of engineering science. As a result, there 
    are declining offerings in engineering subjects dealing with infrastructure, 
    the environment, and related issues, and greater concentration on high 
    technology subjects, largely supporting increasingly complex scientific 
    developments. While the latter is important, it should not be at the expense 
    of more traditional engineering.

    Rapidly developing economies such as China and India, as well as other 
    industrial countries in Europe and Asia, continue to encourage and advance 
    the teaching of engineering. Both China and India, respectively, graduate 
    six and eight times as many traditional engineers as does the United States. 
    Other industrial countries at minimum maintain their output, while America 
    suffers an increasingly serious decline in the number of engineering graduates 
    and a lack of well-educated engineers.
"""


summarizer(text)


No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 (https://huggingface.co/sshleifer/distilbart-cnn-12-6)


Downloading:   0%|          | 0.00/1.76k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.14G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

[{'summary_text': ' America has changed dramatically during recent years . The number of graduates in traditional engineering disciplines has declined . China and India graduate six and eight times as many traditional engineers as does the United States . Rapidly developing economies such as India and Europe continue to encourage and advance the teaching of engineering .'}]

In [5]:
# Translation 
from transformers import pipeline

translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en")
translator("Ce cours est produit par Hugging Face.")

Downloading:   0%|          | 0.00/1.26k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/287M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/784k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/760k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.28M [00:00<?, ?B/s]

[{'translation_text': 'This course is produced by Hugging Face.'}]

# [How do Transformers work?](https://huggingface.co/course/chapter1/4?fw=pt) 

Transformers can be grouped into three categories: 
- GPT-like (auto-regressive) aka **Decoder-only models**
- BERT-like (auto-encoding) aka **Encoder-only models**
- BART/T5-like (sequence to sequence) aka **Encoder-decoder models**

Key concepts:  
- Transformers are language models which have been trained on large amounts a raw text in a self-supervised fashion
- The general pre-trained model then undergoes transfer learning where it is fine-tuned with labeled data on given task. 
  - **causal languag modeling**: predicting the next word given the previous and current words
  - **masked language modeling**: predict the missing word in the sentence
- **pretraining**: the act of training a model from scratch
  - requires a mountrain of data - can takes weeks! 
- **fine-tuning**: take a pretrained model which has been trained on a corpus as similiar to yours as you can find, and train it further with your dataset

Fine-tuning on pretrained model is an example of **transfer learning**: you are transfering what the model has learned from a previous problem to solve your own.

## Attention layers

# START HERE





# Chapter 2. Using 🤗 Transformers
## What Happens Inside the pipeline Function?

In [None]:
from transformers import AutoTokenizer

In [None]:
checkpoint = 'distilbert-base-uncased-finetuned-sst-2-english'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

In [None]:
raw_inputs = [
              "I've been waiting for a HuggingFace course my whole life.", 
              "I hate this so much!",
              ]

inputs = tokenizer(raw_inputs, 
                   padding=True,
                   truncation=True,
                   return_tensors='pt')

In [None]:
print(inputs)

{'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102],
        [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,     0,
             0,     0,     0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}


In [None]:
from transformers import  AutoModel

checkpoint = 'distilbert-base-uncased-finetuned-sst-2-english'
model = AutoModel.from_pretrained(checkpoint)
outputs = model(**inputs)
print(outputs.last_hidden_state.shape)

Some weights of the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing DistilBertModel: ['pre_classifier.bias', 'classifier.weight', 'classifier.bias', 'pre_classifier.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


torch.Size([2, 16, 768])


The code above outputs the hidden states of th model.

To actually solve our classification problem, we need a model with a **sequence classificaiton** head. 

In [None]:
from transformers import  AutoModelForSequenceClassification

checkpoint = 'distilbert-base-uncased-finetuned-sst-2-english'
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

outputs = model(**inputs)

In [None]:
print(outputs.logits.shape)

torch.Size([2, 2])


Since we have two sentences with two labels, we get an output of two by two. 

In [None]:
outputs.logits

tensor([[-1.5607,  1.6123],
        [ 4.1692, -3.3464]], grad_fn=<AddmmBackward>)

The question, of course, is what do thos logits actually mean? To answer that, we move on to Postprocessing. 

We add a SoftMax layer to the logits to get a probablity.

In [None]:
import torch

predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)

print(predictions)

tensor([[4.0195e-02, 9.5980e-01],
        [9.9946e-01, 5.4418e-04]], grad_fn=<SoftmaxBackward>)


The model predicted [.0402, 0.9598] and [0.9946, 0.0544] for the second. 

But what are the labels for those probablities?

In [None]:
model.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}

In [None]:
raw_inputs

["I've been waiting for a HuggingFace course my whole life.",
 'I hate this so much!']

So, for the first sentence, the model predicts with 96% confidence that it is positive while the second is predicted at nearly 100% as being negative.

### Deep Dive on [Models](https://huggingface.co/course/chapter2/3?fw=pt)

It's easy to load a model based on a checkpoint

In [None]:
from transformers import AutoModel

In [None]:
bert_checkpoint = 'bert-base-cased'
gpt2_checkpoint = 'gpt2'
bart_checkpoint = 'facebook/bart-base'

In [None]:
bert_model = AutoModel.from_pretrained(bert_checkpoint)
gpt_model = AutoModel.from_pretrained(gpt2_checkpoint)
bart_model = AutoModel.from_pretrained(bart_checkpoint)

print(type(bert_model))
print(type(gpt_model))
print(type(bart_model))

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=570.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=435779157.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=665.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=548118077.0, style=ProgressStyle(descri…

KeyboardInterrupt: ignored

We need to load the correct config file or else our model will not run.

In [None]:
from transformers import  AutoConfig

bert_config = AutoConfig.from_pretrained(bert_checkpoint)
gpt_config = AutoConfig.from_pretrained(gpt2_checkpoint)
bart_config = AutoConfig.from_pretrained(bart_checkpoint)

In [None]:
print(type(bert_config))
print(type(gpt_config))
print(type(bart_config))

It is also possible to just import the config for the desired snapshot like below.

In [None]:
from transformers import  BertConfig

bert_config = BertConfig.from_pretrained(bert_checkpoint)
print(type(bert_config))
print(bert_config)

### __Key Point__

The config provides all the information necessary to load the model. Namely, it provides the information needed to create teh archetecture.


## Tokenizers

HuggingFace provides several types of tokenizers including: word-based, character-based, byte-level BPE (used in GPT-2), WordPiece (used in BERT), and SentPiece or Unigram used in several multilingual models.  

### Loading and Saving

Same concepts as loading and saving models and configs: 

In [None]:
albert_checkpoint = 'albert-base-v1'
bart_checkpoint = 'facebook/bart-base'
bert_checkpoint = 'bert-base-cased'
gpt2_checkpoint = 'gpt2'
roberta_checkpoint = 'roberta-base'

from transformers import  BertTokenizer

tokenizer = BertTokenizer.from_pretrained(bert_checkpoint)

In [None]:
from transformers import  AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(bert_checkpoint)

In [None]:
sequence = "Using a Transformer network is simple" 
tokenizer(sequence)

{'input_ids': [101, 7993, 170, 13809, 23763, 2443, 1110, 3014, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

### Encoding




In [None]:
inputs = tokenizer("Let's try to tokenize!")

print(inputs['input_ids'])

[101, 2421, 112, 188, 2222, 1106, 22559, 3708, 106, 102]


In [None]:
tokenizer.tokenize("Let's try to tokenize!")

['Let', "'", 's', 'try', 'to', 'token', '##ize', '!']

In [None]:
print(tokenizer.decode(inputs['input_ids']))

[CLS] Let's try to tokenize! [SEP]


## Handling Multiple Inputs

In [None]:
tokenizer.pad_token_id

0