# My Tutorial on HuggingFace

# Chapter 1

In [1]:
from transformers import pipeline

### sentiment pipeline performs text classification on input and determines if it is positive or negative and outputs the given confidence interval using 'score'

In [2]:


classifier = pipeline("sentiment-analysis")
classifier("I'm watching an hour long Natural Language Processing Video to understand this process")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)


[{'label': 'NEGATIVE', 'score': 0.9550794363021851}]

### pass multiple statements together for pipeline to process together as a patch

In [3]:


classifier(["I am very excited for New Years Eve!", 
"I hate not being able to travel as much due to COVID-19."])

[{'label': 'POSITIVE', 'score': 0.9998338222503662},
 {'label': 'NEGATIVE', 'score': 0.9997290968894958}]

### zero text classification pipeline for more general text classification, in this case taking the input text is related to the labels education, business, or programming

In [4]:


classifier = pipeline("zero-shot-classification")
classifier("This course is about learning to explore and visualizing real-world datasets using Python", 
candidate_labels=["education","business","programming"])

No model was supplied, defaulted to facebook/bart-large-mnli (https://huggingface.co/facebook/bart-large-mnli)


{'sequence': 'This course is about learning to explore and visualizing real-world datasets using Python',
 'labels': ['programming', 'education', 'business'],
 'scores': [0.66228187084198, 0.32246798276901245, 0.015250138007104397]}

### generation pipeline auto-completes a given prompt and the output is generated with a bit of randomness so it changes each time
### the text I am using in this example is from the APAN website for Python class description 
### the sentence is: The students in this course will learn to examine raw data with the purpose of deriving insights and drawing conclusions

In [5]:


generator = pipeline("text-generation")
generator("The students in this course will learn to examine raw data with the")

No model was supplied, defaulted to gpt2 (https://huggingface.co/gpt2)
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'The students in this course will learn to examine raw data with the help of a database of hundreds of years of human data, making the process of applying basic statistical techniques in an interactive manner possible. The students will also use their intuition and insight to create'}]

### when a model is not explicitely provided, the default model for each associated task is selected. However, you can select any model that has been pretrained for a task on https://huggingface.co/models
### lets do text generation but with another model, jpg2 and see what happens (this is a lighter version for the gpd2 model)
### we can specify several arguments like max length of generated text and number of sentences we want to return bc there is some randomness in the generation

In [6]:


generator = pipeline("text-generation", model="distilgpt2")
generator(
    "The students in this course will learn to examine raw data with the <mask> of deriving insights and drawing conclusions", 
    max_length=30, 
    num_return_sequences=2,
)


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'The students in this course will learn to examine raw data with the <mask> of deriving insights and drawing conclusions based on their knowledge, including,'},
 {'generated_text': 'The students in this course will learn to examine raw data with the <mask> of deriving insights and drawing conclusions which have previously been published in the'}]

### the fill-mask pipeline had a pretraining objective to guess the value of missing words in a sentence.
### in this case, we ask for the 2 most likely values ofr missing words according to the model

In [7]:


unmasker = pipeline("fill-mask")
unmasker("The students in this course will learn to examine raw data with the purpose of deriving <mask> and drawing conclusions", top_k=2)

No model was supplied, defaulted to distilroberta-base (https://huggingface.co/distilroberta-base)


[{'sequence': 'The students in this course will learn to examine raw data with the purpose of deriving hypotheses and drawing conclusions',
  'score': 0.2825345993041992,
  'token': 44850,
  'token_str': ' hypotheses'},
 {'sequence': 'The students in this course will learn to examine raw data with the purpose of deriving trends and drawing conclusions',
  'score': 0.08611495047807693,
  'token': 3926,
  'token_str': ' trends'}]

### the NER pipelines classifies each word in a sentence instead of the sentence as wall. For example, using named entity recognition such as persons, organizations, or locations in a sentence
### the group pipeline entity use is to make the pipeline group together different words linked to the same entity such as New York and Goldman Sachs

In [8]:


ner = pipeline("ner", grouped_entities=True)
ner("My name is Michael and I work at Goldman Sachs in Manhattan.")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english)


[{'entity_group': 'PER',
  'score': 0.99904186,
  'word': 'Michael',
  'start': 11,
  'end': 18},
 {'entity_group': 'ORG',
  'score': 0.99903166,
  'word': 'Goldman Sachs',
  'start': 33,
  'end': 46},
 {'entity_group': 'LOC',
  'score': 0.99840856,
  'word': 'Manhattan',
  'start': 50,
  'end': 59}]

### another task avaialble with a pipeline api is extractive question answering providing a context and a question the mdoel will identify the span of text containing the answer to the question

In [9]:


question_answerer = pipeline("question-answering")
question_answerer(
    question="Where do I work?", 
    context="My name is Michael and I work at Goldman Sachs in Manhattan."
)

No model was supplied, defaulted to distilbert-base-cased-distilled-squad (https://huggingface.co/distilbert-base-cased-distilled-squad)


{'score': 0.7812808752059937,
 'start': 33,
 'end': 46,
 'answer': 'Goldman Sachs'}

### getting summaries of long text is also a task that transformers libraries can help with using the summarization pipeline

In [10]:


summarizer = pipeline("summarization")
summarizer("""Spotify is not joking around amid a dispute over royalties for comedy content. The streaming giant has removed the work of hundreds of comedians from its platform -- including Tiffany Haddish, Kevin Hart and the late Robin Williams -- according to rights agency Spoken Giants.

Spoken Giants, which represents some of the affected comedians, describes itself as "the first global rights administration company for the owners and creators of spoken word copyrights," and aims to get streaming platforms to pay comedians for writing jokes in the same way songwriters are paid.
The group told CNN that the take-down happened on November 24, and said that it never requested the content's removal.
"Unfortunately, Spotify removed the work of individual comedians rather than continue to negotiate," CEO Jim King told CNN.
"With this take-down, individual comedians are now being penalized for collectively requesting the same compensation songwriters receive," he added. "After Spotify removed our members' work, we reached out but have not received a response. We have now requested an immediate meeting to resolve this situation."
A Spotify spokesperson told CNN that the streaming platform had already paid "significant amounts of money" to offer the comedy content to listeners, and "would love to continue to do so."
"However, given that Spoken Giants is disputing what rights various licensors have, it's imperative that the labels that distribute this content, Spotify and Spoken Giants come together to resolve this issue to ensure this content remains available to fans around the globe," the spokesperson said.
Although the content is still available on other platforms including Pandora and Sirius, Spoken Giants said comedians with lower profiles and revenues could suffer from losing Spotify as a platform.
On social media, New York-based comedian Joe Zimmerman called the move "corporate bullying." Another New York-based comedian, Liz Miele, tweeted that her albums had also been removed from the platform because comedians "had the audacity to ask for money owed to us," and jokingly compared herself to singer Taylor Swift.
Swift was previously engaged in a dispute with Spotify, arguing artists were not paid enough. The singer pulled her entire catalog from the platform in 2014, but reversed her decision in 2017
""")

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 (https://huggingface.co/sshleifer/distilbart-cnn-12-6)


[{'summary_text': ' Spotify has removed the work of hundreds of comedians from its platform . Comedians include Robin Williams, Tiffany Haddish, Kevin Hart and Robin Williams . Spoken Giants, which represents some of the affected comedians, says it never requested the content\'s removal . A Spotify spokesperson says the streaming platform had already paid "significant amounts of money" to offer the comedy content to listeners .'}]

### I would translate this more as there is a little truth behind every joke

In [11]:


translator = pipeline("translation", model="Helsinki-NLP/opus-mt-es-en")
translator("Entre broma y broma la verdad se asoma.")

[{'translation_text': 'Between jokes and jokes the truth comes out.'}]

### russian translator

In [12]:


translator = pipeline("translation", model="Helsinki-NLP/opus-mt-en-ru")
translator("There is a little truth behind every joke.")

[{'translation_text': 'За каждой шуткой есть немного правды.'}]

### test out en zh translator

In [13]:


translator = pipeline("translation", model="Helsinki-NLP/opus-mt-en-zh")
translator("There is a little truth behind every joke.")

[{'translation_text': '每一个笑话背后都有一点点真相。'}]

# Decoders, Encoders, and Transformers
####  decoders are trained to guess next word which is why they are good at generating text,
#### encoder models are pre-trained by filling random mask in a sentence so better at output,
#### transformers models are just massive,
#### lets build summarization model and assess accuracy using zero shot on both the article and summary, bleu, and a custom scoring methodology

# Transfer Learning
#### transfer learning is when take pre-trained model on a lot of data and reuse that model to fine-tune it on new task you want to work with - so by reusing the model instead of doing one from scratch means you needs less data to train your model
#### gpd2 was pre-trained using the content of 45 million links posted by users on reddit - good model for guessing next word in sentence
#### predict value of randmoly masked words - bert was retrained this way using english wikipedia and 11,000 unpublished books
#### but pre-trained model contains knowledge but also bias it already has, ex: imagenet contains images from united states and western europe so models fine-tuned on it usually do better on images from these countries

# Transformer Architecture

#### decoders, encoders, sequence to sequence (encoder-decoder)
#### encoder converts to numerical - is bi-directional, self-attention (inputs)
#### decoder "decodes" the representations from the encoder (output probabilities), uni-directional, auto-regressive  traditional use, masked self-attention
#### encoder-decoder (sequence to sequence) combines decoder and encoder, the encoder accepts inputs and computes a high level representation of those inputs, these outputs are then passed to decoder that uses these outputs alongside other inputs to generate a prediction output which it will use in future - hence auto-regressive

#### encoders - the attention mechanism is allowed to look at every word in the sentence so the word before and after, like BERT model, when need to guess value of masked word it is useful to look at what was before and after
#### decoder - like gpt have to predict the next word so if they were allowed to look at word after, it would be cheating so theyt are only allowed to look at what was before

# Encoder Models

#### BERT is popular encoder model, ALBERT, ELECTRA, RoBERTa, DistilBERT
#### retreives numerical representation of each word
#### encoder outputs one sequence of numbers per input word aka feature vector/tensor
#### each vector is numerical representation of word and the dimension of that vector is defined by the architecture of the model - for base BERT model it is 768
#### these vectors contain the value of a word but contextualized, for example the vector representation of the word "to" isnt just "to", it also takes iinto account the words around it which are called context (right and left context)
#### "Welcome to NYC" - for the word "to" the left context is "Welcome" and the right context is "NYC" -- output is based on contexts so it is a contextualized vector thanks to self-attention mechanism
#### self attention mechanism relates to different positions of different words in a single sequence in order to calculate
#### should use encoder as stand alone models for sequence classification, question answering, and masked language modeling, etc... very powerful at extracting vectors that are meaningful in a sequence 
#### encoders really shine in mask langauge modeling MLM predict hidding word in sequence of words -- BERT was trained on this bc bi-directional inforamtion is crucial in this task
#### requires semantic understanding of as well as syntactic understanding
#### encoders are good at sequence classifrication like sentiment-analysis - aim is to identify sentiment of a sequence like 1-5 stars for review, positive or negative rating of a sequence - even if words are same sequence can make something mean something completely different
#### BERT has maximum length of 512 words so generally cannot use input of greater than 512 words in model, some models like long former can accept longer context so should read documentation for specific encoders and their abilities, can split sentence into several parts of 512 words and pass each of those chunks into model and can average what you get at the end to try to train a classifier for larger sentences

# Decoder Models

#### popular decoder model is gpt2
#### can use generally for same tasks as decoders but with little loss on performance
#### "Welcome to NYC" -- gets numerical representation for each word, outputs one sequence of numbers per input word, this is a feature vector/tensor
#### one vector per word passed through decoder, each vector is numerical represntation of word, dimension of vector defined by the architecture of the model
#### differs from encoder is with self attention mechanism - uses masked self attention
#### wrod "to" would be unmodified by NYC word bc right context of word will all be masked, doesnt benefit from bi-directional context, only single context depending on what is masked
#### self-attention mechanism- provides additional mask to hide context in one direction, so vector is not affected by words in the hidden context
#### should use as standalone models generate numerical, and can be used in wide variety of task, but having only access to left context would be great at text generation / ability to generate word or sequence of words given a string of words aka causal langauge modeling or natual langauge generation
#### causal language modeling -- start with word "my" outputs vector of numbers which could be single word that maps to all words known by model (language modeling head) - and will predict most probabile following words, adds that to sequence so if picks "name" then uses "my name" to predict next word aka auto-regressive
#### passes "my name" through decoder and then says "is" -- gpt2 has maximum context side of 1024 words

# Sequence to Sequence Models (combine encoders and decoders) / Encoder-Decoder Architecture

#### popular encoder-decode model is t5, BART, ProphetNet, mT5, M2M100, Pegasus, MarianMT, mBART, etc...
#### encoder takes words as inputs passes through encoder, and gives numerical representation for each word passed through, and contains info about meaning of sequence
#### decoder is passed outputs from encoder directly, in addition given a sequence sequence, when prompting decoder with no initial sequence, we can give it the value that indicates the start of a sequence (this is where magic happens)
#### encoder accepts sequence as input and computes a prediction and outputs a numerical representation, then sends that over to decode, it has in a sense  "encodede" that sequence and the decoder uses this input alongside its usual sequence input will try to decode the sequence
#### the decode decodes the sequence and outputs the word, dont need to make sense of it, but decode is decoding what encoder has output, start of sequence word indicates sohuld start decoding sequence - so dont need encoder anymore
#### decode acts in auto-regressive manner so can take what was just output as an input - this in combination with numerical representation provided by encoder can be used to generate a second word and continues on and on untild ecode outputs value that is a stopping value like a . indicating end of a sequence
#### so we ahve initial sequence sent to encoder, encoder output sent to decoder to be decoder, then can discard encoder after single use - then decoder can be used several times until we have generated every word that we need
#### transduction (act of translating a sequence) -- can use transformer model built for that task -- take "welcome to nyc" translate to french - encoder translates words and pases welcome to bienvenue as start of sequence word to putput first word bienvenue and then use bienvenue is used as input sequence for deecode, this along with decoder numerical representation to predict second word "to" then uses bienvenue a and then can predict "nyc"
#### encoder and decode often do not share weights - so very good - entire block encoder and be trained to understand sequence and extract relevant information, for translation scenario would mean parsing and understanding what was said in english, and extracting info from that language and put in vector dense information, then have decoder whose sole purpose is to decode numerical representation output by encoder, this decoder can be specialized in completely different lagnauge or modality like images
#### encoder-decoder models are able to manage sequence to sequence tasks like translation, weights between encoder and decoder are not necessarily shared
#### translate "transformers are power" in french takes 3 english words and outputs 4 french words les transformers sont puissants -- decoder can do this due to auto-regression standalone
#### summarization very strong - since encoder and decoder are seperate can have very long context for encoder to handle text and smaller context for decoder which handles summarize sequence
#### can also load an encoder and decoder inside an encoder-decoder model, so can choose to use specific encoders and decoders which are good at specific tasks - customizable
#### cool to paraphrase from formal to informal using encoder-decoder

# Bias and Limitations

#### powerful but dont control input to output, more controlled by training so need precautions to avoid predictions you dont want in deployment
#### BERT has been pretrained in filling masked words - change gender outputs completely different items - this MAN works at more likely to output (carpenter, doctor, mechanic), female more likely to output (maid, nurse, waitress, teacher, prostitute)
#### gpd2 trained on internet reddit so more sexist, etc...
#### these bias will persist even after fine-tuning - need enough samples of outputs want to see and always put model in production after analyzing results - if want to avoid some outputs try to provide more training data that correct the bias of the model

# Chapter 2 - Transformers, Model APIs, and Tokenizers used to convert text into format models can process

# What happens in pipeline function

#### How sentiment pipeline went from positive to negative? converts raw text to numbers using a tokenizer, then those numbers go through model which outputs logits, then post-processing steps transform those delegates into labels and scores
#### Tokenization process - text split into tokens, then tokenizer adds special tokens if it expects them to classify, then tokenizer matches each token to unique id in vocabulary of tokenizer - autotokeniszer api

In [14]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

## Step 1 - Tokenizer

In [15]:


raw_inputs = [
    "I have been waiting for a HuggingFace course my whole life.", 
    "I hate this so much!",
]

# Because two sentences passed in this model are not of the same size, need to pad the shortest one to be able to build an array using padding=True
# Truncation=True is used to make sure that any sentence longer than the maximum the model can handle is truncated
# return_tensors option tells the tokenizer to return the pytorch tensor

inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")

# result is dictionary with 2 keys, input ids contains ids of both sentences with zero padding applied
# the second, attention_mask, indicates that padding has been applied so the model does not pay attention to it
inputs

{'input_ids': tensor([[  101,  1045,  2031,  2042,  3403,  2005,  1037, 17662, 12172,  2607,
          2026,  2878,  2166,  1012,   102],
        [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,     0,
             0,     0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]])}

## Step 2 - Model

In [16]:


from transformers import AutoModel

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"

# from_pretrained method will downlaod and cache configuration of model as well as the pre-trained weight. However, automodel api will on instantiate the body of the model (part of model left once pre-training head is removed)
# outputs high dimensional tensor that is a representation of the sentences passed but which is not directly useful for classification problem
model = AutoModel.from_pretrained(checkpoint)
outputs = model(**inputs)
print(outputs.last_hidden_state.shape)

# output torch.Size([2, 15, 768]) [batch length, sequence length, hidden size]
# this output shows the tensor has 2 sentences, each of 16 tokens, and last timension is indent size of model 768


Some weights of the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing DistilBertModel: ['classifier.weight', 'classifier.bias', 'pre_classifier.weight', 'pre_classifier.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


torch.Size([2, 15, 768])


## Step 2 - Model

In [17]:


# to get output for classifcation problem need to use autoModelForSequenceClassification
# works like automodel class but built a model with a classification head - there is one auto class for each common nlp task in transformers library

from transformers import AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)
print(outputs.logits)

# output tensor([[-1.3782,  1.4346],
#        [ 4.1692, -3.3464]], grad_fn=<AddmmBackward0>)

# tensor size of 2x2, one result for each sentence and each possible label - these outputs are not probabilities yet - can see bc they dont sum to 1
# this is because each model of the transformers library retuirns lockets - to make sens eof those logits need to look at post processing

tensor([[-1.3782,  1.4346],
        [ 4.1692, -3.3464]], grad_fn=<AddmmBackward0>)


## Step 3 - Post-processing (logits to predictions)

In [18]:


import torch

# to convert logits to probabilities need to apply soft max layers to them

predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)

# the output shows that these are positive numbers that sum to 1
# last step is to know which of those correspond to the positive or negative level - given by id to label field of model config
# first index [0] probabilities correspond to negative label, the second [1] correspond to positive label


# output tensor([[5.6636e-02, 9.4336e-01],
#        [9.9946e-01, 5.4418e-04]], grad_fn=<SoftmaxBackward0>)




tensor([[5.6636e-02, 9.4336e-01],
        [9.9946e-01, 5.4418e-04]], grad_fn=<SoftmaxBackward0>)


## this problem only has 2 classes, positive and negative, but what if you have a multi-class classification problem?
# this one predicts two classes

In [19]:


from transformers import pipeline 

classifier = pipeline("sentiment-analysis") 
classifier([ 
    "There is a little truth behind every joke.", 
    "Time flies when you're having fun"
])

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)


[{'label': 'NEGATIVE', 'score': 0.7393240332603455},
 {'label': 'POSITIVE', 'score': 0.8359946608543396}]

# pipeline first needs to process raw text using tokenizer and key thing to remember, if doing any fine-tuning, inference, or prediction, important that checkpoint use for tokenizer is same for model bc when transformers are pretrained on large corpus, there was correspoding tokenizer also trained to learn the vocabulary of that corpus - if mismatch you mismatch vocabulary and get inoptimal results

In [20]:
# first we need to instantiate the tokenizer

from transformers import AutoTokenizer 

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)


# input ids are mapping to unique interger, its a mapping to vocabulary - each word has an integer mapped to it
# attention mask puts 1s and 0s at end of sequence which we will discuss later

In [21]:
raw_inputs = [ 
    "There is a little truth behind every joke.", 
    "Time flies when you're having fun"
]

inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")

print(inputs)

{'input_ids': tensor([[  101,  2045,  2003,  1037,  2210,  3606,  2369,  2296,  8257,  1012,
           102],
        [  101,  2051, 10029,  2043,  2017,  1005,  2128,  2383,  4569,   102,
             0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0]])}


# now load model which processes inputs

In [22]:
# we do this in the model config

from transformers import AutoModel 

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)

model.config

#  every single model has a config and this config tells you things like the number of classes

#  "label2id": {
#    "NEGATIVE": 0,
#    "POSITIVE": 1
#  },

# whjat you can do when you instantiate a model is define the number of classes you would like when you instantiate text classification


Some weights of the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing DistilBertModel: ['classifier.weight', 'classifier.bias', 'pre_classifier.weight', 'pre_classifier.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


DistilBertConfig {
  "_name_or_path": "distilbert-base-uncased-finetuned-sst-2-english",
  "activation": "gelu",
  "architectures": [
    "DistilBertForSequenceClassification"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "finetuning_task": "sst-2",
  "hidden_dim": 3072,
  "id2label": {
    "0": "NEGATIVE",
    "1": "POSITIVE"
  },
  "initializer_range": 0.02,
  "label2id": {
    "NEGATIVE": 0,
    "POSITIVE": 1
  },
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "output_past": true,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.12.5",
  "vocab_size": 30522
}

In [23]:
# for example, say I have a checkpoint - jsut take distilbert-base-uncased model for example
# then will import model for sequence classification - good for classification, multi-class

from transformers import AutoModelForSequenceClassification

checkpoint_for_multiclass = "distilbert-base-uncased"

# then take model for sequence classification - from pretrained and take checkpoint and pass keyword arguments that specify how many labels we are dealing with
# so downloads base pretained model and adds classification head ontyop of model and configure with right number of classes

model = AutoModelForSequenceClassification.from_pretrained(checkpoint_for_multiclass, num_labels=6)

model.config

#  "id2label": {
#     "0": "LABEL_0",
#     "1": "LABEL_1",
#     "2": "LABEL_2",
#     "3": "LABEL_3",
#     "4": "LABEL_4",
#     "5": "LABEL_5"
#   }

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.bias', 'vocab_transform.weight', 'vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias', 'pre_classifier

DistilBertConfig {
  "_name_or_path": "distilbert-base-uncased",
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3",
    "4": "LABEL_4",
    "5": "LABEL_5"
  },
  "initializer_range": 0.02,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2,
    "LABEL_3": 3,
    "LABEL_4": 4,
    "LABEL_5": 5
  },
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.12.5",
  "vocab_size": 30522
}

# simplest way to feed inputs to model is using python unpacking operator to feed all keys and values to the model
# this feed all inputs to forward pass of model to genrate the outputs
# the output is taking raw text, converting to numbers, and then converting those integers into dense vectors - so every token is associated witha  vector 
# torch.Size([2, 11, 768]) means 2 sentences, 11 vectors per sentence, and each vector has 768 dimensions bc of the way BERT was pretrained

In [24]:
from transformers import AutoModel 

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)

outputs = model(**inputs)
print(outputs.last_hidden_state.shape)

Some weights of the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing DistilBertModel: ['classifier.weight', 'classifier.bias', 'pre_classifier.weight', 'pre_classifier.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


torch.Size([2, 11, 768])


# all outputs are wrapped in an object which is something we can index by attribute name - basemodeloutput, has last_hidden_state and tensor

In [25]:
outputs

BaseModelOutput(last_hidden_state=tensor([[[-0.2046,  0.3315,  0.0793,  ...,  0.0537,  0.2866,  0.2205],
         [-0.2585,  0.4289, -0.0989,  ..., -0.1086,  0.1441,  0.3349],
         [-0.3724,  0.4050, -0.0340,  ..., -0.1534,  0.3746,  0.6864],
         ...,
         [-0.0311,  0.3458,  0.0469,  ..., -0.0710,  0.3344,  0.5246],
         [ 0.5216,  0.1719,  0.1227,  ...,  0.3565, -0.1977, -0.1930],
         [ 0.3645,  0.4160,  0.4990,  ...,  0.3746, -0.0586, -0.0142]],

        [[ 0.2481,  0.2188,  0.6047,  ..., -0.7964,  0.9062, -0.0586],
         [ 0.3293,  0.5613,  0.3619,  ..., -0.6319,  0.6820, -0.1266],
         [ 0.4563,  0.5079,  0.5108,  ..., -0.7594,  0.3598,  0.1546],
         ...,
         [ 0.5942,  0.0251,  0.6978,  ...,  0.0855,  0.7381, -0.3256],
         [ 0.9546,  0.4101,  0.8726,  ...,  0.4479,  0.5824, -0.6132],
         [ 0.1438, -0.0730,  0.5039,  ..., -0.1315,  0.6125, -0.2272]]],
       grad_fn=<NativeLayerNormBackward0>), hidden_states=None, attentions=None)

# so if want to look at last_hidden_state of output, first sentence, vector correspodning to first token
# this huge lsit of numbers negative to positive should have a size = 768

In [26]:
outputs.last_hidden_state[0, 0]

tensor([-2.0456e-01,  3.3149e-01,  7.9314e-02,  2.3897e-01, -2.6303e-01,
         5.2754e-02,  2.4966e-01,  3.7042e-01,  1.4424e-01, -5.9761e-01,
         4.8752e-01,  4.8655e-02,  1.7460e-01,  1.1068e-01, -2.8549e-01,
         5.5047e-01,  4.8756e-01, -5.9484e-02,  1.1203e-01, -3.6063e-01,
        -1.0464e-01, -5.5221e-03, -1.3071e-01,  7.7812e-02,  2.4495e-01,
         2.3912e-01,  8.9579e-03,  2.3336e-01, -1.9855e-01,  3.8994e-01,
        -8.8543e-02, -7.9171e-03, -1.4176e-01, -3.8431e-01,  4.0755e-01,
        -1.7427e-02, -3.0652e-01,  6.5352e-01, -4.8395e-02,  1.9115e-01,
        -1.3361e-01,  6.9770e-01, -5.2213e-02, -1.6725e-01, -5.1267e-01,
        -1.8223e-01, -1.1495e+00,  1.9498e-02,  2.5056e-01, -2.8728e-01,
        -2.1412e-01,  3.2262e-01, -1.3824e-01,  4.6103e-01,  2.8472e-01,
         7.8934e-01,  3.3711e-01, -2.9545e-01, -1.3582e-01,  1.2684e-01,
        -6.2874e-02,  3.7624e-01,  3.3091e-01, -2.8465e-01,  9.3525e-02,
         1.4451e+00, -2.4522e-01,  2.3340e-01, -6.4

In [27]:
outputs.last_hidden_state.size()

torch.Size([2, 11, 768])

# numerical representations by themselves dont let us do things like text classification - they just let us say numerical classification is x
# if want to do classificaiton need to add vectors with classification head - entire transformer library is built around idea of taking a model for task x like summarization, classification, question/answering
# when substantiate model with sequence classification - has model with number of labels based on checkpoint


In [28]:
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)
outputs

SequenceClassifierOutput(loss=None, logits=tensor([[ 0.6482, -0.3943],
        [-0.7400,  0.8887]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

# when look at outputs - instead of just last hidden state - you have logits - these are compressed 768 dimensional vectors into numbers used to provide probabilities and indicates which class is most likely
# in this example positive or negative sentiment condenses into 2

In [29]:
print(outputs.logits.shape)

torch.Size([2, 2])


# so first sentence is more likely to be positive, second is more likely to be negative

In [30]:
print(outputs.logits)

tensor([[ 0.6482, -0.3943],
        [-0.7400,  0.8887]], grad_fn=<AddmmBackward0>)


# if want to convert logits to probabiltiies, take a softmax over them - which takes all of the inputs, exponentiates them and normalized that by the sum of the exponentials (ranges from 0-1) - good for probabilitiess - provides probabilties for both sentiments

In [31]:
import torch

predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)

tensor([[0.7393, 0.2607],
        [0.1640, 0.8360]], grad_fn=<SoftmaxBackward0>)


# can see the labeling for each of them in human-readable format

In [32]:
model.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}

# using decode method can see raw text of token ids
# include special tokens like [SEP] which shows sepeeration between pairs of tokens or [CLS] token

In [33]:
tokenizer.decode(inputs.input_ids[0])

'[CLS] there is a little truth behind every joke. [SEP]'

# How Instantiate a Transformer Model from Transformers Library

In [34]:
# use automodel and from pretrained selects model inside

from transformers import AutoModel

bert_model = AutoModel.from_pretrained("bert-base-cased")
print(type(bert_model))

gpt_model =AutoModel.from_pretrained("gpt2")
print(type(gpt_model))

bart_model = AutoModel.from_pretrained("facebook/bart-base")
print(type(bart_model))

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


<class 'transformers.models.bert.modeling_bert.BertModel'>
<class 'transformers.models.gpt2.modeling_gpt2.GPT2Model'>
<class 'transformers.models.bart.modeling_bart.BartModel'>


# can use autoconfig class to easily load the configuration of a pretrained model from any checkpoint to pick the right configuration class from the library

In [35]:
# can use specific class corresponding to a checkpoint but need to change the code each time you want to try a different model architecture

from transformers import BertConfig

bert_config = BertConfig.from_pretrained("bert-base-cased")
print(type(bert_config))

<class 'transformers.models.bert.configuration_bert.BertConfig'>


In [36]:
from transformers import BartConfig 

bart_config = BartConfig.from_pretrained("facebook/bart-base")
print(type(bart_config))

<class 'transformers.models.bart.configuration_bart.BartConfig'>


# config of model is a blueprint that contains all the info necessary to create model architecture
# model was built with 12 layers hidden, size of 768, and vocab sidze oof 28,996

In [37]:
from transformers import BertConfig 

bert_config = BertConfig.from_pretrained("bert-base-cased")
print(bert_config)

BertConfig {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.12.5",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 28996
}



# once have configuration, can create model that has same architecture as checkpoint but is randomly initialized and then train from scratch like any pytorch model

In [38]:
# uses same architecture as bert-base-cased

from transformers import BertConfig, BertModel 

bert_config = BertConfig.from_pretrained("bert-base-cased")
bert_model = BertModel(bert_config)

# can also change any part of the configuration by using key word arguments

In [39]:
# using only 10 layers instead of 12

from transformers import BertConfig, BertModel 

bert_config = BertConfig.from_pretrained("bert-base-cased", num_hidden_layers=10)
bert_model = BertModel(bert_config)

# saving a model once it's trained or fine-tuned using the save_pretrained method which saved in cwd

In [40]:
from transformers import BertConfig, BertModel 

bert_config = BertConfig.from_pretrained("bert-base-cased")
bert_model = BertModel(bert_config)

# training code

bert_model.save_pretrained("my-bert-model")

# reloading a saved model

In [41]:
from transformers import BertModel 

bert_model = BertModel.from_pretrained("my-bert-model")

# once you have saved your model you have config file and  pytorch_model.bin - the later is a state dictionary that has all weights - if want to use it take input text, convert to input ids, convert input ids to tensors which feed the model

# Models (PyTorch)

In [42]:
from transformers import BertConfig, BertModel

# Building the config
config = BertConfig()

# Building the model from the config
model = BertModel(config)

In [43]:
print(config)

BertConfig {
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.12.5",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}



In [44]:
from transformers import BertConfig, BertModel

config = BertConfig()
model = BertModel(config)

# Model is randomly initialized!

# in practice most of the time using from_pretrained as it will initialize model with pretrained weights and correct head if needed

In [45]:
from transformers import BertModel

model = BertModel.from_pretrained("bert-base-cased")

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


# save model so can deploy it somewhere

In [46]:
model.save_pretrained("directory_on_my_computer")

In [47]:
sequences = ["Hello!", "Cool.", "Nice!"]

In [48]:
encoded_sequences = [
    [101, 7592, 999, 102],
    [101, 4658, 1012, 102],
    [101, 3835, 999, 102],
]

# convert to tensors

In [49]:
import torch

model_inputs = torch.tensor(encoded_sequences)

In [50]:
model_inputs

tensor([[ 101, 7592,  999,  102],
        [ 101, 4658, 1012,  102],
        [ 101, 3835,  999,  102]])

# this is what constitues a prediction

In [51]:
output = model(model_inputs)

In [52]:
output

BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[ 4.4496e-01,  4.8276e-01,  2.7797e-01,  ..., -5.4032e-02,
           3.9394e-01, -9.4770e-02],
         [ 2.4943e-01, -4.4093e-01,  8.1772e-01,  ..., -3.1917e-01,
           2.2992e-01, -4.1172e-02],
         [ 1.3668e-01,  2.2518e-01,  1.4502e-01,  ..., -4.6914e-02,
           2.8224e-01,  7.5566e-02],
         [ 1.1789e+00,  1.6738e-01, -1.8187e-01,  ...,  2.4671e-01,
           1.0441e+00, -6.1961e-03]],

        [[ 3.6436e-01,  3.2464e-02,  2.0258e-01,  ...,  6.0111e-02,
           3.2451e-01, -2.0996e-02],
         [ 7.1866e-01, -4.8725e-01,  5.1740e-01,  ..., -4.4012e-01,
           1.4553e-01, -3.7545e-02],
         [ 3.3223e-01, -2.3271e-01,  9.4876e-02,  ..., -2.5268e-01,
           3.2172e-01,  8.1111e-04],
         [ 1.2523e+00,  3.5754e-01, -5.1320e-02,  ..., -3.7840e-01,
           1.0526e+00, -5.6255e-01]],

        [[ 2.4042e-01,  1.4718e-01,  1.2110e-01,  ...,  7.6062e-02,
           3.3564e-01,  2

#### power of transformers is bc we dont have to pretrain - expesive and time consuming
#### only time might really be stuck is if dealing with domain that is different from any existing pretrained model like source code back in the day - understanding python when using english as the base wouldnt give results - would want to train on source code corpus
#### generally also need to find alternative when dealing with language that isnt commonly supported (e.g. many langugages in africa arent highly supported on wikipedia) - adapt something multilingual like BERT and use it on another language

# Tokenizers in more detail (3 most popular approaches)
## most common tokenizers are wordpiece (bert) or sentencepiece (gpt models)

### Split using whitespace
#### split text into words on whitespace in english as this is the boundary between words - so a word is a token (but not japanese!)
#### need a token for every word in language - english several hundred thousand tokens which is expensive and no distinction between dog and dogs - similar but now representing them with two tokens

In [53]:
tokenized_text = "Jim Henson was a puppeteer".split()
print(tokenized_text)

['Jim', 'Henson', 'was', 'a', 'puppeteer']


In [54]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

In [55]:
# better to use autotokenizer since automatically uses based on checkpoint provided

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

In [56]:
tokenizer("Using a Transformer network is simple")

{'input_ids': [101, 7993, 170, 13809, 23763, 2443, 1110, 3014, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

## differentiates words from subwords using ## symbol (double hash) / can tell its good to split trans from everything else

In [57]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

sequence = "Using a Transformer network is simple"
tokens = tokenizer.tokenize(sequence)

print(tokens)

['Using', 'a', 'Trans', '##former', 'network', 'is', 'simple']


### reconstruct sentences by converting back to tokens/input ids

In [58]:
ids = tokenizer.convert_tokens_to_ids(tokens)

print(ids)

[7993, 170, 13809, 23763, 2443, 1110, 3014]


In [59]:
decoded_string = tokenizer.decode([7993, 170, 11303, 1200, 2443, 1110, 3014])
print(decoded_string)

Using a transformer network is simple


### this approach provides special tokens

In [60]:
inputs = tokenizer(sequence)
inputs

{'input_ids': [101, 7993, 170, 13809, 23763, 2443, 1110, 3014, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [61]:
tokenizer.decode(inputs.input_ids)

'[CLS] Using a Transformer network is simple [SEP]'

### unless you skip them

In [62]:
tokenizer.decode(inputs.input_ids, skip_special_tokens=True)

'Using a Transformer network is simple'

# different tokenizer - gpt is quirky using symbol ontop of letters to indicate there is whitespace between two tokens

In [63]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("sshleifer/tiny-gpt2")
sequence = "Using a Transformer network is simple"
tokens = tokenizer.tokenize(sequence)

print(tokens)


['Using', 'Ġa', 'ĠTrans', 'former', 'Ġnetwork', 'Ġis', 'Ġsimple']


### Character based approach
#### split text into characters rather than words - benefits are vocabulary is much smaller and fewer out of vocabulary tokens since every word can be built from characters
#### model has to learn what a word actually means - only gets characters and then says if i put these characters in this order represents more abstract object like a word not good for english

### Subword tokenization
#### instead of splitting on word boundaries or characters, decompose a word into subwords (e.g. annoyingly "annoying" "ly" and collect frequencies of subwords and use this to figure out what are most frequent subwords in language and use that to build back the word itself)

# Handling multiple sequences together
## how to batch inputs together? 

In [64]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizser = AutoTokenizer.from_pretrained(checkpoint)

sentences = [ 
    "I love studying at Columbia University.", 
    "I hate studying at Columbia University sometimes.",
]

tokens = [tokenizer.tokenize(sentence) for sentence in sentences]
ids = [tokenizer.convert_tokens_to_ids(token) for token in tokens]
print(ids[0])
print(ids[1])

[40, 1842, 11065, 379, 9309, 2059, 13]
[40, 5465, 11065, 379, 9309, 2059, 3360, 13]


## this throws an error bc both sentences need to be same length 

In [65]:
import torch

ids =[[40, 1842, 11065, 379, 9309, 2059, 13],[40, 5465, 11065, 379, 9309, 2059, 3360, 13]]

#this will fail
input_ids = torch.tensor(ids)

ValueError: expected sequence of length 7 at dim 1 (got 8)

## can overcome this by passing 0s so they are same length as many times as necessary or truncate the length of the longer sequence to length of shorter sequence but then lose a lot of information
## in general, we only truncate sentences when they are longer than the maximum length the model can handle

In [66]:
import torch

ids =[[40, 1842, 11065, 379, 9309, 2059, 13, 0],[40, 5465, 11065, 379, 9309, 2059, 3360, 13]]
input_ids = torch.tensor(ids)

## the value used to pad the sentence should not be picked randomly - the model has been pretrained with padding id which you can find in tokenizer.pad_token_id

In [67]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
tokenizer.pad_token_id

0

## dont get same result for sentence that is padded, attention layers just the tokens in the unpadded sentence, attention layers attend the tokens and all padded tokens in the sentence with padding

In [68]:
from transformers import AutoModelForSequenceClassification

ids1 = torch.tensor([[40, 1842, 11065, 379, 9309, 2059, 13, 0]])
ids2 = torch.tensor([[40, 5465, 11065, 379, 9309, 2059, 3360, 13]])
all_ids = torch.tensor([[40, 1842, 11065, 379, 9309, 2059, 13, 0],[40, 5465, 11065, 379, 9309, 2059, 3360, 13]])

model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
print(model(ids1).logits)
print(model(ids2).logits)
print(model(all_ids).logits)

tensor([[ 2.3128, -2.0399]], grad_fn=<AddmmBackward0>)
tensor([[ 2.7803, -2.3437]], grad_fn=<AddmmBackward0>)
tensor([[ 2.3128, -2.0399],
        [ 2.7803, -2.3437]], grad_fn=<AddmmBackward0>)


In [69]:
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
output1  = model(ids1)
output2 = model(ids2)
print(output1.logits)
print(output2.logits)

tensor([[ 2.3128, -2.0399]], grad_fn=<AddmmBackward0>)
tensor([[ 2.7803, -2.3437]], grad_fn=<AddmmBackward0>)


## attention mask 1 indicate should consider it in context and 0 are tokens we should ignore

In [72]:
all_ids = torch.tensor([[40, 1842, 11065, 379, 9309, 2059, 13, 0],[40, 5465, 11065, 379, 9309, 2059, 3360, 13]])
attention_mask = torch.tensor(
    [[1, 1, 1, 1, 1, 1, 1, 0],
    [1, 1, 1, 1, 1, 1, 1, 1]]
)

## now passing with attention mask will provide same output for a given sentence without padding

In [73]:
output = model(all_ids, attention_mask=attention_mask)
print(output.logits)

tensor([[ 2.2226, -1.8951],
        [ 2.7803, -2.3437]], grad_fn=<AddmmBackward0>)


## this is all done behind the scenes but the autotokenizer when you pass padding=True

# Handling multiple sequences

In [2]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)

input_ids = torch.tensor([ids])
print("Input IDs:", input_ids)

output = model(input_ids)
print("Logits:", output.logits)

Input IDs: tensor([[ 1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,  2607,
          2026,  2878,  2166,  1012]])
Logits: tensor([[-2.7276,  2.8789]], grad_fn=<AddmmBackward0>)


In [3]:
batched_ids = [
    [200, 200, 200],
    [200, 200]
]

In [4]:
padding_id = 100

batched_ids = [
    [200, 200, 200],
    [200, 200, padding_id],
]

In [5]:
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence1_ids = [[200, 200, 200]]
sequence2_ids = [[200, 200]]
batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

print(model(torch.tensor(sequence1_ids)).logits)
print(model(torch.tensor(sequence2_ids)).logits)
print(model(torch.tensor(batched_ids)).logits)

tensor([[ 1.5694, -1.3895]], grad_fn=<AddmmBackward0>)
tensor([[ 0.5803, -0.4125]], grad_fn=<AddmmBackward0>)
tensor([[ 1.5694, -1.3895],
        [ 1.3374, -1.2163]], grad_fn=<AddmmBackward0>)


In [6]:
batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

attention_mask = [
    [1, 1, 1],
    [1, 1, 0],
]

outputs = model(torch.tensor(batched_ids), attention_mask=torch.tensor(attention_mask))
print(outputs.logits)

tensor([[ 1.5694, -1.3895],
        [ 0.5803, -0.4125]], grad_fn=<AddmmBackward0>)


### modify padding technique

In [8]:
tokenizer(sequence)

{'input_ids': [101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [12]:
tokenizer(sequence, padding=True)

{'input_ids': [101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [10]:
sequences =["my dog is called fido", "my cat is called something really cool like Michael"]

In [13]:
tokenizer(sequences, padding=True)

{'input_ids': [[101, 2026, 3899, 2003, 2170, 10882, 3527, 102, 0, 0, 0], [101, 2026, 4937, 2003, 2170, 2242, 2428, 4658, 2066, 2745, 102]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

### this will process 512 tokens which is max length of bert

In [14]:
tokenizer(sequences, padding="max_length" )

{'input_ids': [[101, 2026, 3899, 2003, 2170, 10882, 3527, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

In [15]:
tokenizer(sequences, padding="longest" )

{'input_ids': [[101, 2026, 3899, 2003, 2170, 10882, 3527, 102, 0, 0, 0], [101, 2026, 4937, 2003, 2170, 2242, 2428, 4658, 2066, 2745, 102]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

In [17]:
tokenizer.padding_side

'right'

In [18]:
#tokenizer.padding_side = "left"

tokenizer(sequences, padding=True)


{'input_ids': [[101, 2026, 3899, 2003, 2170, 10882, 3527, 102, 0, 0, 0], [101, 2026, 4937, 2003, 2170, 2242, 2428, 4658, 2066, 2745, 102]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

# Putting it all together - create own custom pipeline

In [19]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

model_inputs = tokenizer(sequence)

In [20]:
sequence = "I've been waiting for a HuggingFace course my whole life."

model_inputs = tokenizer(sequence)

In [28]:
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

model_inputs = tokenizer(sequences)

model_inputs

{'input_ids': [[101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102], [101, 2061, 2031, 1045, 999, 102]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]]}

In [22]:
# Will pad the sequences up to the maximum sequence length
model_inputs = tokenizer(sequences, padding="longest")

# Will pad the sequences up to the model max length
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequences, padding="max_length")

# Will pad the sequences up to the specified max length
model_inputs = tokenizer(sequences, padding="max_length", max_length=8)

## truncation - long sequences attention is hard to do and most models predefine maximum length in pretraining phase - so truncate often when using pipelines

In [23]:
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

# Will truncate the sequences that are longer than the model max length
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequences, truncation=True)

# Will truncate the sequences that are longer than the specified max length
model_inputs = tokenizer(sequences, max_length=8, truncation=True)

In [36]:
sequence = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"] * 1000

In [40]:
inputs = tokenizer(sequence, return_tensors="pt", truncation=True, padding=True)

### now converted inputs down to maximum size model allows - do when text is too long
### good for classification bc most stuff as at begining, qa is bad bc answer is at end of text 
### summarization depends on the case - sometimes need to do clever things like split text into different pieces, truncate those pieces, then aggregate results

In [41]:
inputs.input_ids.size()

torch.Size([2000, 16])

In [24]:
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

# Returns PyTorch tensors
model_inputs = tokenizer(sequences, padding=True, return_tensors="pt")

# Returns TensorFlow tensors
model_inputs = tokenizer(sequences, padding=True, return_tensors="tf")

# Returns NumPy arrays
model_inputs = tokenizer(sequences, padding=True, return_tensors="np")

In [25]:
sequence = "I've been waiting for a HuggingFace course my whole life."

model_inputs = tokenizer(sequence)
print(model_inputs["input_ids"])

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)

[101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102]
[1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012]


In [26]:
print(tokenizer.decode(model_inputs["input_ids"]))
print(tokenizer.decode(ids))

[CLS] i've been waiting for a huggingface course my whole life. [SEP]
i've been waiting for a huggingface course my whole life.


In [27]:
import tensorflow as tf
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint)
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

tokens = tokenizer(sequences, padding=True, truncation=True, return_tensors="tf")
output = model(**tokens)

Some layers from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing TFDistilBertForSequenceClassification: ['dropout_19']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english and are newly initialized: ['dropout_39']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


# Chapter 3 - Pytorch
#### how to prepare large dataset from hub - github.com/huggingface/datasets
#### train first model using trainer API to fine-tune a model
#### how to use a custom training loop
#### how to leverage the accelerate library to easily run that custom training loop on any distributed setup

In [46]:
from datasets import load_dataset

# mrpc dataset contains pairs of sentences for paraphrasing, glue benchmark
#returns dataset dict - eachs plit indexed by its name, with columns sentence1 label idx and number of rows
raw_datasets = load_dataset("glue", "mrpc")
raw_datasets

Reusing dataset glue (C:\Users\creeg\.cache\huggingface\datasets\glue\mrpc\1.0.0\dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)
100%|██████████| 3/3 [00:00<00:00, 919.00it/s]


DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})

### slice of dataset - list of second sentences

In [47]:

raw_datasets["train"][:5]

{'sentence1': ['Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
  "Yucaipa owned Dominick 's before selling the chain to Safeway in 1998 for $ 2.5 billion .",
  'They had published an advertisement on the Internet on June 10 , offering the cargo for sale , he added .',
  'Around 0335 GMT , Tab shares were up 19 cents , or 4.4 % , at A $ 4.56 , having earlier set a record high of A $ 4.57 .',
  'The stock rose $ 2.11 , or about 11 percent , to close Friday at $ 21.51 on the New York Stock Exchange .'],
 'sentence2': ['Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .',
  "Yucaipa bought Dominick 's in 1995 for $ 693 million and sold it to Safeway for $ 1.8 billion in 1998 .",
  "On June 10 , the ship 's owners had published an advertisement on the Internet , offering the explosives for sale .",
  'Tab shares jumped 20 cents , or 4.6 % , to set a record closing high at 

### features gives us more information about columns - the correspondence between integers and names for labels, 0 means not equivalent and 1 means equivalent

In [48]:
raw_datasets["train"].features

{'sentence1': Value(dtype='string', id=None),
 'sentence2': Value(dtype='string', id=None),
 'label': ClassLabel(num_classes=2, names=['not_equivalent', 'equivalent'], names_file=None, id=None),
 'idx': Value(dtype='int32', id=None)}

### to process elements in our dataset, we need to tokenize them

In [52]:
from transformers import AutoTokenizer 

checkpoint = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)


# send sentences to tokenize with a few key word arguments

def tokenize_function(example):
    return tokenizer(
        example["sentence1"], example["sentence2"], padding="max_length", truncation=True, max_length=128
    )

tokenized_datasets = raw_datasets.map(tokenize_function)
print(tokenized_datasets.column_names)

Loading cached processed dataset at C:\Users\creeg\.cache\huggingface\datasets\glue\mrpc\1.0.0\dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad\cache-62a09ac0c99ae1b1.arrow
100%|██████████| 408/408 [00:00<00:00, 3923.03ex/s]
Loading cached processed dataset at C:\Users\creeg\.cache\huggingface\datasets\glue\mrpc\1.0.0\dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad\cache-d715dbfd8732afb3.arrow


{'train': ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'], 'validation': ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'], 'test': ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask']}


### to speed up processing and to take advantage that tokenizer is backed by rust, can process several elements of same time using the batch=True argument
### can also use map processing

In [53]:
from transformers import AutoTokenizer 

checkpoint = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

def tokenize_function(examples):
    return tokenizer(
        examples["sentence1"], examples["sentence2"], padding="max_length", truncation=True, max_length=128
    )

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)

100%|██████████| 4/4 [00:00<00:00, 19.28ba/s]
100%|██████████| 1/1 [00:00<00:00, 37.03ba/s]
100%|██████████| 2/2 [00:00<00:00, 19.61ba/s]


### can remove columns we dont need anymore with the remove columns method

In [54]:
tokenized_datasets = tokenized_datasets.remove_columns(["idx", "sentence1", "sentence2"])

# rename label to labels
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")

#can set back to desired format, torch/tensorflow/numpy
tokenized_datsets = tokenized_datasets.with_format("torch")
tokenized_datasets["train"]

Dataset({
    features: ['attention_mask', 'input_ids', 'labels', 'token_type_ids'],
    num_rows: 3668
})

### can generate a short sample of a dataset using the select method

In [56]:
small_train_dataset = tokenized_datasets["train"].select(range(100))
small_train_dataset

Dataset({
    features: ['attention_mask', 'input_ids', 'labels', 'token_type_ids'],
    num_rows: 100
})

In [57]:
%%capture
!pip install datasets transformers[sentencepiece]

In [None]:
import torch
from transformers import AdamW, AutoTokenizer, AutoModelForSequenceClassification

# Same as before - text classification head were adding ontop of model
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
sequences = [
    "I've been waiting for a HuggingFace course my whole life.",
    "This course is amazing!",
]
batch = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")

# This is new
batch["labels"] = torch.tensor([1, 1])

optimizer = AdamW(model.parameters())
loss = model(**batch).loss
loss.backward()
optimizer.step()


## dataset dict is a dictionary where the keys are a string that correspond to teh split, and values are dataset object

In [58]:
from datasets import load_dataset

raw_datasets = load_dataset("glue", "mrpc")
raw_datasets


Reusing dataset glue (C:\Users\creeg\.cache\huggingface\datasets\glue\mrpc\1.0.0\dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)
100%|██████████| 3/3 [00:00<00:00, 993.28it/s]


DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})

### label 1 indicates second sentence is actually a paraphrase

In [62]:
raw_train_dataset = raw_datasets["train"]
raw_train_dataset[0]

{'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .',
 'label': 1,
 'idx': 0}

### gives column name and data type of that column

In [76]:
raw_train_dataset.features

{'sentence1': Value(dtype='string', id=None),
 'sentence2': Value(dtype='string', id=None),
 'label': ClassLabel(num_classes=2, names=['not_equivalent', 'equivalent'], names_file=None, id=None),
 'idx': Value(dtype='int32', id=None)}

In [66]:
raw_datasets.keys()

dict_keys(['train', 'validation', 'test'])

In [67]:
raw_datasets.values()

dict_values([Dataset({
    features: ['sentence1', 'sentence2', 'label', 'idx'],
    num_rows: 3668
}), Dataset({
    features: ['sentence1', 'sentence2', 'label', 'idx'],
    num_rows: 408
}), Dataset({
    features: ['sentence1', 'sentence2', 'label', 'idx'],
    num_rows: 1725
})])

### tells me that the second sentence is a paraphrase of the first

In [77]:
raw_train_dataset.features["label"].int2str(1)

'equivalent'

### the first is not a paraphrase

In [78]:
raw_train_dataset.features["label"].int2str(0)

'not_equivalent'

In [69]:
print(raw_train_dataset.info)

DatasetInfo(description='GLUE, the General Language Understanding Evaluation benchmark\n(https://gluebenchmark.com/) is a collection of resources for training,\nevaluating, and analyzing natural language understanding systems.\n\n', citation='@inproceedings{dolan2005automatically,\n  title={Automatically constructing a corpus of sentential paraphrases},\n  author={Dolan, William B and Brockett, Chris},\n  booktitle={Proceedings of the Third International Workshop on Paraphrasing (IWP2005)},\n  year={2005}\n}\n@inproceedings{wang2019glue,\n  title={{GLUE}: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding},\n  author={Wang, Alex and Singh, Amanpreet and Michael, Julian and Hill, Felix and Levy, Omer and Bowman, Samuel R.},\n  note={In the Proceedings of ICLR.},\n  year={2019}\n}\n', homepage='https://www.microsoft.com/en-us/download/details.aspx?id=52398', license='', features={'sentence1': Value(dtype='string', id=None), 'sentence2': Value(dtype='string', 

### for private data if you cant use hugging face

In [79]:
import pandas as pd

#### load data locally - and use datasets functionaity

In [82]:
df = pd.DataFrame({"text": ["Hello world!", "g day!"], "label": [1, 1]})
df

Unnamed: 0,text,label
0,Hello world!,1
1,g day!,1


#### and create your own dataset locally

In [84]:
from datasets import Dataset

dset = Dataset.from_pandas(df)
dset


Dataset({
    features: ['text', 'label'],
    num_rows: 2
})

#### and create your own dataset object that can be used

In [85]:
dset[:]

{'text': ['Hello world!', 'g day!'], 'label': [1, 1]}

## tokenize sentence columns into input ids corresponding to first and second sentences c- these go into embedding layer turns into logits that go into model

In [63]:
from transformers import AutoTokenizer

checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
tokenized_sentences_1 = tokenizer(raw_datasets["train"]["sentence1"])
tokenized_sentences_2 = tokenizer(raw_datasets["train"]["sentence2"])

Downloading: 100%|██████████| 28.0/28.0 [00:00<00:00, 26.4kB/s]
Downloading: 100%|██████████| 570/570 [00:00<00:00, 792kB/s]
Downloading: 100%|██████████| 226k/226k [00:00<00:00, 3.68MB/s]
Downloading: 100%|██████████| 455k/455k [00:00<00:00, 5.79MB/s]


In [64]:
inputs = tokenizer("This is the first sentence.", "This is the second one.")
inputs

{'input_ids': [101, 2023, 2003, 1996, 2034, 6251, 1012, 102, 2023, 2003, 1996, 2117, 2028, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [65]:
tokenizer.convert_ids_to_tokens(inputs["input_ids"])

['[CLS]',
 'this',
 'is',
 'the',
 'first',
 'sentence',
 '.',
 '[SEP]',
 'this',
 'is',
 'the',
 'second',
 'one',
 '.',
 '[SEP]']

In [70]:
tokenized_dataset = tokenizer(
    raw_datasets["train"]["sentence1"],
    raw_datasets["train"]["sentence2"],
    padding=True,
    truncation=True,
)

## this is how you can tokenize an entire dataset - just define a function to operate on every row of dataset and define whatever the operation you define as the function - then define a map onto the dataset

In [71]:
def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

## applies function onto everything in dataset

In [72]:
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
tokenized_datasets

100%|██████████| 4/4 [00:00<00:00, 23.52ba/s]
100%|██████████| 1/1 [00:00<00:00, 43.46ba/s]
100%|██████████| 2/2 [00:00<00:00, 26.14ba/s]


DatasetDict({
    train: Dataset({
        features: ['attention_mask', 'idx', 'input_ids', 'label', 'sentence1', 'sentence2', 'token_type_ids'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['attention_mask', 'idx', 'input_ids', 'label', 'sentence1', 'sentence2', 'token_type_ids'],
        num_rows: 408
    })
    test: Dataset({
        features: ['attention_mask', 'idx', 'input_ids', 'label', 'sentence1', 'sentence2', 'token_type_ids'],
        num_rows: 1725
    })
})

## simple example of adding column using a function

In [86]:
def add_column(row):
    # has to return a dictionary where key is name of column and key is value you want in that row
    return {"new_column": "hello!"}

### when apply map, feed column that is added to new dataset
### this operation IS NOT IN PLACE
### if you want that column storeed in memeory you need to define a new dataset

In [88]:

raw_train_dataset.map(add_column)
dataset_with_column = raw_train_dataset.map(add_column)

dataset_with_column[0]

Loading cached processed dataset at C:\Users\creeg\.cache\huggingface\datasets\glue\mrpc\1.0.0\dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad\cache-36e4d6bb26ef4768.arrow
Loading cached processed dataset at C:\Users\creeg\.cache\huggingface\datasets\glue\mrpc\1.0.0\dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad\cache-36e4d6bb26ef4768.arrow


{'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .',
 'label': 1,
 'idx': 0,
 'new_column': 'hello!'}

### tokenize raw datasets and now have tokenized datasets object

In [89]:
def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True, padding=True)

In [90]:
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
tokenized_datasets

100%|██████████| 4/4 [00:00<00:00, 17.79ba/s]
100%|██████████| 1/1 [00:00<00:00, 33.86ba/s]
100%|██████████| 2/2 [00:00<00:00, 16.68ba/s]


DatasetDict({
    train: Dataset({
        features: ['attention_mask', 'idx', 'input_ids', 'label', 'sentence1', 'sentence2', 'token_type_ids'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['attention_mask', 'idx', 'input_ids', 'label', 'sentence1', 'sentence2', 'token_type_ids'],
        num_rows: 408
    })
    test: Dataset({
        features: ['attention_mask', 'idx', 'input_ids', 'label', 'sentence1', 'sentence2', 'token_type_ids'],
        num_rows: 1725
    })
})

### attention mask tells you not to pay attention to padding - the 0s, the 1s we want the model to pay attention to

In [92]:
tokenized_datasets["train"][0]

{'attention_mask': [1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0],
 'idx': 0,
 'input_ids': [101,
  2572,
  3217,
  5831,
  5496,
  2010,
  2567,
  1010,
  3183,
  2002,
  2170,
  1000,
  1996,
  7409,
  1000,
  1010,
  1997,
  9969,
  4487,
  23809,
  3436,
  2010,
  3350,
  1012,
  102,
  7727,
  2000,
  2032,
  2004,
  2069,
  1000,
  1996,
  7409,
  1000,
  1010,
  2572,
  3217,
  5831,
  5496,
  2010,
  2567,
  1997,
  9969,
  4487,
  23809,
  3436,
  2010,
  3350,
  1012,
  102,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0

In [73]:
from transformers import DataCollatorWithPadding 

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [74]:
samples = tokenized_datasets["train"][:8]
samples = {k: v for k, v in samples.items() if k not in ["idx", "sentence1", "sentence2"]}
[len(x) for x in samples["input_ids"]]

[50, 59, 47, 67, 59, 50, 62, 32]

In [75]:
batch = data_collator(samples)
{k: v.shape for k, v in batch.items()}

{'attention_mask': torch.Size([8, 67]),
 'input_ids': torch.Size([8, 67]),
 'token_type_ids': torch.Size([8, 67]),
 'labels': torch.Size([8])}

# More coming shortly

## Summarization Model

In [60]:
from transformers import pipeline

summarizer = pipeline("summarization")

ARTICLE = """ New York (CNN)When Liana Barrientos was 23 years old, she got married in Westchester County, New York.
... A year later, she got married again in Westchester County, but to a different man and without divorcing her first husband.
... Only 18 days after that marriage, she got hitched yet again. Then, Barrientos declared "I do" five more times, sometimes only within two weeks of each other.
... In 2010, she married once more, this time in the Bronx. In an application for a marriage license, she stated it was her "first and only" marriage.
... Barrientos, now 39, is facing two criminal counts of "offering a false instrument for filing in the first degree," referring to her false statements on the
... 2010 marriage license application, according to court documents.
... Prosecutors said the marriages were part of an immigration scam.
... On Friday, she pleaded not guilty at State Supreme Court in the Bronx, according to her attorney, Christopher Wright, who declined to comment further.
... After leaving court, Barrientos was arrested and charged with theft of service and criminal trespass for allegedly sneaking into the New York subway through an emergency exit, said Detective
... Annette Markowski, a police spokeswoman. In total, Barrientos has been married 10 times, with nine of her marriages occurring between 1999 and 2002.
... All occurred either in Westchester County, Long Island, New Jersey or the Bronx. She is believed to still be married to four men, and at one time, she was married to eight men at once, prosecutors say.
... Prosecutors said the immigration scam involved some of her husbands, who filed for permanent residence status shortly after the marriages.
... Any divorces happened only after such filings were approved. It was unclear whether any of the men will be prosecuted.
... The case was referred to the Bronx District Attorney\'s Office by Immigration and Customs Enforcement and the Department of Homeland Security\'s
... Investigation Division. Seven of the men are from so-called "red-flagged" countries, including Egypt, Turkey, Georgia, Pakistan and Mali.
... Her eighth husband, Rashid Rajput, was deported in 2006 to his native Pakistan after an investigation by the Joint Terrorism Task Force.
... If convicted, Barrientos faces up to four years in prison.  Her next court appearance is scheduled for May 18.
... """

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 (https://huggingface.co/sshleifer/distilbart-cnn-12-6)


In [61]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")
tokenizer = AutoTokenizer.from_pretrained("t5-base")

# T5 uses a max_length of 512 so we cut the article to 512 tokens.
inputs = tokenizer("summarize: " + ARTICLE, return_tensors="pt", max_length=512, truncation=True)
outputs = model.generate(
     inputs["input_ids"], max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True
 )

print(tokenizer.decode(outputs[0]))

<pad> liana barrientos, 39, pleaded not guilty to two counts of "offering a false instrument for filing in the first degree" the marriages were part of an immigration scam, prosecutors say.</s>
