# `transformers_overview` notebook
### Descriprion
Notebook explores basic terms of language models from DL. Models from `transformers` are used to demonstrate basic pipeline of preprocessing and model application to real data. Moreover, models architectures are presented with outputs and input-examples in range of experiments.

In [18]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier("Незабываемо провели выходные!")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'label': 'POSITIVE', 'score': 0.9639915227890015}]

By default, this pipeline selects a particular pretrained model that has been **fine-tuned for sentiment analysis in English**. The model is downloaded and cached when you create the classifier object.

Preprocessing -> Model -> Postprocessing

#### Pipelines:
* feature-extraction (get the vector representation of a text)
* fill-mask
* ner (named entity recognition)
* question-answering
* sentiment-analysis
* summarization
* text-generation
* translation
* zero-shot-classification

In [26]:
generator = pipeline("text-generation", model="distilgpt2")
generator(
    "In this course, we will teach you how to",
    max_length=30,
    num_return_sequences=2,
)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, we will teach you how to build and deploy applications from scratch using Windows 10 SDK. It will be a great read.\n\n'},
 {'generated_text': 'In this course, we will teach you how to be a better woman. You will learn how to succeed in a workplace, how to be happy for'}]

> Ability to test the model **online** (on model's page)

In [31]:
from transformers import pipeline

summarizer = pipeline("summarization")
summarizer(
    """
    America has changed dramatically during recent years. Not only has the number of 
    graduates in traditional engineering disciplines such as mechanical, civil, 
    electrical, chemical, and aeronautical engineering declined, but in most of 
    the premier American universities engineering curricula now concentrate on 
    and encourage largely the study of engineering science. As a result, there 
    are declining offerings in engineering subjects dealing with infrastructure, 
    the environment, and related issues, and greater concentration on high 
    technology subjects, largely supporting increasingly complex scientific 
    developments. While the latter is important, it should not be at the expense 
    of more traditional engineering.

    Rapidly developing economies such as China and India, as well as other 
    industrial countries in Europe and Asia, continue to encourage and advance 
    the teaching of engineering. Both China and India, respectively, graduate 
    six and eight times as many traditional engineers as does the United States. 
    Other industrial countries at minimum maintain their output, while America 
    suffers an increasingly serious decline in the number of engineering graduates 
    and a lack of well-educated engineers.
"""
)

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'summary_text': ' America has changed dramatically during recent years . The number of engineering graduates in the U.S. has declined in traditional engineering disciplines such as mechanical, civil,    electrical, chemical, and aeronautical engineering . Rapidly developing economies such as China and India continue to encourage and advance the teaching of engineering .'}]

---

## Transformers overview

The Transformer architecture was introduced in June 2017. The focus of the **original research was on translation tasks**. 



In [1]:
%cd ..

/home/pristalovya/Документы/nlp-coursework


In [4]:
from datasets_ import DatasetLoader

from datasets import load_dataset
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

import matplotlib.pyplot as plt
%matplotlib notebook

import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader

from transformers import AdamW

from tqdm import tqdm
import numpy as np

import sys

from nltk import WhitespaceTokenizer

from IPython.display import clear_output

from transformers import (
    pipeline,                       
    AutoModelForSequenceClassification,                       
    BertForSequenceClassification,                       
    AutoTokenizer,
    AdamW,
)

In [5]:
train, test = DatasetLoader.load_reviews_Review_Label_dataset(train_test_split=True,
                                                              classnames_to_int=True,
                                                              remove_neutral_class=True,
                                                              show_path=True,)
train.label[train['label'] == 2] = 1
test.label[test['label'] == 2] = 1

print(train.shape, test.shape)

/home/pristalovya/Документы/nlp-coursework/data/reviews_Review_Label/reviews_Review_Label.csv
(55346, 2) (23721, 2)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train.label[train['label'] == 2] = 1
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test.label[test['label'] == 2] = 1


In [5]:
checkpoint = 'blanchefort/rubert-base-cased-sentiment-rusentiment'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

In [6]:
test

Unnamed: 0,review,label
25749,Большое количество фильмов советского кинемато...,1
44489,"Тяжело ответить на вопрос, что же такое Догвил...",1
53162,"В наше время такие героини, как скажем наприме...",0
25843,В 2001 году нам довелось познакомиться с новой...,1
44609,"«Это фильм?», «У них не хватило денег на декор...",1
...,...,...
14104,- Через столько лет?\r\n- Всегда\r\n\r\nБезусл...,1
22232,"После просмотра трейлера, я был под большим вп...",1
73314,"Многие не верят, но я легко подключаюсь к прои...",1
47848,"Как часто нам нужна поддержка? Да, пожалуй, оч...",1


In [5]:
text = test.review.iloc[0]
text[:100]

'Большое количество фильмов советского кинематографа посвящено второй мировой войне. В них поднимаютс'

In [6]:
tokenizer.all_special_tokens

['[UNK]', '[SEP]', '[PAD]', '[CLS]', '[MASK]']

In [20]:
tokens = tokenizer.tokenize(text)
tokens

['Большое',
 'количество',
 'фильмов',
 'советского',
 'кинематографа',
 'посвящено',
 'второй',
 'мировой',
 'войне',
 '.',
 'В',
 'них',
 'поднимаются',
 'разные',
 'темы',
 'того',
 'времени',
 ':',
 'и',
 'любовь',
 'к',
 'родине',
 ';',
 'и',
 'гражданский',
 'долг',
 ';',
 'и',
 'ужасы',
 'оккупации',
 ';',
 'и',
 'потеря',
 'близких',
 ',',
 'любимых',
 ';',
 'и',
 'предательство',
 ',',
 'дезерти',
 '##рс',
 '##тво',
 ',',
 'шпионаж',
 'и',
 'т',
 '.',
 'д',
 '.',
 ',',
 'и',
 'т',
 '.',
 'п',
 '.',
 'На',
 'фоне',
 'этого',
 'много',
 '##образия',
 ',',
 'картина',
 '«',
 'Лет',
 '##ят',
 'журав',
 '##ли',
 '»',
 'выделяется',
 'своей',
 'непосредственно',
 '##стью',
 'и',
 'авангард',
 '##ностью',
 'в',
 'плане',
 'отражения',
 'и',
 'передачи',
 'незыблем',
 '##ых',
 'истин',
 'человеческого',
 'существования',
 '.',
 'Фильм',
 '«',
 'Лет',
 '##ят',
 'журав',
 '##ли',
 '»',
 'повествует',
 'о',
 'судьбе',
 'отдельно',
 'взятой',
 'семьи',
 'и',
 'ее',
 'окружения',
 'в',
 'г

In [23]:
ids = tokenizer.convert_tokens_to_ids(tokens)
ids

[24890,
 2965,
 9777,
 14180,
 37811,
 53124,
 2981,
 7212,
 7421,
 132,
 436,
 1340,
 58987,
 2885,
 7341,
 1226,
 1846,
 156,
 322,
 7136,
 344,
 26815,
 158,
 322,
 36194,
 12919,
 158,
 322,
 34836,
 21837,
 158,
 322,
 4874,
 16735,
 128,
 21114,
 158,
 322,
 37518,
 128,
 88210,
 25502,
 5433,
 128,
 84002,
 322,
 332,
 132,
 348,
 132,
 128,
 322,
 332,
 132,
 350,
 132,
 992,
 7061,
 1250,
 1278,
 79124,
 128,
 11605,
 304,
 16475,
 990,
 62578,
 800,
 326,
 33490,
 1981,
 10841,
 1439,
 322,
 35269,
 5341,
 340,
 5407,
 45127,
 322,
 12270,
 67659,
 2558,
 21785,
 19472,
 11565,
 132,
 16237,
 304,
 16475,
 990,
 62578,
 800,
 326,
 40055,
 292,
 30412,
 8014,
 57168,
 6080,
 322,
 1378,
 29658,
 340,
 4314,
 2981,
 7212,
 3235,
 156,
 340,
 6339,
 444,
 3388,
 2639,
 13971,
 1526,
 22415,
 322,
 24388,
 1464,
 128,
 63249,
 936,
 322,
 67693,
 128,
 51707,
 11137,
 26005,
 95050,
 51784,
 322,
 99063,
 83983,
 72403,
 79263,
 35771,
 11113,
 130,
 865,
 1404,
 132,
 27024,
 1

In [26]:
tokenizer.convert_ids_to_tokens(ids)

['Большое',
 'количество',
 'фильмов',
 'советского',
 'кинематографа',
 'посвящено',
 'второй',
 'мировой',
 'войне',
 '.',
 'В',
 'них',
 'поднимаются',
 'разные',
 'темы',
 'того',
 'времени',
 ':',
 'и',
 'любовь',
 'к',
 'родине',
 ';',
 'и',
 'гражданский',
 'долг',
 ';',
 'и',
 'ужасы',
 'оккупации',
 ';',
 'и',
 'потеря',
 'близких',
 ',',
 'любимых',
 ';',
 'и',
 'предательство',
 ',',
 'дезерти',
 '##рс',
 '##тво',
 ',',
 'шпионаж',
 'и',
 'т',
 '.',
 'д',
 '.',
 ',',
 'и',
 'т',
 '.',
 'п',
 '.',
 'На',
 'фоне',
 'этого',
 'много',
 '##образия',
 ',',
 'картина',
 '«',
 'Лет',
 '##ят',
 'журав',
 '##ли',
 '»',
 'выделяется',
 'своей',
 'непосредственно',
 '##стью',
 'и',
 'авангард',
 '##ностью',
 'в',
 'плане',
 'отражения',
 'и',
 'передачи',
 'незыблем',
 '##ых',
 'истин',
 'человеческого',
 'существования',
 '.',
 'Фильм',
 '«',
 'Лет',
 '##ят',
 'журав',
 '##ли',
 '»',
 'повествует',
 'о',
 'судьбе',
 'отдельно',
 'взятой',
 'семьи',
 'и',
 'ее',
 'окружения',
 'в',
 'г

In [43]:
vocab = tokenizer.get_vocab()
print(tokenizer.vocab_size if len(vocab) == tokenizer.vocab_size else None)
vocab

100792


{'господство': 60725,
 '##акашвили': 43953,
 'одноклассника': 88966,
 'asasha': 62307,
 'гипо': 36845,
 'выдаёт': 37709,
 '##линга': 39243,
 'проклятые': 48873,
 'own': 51366,
 'Элемента': 58004,
 'журнале': 16317,
 'Warcraft': 69289,
 'семьи': 6080,
 'Dirty': 38439,
 '##вой': 1143,
 'щит': 21687,
 'контей': 24632,
 'организованного': 53350,
 'Фору': 58759,
 'техническом': 8379,
 'вызыва': 7007,
 '##изво': 9334,
 'переохла': 66670,
 'Экст': 76638,
 'Оправ': 79832,
 '##Jr': 94852,
 'Яцен': 41490,
 'неожиданно': 12274,
 'геометрии': 41423,
 'диаметрально': 71778,
 'видеозахвата': 9042,
 'минусуете': 95250,
 'всеобще': 29543,
 'сплошной': 29832,
 'ведущего': 35574,
 'девчонки': 37693,
 '##Итого': 74654,
 'дошло': 21281,
 '5к': 50728,
 'интернете': 6845,
 'правильных': 36443,
 'чукот': 100653,
 'академию': 20301,
 'De': 10702,
 'дойдут': 64963,
 'бодибил': 92860,
 '##циона': 2140,
 '##любов': 98071,
 'волею': 79698,
 'козлом': 85912,
 '##RT': 47447,
 'выведены': 40526,
 'Кучу': 96923,
 'см

In [33]:
tokenizer.max_model_input_sizes

{'bert-base-uncased': 512,
 'bert-large-uncased': 512,
 'bert-base-cased': 512,
 'bert-large-cased': 512,
 'bert-base-multilingual-uncased': 512,
 'bert-base-multilingual-cased': 512,
 'bert-base-chinese': 512,
 'bert-base-german-cased': 512,
 'bert-large-uncased-whole-word-masking': 512,
 'bert-large-cased-whole-word-masking': 512,
 'bert-large-uncased-whole-word-masking-finetuned-squad': 512,
 'bert-large-cased-whole-word-masking-finetuned-squad': 512,
 'bert-base-cased-finetuned-mrpc': 512,
 'bert-base-german-dbmdz-cased': 512,
 'bert-base-german-dbmdz-uncased': 512,
 'TurkuNLP/bert-base-finnish-cased-v1': 512,
 'TurkuNLP/bert-base-finnish-uncased-v1': 512,
 'wietsedv/bert-base-dutch-cased': 512}

In [34]:
tokenizer.max_len_single_sentence

1000000000000000019884624838654

In [36]:
tokenizer.padding_side

'right'

In [84]:
text_tokenized = tokenizer(text, return_tensors='pt', max_length=512, truncation=True)
print(text_tokenized.keys(), end='\n\n')
print(text_tokenized)

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])

{'input_ids': tensor([[   101,  24890,   2965,   9777,  14180,  37811,  53124,   2981,   7212,
           7421,    132,    436,   1340,  58987,   2885,   7341,   1226,   1846,
            156,    322,   7136,    344,  26815,    158,    322,  36194,  12919,
            158,    322,  34836,  21837,    158,    322,   4874,  16735,    128,
          21114,    158,    322,  37518,    128,  88210,  25502,   5433,    128,
          84002,    322,    332,    132,    348,    132,    128,    322,    332,
            132,    350,    132,    992,   7061,   1250,   1278,  79124,    128,
          11605,    304,  16475,    990,  62578,    800,    326,  33490,   1981,
          10841,   1439,    322,  35269,   5341,    340,   5407,  45127,    322,
          12270,  67659,   2558,  21785,  19472,  11565,    132,  16237,    304,
          16475,    990,  62578,    800,    326,  40055,    292,  30412,   8014,
          57168,   6080,    322, 

In [49]:
tokenizer.convert_ids_to_tokens(text_tokenized['input_ids'])

['[CLS]',
 'Большое',
 'количество',
 'фильмов',
 'советского',
 'кинематографа',
 'посвящено',
 'второй',
 'мировой',
 'войне',
 '.',
 'В',
 'них',
 'поднимаются',
 'разные',
 'темы',
 'того',
 'времени',
 ':',
 'и',
 'любовь',
 'к',
 'родине',
 ';',
 'и',
 'гражданский',
 'долг',
 ';',
 'и',
 'ужасы',
 'оккупации',
 ';',
 'и',
 'потеря',
 'близких',
 ',',
 'любимых',
 ';',
 'и',
 'предательство',
 ',',
 'дезерти',
 '##рс',
 '##тво',
 ',',
 'шпионаж',
 'и',
 'т',
 '.',
 'д',
 '.',
 ',',
 'и',
 'т',
 '.',
 'п',
 '.',
 'На',
 'фоне',
 'этого',
 'много',
 '##образия',
 ',',
 'картина',
 '«',
 'Лет',
 '##ят',
 'журав',
 '##ли',
 '»',
 'выделяется',
 'своей',
 'непосредственно',
 '##стью',
 'и',
 'авангард',
 '##ностью',
 'в',
 'плане',
 'отражения',
 'и',
 'передачи',
 'незыблем',
 '##ых',
 'истин',
 'человеческого',
 'существования',
 '.',
 'Фильм',
 '«',
 'Лет',
 '##ят',
 'журав',
 '##ли',
 '»',
 'повествует',
 'о',
 'судьбе',
 'отдельно',
 'взятой',
 'семьи',
 'и',
 'ее',
 'окружения',

In [109]:
tokenizer.encode(text)

[101,
 24890,
 2965,
 9777,
 14180,
 37811,
 53124,
 2981,
 7212,
 7421,
 132,
 436,
 1340,
 58987,
 2885,
 7341,
 1226,
 1846,
 156,
 322,
 7136,
 344,
 26815,
 158,
 322,
 36194,
 12919,
 158,
 322,
 34836,
 21837,
 158,
 322,
 4874,
 16735,
 128,
 21114,
 158,
 322,
 37518,
 128,
 88210,
 25502,
 5433,
 128,
 84002,
 322,
 332,
 132,
 348,
 132,
 128,
 322,
 332,
 132,
 350,
 132,
 992,
 7061,
 1250,
 1278,
 79124,
 128,
 11605,
 304,
 16475,
 990,
 62578,
 800,
 326,
 33490,
 1981,
 10841,
 1439,
 322,
 35269,
 5341,
 340,
 5407,
 45127,
 322,
 12270,
 67659,
 2558,
 21785,
 19472,
 11565,
 132,
 16237,
 304,
 16475,
 990,
 62578,
 800,
 326,
 40055,
 292,
 30412,
 8014,
 57168,
 6080,
 322,
 1378,
 29658,
 340,
 4314,
 2981,
 7212,
 3235,
 156,
 340,
 6339,
 444,
 3388,
 2639,
 13971,
 1526,
 22415,
 322,
 24388,
 1464,
 128,
 63249,
 936,
 322,
 67693,
 128,
 51707,
 11137,
 26005,
 95050,
 51784,
 322,
 99063,
 83983,
 72403,
 79263,
 35771,
 11113,
 130,
 865,
 1404,
 132,
 270

In [106]:
tokenizer.encode_plus(text) == tokenizer(text)

True

In [55]:
model.bert

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(119547, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
 

In [56]:
model.classifier

Linear(in_features=768, out_features=3, bias=True)

In [71]:
model

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(119547, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12

In [92]:
model(**text_tokenized)

SequenceClassifierOutput(loss=None, logits=tensor([[ 2.0848, -0.1631, -1.7853]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

In [113]:
def predict(text):
    with torch.no_grad():
        inputs = tokenizer(text, max_length=512, padding=True, truncation=True, return_tensors='pt')
        outputs = model(**inputs)
        predicted = torch.nn.functional.softmax(outputs.logits, dim=1)
        predicted = torch.argmax(predicted, dim=1).numpy()
        return predicted

pred = predict(test.review.to_list()[:100])

In [116]:
print(classification_report(test.label[:100], pred))

              precision    recall  f1-score   support

           0       0.16      0.73      0.26        11
           1       0.00      0.00      0.00         0
           2       0.40      0.02      0.04        89

    accuracy                           0.10       100
   macro avg       0.19      0.25      0.10       100
weighted avg       0.37      0.10      0.07       100



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
