# 🤗 Transformers Episodio 2 - Under the hood de pipeline

[twitch.tv/dataista0](http://twitch.tv/dataista0)


* [Lista de modelos](https://huggingface.co/transformers/pretrained_models.html)
* [Quicktour](https://huggingface.co/transformers/quicktour.html)

In [2]:
import numpy as np
import pandas as pd

from nltk.corpus import twitter_samples
from sklearn.metrics import accuracy_score
from transformers import pipeline
from transformers import set_seed

In [3]:
m1 = pipeline('sentiment-analysis', framework="pt", model="distilbert-base-uncased")

Downloading:   0%|          | 0.00/442 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias', 'vocab_transform.weight', 'vocab_layer_norm.weight', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classi

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

In [4]:
m1("I am very very sad")

[{'label': 'LABEL_0', 'score': 0.5455528497695923}]

In [5]:
m1("I am very very happy")

[{'label': 'LABEL_0', 'score': 0.5358377695083618}]

In [6]:
m0 = pipeline('sentiment-analysis', framework="pt")

In [14]:
import transformers

In [15]:
transformers.__file__

'/home/dataista/anaconda3/envs/transformers/lib/python3.8/site-packages/transformers/__init__.py'

In [12]:
pipeline

<function transformers.pipelines.pipeline(task: str, model: Optional = None, config: Union[str, transformers.configuration_utils.PretrainedConfig, NoneType] = None, tokenizer: Union[str, transformers.tokenization_utils.PreTrainedTokenizer, NoneType] = None, feature_extractor: Union[str, ForwardRef('SequenceFeatureExtractor'), NoneType] = None, framework: Union[str, NoneType] = None, revision: Union[str, NoneType] = None, use_fast: bool = True, use_auth_token: Union[str, bool, NoneType] = None, model_kwargs: Dict[str, Any] = {'use_auth_token': None}, **kwargs) -> transformers.pipelines.base.Pipeline>

In [9]:
m0.

In [8]:
m0.model

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
       

In [16]:
m3 = pipeline("sentiment-analysis", framework="pt", model="distilbert-base-uncased-finetuned-sst-2-english")

In [17]:
m3("I am very very sad")

[{'label': 'NEGATIVE', 'score': 0.9995039105415344}]

In [18]:
m3("I am very very happy")

[{'label': 'POSITIVE', 'score': 0.9998809099197388}]

In [19]:
m3("I am partially happy")

[{'label': 'POSITIVE', 'score': 0.9998473525047302}]

Sabe labels POSITIVE y NEGATIVE porque `SST-2` es un dataset de sentiment analysis

# AutoModel y AutoTokenizer

In [6]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

In [7]:
model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
classifier = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)

In [22]:
classifier("I am very very very very happy")

[{'label': '5 stars', 'score': 0.8320968151092529}]

In [23]:
classifier("Mediocre")

[{'label': '2 stars', 'score': 0.5637515783309937}]

In [24]:
classifier("Mediocre at most")

[{'label': '2 stars', 'score': 0.586713433265686}]

In [25]:
classifier("El peor hotel que conocí en mi vida")

[{'label': '1 star', 'score': 0.8148958086967468}]

In [26]:
classifier("Estoy muy muy feliz")

[{'label': '5 stars', 'score': 0.7797722220420837}]

In [28]:
classifier("Estoy muy muy feliz de que la película haya terminado. La peor en décadas.")

[{'label': '5 stars', 'score': 0.5999898314476013}]

### Tokenizer

Profundizar en tokenizer:
* [Preprocessing](https://huggingface.co/transformers/preprocessing.html)
* [Summary of the tokenizer](https://huggingface.co/transformers/tokenizer_summary.html)

Carpeta de downloads: `~/.cache/huggingface/transformers`

In [3]:
#import os
#os.environ['TRANSFORMERS_CACHE'] = '/media/dataista/DATA/transformers'

from transformers import AutoTokenizer, AutoModelForSequenceClassification
model_name = "nlptown/bert-base-multilingual-uncased-sentiment"

In [4]:
model = AutoModelForSequenceClassification.from_pretrained(model_name)

In [5]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [7]:
inputs = tokenizer("The second wave of Covid seems to be finishing")

In [8]:
inputs

{'input_ids': [101, 10103, 10981, 21560, 10108, 10348, 41194, 32681, 10114, 10346, 33300, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [10]:
inputs['input_ids']

[101,
 10103,
 10981,
 21560,
 10108,
 10348,
 41194,
 32681,
 10114,
 10346,
 33300,
 102]

In [11]:
inputs['token_type_ids']

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

In [12]:
inputs['attention_mask']

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

In [14]:
len("The second wave of Covid seems to be finishing".split())

9

In [15]:
len(inputs['input_ids'])

12

In [30]:
pt_batch = tokenizer(
    ["We are very happy to show you the 🤗 Transformers library.", "We hope you don't hate it.", "Text"],
    padding=True,
    truncation=True,
    max_length=512,
    #return_tensors="pt"
)

In [31]:
pt_batch['input_ids']

[[101,
  11312,
  10320,
  12495,
  19308,
  10114,
  11391,
  10855,
  10103,
  100,
  58263,
  13299,
  119,
  102],
 [101, 11312, 18763, 10855, 11530, 112, 162, 39487, 10197, 119, 102, 0, 0, 0],
 [101, 14059, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]

In [35]:
for key, value in pt_batch.items():
    print(f"{key}: {value}")

input_ids: [[101, 11312, 10320, 12495, 19308, 10114, 11391, 10855, 10103, 100, 58263, 13299, 119, 102], [101, 11312, 18763, 10855, 11530, 112, 162, 39487, 10197, 119, 102, 0, 0, 0], [101, 14059, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
token_type_ids: [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
attention_mask: [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0], [1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]


In [38]:
def f(a, b):
    return a+b

In [39]:
f(**{'a': 1, 'b': 2})

3

In [36]:
import pandas as pd
pd.DataFrame(pt_batch["input_ids"])

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,101,11312,10320,12495,19308,10114,11391,10855,10103,100,58263,13299,119,102
1,101,11312,18763,10855,11530,112,162,39487,10197,119,102,0,0,0
2,101,14059,102,0,0,0,0,0,0,0,0,0,0,0


In [37]:
pd.DataFrame(pt_batch["attention_mask"])

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,1,1,1,1,1,1,1,1,1,1,1,1,1,1
1,1,1,1,1,1,1,1,1,1,1,1,0,0,0
2,1,1,1,0,0,0,0,0,0,0,0,0,0,0


# Model

* [Model outputs](https://huggingface.co/transformers/main_classes/output.html)
* [Fine tuning a pretrained model](https://huggingface.co/transformers/training.html)

In [46]:
pt_batch = tokenizer(
    ["We are very happy to show you the 🤗 Transformers library.", "We hope you don't hate it.", 
     "El hotel era horrendo.",
     "O hotel era esplêndido."
    ],
    padding=True,
    truncation=True,
    max_length=512,
    return_tensors="pt"
)

In [47]:
pt_outputs = model(**pt_batch)

In [48]:
pt_outputs

SequenceClassifierOutput(loss=None, logits=tensor([[-2.6222, -2.7745, -0.8967,  2.0137,  3.3064],
        [ 0.0064, -0.1258, -0.0503, -0.1655,  0.1329],
        [ 2.1701,  1.8844,  0.6988, -1.5086, -2.4908],
        [-1.6145, -1.3498,  0.1185,  1.0354,  1.4689]],
       grad_fn=<AddmmBackward>), hidden_states=None, attentions=None)

In [50]:
pt_outputs.logits.shape

torch.Size([4, 5])

In [56]:
pt_outputs.logits[3].detach().numpy().tolist()

[-1.6100000143051147,
 -1.350000023841858,
 0.11999999731779099,
 1.0399999618530273,
 1.4700000286102295]

In [57]:
import torch.nn.functional as F
pt_predictions = F.softmax(pt_outputs.logits, dim=-1)

In [60]:
pt_predictions

tensor([[0.0021, 0.0018, 0.0115, 0.2121, 0.7725],
        [0.2084, 0.1826, 0.1969, 0.1755, 0.2365],
        [0.4961, 0.3728, 0.1139, 0.0125, 0.0047],
        [0.0228, 0.0297, 0.1287, 0.3220, 0.4968]], grad_fn=<SoftmaxBackward>)

In [62]:
pt_predictions.argmax(dim=1)

tensor([4, 4, 0, 4])

# Grabar modelo y tokenizer finetuneados a disco

In [73]:
from transformers import AutoModel

In [65]:
save_directory = "/media/dataista/DATA/my-transformers/episodio-2/"
tokenizer.save_pretrained(save_directory+"tokenizer/")
model.save_pretrained(save_directory+"model/")

In [67]:
# No puedo gra
try:
    tokenizer = AutoTokenizer.from_pretrained(save_directory+"tokenizer/")
    model = AutoModel.from_pretrained(save_directory+"model/")
except Exception as e:
    print(e)

file /media/dataista/DATA/my-transformers/episodio-2/tokenizer/config.json not found


Can't load config for '/media/dataista/DATA/my-transformers/episodio-2/tokenizer/'. Make sure that:

- '/media/dataista/DATA/my-transformers/episodio-2/tokenizer/' is a correct model identifier listed on 'https://huggingface.co/models'

- or '/media/dataista/DATA/my-transformers/episodio-2/tokenizer/' is the correct path to a directory containing a config.json file




In [69]:
!echo rm -rf ${save_directory}*

rm -rf $/media/dataista/DATA/my-transformers/episodio-2/*


In [70]:
!rm -rf ${save_directory}*

In [71]:
tokenizer.save_pretrained(save_directory)
model.save_pretrained(save_directory)

In [74]:
tokenizer = AutoTokenizer.from_pretrained(save_directory)
model = AutoModel.from_pretrained(save_directory)

Some weights of the model checkpoint at /media/dataista/DATA/my-transformers/episodio-2/ were not used when initializing BertModel: ['classifier.bias', 'classifier.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [76]:
pt_outputs2 = model(**pt_batch)

In [85]:
# La carga con AutoModel perdio el head de Sequence Classification por alguna razon
pt_outputs2

BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[ 0.3566, -0.4698, -0.2348,  ...,  0.9312,  0.0610, -0.2415],
         [ 0.2775, -0.5162, -0.1987,  ...,  1.1560, -0.0194, -0.3539],
         [ 0.4779, -0.5444, -0.0870,  ...,  1.4136, -0.5137, -0.3287],
         ...,
         [ 0.3512, -0.3564,  0.1628,  ..., -0.1383,  0.2157, -0.2248],
         [ 0.2712, -0.2414, -0.2314,  ...,  0.8188, -0.1898, -0.2099],
         [ 0.4083, -0.4160, -0.3753,  ...,  1.6491,  0.1184, -0.7137]],

        [[-0.2620, -0.0454, -0.3714,  ..., -0.1002,  0.0469, -0.2382],
         [-0.5508, -0.0658, -0.1082,  ...,  0.2486,  0.2475, -0.8336],
         [-0.2339, -0.2428, -0.3974,  ..., -0.3725,  0.4164, -0.6616],
         ...,
         [-0.1728, -0.0819, -0.2361,  ..., -0.0804,  0.0541, -0.0938],
         [-0.2688, -0.1920, -0.2137,  ..., -0.2153,  0.0508, -0.0768],
         [-0.2743, -0.1995, -0.0615,  ..., -0.1269,  0.1729,  0.0613]],

        [[-0.4732,  0.2295, -0.3679,  ..., -0.6695,  

In [79]:
model3 = AutoModelForSequenceClassification.from_pretrained(save_directory)

In [80]:
pt_outputs3 = model3(**pt_batch)

In [84]:
pt_outputs3.logits == pt_outputs.logits

tensor([[True, True, True, True, True],
        [True, True, True, True, True],
        [True, True, True, True, True],
        [True, True, True, True, True]])

In [86]:
pt_outputs = model3(**pt_batch, output_hidden_states=True, output_attentions=True)
all_hidden_states  = pt_outputs.hidden_states
all_attentions = pt_outputs.attentions

In [88]:
pt_outputs.keys()

odict_keys(['logits', 'hidden_states', 'attentions'])

In [92]:
pt_outputs.logits ==  pt_outputs3.logits

tensor([[True, True, True, True, True],
        [True, True, True, True, True],
        [True, True, True, True, True],
        [True, True, True, True, True]])

In [94]:
pt_outputs3.keys()

odict_keys(['logits'])

In [97]:
len(pt_outputs.hidden_states)

13

In [99]:
pt_outputs.hidden_states[0].shape

torch.Size([4, 14, 768])

# Mas alla de AutoModel

In [101]:
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
model = DistilBertForSequenceClassification.from_pretrained(model_name)
tokenizer = DistilBertTokenizer.from_pretrained(model_name)

In [102]:
batch = tokenizer(
    ["We are very happy to show you the 🤗 Transformers library.", "We hope you don't hate it.", 
     "El hotel era horrendo.",
     "O hotel era esplêndido."
    ],
    padding=True,
    truncation=True,
    max_length=512,
    return_tensors="pt"
)

In [104]:
res = model(**batch)

In [105]:
res

SequenceClassifierOutput(loss=None, logits=tensor([[-4.0833,  4.3364],
        [ 0.0818, -0.0418],
        [-2.2111,  2.2326],
        [-1.8396,  1.8446]], grad_fn=<AddmmBackward>), hidden_states=None, attentions=None)

In [106]:
from transformers import DistilBertConfig, DistilBertTokenizer, DistilBertForSequenceClassification
config = DistilBertConfig(n_heads=8, dim=512, hidden_dim=4*512)
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
model = DistilBertForSequenceClassification(config)

In [108]:
batch = tokenizer(
    ["We are very happy to show you the 🤗 Transformers library.", 
     "We hope you don't hate it.", 
     "I hate you",
     "I love love."
    ],
    padding=True,
    truncation=True,
    max_length=512,
    return_tensors="pt"
)

In [112]:
F.softmax(model(**batch).logits, dim=-1)

tensor([[0.4469, 0.5531],
        [0.4633, 0.5367],
        [0.4452, 0.5548],
        [0.4687, 0.5313]], grad_fn=<SoftmaxBackward>)

In [114]:
from transformers import DistilBertConfig, DistilBertTokenizer, DistilBertForSequenceClassification
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
model = DistilBertForSequenceClassification.from_pretrained(model_name, num_labels=2)
tokenizer = DistilBertTokenizer.from_pretrained(model_name)

In [125]:
F.softmax(model(**batch).logits, dim=-1).detach().numpy().round(3)

array([[0.   , 1.   ],
       [0.531, 0.469],
       [0.999, 0.001],
       [0.   , 1.   ]], dtype=float32)

# [Fine-tuning a pretrained model](https://huggingface.co/transformers/training.html#fine-tuning-a-pretrained-model)

* [datasets repo](https://github.com/huggingface/datasets)
* [Datasets doc](https://huggingface.co/docs/datasets/)

In [129]:
#!pip install datasets

In [130]:
from datasets import load_dataset

raw_datasets = load_dataset("imdb")

Downloading:   0%|          | 0.00/1.92k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.05k [00:00<?, ?B/s]

Downloading and preparing dataset imdb/plain_text (download: 80.23 MiB, generated: 127.02 MiB, post-processed: Unknown size, total: 207.25 MiB) to /home/dataista/.cache/huggingface/datasets/imdb/plain_text/1.0.0/e3c66f1788a67a89c7058d97ff62b6c30531e05b549de56d3ab91891f0561f9a...


Downloading:   0%|          | 0.00/84.1M [00:00<?, ?B/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Dataset imdb downloaded and prepared to /home/dataista/.cache/huggingface/datasets/imdb/plain_text/1.0.0/e3c66f1788a67a89c7058d97ff62b6c30531e05b549de56d3ab91891f0561f9a. Subsequent calls will reuse this data.


In [131]:
import datasets
datasets.list_datasets()

['acronym_identification',
 'ade_corpus_v2',
 'adversarial_qa',
 'aeslc',
 'afrikaans_ner_corpus',
 'ag_news',
 'ai2_arc',
 'air_dialogue',
 'ajgt_twitter_ar',
 'allegro_reviews',
 'allocine',
 'alt',
 'amazon_polarity',
 'amazon_reviews_multi',
 'amazon_us_reviews',
 'ambig_qa',
 'amttl',
 'anli',
 'app_reviews',
 'aqua_rat',
 'aquamuse',
 'ar_cov19',
 'ar_res_reviews',
 'ar_sarcasm',
 'arabic_billion_words',
 'arabic_pos_dialect',
 'arabic_speech_corpus',
 'arcd',
 'arsentd_lev',
 'art',
 'arxiv_dataset',
 'ascent_kb',
 'aslg_pc12',
 'asnq',
 'asset',
 'assin',
 'assin2',
 'atomic',
 'autshumato',
 'babi_qa',
 'banking77',
 'bbaw_egyptian',
 'bbc_hindi_nli',
 'bc2gm_corpus',
 'best2009',
 'bianet',
 'bible_para',
 'big_patent',
 'billsum',
 'bing_coronavirus_query_set',
 'biomrc',
 'blended_skill_talk',
 'blimp',
 'blog_authorship_corpus',
 'bn_hate_speech',
 'bookcorpus',
 'bookcorpusopen',
 'boolq',
 'bprec',
 'break_data',
 'brwac',
 'bsd_ja_en',
 'bswac',
 'c3',
 'c4',
 'cail2018

In [132]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['label', 'text'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['label', 'text'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['label', 'text'],
        num_rows: 50000
    })
})

In [134]:
type(raw_datasets['train'])

datasets.arrow_dataset.Dataset

In [137]:
train = raw_datasets['train']

In [140]:
pd.set_option("display.max_colwidth", 500)

In [141]:
pd.DataFrame(train[:50])

Unnamed: 0,label,text
0,1,"Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life, such as ""Teachers"". My 35 years in the teaching profession lead me to believe that Bromwell High's satire is much closer to reality than is ""Teachers"". The scramble to survive financially, the insightful students who can see right through their pathetic teachers' pomp, the pettiness of the whole situation, all remind me of the schools I knew and their students. When I saw the episode in which..."
1,1,"Homelessness (or Houselessness as George Carlin stated) has been an issue for years but never a plan to help those on the street that were once considered human who did everything from going to school, work, or vote for the matter. Most people think of the homeless as just a lost cause while worrying about things such as racism, the war on Iraq, pressuring kids to succeed, technology, the elections, inflation, or worrying if they'll be next to end up on the streets.<br /><br />But what if yo..."
2,1,"Brilliant over-acting by Lesley Ann Warren. Best dramatic hobo lady I have ever seen, and love scenes in clothes warehouse are second to none. The corn on face is a classic, as good as anything in Blazing Saddles. The take on lawyers is also superb. After being accused of being a turncoat, selling out his boss, and being dishonest the lawyer of Pepto Bolt shrugs indifferently ""I'm a lawyer"" he says. Three funny words. Jeffrey Tambor, a favorite from the later Larry Sanders show, is fantastic..."
3,1,"This is easily the most underrated film inn the Brooks cannon. Sure, its flawed. It does not give a realistic view of homelessness (unlike, say, how Citizen Kane gave a realistic view of lounge singers, or Titanic gave a realistic view of Italians YOU IDIOTS). Many of the jokes fall flat. But still, this film is very lovable in a way many comedies are not, and to pull that off in a story about some of the most traditionally reviled members of society is truly impressive. Its not The Fisher K..."
4,1,"This is not the typical Mel Brooks film. It was much less slapstick than most of his movies and actually had a plot that was followable. Leslie Ann Warren made the movie, she is such a fantastic, under-rated actress. There were some moments that could have been fleshed out a bit more, and some scenes that could probably have been cut to make the room to do so, but all in all, this is worth the price to rent and see it. The acting was good overall, Brooks himself did a good job without his ch..."
5,1,"This isn't the comedic Robin Williams, nor is it the quirky/insane Robin Williams of recent thriller fame. This is a hybrid of the classic drama without over-dramatization, mixed with Robin's new love of the thriller. But this isn't a thriller, per se. This is more a mystery/suspense vehicle through which Williams attempts to locate a sick boy and his keeper.<br /><br />Also starring Sandra Oh and Rory Culkin, this Suspense Drama plays pretty much like a news report, until William's characte..."
6,1,"Yes its an art... to successfully make a slow paced thriller.<br /><br />The story unfolds in nice volumes while you don't even notice it happening.<br /><br />Fine performance by Robin Williams. The sexuality angles in the film can seem unnecessary and can probably affect how much you enjoy the film. However, the core plot is very engaging. The movie doesn't rush onto you and still grips you enough to keep you wondering. The direction is good. Use of lights to achieve desired affects of sus..."
7,1,"In this ""critically acclaimed psychological thriller based on true events, Gabriel (Robin Williams), a celebrated writer and late-night talk show host, becomes captivated by the harrowing story of a young listener and his adoptive mother (Toni Collette). When troubling questions arise about this boy's (story), however, Gabriel finds himself drawn into a widening mystery that hides a deadly secret"" according to film's official synopsis.<br /><br />You really should STOP reading these comment..."
8,1,"THE NIGHT LISTENER (2006) **1/2 Robin Williams, Toni Collette, Bobby Cannavale, Rory Culkin, Joe Morton, Sandra Oh, John Cullum, Lisa Emery, Becky Ann Baker. (Dir: Patrick Stettner) <br /><br />Hitchcockian suspenser gives Williams a stand-out low-key performance.<br /><br />What is it about celebrities and fans? What is the near paranoia one associates with the other and why is it almost the norm? <br /><br />In the latest derange fan scenario, based on true events no less, Williams stars a..."
9,1,"You know, Robin Williams, God bless him, is constantly shooting himself in the foot lately with all these dumb comedies he has done this decade (with perhaps the exception of ""Death To Smoochy"", which bombed when it came out but is now a cult classic). The dramas he has made lately have been fantastic, especially ""Insomnia"" and ""One Hour Photo"". ""The Night Listener"", despite mediocre reviews and a quick DVD release, is among his best work, period.<br /><br />This is a very chilling story, ev..."


In [142]:
pd.Series([t['label'] for t in train]).value_counts()

0    12500
1    12500
dtype: int64

In [143]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/436k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

In [150]:
# train es un anfibio lista - diccionario! muy bueno :D

In [149]:
train[30]

{'label': 1,
 'text': "Sure, Titanic was a good movie, the first time you see it, but you really should see it a second time and your opinion of the film will definetly change. The first time you see the movie you see the underlying love-story and think: ooh, how romantic. The second time (and I am not the only one to think this) it is just annoying and you just sit there watching the movie thinking, When is this d**n ship going to sink??? And even this is not as impressive when you see it several times. The acting in this film is not bad, but definetly not great either. Was I glad DiCaprio did not win an oscar for that film, I mean who does he think he is, Anthony Hopkins or Denzel Washington? He does 1 half-good movie and won't do a film for less than $20 million. And then everyone is suprised that there are hardly any films with him in it. But enough about, in my eyes, the worst character of the film. Kate Winslet's performance on the other hand was wonderful. I also tink that the d

In [148]:
train["label"]

[1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,


In [151]:
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)


In [152]:
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)

  0%|          | 0/25 [00:00<?, ?ba/s]

  0%|          | 0/25 [00:00<?, ?ba/s]

  0%|          | 0/50 [00:00<?, ?ba/s]

In [154]:
# Estaria bueno no haber tokenizado el unsupervised creo
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['attention_mask', 'input_ids', 'label', 'text', 'token_type_ids'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['attention_mask', 'input_ids', 'label', 'text', 'token_type_ids'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['attention_mask', 'input_ids', 'label', 'text', 'token_type_ids'],
        num_rows: 50000
    })
})

In [155]:
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))
full_train_dataset = tokenized_datasets["train"]
full_eval_dataset = tokenized_datasets["test"]

In [156]:
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=2)

Downloading:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at b

In [160]:
len(list(model.modules()))

219

In [162]:
list(model.modules())[218]

Linear(in_features=768, out_features=2, bias=True)

In [163]:
list(model.modules())[217]

Dropout(p=0.1, inplace=False)

In [164]:
list(model.modules())[216]

Tanh()

In [165]:
list(model.modules())[215]

Linear(in_features=768, out_features=768, bias=True)

In [166]:
list(model.modules())[214]

BertPooler(
  (dense): Linear(in_features=768, out_features=768, bias=True)
  (activation): Tanh()
)

In [168]:
from transformers import TrainingArguments
from transformers import Trainer

training_args = TrainingArguments("/media/dataista/DATA/my-transformers/first_trainer")

trainer = Trainer(
    model=model, args=training_args, train_dataset=small_train_dataset, eval_dataset=small_eval_dataset
)

In [170]:
trainer.train()

RuntimeError: CUDA out of memory. Tried to allocate 96.00 MiB (GPU 0; 3.82 GiB total capacity; 1.58 GiB already allocated; 68.00 MiB free; 1.62 GiB reserved in total by PyTorch)

In [1]:
# De nuevo, todo junto

In [2]:
import datasets
from transformers import Trainer, TrainingArguments, AutoModelForSequenceClassification, AutoTokenizer


raw_datasets = datasets.load_dataset("imdb")

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=2)

small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(100))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(100))
#full_train_dataset = tokenized_datasets["train"]
#full_eval_dataset = tokenized_datasets["test"]


training_args = TrainingArguments("/media/dataista/DATA/my-transformers/first_trainer")

trainer = Trainer(
    model=model, args=training_args, train_dataset=small_train_dataset, eval_dataset=small_eval_dataset
)

INFO:tensorflow:Enabling eager execution
INFO:tensorflow:Enabling v2 tensorshape
INFO:tensorflow:Enabling resource variables
INFO:tensorflow:Enabling tensor equality
INFO:tensorflow:Enabling control flow v2


Reusing dataset imdb (/home/dataista/.cache/huggingface/datasets/imdb/plain_text/1.0.0/e3c66f1788a67a89c7058d97ff62b6c30531e05b549de56d3ab91891f0561f9a)
Loading cached processed dataset at /home/dataista/.cache/huggingface/datasets/imdb/plain_text/1.0.0/e3c66f1788a67a89c7058d97ff62b6c30531e05b549de56d3ab91891f0561f9a/cache-16e01676e45a188f.arrow


HBox(children=(FloatProgress(value=0.0, max=25.0), HTML(value='')))

Loading cached processed dataset at /home/dataista/.cache/huggingface/datasets/imdb/plain_text/1.0.0/e3c66f1788a67a89c7058d97ff62b6c30531e05b549de56d3ab91891f0561f9a/cache-8056582c24960265.arrow





Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at b

In [3]:
trainer.train()

RuntimeError: CUDA out of memory. Tried to allocate 96.00 MiB (GPU 0; 3.82 GiB total capacity; 1.49 GiB already allocated; 177.62 MiB free; 1.52 GiB reserved in total by PyTorch)

In [5]:
# El problema no es el dataset, sino BERT. Pareceria.

In [7]:
import torch
torch.cuda.empty_cache()

In [2]:
# Sigue sin funcionar con DistillBert y 10 samples
# Pruebo con Batch size y sino pasamos a CPU

In [3]:
import datasets
from transformers import Trainer, TrainingArguments, AutoModelForSequenceClassification, AutoTokenizer

model_name = "distilbert-base-cased"


raw_datasets = datasets.load_dataset("imdb")

tokenizer = AutoTokenizer.from_pretrained(model_name)

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(10))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(10))
#full_train_dataset = tokenized_datasets["train"]
#full_eval_dataset = tokenized_datasets["test"]


training_args = TrainingArguments("/media/dataista/DATA/my-transformers/first_trainer",
                                 per_device_train_batch_size=1,
                                 per_device_eval_batch_size=1)

trainer = Trainer(
    model=model, args=training_args, train_dataset=small_train_dataset, eval_dataset=small_eval_dataset
)

Reusing dataset imdb (/home/dataista/.cache/huggingface/datasets/imdb/plain_text/1.0.0/e3c66f1788a67a89c7058d97ff62b6c30531e05b549de56d3ab91891f0561f9a)
Loading cached processed dataset at /home/dataista/.cache/huggingface/datasets/imdb/plain_text/1.0.0/e3c66f1788a67a89c7058d97ff62b6c30531e05b549de56d3ab91891f0561f9a/cache-720c17e77000dab8.arrow
Loading cached processed dataset at /home/dataista/.cache/huggingface/datasets/imdb/plain_text/1.0.0/e3c66f1788a67a89c7058d97ff62b6c30531e05b549de56d3ab91891f0561f9a/cache-9d8983fb05751a6e.arrow
Loading cached processed dataset at /home/dataista/.cache/huggingface/datasets/imdb/plain_text/1.0.0/e3c66f1788a67a89c7058d97ff62b6c30531e05b549de56d3ab91891f0561f9a/cache-c86709e7338e7405.arrow
Some weights of the model checkpoint at distilbert-base-cased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.bias', 'vocab_projector.weight', 'vocab_layer_norm.weight', 'vocab_transform.weight', 'vocab_layer_norm.bias', 'v

In [4]:
trainer.train()

Step,Training Loss


TrainOutput(global_step=30, training_loss=0.6969985961914062, metrics={'train_runtime': 5.2253, 'train_samples_per_second': 5.741, 'train_steps_per_second': 5.741, 'total_flos': 6062565150720.0, 'train_loss': 0.6969985961914062, 'epoch': 3.0})

In [5]:
#small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(10))
#small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(10))

In [6]:
training_args = TrainingArguments("/media/dataista/DATA/my-transformers/first_trainer",
                                 per_device_train_batch_size=1,
                                 per_device_eval_batch_size=1,
                                 no_cuda=True)


trainer = Trainer(
    model=model, args=training_args, train_dataset=small_train_dataset, eval_dataset=small_eval_dataset
)

In [7]:
trainer.train()

Step,Training Loss


TrainOutput(global_step=30, training_loss=0.459854793548584, metrics={'train_runtime': 30.7567, 'train_samples_per_second': 0.975, 'train_steps_per_second': 0.975, 'total_flos': 6062565150720.0, 'train_loss': 0.459854793548584, 'epoch': 3.0})

In [8]:
training_args = TrainingArguments("/media/dataista/DATA/my-transformers/first_trainer",
                                 per_device_train_batch_size=8,
                                 per_device_eval_batch_size=8,
                                 no_cuda=True)


trainer = Trainer(
    model=model, args=training_args, train_dataset=small_train_dataset, eval_dataset=small_eval_dataset
)

In [9]:
trainer.train()

Step,Training Loss


TrainOutput(global_step=6, training_loss=0.1051294207572937, metrics={'train_runtime': 23.2904, 'train_samples_per_second': 1.288, 'train_steps_per_second': 0.258, 'total_flos': 6062565150720.0, 'train_loss': 0.1051294207572937, 'epoch': 3.0})

In [10]:
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(50))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(50))
trainer = Trainer(
    model=model, args=training_args, train_dataset=small_train_dataset, eval_dataset=small_eval_dataset
)
trainer.train()

Loading cached shuffled indices for dataset at /home/dataista/.cache/huggingface/datasets/imdb/plain_text/1.0.0/e3c66f1788a67a89c7058d97ff62b6c30531e05b549de56d3ab91891f0561f9a/cache-71f2b545007cd7bd.arrow
Loading cached shuffled indices for dataset at /home/dataista/.cache/huggingface/datasets/imdb/plain_text/1.0.0/e3c66f1788a67a89c7058d97ff62b6c30531e05b549de56d3ab91891f0561f9a/cache-484a3647a3e90554.arrow


Step,Training Loss


TrainOutput(global_step=21, training_loss=0.4871863410586402, metrics={'train_runtime': 113.4901, 'train_samples_per_second': 1.322, 'train_steps_per_second': 0.185, 'total_flos': 30312825753600.0, 'train_loss': 0.4871863410586402, 'epoch': 3.0})

# Nos vamos a la nube

* [GPUS en Google Cloud](https://cloud.google.com/compute/docs/gpus)
* [Crear instancia de Google Cloud con GPU attacheada](https://cloud.google.com/compute/docs/gpus/create-vm-with-gpus)