##  Importing HuggingFace Transformers

In [1]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.9.1-py3-none-any.whl (2.6 MB)
[K     |████████████████████████████████| 2.6 MB 7.9 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-5.4.1-cp37-cp37m-manylinux1_x86_64.whl (636 kB)
[K     |████████████████████████████████| 636 kB 61.5 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.45-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 63.4 MB/s 
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 59.6 MB/s 
Collecting huggingface-hub==0.0.12
  Downloading huggingface_hub-0.0.12-py3-none-any.whl (37 kB)
Installing collected packages: tokenizers, sacremoses, pyyaml, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found existing installation: PyYAML 3.13
    Uninstalling PyYAML-3.13:
      Successfully uninstalled P

In [2]:
!pip install datasets

Collecting datasets
  Downloading datasets-1.11.0-py3-none-any.whl (264 kB)
[K     |████████████████████████████████| 264 kB 7.7 MB/s 
Collecting xxhash
  Downloading xxhash-2.0.2-cp37-cp37m-manylinux2010_x86_64.whl (243 kB)
[K     |████████████████████████████████| 243 kB 63.8 MB/s 
Collecting tqdm>=4.42
  Downloading tqdm-4.62.0-py2.py3-none-any.whl (76 kB)
[K     |████████████████████████████████| 76 kB 6.1 MB/s 
Collecting fsspec>=2021.05.0
  Downloading fsspec-2021.7.0-py3-none-any.whl (118 kB)
[K     |████████████████████████████████| 118 kB 82.6 MB/s 
Installing collected packages: tqdm, xxhash, fsspec, datasets
  Attempting uninstall: tqdm
    Found existing installation: tqdm 4.41.1
    Uninstalling tqdm-4.41.1:
      Successfully uninstalled tqdm-4.41.1
Successfully installed datasets-1.11.0 fsspec-2021.7.0 tqdm-4.62.0 xxhash-2.0.2


In [3]:
import numpy as np
import transformers
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import pipeline

# Load data

In [4]:
import pandas as pd

In [5]:
from google.colab import drive
drive.mount('/gdrive')
%cd /gdrive

Mounted at /gdrive
/gdrive


In [6]:
import os
os.chdir("/gdrive/")
!ls

MyDrive


In [7]:
os.chdir("/gdrive/MyDrive/Colab Notebooks")

In [8]:
os.listdir(os.getcwd())

['train.csv',
 'test.csv',
 'Untitled0.ipynb',
 'spooky_author_transformers.ipynb']

In [9]:
train_data = pd.read_csv('train.csv')
print(len(train_data))

19579


In [10]:
train_data.head()

Unnamed: 0,id,text,author
0,id26305,"This process, however, afforded me no means of...",EAP
1,id17569,It never once occurred to me that the fumbling...,HPL
2,id11008,"In his left hand was a gold snuff box, from wh...",EAP
3,id27763,How lovely is spring As we looked from Windsor...,MWS
4,id12958,"Finding nothing else, not even gold, the Super...",HPL


In [11]:
# Load Spooky author dataset as dataset object 
dataset = load_dataset('csv', data_files= r'train.csv')
label2newlabel = {'EAP': 0, 'HPL': 1, 'MWS': 2}
def encode_author(example):
    example['author'] = label2newlabel[example['author']]
    return example
dataset['train'] = dataset['train'].map(encode_author)

Using custom data configuration default-29de13610f079790


Downloading and preparing dataset csv/default (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /root/.cache/huggingface/datasets/csv/default-29de13610f079790/0.0.0/9144e0a4e8435090117cea53e6c7537173ef2304525df4a077c435d8ee7828ff...


0 tables [00:00, ? tables/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-29de13610f079790/0.0.0/9144e0a4e8435090117cea53e6c7537173ef2304525df4a077c435d8ee7828ff. Subsequent calls will reuse this data.


  0%|          | 0/19579 [00:00<?, ?ex/s]

# Selecting a model for text classificaiton

To identify author's texts, we would require to understand author's writing style. Understanding writing style entails vocabulary and use of vocabulary across contexts. Since the authors' texts are fiction based texts, we would require a model that gives attention to words across contexts in each of the individual authors texts. Occurence of such words in a text could help model identify its author. 

For example, if we see the text for the author EAP at index = 19576 in train_data dataframe, we will see words that might not be in the vocabulary of the dataset on which the model was trained. Since BERT uses WordPiece tokenizer, which is a subword tokenizer, it will be able to break an unkown word and map it to a token in its huge vocabulary. If the pretrained model sees such subword based tokens more often in a text, then the model should be able to identify's author of the text as EAP. 

A transformer based model that has been trained on identifying masked words in a text could be used to finetune on author's text dataset to learn the vocabulary of each of the three authors in Spooky author dataset. Let us use the HuggingFace library to define checkpoint, tokenizer, and model.

In [12]:
checkpoint = "bert-base-uncased"

# Use from_pretrained method to directly load the weights of a pretrained BERT model and cache it locally.
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [13]:
sample_eap_text = train_data[train_data['author']=='EAP'].text.loc[19576]
print(sample_eap_text)
tokens = tokenizer.tokenize(sample_eap_text)
print(tokens)

Mais il faut agir that is to say, a Frenchman never faints outright.
['mai', '##s', 'il', 'fa', '##ut', 'ag', '##ir', 'that', 'is', 'to', 'say', ',', 'a', 'frenchman', 'never', 'faint', '##s', 'outright', '.']


In [14]:
tokenizer.convert_tokens_to_ids(tokens)

[14736,
 2015,
 6335,
 6904,
 4904,
 12943,
 4313,
 2008,
 2003,
 2000,
 2360,
 1010,
 1037,
 26529,
 2196,
 8143,
 2015,
 13848,
 1012]

# Readying the training data for fine-tuning

In [15]:
# https://github.com/huggingface/datasets/issues/1600
spooky_author_dataset = dataset['train'].train_test_split(test_size=0.2)

In [16]:
print(spooky_author_dataset)

DatasetDict({
    train: Dataset({
        features: ['id', 'text', 'author'],
        num_rows: 15663
    })
    test: Dataset({
        features: ['id', 'text', 'author'],
        num_rows: 3916
    })
})


In [17]:
print(spooky_author_dataset['train'][0])

{'id': 'id08875', 'text': 'Might it not actually be another U boat, offering possibilities of rescue?', 'author': 1}


In [18]:
# Use tokenizer to ready the inputs
# spooky_author_dataset['train']['text']
def tokenize_data(dataset):
    encoded = tokenizer(dataset['text'], 
                        padding=True, 
                        truncation=True, 
                        return_tensors="tf")
    return encoded.data

# BERT has max. sequence length as 512 tokens
tokenized_spooky_author_dataset = {
    split: tokenize_data(spooky_author_dataset[split]) for split in spooky_author_dataset.keys()
}

In [19]:
print(tokenized_spooky_author_dataset)

{'train': {'input_ids': <tf.Tensor: shape=(15663, 512), dtype=int32, numpy=
array([[  101,  2453,  2009, ...,     0,     0,     0],
       [  101,  1996, 14902, ...,     0,     0,     0],
       [  101,  2077,  1045, ...,     0,     0,     0],
       ...,
       [  101,  2053,  1010, ...,     0,     0,     0],
       [  101,  2383,  3264, ...,     0,     0,     0],
       [  101,  3342,  1010, ...,     0,     0,     0]], dtype=int32)>, 'token_type_ids': <tf.Tensor: shape=(15663, 512), dtype=int32, numpy=
array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(15663, 512), dtype=int32, numpy=
array([[1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       ...,
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 

In [20]:
from transformers import TFAutoModelForSequenceClassification
bert_classifier_model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=3)

Downloading:   0%|          | 0.00/536M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [21]:
from tensorflow.keras.losses import SparseCategoricalCrossentropy

bert_classifier_model.compile(
    optimizer='adam',
    loss=SparseCategoricalCrossentropy(from_logits=True),
    metrics=['accuracy'],
)

In [22]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Select the Runtime > "Change runtime type" menu to enable a GPU accelerator, ')
  print('and then re-execute this cell.')
else:
  print(gpu_info)

Sun Aug  8 04:24:49 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.42.01    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   40C    P0    33W / 250W |   1439MiB / 16280MiB |      4%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [23]:
"""bert_classifier_model.fit(
    tokenized_spooky_author_dataset['train'],
    np.array(spooky_author_dataset['train']['author']), 
    validation_data=(
        tokenized_spooky_author_dataset['test'],
        np.array(spooky_author_dataset['test']['author']),
    ),
    batch_size=8
)"""

"bert_classifier_model.fit(\n    tokenized_spooky_author_dataset['train'],\n    np.array(spooky_author_dataset['train']['author']), \n    validation_data=(\n        tokenized_spooky_author_dataset['test'],\n        np.array(spooky_author_dataset['test']['author']),\n    ),\n    batch_size=8\n)"

In [24]:
from tensorflow.keras.optimizers.schedules import PolynomialDecay
batch_size = 8
num_epochs = 3
# The number of training steps is the number of samples in the dataset, divided by the batch size then multiplied
# by the total number of epochs
num_train_steps = (len(tokenized_spooky_author_dataset['train']['input_ids']) // batch_size) * num_epochs
lr_scheduler = PolynomialDecay(
    initial_learning_rate=5e-5,
    end_learning_rate=0.,
    decay_steps=num_train_steps
    )
from tensorflow.keras.optimizers import Adam
opt = Adam(learning_rate=lr_scheduler)

In [25]:
import tensorflow as tf
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
bert_classifier_model.compile(optimizer=opt, loss=loss, metrics=['accuracy'])

In [26]:
bert_classifier_model.fit(
    tokenized_spooky_author_dataset['train'],
    np.array(spooky_author_dataset['train']['author']), 
    validation_data=(
        tokenized_spooky_author_dataset['test'],
        np.array(spooky_author_dataset['test']['author']),
    ),
    batch_size=8
)

Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method
Cause: while/else statement not yet supported
Cause: while/else statement not yet supported
Instructions for updating:
The `validate_indices` argument has no effect. Indices are always validated on CPU and never validated on GPU.


<tensorflow.python.keras.callbacks.History at 0x7f6ca615ea10>

In [30]:
test_dataset = load_dataset('csv', data_files= {'test': r'test.csv'})
print(test_dataset)

Using custom data configuration default-a16b196d77d9f87d


Downloading and preparing dataset csv/default (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /root/.cache/huggingface/datasets/csv/default-a16b196d77d9f87d/0.0.0/9144e0a4e8435090117cea53e6c7537173ef2304525df4a077c435d8ee7828ff...


0 tables [00:00, ? tables/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-a16b196d77d9f87d/0.0.0/9144e0a4e8435090117cea53e6c7537173ef2304525df4a077c435d8ee7828ff. Subsequent calls will reuse this data.
DatasetDict({
    test: Dataset({
        features: ['id', 'text'],
        num_rows: 8392
    })
})


In [31]:
tokenized_spooky_author_test_dataset = {
    split: tokenize_data(test_dataset[split]) for split in test_dataset.keys()
}

In [32]:
test_preds = bert_classifier_model.predict(tokenized_spooky_author_test_dataset['test'])['logits']
print(test_preds)

[[-2.36866    -2.5785275   4.334297  ]
 [ 2.6539552  -0.8524319  -2.5882049 ]
 [-1.1230044   3.9262877  -2.4198089 ]
 ...
 [ 2.1461248  -0.4910377  -2.3914742 ]
 [-1.2115648  -1.684983    2.2513847 ]
 [ 0.16606227  2.7879581  -2.632142  ]]


In [34]:
probabilities = tf.nn.softmax(test_preds, axis=-1)

In [35]:
probabilities

<tf.Tensor: shape=(8392, 3), dtype=float32, numpy=
array([[1.2245560e-03, 9.9273736e-04, 9.9778277e-01],
       [9.6590924e-01, 2.8982222e-02, 5.1085213e-03],
       [6.3619120e-03, 9.9189878e-01, 1.7393726e-03],
       ...,
       [9.2398971e-01, 6.6124447e-02, 9.8858252e-03],
       [2.9820632e-02, 1.8574361e-02, 9.5160496e-01],
       [6.7464061e-02, 9.2842609e-01, 4.1098665e-03]], dtype=float32)>

In [38]:
ids = test_dataset['test']['id']

In [40]:
prob_df = pd.DataFrame(probabilities.numpy())
prob_df.head()

Unnamed: 0,0,1,2
0,0.001225,0.000993,0.997783
1,0.965909,0.028982,0.005109
2,0.006362,0.991899,0.001739
3,0.207318,0.76977,0.022912
4,0.287024,0.34103,0.371946


In [None]:
label2newlabel = {'EAP': 0, 'HPL': 1, 'MWS': 2}
prob_df.rename(columns={0: 'EAP', 1: 'HPL', 2: 'MWS'}, errors="raise", inplace=True)
prob_df.insert(0, 'id', ids)

In [48]:
print(prob_df.head())

        id       EAP       HPL       MWS
0  id02310  0.001225  0.000993  0.997783
1  id24541  0.965909  0.028982  0.005109
2  id00134  0.006362  0.991899  0.001739
3  id27757  0.207318  0.769770  0.022912
4  id04081  0.287024  0.341030  0.371946


In [49]:
prob_df.to_csv('predictions_file_v1.csv')