<a href="https://colab.research.google.com/github/crow-intelligence/gettingStartedWithColab-PyTorch/blob/master/03_Using_PyTorch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a href="https://colab.research.google.com/github/crow-intelligence/gettingStartedWithColab/blob/main/03_Using_PyTorch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Milestone 3
## Using Pytorch

### Load data
Load count vectorized data made at Milestone 1. Transform them into torch tensors.

In [1]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


## Preprocess data

In [2]:
import string

import numpy as np
import pandas as pd
import torch

from sklearn.model_selection import train_test_split
from torchtext import data    


# Load raw data
texts = []
labels = []
with open("/content/drive/My Drive/Colab Notebooks/01_data/SMSSpamCollection.txt", "r") as infile:
  for l in infile:
    label, text = l.strip().split("\t")
    if label == "ham":
      labels.append(0)
    else:
      labels.append(1)
    text = "".join([ch.lower() for ch in text if ch not in string.punctuation])
    texts.append(text)

labels = np.asarray(labels)

df = pd.DataFrame(list(zip(texts, labels)), columns=["SMS", "Category"])


train_df, test_df = train_test_split(df, test_size=0.5)
test_df, valid_df = train_test_split(test_df, test_size=0.5)



## Use SimpleTransformers to train a PyTorch model
First, we have to install the SimpleTransformers library

In [77]:
!pip install simpletransformers

Collecting simpletransformers
[?25l  Downloading https://files.pythonhosted.org/packages/35/ef/0b70ae95138064d665d9298c4d96afba2edf4b86dc44f762807ceb12668e/simpletransformers-0.61.4-py3-none-any.whl (213kB)
[K     |█▌                              | 10kB 21.9MB/s eta 0:00:01[K     |███                             | 20kB 23.3MB/s eta 0:00:01[K     |████▋                           | 30kB 21.4MB/s eta 0:00:01[K     |██████▏                         | 40kB 8.8MB/s eta 0:00:01[K     |███████▊                        | 51kB 10.3MB/s eta 0:00:01[K     |█████████▎                      | 61kB 11.9MB/s eta 0:00:01[K     |██████████▊                     | 71kB 10.3MB/s eta 0:00:01[K     |████████████▎                   | 81kB 8.1MB/s eta 0:00:01[K     |█████████████▉                  | 92kB 8.9MB/s eta 0:00:01[K     |███████████████▍                | 102kB 9.6MB/s eta 0:00:01[K     |█████████████████               | 112kB 9.6MB/s eta 0:00:01[K     |██████████████████▌       

Now, we are ready to train our model using a pre-trained language model.

In [6]:
import os
import torch
from simpletransformers.classification import ClassificationModel



cuda_available = torch.cuda.is_available()

model = ClassificationModel(
    "bert",
    "bert-base-uncased",
    use_cuda=cuda_available,
    num_labels=2,
    args={
        "output_dir": "/content/drive/My Drive/Colab Notebooks/manning/01",
        "best_model_dir": "/content/drive/My Drive/Colab Notebooks/manning/02/",
        "reprocess_input_data": True,
        "sliding_window": True,
        "overwrite_output_dir": True,
        "max_seq_length": 256,
        "num_train_epochs": 20,
        "train_batch_size": 10,
        "eval_batch_size": 10,
        "tensorboard_dir": "/content/drive/My Drive/Colab Notebooks/manning/tb",
    },
)


model.train_model(train_df, valid_df)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

  0%|          | 0/2787 [00:00<?, ?it/s]

Epoch:   0%|          | 0/20 [00:00<?, ?it/s]

Running Epoch 0 of 20:   0%|          | 0/279 [00:00<?, ?it/s]

Running Epoch 1 of 20:   0%|          | 0/279 [00:00<?, ?it/s]

Running Epoch 2 of 20:   0%|          | 0/279 [00:00<?, ?it/s]

Running Epoch 3 of 20:   0%|          | 0/279 [00:00<?, ?it/s]

Running Epoch 4 of 20:   0%|          | 0/279 [00:00<?, ?it/s]

Running Epoch 5 of 20:   0%|          | 0/279 [00:00<?, ?it/s]

Running Epoch 6 of 20:   0%|          | 0/279 [00:00<?, ?it/s]

Running Epoch 7 of 20:   0%|          | 0/279 [00:00<?, ?it/s]

Running Epoch 8 of 20:   0%|          | 0/279 [00:00<?, ?it/s]

Running Epoch 9 of 20:   0%|          | 0/279 [00:00<?, ?it/s]

Running Epoch 10 of 20:   0%|          | 0/279 [00:00<?, ?it/s]

Running Epoch 11 of 20:   0%|          | 0/279 [00:00<?, ?it/s]

Running Epoch 12 of 20:   0%|          | 0/279 [00:00<?, ?it/s]

Running Epoch 13 of 20:   0%|          | 0/279 [00:00<?, ?it/s]

Running Epoch 14 of 20:   0%|          | 0/279 [00:00<?, ?it/s]

Running Epoch 15 of 20:   0%|          | 0/279 [00:00<?, ?it/s]

Running Epoch 16 of 20:   0%|          | 0/279 [00:00<?, ?it/s]

Running Epoch 17 of 20:   0%|          | 0/279 [00:00<?, ?it/s]

Running Epoch 18 of 20:   0%|          | 0/279 [00:00<?, ?it/s]

Running Epoch 19 of 20:   0%|          | 0/279 [00:00<?, ?it/s]

(5580, 0.013313298771212215)

### Evaluation

In [8]:
from sklearn.metrics import classification_report

result, model_outputs, wrong_predictions = model.eval_model(test_df)

target_names = ["Spam", "Ham"]
predicted_class = [list(e[0]) for e in model_outputs]
predicted_class = [e.index(max(e)) for e in predicted_class]
print(
    classification_report(
        list(test_df["Category"]), predicted_class, target_names=target_names
    )
)

  "Dataframe headers not specified. Falling back to using column 0 as text and column 1 as labels."


  0%|          | 0/1393 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/140 [00:00<?, ?it/s]

              precision    recall  f1-score   support

        Spam       0.99      1.00      0.99      1207
         Ham       0.98      0.95      0.96       186

    accuracy                           0.99      1393
   macro avg       0.98      0.97      0.98      1393
weighted avg       0.99      0.99      0.99      1393

