<a href="https://colab.research.google.com/github/ericsdata/colinsbeer/blob/main/BeerSentiment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Testing out Hugging Face

Using a selection of beer revies from the Beer Ratings dataset, we are going to try out setting up a Binary Classifier to tell us whether a user thought a beer was "good" based on their writeup. Generally should be a simple task with high correlation to sentiment scoring. 

The file `write_txt_train.py` has detailed information in how the training set was compiled. Overall, beers that scored at a 4.0 or higher were considered "good".



In [2]:
## Environment will require HF transformers package
!pip install transformers

Collecting transformers
  Downloading transformers-4.16.2-py3-none-any.whl (3.5 MB)
[K     |████████████████████████████████| 3.5 MB 14.5 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 40.4 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.4.0-py3-none-any.whl (67 kB)
[K     |████████████████████████████████| 67 kB 6.2 MB/s 
[?25hCollecting sacremoses
  Downloading sacremoses-0.0.47-py2.py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 46.7 MB/s 
Collecting tokenizers!=0.11.3,>=0.10.1
  Downloading tokenizers-0.11.5-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.8 MB)
[K     |████████████████████████████████| 6.8 MB 48.8 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  Attempting uninstall: pyyaml
 

### Data read in

Working in this colab environment requires a manual upload of the dataset. As google will remind you, this upload expires at the end of the session, so must be reuploaded each time the user runs this notebook. 

The dataset is a two-column dataframe of 7000 rows. One column is the text review, the second is the binary indicator for whether the beer was considered "good" by the reviewer.

In [3]:
import os
import pandas as pd
## Csv produced by write_txt_train.py file
dat = pd.read_csv(r'txt_train.csv')
dat.head(10)


Unnamed: 0,review_text,good_score
0,Bottle from Willow Park Regina. Pours off red ...,0
1,Not bad for a light amber. Beer was found at s...,0
2,"Bomber from bev mo. 98 percentile already, wo...",0
3,"A decent pilsener, if for no other reason than...",0
4,Bottle. Pours a clear golden with a one finger...,0
5,brewers are certainly allowed to brag a bit on...,0
6,"Bottle at GBBF, August 2005. A bit of a disap...",0
7,Pours a clear golden color with a white foamy ...,0
8,Brown colour with ruby highlight and a thin wh...,0
9,Malted caramel on the nose. Somewhat musty an...,1


In [8]:
target = dat.columns[1]
num_targets = len(dat[target].unique())

target

'good_score'

### Preprocessing 

To train an ML model we need a training and testing set. We split our 7000 rows into two different datasets, one to train the model, the other to evauluate its performance.

In [4]:
import numpy as np
import random
import math
## Set the seed
random.seed(1144)
## Set size of our population
population = [i for i in dat.index]
## Set proportionate size of training set
train_proportion = .7
train_size = math.floor(len(population) * train_proportion)
## Sample the index at the training size ot get the list of index positions for the training set
train_idx = random.sample(population, train_size)
## This is not used, but a good way to set up data
mldat = {"train" : dat.loc[train_idx].to_dict(orient = "records"),
        "test" : dat.loc[~dat.index.isin(train_idx)].to_dict(orient = "records")}
## Make training and testing sets
train =  dat.loc[train_idx]
val = dat.loc[~dat.index.isin(train_idx)]


An important part of training a ML model is setting up your data to be fed into the model. 

The inputs for this data object are two corresponding lists. The first is comprised of the encodings (or the numeric representations of the tokenized texts), with the second made up of the corresponding labels for classification purposes (as a torch / tensor object).

The two basic methods can fetch a single record from the dataset object, or return the number of records in the object. 

In [15]:
import torch
#https://discuss.huggingface.co/t/errors-when-fine-tuning-t5/3527

class Beer_Data(torch.utils.data.Dataset):
  '''
  Class is a torch data set, each record has encodings and labels
  '''
  def __init__(self, encodings, labels =None):
    self.encodings = encodings
    self.labels = labels

  def __getitem__(self,idx):
    item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
    item['labels'] = torch.tensor(self.labels[idx])
    return item

  def __len__(self):
    return len(self.encodings["input_ids"])

HuggingFace's value comes from using foundation models that can be specialized for your specific purpose. 

In this example we use the pre-trained Distilbert model, which has already been trained on a large text database, then specialize this model by attaching a classificaiton head to it and preparing it to use our beer review data. 




In [9]:
from transformers import AdamW, AutoTokenizer, AutoModelForSequenceClassification
## Set model checkpoint
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
## Use HF Automodel to choose based on number of classes
## Labels is n-1 for binary class, as "good"
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", 
                                                           num_labels= num_targets)

### Use GPU if available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device = device)

## Load the tokenizer
#### Tokenizer uses pretrained embeddings to assign words localized embeddings
tokenizer = AutoTokenizer.from_pretrained(checkpoint)


Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/256M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_layer_norm.weight', 'vocab_projector.weight', 'vocab_transform.bias', 'vocab_projector.bias', 'vocab_layer_norm.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.weight', 'classifier

Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

In [10]:
## Encode our training and vlaidation data, truncating and padding as needed
train_encodings = tokenizer(train['review_text'].astype(str).to_list(), 
                            truncation = True, padding = True)
val_encodings = tokenizer(val['review_text'].astype(str).to_list(), 
                          truncation = True, padding = True)
                          
## Use text encodings to create dataset with labels
train_dataset = Beer_Data(train_encodings,train[target].astype(int).to_list())
val_dataset = Beer_Data(val_encodings, val[target].astype(int).to_list())

Looking at one record from our dataset, we see it has an attention mask layer, input_ids which are the encoded words, and an associated label. 

In [None]:
train_dataset.__getitem__(1)

{'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0

Set up a Data Loader

In [11]:
from transformers import Trainer, TrainingArguments

## Set up training arguments used to train model
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=4,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=4,
    weight_decay=0.05,
)

## input args into Trainer class
trainer = Trainer(
    model=model,                         
    args=training_args,                  
    train_dataset=train_dataset,         
    eval_dataset=val_dataset             
)

In [12]:
## train the model

trainer.train()
trainer.save_model("beer_review_good_reviews")

***** Running training *****
  Num examples = 4900
  Num Epochs = 4
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 1228


Step,Training Loss
500,0.4574
1000,0.2483


Saving model checkpoint to ./results/checkpoint-500
Configuration saved in ./results/checkpoint-500/config.json
Model weights saved in ./results/checkpoint-500/pytorch_model.bin
Saving model checkpoint to ./results/checkpoint-1000
Configuration saved in ./results/checkpoint-1000/config.json
Model weights saved in ./results/checkpoint-1000/pytorch_model.bin


Training completed. Do not forget to share your model on huggingface.co/models =)


Saving model checkpoint to beer_review_good_reviews
Configuration saved in beer_review_good_reviews/config.json
Model weights saved in beer_review_good_reviews/pytorch_model.bin


In [19]:
## Time to validate the model

## Read in data
test = pd.read_csv(r'txt_test.csv')
## Gen encodings
test_encodings = tokenizer(test['review_text'].astype(str).to_list(), truncation = True, padding = True)

test_dataset = Beer_Data(test_encodings)
## Load model we just trained
model = AutoModelForSequenceClassification.from_pretrained("results/checkpoint-1000", 
                                                           num_labels= num_targets)
## Set Trainer
test_trainer = Trainer(model)
## Make the predictions
predictions, _,_ = test_trainer.predict(test_dataset)

loading configuration file results/checkpoint-1000/config.json
Model config DistilBertConfig {
  "_name_or_path": "results/checkpoint-1000",
  "activation": "gelu",
  "architectures": [
    "DistilBertForSequenceClassification"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "problem_type": "single_label_classification",
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "torch_dtype": "float32",
  "transformers_version": "4.16.2",
  "vocab_size": 30522
}

loading weights file results/checkpoint-1000/pytorch_model.bin
All model checkpoint weights were used when initializing DistilBertForSequenceClassification.

All the weights of DistilBertForSequenceClassification were initialized from the model checkpoint at results/checkpoint-1000.
If your

In [23]:
test_results=test.copy(deep=True)
test_results["label_int_pred_transfer_learning"]= np.argmax(predictions, axis=1)
test_results

Unnamed: 0,review_text,good_score,label_int_pred_transfer_learning
0,22 oz bomber. Almost pitch back with a thin b...,0,0
1,Way too salty a beer for me. Looked great firs...,0,0
2,Hazy golden hue with small head but good lacin...,1,0
3,Bottle 33cl. @ home.Clear light yellow color w...,0,0
4,"Wow, I didn't think that this would still be s...",0,0
...,...,...,...
2995,Bottle. Pours an amber colour with an off-whit...,0,0
2996,"Bottled. Hazy nut brown, lively creamy head. V...",0,0
2997,"Good gracious, what have we here? Slight bubb...",1,1
2998,"On tap, poured into my 32 ounce/1 liter/whatev...",0,0


In [24]:
pd.crosstab(test_results[target], test_results['label_int_pred_transfer_learning'])

label_int_pred_transfer_learning,0,1
good_score,Unnamed: 1_level_1,Unnamed: 2_level_1
0,1860,217
1,446,477


In [None]:
##lets download our model so we dont always have to rereun
!zip -r /content/file.zip /content

from google.colab import files
files.download("file.zip")

FileNotFoundError: ignored