<a href="https://colab.research.google.com/github/ericsdata/colinsbeer/blob/main/src/BeerSentiment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Testing out Hugging Face

Using a selection of beer revies from the Beer Ratings dataset, we are going to try out setting up a Binary Classifier to tell us whether a user thought a beer was "good" based on their writeup. Generally should be a simple task with high correlation to sentiment scoring. 

The file `write_txt_train.py` has detailed information in how the training set was compiled. Overall, beers that scored at a 4.0 or higher were considered "good".



In [2]:
## Environment will require HF transformers package
!pip install transformers

Collecting transformers
  Downloading transformers-4.15.0-py3-none-any.whl (3.4 MB)
[K     |████████████████████████████████| 3.4 MB 4.1 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.4.0-py3-none-any.whl (67 kB)
[K     |████████████████████████████████| 67 kB 5.5 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 37.2 MB/s 
[?25hCollecting sacremoses
  Downloading sacremoses-0.0.47-py2.py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 46.7 MB/s 
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 34.6 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  A

### Data read in

Working in this colab environment requires a manual upload of the dataset. As google will remind you, this upload expires at the end of the session, so must be reuploaded each time the user runs this notebook. 

The dataset is a two-column dataframe of 7000 rows. One column is the text review, the second is the binary indicator for whether the beer was considered "good" by the reviewer.

In [3]:
import os
import pandas as pd
## Csv produced by write_txt_train.py file
dat = pd.read_csv(r'txt_train.csv')
dat.head(10)

Unnamed: 0,review_text,good_score
0,"UPDATED: MAR 6, 2004 Amber, with a wonderful o...",0.0
1,"I did not really like this beer. Grainy, bitt...",0.0
2,I was quite pleased with this beer. I tried a ...,0.0
3,Tried UK 3% version in 440ml can. Light in co...,0.0
4,Clean on the palate with a musty biscuity malt...,0.0
5,"Shenandoah Throwdown, growler, thanks wickedpe...",0.0
6,Short stubby bottle from Morrissons clearance ...,0.0
7,Straw color with a tall waxy looking white hea...,0.0
8,Notes of caramel sticky buns on the nose but v...,0.0
9,"A passable lager... meaning, Ill probably pass...",0.0


### Preprocessing 

To train an ML model we need a training and testing set. We split our 7000 rows into two different datasets, one to train the model, the other to evauluate its performance.

In [5]:
import numpy as np
import random
import math
## Set the seed
random.seed(1144)
## Set size of our population
population = [i for i in dat.index]
## Set proportionate size of training set
train_proportion = .7
train_size = math.floor(len(population) * train_proportion)
## Sample the index at the training size ot get the list of index positions for the training set
train_idx = random.sample(population, train_size)
## This is not used, but a good way to set up data
mldat = {"train" : dat.loc[train_idx].to_dict(orient = "records"),
        "test" : dat.loc[~dat.index.isin(train_idx)].to_dict(orient = "records")}
## Make training and testing sets
train =  dat.loc[train_idx]
val = dat.loc[~dat.index.isin(train_idx)]


An important part of training a ML model is setting up your data to be fed into the model. 

Not going to go into the specific details of what is needed to tokenize and embed text data, but the class below is set up to allow for easy ingestion of data into model.

In [None]:
import torch


class Beer_Data(torch.utils.data.Dataset):
  '''
  Class is a torch data set, each record has encodings and labels
  '''
  def __init__(self, encodings, labels):
    self.encodings = encodings
    self.labels = labels

  def __getitem__(self,idx):
    item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
    item['labels'] = torch.tensor(self.labels[idx])
    return item

  def __len__(self):
    return len(self.labels)

We will used a pretrained distilbert model for this task, fine tuning it for binary classification

In [6]:
from transformers import AdamW, AutoTokenizer, AutoModelForSequenceClassification
## Set model checkpoint
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
## Use HF Automodel to choose based on number of classes
## Labels is n-1 for binary class, as "good"
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=1)

tokenizer = AutoTokenizer.from_pretrained(checkpoint)


Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/256M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.bias', 'vocab_layer_norm.bias', 'vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_projector.weight', 'vocab_transform.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias', 'pre_classifier

Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

In [None]:



train_encodings = tokenizer(train['review_text'].astype(str).to_list(), 
                            truncation = True, padding = True)
val_encodings = tokenizer(val['review_text'].astype(str).to_list(), 
                          truncation = True, padding = True)

train_dataset = Beer_Data(train_encodings, train['good_score'].to_list())
val_dataset = Beer_Data(val_encodings, test['good_score'].to_list())

In [None]:
train_dataset.__getitem__(1)

{'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0

In [None]:
from transformers import DistilBertForSequenceClassification, Trainer, TrainingArguments

labels = [0,1]

training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=4,
    weight_decay=0.01,
)


trainer = Trainer(
    model=model,                         
    args=training_args,                  
    train_dataset=train_dataset,         
    eval_dataset=test_dataset             
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [None]:
trainer.train()
trainer.save_model("beer_review_model")

***** Running training *****
  Num examples = 4900
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 1839


Step,Training Loss
500,0.0334
1000,0.0071
1500,0.0013


Saving model checkpoint to ./results/checkpoint-500
Configuration saved in ./results/checkpoint-500/config.json
Model weights saved in ./results/checkpoint-500/pytorch_model.bin
Saving model checkpoint to ./results/checkpoint-1000
Configuration saved in ./results/checkpoint-1000/config.json
Model weights saved in ./results/checkpoint-1000/pytorch_model.bin
Saving model checkpoint to ./results/checkpoint-1500
Configuration saved in ./results/checkpoint-1500/config.json
Model weights saved in ./results/checkpoint-1500/pytorch_model.bin


Training completed. Do not forget to share your model on huggingface.co/models =)


Saving model checkpoint to beer_review_model
Configuration saved in beer_review_model/config.json
Model weights saved in beer_review_model/pytorch_model.bin


In [None]:
val = pd.read_csv(r'txt_test.csv')

val_encodings = tokenizer(val['review_text'].astype(str).to_list(), truncation = True, padding = True)

val_dataset = Beer_Data(val_encodings, val['good_score'].to_list())

predictions = trainer.predict(val_dataset)

***** Running Prediction *****
  Num examples = 3000
  Batch size = 4


In [None]:
test_results=val.copy(deep=True)
test_results["label_int_pred_transfer_learning"]=predictions.label_ids
test_results
#test_results['label_pred_transfer_learning']=test_results['label_int_pred_transfer_learning'].apply(lambda x:labels[x])

#test_results[test_results["good_score"]!=test_results["label_pred_transfer_learning"]].head()

Unnamed: 0,review_text,good_score,label_int_pred_transfer_learning
0,"Well, tipically bavaria, cheap and fast, not m...",0.0,0.0
1,Packaging sure made this one look deceptively ...,0.0,0.0
2,"Bubbly white head, golden see through. Malty,...",1.0,1.0
3,Bottle pours a light yellow with a thin white ...,0.0,0.0
4,Bottle. Dark ruby colour with a tan head. Arom...,0.0,0.0
...,...,...,...
2995,750ml bottle. Poured with a huge rocky white f...,0.0,0.0
2996,Growler at Shenadoah Throwdown thanks to wicke...,0.0,0.0
2997,"This beer could be a hit with a ""whip out your...",1.0,1.0
2998,Caramel and bitter woody scent. Deep ruby colo...,0.0,0.0


In [None]:
pd.crosstab(test_results['good_score'], test_results['label_int_pred_transfer_learning'])

label_int_pred_transfer_learning,0.0,1.0
good_score,Unnamed: 1_level_1,Unnamed: 2_level_1
0.0,2891,0
1.0,0,109


In [4]:
dat['good_score'].value_counts()

0.0    6739
1.0     261
Name: good_score, dtype: int64

In [None]:
max_len = 512
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

test_seq = mldat['train'][1]['review_text']
test_seq2 = mldat['train'][2]['review_text']

seqs = [mldat['train'][1]['review_text'],mldat['train'][2]['review_text'] ]
tokens = tokenizer(seqs, padding = True, 
                            truncation = True, return_tensors = 'pt')
print(tokens)


ids = tokenizer.convert_tokens_to_ids(tokens)

input_ids = torch.tensor([ids])

model(**tokens)

{'input_ids': tensor([[  101,   101, 26071,  1012, 10364,  2015,  1037,  5024,  2422,  3756,
          2007,  4317,  2132,  1998,  2152,  6351,  3370,  1012, 14747,  2066,
          2087,  7935, 18007,  1010,  2021,  2200,  8217,  1012,  2023,  5404,
          2018,  1037,  2844,  3988, 20943,  5510,  1010,  2029,  2059, 23946,
          2000,  1037,  2300,  2044, 10230,  2618,  1012,  2023,  2419,  2000,
          1037,  4326,  5688,  2004,  2009,  2001,  2667,  2000,  2022,  2119,
          1037,  5744,  4392,  3085,  5404,  1998,  2242,  2007,  6805,  1012,
          1996,  5404,  2001,  2036, 19638,  2084,  3517,  1010,  1998,  1996,
          2132,  8314,  2000, 19219,  2015,  1012,  3452,  1037,  4392,  3085,
          5404,  1010,  2074,  2025,  2200, 22249,  1012,   102,   102],
        [  101,   101, 16958,  2200, 23353,  2100,  1998,  2025,  2428,  2172,
         28126,  1010,  1045,  2876,  2102,  2175,  2041,  1997,  2026,  2126,
          2000,  2031,  2023,  2153,   102, 

SequenceClassifierOutput([('logits', tensor([[ 3.0460, -2.5669],
                                   [ 3.2350, -2.7345]], grad_fn=<AddmmBackward0>))])

In [None]:
sents = [d['review_text'] for d in mldat["train"]]
tokenized_reviews = tokenizer(sents, padding = True, 
                            truncation = True, return_tensors = 'pt')



len(tokenized_reviews['input_ids'])

4900

In [None]:
train_InputExamples

4181    InputExample(guid=None, text_a='[CLS]  [SEP]',...
3943    InputExample(guid=None, text_a='[CLS] Bottled....
4025    InputExample(guid=None, text_a='[CLS] Tastes v...
3991    InputExample(guid=None, text_a='[CLS] Bottle: ...
1436    InputExample(guid=None, text_a='[CLS] Pours cl...
                              ...                        
2846    InputExample(guid=None, text_a='[CLS] Bottle p...
6292    InputExample(guid=None, text_a='[CLS] Can. Pou...
1719    InputExample(guid=None, text_a='[CLS] UPDATED:...
1571    InputExample(guid=None, text_a='[CLS] Bottle, ...
1208    InputExample(guid=None, text_a='[CLS] UPDATED:...
Length: 4900, dtype: object

In [None]:
from transformers import InputExample, InputFeatures

train =  dat.loc[train_idx]
test = dat.loc[~dat.index.isin(train_idx)]

def convert_data_to_examples(train, test, review, sentiment): 
    train_InputExamples = train.apply(lambda x: InputExample(guid=None, # Globally unique ID for bookkeeping, unused in this case
                                                          text_a = x[review], 
                                                          label = x[sentiment]), axis = 1)

    validation_InputExamples = test.apply(lambda x: InputExample(guid=None, # Globally unique ID for bookkeeping, unused in this case
                                                          text_a = x[review], 
                                                          label = x[sentiment]), axis = 1,)
  
    return train_InputExamples, validation_InputExamples

train_InputExamples, validation_InputExamples = convert_data_to_examples(train,  test, 'review_text',  'good_score')


In [None]:



 ## Import BERT Model
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

## Moddel inputs holds two arrays
    ## input_ids: list of id of word vector
    ## attention_mask
    
n_batch = 10

batch = sequence [:n_batch]
batch_labs = labels [:n_batch]

model_inputs = tokenizer(batch, padding=True, truncation=True, return_tensors="pt")
#model_inputs = tokenizer(sequence, padding = True, truncation = True, return_tensors= 'pt')


## List of IDs
ids = tokenizer.convert_tokens_to_ids(model_inputs)


#batch = tokenizer(few_reviews, padding="max_length", truncation=True, return_tensors="pt")

## !!! Attaching class to model inputs
model_inputs['labels'] = torch.tensor(batch_labs)

'''
### Choosing tokenizer
####    A) Keep reviews by uid
####    B) Sentence strings associated with a particular style

output = model(**model_inputs)

i = 0

for out in range(0,len(output)):

    print(output[i])
    i+=1
'''

'\n### Choosing tokenizer\n####    A) Keep reviews by uid\n####    B) Sentence strings associated with a particular style\n\noutput = model(**model_inputs)\n\ni = 0\n\nfor out in range(0,len(output)):\n\n    print(output[i])\n    i+=1\n'

In [None]:

optimizer = AdamW(model.parameters())
loss = model(**model_inputs).loss
loss.backward()
optimizer.step()