<a href="https://colab.research.google.com/github/ericsdata/colinsbeer/blob/main/src/BeerSentiment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Generating New Reivews

Using a selection of beer revies from the Beer Ratings dataset, we are going to try out generate new text reviews on beer and / or brewer cues.  We will try out several language models - including GPT 2 for text generation. 

The file `write_txt_train.py` has detailed information in how the training set was compiled. Overall, beers that scored at a 4.0 or higher were considered "good".

Resource : https://colab.research.google.com/drive/13dZVYEOMhXhkXWfvSMVM1TTtUDrT6Aeh?usp=sharing#scrollTo=U_XJVIetKN-h

Resource 2: https://towardsdatascience.com/how-to-fine-tune-gpt-2-for-text-generation-ae2ea53bc272

In [2]:
## Environment will require HF transformers package
!pip install transformers

Collecting transformers
  Downloading transformers-4.15.0-py3-none-any.whl (3.4 MB)
[K     |████████████████████████████████| 3.4 MB 4.1 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.4.0-py3-none-any.whl (67 kB)
[K     |████████████████████████████████| 67 kB 5.5 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 37.2 MB/s 
[?25hCollecting sacremoses
  Downloading sacremoses-0.0.47-py2.py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 46.7 MB/s 
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 34.6 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  A

### Data read in

Working in this colab environment requires a manual upload of the dataset. As google will remind you, this upload expires at the end of the session, so must be reuploaded each time the user runs this notebook. 

Dataset contains beers with at least 30 reviews 

In [1]:
import os
import pandas as pd
import torch


## Csv produced by write_txt_train.py file
#dat = pd.read_csv(r'..\txt_train.csv')
#dat.head(10)

### Model Loading 

We are relying on HF models to deploy this. 

In [2]:
from transformers import GPT2Tokenizer, GPT2Model


tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2Model.from_pretrained('gpt2')

## GPT2 Doesnt have pad token??
#### Errored out later unless this was added
if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({'pad_token': '[PAD]'})

Using pad_token, but it is not set yet.


In [3]:
### FAKE DATA

revs = ["On tap at the Springfield, PA location. Poured a deep and cloudy orange (almost a copper) color with a small sized off white head. Aromas or oranges and all around citric. Tastes of oranges, light caramel and a very light grapefruit finish. I too would not believe the 80+ IBUs - I found this one to have a very light bitterness with a medium sweetness to it. Light lacing left on the glass.",
"On tap at the John Harvards in Springfield PA.  Pours a ruby red amber with a medium off whie creamy head that left light lacing.  Aroma of orange and various other citrus.  A little light for what I was expecting from this beers aroma...expecting more from the Simcoe.  Flavor of pine, orange, grapefruit and some malt balance.  Very light bitterness for the 80+ IBUs they said this one had.",
"UPDATED: FEB 19, 2003 Springfield, PA. I've never had the Budvar Cristal but this is exactly what I imagined it to be.  A clean and refreshing, hoppy beer, med bodied with plenty of flavor.  This beer's only downfall is an unpleasant bitterness in the aftertaste.",
"On tap the Springfield PA location billed as the Fancy Lawnmower Light.  Pours a translucent clear yellow with a small bubbly white head.  Aroma was lightly sweet and malty, really no hop presence.  Flavor was light, grainy, grassy and malty.  Just really light in flavor and aroma overall. Watery.",
"On tap at the Springfield, PA location. Poured a lighter golden color with a very small, if any head. Aromas and tastes of grain, very lightly fruity with a light grassy finish. Lively yet thin and watery body. Oh yeah, the person seating me told me this was a new one and was a Pale Ale even though the menu he gave me listed it as a lighter beer brewed in the Kolsh style.",
"Springfield, PA location... Poured an opaque black color with a creamy tan head and nice lacing.  Strong vanilla and roasted malt aroma.  Creamy taste of coffee, chocolate and vanilla. The bartender told me this was an imperial stout at about 8%.  She didn't convince me, there was no alcohol to be found, and it was sweet as hell!  But still good.",
"On tap at the Springfield, PA location. Listed on the beer menu as ""James Brown Ale"". Had the regular and cask version. Poured a deep brown color with an averaged sized off white head (cask had a huge head). Ill stop on the cask version here as I found it to smell and taste like buttered popcorn. The regular had aromas of nuts, light chocolate, and roast. Taste of chocolate, nuts, very light roast and caramel.	 Tasted on 9/7/2006 and moved over as part of the John Harvard clean up."","
"Sampled @ the Springfield, PA location.   Candi Sugar dominates this Belgian Ale.  Beer was on the flat side but had a nice crimson color.   Enjoyable Belgian Ale, I did not expect John Harvards to have it in its line-up.",
"Springfield... Poured a hazy copper color with a medium sized, off white head that left spotty lacing on the glass.  Aroma of yeast, dried fruits, clove, banana, and cherries, with light roastiness.  Aroma was very dubbelish.  Herbal taste of dark fruits, yeast and alcohol was barely noticed.  Slick mouthfeel.  Could have been more flavorful.",
"UPDATED: FEB 19, 2003 Springfield, PA. Darkish copper colored, with no head -	probably poured like that on purpose.	Served inappropriately at about 40 deg	F.  This beer was cold.  It tasted	fine at that temp but I had to let it warm up 	for awhile.  It was worth the wait, as a	very interesting and complex character developed.	Very phenolic and funky - with a strong ester of	bubblegum.  Also a little clove or some kind of 	spice.  Strong but not overwhelming at all.  Surpisingly	easy to drink."","
"UPDATED: FEB 19, 2003 Springfield, PA. Sharp and cloyingly sweet.  The alcohol presence becomes more and more noticeable.",
"UPDATED: FEB 19, 2003 Springfield, PA. Interesting example.  The fruit flavors are very apparent, but the natural mildness of the currants keep the sweetness in check.  These flavors blend well with the white beer base.",
"From Springfield PA:  nice smooth	malty flavor, mildy fruity, but served	via nitro and thru a restrictor disc	(stout tap).  Thus, overly creamy and lacking	some of its original flavor.  I could tell there	was a pretty good beer in there.  Aroma difficult	to detect."","
"On tap at Springfield location.  Pours a translucent golden amber with really no head.  Aroma of caramel, grains and light hops.  Flavor was malty with a light hop presence.  Really kind of non-descript overall.",
"On tap at the Springfield, PA location. Poured a medium and see through orange color with a small sized off white head. Aromas and tastes on the weak side and contained some citrus, caramel, and grains. Body was thin and watery.",
"Handbottled from trade wth Sprinkle. Pours a nice dark copper color with medium size off white head. Aroma of bourbon, malt , hops and oak. Slight smokey flavor with a bourbon taste in the initial sip. Flavors of malt, vanilla and hops still remain although none dominate the brew. Taste is still very enjoyable with a smooth and balanced finish.",
"On tap at the Great Taste of the Midwest (8/9/08): Pours a transparent bright copper orange with an airy white head.  Aroma of sweet toasty pale malt and sweet light fruitiness with a good resiny piny hop character.  Body starts with decent fullness and sweet caramel malts with good balance of hop flavor and bitterness.  Finishes smooth and bittersweet, nicely aged and balanced.",
"UPDATED: JUL 7, 2009 On tap. Interesting experiment, but I liked regular Hopula better. Pours a dark amber with an off white head. Aroma is slightly wood and hops, but mostly bourbon. Flavor is everything I loved about Hopula, but with too much booze. Some nice vanilla notes, but bourbon constantly overpowers the Hopula. Just not my style, but would be very interesting if they reused the barrels.",
"On cask at BI - Aroma of the Hopula Play-Doh hops and malt with lots of oak, vanilla and bourbon.  Pours dark mahogany with a medium lasting head and great lacing.  Flavor is strong bourbon, too strong.  The base beer is hidden under there somewhere but is way overpowered.  I had trouble getting it down to be honest.",
"GTMW 08 on cask - Pours bronze orang with a minimal head.  The aroma has lots of vanilla and some bourbon and toasty malt.  Medium sticky body with light carbonation.  The flavor starts with the aroma traits with more caramel malt and earthy hops.  The finish has vanilla and okay dominating everything.  This kind of beer is just not down my alley."

    
]

In [7]:
from torch.utils.data import random_split, RandomSampler, SequentialSampler

# Split into training and validation sets
train_size = int(0.9 * len(revs))
val_size = len(revs) - train_size

revs_train, revs_val = random_split(revs, [train_size, val_size])




I define the generative Beer Data. The class takes a text list, adds beginning / ending tags to each element, then tokenizes. 
It returns input_id and attention mask layers

In [5]:

class generative_BD(torch.utils.data.Dataset):
  '''Sequence text tokens
      This means it adds tags to start and end of texts

      Reads in text dataset, & tokenizes

      !!! NEED TO ADD PADDING TOKEN
  '''

  def __init__(self,text_list, tokenizer, text_tags, gpt2_type="gpt2", max_length=768):
    #self.text_list = text_list
    self.tokenizer = tokenizer
    self.input_ids = []
    self.attn_masks = []
    self.text_tags = [text_tags]

    ## In definitoin
    for txt in text_list:

        encodings_dict = tokenizer('<%s>'%(text_tags[0])+ txt + '<%s>'%(text_tags[1]), truncation=True, max_length=max_length, padding="max_length")

        self.input_ids.append(torch.tensor(encodings_dict['input_ids']))
        self.attn_masks.append(torch.tensor(encodings_dict['attention_mask']))

  def __getitem__(self, idx):
    return self.input_ids[idx], self.attn_masks[idx]


class Beer_Class_Data(torch.utils.data.Dataset):
  '''
  Classification torch data set, each record has encodings and labels
  '''
  def __init__(self, encodings, labels):
    self.encodings = encodings
    self.labels = labels

  def __getitem__(self,idx):
    item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
    item['labels'] = torch.tensor(self.labels[idx])
    return item

  def __len__(self):
    return len(self.labels)

We will used a pretrained distilbert model for this task, fine tuning it for binary classification

In [8]:
train = revs_train
train_dataset = generative_BD(text_list= train, tokenizer = tokenizer, text_tags = ['CLS', 'SEP'])

val_dataset = generative_BD(text_list= revs_val, tokenizer = tokenizer, text_tags = ['CLS', 'SEP'])


In [14]:
train_dataset.__getitem__(2)

(tensor([   27,  5097,    50,    29,    52, 49316,    25, 18630,    33,   678,
            11,  5816, 27874,    11,  8147,    13,   314,  1053,  1239,   550,
           262, 10370,  7785, 24568,   282,   475,   428,   318,  3446,   644,
           314, 15758,   340,   284,   307,    13,   220,   317,  3424,   290,
         23056,    11,  8169, 14097,  6099,    11,  1117, 16429,   798,   351,
          6088,   286,  9565,    13,   220,   770,  6099,   338,   691, 38041,
           318,   281, 22029, 35987,   287,   262,   706,    83,  4594, 29847,
          5188,    47,    29, 50257, 50257, 50257, 50257, 50257, 50257, 50257,
         50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257,
         50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257,
         50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257,
         50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257,
         50257, 50257, 50257, 50257, 50257, 50257, 5

GPT2 model is very large - use on a CPU will likely cause memory errors. 



In [9]:
from transformers import GPT2LMHeadModel,  GPT2Tokenizer, GPT2Config, GPT2LMHeadModel
from transformers import AdamW, get_linear_schedule_with_warmup

# I'm not really doing anything with the config 
configuration = GPT2Config.from_pretrained('gpt2', output_hidden_states=False)

model = GPT2LMHeadModel.from_pretrained("gpt2", config=configuration)

# this step is necessary because I've added some tokens (bos_token, etc) to the embeddings
# otherwise the tokenizer and model tensors won't match up
model.resize_token_embeddings(len(tokenizer))


device = torch.device("cuda" if torch.cuda.is_available() else "cpu")



Function written below is used to "accumulate the gradients". The idea being that before optimizing the step of gradient descent - the algorithm will first evaluate the gradient of several potential operations. It divides that sum by the number of accumulated steps, and finally gets an average loss over the training sample. 

In [10]:
def pack_tensor(new_tensor, packed_tensor, max_seq_len):
    if packed_tensor is None:
        return new_tensor, True, None
    if new_tensor.size()[1] + packed_tensor.size()[1] > max_seq_len:
        return packed_tensor, False, new_tensor
    else:
        packed_tensor = torch.cat([new_tensor, packed_tensor[:, 1:]], dim=1)
        return packed_tensor, True, None

In [11]:
import random
import numpy as np

# Tell pytorch to run this model on the GPU.
device = torch.device("cuda")
model.cuda()

# Set the seed value all over the place to make this reproducible.
seed_val = 42

random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

AssertionError: Torch not compiled with CUDA enabled

In [12]:
## Parameters


epochs = 5
learning_rate = 5e-4
warmup_steps = 1e2
epsilon = 1e-8

# this produces sample output every 100 steps
sample_every = 20

In [13]:
optimizer = AdamW(model.parameters(),
            lr = learning_rate,
            eps = epsilon)



In [14]:
### Figure out how many training steps

total_steps = len(train) * epochs

## A scheduler adjusts the learning rate as the training loop progresses

scheduler = get_linear_schedule_with_warmup(optimizer,
                                            num_warmup_steps = warmup_steps,
                                            num_training_steps = total_steps)

In [15]:
import datetime
import time

def format_time(time_elapsed):
    return str(datetime.timedelta(seconds = int(round(time_elapsed))))

In [16]:
#start training
total_t0 = time.time()
## store records of each run
training_stats = []

### Manual set for each epoch of training
for epoch_i in range(0, epochs):
    print("")
    print('======== Epoch {:} / {:} ========'.format(epoch_i + 1, epochs))
    print('Training...')

    ## Store process starting time
    startT = time.time()
    ## Init loss across training
    total_train_loss = 0
    #begin model training
    model.train()

    for step, batch in enumerate(train_dataset):
        b_input_ids =batch[0].to(device) ## send input ids to model
        b_labels = batch[0].to(device) ## send labels to model (labs are smae as input in gen text)
        b_masks = batch[1].to(device) ## send attention layer to model

        #init gradient at 0
        model.zero_grad()        
        ## feed labels, inputs, and masks to model
        outputs = model(  b_input_ids,
                          labels=b_labels, 
                          attention_mask = b_masks,
                          token_type_ids=None
                        )
        ## Calc loss
        loss = outputs[0]  
        ### loss on batch
        batch_loss = loss.item()
        ## Add it to total training loss
        total_train_loss += batch_loss
        ### Reporting step
        if step % sample_every == 0 and not step == 0:
          elapsed = format_time(time.time() - startT)
          print('  Batch {:>5,}  of  {:>5,}. Loss: {:>5,}.   Elapsed: {:}.'.format(step, len(train_dataset), batch_loss, elapsed))

          model.eval()
          ### Output some samples so you know its working
          sample_outputs = model.generate(
                                bos_token_id = random.randint(1,30000)
                                ,do_sample = True
                                ,top_k = 50
                                ,max_lenght = 200
                                ,top_p = 0.095
                                ,num_return_sequences = 1
          )

          for i, sample_output in enumerate(sample_outputs):
                  print("{}: {}".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))

          model.train()

        loss.backward()

        optimizer.step()

        scheduler.step()
    ## Calcu average loss over all epochs
    avg_train_loss = total_train_loss / len(train_dataset)

    ## Ouput how long epoch took
    training_time = format_time(time.time() - startT)

    print("")
    print("  Average training loss: {0:.2f}".format(avg_train_loss))
    print("  Training epoch took: {:}".format(training_time))

     # ========================================
    #               Validation
    # ========================================

    print("")
    print("Running Validation...")

    t0 = time.time()

    model.eval()

    total_eval_loss = 0
    nb_eval_steps = 0

    # Evaluate data for one epoch
    for batch in validation_dataloader:
        
        b_input_ids = batch[0].to(device)
        b_labels = batch[0].to(device)
        b_masks = batch[1].to(device)
        
        with torch.no_grad():        

            outputs  = model(b_input_ids, 
#                            token_type_ids=None, 
                             attention_mask = b_masks,
                            labels=b_labels)
          
            loss = outputs[0]  
            
        batch_loss = loss.item()
        total_eval_loss += batch_loss        

    avg_val_loss = total_eval_loss / len(validation_dataloader)
    
    validation_time = format_time(time.time() - t0)    

    print("  Validation Loss: {0:.2f}".format(avg_val_loss))
    print("  Validation took: {:}".format(validation_time))

    # Record all statistics from this epoch.
    training_stats.append(
        {
            'epoch': epoch_i + 1,
            'Training Loss': avg_train_loss,
            'Valid. Loss': avg_val_loss,
            'Training Time': training_time,
            'Validation Time': validation_time
        }
    )

print("")
print("Training complete!")
print("Total training took {:} (h:mm:ss)".format(format_time(time.time()-total_t0)))
        


Training...


AssertionError: Torch not compiled with CUDA enabled