# 03 - BERT token creation

## Instructions
RUN THIS ON COLAB with GPU!!!

- Turn on GPU: `Runtime>Change Runtime>GPU`
- Load data file `df_total_cleanen.pkl.gz`


Since we are dealing with a lot of data, Colab will probably crash a couple of times.

When BERT is done. Remember to download the Pickle File!

## Data files needed to run this notebook:
- `df_total_cleaned.pkl.gz`

## Settings:
- set `COLAB = True` if you run this on Colab. Data can be placed in the root directory

In [1]:
# setup
import sys
import subprocess
import pkg_resources
from collections import Counter
import re
from numpy import log, mean, matmul


required = {'spacy', 'scikit-learn', 'numpy', 
            'pandas', 'torch', 'matplotlib',
            'transformers', 'allennlp==0.9.0'}
installed = {pkg.key for pkg in pkg_resources.working_set}
missing = required - installed

if missing:
    python = sys.executable
    subprocess.check_call([python, '-m', 'pip', 'install', *missing], stdout=subprocess.DEVNULL)

import numpy as np
import pandas as pd

# PyTorch
import torch
# import torch.nn as nn
from torch.utils.data import TensorDataset, DataLoader

# File managment
import os
from os import listdir
from pathlib import Path
import pickle
import gzip

In [3]:
COLAB = False

In [7]:
if COLAB:
  # Google Colab
  path = "./"
  device = torch.device("cuda:0") # use GPU, change 
else:
  # Laptop
  path = "./data/"
  device = torch.device("cpu")
#   !pip install ipywidgets
#   !jupyter nbextension enable --py widgetsnbextension


In [8]:
def save_pickle(filename, data):
    with open(f"{path}{filename}", "wb") as fp: 
      pickle.dump(data, fp)

def load_pickle(filename):
    with open(f"{path}{filename}", 'rb') as f:
      return pickle.load(f)
    
def save_file(filename, train_tokens, test_tokens, val_tokens):
  data = {"train_tokens": train_tokens, "test_tokens" : test_tokens, "val_tokens": val_tokens}

  with open(f"{path}{filename}", "wb") as fp: 
    pickle.dump(data, fp)
    
def load_file(filename):
  with open(f"{path}{filename}", 'rb') as f:
      all_text = pickle.load(f)
      return (all_text["train_tokens"], all_text["test_tokens"], all_text["val_tokens"])


In [9]:
df_total = pd.read_pickle(f'{path}df_total_cleaned.pkl.gz')

In [10]:
df_total.head()

Unnamed: 0,SName,Lyric,Artist,Genre
0,More Than This,I could feel at the time. There was no way of ...,10000 Maniacs,Rock
1,Because The Night,"Take me now, baby, here as I am. Hold me close...",10000 Maniacs,Rock
2,These Are Days,These are. These are days you'll remember. Nev...,10000 Maniacs,Rock
3,A Campfire Song,"A lie to say, ""O my mountain has coal veins an...",10000 Maniacs,Rock
4,Everyday Is Like Sunday,Trudging slowly over wet sand. Back to the ben...,10000 Maniacs,Rock


In [11]:
def simplify_data(data):
  y = data["Genre"]
  y = y.reset_index()
  y = y.drop('index', axis=1)
  
  X= data["Lyric"]
  X = X.reset_index()
  X = X.drop("index", axis=1)
  return (X,y)



In [12]:
X, y = simplify_data(df_total)

In [13]:
import transformers
# what we're used to: BERT
from transformers import BertTokenizer, BertModel 

MODEL_NAME = 'bert-base-uncased'
# Load pre-trained model
model = BertModel.from_pretrained(MODEL_NAME)
# Load pre-trained model tokenizer (vocabulary)
tokenizer = BertTokenizer.from_pretrained(MODEL_NAME)

In [14]:
# set the whole model to GPU
model.to(device)

# here we're doing small batches to the model on GPU, we'll load the product of this process later
# The model itself takes up a LOT of memory, so we're passing very small batches
# note here: You may run out of RAM if you try and run this along with all the above.

def generate_BERT_tokens(data, filename):
  print("Starting BERT tokenization")
  print(f"using: {device}")
  st = 0
  batch_size = 5
  batches = list(range(batch_size, len(data), batch_size))+[len(data)]
  # print(batches)
  doc_rep_collector = []
  for b in batches:
      print(f"Batch: {b}/{len(data)} ({100*round(b/len(data),3)}%)")
      tokens = tokenizer.batch_encode_plus(
          data["Lyric"][st:b],
          pad_to_max_length=True, 
          return_tensors="pt",
          max_length=512,
          truncation=True)
      st = b
      tokens.to(device)
      outputs = model(**tokens)
      # taking the representation of the 'CLS' token (doc-level embedding)
      o = outputs[0][:,0].cpu().detach().numpy()
      doc_rep_collector.append(o)

  # stack into array
  doc_rep_collector = np.concatenate(doc_rep_collector)
  
  # to minimize size, can store as 16-bit float
  doc_rep_collector = doc_rep_collector.astype('float16')

  # additionally, will store as gzip (pandas can handle this)
  pickle.dump(doc_rep_collector, gzip.open(f'{path}{filename}', 'wb'))

In [15]:
%%time
if COLAB: # around .. minutes
  generate_BERT_tokens(X, 'lyrics_bert_vectors_total.pkl.gz')
else: # around 50 seconds
  generate_BERT_tokens(X[0:10], 'lyrics_bert_vectors_localsubset.pkl.gz')

Starting BERT tokenization
using: cpu
Batch: 5/10 (50.0%)
Batch: 10/10 (100.0%)
CPU times: user 39.7 s, sys: 8.78 s, total: 48.5 s
Wall time: 20.8 s


In [20]:
X[1:20]

Unnamed: 0,Lyric
1,"Take me now, baby, here as I am. Hold me close..."
2,These are. These are days you'll remember. Nev...
3,"A lie to say, ""O my mountain has coal veins an..."
4,Trudging slowly over wet sand. Back to the ben...
5,"Don't talk, I will listen. Don't talk, you kee..."
6,"Well they left then in the morning, a hundred ..."
7,. . science. is truth for life. watch religion...
8,On bended kneeI've looked through every window...
9,For whom do the bells toll. When sentenced to ...
10,"She walks alone on the brick lane,. the breeze..."
