# Text Classification with a Pre-Trained Language Model using .Net
The purpose of this notebook is to show how to use pre-trained weights from BERT (or another Tensorflow) 'language' model to train a classifier in .Net (specifically F#).

The text classification task is more easily accomplished in Python due to the supportive ecosystem available there. The website [Hugging Face](https://huggingface.co/transformers/) contains 1000's of pre-trained language models that can be easily consumed using tooling supplied by Hugging Face. 

Python however is not the language of choice when it comes to building high-performance applications. To consume language (or other deep learning) models from an application one usually resorts to deploying the model as a service - with attendant cost, security and integration concerns. For a high-performance application, there may be a need to more tightly integrate the model with other application functionality and therefore an embedded model may be required.

This notebook shows how a language model maybe re-trained and used directly from .Net, bypassing the need to deploy the model as a service.

## Load the required packages

In [1]:
//#r "nuget: libtorch-cpu-win-x64" //this notebook is written to work on CPU also
#r "nuget: libtorch-cuda-11.3-win-x64, 1.10.0.1" //large package - takes a long time to load and unpack the first time.
#r "nuget: TorchSharp"
#r "nuget: TfCheckpoint"   
#r "nuget: FsBERTTokenizer"
#r "nuget: FSharp.Data"

In [1]:
open TfCheckpoint
open TorchSharp

let device = if torch.cuda.is_available() then torch.CUDA else torch.CPU
printfn $"torch devices is %A{device}"

## Load weigths from pre-trained BERT 'checkpoint'
Here the pre-trained weights from the 'small' BERT uncased model are used - downloaded from [Tensorflow Hub](https://tfhub.dev/google/small_bert/bert_uncased_L-2_H-128_A-2/2).

Note: The weights can also be downloaded from Hugging Face, however they are not easily extractable from languages other than Python. Hugging Face creates its own wrapped packages that require Hugging Face tooling to use.

The download includes a folder called 'variables' that contains the pre-trained weights.

In [1]:
let bertCheckpointFolder = @"C:\s\hack\small_bert_bert_uncased_L-2_H-128_A-2_2\variables"
let tensors = CheckpointReader.readCheckpoint bertCheckpointFolder |> Seq.toArray
//show first tensor
printfn "%A" tensors.[0]

In the above output, the first tensor is named *"bert/embeddings/LayerNorm/beta"*. It is a float32 array of shape 1x128. 

Note the TfCheckpoint package keeps tensors as flat arrays. These can be reshaped when loading into other Tensor libraries e.g. TorchSharp as shown later.

### List checkpoint tensors
Below are all the tensor names in the pre-trained BERT checkpoint.

In [1]:
tensors (* |> Array.skip 20 *) |> Array.map (fun (n,st) -> {|Dims=st.Shape; Name=n|})

index,Dims,Name
0,[ 128 ],bert/embeddings/LayerNorm/beta
1,[ 128 ],bert/embeddings/LayerNorm/gamma
2,"[ 512, 128 ]",bert/embeddings/position_embeddings
3,"[ 2, 128 ]",bert/embeddings/token_type_embeddings
4,"[ 30522, 128 ]",bert/embeddings/word_embeddings
5,[ 128 ],bert/encoder/layer_0/attention/output/LayerNorm/beta
6,[ 128 ],bert/encoder/layer_0/attention/output/LayerNorm/gamma
7,[ 128 ],bert/encoder/layer_0/attention/output/dense/bias
8,"[ 128, 128 ]",bert/encoder/layer_0/attention/output/dense/kernel
9,[ 128 ],bert/encoder/layer_0/attention/self/key/bias


## The Small BERT Model
The small BERT model used here ('small_bert_bert_uncased_L-2_H-128') requries lower cased text and has hidden layer size of 128. It has two 'transformer' layers. There are many versions of BERT available - from tiny to large. See the [text classification tutorial](https://www.tensorflow.org/text/tutorials/classify_text_with_bert) from Tensforflow for more details.

We will re-construct the BERT model using TorchSharp (.Net binding to PyTorch) code and then load the pre-trained weights. The weights will have to be mapped manually. This requires some knowledge of [Tranformers/BERT](https://arxiv.org/abs/1810.04805). However, our task is made easier because TorchSharp (PyTorch) provides a pre-built [TransformerEncoderLayer](https://pytorch.org/docs/stable/generated/torch.nn.TransformerEncoderLayer.html) which encapulates the basic structure of Transformer-based models.

We are required to reconstruct the exact structure of BERT layer-by-layer to ensure the weights are applicable. For re-training we may exclude the final layers and only build the model up to the output of the encoder. For our needs, we wil use the pre-trained weights that start with 'bert/' prefix (see above) and ignore the rest.

### Bert Layers
The top level layers we will need are:
- *Embedding layer*: With sub-layers for word, position & token-type embeddings; layer normalization and dropout. Embedding maps an index to its corresponding learned feature vector.
- *Transfomer Encoder layer*: With two sub-layers that apply the core transformer functionality.
- *Pooling*: Pools (summarizes) the output sequence into a single value - its encoded representation

The rest of the layers will be custom built for text classification (later).

### Other Parameters
Additional model parameters are required, e.g. vocabulary size, dropout rate, etc. Some of these maybe obtained from Hugging Face from the [BERT model 'card' config file](https://huggingface.co/bert-base-uncased/blob/main/config.json). Here we have defined the required parameters as constants in the code below.

## Constants

In [1]:
//tensor dims - these values should match the relevant dimensions of the corresponding tensors in the checkpoint
let HIDDEN      = 128L
let VOCAB_SIZE  = 30522L    // see vocab.txt file included in the BERT download
let TYPE_SIZE   = 2L         // bert needs 'type' of token
let MAX_POS_EMB = 512L

//other parameters
let EPS_LAYER_NORM      = 1e-12
let HIDDEN_DROPOUT_PROB = 0.1
let N_HEADS             = 2L
let ATTN_DROPOUT_PROB   = 0.1
let ENCODER_LAYERS      = 2L
let ENCODER_ACTIVATION  = torch.nn.Activations.GELU

## Embedding Layer

In [1]:
//Note: The module and variable names used here match the tensor name 'paths' as delimted by '/' for TF (see above), 
//for easier matching.
type BertEmbedding() as this = 
    inherit torch.nn.Module("embeddings")
    
    let word_embeddings         = torch.nn.Embedding(VOCAB_SIZE,HIDDEN,padding_idx=0L)
    let position_embeddings     = torch.nn.Embedding(MAX_POS_EMB,HIDDEN)
    let token_type_embeddings   = torch.nn.Embedding(TYPE_SIZE,HIDDEN)
    let LayerNorm               = torch.nn.LayerNorm([|HIDDEN|],EPS_LAYER_NORM)
    let dropout                 = torch.nn.Dropout(HIDDEN_DROPOUT_PROB)

    do 
        this.RegisterComponents()

    member this.forward(input_ids:torch.Tensor, token_type_ids:torch.Tensor, position_ids:torch.Tensor) =   
    
        let embeddings =      
            (input_ids       --> word_embeddings)        +
            (token_type_ids  --> token_type_embeddings)  +
            (position_ids    --> position_embeddings)

        embeddings --> LayerNorm --> dropout             // the --> operator works for simple 'forward' invocations

## BERT Pooler

In [1]:
type BertPooler() as this = 
    inherit torch.nn.Module("pooler")

    let dense = torch.nn.Linear(HIDDEN,HIDDEN)
    let activation = torch.nn.Tanh()

    let ``:`` = torch.TensorIndex.Colon
    let first = torch.TensorIndex.Single(0L)

    do
        this.RegisterComponents()

    override _.forward (hidden_states) =
        let first_token_tensor = hidden_states.index(``:``, first) //take first token of the sequence as the pooled value
        first_token_tensor --> dense --> activation

## BERT Model 
Combines the embedding, pooler and transformer encoder layers. (The transformer encoders are available out-of-the-box in PyTroch)

In [1]:
type BertModel() as this =
    inherit torch.nn.Module("bert")

    let embeddings = new BertEmbedding()
    let pooler = new BertPooler()

    let encoderLayer = torch.nn.TransformerEncoderLayer(HIDDEN, N_HEADS, MAX_POS_EMB, ATTN_DROPOUT_PROB, activation=ENCODER_ACTIVATION)
    let encoder = torch.nn.TransformerEncoder(encoderLayer, ENCODER_LAYERS)

    do
        this.RegisterComponents()
    
    member this.forward(input_ids:torch.Tensor, token_type_ids:torch.Tensor, position_ids:torch.Tensor,?mask:torch.Tensor) =
        let src = embeddings.forward(input_ids, token_type_ids, position_ids)
        let srcBatchDim2nd = src.permute(1L,0L,2L) //PyTorch transformer requires input as such. See the Transformer docs
        let encoded = match mask with None -> encoder.forward(srcBatchDim2nd) | Some mask -> encoder.forward(srcBatchDim2nd,mask)
        let encodedBatchFst = encoded.permute(1L,0L,2L)
        encodedBatchFst --> pooler

## Create a Test BERT Model Instance and Load Pre-Trained TF Weights
The main task here is to find the right mapping between the parameters of the BertModel and those form the Tensorflow BERT checkpoint.

There are several steps involved - first is create an empty model and list all the parameters in the model.

In [1]:
let testBert = new BertModel()
//bert.named_modules() 
testBert.named_parameters() |> Seq.map (fun struct(n,x) -> n,x.shape) |> Seq.iter (printfn "%A")

If we compare the names above to the Tensorflow checkpoint names in the beginning, we can find clues as to how the two may be matched. However this is not straigtforward. We need to build some 'infrastructure' to make this work.

### Tensor data access helpers
First off are some utility functions to get and set data into PyTorch tensors.

In [1]:
module Tensor = 
    //Note: ensure 't matches tensor datatype otherwise ToArray might crash the app (i.e. exception cannot be caught)
    let private _getData<'t when 't:>ValueType and 't:struct and 't : (new:unit->'t) > (t:torch.Tensor) =
        let s = t.data<'t>()
        s.ToArray()

    let getData<'t when 't:>ValueType and 't:struct and 't : (new:unit->'t)>  (t:torch.Tensor) =
        if t.device_type <> DeviceType.CPU then 
            //use t1 = t.clone()
            use t2 = t.cpu()
            _getData<'t> t2
        else 
            _getData<'t> t
  
    let setData<'t when 't:>ValueType and 't:struct and 't : (new:unit->'t)> (t:torch.Tensor) (data:'t[]) =
        if t.device_type = DeviceType.CPU |> not then failwith "tensor has to be on cpu for setData"        
        let s = t.data<'t>()
        s.CopyFrom(data,0,0L)

### Name map
The *nameMap* is a 3-tuple list: 
1. BertModel parameter name; 
2. List of TF tensor names that should be mapped to the parameter
3. Post processing indicator. 
    
In PyTorch, the encoder layer combines the query/key/value weights into a single parameter; these are separate in Tensorflow and therefore a list is requrired to map correctly.

The post processing indicator (type **PostProc**) specifies the post processing required for each map entry.

The *nameMap* list names contain wildcards ('#') which will be replaced by a number representing the encoder layer. BERT model versions can have different number of transformer layers. The model here has 2 layers but larger BERT models can have upto 12 layers. The wildcard-based mapping scheme is apt to handle an arbitrary number of layers.

In [1]:
type PostProc = V | H | T | N

let postProc (ts:torch.Tensor list) = function
    | V -> torch.vstack(ResizeArray ts)
    | H -> torch.hstack(ResizeArray ts)
    | T -> ts.Head.T                  //Linear layer weights need to be transformed. See https://github.com/pytorch/pytorch/issues/2159
    | N -> ts.Head

let nameMap =
    [
        "encoder.layers.#.self_attn.in_proj_weight",["encoder/layer_#/attention/self/query/kernel"; 
                                                     "encoder/layer_#/attention/self/key/kernel";    
                                                     "encoder/layer_#/attention/self/value/kernel"],        V

        "encoder.layers.#.self_attn.in_proj_bias",  ["encoder/layer_#/attention/self/query/bias";
                                                     "encoder/layer_#/attention/self/key/bias"; 
                                                     "encoder/layer_#/attention/self/value/bias"],          H

        "encoder.layers.#.self_attn.out_proj.weight", ["encoder/layer_#/attention/output/dense/kernel"],    N
        "encoder.layers.#.self_attn.out_proj.bias",   ["encoder/layer_#/attention/output/dense/bias"],      N


        "encoder.layers.#.linear1.weight",          ["encoder/layer_#/intermediate/dense/kernel"],          T
        "encoder.layers.#.linear1.bias",            ["encoder/layer_#/intermediate/dense/bias"],            N

        "encoder.layers.#.linear2.weight",          ["encoder/layer_#/output/dense/kernel"],                T
        "encoder.layers.#.linear2.bias",            ["encoder/layer_#/output/dense/bias"],                  N

        "encoder.layers.#.norm1.weight",            ["encoder/layer_#/attention/output/LayerNorm/gamma"],   N
        "encoder.layers.#.norm1.bias",              ["encoder/layer_#/attention/output/LayerNorm/beta"],    N

        "encoder.layers.#.norm2.weight",            ["encoder/layer_#/output/LayerNorm/gamma"],             N
        "encoder.layers.#.norm2.bias",              ["encoder/layer_#/output/LayerNorm/beta"],              N

        "embeddings.word_embeddings.weight"         , ["embeddings/word_embeddings"]           , N
        "embeddings.position_embeddings.weight"     , ["embeddings/position_embeddings"]       , N
        "embeddings.token_type_embeddings.weight"   , ["embeddings/token_type_embeddings"]     , N
        "embeddings.LayerNorm.weight"               , ["embeddings/LayerNorm/gamma"]           , N
        "embeddings.LayerNorm.bias"                 , ["embeddings/LayerNorm/beta"]            , N
        "pooler.dense.weight"                       , ["pooler/dense/kernel"]                  , T
        "pooler.dense.bias"                         , ["pooler/dense/bias"]                    , N
    ]

let PREFIX = "bert"
let addPrefix (s:string) = $"{PREFIX}/{s}"
let sub n (s:string) = s.Replace("#",string n)


### Name map helpers
Functions to set the parameter values of a PyTorch module from a TF checkpoint and a 'nameMap'

In [1]:
//create a PyTorch tensor from TF checkpoint tensor data
let toFloat32Tensor (shpdTnsr:CheckpointReader.ShapedTensor) = 
    match shpdTnsr.Tensor with
    | CheckpointReader.TensorData.TdFloat ds -> torch.tensor(ds, dimensions=shpdTnsr.Shape)
    | _                                      -> failwith "TdFloat expected"

//set the value of a single parameter
let performMap (tfMap:Map<string,_>) (ptMap:Map<string,Modules.Parameter>) (torchName,tfNames,postProcType) = 
    let torchParm = ptMap.[torchName]
    let fromTfWts = tfNames |> List.map (fun n -> tfMap.[n] |> toFloat32Tensor) 
    let parmTensor = postProc fromTfWts postProcType
    if torchParm.shape <> parmTensor.shape then failwithf $"Mismatched weights for parameter {torchName}; parm shape: %A{torchParm.shape} vs tensor shape: %A{parmTensor.shape}"
    Tensor.setData<float32> torchParm (Tensor.getData<float32>(parmTensor))

//set the parameter weights of a PyTorch model given checkpoint and nameMap
let loadWeights (model:torch.nn.Module) checkpoint encoderLayers nameMap =
    let tfMap = checkpoint |> Map.ofSeq
    let ptMap = model.named_parameters() |> Seq.map (fun struct(n,m) -> n,m) |> Map.ofSeq

    //process encoder layers
    for l in 0 .. encoderLayers - 1 do
        nameMap
        |> List.filter (fun (p:string,_,_) -> p.Contains("#")) 
        |> List.map (fun (p,tns,postProc) -> sub l p, tns |> List.map (addPrefix >> (sub l)), postProc)
        |> List.iter (performMap tfMap ptMap)

    nameMap
    |> List.filter (fun (p,_,_) -> p.Contains("#") |> not)
    |> List.map (fun (name,tns,postProcType) -> name, tns |> List.map addPrefix, postProcType)
    |> List.iter (performMap tfMap ptMap)

### Load weights into test model instance

In [1]:
loadWeights testBert tensors (int ENCODER_LAYERS) nameMap

### Quick check
Do a quick check - print the value of one of the model parameters and compare that to the equivalent one from TF to see if the values look right.

In [1]:
testBert.get_parameter("encoder.layers.0.self_attn.in_proj_weight") |> Tensor.getData<float32>

index,value
0,-0.05550384
1,0.0120992055
2,-0.008531141
3,-0.019300643
4,0.043453686
5,0.055428784
6,0.029344434
7,-0.10154877
8,-0.07228437
9,-0.00066868344


In [1]:
tensors |> Seq.find (fun (n,_) -> n = "bert/encoder/layer_0/attention/self/query/kernel")

Item1,Item2
bert/encoder/layer_0/attention/self/query/kernel,"{ { Shape = [|128L; 128L|]  Tensor =  TdFloat  [|-0.05550384149f; 0.01209920552f; -0.008531141095f; -0.01930064335f;  0.04345368594f; 0.05542878434f; 0.02934443392f; -0.1015487686f;  -0.07228437066f; -0.0006686834386f; 0.06803441048f; -0.008480499499f;  0.01477844175f; 0.06076622382f; 0.003934786189f; -0.0008054107311f;  0.01435963903f; -0.06610650569f; -0.01799995638f; -0.05386257172f;  -0.03858022392f; -0.02942419797f; -0.01362462994f; 0.04548906162f;  0.02651916817f; 0.01372765377f; -0.04908275977f; -0.1125504076f;  0.007296630181f; 0.04891639203f; 0.04387928173f; 0.06503837556f;  -0.05659360439f; -0.006504856516f; 0.06188944727f; -0.1045558825f;  0.06272804737f; 0.1617943794f; -0.008180449717f; 0.005743560381f;  0.04920198023f; -0.02764579467f; -0.02393522486f; 0.07790721953f;  0.1218848452f; -0.1136314869f; 0.08718100935f; 0.03885361925f;  0.146388337f; 0.08435544372f; -0.01149796881f; 0.05596561357f;  -0.02651243284f; 0.08560265601f; -0.00856921915f; 0.120351024f;  -0.01969836466f; 0.2401848584f; 0.008963301778f; -0.05614889041f;  0.09748630226f; -0.03333476558f; 0.05127187818f; 0.03764935955f;  -0.07448612154f; 0.0264826864f; 0.01951963082f; -0.04107205197f;  -0.007767393254f; -0.008013598621f; 0.035505265f; -0.1104705185f;  0.05872561783f; 0.09439925104f; -0.02330717817f; -0.08990310878f;  -0.05722709373f; 0.06196752936f; 0.01164332032f; -0.009060089476f;  -0.01447457168f; 0.04950447381f; 0.007276773453f; -0.01481497008f;  0.1441659927f; 0.01024056226f; 0.01306775771f; 0.01946176961f;  -0.01554604061f; 0.01095542125f; -0.01752724685f; -0.1066188514f;  0.09672852606f; 0.05475880951f; -0.07761218399f; -0.04246133566f;  -0.05408534035f; 0.01053970307f; -0.1243368387f; -0.08822208643f; ...|] }: Shape: [ 128, 128 ], Tensor: { TdFloat  [|-0.05550384149f; 0.01209920552f; -0.008531141095f; -0.01930064335f;  0.04345368594f; 0.05542878434f; 0.02934443392f; -0.1015487686f;  -0.07228437066f; -0.0006686834386f; 0.06803441048f; -0.008480499499f;  0.01477844175f; 0.06076622382f; 0.003934786189f; -0.0008054107311f;  0.01435963903f; -0.06610650569f; -0.01799995638f; -0.05386257172f;  -0.03858022392f; -0.02942419797f; -0.01362462994f; 0.04548906162f;  0.02651916817f; 0.01372765377f; -0.04908275977f; -0.1125504076f;  0.007296630181f; 0.04891639203f; 0.04387928173f; 0.06503837556f;  -0.05659360439f; -0.006504856516f; 0.06188944727f; -0.1045558825f;  0.06272804737f; 0.1617943794f; -0.008180449717f; 0.005743560381f;  0.04920198023f; -0.02764579467f; -0.02393522486f; 0.07790721953f;  0.1218848452f; -0.1136314869f; 0.08718100935f; 0.03885361925f; 0.146388337f;  0.08435544372f; -0.01149796881f; 0.05596561357f; -0.02651243284f;  0.08560265601f; -0.00856921915f; 0.120351024f; -0.01969836466f;  0.2401848584f; 0.008963301778f; -0.05614889041f; 0.09748630226f;  -0.03333476558f; 0.05127187818f; 0.03764935955f; -0.07448612154f;  0.0264826864f; 0.01951963082f; -0.04107205197f; -0.007767393254f;  -0.008013598621f; 0.035505265f; -0.1104705185f; 0.05872561783f;  0.09439925104f; -0.02330717817f; -0.08990310878f; -0.05722709373f;  0.06196752936f; 0.01164332032f; -0.009060089476f; -0.01447457168f;  0.04950447381f; 0.007276773453f; -0.01481497008f; 0.1441659927f;  0.01024056226f; 0.01306775771f; 0.01946176961f; -0.01554604061f;  0.01095542125f; -0.01752724685f; -0.1066188514f; 0.09672852606f;  0.05475880951f; -0.07761218399f; -0.04246133566f; -0.05408534035f;  0.01053970307f; -0.1243368387f; -0.08822208643f; ...|]: Item: [ -0.05550384, 0.0120992055, -0.008531141, -0.019300643, 0.043453686, 0.055428784, 0.029344434, -0.10154877, -0.07228437, -0.00066868344, 0.06803441, -0.0084804995, 0.014778442, 0.060766224, 0.003934786, -0.00080541073, 0.014359639, -0.066106506, -0.017999956, -0.05386257 ... (16364 more) ] } }"


## Training Data
The training dataset is the [Yelp review dataset](https://s3.amazonaws.com/fast-ai-nlp/yelp_review_polarity_csv.tgz). Assume this data is saved to a local folder as given below.

In [1]:
let foldr = @"C:\yelp_review_polarity_csv"
let testCsv = Path.Combine(foldr,"test.csv")
let trainCsv = Path.Combine(foldr,"train.csv")
if File.Exists testCsv |> not then failwith $"File not found; path = {testCsv}"
printfn "%A" trainCsv

### Load data

In [1]:
open FSharp.Data
type YelpCsv = FSharp.Data.CsvProvider< Sample="a,b", HasHeaders=false, Schema="Label,Text">
type [<CLIMutable>] YelpReview = {Label:int; Text:string}
//need to make labels 0-based so subtract 1
let testSet = YelpCsv.Load(testCsv).Rows |> Seq.map (fun r-> {Label=int r.Label - 1; Text=r.Text}) |> Seq.toArray 
let trainSet = YelpCsv.Load(trainCsv).Rows |> Seq.map (fun r->{Label=int r.Label - 1; Text=r.Text}) |> Seq.toArray
testSet.Display() 

index,Label,Text
0,1,"Contrary to other reviews, I have zero complaints about the service or the prices. I have been getting tire service here for the past 5 years now, and compared to my experience with places like Pep Boys, these guys are experienced and know what they're doing. \nAlso, this is one place that I do not feel like I am being taken advantage of, just because of my gender. Other auto mechanics have been notorious for capitalizing on my ignorance of cars, and have sucked my bank account dry. But here, my service and road coverage has all been well explained - and let up to me to decide. \nAnd they just renovated the waiting room. It looks a lot better than it did in previous years."
1,0,"Last summer I had an appointment to get new tires and had to wait a super long time. I also went in this week for them to fix a minor problem with a tire they put on. They \""fixed\"" it for free, and the very next morning I had the same issue. I called to complain, and the \""manager\"" didn't even apologize!!! So frustrated. Never going back. They seem overpriced, too."
2,1,"Friendly staff, same starbucks fair you get anywhere else. Sometimes the lines can get long."
3,0,"The food is good. Unfortunately the service is very hit or miss. The main issue seems to be with the kitchen, the waiters and waitresses are often very apologetic for the long waits and it's pretty obvious that some of them avoid the tables after taking the initial order to avoid hearing complaints."
4,1,"Even when we didn't have a car Filene's Basement was worth the bus trip to the Waterfront. I always find something (usually I find 3-4 things and spend about $60) and better still, I am always still wearing the clothes and shoes 3 months later. \n\nI kind of suspect this is the best shopping in Pittsburgh; it's much better than the usual department stores, better than Marshall's and TJ Maxx and better than the Saks downtown, even when it has a sale. Selection, bargains AND quality.\n\nI like this Filene's better than Gabriel Brothers, which are harder to get to. Gabriel Brothers are a real discount shopper's challenge and I'm afraid I didn't live in Pittsburgh long enough to develop the necessary skills . . . Filene's was still up and running in June 2007 when I left town."
5,1,"Picture Billy Joel's \""Piano Man\"" DOUBLED mixed with beer, a rowdy crowd, and comedy - Welcome to Sing Sing! A unique musical experience found in Homestead.\n\nIf you're looking to grab a bite to eat or a beer, come on in! Serving food and brews from Rock Bottom Brewery, Sing Sing keeps your tummy full while you listen to two (or more) amazingly talented pianists take your musical requests. They'll play anything you'd like, for tips of course. Wanting to hear Britney Spears? Toto? Duran Duran? Yep, they play that... new or old.\n\nThe crowd makes the show, so make sure you come ready for a good time. If the crowd is dead, it's harder for the Guys to get a reaction. If you're wanting to have some fun, it can be a GREAT time! It's the perfect place for Birthday parties - especially if you want to embarrass a friend. The guys will bring them up to the pianos and perform a little ditty. For being a good sport, you get the coveted Sing Sing bumper sticker. Now who wouldn't want that?\n\nDueling Pianos and brews... time to Shut Up & Sing Sing!"
6,0,Mediocre service. COLD food! Our food waited so long the lettuce & pickles wilted. Bland food. Crazy overpriced. Long waits in the arcade. 1 beer per hour maximum. Avoid at all costs. Fair manager.
7,0,"Ok! Let me tell you about my bad experience first. I went to D&B last night for a post wedding party - which, side note, is a great idea!\n\nIt was around midnight and the bar wasn't really populated. There were three bartenders and only one was actually making rounds to see if anyone needed anything. The two other bartenders were chatting on the far side of the bar that no one was sitting at. Kind of counter productive if you ask me. \n\nI stood there for about 5 minutes, which for a busy bar is fine but when I am the only one with my card out then, it just seems a little ridiculous. I made eye contact with the one girl twice and gave her a smile and she literally turned away. I finally had to walk to them to get their attention. I was standing right in front of them smiling and they didn't ask if i need anything. I finally said, \""Are you working?\"" and they gave each other a weird look. I felt like i was the crazy one. I asked for a beer/got the beer.\n\nIn between that time, the other bartender brought food over and set it down. She took a fry from the plate (right in front of me) and then served it to someone on the other side of the bar. What the hell! I felt like i was in some grimy bar in out in the sticks - not an established D&B. \n\nI was just really turned off from that experience. \n\nThe good is that D&B provides a different type of entertainment when you want to mix things up. I remember going here with my grandparents when I was a kid and it was the best treat ever! We would eat at the restaurant and then spend hours playing games. This place holds some really good memories for me. \n\nIt's a shame that my experience last night has spoiled the high standards I held for it."
8,0,"I used to love D&B when it first opened in the Waterfront, but it has gone down hill over the years. The games are not as fun and do not give you as many tickets and the prizes have gotten cheaper in quality. It takes a whole heck of a lot of tickets for you to even get a pencil! The atmosphere is okay but it used to be so much better with the funnest games and diverse groups of people! Now, it is run down and many of the games are app related games (Fruit Ninja) and 3D Experience rides. With such \""games\"", you can't even earn tickets and they take a lot of tokens! Last time I went, back in the winter, many of the games were broken, which made for a negative player experience. I would go to D&B to play some games again in the future, but it is no longer one of my favorite places to go due to the decline of fun games where you can earn tickets."
9,1,"Like any Barnes & Noble, it has a nice comfy cafe, and a large selection of books. The staff is very friendly and helpful. They stock a decent selection, and the prices are pretty reasonable. Obviously it's hard for them to compete with Amazon. However since all the small shop bookstores are gone, it's nice to walk into one every once in a while."


### Calculate the number of label classes

In [1]:
let classes = trainSet |> Seq.map (fun x->x.Label) |> set
classes.Display()
let TGT_LEN = classes.Count |> int64

index,value
0,0
1,1


### Batch processing
Helpers for serving minibatches of tensors for training and evaluation

In [1]:
let BATCH_SIZE = 128
let trainBatches = trainSet |> Seq.chunkBySize BATCH_SIZE
let testBatches  = testSet  |> Seq.chunkBySize BATCH_SIZE
open BERTTokenizer
let vocabFile = @"C:\s\hack\small_bert_bert_uncased_L-2_H-128_A-2_2\assets\vocab.txt"
let vocab = Vocabulary.loadFromFile vocabFile

let position_ids = torch.arange(MAX_POS_EMB).expand(int64 BATCH_SIZE,-1L).``to``(device)

//convert a batch to input and output (X, Y) tensors
let toXY (batch:YelpReview[]) = 
    let xs = batch |> Array.map (fun x-> Featurizer.toFeatures vocab true (int MAX_POS_EMB) x.Text "")
    let d_tkns      = xs |> Seq.collect (fun f -> f.InputIds )  |> Seq.toArray
    let d_tkn_typs  = xs |> Seq.collect (fun f -> f.SegmentIds) |> Seq.toArray
    let tokenIds = torch.tensor(d_tkns,     dtype=torch.int).view(-1L,MAX_POS_EMB)        
    let sepIds   = torch.tensor(d_tkn_typs, dtype=torch.int).view(-1L,MAX_POS_EMB)
    let Y = torch.tensor(batch |> Array.map (fun x->x.Label), dtype=torch.int64).view(-1L)
    (tokenIds,sepIds),Y

### Quick model check
Evaluate bert instance with just the first batch of the training data to ensure its can produce the expected output.
The expected output is a tensor with the shape BATCH_SIZE x HIDDEN.

In [1]:
testBert.Eval()
let (_tkns,_seps),_ = trainBatches |> Seq.head |> toXY
//_tkns.shape
//_tkns |> Tensor.getData<int64>
let _testOut = testBert.forward(_tkns,_seps,position_ids.cpu()) //test is on cpu
_testOut.shape.Display()

index,value
0,128
1,128


### Extend for classification
Here the PyTorch multi-class classification method is used. The number of classes is only two for this data but the multi-class method is more general and can be easily extended to more than two classes

In [1]:
type BertClassification() as this = 
    inherit torch.nn.Module("BertClassification")

    let bert = new BertModel()
    let proj = torch.nn.Linear(HIDDEN,TGT_LEN)

    do
        this.RegisterComponents()
        this.LoadBertPretrained()

    member _.LoadBertPretrained() =
        loadWeights bert tensors (int ENCODER_LAYERS) nameMap
    
    member _.forward(tknIds,sepIds,pstnIds) =
        use encoded = bert.forward(tknIds,sepIds,pstnIds)
        encoded --> proj 

## Training and evaluation code

In [1]:
let _model = new BertClassification()
_model.``to``(device)
let _loss = torch.nn.functional.cross_entropy_loss()
let mutable EPOCHS = 1
let mutable verbose = true
let gradCap = 0.1f
let gradMin,gradMax = (-gradCap).ToScalar(),  gradCap.ToScalar()
let opt = torch.optim.Adam(_model.parameters (), 0.001, amsgrad=true)       

let class_accuracy (y:torch.Tensor) (y':torch.Tensor) =
    use i = y'.argmax(1L)
    let i_t = Tensor.getData<int64>(i)
    let m_t = Tensor.getData<int64>(y)
    Seq.zip i_t m_t 
    |> Seq.map (fun (a,b) -> if a = b then 1.0 else 0.0) 
    |> Seq.average

//adjustment for end of data when full batch may not be available
let adjPositions currBatchSize = if int currBatchSize = BATCH_SIZE then position_ids else torch.arange(MAX_POS_EMB).expand(currBatchSize,-1L).``to``(device)

let dispose ls = ls |> List.iter (fun (x:IDisposable) -> x.Dispose())

//run a batch through the model; return true output, predicted output and loss tensors
let processBatch ((tkns:torch.Tensor,typs:torch.Tensor), y:torch.Tensor) =
    use tkns_d = tkns.``to``(device)
    use typs_d = typs.``to``(device)
    let y_d    = y.``to``(device)            
    let pstns  = adjPositions tkns.shape.[0]
    if device <> torch.CPU then //these were copied so ok to dispose old tensors
        dispose [tkns; typs; y]
    let y' = _model.forward(tkns_d,typs_d,pstns)
    let loss = _loss.Invoke(y', y_d)   
    y_d,y',loss

//evaluate on test set; return cross-entropy loss and classification accuracy
let evaluate e =
    _model.Eval()
    let lss =
        testBatches 
        |> Seq.map toXY
        |> Seq.map (fun batch ->
            let y,y',loss = processBatch batch
            let ls = loss.ToDouble()
            let acc = class_accuracy y y'            
            dispose [y;y';loss]
            GC.Collect()
            ls,acc)
        |> Seq.toArray
    let ls  = lss |> Seq.averageBy fst
    let acc = lss |> Seq.averageBy snd
    ls,acc

let mutable e = 0
let train () =
    
    while e < EPOCHS do
        e <- e + 1
        _model.Train()
        let losses = 
            trainBatches 
            |> Seq.map toXY
            |> Seq.mapi (fun i batch ->                 
                opt.zero_grad ()   
                let y,y',loss = processBatch batch
                let ls = loss.ToDouble()  
                loss.backward()
                _model.parameters() |> Array.iter (fun t -> t.grad().clip(gradMin,gradMax) |> ignore)                            
                use  t_opt = opt.step ()
                if verbose && i % 100 = 0 then
                    let acc = class_accuracy y y'
                    printfn $"Epoch: {e}, minibatch: {i}, ce: {ls}, accuracy: {acc}"                            
                dispose [y;y';loss]
                GC.Collect()
                ls)
            |> Seq.toArray

        let evalCE,evalAcc = evaluate e
        printfn $"Epoch {e} train: {Seq.average losses}; eval acc: {evalAcc}"

    printfn "Done train"

let runner () = async { do train () } 

### Run training

In [1]:
runner() |> Async.RunSynchronously
(*

sample output:
...
Epoch: 2, minibatch: 4200, ce: 0.14490307867527008, accuracy: 0.9375
Epoch: 2, minibatch: 4300, ce: 0.04636668041348457, accuracy: 0.984375
Epoch 2 train: 0.15354100534277304; eval acc: 0.9376728595478595
*)