In [1]:
import io
import os
import re
import shutil
import string
import tensorflow as tf
import numpy as np
import pandas as pd
from acquire import prep_and_split_data
from tensorflow.keras import layers
from tensorflow import keras

from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense, Embedding, GlobalAveragePooling1D
from tensorflow.keras.layers import TextVectorization

Create a [tf.data.Dataset](https://www.tensorflow.org/api_docs/python/tf/data/Dataset) using [tf.keras.utils.text_dataset_from_directory](https://www.tensorflow.org/api_docs/python/tf/keras/utils/text_dataset_from_directory). You can read more about using this utility in this [text classification tutorial](https://www.tensorflow.org/tutorials/keras/text_classification).

In [2]:
train, validate, test = prep_and_split_data()

Number of rows in training set: 37931
Number of rows in validation set: 2108
Number of rows in test set: 2107


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(


In [3]:
b = pd.Series(' '.join(train.basic_clean_v2).split('.'))

In [4]:
length_of_sentence = []
for sentence in b:
    length_of_sentence.append(len(sentence.split()))


In [5]:
# Use average length of sentence as embedding dimensions
sum(length_of_sentence) / len(length_of_sentence)

15.71804038700181

# Takeaways
I'm calculating the average sentence length to use as the embedding output dimension.

In [6]:
train_dataset = tf.keras.preprocessing.text_dataset_from_directory('/Users/brent/Leashed/student_notes/', batch_size=32)
validate_dataset = tf.keras.preprocessing.text_dataset_from_directory('/Users/brent/Leashed/validate/', batch_size=32)
test_dataset = tf.keras.preprocessing.text_dataset_from_directory('/Users/brent/Leashed/test/', batch_size=32)

Found 37931 files belonging to 10 classes.


2022-03-04 21:52:57.196743: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


Found 2118 files belonging to 10 classes.
Found 2107 files belonging to 10 classes.


### Take a look at a few student notes and their labels (the case they belong to)

In [7]:
for text_batch, label_batch in train_dataset.take(1):
  for i in range(3):
    print(label_batch[i].numpy(), text_batch.numpy()[i])

3 b"mr .  hamilton is a 35 yo man who presents with epigastric pain for the last 2 months .  the pain is burning and 'knawing' ,  5 10 ,  sometimes present and non-radiating .  he has tried using tums to relieve the pain ,  which worked initially but no longer does .  nothing makes it worse .  he endorses some nausea ,  but has not vomited .  he also states that his stool has become darker ,  but no changes to bowel habits- goes 1x per day .  he is most concerned about this because the pain has been causing him to lose sleep and is worried he might make a mistake at work .  he denies travel or sick contacts .  no headache ,  cp ,  sob .   pmh :  chronic back pain and spasms x10 years 2 2 job .  allergies- none ,  meds- motrin 2x200mg weekly prn ,  tums prn ,  no surgeries  fhx :  uncle with ulcer ,  otherwise healthy  shx :  smoker x20 years (10-20pack year) ,  2-3 beer on weekend with friends (cutting back) ,  denies illicit substance use .  works in construction ,  divorced ,  not se

### Create TextVectorization Layer (aka Text Encoder)

In [8]:
from tensorflow.keras.layers import TextVectorization
import string
import re

# Define a custom standardization function to be incorporated into the TextVectorization layer.
def custom_standardization(input_data):
    lowercase = tf.strings.lower(input_data)
    stripped_html = tf.strings.regex_replace(lowercase, "<br />", " ")
    return tf.strings.regex_replace(
        stripped_html, f"[{re.escape(string.punctuation)}]", ""
    )

# Set maximum vocabulary for model(top 5000 most frequent words).
max_vocab = 3000
# Set the output tensor dimension.
embedding_dim = 28
# Cap sequence length.
sequence_length = 20

In [9]:
# Create text encoding layer.
vectorize_layer = TextVectorization(
    standardize= custom_standardization,
    max_tokens = max_vocab,
    output_mode='int',
    output_sequence_length = sequence_length
)
# Set vocabulary for the text encoder to be train dataset text
vectorize_layer.adapt(train_dataset.map(lambda text, label: text))

### Check vocabulary of vectorization layer.

In [10]:
vocab = np.array(vectorize_layer.get_vocabulary())


In [11]:
print(vocab[:5], len(vocab))

['' '[UNK]' 'and' 'no' 'with'] 3000


## Using the Embedding layer

Keras makes it easy to use word embeddings. Take a look at the [Embedding](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding) layer.

The Embedding layer can be understood as a lookup table that maps from integer indices (which stand for specific words) to dense vectors (their embeddings). The dimensionality (or width) of the embedding is a parameter you can experiment with to see what works well for your problem, much in the same way you would experiment with the number of neurons in a Dense layer. 

This layer can only be used on positive integer inputs of a fixed range. The [tf.keras.layers.TextVectorization](https://www.tensorflow.org/api_docs/python/tf/keras/layers/TextVectorization), [tf.keras.layers.StringLookup](https://www.tensorflow.org/api_docs/python/tf/keras/layers/StringLookup), and [tf.keras.layers.IntegerLookup](https://www.tensorflow.org/api_docs/python/tf/keras/layers/IntegerLookup) preprocessing layers can help prepare inputs for an Embedding layer.

This layer accepts [tf.Tensor](https://www.tensorflow.org/api_docs/python/tf/Tensor) and [tf.RaggedTensor](https://www.tensorflow.org/api_docs/python/tf/RaggedTensor) inputs. It cannot be called with [tf.SparseTensor](https://www.tensorflow.org/api_docs/python/tf/sparse/SparseTensor) input.

In [12]:
# Embed a 5000 word vocabulary to 90 dimensions.
embedding_layer = tf.keras.layers.Embedding(3000, 28)

### When you create an Embedding layer, the weights for the embedding are randomly initialized (just like any other layer). 

If you pass an integer to an embedding layer, the result replaces each integer with the vector from the embedding table:

In [13]:
result = embedding_layer(tf.constant([1,2,3,5,2]))

For text or sequence problems, the Embedding layer takes a 2D tensor of integers, of shape `(samples=observation, sequence_length=embedding dimension)`, where each entry is a sequence of integers. It can embed sequences of variable lengths. You could feed into the embedding layer above batches with shapes `(32, 10)` (batch of 32 sequences of length 10) or `(64, 15)` (batch of 64 sequences of length 15).

The returned tensor has one more axis than the input, the embedding vectors are aligned along the new last axis. Pass it a `(2, 3)` input batch and the output is `(2, 3, N)`

When given a batch of sequences as input, an embedding layer returns a 3D floating point tensor, of shape (samples, sequence_length, embedding_dimensionality). To convert from this sequence of variable length to a fixed representation there are a variety of standard approaches. You could use an RNN, Attention, or pooling layer before passing it to a Dense layer. This tutorial uses pooling becuse it's the simplest. The Text Classification with an RNN tutorial is a good next step.

## Create the classification model.

In [14]:
# Initiate Sequential model
model = tf.keras.Sequential([
    # Add text ecoding layer:
    vectorize_layer,
    # Add Embedding layer:
    embedding_layer,
    GlobalAveragePooling1D(),
    tf.keras.layers.Dense(14, activation='softmax'),
    # Process vector representation to a single logit as the classification output.
    tf.keras.layers.Dense(1)
])

## Compile the Keras model to configure the training process:

In [15]:
model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])

## Train the model

In [16]:
history = model.fit(train_dataset, epochs=10,
                    validation_data=validate_dataset,
                    validation_steps=2)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10

KeyboardInterrupt: 

In [33]:
test_loss, test_acc = model.evaluate(test_dataset)

print('Test Loss:', test_loss)
print('Test Accuracy:', test_acc)

Test Loss: nan
Test Accuracy: 0.05315614491701126


In [34]:
sample_text = ("hpi :  stephanie madden ,  a 20-year-old female ,  has come to the doctor's office complaining of a headache .    - woke up yesterday due this headache ,  constant ,  progressive  - says that is the worst headache she ever had  - ibuprofen and tylenol did not work  - localized all over the head ,  no radiation ,  no recent trauma  - no allev factors ,  aggrav by walking around  - reports subjective fever ,  nausea ,  3 epidosed of vomiting ,  phonophobia and runny nose  ros :  denies loc ,  denies prodroms ,  wt loss ,  insect bite ,  eye discharge or phonophobia  pmh :  none ,  no previous epidoes like that    psh :  none  ob gyn :  g0o0 ,  lmp 2 weeks ago  meds :  ocp ,  and otc above  all :  none  fh :  mother with headache  sh :  etoh occas ,  no smoking ,  uses marijuana 3 week    ")
predictions = model.predict(np.array([sample_text]))

In [36]:
print(predictions)

[[nan]]


In [72]:
model.summary()

Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 text_vectorization_5 (TextV  (None, 20)               0         
 ectorization)                                                   
                                                                 
 embedding_2 (Embedding)     (None, 20, 28)            84000     
                                                                 
 global_average_pooling1d (G  (None, 28)               0         
 lobalAveragePooling1D)                                          
                                                                 
 dense_6 (Dense)             (None, 14)                406       
                                                                 
 dense_7 (Dense)             (None, 1)                 15        
                                                                 
Total params: 84,421
Trainable params: 84,421
Non-trai

In [76]:
class TransformerBlock(layers.Layer):
    def __init__(self, embed_dim, num_heads, ff_dim, rate=0.1):
        super(TransformerBlock, self).__init__()
        self.att = layers.MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)
        self.ffn = keras.Sequential(
            [layers.Dense(ff_dim, activation="relu"), layers.Dense(embed_dim),]
        )
        self.layernorm1 = layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = layers.LayerNormalization(epsilon=1e-6)
        self.dropout1 = layers.Dropout(rate)
        self.dropout2 = layers.Dropout(rate)

    def call(self, inputs, training):
        attn_output = self.att(inputs, inputs)
        attn_output = self.dropout1(attn_output, training=training)
        out1 = self.layernorm1(inputs + attn_output)
        ffn_output = self.ffn(out1)
        ffn_output = self.dropout2(ffn_output, training=training)
        return self.layernorm2(out1 + ffn_output)

In [77]:
class TokenAndPositionEmbedding(layers.Layer):
    def __init__(self, maxlen, vocab_size, embed_dim):
        super(TokenAndPositionEmbedding, self).__init__()
        self.token_emb = layers.Embedding(input_dim=vocab_size, output_dim=embed_dim)
        self.pos_emb = layers.Embedding(input_dim=maxlen, output_dim=embed_dim)

    def call(self, x):
        maxlen = tf.shape(x)[-1]
        positions = tf.range(start=0, limit=maxlen, delta=1)
        positions = self.pos_emb(positions)
        x = self.token_emb(x)
        return x + positions

In [78]:
vocab_size = 5000
maxlen = 130

In [81]:
embed_dim = 32  # Embedding size for each token
num_heads = 2  # Number of attention heads
ff_dim = 32  # Hidden layer size in feed forward network inside transformer

inputs = layers.Input(shape=(maxlen,))
embedding_layer = TokenAndPositionEmbedding(maxlen, vocab_size, embed_dim)
x = embedding_layer(inputs)
transformer_block = TransformerBlock(embed_dim, num_heads, ff_dim)
x = transformer_block(x)
x = layers.GlobalAveragePooling1D()(x)
x = layers.Dropout(0.1)(x)
x = layers.Dense(20, activation="relu")(x)
x = layers.Dropout(0.1)(x)
outputs = layers.Dense(2, activation="softmax")(x)

model = keras.Model(inputs=inputs, outputs=outputs)

In [84]:
model.compile(optimizer="adam", loss="sparse_categorical_crossentropy", metrics=["accuracy"])
history = model.fit(train_dataset, epochs=10,
                    validation_data=validate_dataset,
                    validation_steps=2)

Epoch 1/10


ValueError: in user code:

    File "/usr/local/anaconda3/lib/python3.8/site-packages/keras/engine/training.py", line 1021, in train_function  *
        return step_function(self, iterator)
    File "/usr/local/anaconda3/lib/python3.8/site-packages/keras/engine/training.py", line 1010, in step_function  **
        outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "/usr/local/anaconda3/lib/python3.8/site-packages/keras/engine/training.py", line 1000, in run_step  **
        outputs = model.train_step(data)
    File "/usr/local/anaconda3/lib/python3.8/site-packages/keras/engine/training.py", line 859, in train_step
        y_pred = self(x, training=True)
    File "/usr/local/anaconda3/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 67, in error_handler
        raise e.with_traceback(filtered_tb) from None

    ValueError: Exception encountered when calling layer "transformer_block_1" (type TransformerBlock).
    
    in user code:
    
        File "/var/folders/jl/s3ptdwdx55v01d2g2wrs7vdc0000gn/T/ipykernel_18191/4209189007.py", line 14, in call  *
            attn_output = self.att(inputs, inputs)
        File "/usr/local/anaconda3/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 67, in error_handler  **
            raise e.with_traceback(filtered_tb) from None
    
        ValueError: Exception encountered when calling layer "query" (type EinsumDense).
        
        Shape must be rank 3 but is rank 2
        	 for 0th input and equation: abc,cde->abde for '{{node model/transformer_block_1/multi_head_attention_1/query/einsum/Einsum}} = Einsum[N=2, T=DT_FLOAT, equation="abc,cde->abde"](model/token_and_position_embedding_1/add, model/transformer_block_1/multi_head_attention_1/query/einsum/Einsum/ReadVariableOp)' with input shapes: [?,32], [32,2,32].
        
        Call arguments received:
          • inputs=tf.Tensor(shape=(None, 32), dtype=float32)
    
    
    Call arguments received:
      • inputs=tf.Tensor(shape=(None, 32), dtype=float32)
      • training=True
