# Week 4 Assignment
- **Assignment Description** The overall goal of this assignment is to use LSTM to recognize named entities.

## Data File Description

The dataset is about adverse reactions to pharmaceutical psychiatric treatment. The dataset covers patients' expression of effectiveness and adverse drug events associated with four psychiatric medications.  It uses the same dataset as the Week 3 assignment.

For this assignment, you will need files list below.
```
/review_data
    REVIEW_LABELSEQ.txt
    REVIEW_TEXT.txt
    TEST_REVIEW_TEXT.txt
```

1. `REVIEW_TEXT.txt`
> The training file contains 4,744 sentences coming from 711 reviews. You can find that all patients' reviews are splitted by sentence, and the file haas two columns. One of the columns is the ID for each sentence which is labeled as `<Medication>.<Post_number>.<Sentence_number>`. The other column is the splitted sentence itself. 
  > Here is an example row of this dataset.

  >  `8<tab>Tobacco cravings were rampant .`
  


2. `REVIEW_LABELSEQ.txt`
> You can find there are two columns, ID and TAGSEQ. ID is the same unique string as previous file. TAGSEQ is a sequence of sapce-separated named-entity tages that are either `O` or `B-AE`, `I-AE`, `B-SSI`, `I-SSI`. These tags are explained as below.
  - `AE`: adverse events (entity of interest)
  - `SSI`: signs, symptoms, and indications (entity of interest)
  - `B-`: beginning of a tagged named entity
  - `I-`: inside a tagged named entity
  - `O` : outside of any tagged named entity

  > Here is an example row of this file.

  >  `8    B-AE I-AE I-AE I-AE O `



## Code Template
Please modify code templates as much as you want. If you find using a function would be more useful, please use it!

## Step 1: Read data

In [1]:
def read_file(f):
    """This function is used to read files that are tab-separated. 
    The function will split each row into two parts: ID and data.
    Data is a list of either sentence or tag sequence that is splitted into a list by space. 
    """
    data = open(f,'r').readlines()[1:]
    row_id = [i.split('\t')[0].strip() for i in data]
    data = [i.split('\t')[1].strip().split(' ') for i in data]
    return row_id,data

In [2]:
row_id_text, texts = read_file('./data/REVIEW_TEXT.txt')
row_id_tags, tags = read_file('./data/REVIEW_LABELSEQ.txt')

#texts = texts[:5000]  
#tags = tags[:5000]

## Step 2: Get GloVe word embeddings

You do not need to have a deep understanding of word embeddings! &nbsp;Roughly speaking, these word embeddings transform each token into a vector of numbers, based on their distribution in a (separate) large dataset.&nbsp; So, these embeddings were trained by someone else already!&nbsp; We can think of the values in the vector as features of a word.&nbsp; The ordered set of word feature vectors make up the input for each sentence.&nbsp; The input to the models we will use is just different kinds of features compared to what we used for the last assignment.&nbsp; We are simply importing these embeddings because the deep learning methods will perform much better using these instead of treating each word as an arbitrary categorical value.&nbsp; Using embeddings like these is is a standard way of doing natural language processing tasks!

In [None]:
!pip install tensorflow
import tensorflow

In [3]:
from tensorflow.keras.layers import TextVectorization
import numpy as np
from tensorflow import convert_to_tensor
from functools import reduce
import operator
from tensorflow import keras
import tensorflow as tf
vectorizer = TextVectorization(max_tokens=8000, pad_to_max_tokens=False)
vectorizer.adapt(convert_to_tensor(reduce(operator.concat, texts)))

ModuleNotFoundError: No module named 'tensorflow'

### Vocabulary

In [None]:
voc = vectorizer.get_vocabulary()
word_index = dict(zip(voc, range(len(voc))))
print(len(voc))

### Import GloVe Word Embeddings

In [None]:
path_to_glove_file = "./data/glove.6B.100d.txt"

embeddings_index = {}
with open(path_to_glove_file) as f:
    for line in f:
        word, coefs = line.split(maxsplit=1)
        coefs = np.fromstring(coefs, "f", sep=" ")
        embeddings_index[word] = coefs

print("Found %s word vectors." % len(embeddings_index))

In [None]:
num_tokens = len(voc) + 2
embedding_dim = 100
hits = 0
misses = 0

# Prepare embedding matrix
embedding_matrix = np.zeros((num_tokens, embedding_dim))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # Words not found in embedding index will be all-zeros.
        # This includes the representation for "padding" and "OOV"
        embedding_matrix[i] = embedding_vector
        hits += 1
    else:
        misses += 1
print("Converted %d words (%d misses)" % (hits, misses))

## Step 3: Get Inputs; Split into Train and Validation Sets

In [None]:
from sklearn.model_selection import train_test_split

input_length =  # Set input_length to 15

unique_words = list(set([j.lower() for i in texts for j in i]))
word2idx = {j:i+1 for i,j in enumerate(unique_words)}
word2idx["PAD"] = 0

unique_tags = list(set([j for i in tags for j in i]))
label2idx = {j:i for i,j in enumerate(unique_tags)}
idx2label = {j:i for i,j in label2idx.items()}

texts_train, texts_validation, tags_train, tags_validation = train_test_split(texts, tags, test_size = 0.2, random_state = 0)


### Create inputs

In [None]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical

X_train = [tf.reshape(vectorizer(s), [-1]) for s in texts_train]
X_train = pad_sequences(maxlen = input_length, sequences = X_train, padding = "post", value = 0)

X_validation = [tf.reshape(vectorizer(s), [-1]) for s in texts_validation]
X_validation = pad_sequences(maxlen = input_length, sequences = X_validation, padding = "post", value = 0)

y_train = [[label2idx[j] for j in i] for i in tags_train]
y_train = pad_sequences(maxlen = input_length, sequences = y_train, padding = "post", value = 0)
y_train = [to_categorical(i, num_classes = len(unique_tags)) for i in y_train]

y_validation = [[label2idx[j] for j in i] for i in tags_validation]
y_validation = pad_sequences(maxlen = input_length, sequences = y_validation, padding = "post", value = 0)
y_validation = [to_categorical(i, num_classes = len(unique_tags)) for i in y_validation]

## Step 4: Build Model with GloVe embeddings

In [None]:
from tensorflow.keras.layers import LSTM, Dense, Embedding, Bidirectional, TimeDistributed, Add, Input
from tensorflow.keras.models import Sequential, Model

# here, we build a model layer by layer.
# Roughly, each layer can be thought of as a function or composition of previous layers - 
#          sometimes just the immediately precdeding layer
#          sometimes the immediate previous layer and another previous layer.

# input - straightforward
input = Input(shape=(input_length,))
# embedding - applied to input layer
embedding = Embedding(input_dim=num_tokens, output_dim=100, input_length=input_length, embeddings_initializer=keras.initializers.Constant(embedding_matrix), trainable=False)(input)
# x - applied to embedding layer
x = Bidirectional(LSTM(units=100, return_sequences=True, dropout=0.2), merge_mode = 'concat')(embedding)
# x_rnn - applied to x layer
x_rnn = Bidirectional(LSTM(units=100, return_sequences=True, dropout=0.2))(x)
# x_dense
x_dense = Add()([x, x_rnn])
# output
out = TimeDistributed(Dense(len(label2idx.keys()), activation="softmax"))(x_dense)

# create the model
model = Model(input, out)
# customized_model.add(Dense(len(label2idx.keys()), activation="softmax"))
model.compile(loss='categorical_crossentropy', optimizer=keras.optimizers.Adam(learning_rate=0.01), metrics=['accuracy', 'categorical_accuracy'])
# original: categorical_crossentropy
#model.summary()

## Step 5: Train GloVe-based deep learning model

In [None]:
# fit() is the training method.
# This will output the training metrics for each epoch
epochs =  # set epochs to 1

history = model.fit(X_train, np.array(y_train), batch_size=16, epochs=epochs, validation_split=0.1)

In [None]:
# make sure not to run this cell more than once without resetting
# y_validation by running the create inputs cell above
y_pred = model.predict(X_validation)
y_pred = np.argmax(y_pred, axis=-1)
y_pred = [[idx2label[i] for i in row] for row in y_pred]

y_validation = np.argmax(y_validation, axis=-1)
y_validation = [[idx2label[i] for i in row]
                for row in y_validation]

Report metrics on the validation set!

In [None]:
from sklearn_crfsuite import metrics

def flat_classification_report(y_true, y_pred, digits=3):
    report = metrics.flat_classification_report(y_true, y_pred, digits=digits)
    report += '\n'
    report += '{}{: >11}'.format('sequence acc', str(round(metrics.sequence_accuracy_score(y_true, y_pred), digits)))
    return report

print(flat_classification_report(y_validation, y_pred))

In the cell below, report the accuracy of the model as a number:
- e.g. ***if*** the report says:

    19s 67ms/step - loss: 0.3488 - accuracy: 0.9855 - categorical_accuracy: 0.8509 ...

**then**

accuracy = 0.9855
- is what should be in the cell below.
- (That is just an example)

In [None]:
accuracy =  # report accuracy (copy and paste accuracy number from the report above)

In [None]:
# Check reported accuracy

# Part 2: Now Tweak Parameters
- Now the LSTM model is performing at some particular level.
- There are many reasons for it, but, luckily, there are a couple of things we can do right away to tune this model and improve performance.

## Parameter 1:  Epochs
### One  simple way to improve performance a lot of the time will be adjusting the epochs.
- Each epoch will represent another chance for the model to be better.
- When adding epochs, a tradeoff you will face in the real world is time.  See if you can squeeze slightly better performance out of your model by adding more epochs.
- There is a trade off between training time and performance, and you will get diminishing returns quickly.  Since each cell is time-limited on this platform, try to see how low you can set the epochs and still meet the accuracy threshold.  In this case, epochs will increase time linearly, such that changing from 2 to 4 will essentially double training time.
- See if you can get the model to at least .91 accuracy on the validation set!
- If you time out, it might affect your final grade drastically! Only use epochs you need.

In [None]:
#import numpy as np
from tensorflow.keras.layers import LSTM, Dense, Embedding, Bidirectional
from tensorflow.keras.models import Sequential
from tensorflow.keras.metrics import CategoricalCrossentropy
from tensorflow.keras.losses import SparseCategoricalCrossentropy
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
from sklearn.model_selection import train_test_split


In [None]:
epochs2 =  # increase epochs here!

history2 = model.fit(X_train,np.array(y_train),batch_size=16,epochs=epochs2,validation_split=0.1)

print("fit done")

y_pred2 = model.predict(X_validation)
y_pred2 = np.argmax(y_pred2, axis=-1)
y_pred2 = [[idx2label[i] for i in row] for row in y_pred2]


Report metrics on the validation set!

In [None]:
def flat_classification_report(y_true, y_pred, digits=3):
    report = metrics.flat_classification_report(y_true, y_pred, digits=digits)
    report += '\n'
    report += '{}{: >11}'.format('sequence acc', str(round(metrics.sequence_accuracy_score(y_true, y_pred), digits)))
    return report

report = flat_classification_report(y_pred=y_pred2, y_true=y_validation, digits=3)
print(report)

In [None]:
# test epochs.  No need to submit accuracy, but do make sure it has increased to the desired level

## Parameter 2: input length
### Another option available to you is changing how many tokens to use for each input
- Every input must have the same length.
- This why we need to explicitly set it in both the input creation and model-building processes.
- We are potentially losing a lot of information by setting our input length to 15.

In [None]:
# Run this cell and...
# See how we are cutting some sentences short
print("length of actual sentence:", len(texts[132]))
print("length of used sentence:  ", len(X_train[132]))
# Uncomment the following to see how much would be missing
print(texts[132][:50], '\b\b, ...]\n\nBecomes\n')
print(texts[132][:15])

- Remember, this should also be the input_length parameter of the Embedding layer, hence using the same variable to set both.
- You could run the above and below cell again each time you change the input_length variable to see updated results.

In [None]:
# (Run this cell)
# But there are examples like this as well:
print("actual:      ", len(texts[113]))
print("fed to model:", len(X_train[113]))
print(texts[113])
print(X_train[113])

### Before moving on, try to think about the tradeoffs of increasing or decreasing the input length.
- Intuitively, what are the benefits of increasing input length?
- What are the drawbacks to increasing input length?


- What about the pros and cons of having a lower input length?

- Try changing the input length, which will be the maxlen parameter of pad_sequences() and the input length parameter in the model input, to another number (keep it under 60) and see how it impacts performance.
- In particular, set the input_length2 variable below to a number that results in an accuracy of at least 0.95!

### Set new input length.  Create new inputs with new input length.  Increase it (from the original 15), but definitely keep it below 60!

In [None]:
input_length2 =   # set input length.

X_train2 = [tf.reshape(vectorizer(s), [-1]) for s in texts_train]
X_train2 = pad_sequences(maxlen = input_length2, sequences = X_train2, padding = "post", value = 0)

X_validation2 = [tf.reshape(vectorizer(s), [-1]) for s in texts_validation]
X_validation2 = pad_sequences(maxlen = input_length2, sequences = X_validation2, padding = "post", value = 0)

y_train2 = [[label2idx[j] for j in i] for i in tags_train]
y_train2 = pad_sequences(maxlen = input_length2, sequences = y_train2, padding = "post", value = 0)
y_train2 = [to_categorical(i, num_classes = len(unique_tags)) for i in y_train2]

y_validation2 = [[label2idx[j] for j in i] for i in tags_validation]
y_validation2 = pad_sequences(maxlen = input_length2, sequences = y_validation2, padding = "post", value = 0)
y_validation2 = [to_categorical(i, num_classes = len(unique_tags)) for i in y_validation2]


### Create new model with new input length

In [None]:
from tensorflow.keras.layers import LSTM, Dense, Embedding, Bidirectional, TimeDistributed, Add, Input
from tensorflow.keras.models import Sequential, Model

input2 = Input(shape=(input_length2,))
embedding2 = Embedding(input_dim=num_tokens, output_dim=100, input_length=input_length2, embeddings_initializer=keras.initializers.Constant(embedding_matrix), trainable=False)(input2)
x2 = Bidirectional(LSTM(units=100, return_sequences=True, dropout=0.2), merge_mode = 'concat')(embedding2)
x_rnn2 = Bidirectional(LSTM(units=100, return_sequences=True, dropout=0.2))(x2)
x_dense2 = Add()([x2, x_rnn2])
out2 = TimeDistributed(Dense(len(label2idx.keys()), activation="softmax"))(x_dense2)
model2 = Model(input2, out2)
# customized_model.add(Dense(len(label2idx.keys()), activation="softmax"))
model2.compile(loss='categorical_crossentropy', optimizer=keras.optimizers.Adam(learning_rate=0.01), metrics=['accuracy', 'categorical_accuracy'])
# original: categorical_crossentropy
#model.summary()

In [None]:
epochs3 =  epochs2 # You can change epochs here

history3 = model2.fit(X_train2,np.array(y_train2),batch_size=16,epochs=epochs3,validation_split=0.1)

print("fit done")

y_pred3 = model2.predict(X_validation2)
y_pred3 = np.argmax(y_pred3, axis=-1)
y_pred3 = [[idx2label[i] for i in row] for row in y_pred3]

y_validation2 = np.argmax(y_validation2, axis=-1)
y_validation2 = [[idx2label[i] for i in row]
                for row in y_validation2]


Report metrics on the validation set!

In [None]:
report = flat_classification_report(y_pred=y_pred3, y_true=y_validation2, digits=3)
print(report)

- Play with the input length and epochs a few times until you are able to find where the accuracy is great, but you haven't drastically increased the run time.

- There will always be one more tweak you can do to make your model better, but, for this lab, just try to get a feel for when it is not improving much at anymore.

In [None]:
# Test for setting input_length2

In [None]:
# REPORT YOUR FINAL ACCURACY HERE:
accuracy_final =    # fill this in!

In [None]:
# Test final model

For the last assignment, we focused on macro-averaged F1. Note that this increased as well when accuracy did in this particular setup.

## Final Notes:
### Think about what some possible reasons for this model's performance to increase with higher input length.
### And think about other limitations of the way we evaluated performance.
- How is the data changing, and how might that make it easier for the model to get more right?
- Accuracy may have been high, but you might notice that not all classes are being identified well.  Think about what could have been a better measure of your model's success!  What was the purpose of the task?  What combination of metrics would better represent successfully completing that task?  Notice that it is not immediately obvious how best to evaluate performance on this task!

### Deep Learning Model
- Depending on your exact feature engineering choices from the last assignment, you may see that this approach performed better or worse, but in whatever case, this one required a bit less effort. There was a lot of setup, but it sort of did all of the work. However, you may also feel that, with this approach, there was more mystery surrounding what was going on. The previous (CRF) model used features that you created, while this model seems to learn useful things, but exactly what kind of relationships it is learning are harder to see.
- Think about whether you have problems with that. Should we be able to explain the models we are deploying reasonably well, or is it okay just to choose the best performing model?  Why?