## Case 3: Patient Drug Review

**TEAM 10: Eliecer Diaz, Muskan Kaushik, and Zakaria Hasan**

**Introduction**

Text analysis is one of the everyday machine learning technics that can be solved by deep learning. Deep learning architecture uses a neural network.
Recurrent neural networks (RNN) are at the forefront of the neural network models used for learning from sequential data. 
This document aims to investigate sentiment analysis on patient drug reviews. The computation was implemented by using Long Short Term Memory(LSTM) to examine the dataset and construct effective model to predict the rating based on given reviews. 


**Objectives:**
* Predict the multiclass type of drug rating based on the drug reviews 
* give metrics about the accuracy, Cohen's Kappa, Classification report
* 3 categories from the numeric drug review ratings and try to classify converted 

# Importing libraries

In [None]:
import random # Random generators
import numpy as np
import pandas as pd # Pandas dataframe
import matplotlib.pyplot as plt
import re # Text cleaning
import nltk # Text processing
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
from bs4 import BeautifulSoup # Text cleaning
import tensorflow as tf # Tensorflow
from tensorflow.keras import preprocessing # Text preprocessing
from tensorflow.keras.preprocessing.text import Tokenizer # Text preprocessing
from tensorflow.keras.preprocessing.sequence import pad_sequences # Text preprocessing
from tensorflow.keras.models import Sequential # modeling neural networks
from tensorflow.keras.layers import Input, Dense, Conv1D, MaxPooling1D, GlobalMaxPooling1D, Embedding, SpatialDropout1D, LSTM
from tensorflow.keras.initializers import Constant
from tensorflow.keras import optimizers, metrics # Neural Network
from sklearn.metrics import classification_report, confusion_matrix, cohen_kappa_score
random.seed(10) # Set seed for the random generators
print(f"Tensorflow version: {tf.__version__}")

# Create dataframes train and test

In [None]:
train = pd.read_csv('../input/drugsComTrain_raw.csv')
test = pd.read_csv('../input/drugsComTest_raw.csv')

In [None]:
train.head()

Check the column names and dataset sizes

In [None]:
list(train)

In [None]:
train.values.shape[0], test.values.shape[0],
train.values.shape[0]/test.values.shape[0]

The train set is almost exactly 3 times as big as test set. This is a typycal 75:25 train:test split

In [None]:
print("Train shape :" ,train.shape)
print("Test shape :",test.shape)

Rating Distribution 

In [None]:
train.rating.hist(color = 'skyblue')
plt.title('Distribution of Ratings')
plt.xlabel('Rating')
plt.xticks([i for i in range(1,11)]);

This distribution illustrate that people generally write review for drugs they really like or those that they rally dislike. There are fewer middle rating as compared to extreme ratings.

# Hadling text with Tensorflow

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer
# For Train Data
samples = train['review']
tokenizer =Tokenizer(num_words = 5000)
tokenizer.fit_on_texts(samples)
# For Test Data
test_samples = test['review']
test_tokenizer =Tokenizer(num_words = 5000)
test_tokenizer.fit_on_texts(test_samples)



# Convert text to sequences

In [None]:
# Convert text to sequences for Train Data
sequences = tokenizer.texts_to_sequences(samples)
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))


# Convert text to sequences for Test Data
test_sequences = test_tokenizer.texts_to_sequences(test_samples)
test_word_index = test_tokenizer.word_index
print('Found %s unique tokens.' % len(test_word_index))

# Make one hot samples

In [None]:
from tensorflow.keras.preprocessing.sequence import pad_sequences 

data = pad_sequences(sequences, maxlen=200)
data.shape

test_data = pad_sequences(test_sequences, maxlen=200)
test_data.shape

# Categorize labels

2 Mean the patient review is POSITIVE <br>
1 Mean the patient review is NEUTRAL <br>
0 Mean the patient review is NEGATIVE

In [None]:
# Categorize labels for Train Data
labels = train ['rating'].values
labels = 1.0 * (labels >= 6 ) + 1.0*(labels >= 4)


# Categorize labels for Test Data
test_labels = test ['rating'].values
test_labels = 1.0 * (test_labels >= 6 ) + 1.0*(test_labels >= 4)




# One hot code the output values

In [None]:
from tensorflow.keras.utils import to_categorical

# For train and validation 
labels = to_categorical(np.asarray(labels))#
print('Shape of data tensor:', data.shape)
print('Shape of label tensor:', labels.shape)

# For Test data
test_labels = to_categorical(np.asarray(test_labels))
print('Shape of data tensor:', test_data.shape)
print('Shape of label tensor:', test_labels.shape)

# Split into traning and validition 

In [None]:
VALIDATION_SPLIT = 0.25
# split the data into a training set and a validation set
indices = np.arange(data.shape[0])
np.random.shuffle(indices)
data = data[indices]
labels = labels[indices]
nb_validation_samples = int(VALIDATION_SPLIT * data.shape[0])

x_train = data[:-nb_validation_samples]
y_train = labels[:-nb_validation_samples]
x_val = data[-nb_validation_samples:]
y_val = labels[-nb_validation_samples:]

## Model 1



In this model, we use nine layers contain one Embedding layer, three conv1D,one MaxPooling, one GlobalMaxPooling and two Dense layer. 

Embedding layer is for text processing.The layer will turn the number that encoded from words into vectors.

Conv1D layer creates a convolution kernel that is convolved with the layer input over a single spatial (or temporal) dimension to produce a tensor of outputs.

MaxPooling1D creates a operation for temporal data.

GlobalMaxPooling1D for temporal data takes the max vector over the steps dimension.

Dense layer represents a matrix vector multiplication. (assuming your batch size is 1) The values in the matrix are the trainable parameters which get updated during backpropagation.



In [None]:
from tensorflow.keras.layers import Dense, Input, GlobalMaxPooling1D
from tensorflow.keras.layers import Conv1D, MaxPooling1D, Embedding
from tensorflow.keras.models import Model
from keras import regularizers


embedding_layer = Embedding(5000,
                            100,
                            input_length=200,
                            trainable=True)


sequence_input = Input(shape=(200,), dtype='int32')
embedded_sequences = embedding_layer(sequence_input)
x = Conv1D(128, 5, activation='relu')(embedded_sequences)
x = MaxPooling1D(5)(x)
x = Conv1D(128, 5, activation='relu')(x)
x = MaxPooling1D(5)(x)
x = Conv1D(128, 5, activation='relu')(x)
x = GlobalMaxPooling1D()(x)
x = Dense(128, activation='relu',kernel_regularizer=regularizers.l2(0.05))(x)
preds = Dense(3, activation='softmax')(x)


model1 = Model(sequence_input, preds)
model1.compile(loss='categorical_crossentropy',
              optimizer='rmsprop',
              metrics=['acc'])


model1.summary()

# Train the model

In [None]:
history = model1.fit(x_train, y_train,
          batch_size=32,
          epochs=15,
          verbose=0,
          validation_data=(x_val, y_val))

# Plot the accuracy and loss

In [None]:

acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']
e = arange(len(acc)) + 1

plot(e, acc, label = 'train')
plot(e, val_acc, label = 'validation')
title('Training and validation accuracy')
xlabel('Epoch')
grid()
legend()

figure()

plot(e, loss, label = 'train')
plot(e, val_loss, label = 'validation')
title('Training and validation loss')
xlabel('Epoch')
grid()
legend()

show()

Calculate metrics

# Find the predicted values for the validation set

In [None]:
y_pred = argmax(model1.predict(x_val), axis = 1)
y_true = argmax(y_val, axis = 1)

# Calculate the classification report for the validation set

In [None]:
cr = classification_report(y_true, y_pred)
print(cr)

# Calculate the confusion matrix for the validation set

In [None]:
cm = confusion_matrix(y_true, y_pred).T
print(cm)

# Calculate the cohen's kappa, both with linear and quadratic weights for the validation set

In [None]:
# Calculate the cohen's kappa, both with linear and quadratic weights
k = cohen_kappa_score(y_true, y_pred)
print(f"Cohen's kappa (linear)    = {k:.3f}")
k2 = cohen_kappa_score(y_true, y_pred, weights = 'quadratic')
print(f"Cohen's kappa (quadratic) = {k2:.3f}")

# Find the predicted values for the Test set

In [None]:
y_pred = argmax(model1.predict(test_data), axis = 1)
y_true = argmax(test_labels, axis = 1)

# Calculate the classification report for the Test set

In [None]:
cr = classification_report(y_true, y_pred)
print(cr)

# Calculate the confusion matrix for the Test set

In [None]:
cm = confusion_matrix(y_true, y_pred).T
print(cm)

# Calculate the cohen's kappa, both with linear and quadratic weights for the Test set

In [None]:
# Calculate the cohen's kappa, both with linear and quadratic weights
k = cohen_kappa_score(y_true, y_pred)
print(f"Cohen's kappa (linear)    = {k:.3f}")
k2 = cohen_kappa_score(y_true, y_pred, weights = 'quadratic')
print(f"Cohen's kappa (quadratic) = {k2:.3f}")

#  Model 2

In this model, we use five layers contain one Embedding layer, one SpatialDropout1D layer, two LSTM layers and one Dense layer.

Embedding layer is for text processing.The layer will turn the number that encoded from words into vectors.

SpatialDropout1D layer is like a dropout function. We input rate(0.1) in the function and fraction of the input units will be droped.

LSTM layer is a text prediction layer.The LSTM is improved from simpleRNN network.It helps us solve 'vanishing gradient' problem.

In [None]:
model = Sequential()
model.add(Embedding(5000, 128, input_length=200))
model.add(SpatialDropout1D(0.1))
model.add(LSTM(128, dropout=0.1, recurrent_dropout=0.1, return_sequences=True))
model.add(LSTM(128, dropout=0.1, recurrent_dropout=0.1))
model.add(Dense(3, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [None]:
history2 = model.fit(x_train, y_train,
          batch_size=32,
          epochs=15,
          verbose=0,
          use_multiprocessing=True,  
          validation_data=(x_val, y_val))

In [None]:
acc = history2.history['acc']
val_acc = history2.history['val_acc']
loss = history2.history['loss']
val_loss = history2.history['val_loss']
e = arange(len(acc)) + 1

plot(e, acc, label = 'train')
plot(e, val_acc, label = 'validation')
title('Training and validation accuracy')
xlabel('Epoch')
grid()
legend()

figure()

plot(e, loss, label = 'train')
plot(e, val_loss, label = 'validation')
title('Training and validation loss')
xlabel('Epoch')
grid()
legend()

show()

# Find the predicted values for the validation set

In [None]:
y_pred = argmax(model.predict(x_val), axis = 1)
y_true = argmax(y_val, axis = 1)

# Calculate the classification report for the validation set

In [None]:
cr = classification_report(y_true, y_pred)
print(cr)

# Calculate the confusion matrix for the validation set

In [None]:
cm = confusion_matrix(y_true, y_pred).T
print(cm)

# Calculate the cohen's kappa, both with linear and quadratic weights for the Validation set

In [None]:
k = cohen_kappa_score(y_true, y_pred)
print(f"Cohen's kappa (linear)    = {k:.3f}")
k2 = cohen_kappa_score(y_true, y_pred, weights = 'quadratic')
print(f"Cohen's kappa (quadratic) = {k2:.3f}")

# Find the predicted values for the Test set

In [None]:
y_pred = argmax(model.predict(test_data), axis = 1)
y_true = argmax(test_labels, axis = 1)

# Calculate the classification report for the Test set

In [None]:
cr = classification_report(y_true, y_pred)
print(cr)

# Calculate the confusion matrix for the Test set

In [None]:
cm = confusion_matrix(y_true, y_pred).T
print(cm)

# Calculate the cohen's kappa, both with linear and quadratic weights for the Test set

In [None]:
# Calculate the cohen's kappa, both with linear and quadratic weights
k = cohen_kappa_score(y_true, y_pred)
print(f"Cohen's kappa (linear)    = {k:.3f}")
k2 = cohen_kappa_score(y_true, y_pred, weights = 'quadratic')
print(f"Cohen's kappa (quadratic) = {k2:.3f}")

Now let compare the results with Grässer et al. Table 2, p. 124.The results obtained with the results in the book.The accuracy of the data is not too different. The results from the book have an accuracy of around 90 percent while the results of our models can predict up to 80 percent.But the inter-rater reliability or Cohen's kappa of the models in the books compare with our models are quite distinct.The Cohen's kappa values from the books have almost 84, while the Cohen's kappa values of our models are around from 60 to 70.

# Conclusion

In conclusion, in the case 3 what we did are text anylysis for predicting drug review catigories.First we start with importing data, text preprocessing, build model and vesulize the result respectively.By doing text anylysis, it helps us to explore text data and also text prediction as well.we can use text analysis in various field such as  Sentiment analysis,Language recognition,Automatization of customer service,Spam email filtering, and others.Over all It was a great experience for us.



