# Case 3. Patient Drug Review
**Neural Networks for Machine Learning Applications**<br>
27.02.2023<br>
Erik Holopainen, Alejandro Rosales Rodriguez and Brian van den Berg<br>
[Information Technology, Bachelor's Degree](https://www.metropolia.fi/en/academics/bachelors-degrees/information-technology)<br>
[Metropolia University of Applied Sciences](https://www.metropolia.fi/en)

## 1. Introduction

Instructions: Write here why this Notebook was created, what were the main objectives.

## 2. Setup

Instructions: Write here shortly what libraries were used and why.

In [228]:
# Machine Learning and Data Science
import pandas as pd
import numpy as np
import tensorflow as tf

# Modeling neural networks
from keras.models import Model
from keras.models import Sequential
from keras.layers import Dense, Input, Activation, Embedding, Dropout
from keras.layers import Conv1D, MaxPooling1D, GlobalMaxPooling1D, GlobalAveragePooling1D
from keras.preprocessing.text import Tokenizer
from keras.utils import to_categorical
from keras.utils import pad_sequences

# Sklearn
from sklearn.model_selection import train_test_split

# General imports
import os

## 3. Dataset

Instructions: Describe here brielfy the data and its main characteristics. Remember document the code.  

In [229]:
# Define the input variables
inputDir = 'input'
inputPaths = []

# Get the .csv files in the input folder
for file in os.listdir(inputDir):
    if file.endswith('.csv'):
        inputPaths.append(os.path.join(inputDir, file))

# Print the input paths
print(inputPaths)

# Define the dataframe
df = pd.DataFrame()

# Append all the input files
for path in inputPaths:
    df = pd.concat([df, pd.read_csv(path)], ignore_index=True)

# Drop the unique id column
df = df.drop(['uniqueID'], axis=1)

# Shuffle the dataframe
df = df.sample(frac=1)
df = df.reset_index(drop=True)

# Display the dataframe
display(df)

# Display the dataframe description
print("Description of the dataframe:")
display(df.describe().T)

['input\\drugsComTest_raw.csv', 'input\\drugsComTrain_raw.csv']


Unnamed: 0,drugName,condition,review,rating,date,usefulCount
0,Cymbalta,Osteoarthritis,"""My doctor prescribed Cymbalta for me as an an...",9,19-Jun-11,93
1,Dilaudid,Pain,"""I had a severe ear infection and was given No...",9,8-Oct-09,14
2,Ceftriaxone,Pneumonia,"""I wish I would have received as fast a healin...",10,26-Mar-13,25
3,Tamsulosin,Overactive Bladde,"""My Doctor prescribed Flomax Min Dose, I was o...",3,13-Jan-16,11
4,Miconazole,Vaginal Yeast Infection,"""OMG! Monistat 3 is bad! Upon insertion I&#039...",2,3-Oct-17,6
...,...,...,...,...,...,...
215058,Loestrin 24 Fe,Birth Control,"""I have been on this medicine for over three y...",8,18-Mar-11,2
215059,Acetaminophen / hydrocodone,Pain,"""I started taking this after my knee surgery, ...",10,27-Oct-09,256
215060,Levonorgestrel,Birth Control,"""I&#039;ve had Skyla for almost 2 years now, a...",6,22-Mar-16,2
215061,Dapagliflozin,"Diabetes, Type 2","""This medication worked immediately lowering m...",9,19-Nov-15,15


Description of the dataframe:


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
rating,215063.0,6.990008,3.275554,1.0,5.0,8.0,10.0,10.0
usefulCount,215063.0,28.001004,36.346069,0.0,6.0,16.0,36.0,1291.0


## 4. Preprocessing

Instructions: Describe:

- how the missing values are handled
- conversion of textual and categorical data into numerical values (if needed)
- how the data is splitted into train, validation and test sets
- the features (=input) and labels (=output), and 
- how the features are normalized or scaled

### Tokenize

In [230]:
# Get the reviews
X = list(df['review'])

# Create a tokenizer to convert text to sequences
tokenizer = Tokenizer()
tokenizer.fit_on_texts(X)

# Tokenize the reviews
X = tokenizer.texts_to_sequences(X)

# Print out the number of unique tokens
word_index = tokenizer.word_index
print(f'Found {len(word_index)} unique tokens.')

# Get the biggest sequence in the data
max_sequence = 0
for seq in X:
    if len(seq) > max_sequence:
        max_sequence = len(seq)
print(f'The maximum amount of words that the model can process is {max_sequence}.')

# Apply padding to make all sequences an equal size
X = np.array(pad_sequences(X, maxlen=200))

Found 55245 unique tokens.
The maximum amount of words that the model can process is 2034.


### Simplify and Encode the Labels

In [231]:
# Get the ratings
y = list(df['rating'])

# Simplification function
def simplify(num):
    if num < 5:
        return 0
    elif num > 6:
        return 2
    else:
        return 1

# Simplify the labels
y = np.array(list(map(simplify, y)))

# Calculate the class weights
total = len(y)
unique, counts = np.unique(y, return_counts=True)

# Encode the labels for multi classification
y = np.array(to_categorical(y))

# Print the distribution
print(f'The unique labels are: [{unique[0]}, {unique[1]}, {unique[2]}] with a distribution of [{counts[0]}, {counts[1]}, {counts[2]}].')

The unique labels are: [0, 1, 2] with a distribution of [53572, 19185, 142306].


## 5. Modeling

Instructions: Write a short description of the model: 

- selected loss, optimizer and metrics settings, and 
- the summary of the selected model architecture. 

In [234]:
# Defining the model
embedding_dim = 100
model = Sequential([
    Embedding(len(word_index) + 1, embedding_dim),
    Conv1D(128, 7, padding="valid", activation="relu", strides=3),
    GlobalAveragePooling1D(),
    Dropout(.2),
    Dense(128, activation="relu"),
    Dense(3, activation = 'softmax')
])

# Compiling
model.compile(loss = 'categorical_crossentropy',
              optimizer = 'rmsprop',
              metrics = ['acc'])

# Summarizing
model.summary()

Model: "sequential_31"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_31 (Embedding)    (None, None, 100)         5524600   
                                                                 
 conv1d_31 (Conv1D)          (None, None, 128)         89728     
                                                                 
 global_average_pooling1d_31  (None, 128)              0         
  (GlobalAveragePooling1D)                                       
                                                                 
 dropout_48 (Dropout)        (None, 128)               0         
                                                                 
 dense_62 (Dense)            (None, 128)               16512     
                                                                 
 dense_63 (Dense)            (None, 3)                 387       
                                                     

## 6. Training

Instructions: Write a short description of the training process, and document the code for training and the total time spend on it. 

In [235]:
# Split into train, validation and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, stratify=y)
X_train, X_val,  y_train, y_val =  train_test_split(X_train, y_train, test_size=.2, stratify=y_train)

# Model Fitting
history = model.fit(
    X_train, y_train,
    batch_size=128,
    epochs=10,
    verbose=1,
    validation_data=(X_val, y_val)
)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


## 7. Performance and evaluation

Instructions: 

- Show the training and validation loss and accuracy plots
- Interpret the loss and accuracy plots (e.g. is there under- or over-fitting)
- Describe the final performance of the model with test set 

In [None]:
# Your code

## 8. Discussion and conclusions

Instructions: Write

- What settings and models were tested before the best model was found
    - What where the results of these experiments 
- Summary of  
    - What was your best model and its settings 
    - What was the final achieved performance 
- What are your main observations and learning points
- Discussion how the model could be improved in future 

**Note:** Remember to evaluate the final metrics using the test set. 
