# Case 3. Patient Drug Review
**Neural Networks for Machine Learning Applications**<br>
27.02.2023<br>
Erik Holopainen, Alejandro Rosales Rodriguez and Brian van den Berg<br>
[Information Technology, Bachelor's Degree](https://www.metropolia.fi/en/academics/bachelors-degrees/information-technology)<br>
[Metropolia University of Applied Sciences](https://www.metropolia.fi/en)

## 1. Introduction

Instructions: Write here why this Notebook was created, what were the main objectives.

## 2. Setup

Instructions: Write here shortly what libraries were used and why.

In [20]:
# Machine Learning and Data Science
import pandas as pd
import tensorflow as tf

# Modeling neural networks
from keras.models import Model
from keras.models import Sequential
from keras.layers import Dense, Input, Activation, Embedding, Dropout
from keras.layers import Conv1D, MaxPooling1D, GlobalMaxPooling1D, GlobalAveragePooling1D
from keras.preprocessing.text import Tokenizer

# Sklearn
from sklearn.model_selection import train_test_split

# General imports
import os

## 3. Dataset

Instructions: Describe here brielfy the data and its main characteristics. Remember document the code.  

In [21]:
# Define the input variables
inputDir = 'input'
inputPaths = []

# Get the .csv files in the input folder
for file in os.listdir(inputDir):
    if file.endswith('.csv'):
        inputPaths.append(os.path.join(inputDir, file))

# Print the input paths
print(inputPaths)

# Define the dataframe
df = pd.DataFrame()

# Append all the input files
for path in inputPaths:
    df = pd.concat([df, pd.read_csv(path)], ignore_index=True)

# Drop the unique id column
df = df.drop(['uniqueID'], axis=1)

# Shuffle the dataframe
df = df.sample(frac=1)
df = df.reset_index(drop=True)

# Display the dataframe
display(df)

# Display the dataframe description
print("Description of the dataframe:")
display(df.describe().T)

['input\\drugsComTest_raw.csv', 'input\\drugsComTrain_raw.csv']


Unnamed: 0,drugName,condition,review,rating,date,usefulCount
0,Suprep Bowel Prep Kit,Bowel Preparation,"""This bowel prep was a disgusting experience a...",1,22-Aug-14,17
1,Nora-Be,Birth Control,"""I have taken Nora Be 0.35 for 3 months. I ha...",1,17-Feb-16,6
2,Dulaglutide,"Diabetes, Type 2","""I started Trulicity three days ago. No vomit...",4,29-Aug-15,14
3,Paroxetine,Anxiety,"""I&#039;ve been taking Paxil 20 mg for 6 years...",9,26-Jun-09,3
4,Amitriptyline,Anxiety and Stress,"""Taking this medicine is like taking speed ?\r...",1,15-Feb-17,16
...,...,...,...,...,...,...
215058,Lyza,Birth Control,"""I ran out of options for bc since I was so af...",10,11-Nov-15,12
215059,Levothyroxine,Hashimoto's disease,"""After three months of dedicated use, my goite...",10,24-Jul-12,22
215060,Benzoyl peroxide / clindamycin,Acne,"""I have only been using Duac gel for three day...",8,7-Sep-13,11
215061,Levonorgestrel,Abnormal Uterine Bleeding,"""I&#039;m 49 &amp; have had heavy bleeding, bi...",9,17-Aug-16,27


Description of the dataframe:


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
rating,215063.0,6.990008,3.275554,1.0,5.0,8.0,10.0,10.0
usefulCount,215063.0,28.001004,36.346069,0.0,6.0,16.0,36.0,1291.0


## 4. Preprocessing

Instructions: Describe:

- how the missing values are handled
- conversion of textual and categorical data into numerical values (if needed)
- how the data is splitted into train, validation and test sets
- the features (=input) and labels (=output), and 
- how the features are normalized or scaled

In [26]:
# Split into features and labels
X = list(df['review'])
y = list(df['rating'])

# Create a tokenizer to convert text to sequences
tokenizer = Tokenizer()
tokenizer.fit_on_texts(X)

# Tokenize the reviews
X = tokenizer.texts_to_sequences(X)

# Print out the number of unique tokens
word_index = tokenizer.word_index
print(f'Found {len(word_index)} unique tokens.')

# Get the biggest sequence in the data
max_sequence = 0
for seq in X:
    if len(seq) > max_sequence:
        max_sequence = len(seq)
print(f'The maximum amount of words that the model can process is {max_sequence}.')

# Simplification function
def simplify(num):
    if num < 5:
        return 0
    elif num > 6:
        return 2
    else:
        return 1

# Simplify the labels
y = list(map(simplify, y))

# Split into train, validation and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, stratify=y)
X_train, X_val,  y_train, y_val =  train_test_split(X_train, y_train, test_size=.2, stratify=y_train)

Found 55245 unique tokens.
The maximum amount of words that the model can process is 2034.


## 5. Modeling

Instructions: Write a short description of the model: 

- selected loss, optimizer and metrics settings, and 
- the summary of the selected model architecture. 

## 6. Training

Instructions: Write a short description of the training process, and document the code for training and the total time spend on it. 

In [23]:
# Your code

## 7. Performance and evaluation

Instructions: 

- Show the training and validation loss and accuracy plots
- Interpret the loss and accuracy plots (e.g. is there under- or over-fitting)
- Describe the final performance of the model with test set 

In [24]:
# Your code

## 8. Discussion and conclusions

Instructions: Write

- What settings and models were tested before the best model was found
    - What where the results of these experiments 
- Summary of  
    - What was your best model and its settings 
    - What was the final achieved performance 
- What are your main observations and learning points
- Discussion how the model could be improved in future 

**Note:** Remember to evaluate the final metrics using the test set. 
