# Case 3. Patient Drug Review
**Neural Networks for Machine Learning Applications**<br>
27.02.2023<br>
Erik Holopainen, Alejandro Rosales Rodriguez and Brian van den Berg<br>
[Information Technology, Bachelor's Degree](https://www.metropolia.fi/en/academics/bachelors-degrees/information-technology)<br>
[Metropolia University of Applied Sciences](https://www.metropolia.fi/en)

## 1. Introduction

Instructions: Write here why this Notebook was created, what were the main objectives.

## 2. Setup

Instructions: Write here shortly what libraries were used and why.

In [12]:
import pandas as pd
import numpy as np
import tensorflow as tf

from keras.preprocessing.text import Tokenizer
from sklearn.model_selection import train_test_split

import os

## 3. Dataset

Instructions: Describe here brielfy the data and its main characteristics. Remember document the code.  

In [13]:
# Define the input variables
inputDir = 'input'
inputPaths = []

# Get the .csv files in the input folder
for file in os.listdir(inputDir):
    if file.endswith('.csv'):
        inputPaths.append(os.path.join(inputDir, file))

# Print the input paths
print(inputPaths)

# Define the dataframe
df = pd.DataFrame()

# Append all the input files
for path in inputPaths:
    df = pd.concat([df, pd.read_csv(path)], ignore_index=True)

# Drop the unique id column
df = df.drop(['uniqueID'], axis=1)

# Shuffle the dataframe
df = df.sample(frac=1)
df = df.reset_index(drop=True)

# Display the dataframe
display(df)

# Display the dataframe description
print("Description of the dataframe:")
display(df.describe().T)

['input\\drugsComTest_raw.csv', 'input\\drugsComTrain_raw.csv']


Unnamed: 0,drugName,condition,review,rating,date,usefulCount
0,Venlafaxine,Depression,"""Have been on 350mg for the past three years. ...",6,6-Jun-11,19
1,Rizatriptan,Migraine,"""I love how fast I get relief with this and ho...",9,8-Dec-11,3
2,Gabapentin,Neuropathic Pain,"""Because of the opiate crisis, my doctor didn&...",10,2-May-17,12
3,Escitalopram,Anxiety,"""Stay away from this drug! Or at least if you ...",1,29-Jan-16,11
4,Fluticasone,Rhinitis,"""Umm never again. One squirt - 5 mins later my...",1,8-Dec-15,32
...,...,...,...,...,...,...
215058,Citalopram,Depression,"""I have been on Celexa for about 15 years and ...",10,23-Jul-09,106
215059,Paxil,Anxiety,"""About 6-8 weeks ago my doc prescribed 10 mg p...",8,15-Apr-13,196
215060,Paroxetine,Social Anxiety Disorde,"""Paxil has changed my life for the better. I a...",10,6-Mar-15,65
215061,Phentermine,Weight Loss,"""It was really effective for me. My doctor onl...",9,14-Dec-14,161


Description of the dataframe:


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
rating,215063.0,6.990008,3.275554,1.0,5.0,8.0,10.0,10.0
usefulCount,215063.0,28.001004,36.346069,0.0,6.0,16.0,36.0,1291.0


## 4. Preprocessing

Instructions: Describe:

- how the missing values are handled
- conversion of textual and categorical data into numerical values (if needed)
- how the data is splitted into train, validation and test sets
- the features (=input) and labels (=output), and 
- how the features are normalized or scaled

In [19]:
# Split into features and labels
X = list(df['review'])
y = list(df['rating'])

# Create a tokenizer to convert text to sequences
tokenizer = Tokenizer()
tokenizer.fit_on_texts(X)

# Tokenize the reviews
X = tokenizer.texts_to_sequences(X)

# Print out the number of unique tokens
word_index = tokenizer.word_index
print(f'Found {len(word_index)} unique tokens.')

# Simplification function
def simplify(num):
    if num < 5:
        return 0
    elif num > 6:
        return 2
    else:
        return 1

# Simplify the labels
y = list(map(simplify, y))

# Split into train, validation and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, stratify=y)
X_train, X_val,  y_train, y_val =  train_test_split(X_train, y_train, test_size=.2, stratify=y_train)

Found 55245 unique tokens.


## 5. Modeling

Instructions: Write a short description of the model: 

- selected loss, optimizer and metrics settings, and 
- the summary of the selected model architecture. 

## 6. Training

Instructions: Write a short description of the training process, and document the code for training and the total time spend on it. 

In [15]:
# Your code

## 7. Performance and evaluation

Instructions: 

- Show the training and validation loss and accuracy plots
- Interpret the loss and accuracy plots (e.g. is there under- or over-fitting)
- Describe the final performance of the model with test set 

In [16]:
# Your code

## 8. Discussion and conclusions

Instructions: Write

- What settings and models were tested before the best model was found
    - What where the results of these experiments 
- Summary of  
    - What was your best model and its settings 
    - What was the final achieved performance 
- What are your main observations and learning points
- Discussion how the model could be improved in future 

**Note:** Remember to evaluate the final metrics using the test set. 
