# Drug Prescription Effectiveness and Ratings
### Dataset 
This dataset deals with prescription drugs; it was found on Kaggle from user Rohan Harode.

Link: https://www.kaggle.com/datasets/jessicali9530/kuc-hackathon-winter-2018

#### Attributes
* _Drug_: Name of drug
* _condition_: Name of condition the drug intends to treat
* _date_: date of review and rating entry
* _Effectiveness_: 5 star patient review
* _Age_: age range of the patient
* _EaseOfUse_: 5 star rating of how easy the drug is to use
* _Satisfaction_: 5 star rating of how the patients liked the drug
* _Sex_: gender of patient

* _*Both 'usefulCount' and 'review' attributes were ommitted; they will serve no use in the predictions_

#### Predictions
We are trying to predict the effectiveness of each drug based on an unseen patient. To do this, we will need to find the average effectiveness of each drug grouped by drug name.

In [117]:
# some useful mysklearn package import statements and reloads
import importlib
import mysklearn.myutils
importlib.reload(mysklearn.myutils)
import mysklearn.myutils as myutils

# uncomment once you paste your mypytable.py into mysklearn package
import mysklearn.mypytable
importlib.reload(mysklearn.mypytable)
from mysklearn.mypytable import MyPyTable 

# uncomment once you paste your myclassifiers.py into mysklearn package
import mysklearn.myclassifiers
importlib.reload(mysklearn.myclassifiers)
from mysklearn.myclassifiers import MyKNeighborsClassifier, MyDummyClassifier, MyNaiveBayesClassifier, MyDecisionTreeClassifier

import mysklearn.myevaluation
importlib.reload(mysklearn.myevaluation)
import mysklearn.myevaluation as myevaluation

In [118]:
# Dataset Preprocessing
table, header = myutils.get_tables("input_data/webmd.csv")
drug_table = MyPyTable(header, table)

drug_table.remove_rows_with_missing_values()
drug_table.convert_to_numeric()
to_rem = drug_table.find_duplicates(['Drug','Age','Condition','Date','EaseofUse','Satisfaction','Sex','UsefulCount','Effectiveness'])
drug_table.drop_rows(to_rem)

# remove non-useful attributes
for i in range(len(drug_table.data)):
    del drug_table.data[i][7]

del drug_table.column_names[7]

# convert dates to seasons, change some misc age values
drug_table.column_names[3] = 'Season'
for i in range(len(drug_table.data)):
    val = drug_table.data[i][3]
    newVal = myutils.season_discretize(val)
    drug_table.data[i][3] = newVal

    if drug_table.data[i][1] == '6-Mar':
        drug_table.data[i][1] = '3-6'
    if drug_table.data[i][1] == '12-Jul':
        drug_table.data[i][1] = '7-12'

drug_table.save_to_file('input_data/clean_drug.csv')

In [119]:
import statistics

# stats - group by 'Drug'
values, counts = drug_table.get_frequencies('Drug')

grouped_data = []
for item in values:
    grouped_data.append([])

for item in drug_table.data:
    ind = values.index(item[0])
    grouped_data[ind].append(item)

drug_avg_data = []
for i in range(len(grouped_data)):
    instance = []
    instance.append(grouped_data[i][0][0])

    mean_lists = []
    for n in range(7):
        mean_lists.append([])

    for item in grouped_data[i]:
        for j in range(len(item)):
            if j == 0:
                pass
            else:
                mean_lists[j-1].append(item[j])
    
    for item in mean_lists:
        if isinstance(item[0], float):
            instance.append(round(statistics.mean(item), 2))
        else:
            instance.append(myutils.get_most_frequent(item))

    drug_avg_data.append(instance)

avg_drug_table = MyPyTable()
avg_drug_table.data = drug_avg_data
avg_drug_table.column_names = drug_table.column_names

avg_drug_table.save_to_file('input_data/grouped_data.csv')

print()




### EDA - Statistics

To compute the effectiveness of each drug and display classification information on each instance, we had to convert the class label  
into a categorical label. Instead of having an Effectiveness integer value between 0.0 and 5.0, we converted each Effectiveness  
value to the following categorical values:
- 0.0 to 1.0: Not_Effective ("NE")
- 1.01 to 2.0: Slightly_Effective ("SE")
- 2.01 to 3.0: Moderately_Effective ("ME")
- 3.01 to 4.0: Effective ("E")
- 4.01 to 5.0: Very_Effective ("VE")

In [120]:
# create categorical attribute of effectiveness
# convert effectiveness rating into categorical rating
effectiveness_con = myutils.get_column(avg_drug_table.data, avg_drug_table.column_names, "Effectiveness")
effectiveness_cat = myutils.continuous_to_categorical(effectiveness_con)
for i in range(len(avg_drug_table.data)):
    avg_drug_table.data[i][-1] = effectiveness_cat[i]

# get X and y for stratified k fold cross validation
X = []
y = []
for i in range(len(avg_drug_table.data)):
    # create X, y, remove class label from X
    X.append(avg_drug_table.data[i].copy()) 
    y.append(avg_drug_table.data[i][-1]) # class label
# remove class label from X
for row in X:
    row.pop(-1)

# get folds
X_train_folds, X_test_folds = myevaluation.stratified_kfold_cross_validation(X, y, n_splits=5, random_state=None, shuffle=True)

# create training/testing sets out of the folds
X_train_sets, y_train_sets = myutils.get_sets_from_folds(X, y, X_train_folds)
X_test_sets, y_test_sets = myutils.get_sets_from_folds(X, y, X_test_folds)



labels = ["VE", "E", "ME", "SE", "NE"]
dummy = MyDummyClassifier()
myutils.fit_predict_classification(X_train_sets, y_train_sets, 
    X_test_sets, y_test_sets, dummy, "Dummy", labels)


index: 7
index: 1
Dummy Classification
Accuracy: 0.38 ~ Error Rate: 0.62
Recall Score: 1.0
Precision Score: 0.36
F1 Score: 0.53

Confusion Matrix:
+-----+---+----+----+----+
| VE  | E | ME | SE | NE |
+-----+---+----+----+----+
| 369 | 0 | 0  | 0  | 0  |
| 343 | 0 | 0  | 0  | 0  |
| 191 | 0 | 0  | 0  | 0  |
| 57  | 0 | 0  | 0  | 0  |
| 56  | 0 | 0  | 0  | 0  |
+-----+---+----+----+----+
