# Who is heading for Diabetes?

diabetesahead

This is the predictive part of the 2017 Melbourne Datathon.

The task is to predict the probability that a patient will be dispensed a drug related to Diabetes post 2015. This is quite important research as it will be an early warning system for doctors so intervention can potentially be made before it is too late.

Use the patients that we have provided all the records for to build your model, then see how it performs on these unseen people.

For patient ID'S 279,201 to 558,352 you need to submit a file with 2 columns, the Patient_ID and the probability in the range [0-1]. The file will have 279,153 rows including the header row. An example submission file is provided for download.

In [178]:
import pandas as pd
import numpy as np
import sqlite3

from sklearn.model_selection import train_test_split, cross_val_score 
from sklearn import svm

In [138]:
submission = pd.read_csv('../../submissions/diabetes_submission_example.csv')  

## Load the lookup data

In [139]:
conn = sqlite3.connect("../../sql/datasci.db")

In [237]:
def patient_data(conn, patient_id):
    """
    Return the patient data.
    """
    SQL = """
SELECT *
FROM transactions a
LEFT OUTER JOIN ChronicIllness_LookUp b 
    ON a.Drug_ID = b.MasterProductID 
LEFT OUTER JOIN patients c
    ON a.Patient_ID = c.Patient_ID
LEFT OUTER JOIN classification d
    ON a.Patient_ID = d.Patient_ID
WHERE a.Patient_ID = {}
AND a.prescription_week < '2016-01-01'
ORDER BY prescription_week
    """.format(patient_id)

    return pd.read_sql_query(SQL, conn)

In [252]:
df = patient_data(conn, 100)
df.head()

Unnamed: 0,Patient_ID,Store_ID,Prescriber_ID,Drug_ID,SourceSystem_Code,Prescription_Week,Dispense_Week,Drug_Code,NHS_Code,IsDeferredScript,...,StreamlinedApproval_Code,ChronicIllness,MasterProductID,MasterProductFullName,Patient_ID.1,gender,year_of_birth,postcode,Patient_ID.2,Target
0,100,2415,32470,4867,F,2010-10-17,2011-06-19,LIPI3,8215J,0,...,,Lipids,4867.0,LIPITOR TAB 40MG 30,100,F,1938,3500,100,0
1,100,2415,32470,9430,F,2011-01-02,2011-02-13,ZOMI1,8266C,0,...,,,,,100,F,1938,3500,100,0
2,100,2415,32470,9430,F,2011-04-03,2011-04-17,ZOMI1,8266C,0,...,,,,,100,F,1938,3500,100,0
3,100,2002,0,9430,F,2011-04-03,2011-05-08,ZOMI1,8266C,0,...,,,,,100,F,1938,3500,100,0
4,100,2415,1637,9430,F,2011-06-05,2011-06-12,ZOMI1,8266C,0,...,,,,,100,F,1938,3500,100,0


In [265]:
def had_diabetes(patient_data):
    return float(patient_data.ChronicIllness.str.contains('Diabetes').any())

def had_lipids(patient_data):
    return float(patient_data.ChronicIllness.str.contains('Lipids').any())

def had_hypertension(patient_data):
    return float(patient_data.ChronicIllness.str.contains('Hypertension').any())
    

In [266]:
had_hypertension(df)

1.0

## Compute some basic features of the data 

In [267]:
gender_map = {'F': 1, 'M': 0, 'U': 0.5}

def feature_extract(patient_frame):
    """
    The thing that forms a feature vector.
    
    ** Make sure to partition out data from the 2016 period.
    
    """
    
    x = [gender_map[patient_frame.gender[0]], 
         2016 - patient_frame.year_of_birth[0], 
         had_diabetes(patient_frame),
         had_lipids(patient_frame),
         had_hypertension(patient_frame)]
    
    y = patient_frame.Target[0]
    
    return x, y


## Look at the descriptive power of the features.


Look at the relationship of variables and plot them.

## Create the matrix of features and vector of targets

In [None]:
X, Y = [], []
for i in np.random.randint(0, 200000, 1000): 
    x, y = feature_extract(patient_data(conn, i))
    X.append(x)
    Y.append(y)
X = np.vstack(X)
y = np.array(Y)

## Fit a support vector machine to the data, using 5 fold cross validation.

In [263]:
clf = svm.SVC(kernel='rbf', C=1)

scores = cross_val_score(clf, X, y, cv=5)
scores

array([ 0.90547264,  0.925     ,  0.91      ,  0.97      ,  0.90954774])

In [264]:
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Accuracy: 0.92 (+/- 0.05)


## Classify the remaining dataset anf form a submission