# About
Since the current data size is too big, we will reduce the dataset to just a handful of patients, which will make the Dash app work faster and better. 
We will select patients based on some criteria that allows a good representative sample of the 'comparative cohort' while highlighting some aspects of the Dash app.

# Set Up

In [1]:
import os
os.chdir('../src')

In [3]:
import pandas as pd
import pickle

In [None]:
df = pickle.load(open('../data/intermediate/patient_demographics.pkl','rb'))
visits_data = pickle.load(open('../data/intermediate/visits_data.pkl','rb'))
visit_types_list = list(visits_data.groupby('VisitType')['VisitID'].count().sort_values(ascending=False).index.dropna())
disease_list = visits_data.DiseaseCat.unique().tolist()

In [14]:
visits_data

Unnamed: 0,PatientID,VisitID,Days,VisitType,CGI,Medication,Dose,Regimen,Diagnosis,DSMNo,DiseaseCat
0,92,1667,105,Outpatient,2,fluoxetine,20.0,1.0,Diagnosis Deferred on Axis II,799.9,others
1,92,1668,105,Outpatient,1,fluoxetine,20.0,1.0,Bulimia Nervosa,307.51,feeding and eating disorders
2,92,1669,104,Outpatient,1,fluoxetine,20.0,1.0,Opioid Dependence,304.00,substance-related and addictive disorders
3,92,1673,104,Outpatient,1,fluoxetine,20.0,1.0,Opioid Dependence,304.00,substance-related and addictive disorders
4,92,1674,103,Outpatient,1,fluoxetine,20.0,1.0,Diagnosis Deferred on Axis II,799.9,others
5,92,1675,103,Outpatient,1,fluoxetine,20.0,1.0,Diagnosis Deferred on Axis II,799.9,others
6,92,1677,103,Outpatient,2,fluoxetine,20.0,1.0,Depressive Disorder NOS,311,other mood disorder
7,92,1679,102,Outpatient,1,fluoxetine,20.0,1.0,Diagnosis Deferred on Axis II,799.9,others
8,92,1681,102,Outpatient,1,fluoxetine,20.0,1.0,Diagnosis Deferred on Axis II,799.9,others
9,92,1691,99,Outpatient,1,fluoxetine,20.0,1.0,Bulimia Nervosa,307.51,feeding and eating disorders


# Data Overview

Original data size:

In [136]:
df.shape, visits_data.shape

((14633, 4), (1355676, 11))

In [17]:
len(visits_data.PatientID.unique())

14473

In [21]:
visits_data.groupby('DiseaseCat')['VisitID'].count().sort_values(ascending=False)

DiseaseCat
major depressive disorder                            376963
others                                               245439
anxiety disorders                                    182627
substance-related and addictive disorders            126786
other mood disorder                                   85129
trauma and stressor related disorders                 78848
personality disorders                                 78525
neurodevelopmental disorders                          52592
schizophrenia and other psychotic disorders           33587
bipolar disorders                                     33449
adhd                                                  28140
feeding and eating disorders                           7615
neurocognitive disorders                               7514
impulse-control disorders                              7001
conduct disorder                                       3615
sleep-wake disorders                                   2946
mental disorders due to gener

# Picking Patients
Ideal patients: 
- have enough records in the database
- have large changes in CGI
- have tried a reasonable number of medications
- have some comorbidities

In [72]:
ptgrp = visits_data.groupby('PatientID')
pts = ptgrp.agg({'VisitID':'count',
           'CGI' : ['std','min','max'],
           'Medication': 'nunique',
           'DiseaseCat':'nunique',
           'Days' : ['min','max']
          })

pts['duration'] = pts['Days']['max'] - pts['Days']['min'] + 1

In [77]:
# Filter criteria
pts_subset = pts[(pts['VisitID']['count']>10)&
                (pts['VisitID']['count']<=200)&
                (pts['DiseaseCat']['nunique']>2)&
                (pts['duration']>30)&(pts['duration']<=720)&
                (pts['CGI']['std']>0.5)]

In [126]:
pts_sample = list(pts_subset.index)

In [135]:
df_sample = df[df.PatientID.isin(pts_sample)]
visits_data_sample = visits_data[visits_data.PatientID.isin(pts_sample)]

Unnamed: 0,PatientID,Sex,Race,Age
3,924,F,other,28
6,1651,F,other,51
7,1713,F,other,22
9,2107,F,white,35
12,2191,F,white,58
14,2361,M,white,42
20,4425,M,other,36
21,4714,F,other,36
22,4809,M,other,36
27,6704,F,white,38


# Saving Sampled Data
Reduced the data sample to:

In [132]:
df_sample.shape, visits_data_sample.shape

((2519, 4), (125163, 11))

In [134]:
with open('../data/intermediate/demographics_sample.pkl', 'wb') as f:
    pickle.dump(df_sample, f)
with open('../data/intermediate/visits_data_sample.pkl', 'wb') as f:
    pickle.dump(visits_data_sample, f)