# Mean Study Distance

Compute the mean similarity between studies which were included in each drug review and the studies which were explicitly excluded. The hope is that that the mean distance between studies in the same review is smaller than the mean distance to studies which were explicitly excluded!

### Choose Experiment Group & ID

In [1]:
exp_group = 'big-populations'
exp_id = 9

### Load Model

In [2]:
import keras

model = keras.models.load_model('store/weights/{}/{}/0-loss.h5'.format(exp_group, exp_id))

Using Theano backend.


Load abstracts which were included and excluded at the PICO level...

In [3]:
df = pd.read_csv('../preprocess/test_df.csv')

df

Unnamed: 0,pmid,drug,label,text
0,10023943,BetaBlockers,inc,"In patients with heart failure, beta-blockade ..."
1,10024259,ProtonPumpInhibitors,inc,To assess intermittent treatment over 12 month...
2,10029645,ACEInhibitors,inc,Population-based studies have found that black...
3,10029786,CalciumChannelBlockers,pop,Preterm labor is the leading cause of perinata...
4,10030097,BetaBlockers,inc,"Prospective, randomized and long term multicen..."
5,10047626,Statins,inc,The HMG CoA reductase inhibitors have quickly ...
6,10052772,CalciumChannelBlockers,inc,The prevalence of gingival overgrowth induced ...
7,10063787,BetaBlockers,inc,To assess the efficacy and safety of bisoprolo...
8,10066996,ADHD,inc,Three experiments were conducted to explore th...
9,10069685,CalciumChannelBlockers,outcome,We compared antihypertensive efficacy and safe...


### List Drugs

In [24]:
for drug in df.groupby('drug').size().index.tolist():
    print drug

ACEInhibitors
ADHD
Antihistamines
AtypicalAntipsychotics
BetaBlockers
CalciumChannelBlockers
Estrogens
NSAIDS
Opiods
OralHypoglycemics
ProtonPumpInhibitors
SkeletalMuscleRelaxants
Statins
Triptans
UrinaryIncontinence


### Sample Abstracts

In [4]:
for _ in range(5):
    print np.random.choice(df.text)
    print
    print '*'*80

To determine whether a third generation vasodilating beta blocker (celiprolol) has long term clinical advantages over metoprolol in patients with chronic heart failure.

********************************************************************************

********************************************************************************
This review examines the evidence for the development of adverse effects due to prolonged gastric acid suppression with proton pump inhibitors. Potential areas of concern regarding long-term proton pump inhibitor use have included: carcinoid formation; development of gastric adenocarcinoma (especially in patients with Helicobacter pylori infection); bacterial overgrowth; enteric infections; and malabsorption of fat, minerals, and vitamins. Prolonged proton pump inhibitor use may lead to enterochromaffin-like cell hyperplasia, but has not been demonstrated to increase the risk of carcinoid formation. Long-term proton pump inhibitor treatment has not been docum

Check drug categories...

In [5]:
for name, group in df.groupby('drug'):
    print name
    print group.groupby('label').size()
    print
    print '*'*80
    print

ACEInhibitors
label
inc        168
outcome      2
dtype: int64

********************************************************************************

ADHD
label
inc        83
outcome    21
pop         4
dtype: int64

********************************************************************************

Antihistamines
label
inc    90
dtype: int64

********************************************************************************

AtypicalAntipsychotics
label
inc        333
outcome    120
pop         94
dtype: int64

********************************************************************************

BetaBlockers
label
inc        270
outcome      8
pop         53
dtype: int64

********************************************************************************

CalciumChannelBlockers
label
inc        257
outcome    145
pop         94
dtype: int64

********************************************************************************

Estrogens
label
inc    79
dtype: int64

**************************************

CalciumChannelBlockers seems to have a somewhat uniform distribution. Let's go with that...

In [6]:
drug_df = df.groupby('drug').get_group('CalciumChannelBlockers')

drug_df

Unnamed: 0,pmid,drug,label,text
3,10029786,CalciumChannelBlockers,pop,Preterm labor is the leading cause of perinata...
6,10052772,CalciumChannelBlockers,inc,The prevalence of gingival overgrowth induced ...
9,10069685,CalciumChannelBlockers,outcome,We compared antihypertensive efficacy and safe...
15,10073852,CalciumChannelBlockers,inc,"This multicenter, randomized, double-blind, pa..."
26,10090348,CalciumChannelBlockers,outcome,"A single-center, prospective double-blind rand..."
30,10099034,CalciumChannelBlockers,inc,It has been proposed that worsening of heart f...
32,10099075,CalciumChannelBlockers,outcome,"To assess the efficacy and safety of 2.5, 5, a..."
35,10100063,CalciumChannelBlockers,inc,The Systolic Hypertension in Europe (Syst-Eur)...
50,10172139,CalciumChannelBlockers,inc,This study was a prospective observational stu...
55,10189144,CalciumChannelBlockers,inc,Recent reports suggest a possible link between...


### Load Population Vectorizer

In [25]:
import pickle

vectorizer = pickle.load(open('../preprocess/abstracts.p'))

X = vectorizer.texts_to_sequences(drug_df.text)

inc_idxs = np.argwhere(drug_df.label == 'inc').flatten()
pop_idxs = np.argwhere(drug_df.label == 'pop').flatten()

X_pop, X_inc = X[pop_idxs], X[inc_idxs]

Run abstracts through the model...

In [26]:
import keras.backend as K

inputs = [model.inputs[0], K.learning_phase()]
outputs = model.get_layer('study').output
f = K.function(inputs, outputs)

TEST_MODE = 0
H_pop, H_inc = f([X_pop, TEST_MODE]), f([X_inc, TEST_MODE])

Compute the mean similarity...

In [27]:
def gen_upper_triangular(N=5):
    """Generate pairs of indexes to be passed to zip() so we can extract the upper triangle of a matrix"""
    
    for i in range(N):
        for j in range(i+1, N):
            yield i, j
            
I, J = zip(*list(gen_upper_triangular(N=len(H_inc))))

Mean similarity between studies which were included in the same review...

In [28]:
S_inc = np.dot(H_inc, H_inc.T)

S_inc[I, J].mean()

0.37692174

Mean similarity between studies which were included in the same review and studies which were excluded due to population...

In [29]:
S_pop = np.dot(H_pop, H_inc.T)

S_pop.mean()

0.21857567

Almost all activations are zero!

In [30]:
used_idxs = np.argwhere(H_inc != 0)[:, 1]

np.unique(used_idxs).shape[0]

243

In [13]:
np.sum(H_inc==0, axis=1) / float(H_inc.shape[1])

array([ 0.998999  ,  0.98798799,  0.998999  ,  0.998999  ,  0.98398398,
        0.99099099,  0.997998  ,  0.995996  ,  0.98998999,  0.98698699,
        0.99299299,  0.997998  ,  0.98798799,  0.998999  ,  0.997998  ,
        0.995996  ,  0.98198198,  0.996997  ,  0.996997  ,  0.99099099,
        0.98898899,  0.997998  ,  0.99099099,  0.98598599,  0.99499499,
        0.996997  ,  0.98598599,  0.98598599,  0.99199199,  0.99499499,
        0.995996  ,  0.996997  ,  0.99399399,  1.        ,  0.99399399,
        0.99499499,  0.996997  ,  0.998999  ,  0.998999  ,  0.998999  ,
        0.995996  ,  0.98798799,  0.997998  ,  0.998999  ,  0.99499499,
        0.98598599,  0.998999  ,  0.99299299,  0.995996  ,  0.998999  ,
        1.        ,  0.998999  ,  0.96796797,  0.99199199,  0.995996  ,
        0.99499499,  0.98998999,  0.97997998,  0.998999  ,  0.99499499,
        0.998999  ,  0.995996  ,  0.996997  ,  0.995996  ,  0.99199199,
        0.998999  ,  0.997998  ,  0.997998  ,  0.99299299,  0.98