### Remove pkl files to re-generate models

In [None]:
# import os
# for fname in os.listdir("./"):
#     if fname.endswith("pkl"):
#         os.remove('./'+fname)
#         print "Removing "+fname

# Data
## File Paths
Set the data paths (for training, only when it's available)<br>
Download sample data here: [train](https://drive.google.com/open?id=1HN5L6kkh9mYa7vQ_W-H9InW_kgQbFfrR), [test](https://drive.google.com/open?id=1s_P7IrmGJFN6OTLKOrQSUz8TTcyNF6fS) (Only accessible to PCORI team)

In [1]:
tr_data_file = './mhd.4.25.18_sample_tr.txt'
te_data_file = './mhd.4.25.18_sample_te.txt'

## Data Classes
Training and test data classes are slightly different since labels and vocabulary are determined only at the training step.

In [5]:
from mhddata import MHDTrainData, MHDTestData

### Training data: Below lines will run inside the function `.fit_model`

In [6]:
mhdtrain = MHDTrainData(tr_data_file, nouns_only=False, ignore_case=True,
                 remove_numbers=False, sub_numbers=True, stopwords_dir="./stopwordlists",
                 label_mappings=None, ngram_range=(1,1), max_np_len=2, min_wlen=1,
                 min_dfreq=0, max_dfreq=0.8, min_sfreq=20,
                 token_pattern=r"(?u)[A-Za-z\?\!\-\.']+", verbose=3,  # can control verbosity
                 corpus_pkl='./corpus.pkl', label_pkl='./label.pkl', vocab_pkl='./vocab.pkl')

Loading and preprocessing the corpus with labels
  Cleaning the corpus (removing punctuations..)
            0 utterances
         5000 utterances
        10000 utterances
        15000 utterances
        20000 utterances
        25000 utterances
        30000 utterances
        35000 utterances
        40000 utterances
        45000 utterances
        50000 utterances
        55000 utterances
        60000 utterances
        65000 utterances
        70000 utterances
        75000 utterances
        80000 utterances
        85000 utterances
        90000 utterances
  Extracting noun phrases..
Cleaning labels ..
  16 OtherAddictions --> 37 Other
  18 Death --> 37 Other
  19 Bereavement --> 37 Other
  20 PainSuffering --> 37 Other
  24 ActivityDailyLiving --> 37 Other
  26 Unemployment --> 37 Other
  27 MoneyBenefits --> 37 Other
  28 Caregiver --> 37 Other
  31 Religion --> 37 Other
  32 Age --> 37 Other
  33 LivingWillAdvanceCarePlanning --> 37 Other
  35 MDPT-Relationship --> 37 Other

In [7]:
mhdtrain.print_stats()  # you could always print out the stats

Number of sessions: 209 (ones that have text)
Number of sessions: 209 (ones that have labels)
Number of sessions: 209 (ones that have both text and labels)
Number of segments: 6565 (ones that have both text and labels)
Number of utterances: 92739 (ones that have both text and labels)
Number of labels that originally had: 38 (including the ones that appear in the sessions without text)
Number of labels: 25 (after cleaning the labels)
Vocabulary size: 13091
Number of user-defined stopwords: 553
Number of stopwords used in total: 553 (including the words with low dfs and high dfs)


In [None]:
sorted(list(mhdtrain.vocabulary))

### Test data: Below lines will run inside the function `.predict ` or `.predict_viterbi` (for HMM)

In [10]:
mhdtest = MHDTestData(te_data_file, nouns_only=False, ignore_case=True,
                 remove_numbers=False, sub_numbers=True, stopwords_dir="./stopwordlists",
                 label_mappings=None, ngram_range=(1,1), max_np_len=2, min_wlen=1,
                 min_dfreq=0, max_dfreq=0.8, min_sfreq=20,
                 token_pattern=r"(?u)[A-Za-z\?\!\-\.']+", verbose=3,
                 corpus_pkl='./corpus_te.pkl', tr_label_pkl='./label.pkl', tr_vocab_pkl='./vocab.pkl')

Loading labels file from ./label.pkl
 (Delete the file if you want to re-generate the labels)
Loading cleaned labels file from ./label_cleaned.pkl
Loading the vocabulary file from ./vocab.pkl
 (Delete the file if you want to re-generate the vocabulary)
Loading and preprocessing the corpus with labels
Loading the processed file from ./corpus_te.pkl
 (Delete the file if you want to re-process the corpus)
Cleaning labels ..
  16 OtherAddictions --> 37 Other
  18 Death --> 37 Other
  19 Bereavement --> 37 Other
  20 PainSuffering --> 37 Other
  24 ActivityDailyLiving --> 37 Other
  26 Unemployment --> 37 Other
  27 MoneyBenefits --> 37 Other
  28 Caregiver --> 37 Other
  31 Religion --> 37 Other
  32 Age --> 37 Other
  33 LivingWillAdvanceCarePlanning --> 37 Other
  35 MDPT-Relationship --> 37 Other
  5 Prognosis --> 37 Other


# Models

## 1. Logistic Regression Models
## 1.1 Train & Predict
### Train (This step will not needed if you're not training and loading the pre-trained model)

In [2]:
from models import LogRegDialogModel

lr = LogRegDialogModel(lr_type='ovr')
lr.fit_model(tr_data_file, penalty_type="l2", reg_const=1.0,
             model_file='./lrdialog_ovr.pkl', verbose=0)  # Saves model to model_file

  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):


Saving Logistic regression model to ./lrdialog_ovr.pkl


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='lbfgs', tol=0.0001,
          verbose=0, warm_start=False)

### Predict
Prediction creates `lr.result`

In [3]:
lr.predict(te_data_file, verbose=1)

Loading labels file from ./label.pkl
 (Delete the file if you want to re-generate the labels)
Loading cleaned labels file from ./label_cleaned.pkl
Loading the vocabulary file from ./vocab.pkl
 (Delete the file if you want to re-generate the vocabulary)
Loading and preprocessing the corpus with labels
Loading the processed file from ./corpus_te.pkl
 (Delete the file if you want to re-process the corpus)
Cleaning labels ..
Calculating scores..


  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


<models.DialogResult instance at 0x132536d88>

### Result scores

In [4]:
lr.result.scores

{'accuracy': 34.82142857142857,
 'auc': 0.5435482799766377,
 'auc_w': 0.5739195353237183,
 'f1score': 0.14652481268845458,
 'f1score_w': 0.29283037656735433,
 'precision': 0.38311016246092167,
 'precision_w': 0.4180309387651325,
 'recall': 0.12134763011459579,
 'recall_w': 0.3791459075340366,
 'rprecision': 0.10704220148834033,
 'rprecision_w': 0.20824991390702363}

Can print and save it to a file

In [11]:
lr.result.print_scores(filename='./result_in_diff_metrics.csv')

model,accuracy,precision_w,recall_w,auc_w,rprecision_w,f1score_w,precision,recall,auc,rprecision,f1score
LogReg_ovr_l2_0.9 ,34.7354 ,0.4206 ,0.3784 ,0.5727 ,0.2063 ,0.2910 ,0.3826 ,0.1195 ,0.5426 ,0.1052 ,0.1440


## 1.2 Load & Predict
`lr2` loads the model that was trained above (part that we're going to release)

In [12]:
lr2 = LogRegDialogModel(lr_type='ovr')
lr2.load_model(model_file="./lrdialog_ovr.pkl")

Loading Logistic regression model to ./lrdialog_ovr.pkl


In [13]:
lr2.predict(te_data_file, verbose=0)

Calculating scores..


<models.DialogResult instance at 0x131635b00>

In [14]:
lr2.result.scores

{'accuracy': 34.735449735449734,
 'auc': 0.5425712732040822,
 'auc_w': 0.572677043648532,
 'f1score': 0.14403660616037472,
 'f1score_w': 0.2910227733934047,
 'precision': 0.38264151805597885,
 'precision_w': 0.42059459080558814,
 'recall': 0.11948464842998098,
 'recall_w': 0.3783846298101865,
 'rprecision': 0.10519838445328382,
 'rprecision_w': 0.20626242019434446}

In [15]:
# Below code is just to test out if the HMM on top of any base class runs fine by loading predictions and out probs.
import cPickle as cp
with open('./sample_prob.pkl', 'wb') as f:
    cp.dump(lr2.result.output_prob, f)
with open('./sample_pred.pkl', 'wb') as f:
    cp.dump(lr2.result.predictions, f)

### (Run GridSearch CV)
You could do a cross-validation to find the best parameter C

In [5]:
import numpy as np
lr.grid_search_parameter(tr_data_file, C_values=np.arange(0.5, 2, 0.5), n_fold=3, verbose=2)

Fitting 3 folds for each of 3 candidates, totalling 9 fits
[CV] C=0.5 ...........................................................
[CV] ............................................ C=0.5, total=   8.6s
[CV] C=0.5 ...........................................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    8.7s remaining:    0.0s


[CV] ............................................ C=0.5, total=   8.8s
[CV] C=0.5 ...........................................................
[CV] ............................................ C=0.5, total=   9.9s
[CV] C=1.0 ...........................................................
[CV] ............................................ C=1.0, total=  12.1s
[CV] C=1.0 ...........................................................
[CV] ............................................ C=1.0, total=   9.9s
[CV] C=1.0 ...........................................................
[CV] ............................................ C=1.0, total=  10.6s
[CV] C=1.5 ...........................................................
[CV] ............................................ C=1.5, total=  11.8s
[CV] C=1.5 ...........................................................
[CV] ............................................ C=1.5, total=  11.4s
[CV] C=1.5 ...........................................................
[CV] .

[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed:  1.6min finished


Best regularization constant: 1.50


{'C': 1.5}

## 2. HMM on top of LR
Running HMM requires you to have `base_model`, which should be trained in advance and given as an argument.

In [6]:
from models import HMMDialogModel

hmmlr = HMMDialogModel(base_model=lr)
hmmlr.fit_model(tr_data_file, model_file='hmmdialog_lrovr.pkl')

Loading and preprocessing the corpus with labels
Loading the processed file from ./corpus.pkl
 (Delete the file if you want to re-process the corpus)
Loading the vocabulary file from ./vocab.pkl
 (Delete the file if you want to re-generate the vocabulary)
Loading labels file from ./label.pkl
 (Delete the file if you want to re-generate the labels)
Cleaning labels ..
Getting lists of valid session/utterance IDs that have both text and labels
Saving model to hmmdialog_lrovr.pkl


In [7]:
hmmlr.predict_viterbi(te_data_file)

Loading labels file from ./label.pkl
 (Delete the file if you want to re-generate the labels)
Loading cleaned labels file from ./label_cleaned.pkl
Loading the vocabulary file from ./vocab.pkl
 (Delete the file if you want to re-generate the vocabulary)
Loading and preprocessing the corpus with labels
Loading the processed file from ./corpus_te.pkl
 (Delete the file if you want to re-process the corpus)
Cleaning labels ..
Calculating scores..


<models.DialogResult instance at 0x1439fb9e0>

In [8]:
hmmlr.result.scores

{'accuracy': 61.07804232804233,
 'auc': 0.7450265747286064,
 'auc_w': 0.7891203038559265,
 'f1score': 0.4746712926281416,
 'f1score_w': 0.6121240696688772,
 'precision': 0.4890888276638108,
 'precision_w': 0.6294373210326876,
 'recall': 0.5070740452021653,
 'recall_w': 0.6164666637512746,
 'rprecision': 0.4191670434480562,
 'rprecision_w': 0.5650249909207642}

## 3. HMM on top of other output probabilities

If we have a set of results from another base model (independent model) that is trained somewhere else (e.g. output from RNN), <br>
we can load the predictions and output probabilities and plug them into HMM. <br>
They should be the result of the same data as `mhdtest`.
- `predictions`:  Should have a list of sessions, where each session is a 2-d array with size `(N,T)`, where `N` is the number of utterances in the session and `T` is the number of topics (labels). Each entry is the $p(topic|utterance)$ in each session.  <br> Type: `list[ 2-d np.array[float] ]`.
- `output_probs`: Should have a list of sessions, where each session is a list of utterance predictions within that session. <br> Type: `list[list[int]]` or `list[np.array[int]]`


After loading predictions and probabilities, a base model object should have the following data
and it can be plugged in as an argument to HMMDialogModel
- base_model.result
- base_model.result.output_prob
- base_model.model_info

In [19]:
from models import DialogModel, HMMDialogModel

In [20]:
predfile = './sample_pred.pkl'
outprobfile = './sample_prob.pkl'

The results are not from RNN, but let's say we've loaded the results from RNN model

In [21]:
rnn = DialogModel()
rnn.load_results(te_data_file, model_info="RNN", marginals=None, predictions=predfile, output_probs=outprobfile)

Loading labels file from ./label.pkl
 (Delete the file if you want to re-generate the labels)
Loading cleaned labels file from ./label_cleaned.pkl
Loading the vocabulary file from ./vocab.pkl
 (Delete the file if you want to re-generate the vocabulary)
Loading and preprocessing the corpus with labels
Loading the processed file from ./corpus_te.pkl
 (Delete the file if you want to re-process the corpus)
Cleaning labels ..
Calculating scores..


<models.DialogResult instance at 0x12876a710>

In [22]:
hmmrnn = HMMDialogModel(base_model=rnn)
hmmrnn.fit_model(tr_data_file)

Loading and preprocessing the corpus with labels
Loading the processed file from ./corpus.pkl
 (Delete the file if you want to re-process the corpus)
Loading the vocabulary file from ./vocab.pkl
 (Delete the file if you want to re-generate the vocabulary)
Loading labels file from ./label.pkl
 (Delete the file if you want to re-generate the labels)
Cleaning labels ..
Getting lists of valid session/utterance IDs that have both text and labels
Saving model to ./hmmdialog.pkl


In [23]:
hmmrnn.predict_viterbi(te_data_file)

Loading labels file from ./label.pkl
 (Delete the file if you want to re-generate the labels)
Loading cleaned labels file from ./label_cleaned.pkl
Loading the vocabulary file from ./vocab.pkl
 (Delete the file if you want to re-generate the vocabulary)
Loading and preprocessing the corpus with labels
Loading the processed file from ./corpus_te.pkl
 (Delete the file if you want to re-process the corpus)
Cleaning labels ..
Calculating scores..


<models.DialogResult instance at 0x1662f5ea8>

In this case we should have the same result as the result at section 2. since we've loaded the same result from LR.

In [24]:
hmmlr.result.scores

{'accuracy': 60.826719576719576,
 'auc': 0.7386513157025536,
 'auc_w': 0.7880021417563148,
 'f1score': 0.4611548266387942,
 'f1score_w': 0.6094283413948923,
 'precision': 0.47253030001967306,
 'precision_w': 0.6258756621745318,
 'recall': 0.4944229819272277,
 'recall_w': 0.614199668112505,
 'rprecision': 0.4151477738181185,
 'rprecision_w': 0.5668641644422592}