# NOTE: This notebook is not going to be released to public
A modified version will be released to public. This is the internal version

### Remove pkl files to re-generate models

In [None]:
# import os
# for fname in os.listdir("./"):
#     if fname.endswith("pkl"):
#         os.remove('./'+fname)
#         print "Removing "+fname

# Data
## File Paths
Set the data paths (for training, only when it's available)<br>
Download sample data here: [train](https://drive.google.com/open?id=1HN5L6kkh9mYa7vQ_W-H9InW_kgQbFfrR), [test](https://drive.google.com/open?id=1s_P7IrmGJFN6OTLKOrQSUz8TTcyNF6fS) (Only accessible to PCORI team)

In [None]:
tr_data_file = './mhd.4.25.18_sample_tr.txt'
te_data_file = './mhd.4.25.18_sample_te.txt'

# Data Classes
Data preprocessing is done at the initialization step when creating data classes.<br>
Training and test data classes are slightly different since labels and vocabulary are determined only at the training step. <br> 
An object of class `MHDTrainData` should be put in as an argument for `.fit_model` function, <br>
and an object of class `MHDTestData` should be plugged into the `.predict_*` function for each model.

In [None]:
from mhddata import MHDTrainData, MHDTestData

### Training data

In [None]:
mhdtrain = MHDTrainData(tr_data_file, nouns_only=False, ignore_case=True,
                 remove_numbers=False, sub_numbers=True, stopwords_dir="./stopwordlists",
                 label_mappings=None, ngram_range=(1,1), max_np_len=2, min_wlen=1,
                 min_dfreq=0, max_dfreq=0.8, min_sfreq=20,
                 token_pattern=r"(?u)[A-Za-z\?\!\-\.']+", verbose=3,  # can control verbosity
                 corpus_pkl='./corpus.pkl', label_pkl='./label.pkl', vocab_pkl='./vocab.pkl')

In [None]:
mhdtrain.print_stats()  # you could always print out the stats

In [None]:
sorted(list(mhdtrain.vocabulary))

### Test data

In [None]:
mhdtest = MHDTestData(te_data_file, nouns_only=False, ignore_case=True,
                 remove_numbers=False, sub_numbers=True, proper_nouns_dir="./stopwordlists",
                 min_wlen=1, token_pattern=r"(?u)[A-Za-z\?\!\-\.']+", verbose=3, reload_corpus=True,
                 corpus_pkl='./corpus_te.pkl', tr_label_pkl='./label.pkl', tr_vocab_pkl='./vocab.pkl')

# Models

## 1. Logistic Regression Models
Create an object of class LogRegDialogModel.<br>
`lr_type` can be either `ovr` for one vs. rest model, or `multinomial` for multinomial model.

In [None]:
from models import LogRegDialogModel
lr = LogRegDialogModel(lr_type='ovr')

## 1.1 Train & Predict
### Train (This step will not needed if you're loading the pre-trained model)
1) Trains a LR model using training data. `lr.model` is created.<br>
2) saves the model into pickle files.

In [None]:
lr.fit_model(mhdtrain, penalty_type="l2", reg_const=1.0,
             model_file='./lrdialog_ovr.pkl', verbose=1) 

### Predict
1) Plug in the test data for prediction. `lr.predict()` uses `lr.model` and predict on the test data. <br>
2) Prediction creates `lr.result` object. Also outputs an utterance-level results to file `output_filename`.

In [None]:
lr.predict(mhdtest, verbose=1, output_filename='./utter_level_results_lrovr.txt')

### Result scores

In [None]:
lr.result.scores

Can print and save it to a file

In [None]:
lr.result.print_scores(filename='./result_in_diff_metrics_lrovr.csv')

## 1.2 Load & Predict
`lr2` loads the model that was trained above (part that we're going to release)

In [None]:
lr2 = LogRegDialogModel(lr_type='ovr')
lr2.load_model(model_file="./lrdialog_ovr.pkl")

In [None]:
lr2.predict(mhdtest, verbose=1, output_filename='./utter_level_results_lrovr2.txt')

Results should be the same as above since we used the same data

In [None]:
lr2.result.scores

### Save the output probability and predictions to pkl files.
Below code is just to test out if the HMM on top of any base class runs fine by loading predictions and out probs.


In [None]:
import cPickle as cp
with open('./sample_prob.pkl', 'wb') as f:
    cp.dump(lr2.result.output_prob, f)
with open('./sample_pred.pkl', 'wb') as f:
    cp.dump(lr2.result.predictions, f)

### (Optional) Run GridSearch CV to find the best parameter `C`
You could do a cross-validation to find the best parameter C

In [None]:
import numpy as np
lr.grid_search_parameter(mhdtrain, C_values=np.arange(0.5, 2, 0.5),
                          penalty_type="l2", solver='lbfgs',
                          n_fold=3, verbose=2)

## 2. HMM on top of LR
Running HMM requires you to have an object of **`base_model`**, which should be trained and predicted in advance and given as an argument. <br>
The object has to have `.result` field since HMM is using the output probabilities from the model. 
<br>Here we use the logistic regression model that was trained and predicted above.<br>
**NOTE: The base model and the HMM should be trained with the same data!**

In [None]:
from models import HMMDialogModel
hmmlr = HMMDialogModel(base_model=lr2)

HMM pickle file has transition probabilities as well as start and ending probabilities.<br>
You could also load the pre-trained model if available. (Commented out)

In [None]:
hmmlr.fit_model(mhdtrain, model_file='hmmdialog.pkl', verbose=1)

In [None]:
# hmmlr.load_model(model_file='hmmdialog.pkl')

In [None]:
hmmlr.predict_viterbi(mhdtest, output_filename='./utter_level_result_hmmlrovr.txt')

In [None]:
hmmlr.result.scores

## 3. HMM on top of other output probabilities

If we have a set of results from another base model (independent model) that is trained somewhere else (e.g. output from RNN), <br>
we can load the predictions and output probabilities and plug them into HMM. <br>
They should be the result of the same data as `mhdtest`.
- `predictions`:  Should have a list of sessions, where each session is a 2-d array with size `(N,T)`, where `N` is the number of utterances in the session and `T` is the number of topics (labels). Each entry is the $p(topic|utterance)$ in each session.  <br> Type: `list[ 2-d np.array[float] ]`.
- `output_probs`: Should have a list of sessions, where each session is a list of utterance predictions within that session. <br> Type: `list[list[int]]` or `list[np.array[int]]`


After loading predictions and probabilities, a base model object should have the following data
and it can be plugged in as an argument to HMMDialogModel
- base_model.result
- base_model.result.output_prob
- base_model.model_info

In [None]:
from models import DialogModel, HMMDialogModel

In [None]:
# use the pkl files that we saved above
predfile = './sample_pred.pkl'
outprobfile = './sample_prob.pkl'

The results are not from RNN, but let's say we've loaded the results from RNN model

In [None]:
rnn = DialogModel()
rnn.load_results(mhdtest, model_info="RNN", marginals=None, predictions=predfile, output_probs=outprobfile)

In [None]:
hmmrnn = HMMDialogModel(base_model=rnn)
hmmrnn.load_model(model_file='hmmdialog.pkl')

In [None]:
hmmrnn.predict_viterbi(mhdtest, output_filename='./utter_level_result_fake_rnn.txt')

In this case we should have the same result as the result at section 2. since we've loaded the same result from LR.

In [None]:
hmmlr.result.scores