# Load Model and Predict

This tutorial shows steps of loading the pre-trained models (Logistic Regression, and Hidden Markov Model) and predicting the utterance-level topic labels using the packages in the repository. Test data file is needed to run prediction. 


## Requirements
Currently our package supports `Python 2` with the following packages. `Python 3` will be supported in the near future.
- `numpy`
- `nltk`
- `pandas`
- `sklearn`
- `csv`
- `cPickle`. 

-------------------------------------------------------_

# Data
## File Paths
Set the path to the test data.

In [None]:
te_data_file = './data/sample_test_data.txt'

# Data Classes
Data preprocessing is done at the initialization step when creating data classes.<br>
Training and test data classes are slightly different since labels and vocabulary are determined only at the training step.

An object of class `MHDTrainData` should be put in as an argument for `.fit_model` function, <br>
and an object of class `MHDTestData` should be plugged into the `.predict_*` function for each model.

Since we are going to load the pre-trained model, we only load the test data using `MHDTestData`. <br>
When loading is finished, pre-processed test data will be saved to `corpus_pkl` file (in the argument). <br>
Saving the preprocessed file into `corpus_pkl` file can save time when loading the same file again. <br> 
Loading the test data corpus from the pickle file can be disabled by setting the argument `reload_corpus` to `True`.<br>
Also, the label and vocabulary from the training data are loaded.  <br>
Those files are already available in the current repository as files `label.pkl` and `vocab.pkl`.

In [None]:
from mhddata import MHDTestData

In [None]:
mhdtest = MHDTestData(te_data_file, nouns_only=False, ignore_case=True,
                 remove_numbers=False, sub_numbers=True, proper_nouns_dir="./stopwordlists",
                 min_wlen=1, token_pattern=r"(?u)[A-Za-z\?\!\-\.']+", verbose=3, 
                 reload_corpus=True, corpus_pkl='./data/corpus_te.pkl', 
                 tr_label_pkl='./data/label.pkl', tr_vocab_pkl='./data/vocab.pkl')

# Models

Since we are loading the pre-trained models, we only talk about **loading** the model, **not training**.

## 1. Logistic Regression Models
Load the pre-trained model from `./lrdialog_ovr.pkl`

In [None]:
from models import LogRegDialogModel

lr = LogRegDialogModel(lr_type='ovr')
lr.load_model(model_file="./model/lrdialog_ovr.pkl")

Now run prediction using the loaded model with the loaded test data. <br>
Utterance-level results will be saved to an output file.

In [None]:
lr.predict(mhdtest, verbose=1, output_filename="./utter_level_results_lrovr.txt")

Output the scores to see the scores

In [None]:
lr.result.scores

Also can print out the scores as csv and save it to a file

In [None]:
lr.result.print_scores(filename='./result_in_diff_metrics.csv')

### Save the output probability and predictions to pkl files. (Used in 3.)
HMM on top of any base class can be run by loading predictions and out probs. <br>
To test the case, we will save the output probabilities and predictions from above (logistic regression results). <br>
We can assume that this results are from a recurrent neural network (RNN), for example.<bR>
These files will be loaded later in the part 3.

In [None]:
predfile = './fake_rnn_pred.pkl'
outprobfile = './fake_rnn_prob.pkl'

import cPickle as cp
with open(outprobfile, 'wb') as f:
    cp.dump(lr.result.output_prob, f)
with open(predfile, 'wb') as f:
    cp.dump(lr.result.predictions, f)

## 2. HMM on top of LR
Running HMM requires you to have an object of **`base_model`**, which should be trained and predicted in advance and given as an argument. <br>
The object has to have `.result` field since HMM is using the output probabilities from the model. 
<br>Here we use the logistic regression model that was trained and predicted above.<br>
**NOTE: The base model and the HMM should share the same train and test data!**

In [None]:
from models import HMMDialogModel
hmmlr = HMMDialogModel(base_model=lr)  # lr: logistic regression model from the previous part.

Loads the model. HMM pickle file has transition probabilities as well as start and ending probabilities.

In [None]:
hmmlr.load_model(model_file='./model/hmmdialog.pkl')

Predicts the output labels using HMM and Viterbi decoding. <br>
Also outputs the utterance-level results to a file.

In [None]:
hmmlr.predict_viterbi(mhdtest, output_filename="./utter_level_results_hmm_lrovr.txt")

In [None]:
hmmlr.result.scores

## 3. HMM on top of other output probabilities

If we have a set of results from another base model (independent model) that is trained somewhere else (e.g. output from RNN), <br>
we can load the predictions and output probabilities and plug them into HMM. <br>
They should be the result of the same data as `mhdtest`.
- `predictions`:  Should have a list of sessions, where each session is a 2-d array with size `(N,T)`, where `N` is the number of utterances in the session and `T` is the number of topics (labels). Each entry is the $p(topic|utterance)$ in each session.  <br> Type: `list[ 2-d np.array[float] ]`.
- `output_probs`: Should have a list of sessions, where each session is a list of utterance predictions within that session. <br> Type: `list[list[int]]` or `list[np.array[int]]`


After loading predictions and probabilities, a base model object should have the following data
and it can be plugged in as an argument to HMMDialogModel
- base_model.result
- base_model.result.output_prob
- base_model.model_info

In [None]:
from models import DialogModel, HMMDialogModel

We will use the files that we saved at the end of part 1. <br>
Remember these are actually from the Logistic Regression classifier.

In [None]:
predfile = './fake_rnn_pred.pkl'
outprobfile = './fake_rnn_prob.pkl'

The results are not from RNN, but let's say we've loaded the results from RNN model

In [None]:
rnn = DialogModel()
rnn.load_results(mhdtest, model_info="RNN", marginals=None, 
                 predictions=predfile, output_probs=outprobfile)

Load HMM pickle again and predict

In [None]:
hmmrnn = HMMDialogModel(base_model=rnn)
hmmrnn.load_model(model_file='./model/hmmdialog.pkl')

In [None]:
hmmrnn.predict_viterbi(mhdtest, output_filename="./utter_level_results_fake_hmm_rnn.txt")

In this case we should have the same result as the result at section 2. since we've loaded the same result from LR.

In [None]:
hmmlr.result.scores