<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Initialization" data-toc-modified-id="Initialization-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Initialization</a></span><ul class="toc-item"><li><span><a href="#Define-variables-for-experiment" data-toc-modified-id="Define-variables-for-experiment-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Define variables for experiment</a></span></li><li><span><a href="#Download-data-set" data-toc-modified-id="Download-data-set-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Download data set</a></span></li></ul></li><li><span><a href="#Process-Data" data-toc-modified-id="Process-Data-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Process Data</a></span><ul class="toc-item"><li><span><a href="#Split-data" data-toc-modified-id="Split-data-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Split data</a></span></li><li><span><a href="#Preprocess-data-for-Machine-Learning" data-toc-modified-id="Preprocess-data-for-Machine-Learning-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Preprocess data for Machine Learning</a></span></li></ul></li><li><span><a href="#Training" data-toc-modified-id="Training-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Training</a></span></li><li><span><a href="#See-prediction" data-toc-modified-id="See-prediction-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>See prediction</a></span></li></ul></div>

# 101 Github Issue Summarization


In this notebook, we will show how to:

* Define a seq2seq model
* Perform training with [Keras](https://keras.io) and [Tensorflow](https://www.tensorflow.org/api_docs/python/tf)
* Validate model [Seldon](https://docs.seldon.io/projects/seldon-core/en/latest/python/api/modules.html)

To perform the training we have wired few technologies together. 
* **S3 bucket** as mounted a file system [here](../bucket): this is place for all training artifacts. Keras prefers to work with the artifacts as it would be normal files.
* **Notebook profile** [here](profile_default/startup) contains meaningful notebook defaults and settings
* **SuperHub integration** contains various environment provisioning scripts 
* **Git integration** your notebook has a git repository. It is wise to make periodical commits into the git. Best way to do it is to use Jupyter terminal

## Initialization

In [None]:
%load_ext autoreload
%autoreload 2
from os import environ, makedirs
from nbextensions.utils import download_file

### Define variables for experiment
In the beginning of the scrip we define all necessary variables. This is a good start, we have a single cell to define all experiment configuration in one place

In [None]:
TAG = 'latest'

ARTIFACTS_ROOT = f"{environ['HOME']}/data/training-{TAG}"
DATASET_FILE = f"{ARTIFACTS_ROOT}/dataset.csv"
MODEL_FILE = f"{ARTIFACTS_ROOT}/training1.h5"
TITLE_PP_FILE = f"{ARTIFACTS_ROOT}/title_preprocessor.dpkl"
BODY_PP_FILE = f"{ARTIFACTS_ROOT}/body_preprocessor.dpkl"
TRAIN_DF_FILE = f"{ARTIFACTS_ROOT}/traindf.csv"
TEST_DF_FILE =  f"{ARTIFACTS_ROOT}/testdf.csv"
TRAIN_TITLE_VECS = f"{ARTIFACTS_ROOT}/train_title_vecs.npy"
TRAIN_BODY_VECS = f"{ARTIFACTS_ROOT}/train_body_vecs.npy"

TRAINING_DATA_SIZE = 2000
TEST_SIZE = .10

### Download data set 

Before we start training we need to download a dataset file in a CSV format. 

Here we have two data set file. You can choose eather of them eather of them:
- *2Mi* `sample-dataset`: This is good for tryout and debug your preprocessing or testing python scripts because of fast turnover. However model trained on such small dataset will certainly not be very accurate
- *3Gi* `full-dataset`: This dataset takes significant time for training, however predictions based on this model are quite good

In [None]:
# github issues small: 2Mi data set (best for dev/test)
SAMPLE_DATASET = 'https://s3.us-east-2.amazonaws.com/asi-kubeflow-models/gh-issues/data-sample.csv'
SAMPLE_DATASET_MD5 = '916af946f2fe1d1779b26205d4d8378f'
# data set for 3Gi. (best for training)
FULL_DATASET = 'https://s3.us-east-2.amazonaws.com/asi-kubeflow-models/gh-issues/data-full.csv'
FULL_DATASET_MD5 = '57dc987c04d41a94d0d9daf4d0ebf8ba'

%time
download_file(
    url=SAMPLE_DATASET, 
    md5sum=SAMPLE_DATASET_MD5, 
    download_to=DATASET_FILE
)

## Process Data


### Split data
Before we process data for machine learning, we need to split data into training and test data sets (variable `TEST_SIZE`). To accelerate a training we can also limit data size (variable `TRAINING_DATA_SIZE`)

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
if TRAINING_DATA_SIZE:
    traindf, testdf = train_test_split(pd.read_csv(DATASET_FILE).sample(n=TRAINING_DATA_SIZE), test_size=TEST_SIZE)
else:
    traindf, testdf = train_test_split(pd.read_csv(DATASET_FILE),test_size=TEST_SIZE)

print(f'Train: {traindf.shape[0]:,} rows {traindf.shape[1]:,} columns')
print(f'Test: {testdf.shape[0]:,} rows {testdf.shape[1]:,} columns')

# preview data
traindf.head(3)

### Preprocess data for Machine Learning
Here we will use `ktext` library [documentation](https://github.com/hamelsmu/ktext)

We will convert text into a vector and do the same for title and body:

* **body**: Clean, tokenize, and apply padding / truncating such that each document `length = 70` also, retain only the top `8,000` words in the vocabulary and set the remaining words to 1 which will become common index for rare words

* **title**: Instantiate a text processor for the titles, with some different parameters `append_indicators=True` appends the tokens `_start_` and `_end_` to each document. `padding='post'` means that zero padding is appended to the end of the of the document (as opposed to the default which is 'pre')

In [None]:
%time
from ktext.preprocess import processor
import numpy as np
import pandas as pd
from IPython.display import display, HTML
import dill as dpickle
import numpy as np

train_body_raw = traindf.body.tolist()
body_pp = processor(keep_n=8000, padding_maxlen=70)
train_body_vecs = body_pp.fit_transform(train_body_raw)

train_title_raw = traindf.issue_title.tolist()
title_pp = processor(append_indicators=True, keep_n=4500, padding_maxlen=12, padding ='post')
train_title_vecs = title_pp.fit_transform(train_title_raw)

# preview
data = np.array([['Before', train_title_raw[0], train_body_raw[0]],
                ['After', train_title_vecs[0], train_body_vecs[0]]])
df = pd.DataFrame(data=data, columns=['', 'Issue Title', 'Issue body'])
display(HTML(df.to_html(index=False)))

# Save the preprocessor
print(f"Saving {BODY_PP_FILE}")
with open(BODY_PP_FILE, 'wb') as f:
    dpickle.dump(body_pp, f)

print(f"Saving {TITLE_PP_FILE}")
with open(TITLE_PP_FILE, 'wb') as f:
    dpickle.dump(title_pp, f)

# Save the processed data
print(f"Saving {TRAIN_TITLE_VECS}")
np.save(TRAIN_TITLE_VECS, train_title_vecs)
print(f"Saving {TRAIN_BODY_VECS}")
np.save(TRAIN_BODY_VECS, train_body_vecs)

## Training

Now we are ready to start our training. Training has been implemented as a [python script](components/training/src/train.py). It takes following variables defined in the notebook user space (above) as the implicit input
* `TITLE_PP_FILE`
* `BODY_PP_FILE`
* `TRAIN_DF_FILE`
* `TEST_DF_FILE`
* `MODEL_FILE`

In [None]:
%time
%run 'components/training/src/train.py'

## See prediction
It is useful to see examples of real predictions on a holdout set to get a sense of the performance of the model. We will also evaluate the model numerically in a following section.

In [None]:
%time
from keras.models import load_model
from seq2seq_utils import Seq2Seq_Inference
seq2seq_Model = load_model(MODEL_FILE)
seq2seq_inf = Seq2Seq_Inference(encoder_preprocessor=body_pp,
                                 decoder_preprocessor=title_pp,
                                 seq2seq_model=seq2seq_Model)
seq2seq_inf.demo_model_predictions(n=1, issue_df=testdf)