# Tutorial 2: Usage of IterativeLanTiSEAA

The class lantiseaa.IterativeLanTiSEAA is a also classifier that can be integrated with the sklearn framework. It implements the Language Time Series Enriched Authorship Attribution method with the iterative stacking framework described in the paper. Hence, in the fit function of IterativeLanTiSEAA, the whole process from splitting folds to selecting a final combination of feature groups described in the paper was implemented. Just like LanTiSEAA, IterativeLanTiSEAA can take customized time series transformers, feature extractor, baseline classifier, meta classifier and buffer object. Besides, it can also take a specified cv (fold splitter) and a metric for evaluating the predictions on each fold. In the stacking process, like in the paper, both bayesian estimation and Wilcoxon Signed Rank Test were used to measure the effect of adding a feature group. After selecting the best combination, IterativeLanTiSEAA will be fit on the complete training data set and ready for predicting unseen testing data.

Using a sample data set which contains 1% of the samples randomly selected from the Spooky Books Data Set described in the paper (while keeping the class distribution), this notebook shows an example of using IterativeLanTiSEAA to study a language time series enriched authorship attribution classification task.

# Table of Content
1. [Import Data](#1)
2. [Split train and test data set](#2)
3. [Use IterativeLanTiSEAA to study a language time series enriched authorship attribution classification task](#3) <br>
3. [Another task with a bad baseline method](#3) <br>

# 1. Import Data <a class="anchor" id="1"></a>

In [None]:
import os
import random
import logging
import numpy as np 
import pandas as pd
import pymc3 as pm
import matplotlib.pyplot as plt
import seaborn as sns
from xgboost import XGBClassifier
from sklearn.metrics import log_loss
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelBinarizer

logging.getLogger().setLevel(logging.INFO)
sns.set()

import lantiseaa
import lantiseaa.ts as ts
from lantiseaa import IterativeLanTiSEAA
from lantiseaa.baseline import BOWMNB
from lantiseaa.extractor import TsfreshTSFeatureExtractor
from lantiseaa.buffer import LocalBuffer, MemoryBuffer

%matplotlib inline
%load_ext autoreload
%autoreload 2

The LocalBuffer can be used as a project IO for reading and saving data. Here we use it to read the sample data set and save results produced during the execution. MemoryBuffer can also be used to save results (temporarily) on the computer memory, and it is the default buffer used for class LanTiSEAA. However, for research purposes, we want to save all the results on the disks and maybe look for usaful information later in the study. Hence, LocalBuffer is used in this tutorial.

In default, LocalBuffer set the root project directory at one level beyound the module "LanTiSEAA" folder. E.g. if the folder "LanTiSEAA" is placed under a folder named "my_project", the "my_project" folder will be set as the project folder and all data will be read/saved from/to the "my_project" folder. We want to direct it to the place where we want it to be - in this case, the "LanTiSEAA" folder itself.

Besides, in default, the subfolder which used to organize the data produced from this specific execution is not set. We want to specify a subfolder for better organization of the saved files, in this case - "tutorial1".

In [None]:
io = LocalBuffer(project_path=os.getcwd(), subfolder='tutorial2')

In [None]:
io.project_path

In [None]:
io.subfolder

We can use LocalBuffer to get absolute paths to files in the project folder (note that path for a file that does not exist can be returned):

In [None]:
io.data(filename='sample_dataset.csv', subfolder='')

In [None]:
io.results(filename='somefile.hdf', subfolder='tutorial2')

In [None]:
io.figures(filename='somefig.png', subfolder='tutorial2')

In [None]:
io.classes(filename='someclass.pkl', subfolder='tutorial2')

In [None]:
io.project(filename='Tutorial2_IterativeLanTiSEAA.ipynb', folder='')

Now we use LocalMemory to read sample data set.

In [None]:
dataset = pd.read_csv(io.data('sample_dataset.csv'))

texts = dataset.text
y = dataset.author

In [None]:
print(dataset.shape)
dataset.head()

Now let's have a look at the data set.

In [None]:
# look at the authors in the data set
y_values = y.value_counts()

sns.barplot(y_values.index, y_values.values, alpha=0.8)
plt.title('Distribution of authors in sample data set')
plt.ylabel('Number of Occurrences', fontsize=12)
plt.xlabel("Author", fontsize=12)
plt.show() # distribution is kept the same as the original data set

# 2. Split train and test data set <a class="anchor" id="2"></a>
Leave about 1/3 of the data as the test data set.

In [None]:
texts_train, texts_test, y_train, y_test = train_test_split(texts, y, test_size=0.33, stratify=y)

In [None]:
texts_train = texts_train.reset_index(drop=True)
texts_test = texts_test.reset_index(drop=True)
y_train = y_train.reset_index(drop=True)
y_test = y_test.reset_index(drop=True)

# 3. Use IterativeLanTiSEAA to study a language time series enriched authorship attribution classification task <a class="anchor" id="3"></a>

Similar with LanTiSEAA, IterativeLanTiSEAA can also take in custmizable objects and parameters. Beyound those in LanTiSEAA, IterativeLanTiSEAA can also take in a cv for splitting folders, a metric to evaluate the predictions made on folds, and a random_state which specifies the random seed to use when using StratifiedKFold to split folds if the cv parameter is an integer or None. The greater_is_better parameter specifies that either are the scores obtained by the metric better if they are higher or lower.

In this notebook, again we simply use the defaults, except a few changes. 

Besides the same changes been made on LanTiSEAA in tutorial 1, we also want to make the fdr_levels for bayesian estimation and wilcoxon signed rank test to be higher.

However, with such a small data set that has very little information for classification but a lot of noises, neither the baseline method nor the time series features enriched one can handle it. Especially because Gradient Boosting meta classifier is noise-sensitive, we are expecting the time series features to reduce the performance of the baseline method because they will contain mainly noises. In the end, the baseline method will be chosen as a single best classifier.

To show how the IterativeLanTiSEAA manages with stacking time series features that improves the baseline method, in [section 4](#4) we will generate a very bad baseline and let the time series features improve that.

In [None]:
meta_classifier = XGBClassifier(objective='multi:softprob', num_class=3, n_jobs=-1, nthread=-1)
feature_extractor=TsfreshTSFeatureExtractor(fdr_level=15)
clf = IterativeLanTiSEAA(ts_transformers=[ts.TokenLenSeqTransformer(), 
                                          ts.TokenFreqSeqTransformer(), 
                                          ts.TokenRankSeqTransformer(), 
                                          ts.TokenLenDistTransformer()], 
                         feature_extractor=feature_extractor, meta_classifier=meta_classifier, buffer=io)
clf.fit(texts_train, y_train, fdr_level_bayesian=0.5, fdr_level_wilcoxon=0.5)

In [None]:
clf.relevant_features_

In [None]:
clf.feature_groups_

In [None]:
io.read_evaluation_score('baseline')

In [None]:
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
io.read_records()

# 4. Another task with a bad baseline method <a class="anchor" id="4"></a>

To show how IterativeLanTiSEAA manage with stacking time series features to improve baseline, in this task, we introduce a very bad baseline prediction that made all predictions wrong - and see how time series features improved it.

In [None]:
y_train.head()

In [None]:
bad_baseline_predictions_train = []
bad_labels = []
for label in y_train:
    new_label = label
    while new_label == label:
        new_label = random.choice(['EAP', 'MWS', 'HPL'])
    bad_labels.append(new_label)
    
lb = LabelBinarizer()
lb.fit(bad_labels)
bad_baseline_predictions_train = pd.DataFrame(lb.transform(bad_labels), columns=lb.classes_)
bad_baseline_predictions_train.head()

In [None]:
io2 = LocalBuffer(project_path=os.getcwd(), subfolder='tutorial2_task2')

In [None]:
meta_classifier = XGBClassifier(objective='multi:softprob', num_class=3, n_jobs=-1, nthread=-1)
feature_extractor=TsfreshTSFeatureExtractor(fdr_level=15)
clf = IterativeLanTiSEAA(ts_transformers=[ts.TokenLenSeqTransformer(), 
                                          ts.TokenFreqSeqTransformer(), 
                                          ts.TokenRankSeqTransformer(), 
                                          ts.TokenLenDistTransformer()], 
                         baseline_classifier=None,
                         feature_extractor=feature_extractor, meta_classifier=meta_classifier, buffer=io2)
clf.fit(texts_train, y_train, fdr_level_bayesian=0.5, fdr_level_wilcoxon=0.5, 
        baseline_prediction=bad_baseline_predictions_train)

In [None]:
clf.relevant_features_

In [None]:
clf.feature_groups_

The feature_groups_ attribute after fitting shows the best combination of the feature groups. In this case, Token Frequency Sequence was the single time series feature group that improved the predictions the most.

In [None]:
io2.read_evaluation_score('baseline')

In [None]:
io2.read_evaluation_score('baseline_tokenfreqseq')

In [None]:
trace = io2.read_bayesian_estimation_trace("baseline", "baseline_tokenfreqseq")

In [None]:
pm.plot_posterior(trace, varnames=['difference of means'])

In [None]:
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
io2.read_records()