# Tutorial 1: Usage of LanTiSEAA

The class lantiseaa.LanTiSEAA is a classifier implementing the Language Time Series Enriched Authorship Attribution method and can be integrated with the sklearn framework (as an Estimator and a Predictor). LanTiSEAA can take customized time series transformers, feature extractor, baseline classifier, meta classifier and buffer object and perform classification on text data. The iterative framework for stacking time series feature groups described in the paper is not implemented in this class (for more details have a look at IteratieLanTiSEAA), instead, LanTiSEAA simply combines all features extracted from the time series methods and the predictions made by the baseline method (if given), select relevant features and make predictions using the meta classifier.

Using a sample data set which contains 1% of the samples randomly selected from the Spooky Books Data Set described in the paper (while keeping the class distribution), this notebook shows an example of using LanTiSEAA to perform an authorship attribution classification task.

# Table of Content
1. [Import Data](#1)
2. [Split train and test data set](#2)
3. [Use LanTiSEAA to perform classification](#3) <br>

# 1. Import Data <a class="anchor" id="1"></a>

In [None]:
import os
import logging
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from xgboost import XGBClassifier
from sklearn.metrics import log_loss
from sklearn.model_selection import train_test_split

logging.getLogger().setLevel(logging.INFO)
sns.set()

import lantiseaa
import lantiseaa.ts as ts
from lantiseaa import LanTiSEAA
from lantiseaa.baseline import BOWMNB
from lantiseaa.extractor import TsfreshTSFeatureExtractor
from lantiseaa.buffer import LocalBuffer, MemoryBuffer

%matplotlib inline
%load_ext autoreload
%autoreload 2

The LocalBuffer can be used as a project IO for reading and saving data. Here we use it to read the sample data set and save results produced during the execution. MemoryBuffer can also be used to save results (temporarily) on the computer memory, and it is the default buffer used for class LanTiSEAA. However, for research purposes, we want to save all the results on the disks and maybe look for usaful information later in the study. Hence, LocalBuffer is used in this tutorial.

In default, LocalBuffer set the root project directory at one level beyound the module "LanTiSEAA" folder. E.g. if the folder "LanTiSEAA" is placed under a folder named "my_project", the "my_project" folder will be set as the project folder and all data will be read/saved from/to the "my_project" folder. We want to direct it to the place where we want it to be - in this case, the "LanTiSEAA" folder itself.

Besides, in default, the subfolder which used to organize the data produced from this specific execution is not set. We want to specify a subfolder for better organization of the saved files, in this case - "tutorial1".

In [None]:
io = LocalBuffer(project_path=os.getcwd(), subfolder='tutorial1')

In [None]:
io.project_path

In [None]:
io.subfolder

We can use LocalBuffer to get absolute paths to files in the project folder (note that path for a file that does not exist can be returned):

In [None]:
io.data(filename='sample_dataset.csv', subfolder='')

In [None]:
io.results(filename='somefile.hdf', subfolder='tutorial1')

In [None]:
io.figures(filename='somefig.png', subfolder='tutorial1')

In [None]:
io.classes(filename='someclass.pkl', subfolder='tutorial1')

In [None]:
io.project(filename='Tutorial1_LanTiSEAA.ipynb', folder='')

Now we use LocalMemory to read sample data set.

In [None]:
dataset = pd.read_csv(io.data('sample_dataset.csv'))

texts = dataset.text
y = dataset.author

In [None]:
print(dataset.shape)
dataset.head()

Now let's have a look at the data set.

In [None]:
# look at the authors in the data set
y_values = y.value_counts()

sns.barplot(y_values.index, y_values.values, alpha=0.8)
plt.title('Distribution of authors in sample data set')
plt.ylabel('Number of Occurrences', fontsize=12)
plt.xlabel("Author", fontsize=12)
plt.show() # distribution is kept the same as the original data set

# 2. Split train and test data set <a class="anchor" id="2"></a>
Leave about 1/3 of the data as the test data set.

In [None]:
texts_train, texts_test, y_train, y_test = train_test_split(texts, y, test_size=0.33, stratify=y)

In [None]:
texts_train = texts_train.reset_index(drop=True)
texts_test = texts_test.reset_index(drop=True)
y_train = y_train.reset_index(drop=True)
y_test = y_test.reset_index(drop=True)

# 3. Use LanTiSEAA to perform classification <a class="anchor" id="3"></a>

LanTiSEAA can take in customized ts_transformers, feature_extractor, baseline_classifier, meta_classifier and buffer. To implement customized ts_transformer, feature_extractor, baseline_classifier or buffer, look at the ts, extractor, baseline, and buffer modules and extend the base classes. Besides, either to use predict_proba function or predict function from the baseline_classifier to make predictions can be specified using the use_predict_proba parameter. 

In this notebook, we simply use the defaults, except a few changes. 

The buffer will be the LocalBuffer we used to access files on the local disk. 

While the default meta_classifier is sklearn's GradientBoostingClassifier, we will use XGBoost's XGBClassifier here as in the paper, because we already know the number of classes the XGBClassifier will be dealing with and hence we can instantiate the XGBClassifier. 

Besides, we want to change the fdr_level for the TsfreshTSFeatureExtractor as for such a small data set, fdr_level=0.001 used in the paper will filter out almost all features.

In addition, we remove the TokenRankDistTransformer as it will take relatively longer time to extract time series features from.

In [None]:
meta_classifier = XGBClassifier(objective='multi:softprob', num_class=3, n_jobs=-1, nthread=-1)
feature_extractor=TsfreshTSFeatureExtractor(fdr_level=15)
clf = LanTiSEAA(ts_transformers=[ts.TokenLenSeqTransformer(), 
                                 ts.TokenFreqSeqTransformer(), 
                                 ts.TokenRankSeqTransformer(), 
                                 ts.TokenLenDistTransformer()], 
                feature_extractor=feature_extractor, meta_classifier=meta_classifier, buffer=io)
clf.fit(texts_train, y_train)

In [None]:
clf.relevant_features_

In [None]:
pred_test = clf.predict_proba(texts_test)

In [None]:
pred_test

In [None]:
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
io.read_records()