# Tagging Exercise (early draft for internal experimentation)

Instructions and some steps pending.

## Setup

In [1]:
from tagger import *
from sklearn.pipeline import Pipeline

Using TensorFlow backend.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


In [2]:
#!wget http://citypolarna.se/event_data.csv -O "../data/raw/citypolarna_public_events_out.csv"

In [3]:
from tagger.dataset.cleaning import load_datasets

events_train, tags_train, events_test, tags_test, top_tags = load_datasets(
    "../data/raw/citypolarna_public_events_out.csv")

import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /Users/chrka/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## Preprocessing

* `ExtractText(columns=['description'], add_time_of_day=False)`: (Data frame to HTM) Extract text fields from event data into a single string vector.  Optionally prepends special symbols for time of day (or all-day) as appropriate.
* `HTMLToText()`: (HTML to string) Converts HTML into raw text.
* `CharacterSet(punctuation=True, digits=False)`: (String vector to string vector) Keeps alphabetic characters and collapses multiple whitespaces into single.  Optionally keeps digits and punctuation.
* `Lowercase()`: (String to string) Converts all alphabetic characters into their lowercase equivalents.
* `Tokenize(method='word_punct)`: (String to token list) Splits strings into lists of tokens.  If method is `whitespace`, whitespaces are used for splitting, if `word_punct` (default), punctuation marks are also used for splitting.
* `Stopwords()`: (Token list to token list) Removes stop words.
* `Stemming()': (Token list to token list) Converts tokens into their stems.
* `NGram(n_min, n_max=None)`: (Token list to token list) Create all $n$-grams from $n_{\mathrm{min}}$-grams to $n_{\mathrm{max}}$-grams. (If no $n_{\mathrm{max}}$, only $n_{\mathrm{min}}$-grams are created.)


In [4]:
baseline_preprocessing = Pipeline([
    ('fields', ExtractText()),
    ('html', HTMLToText()),
    ('cset', CharacterSet(punctuation=False)),
    ('lower', Lowercase()),
    ('token', Tokenize())
])

In [5]:
baseline_preprocessing.fit_transform(events_train[0:10])

7318    [vi, är, några, som, tänkt, fika, på, söndag, ...
9088    [hej, då, var, det, dags, för, en, bokklubbstr...
4793    [pröva, på, att, dansa, kizomba, prova, på, kl...
4553    [på, fredag, är, det, premiär, för, grand, hot...
6068    [uppdatering, dddd, dd, dd, ändrat, sista, dat...
1732    [intressekoll, inför, kvällen, trädgårn, klubb...
8802    [hej, minsta, rundan, ever, haha, men, då, får...
9183    [missa, inte, denna, intima, och, självutlämna...
6929    [någon, som, vill, med, till, hävringe, fyr, f...
4779    [obs, det, riskerar, att, bli, fullt, eller, n...
Name: description, dtype: object

In [6]:
my_preprocessing = Pipeline([
    ('fields', ExtractText(['title', 'description'], add_time_of_day=False)),
    ('html', HTMLToText()),
    ('cset', CharacterSet(punctuation=False, digits=False)),
    ('lower', Lowercase()),
    ('token', Tokenize()),
    ('stop', Stopwords()),
    ('stem', Stemming()),
    ('ngram', NGram(1, 2))
])

In [7]:
list(my_preprocessing.fit_transform(events_train[0:1]))

[['fik',
  'tänk',
  'fik',
  'söndag',
  'häng',
  'vill',
  'ring',
  'komm',
  'dd',
  'förklar',
  'dddd',
  'fik tänk',
  'tänk fik',
  'fik söndag',
  'söndag häng',
  'häng vill',
  'vill ring',
  'ring komm',
  'komm dd',
  'dd förklar',
  'förklar dddd']]

## Feature Extraction

* `BagOfWords(binary=False)`: (List of tokens to sparse vector) Create bag of words vectors.  If `binary=True`, ignore counts and only indicate if word is present or not.
* `Tfidf()`: (List of tokens to sparse vector)
* `SumWordBedding(model_path)`, `MeanWordBedding(model_path)`: (List of tokens to sparse vector) Convert tokens to sum respective mean of their word embedding vectors.
* (`WordEmbedding()`: (List of tokens to matrix))
* `SparseToDense()`: (Sparse vector to vector)

In [8]:
baseline_features = Pipeline([
    ('bow', BagOfWords())
])

In [9]:
my_features = Pipeline([
    ('bow', BagOfWords(binary=False))
])

## Classification Algorithms

* `NaiveBayes()`: ((Sparse) Vector to prediction) Naïve Bayes
* `LogisticRegression()`: ((Sparse) Vector to predictions) Logistic regression
* `MultiLayerPerceptron(layers, epochs=16, batch_size=64)`: (Vector to prediction) Multi-layered perceptron with specified layers, eg., `layers=[1024, 256]`)

In [10]:
baseline_classifier = Pipeline([
    ('nb', NaiveBayes())
])

In [11]:
my_classifier = Pipeline([
    ('lr', LogisticRegression())
])

## Evaluation

* `evaluate_per_label(model, top_tags, events, tags, test_size=0.2, sample_size=None, n_splits=3, random_state=42)`: Calculate per-label stats for the given model, using $n_{\mathrm{n_splits}}$-fold cross validation.

Model comparisons, and visualizations coming later.

In [12]:
baseline_model = Pipeline([
    ('pre', baseline_preprocessing),
    ('feat', baseline_features),
    ('clf', baseline_classifier)
])

In [13]:
%%time
baseline_model.fit(events_train, tags_train)

CPU times: user 3.06 s, sys: 69.5 ms, total: 3.13 s
Wall time: 3.13 s


Pipeline(memory=None,
     steps=[('pre', Pipeline(memory=None,
     steps=[('fields', ExtractText(add_time_of_day=False, columns=['description'])), ('html', HTMLToText()), ('cset', CharacterSet(digits=False, punctuation=False)), ('lower', Lowercase()), ('token', Tokenize(method='word_punct'))])), ('feat', Pipeline(memory=None, steps=[('bow', BagOfWords(binary=False))])), ('clf', Pipeline(memory=None, steps=[('nb', NaiveBayes())]))])

In [14]:
my_model = Pipeline([
    ('pre', my_preprocessing),
    ('feat', my_features),
    ('clf', my_classifier)
])

In [15]:
%%time
my_model.fit(events_train, tags_train)

CPU times: user 5min 8s, sys: 2.71 s, total: 5min 11s
Wall time: 1min 25s


Pipeline(memory=None,
     steps=[('pre', Pipeline(memory=None,
     steps=[('fields', ExtractText(add_time_of_day=False, columns=['title', 'description'])), ('html', HTMLToText()), ('cset', CharacterSet(digits=False, punctuation=False)), ('lower', Lowercase()), ('token', Tokenize(method='word_punct')), ('stop', Stopwords()),... BagOfWords(binary=False))])), ('clf', Pipeline(memory=None, steps=[('lr', LogisticRegression())]))])

In [None]:
%%time
stats = evaluate_per_label(my_model, top_tags, events_train, tags_train)
stats.sort_values('auc', ascending=False)

HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

## Submission

* `submit_model(model, *, team_name, model_name, local_events=None, local_tags=None)`:  Evaluate and submit model predictions to leaderboard.

For now, only local evaluation is available, which can be used for model comparisons for now,

In [None]:
submit_model(baseline_model, 
             team_name="All your base are belong to us",
             model_name="baseline",
             local_events=events_test,
             local_tags=tags_test)

In [None]:
submit_model(my_model, 
             team_name="Little gray cells",
             model_name="1-2-gram",
             local_events=events_test,
             local_tags=tags_test)