# CiP Tagging Exercise

In this exercise you will try your own hand at assembling an NLP pipeline suggesting tags for [Citypolarna](https://www.citypolarna.se) events.

You will be working with a dataset consisting of descriptions of past events which has been tagged with a selection of tags previously, using them to train a classifier to suggest tags for new and unseen events.

Note that an event may have more than one tag, for example, it may both have **mat**

## Setup

The first step is to import the packages and data we need for the exercise.

> **TODO:** The notebook is still set up for local development.  This section will change a lot.

In [1]:
from tagger import *
from sklearn.pipeline import Pipeline

Using TensorFlow backend.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


In [2]:
from tagger.dataset.cleaning import load_datasets

events_train, tags_train, events_test, tags_test, top_tags, tags_train_stats = load_datasets(
    "../data/raw/citypolarna_public_events_out.csv")

import nltk
nltk.download('stopwords')

import pandas as pd
pd.set_option('display.max_rows', None)

[nltk_data] Downloading package stopwords to /Users/chrka/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## A brief overview of the data

> Feel free to skip through this section and refer back to it only when if you have questions about the format of the data.

The data for the events are contained in the data frames `events_train` (to be used hen training our tagger) and `events_test` (for final evaluation of the tagger).  The structure of the data frames is as follows:

| Field       | Description                              |  
| ----------- | ---------------------------------------- |  
| id          | ID of the event                          |  
| weekday     | Day of the week (0 = Monday, 6 = Sunday) |  
| time        | Time of day (or _NA_ if an all-day event |  
| title       | Event title                              |  
| description | Event description (HTML)                 |  

The text fields — `title` and `description` — are probably the most important for figuring out which tags should be suggested.

This is what the data for the first couple of events look like:

In [3]:
events_train.head()

Unnamed: 0,id,weekday,time,title,description
0,1487,5,21:00:00,Cantina och Harrys midsommardagen,Lust att hänga med? Vi är (än så länge) två ki...
3,1508,5,01:00:00,Minigolf - Malmö,Bra banor i lummig tervlig omgivning på Bullto...
7,1527,4,00:00:00,Per Gessle,uppträder kl. DD.DD i Mariebergsskogen. Nån so...
8,1545,0,00:00:00,Akut Flytthjälp!!,Behöver akut hjälp idag att flytta. Alla som h...
9,1547,3,19:00:00,Ölprovning -alkoholfri!,Ölprovarkväll på Malmös (Sveriges) första alko...


And we have training data for about 7000 events and will be testing on about 2300 events:

In [4]:
events_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7020 entries, 0 to 9329
Data columns (total 5 columns):
id             7020 non-null int64
weekday        7020 non-null int64
time           6839 non-null object
title          7020 non-null object
description    7020 non-null object
dtypes: int64(2), object(3)
memory usage: 329.1+ KB


In [5]:
events_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2310 entries, 1 to 9297
Data columns (total 5 columns):
id             2310 non-null int64
weekday        2310 non-null int64
time           2267 non-null object
title          2310 non-null object
description    2310 non-null object
dtypes: int64(2), object(3)
memory usage: 108.3+ KB


The classifier will be trained to suggest labels from the following list of tags (shown with their respective counts in the training dataset):

In [6]:
tags_train_stats

Unnamed: 0,tag,count
1,mat,3213
2,musik,2379
32,fest,1840
12,fika,1797
28,teater,1741
17,kultur,1360
23,konsert,1095
5,dans,1068
30,promenad,890
9,film,874


The tags for the events in the training data are available in `tags_train` in matrix form with a row for each event, and where a `1` in the $n$-th column means that the $n$-th tag (in the order given by `top_tags`) was applied for that event.

This is what it looks for the first 5 events:

In [7]:
tags_train[0:5]

array([[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        

In [8]:
top_tags

['gratis',
 'mat',
 'musik',
 'kurs',
 'casino',
 'dans',
 'musuem',
 'båt',
 'barn',
 'film',
 'språk',
 'bowling',
 'fika',
 'sport',
 'biljard',
 'bio',
 'opera',
 'kultur',
 'grilla',
 'kubb',
 'festival',
 'cykel',
 'picknick',
 'konsert',
 'pub',
 'frisbeegolf',
 'svamp',
 'bangolf',
 'teater',
 'afterwork',
 'promenad',
 'humor',
 'fest',
 'shopping',
 'resa',
 'sällskapsspel',
 'träna',
 'pubquiz',
 'poker',
 'bok',
 'foto',
 'hund',
 'skridskor',
 'dart',
 'bada',
 'diskussion',
 'badminton',
 'pyssel',
 'golf',
 'loppis',
 'boule',
 'yoga',
 'innebandy',
 'högtid',
 'fiske',
 'beachvolleyboll',
 'friluftsliv',
 'volleyboll',
 'geocaching',
 'vindsurfing',
 'SUP',
 'standup']

> **TODO**: Add matrix_to_tags to make this simpler

In [9]:
import numpy as np
np.array(top_tags)[(tags_train[0:1] > 0).squeeze()]

array(['mat', 'fika', 'fest'], dtype='<U15')

## Pipeline

Now we get to the part where we actually get to put together our classifier.

The classifier consists of three different parts:

* First, the **preprocessor** which extracts text data from the event data and turns it into tokens, to be used in the next step:
* the **feature extractor** which in turn converts the tokens into numerical data, suitable for use in
* a **Machine Learning algorithm** which learns what tags might be suitable for which events.

The pipeline is built up out of a sequence of **transformers**, each of which does something to the data before passing it onto the next.  At the end of the pipeline, we put our **ML algorithm**.

We define the pipeline by a list of tuples (each being a **name** and the **transformer**/**algorithm** for the step.)

To have something to compare your classifier against, we have also provided a baseline classifier.

> **NB.** Take care that the output of each step matches the input of the next step!

### Preprocessing

The first task of the pipeline is to extract the text from the event data, and turn it into a series of tokens.

The following transformers are suitable for this:

`ExtractText(columns=['description'])`: (_Data frame to HTML_) Extract text fields from event data joined together. By default it only takes the descriptions, but by specifying the `columns` argument you can add other fields as well, eg., `columns=['title', 'description']` to join the titles and descriptions toghether.

`HTMLToText()`: (_HTML to string_) The descriptions are HTML formatted, so we need a way to convert them into raw text without any formatting data.

`CharacterSet(punctuation=True, digits=False)`: (_String to string_) Keeps alphabetic characters (Swedish) and collapses multiple whitespaces into single.  Optionally keeps digits and punctuation.  (Digits have been removed from this particular dataset already however.)

`Lowercase()`: (_String to string_) Converts all alphabetic characters into their lowercase equivalents.

`Tokenize(method='word_punct)`: (_String to token list_) Splits strings into lists of tokens.  If method is `whitespace`, whitespaces are used for splitting, if `word_punct` (default), punctuation marks are also used for splitting.



`Stopwords()`: (_Token list to token list_) Removes stop words (the most common words in the Swedish language).

`Stemming()`: (_Token list to token list_) Converts tokens into their stems.

`NGram(n_min, n_max=None)`: (_Token list to token list_) Create all $n$-grams from $n_{\mathrm{min}}$-grams to $n_{\mathrm{max}}$-grams. (If no $n_{\mathrm{max}}$, only $n_{\mathrm{min}}$-grams are created.)

> Of these steps, `ExtractToText()`, `HTMLToText()`, and, `Tokenize()` are most likely necessary to include in the pipeline, but do try to experiment a little with the other ones as well.

The baseline model extracts the text, converts it from HTML into raw text, removes any non-alphabetic characters — it even removes punctuation — and breaks the text into tokens after converting everything into lowercase.

We'll assemble each step of the classifier into a separate [scikit-learn](https://scikit-learn.org/) pipeline so that we can try them out separately if we want to.

> The details of the Pipelines are not terribly important, right now, but it might be useful to know that we can `fit` them to data, and in the case of transformers, we can use them to `transform` data (after they've been fitted), or for classifiers we can use them to `predict` (once again, after they've been fitted).  As a shortcut, one can also `fit_transform` to fit and transform the same data in one single step.

In [10]:
baseline_preprocessing = Pipeline([
    ('fields', ExtractText()),
    ('html', HTMLToText()),
    ('cset', CharacterSet(punctuation=False)),
    ('lower', Lowercase()),
    ('token', Tokenize())
])

We can take a quick peek at what the basline's preprocessing steps do (and we do see that it does what we'd expect):

In [11]:
baseline_preprocessed = baseline_preprocessing.fit_transform(events_train.head())
baseline_preprocessed

0    [lust, att, hänga, med, vi, är, än, så, länge,...
3    [bra, banor, i, lummig, tervlig, omgivning, på...
7    [uppträder, kl, dd, dd, i, mariebergsskogen, n...
8    [behöver, akut, hjälp, idag, att, flytta, alla...
9    [ölprovarkväll, på, malmös, sveriges, första, ...
Name: description, dtype: object

In [12]:
my_preprocessing = Pipeline([
    ('fields', ExtractText()),
    ('html', HTMLToText()),
    # YOUR STEPS HERE
    ('cset', CharacterSet(punctuation=True)), # REMOVE
    ('lower', Lowercase()), # REMOVE
    ('token', Tokenize()),
    # YOUR STEPS HERE
    ('ngram', NGram(1, 2)) # REMOVE
])

### Feature Extraction

In order to convert the tokens into numerical data suitable for a machine learning algorithm, you could try one of the following common methods.

`BagOfWords(binary=False)`: (_List of tokens to sparse vector_) Counts the occurences of each word in the list of the tokens, and creates a vector out of them.  If the argument `binary` is set to `True`, then it only cares if a token occurs or not (ie., it gives it a count of either 1 or 0).

> That the result is a _sparse vector_ means that the result only specifies non-zero entries.  When zeros are included as well, it is called a _dense vector_.  Since some implementations of ML algorithms do not work well with sparse vector, we have also provided a `SparseToDense()` transformer below.

`Tfidf()`: (_List of tokens to sparse vector_) Similarly to `BagOfWords`, it also counts the occurences of each token, but instead creates a vector of each token's _term frequency_ (how often the token occurs in the event description) multiplied by its _inverse document frequency_ (one divided by the number of descriptions the token occurs in).  The intuition being that the more often a token occurs in a description the more likely it is that it is representative of that event, while at the same time considering that if the token occurs in many, many events, it is probably not specific to the the event.

`SumWordBedding()`, `MeanWordBedding()`: (_List of tokens to sparse vector_) A different way of converting words to vectors is to use what is known as a _word embedding_. Each word is converted in such a way that words that occur in a similar context, result in vectors that are near each other.  The simplest way of using these are by adding the vectors together for all words (or taking their means).

> **NB.** We have precomputed word vectors from the event dataset based on regular words only (no punctuation, lowercase only) so will probably not work very well with n-grams etc.

Some other transformers that can be useful for dealing with feature vectors:

`SparseToDense()`: (_Sparse vector to dense vector_) Converts a sparse vector to a dense vector with the same contents.  Necessary, for example, for classifiers using the `MultiLayerPerceptron()`.

`MaxAbsScaler()`: (_Sparse/dense vector to sparse/dense vector_) Scales the elements of a vector to be in the range $[-1, 1]$ such that the absolute maximum value of each column (over all training samples) is 1.  This can increase the performance of certain ML algorithms, such as `LogisticRegression()`.

The baseline model uses bag of words, plain and simple.

In [13]:
baseline_feature_extraction = Pipeline([
    ('bow', BagOfWords())
])

We can once again take a quick look at what the baseline does:

In [14]:
baseline_feature_extraction.fit_transform(baseline_preprocessed)

<5x135 sparse matrix of type '<class 'numpy.int64'>'
	with 170 stored elements in Compressed Sparse Row format>

And we see that it results in a sparse matrix for 128 tokens (since we only look at the first couple of events for this example, the number of distinct tokens won't be that large).

In [15]:
my_feature_extraction = Pipeline([
    # YOUR STEPS HERE
    ('bow', BagOfWords()) # REMOVE
])

### Algorithm

Plenty of algorithms could be used, but here are a couple of suggestions:

`NaiveBayes()`: ((Sparse) Vector to prediction) Naïve Bayes

`LogisticRegression()`: ((Sparse) Vector to predictions) Logistic regression

`MultiLayerPerceptron(layers, epochs=16, batch_size=64)`: (Vector to prediction) Multi-layered perceptron with specified layers, eg., `layers=[256, 256]`)

The baseline uses Naïve Bayes.

In [16]:
baseline_algorithm = Pipeline([
    ('nb', NaiveBayes())
])

In [17]:
my_algorithm = Pipeline([
    # YOUR STEPS HERE
    ('lr', LogisticRegression()) # REMOVE
])

## Assembling the pipeline

We'll now put together the preprocessing, feature extraction, and algorithm steps into a single pipeline.

In [18]:
baseline_classifier = Pipeline([
    ('pre', baseline_preprocessing), 
    ('feat', baseline_feature_extraction), 
    ('algo', baseline_algorithm)
])

In [19]:
my_classifier =  Pipeline([
    ('pre', my_preprocessing), 
    ('feat', my_feature_extraction), 
    ('algo', my_algorithm)
])

## Evaluation

You will now try out our model by training it on a subset of the training data only, and evaluating its performance on the remainder of the data.  This should give us some idea of how well it will perform on unseen data.

The two main metrics the model will be evaluated on are:

* **Hamming loss**: The fraction of tags that either are suggested when they shouldn't be, or aren't suggested when they should. (It ranges from 0–1, lower is better.)
* **Exact match ratio**: The fraction of events that have been completely correctly classified, that is, precisely those labels that should be suggested for the event has been suggested, and no others. (Also ranges from 0–1, but here higher is better.)

It can also be interesting to take a look at how well the classifier works with individual labels, so **accuracy**, **precision**, **recall**, and **$F_1$-scores** (harmonic mean of precision and recall) are reported for the individual tags as well.

In [20]:
%%time
evaluate_classifier(baseline_classifier, top_tags, events_train, tags_train)

Hamming loss for model: 0.04214915908464296
Exact match ratio for model: 0.0754985754985755
CPU times: user 3.85 s, sys: 89.8 ms, total: 3.94 s
Wall time: 3.97 s


  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


Unnamed: 0,tag,accuracy,precision,recall,f1
42,skridskor,0.998575,1.0,0.888889,0.941176
52,innebandy,0.997863,0.818182,0.9,0.857143
16,opera,0.977208,0.761194,0.761194,0.761194
1,mat,0.717236,0.687211,0.696875,0.692009
2,musik,0.81339,0.8,0.59322,0.681265
9,film,0.924501,0.849558,0.518919,0.644295
28,teater,0.838319,0.767932,0.514124,0.615905
15,bio,0.94943,0.888889,0.466667,0.612022
30,promenad,0.923789,0.861702,0.462857,0.60223
36,träna,0.991453,0.818182,0.473684,0.6


In [21]:
%%time
evaluate_classifier(my_classifier, top_tags, events_train, tags_train)



Hamming loss for model: 0.03676132708390773
Exact match ratio for model: 0.1794871794871795
CPU times: user 8min 55s, sys: 2min, total: 10min 56s
Wall time: 1min 51s


  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


Unnamed: 0,tag,accuracy,precision,recall,f1
57,volleyboll,1.0,1.0,1.0,1.0
42,skridskor,0.999288,1.0,0.944444,0.971429
52,innebandy,0.997863,0.818182,0.9,0.857143
33,shopping,0.997863,1.0,0.727273,0.842105
46,badminton,0.989316,1.0,0.634146,0.776119
16,opera,0.980769,0.954545,0.626866,0.756757
1,mat,0.742877,0.75226,0.65,0.697402
2,musik,0.82906,0.884106,0.565678,0.689922
15,bio,0.959402,1.0,0.525,0.688525
51,yoga,0.992877,0.833333,0.555556,0.666667


## Submission

Once you have found a sequence of steps you think performs well, it's time to train it on the full training set, and submit its prediction on the (secret!) test set.

In [22]:
%%time
my_classifier.fit(events_train, tags_train)

submit_model(my_classifier, 
             team_name="<INSERT TEAM NAME HERE>",
             model_name="<INSERT MODEL NAME HERE>",
             local_events=events_test,
             local_tags=tags_test)



Team '<INSERT TEAM NAME HERE>' submitting model '<INSERT MODEL NAME HERE>':
Pipeline(memory=None,
     steps=[('pre', Pipeline(memory=None,
     steps=[('fields', ExtractText(add_time_of_day=False, columns=['description'])), ('html', HTMLToText()), ('cset', CharacterSet(digits=False, punctuation=True)), ('lower', Lowercase()), ('token', Tokenize(method='word_punct')), ('ngram', NGram(n_max=2, n_min=1))])), ('feat', Pipeline(memory=None, steps=[('bow', BagOfWords(binary=False))])), ('algo', Pipeline(memory=None, steps=[('lr', LogisticRegression())]))])
------------------------------------------------------------------------
Hamming loss for submission: 0.03599357631615696
Exact match ratio for submission: 0.1484848484848485
CPU times: user 11min 37s, sys: 2min 56s, total: 14min 33s
Wall time: 2min 29s
