# Tagging Stack-Overflow Questions

**The data**
* Python questions from Stackoverflow: [https://www.kaggle.com/stackoverflow/pythonquestions](https://www.kaggle.com/stackoverflow/pythonquestions)
* ~ 600000 questions
* each question with 0-5 tags

**The problem**

Can we predict tags from question / title texts? If so, how well?

**Approach**

Create several models and compare performances:
* Bag-of-words model
* sequential LSTM model for question bodies
* *composite LSTM model for question bodies + titles*   <== this notebook

In [None]:
%load_ext autoreload
%autoreload 2

import pandas as pd
from nltk.tokenize import word_tokenize
import itertools

from tensorflow.keras.preprocessing import sequence
from sklearn.preprocessing import OneHotEncoder, MultiLabelBinarizer
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier

import time

import numpy as np
import nltk
nltk.download('punkt')

### Configuration

In [None]:
data_path = "../data/pythonquestions/"
ft_path = "alldata.ft"  # set this to None if you want to train your own fasttext embeddings
n_top_labels = 100  # number of top labels to reduce dataset to
max_question_words = 100
sample_size = 1000  # set to -1 to use entire dataset
normalize_embeddings = True  # whether to normalize fasttext embeddings between -1, +1

tokenized_field = "q_title_tokenized" if use_titles else "q_all_body_tokenized"
content_field = "Title" if use_titles else "Body_q"

### Load data

In [None]:
from toolbox.data_prep_helpers import load_data

df = load_data("presentation_sample.pkl", ignore_cache=False)

## Preprocessing

### Slim down number of tags

We remove all tags that are not within the top *n_top_label* tags of the dataset. Afterwards, we remove any row that has no tags left.

In [None]:
from toolbox.data_prep_helpers import reduce_number_of_tags

df = reduce_number_of_tags(df, n_top_labels)

### Remove HTML Formatting

In [None]:
df["Body_q"].iloc[100000]

In [None]:
from toolbox.data_prep_helpers import remove_html_tags

# question bodies are stored as html code, we need to extract the content only
remove_html_tags(df, ["Body_q"])

In [None]:
df["Body_q"].iloc[100000]

### Tokenization
We need to tokenize questions in order to be able to apply/train embeddings on them.

To do this, we use the word_tokenize function from the nltk library ([https://www.nltk.org/api/nltk.tokenize.html](https://www.nltk.org/api/nltk.tokenize.html)) to transform multiple sentences of a question to a 1-dimensional list of tokens. 

In [None]:
# tokenization example
generate_question_level_tokens("Please help! How do I format in markdown?")

In [None]:
from toolbox.data_prep_helpers import generate_question_level_tokens

df["q_body_tokenized"] = df["Body_q"].apply(generate_question_level_tokens)
df["q_title_tokenized"] = df["Title"].apply(generate_question_level_tokens)

### Remove samples with too many tokens

In [None]:
# remove questions that contain more than max_questions_words words to meet memory limitations. 
df = df[df["q_body_tokenized"].apply(len).between(1, max_question_words)

In [None]:
df.shape

### FastText word embeddings

We trained our own embeddings, because Code is often related to certain key words."Pandas" for instance is related to the python library. Hence it's meaning within python code is totally different from it's meaning in pretrained embeddings.


Why FastText?

FastText includes reasonable mechanisms to deal with words, where no embedding exists. 
It represents a word as a bag of character n-grams. For words, which are out of vocab it calculates the embedding by combining the specific n-grams. For Training we used skip gram and cbow, where xy turned out to have a better perfomance.


In [None]:
# train word embeddings ONLY with training data
# wv = create_Word2Vec_embeddings(train_data, "Body_q")
# Use FastText to include solution for out-of-vocab words
if ft_path is not None:
    wv = load_fasttext_embeddings(ft_path)
else:
    wv = create_FastText_embedding(train_data, content_field)
wv.init_sims()

### Apply embeddings

In [None]:
X_t = df["q_title_tokenized"].apply(lambda x: np.array([wv.word_vec(w, use_norm=normalize_embeddings) for w in x]))
X_b = df["q_all_body_tokenized"].apply(lambda x: np.array([wv.word_vec(w, use_norm=normalize_embeddings) for w in x]))

### Transform data to model-compatible format

Pad question embeddings to equal length to unify tensor shapes

In [None]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

padding_element = np.array([0.0] * X_train_t.iloc[0].shape[-1])

X_t = pad_sequences(X_t, padding="post", dtype='float32', value=padding_element)
X_b = pad_sequences(X_b, padding="post", dtype='float32', value=padding_element)
print(X_t.shape)
print(X_b.shape)

#### Target data
With the MultiLabelBinarizer we create a (sample x label) matrix where for each record a 1 represents the presence of a certain label and a 0 its absence. (Similar to one-hot-encoding for single class problems) 

In [None]:
from sklearn.preprocessing import MultiLabelBinarizer

label_encoder = MultiLabelBinarizer()
label_encoder.fit(df["tags"])
y = label_encoder.transform(df["tags"])

## Training

Our title/body model takes title and body token sequences as separate inputs. These inputs are each passed through a masking layer which allows following layers (i.e. the lstm layers) to skip padding elements in the sequence. The masked inputs are processed by two separate lstm layers, whos last output vectors are concatenated to form one big context vector. This context is then passed through a fully connected layer with a sigmoid activation function, which assigns "independent" probabilities to each output class.

The model is visualized in the diagram below. For this visual example, we go with the following properties:
* batch size: 32
* sequence length: 50
* embedding size: 100
* lstm size (each): 64

![model architecture](graphics/title_body_model.svg)

### Conduct GridSearch to find "optimal" hyperparameters

In [None]:
from toolbox.training import grid_search_es

search_params = {
    # "lstm_layer_size": [512, 256, 128],
    # "lstm_dropout": [0.0, 0.2, 0.4],
    "lstm_layer_size": [16],
    "lstm_dropout": [0.0],
    # don't change these:
    "output_dim": [y.shape[-1]]
}

all_hists = grid_search_es(create_model, search_params)

best_params, best_hist, best_loss = min(all_hists, key=lambda x: x[2])

epoch_lengths = [len(h["val_loss"]) for h in best_hist]
print(f"best combindation: {best_params}")
print(f"avg min val_loss: {best_loss} -- epoch counts: {epoch_lengths}")

### Create the model

In [None]:
from models.title_body_lstm import create_model

model = create_model(**best_params)
model.summary()

### Train it

In [None]:
from sklearn.model_selection import train_test_split
from tensorflow.keras.callbacks import EarlyStopping, TensorBoard, ModelCheckpoint
import datetime

model_name = datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
log_dir="logs/fit/" + model_name

callbacks = [
    EarlyStopping(monitor="val_loss", patience=10, verbose=0),
    TensorBoard(log_dir=log_dir, histogram_freq=1),
    ModelCheckpoint(filepath=f"checkpoints/{model_name}", monitor="val_loss", restore_best_weights=True, verbose=0)
]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model.fit(x=X_train, y=y_train, batch_size=128, epochs=100, validation_data=[X_test, y_test], callbacks=callbacks)

## Model Evaluation

Let's have a look at some predictions.

In [None]:
n_predictions = 100

predictions = model.predict([X_test_t_padded, X_test_b_padded], batch_size=64)

l_pred = label_encoder.inverse_transform(binarize_model_output(predictions, threshold=0.10))
l_pred_out = l_pred[:n_predictions]
l_true = label_encoder.inverse_transform(y_test[:n_predictions])
texts = test_data[tokenized_field][:n_predictions]
raw_texts = test_data[content_field][:n_predictions]
titles = test_data["Title"][:n_predictions]

for pred, act, txt, raw_txt, titles in zip(l_pred, l_true, texts, raw_texts, titles):
    print(f"TRUE: {act}\nPREDICTION: {pred}\n")
    print(f"{title}\n")
    print(f"{raw_txt}\n-------------------------")

### F1_Micro Score Optimization

F1_Score = 2 * (precision * recall) / (precision + recall)

For Multi-Labeling we used the F1_Micro score which calculates the number of "True Positives", "False Positives" and "False Negatives" globally.
As we use the sigmoid function within our model we get values between 0 and 1 for every label. Hence it is necessary to define a threshold to decide whether a certain label is predicted (=1). The threshold, that maximizes the f1_micro score is calculated within the output_evaluation function. 

In [None]:
output_evaluation(model, sample_size, max_question_words, n_top_labels, y_test, predictions, normalize_embeddings, None, None, n_epochs)