# Tagging Stack-Overflow Questions

**The data**
* Python questions from Stackoverflow: [https://www.kaggle.com/stackoverflow/pythonquestions](https://www.kaggle.com/stackoverflow/pythonquestions)
* ~ 600000 questions
* each question with 0-5 tags

**The problem**

Can we predict tags from question / title texts? If so, how well?

**Approach**

Create several models and compare performances:
* Bag-of-words model
* sequential LSTM model for question bodies
* *composite LSTM model for question bodies + titles*   <== this notebook

In [41]:
%load_ext autoreload
%autoreload 2

import pandas as pd
from nltk.tokenize import word_tokenize
import itertools

from models.lstm_classifier import create_model

from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import OneHotEncoder, MultiLabelBinarizer
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier

from tensorflow.keras.callbacks import EarlyStopping, TensorBoard, ModelCheckpoint
import datetime
import time

import numpy as np
import nltk
nltk.download('punkt')

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


[nltk_data] Downloading package punkt to /home/lukas/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### Configuration

In [44]:
data_path = "../data/pythonquestions/"
ft_path = "alldata.ft"  # set this to None if you want to train your own fasttext embeddings
n_top_labels = 100  # number of top labels to reduce dataset to
max_question_words = 100
sample_size = 1000  # set to -1 to use entire dataset
normalize_embeddings = True  # whether to normalize fasttext embeddings between -1, +1

tokenized_field = "q_title_tokenized" if use_titles else "q_all_body_tokenized"
content_field = "Title" if use_titles else "Body_q"

### Load data

In [None]:
from toolbox.data_prep_helpers import load_data

df = load_data("presentation_sample.pkl", ignore_cache=False)

## Preprocessing

### Slim down number of tags

We remove all tags that are not within the top *n_top_label* tags of the dataset. Afterwards, we remove any row that has no tags left.

In [61]:
from toolbox.data_prep_helpers import reduce_number_of_tags

df = reduce_number_of_tags(df, n_top_labels)

deleting element python from top_tags


  dataframe["tags"] = dataframe["tags"].apply(lambda x: [tag for tag in x if tag in top_tags])


(426041, 8)

### Remove HTML Formatting

In [55]:
df["Body_q"].iloc[100000]

'My question would be if there was any other way besides below to iterate through a file one character at a time?\nwith open(filename) as f:\n  while True:\n    c = f.read(1)\n    if not c:\n      print "End of file"\n      break\n    print "Read a character:", c\n\nSince there is not a function to check whether there is something to read like in Java, what other methods are there. Also, in the example, what would be in the variable c when it did reach the end of the file? Thanks for anyones help.\n'

In [None]:
from toolbox.data_prep_helpers import remove_html_tags

# question bodies are stored as html code, we need to extract the content only
remove_html_tags(df, ["Body_q"])

In [57]:
df["Body_q"].iloc[100000]

'My question would be if there was any other way besides below to iterate through a file one character at a time?\nwith open(filename) as f:\n  while True:\n    c = f.read(1)\n    if not c:\n      print "End of file"\n      break\n    print "Read a character:", c\n\nSince there is not a function to check whether there is something to read like in Java, what other methods are there. Also, in the example, what would be in the variable c when it did reach the end of the file? Thanks for anyones help.\n'

### Tokenization
We need to tokenize questions in order to be able to apply/train embeddings on them.

To do this, we use the word_tokenize function from the nltk library ([https://www.nltk.org/api/nltk.tokenize.html](https://www.nltk.org/api/nltk.tokenize.html)) to transform multiple sentences of a question to a 1-dimensional list of tokens. 

In [60]:
# tokenization example
generate_question_level_tokens("Please help! How do I format in markdown?")

['please', 'help', '!', 'how', 'do', 'i', 'format', 'in', 'markdown', '?']

In [53]:
from toolbox.data_prep_helpers import generate_question_level_tokens

df["q_body_tokenized"] = df["Body_q"].apply(generate_question_level_tokens)
df["q_title_tokenized"] = df["Title"].apply(generate_question_level_tokens)

KeyboardInterrupt: 

### Remove samples with too many tokens

In [None]:
df = df[df["q_body_tokenized"].apply(len).between(1, 100)

In [62]:
df.shape

(426041, 8)

### FastText word embeddings

We trained our own embeddings, because Code is often related to certain key words."Pandas" for instance is related to the python library. Hence it's meaning within python code is totally different from it's meaning in pretrained embeddings.


Why FastText?

FastText includes reasonable mechanisms to deal with words, where no embedding exists. 
It represents a word as a bag of character n-grams. For words, which are out of vocab it calculates the embedding by combining the specific n-grams. For Training we used skip gram and cbow, where xy turned out to have a better perfomance.


In [14]:
# train word embeddings ONLY with training data
# wv = create_Word2Vec_embeddings(train_data, "Body_q")
# Use FastText to include solution for out-of-vocab words
if ft_path is not None:
    wv = load_fasttext_embeddings(ft_path)
else:
    wv = create_FastText_embedding(train_data, content_field)
wv.init_sims()

### Prepare Training and Test data

Remove questions that contain more than 100 (=max_questions_words) words to meet memory limitations. Create a train and test dataset-split.

In [13]:
# Tokenize text into words on question level
data = sample[sample[tokenized_field].apply(len) <= max_question_words]
train_data, test_data = train_test_split(data, test_size = 0.2)
print(train_data.shape)
print(test_data.shape)

(1300, 5)
(326, 5)


#### Train Data

Generate train data with word embeddings for title and question body. Equalize their shape with a padding element.

In [42]:
X_train_t = train_data["q_title_tokenized"].apply(lambda x: np.array([wv.word_vec(w, use_norm=normalize_embeddings) for w in x]))
X_train_b = train_data["q_all_body_tokenized"].apply(lambda x: np.array([wv.word_vec(w, use_norm=normalize_embeddings) for w in x]))


padding_element = np.array([0.0] * X_train_t.iloc[0].shape[-1])

X_train_t_padded = pad_sequences(X_train_t, padding="post", dtype='float32', value=padding_element)
X_train_b_padded = pad_sequences(X_train_b, padding="post", dtype='float32', value=padding_element)
print(X_train_t_padded.shape)
print(X_train_b_padded.shape)

(78101, 39, 100)
(78101, 100, 100)


#### Test Data

With the MultiLabelBinarizer we create a (sample x label) matrix where for each record a 1 represents the presence of a certain label and a 0 its absence. (Similar to one-hot-encoding for single class problems) 

In [45]:
label_encoder = MultiLabelBinarizer()
label_encoder.fit(train_data["tags"])
y_train = label_encoder.transform(train_data["tags"])
y_test = label_encoder.transform(test_data["tags"])

## Train with Title and Body

## Model Evaluation

Let's have a look at some predictions.

In [1]:
n_predictions = 100

predictions = model.predict([X_test_t_padded, X_test_b_padded], batch_size=64)

l_pred = label_encoder.inverse_transform(binarize_model_output(predictions, threshold=0.10))
l_pred_out = l_pred[:n_predictions]
l_true = label_encoder.inverse_transform(y_test[:n_predictions])
texts = test_data[tokenized_field][:n_predictions]
raw_texts = test_data[content_field][:n_predictions]
titles = test_data["Title"][:n_predictions]

for pred, act, txt, raw_txt, titles in zip(l_pred, l_true, texts, raw_texts, titles):
    print(f"TRUE: {act}\nPREDICTION: {pred}\n")
    print(f"{title}\n")
    print(f"{raw_txt}\n-------------------------")

NameError: name 'model' is not defined

### F1_Micro Score Optimization

F1_Score = 2 * (precision * recall) / (precision + recall)

For Multi-Labeling we used the F1_Micro score which calculates the number of "True Positives", "False Positives" and "False Negatives" globally.
As we use the sigmoid function within our model we get values between 0 and 1 for every label. Hence it is necessary to define a threshold to decide whether a certain label is predicted (=1). The threshold, that maximizes the f1_micro score is calculated within the output_evaluation function. 

In [119]:
output_evaluation(model, sample_size, max_question_words, n_top_labels, y_test, predictions, normalize_embeddings, None, None, n_epochs)

Model Evaluation

normalize_embeddings = True, learning_rate = 1, vocab_size = None, epochs=20
Parameter Settings:
 Sample size = -1, Max. number of words per question = 100, Number of Top Labels used = 20

Model: "sequential_6"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
masking_4 (Masking)          (None, None, 100)         0         
_________________________________________________________________
lstm_6 (LSTM)                (None, 256)               365568    
_________________________________________________________________
dense_17 (Dense)             (None, 100)               25700     
Total params: 391,268
Trainable params: 391,268
Non-trainable params: 0
_________________________________________________________________
None


  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)



Metrics with optimized threshold of 0.37
 Macro Evaluation: f1_Score= 0.4795355196850226 , Recall = 0.4341559331734409 , Precision = 0.582528658477357
 Micro Evaluation: f1_Score= 0.5950810603804093 , Recall = 0.5219257324127635 , Precision = 0.6920869005790072
