# Tagging Stack-Overflow Questions

**The data**
* Python questions from Stackoverflow: [https://www.kaggle.com/stackoverflow/pythonquestions](https://www.kaggle.com/stackoverflow/pythonquestions)
* ~ 600000 questions
* each question with 0-5 tags

**The problem**

Can we predict tags from question / title texts? If so, how well?

**Approach**

Create several models and compare performances:
* Bag-of-words model
* sequential LSTM model for question bodies
* *composite LSTM model for question bodies + titles*   <== this notebook

In [41]:
%load_ext autoreload
%autoreload 2

import pandas as pd
from nltk.tokenize import word_tokenize
import itertools

from models.lstm_classifier import create_model

from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import OneHotEncoder, MultiLabelBinarizer
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier

from tensorflow.keras.callbacks import EarlyStopping, TensorBoard, ModelCheckpoint
import datetime
import time

import numpy as np
import nltk
nltk.download('punkt')

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


[nltk_data] Downloading package punkt to /home/lukas/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### Configuration

In [44]:
data_path = "../data/pythonquestions/"
ft_path = "alldata.ft"  # set this to None if you want to train your own fasttext embeddings
n_top_labels = 100  # number of top labels to reduce dataset to
max_question_words = 100
sample_size = 1000  # set to -1 to use entire dataset
normalize_embeddings = True  # whether to normalize fasttext embeddings between -1, +1

tokenized_field = "q_title_tokenized" if use_titles else "q_all_body_tokenized"
content_field = "Title" if use_titles else "Body_q"

### Load data

In [None]:
from toolbox.data_prep_helpers import load_data

df = load_data("presentation_sample.pkl", ignore_cache=False)

## Preprocessing

### Slim down number of tags

We remove all tags that are not within the top *n_top_label* tags of the dataset. Afterwards, we remove any row that has no tags left.

In [61]:
from toolbox.data_prep_helpers import reduce_number_of_tags

df = reduce_number_of_tags(df, n_top_labels)

deleting element python from top_tags


  dataframe["tags"] = dataframe["tags"].apply(lambda x: [tag for tag in x if tag in top_tags])


(426041, 8)

### Remove HTML Formatting

In [55]:
df["Body_q"].iloc[100000]

'My question would be if there was any other way besides below to iterate through a file one character at a time?\nwith open(filename) as f:\n  while True:\n    c = f.read(1)\n    if not c:\n      print "End of file"\n      break\n    print "Read a character:", c\n\nSince there is not a function to check whether there is something to read like in Java, what other methods are there. Also, in the example, what would be in the variable c when it did reach the end of the file? Thanks for anyones help.\n'

In [None]:
from toolbox.data_prep_helpers import remove_html_tags

# question bodies are stored as html code, we need to extract the content only
remove_html_tags(df, ["Body_q"])

In [57]:
df["Body_q"].iloc[100000]

'My question would be if there was any other way besides below to iterate through a file one character at a time?\nwith open(filename) as f:\n  while True:\n    c = f.read(1)\n    if not c:\n      print "End of file"\n      break\n    print "Read a character:", c\n\nSince there is not a function to check whether there is something to read like in Java, what other methods are there. Also, in the example, what would be in the variable c when it did reach the end of the file? Thanks for anyones help.\n'

### Tokenization
We need to tokenize questions in order to be able to apply/train embeddings on them.

To do this, we use the word_tokenize function from the nltk library ([https://www.nltk.org/api/nltk.tokenize.html](https://www.nltk.org/api/nltk.tokenize.html)) to transform multiple sentences of a question to a 1-dimensional list of tokens. 

In [60]:
# tokenization example
generate_question_level_tokens("Please help! How do I format in markdown?")

['please', 'help', '!', 'how', 'do', 'i', 'format', 'in', 'markdown', '?']

In [53]:
from toolbox.data_prep_helpers import generate_question_level_tokens

df["q_body_tokenized"] = df["Body_q"].apply(generate_question_level_tokens)
df["q_title_tokenized"] = df["Title"].apply(generate_question_level_tokens)

KeyboardInterrupt: 

### Remove samples with too many tokens

In [None]:
df = df[df["q_body_tokenized"].apply(len).between(1, 100)

In [62]:
df.shape

(426041, 8)