# Multiclass Classification on Stack Overflow Questions

Train a multiclass classifier to predict the tag of a programming question on Stack Overflow.

The dataset contains the body of several thousand programming questions (for example, "How can sort a dictionary by value in Python?") posted to Stack Overflow. Each of these is labeled with exactly one tag (either Python, CSharp, JavaScript, or Java). The task is to take a question as input, and predict the appropriate tag.

In [1]:
!pip install tf-nightly

Collecting tf-nightly
[?25l  Downloading https://files.pythonhosted.org/packages/a7/4e/5d6d144e2733acaa3ef451345d74bfb1aa28884e7086dff8b4e37b4ec7f0/tf_nightly-2.4.0.dev20200922-cp36-cp36m-manylinux2010_x86_64.whl (390.7MB)
[K     |████████████████████████████████| 390.7MB 41kB/s 
[?25hCollecting flatbuffers>=1.12
  Downloading https://files.pythonhosted.org/packages/eb/26/712e578c5f14e26ae3314c39a1bdc4eb2ec2f4ddc89b708cf8e0a0d20423/flatbuffers-1.12-py2.py3-none-any.whl
Collecting tf-estimator-nightly
[?25l  Downloading https://files.pythonhosted.org/packages/d2/54/e6255de0770a055ed3b9bfc90b254deb6cece5621fcffa0d0300199927c5/tf_estimator_nightly-2.4.0.dev2020091801-py2.py3-none-any.whl (460kB)
[K     |████████████████████████████████| 460kB 50.6MB/s 
Collecting tb-nightly<3.0.0a0,>=2.4.0a0
[?25l  Downloading https://files.pythonhosted.org/packages/0e/a3/32f95366a8390d73cad81401f1e0b5e6408464c45e0581ed4f9d5ebd73e3/tb_nightly-2.4.0a20200921-py3-none-any.whl (10.2MB)
[K     |███████

In [2]:
import tensorflow as tf

from tensorflow.keras import layers
from tensorflow.keras import losses
from tensorflow.keras import preprocessing
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

In [3]:
print(tf.__version__)

2.4.0-dev20200922


## Download the BigQuery dataset

In [4]:
url = "http://storage.googleapis.com/download.tensorflow.org/data/stack_overflow_16k.tar.gz"

In [5]:
tf.keras.utils.get_file("stack_overflow_16k.tar.gz",
                        url,
                        untar=True,
                        cache_dir='.',
                        cache_subdir='')

Downloading data from http://storage.googleapis.com/download.tensorflow.org/data/stack_overflow_16k.tar.gz


'./stack_overflow_16k.tar.gz'

In [7]:
batch_size = 32

In [8]:
raw_train_ds = tf.keras.preprocessing.text_dataset_from_directory(
    'train', 
    batch_size=batch_size, 
    validation_split=0.2, 
    subset='training', 
    seed=42
)

Found 8000 files belonging to 4 classes.
Using 6400 files for training.


In [9]:
raw_val_ds = tf.keras.preprocessing.text_dataset_from_directory(
    'train', 
    batch_size=batch_size, 
    validation_split=0.2, 
    subset='validation', 
    seed=42
)

Found 8000 files belonging to 4 classes.
Using 1600 files for validation.


In [10]:
raw_test_ds = tf.keras.preprocessing.text_dataset_from_directory(
    'train', 
    batch_size=batch_size
)

Found 8000 files belonging to 4 classes.


## Explore the data

In [12]:
print(raw_train_ds.class_names)

['csharp', 'java', 'javascript', 'python']


In [15]:
for text_batch, label_batch in raw_train_ds.take(1):
  for i in range(5):
    print(text_batch.numpy()[i])
    print(label_batch.numpy()[i])
    print()

b'"unit testing of setters and getters teacher wanted us to do a comprehensive unit test. for me, this will be the first time that i use junit. i am confused about testing set and get methods. do you think should i test them? if the answer is yes; is this code enough for testing?..  public void testsetandget(){.    int a = 10;.    class firstclass = new class();.    firstclass.setvalue(10);.    int value = firstclass.getvalue();.    assert.asserttrue(""error"", value==a);.  }...in my code, i think if there is an error, we can\'t know that the error is deriving because of setter or getter."\n'
1

b'"static class, static constructors and static properties i have a static class that has only static properties and a static constructor. when i try to access or set the value of property (with a backing field) the static constructor is not called. however, if i define a static method and try to call it the constructor is executed...i believe properties are just syntactical sugar and are inter

## Prepare data for training

In [16]:
max_features = 5000
embedding_dim = 128
sequence_length = 500

vectorize_layer = TextVectorization(
    max_tokens=max_features,
    output_mode='int',
    output_sequence_length=sequence_length
)

Make a text-only dataset (no labels) and call adapt:

In [17]:
text_ds = raw_train_ds.map(lambda x, y: x)
vectorize_layer.adapt(text_ds)

## Vectorize the data

In [20]:
def vectorize_text(text, label):
  text = tf.expand_dims(text, -1)
  return vectorize_layer(text), label

In [21]:
train_ds = raw_train_ds.map(vectorize_text)
val_ds = raw_val_ds.map(vectorize_text)
test_ds = raw_test_ds.map(vectorize_text)

## Configure the dataset for performance

In [22]:
AUTOTUNE = tf.data.experimental.AUTOTUNE

train_ds = train_ds.cache().prefetch(buffer_size=AUTOTUNE)
val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)
test_ds = test_ds.cache().prefetch(buffer_size=AUTOTUNE)

## Build the model 

The last layer of the model has to be Dense(4), as there are now four output classes:  Python, CSharp, JavaScript, and Java.

In [23]:
model = tf.keras.Sequential([
  layers.Embedding(max_features + 1, embedding_dim),
  layers.Dropout(0.2),
  layers.GlobalAveragePooling1D(),
  layers.Dropout(0.2),
  layers.Dense(4)
])

In [24]:
model.compile(
    loss=losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer='adam',
    metrics=['accuracy']
)

## Train the model

In [25]:
history = model.fit(
    train_ds,
    validation_data=val_ds,
    epochs=5
)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


## Evaluate the model

In [26]:
loss, accuracy = model.evaluate(test_ds)

print("Loss: ", loss)
print("Accuracy: ", accuracy)

Loss:  0.9314426183700562
Accuracy:  0.7438750267028809
