Exercise from the end of : https://www.tensorflow.org/tutorials/keras/text_classification

### Exercise: multi-class classification on Stack Overflow questions

This tutorial showed how to train a binary classifier from scratch on the IMDB dataset. As an exercise, you can modify this notebook to train a multi-class classifier to predict the tag of a programming question on <a href="http://stackoverflow.com/">Stack Overflow</a>.

A <a href="https://storage.googleapis.com/download.tensorflow.org/data/stack_overflow_16k.tar.gz">dataset</a> has been prepared for you to use containing the body of several thousand programming questions (for example, "How can I sort a dictionary by value in Python?") posted to Stack Overflow. Each of these is labeled with exactly one tag (either Python, CSharp, JavaScript, or Java). Your task is to take a question as input, and predict the appropriate tag, in this case, Python.

The dataset you will work with contains several thousand questions extracted from the much larger public Stack Overflow dataset on <a href="https://console.cloud.google.com/marketplace/details/stack-exchange/stack-overflow">BigQuery</a>, which contains more than 17 million posts.

After downloading the dataset, you will find it has a similar directory structure to the IMDB dataset you worked with previously:

train/         <br/>
...python/     <br/>
......0.txt    <br/>
......1.txt    <br/>
...javascript/ <br/>
......0.txt    <br/>
......1.txt    <br/>
...csharp/     <br/>
......0.txt    <br/>
......1.txt    <br/>
...java/       <br/>
......0.txt    <br/>
......1.txt    <br/>

To complete this exercise, you should modify this notebook to work with the Stack Overflow dataset by making the following modifications:

<ol>
    <li>At the top of your notebook, update the code that downloads the IMDB dataset with code to download the <a href="https://storage.googleapis.com/download.tensorflow.org/data/stack_overflow_16k.tar.gz">Stack Overflow</a> dataset that has already been prepared. As the Stack Overflow dataset has a similar directory structure, you will not need to make many modifications.
    <li>Modify the last layer of your model to Dense(4), as there are now four output classes.</li>
    <li>When compiling the model, change the loss to <a href="https://www.tensorflow.org/api_docs/python/tf/keras/losses/SparseCategoricalCrossentropy">tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)</a>a>. This is the correct loss function to use for a multi-class classification problem, when the labels for each class are integers (in this case, they can be 0, 1, 2, or 3). In addition, change the metrics to metrics=['accuracy'], since this is a multi-class classification problem (tf.metrics.BinaryAccuracy is only used for binary classifiers).</li>
    <li>When plotting accuracy over time, change binary_accuracy and val_binary_accuracy to accuracy and val_accuracy, respectively.</li>
    <li>Once these changes are complete, you will be able to train a multi-class classifier.</li>
</ol>

### Learning more
This tutorial introduced text classification from scratch. To learn more about the text classification workflow in general, check out the Text classification guide from Google Developers.

In [6]:
import matplotlib.pyplot as plt
import os
import re
import shutil
import string
import tensorflow as tf

from tensorflow.keras import layers
from tensorflow.keras import losses

In [7]:
url = "https://storage.googleapis.com/download.tensorflow.org/data/stack_overflow_16k.tar.gz"

dataset = tf.keras.utils.get_file("stack_overflow_16k", url,
                                    untar=True, cache_dir='./downloads/stack_overflow_16k',
                                    cache_subdir='')

dataset_dir = os.path.join(os.path.dirname(dataset), 'stack_overflow_16k')