<a href="https://colab.research.google.com/github/brianMutea/TensorFlow-Predict-a-tag-for-a-Stack-Overflow-question/blob/main/TensorFlow_Predict_the_tag_for_a_Stack_Overflow_question.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# %pip install tensorflow-text

In [2]:
import collections
import pathlib

import tensorflow as tf

from tensorflow.keras import layers
from tensorflow.keras import losses
from tensorflow.keras import utils
from tensorflow.keras.layers import TextVectorization

import tensorflow_datasets as tfds
import tensorflow_text as tf_text

## Downloading the dataset

Uses Keras [tf.keras.utils.get_file]() that downloads a file from a URL if it not already in the cache.



In [3]:
data_url = 'https://storage.googleapis.com/download.tensorflow.org/data/stack_overflow_16k.tar.gz'
dataset_dir = tf.keras.utils.get_file(
    origin=data_url,
    extract=True,
    cache_dir = 'stack_overflow',
    cache_subdir=''
)

Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/stack_overflow_16k.tar.gz


In [4]:
dataset_dir = pathlib.Path(dataset_dir).parent

In [5]:
list(dataset_dir.iterdir())

[PosixPath('/tmp/.keras/test'),
 PosixPath('/tmp/.keras/README.md'),
 PosixPath('/tmp/.keras/train'),
 PosixPath('/tmp/.keras/stack_overflow_16k.tar.gz')]

In [6]:
train_dir = dataset_dir/'train'
list(train_dir.iterdir())

[PosixPath('/tmp/.keras/train/csharp'),
 PosixPath('/tmp/.keras/train/python'),
 PosixPath('/tmp/.keras/train/javascript'),
 PosixPath('/tmp/.keras/train/java')]

The `train/csharp`, `train/java`, `train/python` and `train/javascript` directories contain many text files, each of which is a Stack Overflow question.

In [7]:
sample_file = train_dir/'python/1755.txt'

with open(sample_file) as f:
  print(f.read())

why does this blank program print true x=true.def stupid():.    x=false.stupid().print x



## Loading the dataset

Here we will load the data off disk and prepare it into a format suitable for training

We will use the `tf.keras.utils.text_dataset_from_directory` utility to create a labeled `tf.data.Dataset`(The Dataset object is a Python iterable. This makes it possible to consume its elements using a for loop:)

To read:

* [`tf.data`](https://www.tensorflow.org/guide/data) - to create dinput pipelines

The Stack Overflow dataset has already been divided into training and test sets, but it lacks a validation set.

Create a validation set using an 80:20 split of the training data by using `tf.keras.utils.text_dataset_from_directory` with `validation_split` set to 0.2 (i.e. 20%):

In [8]:
batch_size = 32
seed = 42

raw_train_ds = utils.text_dataset_from_directory(
    train_dir,
    batch_size=batch_size,
    validation_split = 0.2,
    subset = 'training',
    seed = seed
)

Found 8000 files belonging to 4 classes.
Using 6400 files for training.


We can train a model with the `tf.Dataset` by passing it directly to a `Model.fit`

In [9]:
for text_batch, label_batch in raw_train_ds.take(1):
  for i in range(10):
    print('Question: ', text_batch.numpy()[i])
    print('Label', label_batch.numpy()[i])

Question:  b'"my tester is going to the wrong constructor i am new to programming so if i ask a question that can be easily fixed, please forgive me. my program has a tester class with a main. when i send that to my regularpolygon class, it sends it to the wrong constructor. i have two constructors. 1 without perameters..public regularpolygon().    {.       mynumsides = 5;.       mysidelength = 30;.    }//end default constructor...and my second, with perameters. ..public regularpolygon(int numsides, double sidelength).    {.        mynumsides = numsides;.        mysidelength = sidelength;.    }// end constructor...in my tester class i have these two lines:..regularpolygon shape = new regularpolygon(numsides, sidelength);.        shape.menu();...numsides and sidelength were declared and initialized earlier in the testing class...so what i want to happen, is the tester class sends numsides and sidelength to the second constructor and use it in that class. but it only uses the default con

The labels are 0, 1, 2 or 3. To check which of these correspond to which string label, you can inspect the class_names property on the dataset:

In [10]:
for i, label in enumerate(raw_train_ds.class_names):
  print("Label", i, "corresponds to", label)

Label 0 corresponds to csharp
Label 1 corresponds to java
Label 2 corresponds to javascript
Label 3 corresponds to python


We create a validation and a test set using `tf.keras.utils.text_dataset_from_directory`.

We will use the remaining 1,600 reviews from the training set for validation.

In [11]:
# Create a validation set.
raw_val_ds = utils.text_dataset_from_directory(
    train_dir,
    batch_size=batch_size,
    validation_split=0.2,
    subset='validation',
    seed=seed)

Found 8000 files belonging to 4 classes.
Using 1600 files for validation.


In [12]:
test_dir = dataset_dir/'test'

# Create the test set

raw_test_ds = utils.text_dataset_from_directory(
    test_dir, 
    batch_size = batch_size
)

Found 8000 files belonging to 4 classes.


Next, we standardize, tokenize, and vectorize the data using the [`tf.keras.layers.TextVectorization`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/TextVectorization) layer.

*Standardization* refers to preprocessing the text, typically to remove punctuation or HTML elements to simplify the dataset.

*Tokenization* refers to splitting strings into tokens (for example, splitting a sentence into individual words by splitting on whitespace).

*Vectorization* refers to converting tokens into numbers so they can be fed into a neural network.

📓

The default standardization converts text to lowercase and removes punctuation (standardize='lower_and_strip_punctuation').

The default tokenizer splits on whitespace (split='whitespace').

The default vectorization mode is `'int'` (**output_mode='int'**). This outputs integer indices (one per token). This mode **can be used to build models that take word order into account**. You can also use other modes—like `'binary'`—to build bag-of-words models.

**Binary mode**

In [13]:
VOCAB_SIZE = 10000

binary_vectorize_layer = TextVectorization(
    max_tokens = VOCAB_SIZE,
    output_mode = 'binary'
)

**int mode**

For the `'int'` mode, in addition to maximum vocabulary size, you need to set an explicit maximum sequence length (MAX_SEQUENCE_LENGTH), which will cause the layer to pad or truncate sequences to exactly output_sequence_length values:

In [14]:
MAX_SEQUENCE_LENGTH = 250

int_vectorize_layer = TextVectorization(
    max_tokens = VOCAB_SIZE,
    output_mode = 'int',
    output_sequence_length = MAX_SEQUENCE_LENGTH
)

Next, call `TextVectorization.adapt` to fit the state of the preprocessing layer to the dataset. This will cause the model to build an index of strings to integers.

TextVectorization.adapt is susitable for training sets as it may cause info leakage when used on test_set

In [15]:
# Make a text-only dataset (without labels), then call `TextVectorization.adapt`.
train_text = raw_train_ds.map(lambda text, labels: text)
binary_vectorize_layer.adapt(train_text)
int_vectorize_layer.adapt(train_text)

In [16]:
def binary_vectorize_text(text, label):
  text = tf.expand_dims(text, -1) #returns a tensor with a length 1 axis inserted at index axis.
  return binary_vectorize_layer(text), label

In [17]:
def int_vectorize_text(text, label):
  print(text)
  text = tf.expand_dims(text, -1)
  return int_vectorize_layer(text), label

In [18]:
# Retrieve a batch (of 32 reviews and labels) from the dataset.
text_batch, label_batch = next(iter(raw_train_ds))
first_question, first_label = text_batch[0], label_batch[0]
print("Question", first_question)
print("Label", first_label)

Question tf.Tensor(b'"what is the difference between these two ways to create an element? var a = document.createelement(\'div\');..a.id = ""mydiv"";...and..var a = document.createelement(\'div\').id = ""mydiv"";...what is the difference between them such that the first one works and the second one doesn\'t?"\n', shape=(), dtype=string)
Label tf.Tensor(2, shape=(), dtype=int32)


**binary mode output**

In [19]:
print("'binary' vectorized question:",
      binary_vectorize_text(first_question, first_label)[0])

'binary' vectorized question: tf.Tensor([[1. 1. 0. ... 0. 0. 0.]], shape=(1, 10000), dtype=float32)


**int mode output**

In [20]:
print("'int' vectorized question:",
      int_vectorize_text(first_question, first_label)[0])

tf.Tensor(b'"what is the difference between these two ways to create an element? var a = document.createelement(\'div\');..a.id = ""mydiv"";...and..var a = document.createelement(\'div\').id = ""mydiv"";...what is the difference between them such that the first one works and the second one doesn\'t?"\n', shape=(), dtype=string)
'int' vectorized question: tf.Tensor(
[[ 55   6   2 410 211 229 121 895   4 124  32 245  43   5   1   1   5   1
    1   6   2 410 211 191 318  14   2  98  71 188   8   2 199  71 178   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   

As shown above, TextVectorization's `'binary'` mode returns an array denoting which tokens exist at least once in the input, while the `'int'` mode replaces each token by an integer, thus preserving their order.

Let's lookup the token (string) that each integer corresponds to by calling `TextVectorization.get_vocabulary` on the layer:

In [21]:
print("1300 ---> ", int_vectorize_layer.get_vocabulary()[1300])
print("205 ---> ", int_vectorize_layer.get_vocabulary()[205])
print("Vocabulary size: {}".format(len(int_vectorize_layer.get_vocabulary())))

1300 --->  equivalent
205 --->  form
Vocabulary size: 10000


As a final preprocessing step, we will apply the TextVectorization layers we created earlier to the training, validation, and test sets:

In [22]:
binary_train_ds = raw_train_ds.map(binary_vectorize_text)
binary_val_ds = raw_val_ds.map(binary_vectorize_text)
binary_test_ds = raw_test_ds.map(binary_vectorize_text)

int_train_ds = raw_train_ds.map(int_vectorize_text)
int_val_ds = raw_val_ds.map(int_vectorize_text)
int_test_ds = raw_test_ds.map(int_vectorize_text)

Tensor("args_0:0", shape=(None,), dtype=string)
Tensor("args_0:0", shape=(None,), dtype=string)
Tensor("args_0:0", shape=(None,), dtype=string)


[Below code see why...](https://www.tensorflow.org/tutorials/load_data/text#configure_the_dataset_for_performance)

In [23]:
AUTOTUNE = tf.data.AUTOTUNE

def configure_dataset(dataset):
  return dataset.cache().prefetch(buffer_size=AUTOTUNE)

In [24]:
binary_train_ds = configure_dataset(binary_train_ds)
binary_val_ds = configure_dataset(binary_val_ds)
binary_test_ds = configure_dataset(binary_test_ds)

int_train_ds = configure_dataset(int_train_ds)
int_val_ds = configure_dataset(int_val_ds)
int_test_ds = configure_dataset(int_test_ds)

# Training the model

**Training the binary mode model...**

In [25]:
binary_model = tf.keras.Sequential([layers.Dense(4)])

binary_model.compile(
    loss=losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer='adam',
    metrics=['accuracy'])

history = binary_model.fit(
    binary_train_ds, validation_data=binary_val_ds, epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


**With the int mode model, we will train a 1D ConvNet**

In [26]:
def create_model(vocab_size, num_labels):
  model = tf.keras.Sequential([
      layers.Embedding(vocab_size, 64, mask_zero=True),
      layers.Conv1D(64, 5, padding="valid", activation="relu", strides=2),
      layers.GlobalMaxPooling1D(),
      layers.Dense(num_labels)
  ])
  return model

In [27]:
# `vocab_size` is `VOCAB_SIZE + 1` since `0` is used additionally for padding.
int_model = create_model(vocab_size=VOCAB_SIZE + 1, num_labels=4)
int_model.compile(
    loss=losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer='adam',
    metrics=['accuracy'])
history = int_model.fit(int_train_ds, validation_data=int_val_ds, epochs=5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


## Model Evaluations

`model.evaluate()` will take the neural network as it is (at last epoch), computes predictions, and then calculates the loss.

**binary mode model**

In [28]:
binary_loss, binary_accuracy = binary_model.evaluate(binary_test_ds)
int_loss, int_accuracy = int_model.evaluate(int_test_ds)



In [29]:
print("Binary model accuracy: {:2.2%}".format(binary_accuracy))
print("Int model accuracy: {:2.2%}".format(int_accuracy))

Binary model accuracy: 81.50%
Int model accuracy: 80.54%


# Exporting the model for deployment purposes

Since we applied the `TextVectorization` before feeding the dataset to the model, we will make the model capable of processing raw strings by applying the `TextVectorization` layer inside the model

To do that we will create a new model using the weights we have just trained above

Let's export the binary model that has more accuracy...

In [32]:
export_binary_model = tf.keras.Sequential([
    binary_vectorize_layer, binary_model,
    layers.Activation('sigmoid')
])

export_binary_model.compile(
    loss=losses.SparseCategoricalCrossentropy(from_logits=False),
    optimizer='adam',
    metrics=['accuracy'])

Now we can test the model with the `raw_test_ds` which are raw strings

In [31]:
loss, accuracy = export_binary_model.evaluate(raw_test_ds)
print("Accuracy: {:2.2%}".format(accuracy))

Accuracy: 81.50%


## Make predictions on new data

In [33]:

#Find labels with the highest scores

def get_string_labels(predicted_scores_batch):
  predicted_int_labels = tf.math.argmax(predicted_scores_batch, axis=1)
  predicted_labels = tf.gather(raw_train_ds.class_names, predicted_int_labels)
  return predicted_labels

In [40]:
inputs = [
    "How do I iterate over a Pandas column?",
    "How do I use the arrow functions"
]

predicted_scores = export_binary_model.predict(inputs)
predicted_labels = get_string_labels(predicted_scores)


for input, label in zip(inputs, predicted_labels):
  print("Question: ", input)
  print('Predicted tag/label: ', label.numpy())

Question:  How do I iterate over a Pandas column?
Predicted tag/label:  b'python'
Question:  How do I use the arrow functions
Predicted tag/label:  b'javascript'


[Source](https://www.tensorflow.org/tutorials/load_data/text#configure_the_dataset_for_performance)

Including the text preprocessing logic inside your model enables you to export a model for production that simplifies deployment, and reduces the potential for train/test skew.

There is a performance difference to keep in mind when choosing where to apply tf.keras.layers.TextVectorization. Using it outside of your model enables you to do asynchronous CPU processing and buffering of your data when training on GPU. So, if you're training your model on the GPU, you probably want to go with this option to get the best performance while developing your model, then switch to including the TextVectorization layer inside your model when you're ready to prepare for deployment.

Assignment to read:

* [Saving and loading models](https://www.tensorflow.org/tutorials/keras/save_and_load)

