<a href="https://colab.research.google.com/github/alawrence30/Deep-Learning/blob/main/MSDS458_Assignment_03_part01%20-%20v1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src="https://github.com/djp840/MSDS_458_Public/blob/master/images/NorthwesternHeader.png?raw=1">

## MSDS458 Research Assignment 3 - Part 01

## Analyze AG_NEWS_SUBSET Data <br>

AG is a collection of more than 1 million news articles. News articles have been gathered from more than 2000 news sources by ComeToMyHead in more than 1 year of activity. ComeToMyHead is an academic news search engine which has been running since July, 2004. The dataset is provided by the academic comunity for research purposes in data mining (clustering, classification, etc), information retrieval (ranking, search, etc), xml, data compression, data streaming, and any other non-commercial activity.<br> 

For more information, please refer to the link http://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html<br> 


The AG's news topic classification dataset is constructed by choosing 4 largest classes (**World**, **Sports**, **Business**, and **Sci/Tech**) from the original corpus. Each class contains 30,000 training samples and 1,900 testing samples. The total number of training samples is 120,000 and testing 7,600.<br>

Homepage: https://arxiv.org/abs/1509.01626<br>

Source code: tfds.text.AGNewsSubset

Versions:

1.0.0 (default): No release notes.
Download size: 11.24 MiB

Dataset size: 35.79 MiB

## References
1. Deep Learning with Python, Francois Chollet (https://learning.oreilly.com/library/view/deep-learning-with/9781617296864/)
 * Chapter 10: Deep learning for time series
 * Chapter 11: Deep learning for text
2. Deep Learning A Visual Approach, Andrew Glassner (https://learning.oreilly.com/library/view/deep-learning/9781098129019/)
 * Chapter 19: Recurrent Neural Networks
 * Chapter 20: Attention and Transformers

# Deep learning for text

## Natural-language processing: The bird's eye view

## Preparing text data

<img src="https://github.com/djp840/MSDS_458_Public/blob/master/images/11-01.png?raw=1">

## Import Packages

In [1]:
from packaging import version

import tensorflow as tf
import tensorflow_datasets as tfds
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.layers import TextVectorization

## Verify TensorFlow version and Keras version

In [2]:
print("This notebook requires TensorFlow 2.0 or above")
print("TensorFlow version: ", tf.__version__)
assert version.parse(tf.__version__).release[0] >=2

This notebook requires TensorFlow 2.0 or above
TensorFlow version:  2.9.2


In [3]:
print("Keras version: ", keras.__version__)

Keras version:  2.9.0


## Mount Google Drive to Colab environment

In [4]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


## Load AG_NEWS_SUBSET News Articles Dataset

In [5]:
# register  ag_news_subset so that tfds.load doesn't generate a checksum (mismatch) error
!python -m tensorflow_datasets.scripts.download_and_prepare --register_checksums --datasets=ag_news_subset

dataset, info = tfds.load('ag_news_subset', with_info=True,  split=['train[:114000]','train[114000:]', 'test'],
                          batch_size = 32, as_supervised=True)
train_ds, val_ds, test_ds = dataset

2022-10-29 21:54:04.216012: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
W1029 21:54:04.216403 140474941478784 download_and_prepare.py:43] ***`tfds build` should be used instead of `download_and_prepare`.***
INFO[build.py]: Loading dataset ag_news_subset from imports: tensorflow_datasets.text.ag_news_subset
2022-10-29 21:54:04.348648: W tensorflow/core/platform/cloud/google_auth_provider.cc:184] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "NOT_FOUND: Could not locate the credentials file.". Retrieving token from GCE failed with "NOT_FOUND: Error executing an HTTP request: HTTP response code 404".
INFO[dataset_info.py]: Load pre-computed DatasetInfo (eg: splits, num examples,...) from GCS: ag_news_subset/1.0.0
INFO[dataset_info.py]: Load dataset info from /tmp/tmpr03q029ftfds
INFO[dataset_info.py]: Field info.spli

## Display The Number of Batches

In [6]:
len(train_ds), len(val_ds), len(test_ds)

(3563, 188, 238)

## Displaying The Shapes and Dtypes of the First Batch

In [7]:
for inputs, targets in train_ds:
    print("inputs.shape:", inputs.shape)
    print()
    print("inputs.dtype:", inputs.dtype)
    print()
    print("targets.shape:", targets.shape)
    print()
    print("targets.dtype:", targets.dtype)
    print()
    print("inputs[0]:", inputs[0])
    print()
    print("targets[0]:", targets[0])
    break

inputs.shape: (32,)

inputs.dtype: <dtype: 'string'>

targets.shape: (32,)

targets.dtype: <dtype: 'int64'>

inputs[0]: tf.Tensor(b'AMD #39;s new dual-core Opteron chip is designed mainly for corporate computing applications, including databases, Web services, and financial transactions.', shape=(), dtype=string)

targets[0]: tf.Tensor(3, shape=(), dtype=int64)


## Processing words as a set: The bag-of-words approach

The simplest way to encode a piece of text for processing by a machine learning model is to discard order and treat it as a set (a “bag”) of tokens.

## Single words (unigrams) with binary encoding

The main advantage of this encoding is that you can represent an entire text as a single vector, where each entry is a presence indicator for a given word.

## Preprocessing Datasets TextVectorization Layer

<div class="alert alert-block alert-success">
    <b>tf.keras.layers.TextVectorization</b><br>
    https://www.tensorflow.org/api_docs/python/tf/keras/layers/TextVectorization
    </div>

In [8]:
text_vectorization = TextVectorization(
    max_tokens=1000,
    output_mode="multi_hot")

In [9]:
text_only_train_ds = train_ds.map(lambda x, y: x)

In [10]:
for text in text_only_train_ds:
    print(f"Get first batch of {text.shape[0]} news articles.\n")
    print(f"Here is the first news article:\n\n{text[0]}.")
    break

Get first batch of 32 news articles.

Here is the first news article:

b'AMD #39;s new dual-core Opteron chip is designed mainly for corporate computing applications, including databases, Web services, and financial transactions.'.


## Adapt Method - Standardize Text

In [11]:
text_vectorization.adapt(text_only_train_ds)

In [12]:
binary_1gram_train_ds = train_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
binary_1gram_val_ds = val_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
binary_1gram_test_ds = test_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)

## Inspecting Output Binary Unigram Dataset

In [13]:
for inputs, targets in binary_1gram_train_ds:
    print("inputs.shape:", inputs.shape)
    print()
    print("inputs.dtype:", inputs.dtype)
    print()
    print("targets.shape:", targets.shape)
    print()
    print("targets.dtype:", targets.dtype)
    print()
    print("targets[0]:", targets[0])
    break

inputs.shape: (32, 1000)

inputs.dtype: <dtype: 'float32'>

targets.shape: (32,)

targets.dtype: <dtype: 'int64'>

targets[0]: tf.Tensor(3, shape=(), dtype=int64)


## Model Function 

In [14]:
def get_model(max_tokens=1000, hidden_dim=16):
    inputs = tf.keras.Input(shape=(max_tokens,))
    x = layers.Dense(hidden_dim, activation="relu")(inputs)
    x = layers.Dropout(0.5)(x)
    outputs = layers.Dense(4, activation="softmax")(x)
    model = tf.keras.Model(inputs, outputs)
    model.compile(optimizer="rmsprop",
                  loss='SparseCategoricalCrossentropy',
                  metrics=["accuracy"])
    return model

## Build Binary Unigram Model

In [15]:
model_Unigram = get_model()
model_Unigram.summary()
callbacks = [
    tf.keras.callbacks.ModelCheckpoint("binary_1gram.keras",save_best_only=True)
    ,tf.keras.callbacks.EarlyStopping(monitor='val_accuracy', patience=5)
]

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 1000)]            0         
                                                                 
 dense (Dense)               (None, 16)                16016     
                                                                 
 dropout (Dropout)           (None, 16)                0         
                                                                 
 dense_1 (Dense)             (None, 4)                 68        
                                                                 
Total params: 16,084
Trainable params: 16,084
Non-trainable params: 0
_________________________________________________________________


In [16]:
model_Unigram.fit(binary_1gram_train_ds.cache(),
          validation_data=binary_1gram_val_ds.cache(),
          epochs=200,
          callbacks=callbacks)
model_Unigram = keras.models.load_model("binary_1gram.keras")
print(f"Test acc: {model_Unigram.evaluate(binary_1gram_test_ds)[1]:.3f}")

Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Test acc: 0.837


We call `cache()` on the datasets to cache them in memory: this way, we will only do the preprocessing once, during the first epoch, and we’ll reuse the preprocessed texts for the following epochs. This can only be done if the data is small enough to fit in memory.

## Bigrams With Binary Encoding

Of course, discarding word order is very reductive, because even atomic concepts can be expressed via multiple words: the term “United States” conveys a concept that is quite distinct from the meaning of the words “states” and “united” taken separately. 

With bigrams, the sentence “`the cat sat on the mat.`” becomes

`{"the", "the cat", "cat", "cat sat", "sat",
 "sat on", "on", "on the", "the mat", "mat"}`

<div class="alert alert-block alert-success">
    <b>tf.keras.layers.TextVectorization</b><br>
    https://www.tensorflow.org/api_docs/python/tf/keras/layers/TextVectorization
    </div>

## Configuring the `TextVectorization` layer to return Bigrams

The TextVectorization layer can be configured to return arbitrary N-grams: bigrams, trigrams, etc. Just pass an `ngrams=N` argument as in the following listing.

In [17]:
text_vectorization = TextVectorization(
    ngrams=2,
    max_tokens=1000,
    output_mode="multi_hot",
)

## Build Binary Bigram Model 

In [18]:
text_vectorization.adapt(text_only_train_ds)
binary_2gram_train_ds = train_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
binary_2gram_val_ds = val_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
binary_2gram_test_ds = test_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)

model_Bigram = get_model()
model_Bigram.summary()

Model: "model_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_2 (InputLayer)        [(None, 1000)]            0         
                                                                 
 dense_2 (Dense)             (None, 16)                16016     
                                                                 
 dropout_1 (Dropout)         (None, 16)                0         
                                                                 
 dense_3 (Dense)             (None, 4)                 68        
                                                                 
Total params: 16,084
Trainable params: 16,084
Non-trainable params: 0
_________________________________________________________________


In [19]:
callbacks = [
     tf.keras.callbacks.ModelCheckpoint("binary_2gram.keras",save_best_only=True)
    ,tf.keras.callbacks.EarlyStopping(monitor='val_accuracy', patience=5)
]

model_Bigram.fit(binary_2gram_train_ds.cache(),
          validation_data=binary_2gram_val_ds.cache(),
          epochs=200,
          callbacks=callbacks)
model_Bigram = keras.models.load_model("binary_2gram.keras")
print(f"Test acc: {model_Bigram.evaluate(binary_2gram_test_ds)[1]:.3f}")

Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Test acc: 0.832


## Bigrams with TF-IDF Encoding

You can also add a bit more information to this representation by counting how many times each word or N-gram occurs, that is to say, by taking the histogram of the words over the text:

```{"the": 2, "the cat": 1, "cat": 1, "cat sat": 1, "sat": 1,
 "sat on": 1, "on": 1, "on the": 1, "the mat: 1", "mat": 1}```

## Understanding TF-IDF normalization
The more a given term appears in a document, the more important that term is for understanding what the document is about. At the same time, the frequency at which the term appears across all documents in your dataset matters too: terms that appear in almost every document (like “the” or “a”) aren’t particularly informative,

`TF-IDF` is a metric that fuses these two ideas. It weights a given term by taking “term frequency,” how many times the term appears in the current document, and dividing it by a measure of “document frequency,” which estimates how often the term comes up across the dataset. 

```python
def tfidf(term, document, dataset):
    term_freq = document.count(term)
    doc_freq = math.log(sum(doc.count(term) for doc in dataset) + 1)
    return term_freq / doc_freq
```

## Configure `TextVectorization` Layer To Return Token Counts

In [20]:
text_vectorization = TextVectorization(
    ngrams=2,
    max_tokens=1000,
    output_mode="count"
)

## Configuring `TextVectorization` To Return TF-IDF-weighted Outputs

In [21]:
text_vectorization = TextVectorization(
    ngrams=2,
    max_tokens=1000,
    output_mode="tf_idf",
)

## Build TF-IDF Bigram Model

In [22]:
text_vectorization.adapt(text_only_train_ds)

tfidf_2gram_train_ds = train_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
tfidf_2gram_val_ds = val_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
tfidf_2gram_test_ds = test_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)

model_tfidf = get_model()
model_tfidf.summary()

Model: "model_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_3 (InputLayer)        [(None, 1000)]            0         
                                                                 
 dense_4 (Dense)             (None, 16)                16016     
                                                                 
 dropout_2 (Dropout)         (None, 16)                0         
                                                                 
 dense_5 (Dense)             (None, 4)                 68        
                                                                 
Total params: 16,084
Trainable params: 16,084
Non-trainable params: 0
_________________________________________________________________


In [23]:
callbacks = [
    tf.keras.callbacks.ModelCheckpoint("tfidf_2gram.keras",save_best_only=True)
   ,tf.keras.callbacks.EarlyStopping(monitor='val_accuracy', patience=5)
]

model_tfidf.fit(tfidf_2gram_train_ds.cache(),
          validation_data=tfidf_2gram_val_ds.cache(),
          epochs=200,
          callbacks=callbacks)
model_tfidf = keras.models.load_model("tfidf_2gram.keras")
print(f"Test acc: {model_tfidf.evaluate(tfidf_2gram_test_ds)[1]:.3f}")

Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Test acc: 0.813


In [24]:
inputs = tf.keras.Input(shape=(1,), dtype="string")
processed_inputs = text_vectorization(inputs)
outputs = model_tfidf(processed_inputs)
inference_model = tf.keras.Model(inputs, outputs)

In [25]:
inference_model.summary()

Model: "model_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_4 (InputLayer)        [(None, 1)]               0         
                                                                 
 text_vectorization_3 (TextV  (None, 1000)             1         
 ectorization)                                                   
                                                                 
 model_2 (Functional)        (None, 4)                 16084     
                                                                 
Total params: 16,085
Trainable params: 16,084
Non-trainable params: 1
_________________________________________________________________


In [26]:
raw_text_data = tf.convert_to_tensor(
    [["That was an excellent movie, I loved it."],
])
predictions = inference_model(raw_text_data)
print(f"{predictions.numpy()[0][0] * 100:.2f} percent positive")
predictions.numpy()[0]

24.58 percent positive


array([0.2458025 , 0.23040332, 0.19021425, 0.33357987], dtype=float32)

In [27]:
raw_text_data = tf.convert_to_tensor([['''
ATLANTA -- Atlanta Braves shortstop Rafael Furcal has had his first court appearance
after being arrested on charges of driving under the influence.'
''']])
predictions = inference_model(raw_text_data)
print(f"{predictions.numpy()[0][0] * 100:.2f} percent positive")

40.94 percent positive


In [28]:
predictions.numpy()

array([[0.40942684, 0.32446882, 0.10614656, 0.15995784]], dtype=float32)