# Using pre-trained word embeddings with CNN

This code is taken from KERAS.IO

The original file:
https://keras.io/examples/nlp/pretrained_word_embeddings/

## Setup

In [1]:
import numpy as np
import tensorflow as tf
from tensorflow import keras

## Introduction

In this example, we show how to train a text classification model that uses pre-trained
word embeddings.

We'll work with the Newsgroup20 dataset, a set of 20,000 message board messages
belonging to 20 different topic categories.

For the pre-trained word embeddings, we'll use
[GloVe embeddings](http://nlp.stanford.edu/projects/glove/).

## Download the Newsgroup20 data

In [2]:
data_path = keras.utils.get_file(
    "news20.tar.gz",
    "http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz",
    untar=True,
)

Downloading data from http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz


## Let's take a look at the data

In [3]:
import os
import pathlib

data_dir = pathlib.Path(data_path).parent / "20_newsgroup"
dirnames = os.listdir(data_dir)
print("Number of directories:", len(dirnames))
print("Directory names:", dirnames)

fnames = os.listdir(data_dir / "comp.graphics")
print("Number of files in comp.graphics:", len(fnames))
print("Some example filenames:", fnames[:5])

Number of directories: 20
Directory names: ['comp.graphics', 'rec.autos', 'talk.politics.mideast', 'talk.politics.guns', 'alt.atheism', 'soc.religion.christian', 'rec.motorcycles', 'talk.politics.misc', 'rec.sport.baseball', 'rec.sport.hockey', 'misc.forsale', 'sci.crypt', 'sci.electronics', 'sci.space', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.os.ms-windows.misc', 'talk.religion.misc', 'comp.windows.x', 'sci.med']
Number of files in comp.graphics: 1000
Some example filenames: ['38530', '38649', '38616', '38626', '38571']


Here's a example of what one file contains:

In [4]:
print(open(data_dir / "comp.graphics" / "38987").read())

Newsgroups: comp.graphics
Path: cantaloupe.srv.cs.cmu.edu!das-news.harvard.edu!noc.near.net!howland.reston.ans.net!agate!dog.ee.lbl.gov!network.ucsd.edu!usc!rpi!nason110.its.rpi.edu!mabusj
From: mabusj@nason110.its.rpi.edu (Jasen M. Mabus)
Subject: Looking for Brain in CAD
Message-ID: <c285m+p@rpi.edu>
Nntp-Posting-Host: nason110.its.rpi.edu
Reply-To: mabusj@rpi.edu
Organization: Rensselaer Polytechnic Institute, Troy, NY.
Date: Thu, 29 Apr 1993 23:27:20 GMT
Lines: 7

Jasen Mabus
RPI student

	I am looking for a hman brain in any CAD (.dxf,.cad,.iges,.cgm,etc.) or picture (.gif,.jpg,.ras,etc.) format for an animation demonstration. If any has or knows of a location please reply by e-mail to mabusj@rpi.edu.

Thank you in advance,
Jasen Mabus  



As you can see, there are header lines that are leaking the file's category, either
explicitly (the first line is literally the category name), or implicitly, e.g. via the
`Organization` filed. Let's get rid of the headers:

In [5]:
samples = []
labels = []
class_names = []
class_index = 0
for dirname in sorted(os.listdir(data_dir)):
    class_names.append(dirname)
    dirpath = data_dir / dirname
    fnames = os.listdir(dirpath)
    print("Processing %s, %d files found" % (dirname, len(fnames)))
    for fname in fnames:
        fpath = dirpath / fname
        f = open(fpath, encoding="latin-1")
        content = f.read()
        lines = content.split("\n")
        lines = lines[10:]
        content = "\n".join(lines)
        samples.append(content)
        labels.append(class_index)
    class_index += 1

print("Classes:", class_names)
print("Number of samples:", len(samples))

Processing alt.atheism, 1000 files found
Processing comp.graphics, 1000 files found
Processing comp.os.ms-windows.misc, 1000 files found
Processing comp.sys.ibm.pc.hardware, 1000 files found
Processing comp.sys.mac.hardware, 1000 files found
Processing comp.windows.x, 1000 files found
Processing misc.forsale, 1000 files found
Processing rec.autos, 1000 files found
Processing rec.motorcycles, 1000 files found
Processing rec.sport.baseball, 1000 files found
Processing rec.sport.hockey, 1000 files found
Processing sci.crypt, 1000 files found
Processing sci.electronics, 1000 files found
Processing sci.med, 1000 files found
Processing sci.space, 1000 files found
Processing soc.religion.christian, 997 files found
Processing talk.politics.guns, 1000 files found
Processing talk.politics.mideast, 1000 files found
Processing talk.politics.misc, 1000 files found
Processing talk.religion.misc, 1000 files found
Classes: ['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.ha

There's actually one category that doesn't have the expected number of files, but the
difference is small enough that the problem remains a balanced classification problem.

In [6]:
len(samples), len(labels)

(19997, 19997)

## Shuffle and split the data into training & validation sets

In [7]:
# Shuffle the data
seed = 1337
rng = np.random.RandomState(seed)
rng.shuffle(samples)
rng = np.random.RandomState(seed)
rng.shuffle(labels)

# Extract a training & validation split
validation_split = 0.2
num_validation_samples = int(validation_split * len(samples))
train_samples = samples[:-num_validation_samples]
val_samples = samples[-num_validation_samples:]
train_labels = labels[:-num_validation_samples]
val_labels = labels[-num_validation_samples:]

In [None]:
train_samples[:3]

["\nIn article <1993Apr21.040956.12823@wuecl.wustl.edu>\nmvs1@cec2.wustl.edu (Michael Virata Sy) writes:\n \n>\n>        Don't forget Paul Ysebaert, ex-Devil.  He's a good team player.\n \n      And Dino Ciccarelli and Ray Sheppard and so on and so on.....\n \n    : )\n \nLaurie Marshall\nWayne State University\nDetroit, Michigan\nGo Wings!!!!\n",
 'Approved: christian@aramis.rutgers.edu\n\nIn article <Apr.21.03.25.41.1993.1322@geneva.rutgers.edu>  \nJBUDDENBERG@vax.cns.muskingum.edu (Jimmy Buddenberg) writes:\n> \n> Hello all.  We are doing a bible study (at my college) on Revelations.  We\n> have been doing pretty good as far as getting some sort of reasonable\n> interpretation.  We are now on chapters 17 and 18 which talk about the\n> woman on the beast and the fall of Babylon.  I believe the beast is the\n> Antichrist (some may differ but it seems obvious) and the woman represents\n> Babylon which stands for Rome or the Roman Catholic Church.  What are some\n> views on this interpr

## Create a vocabulary index

Let's use the `TextVectorization` to index the vocabulary found in the dataset.
Later, we'll use the same layer instance to vectorize the samples.

Our layer will only consider the top 20,000 words, and will truncate or pad sequences to
be actually 200 tokens long.

In [8]:
from tensorflow.keras.layers import TextVectorization
vectorizer = TextVectorization(max_tokens=20000, output_sequence_length=200)
text_ds = tf.data.Dataset.from_tensor_slices(train_samples).batch(128)
vectorizer.adapt(text_ds)

You can retrieve the computed vocabulary used via `vectorizer.get_vocabulary()`. Let's
print the top 5 words:

In [9]:
vectorizer.get_vocabulary()[0:16]

['',
 '[UNK]',
 'the',
 'to',
 'of',
 'a',
 'and',
 'in',
 'is',
 'i',
 'that',
 'it',
 'for',
 'you',
 'this',
 'on']

In [10]:
# size of voc
len(vectorizer.get_vocabulary())

20000

Let's vectorize a test sentence:

In [11]:
output = vectorizer([["the cat sat on the mat"]] )
output.numpy()[0, :6]

array([   2, 3596, 1735,   15,    2, 6994])

In [12]:
output.numpy()[0, :16]

array([   2, 3596, 1735,   15,    2, 6994,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0])

As you can see, "the" gets represented as "2". Why not 0, given that "the" was the first
word in the vocabulary? That's because index 0 is reserved for padding and index 1 is
reserved for "out of vocabulary" tokens.

Here's a dict mapping words to their indices:

In [13]:
voc = vectorizer.get_vocabulary()
word_index = dict(zip(voc, range(len(voc))))

As you can see, we obtain the same encoding as above for our test sentence:

In [14]:
test = ["the", "cat", "sat", "on", "the", "mat"]
[word_index[w] for w in test]

[2, 3596, 1735, 15, 2, 6994]

## Load pre-trained word embeddings

Let's download pre-trained GloVe embeddings (a 822M zip file).

You'll need to run the following commands:

```
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip -q glove.6B.zip
```

The archive contains text-encoded vectors of various sizes: 50-dimensional,
100-dimensional, 200-dimensional, 300-dimensional. We'll use the 100D ones.

Let's make a dict mapping words (strings) to their NumPy vector representation:

In [15]:
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip -q glove.6B.zip

--2023-03-23 01:04:41--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2023-03-23 01:04:41--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2023-03-23 01:04:42--  https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


202

In [16]:
os.listdir()

['.config',
 'glove.6B.200d.txt',
 'glove.6B.300d.txt',
 'glove.6B.100d.txt',
 'glove.6B.zip',
 'glove.6B.50d.txt',
 'sample_data']

In [17]:
#!head -10 glove.6B.50d.txt
!wc -l glove.6B.100d.txt

400000 glove.6B.100d.txt


In [19]:
path_to_glove_file ="glove.6B.100d.txt"

embeddings_index = {}
with open(path_to_glove_file) as f:
    for line in f:
        word, coefs = line.split(maxsplit=1)
        coefs = np.fromstring(coefs, "f", sep=" ")
        embeddings_index[word] = coefs

print("Found %s word vectors." % len(embeddings_index))

Found 400000 word vectors.


In [20]:
embeddings_index["the"]

array([-0.038194, -0.24487 ,  0.72812 , -0.39961 ,  0.083172,  0.043953,
       -0.39141 ,  0.3344  , -0.57545 ,  0.087459,  0.28787 , -0.06731 ,
        0.30906 , -0.26384 , -0.13231 , -0.20757 ,  0.33395 , -0.33848 ,
       -0.31743 , -0.48336 ,  0.1464  , -0.37304 ,  0.34577 ,  0.052041,
        0.44946 , -0.46971 ,  0.02628 , -0.54155 , -0.15518 , -0.14107 ,
       -0.039722,  0.28277 ,  0.14393 ,  0.23464 , -0.31021 ,  0.086173,
        0.20397 ,  0.52624 ,  0.17164 , -0.082378, -0.71787 , -0.41531 ,
        0.20335 , -0.12763 ,  0.41367 ,  0.55187 ,  0.57908 , -0.33477 ,
       -0.36559 , -0.54857 , -0.062892,  0.26584 ,  0.30205 ,  0.99775 ,
       -0.80481 , -3.0243  ,  0.01254 , -0.36942 ,  2.2167  ,  0.72201 ,
       -0.24978 ,  0.92136 ,  0.034514,  0.46745 ,  1.1079  , -0.19358 ,
       -0.074575,  0.23353 , -0.052062, -0.22044 ,  0.057162, -0.15806 ,
       -0.30798 , -0.41625 ,  0.37972 ,  0.15006 , -0.53212 , -0.2055  ,
       -1.2526  ,  0.071624,  0.70565 ,  0.49744 , 

Now, let's prepare a corresponding embedding matrix that we can use in a Keras
`Embedding` layer. It's a simple NumPy matrix where entry at index `i` is the pre-trained
vector for the word of index `i` in our `vectorizer`'s vocabulary.

In [22]:
num_tokens = len(voc) + 2
embedding_dim = 100
hits = 0
misses = 0

# Prepare embedding matrix
embedding_matrix = np.zeros((num_tokens, embedding_dim))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # Words not found in embedding index will be all-zeros.
        # This includes the representation for "padding" and "OOV"
        embedding_matrix[i] = embedding_vector
        hits += 1
    else:
        misses += 1
print("Converted %d words (%d misses)" % (hits, misses))


Converted 17956 words (2044 misses)


Next, we load the pre-trained word embeddings matrix into an `Embedding` layer.

Note that we set `trainable=False` so as to keep the embeddings fixed (we don't want to
update them during training).

In [23]:
from tensorflow.keras.layers import Embedding

embedding_layer = Embedding(
    num_tokens,
    embedding_dim,
    embeddings_initializer=keras.initializers.Constant(embedding_matrix),
    trainable=False,
)

## Build the model

A simple 1D convnet with global max pooling and a classifier at the end.

In [24]:
from tensorflow.keras import layers

int_sequences_input = keras.Input(shape=(None,), dtype="int64")
embedded_sequences = embedding_layer(int_sequences_input)
x = layers.Conv1D(128, 5, activation="relu")(embedded_sequences)
x = layers.MaxPooling1D(5)(x)
x = layers.Conv1D(128, 5, activation="relu")(x)
x = layers.MaxPooling1D(5)(x)
x = layers.Conv1D(128, 5, activation="relu")(x)
x = layers.GlobalMaxPooling1D()(x)
x = layers.Dense(128, activation="relu")(x)
x = layers.Dropout(0.5)(x)
preds = layers.Dense(len(class_names), activation="softmax")(x)
model = keras.Model(int_sequences_input, preds)
model.summary()

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, None)]            0         
                                                                 
 embedding (Embedding)       (None, None, 100)         2000200   
                                                                 
 conv1d (Conv1D)             (None, None, 128)         64128     
                                                                 
 max_pooling1d (MaxPooling1D  (None, None, 128)        0         
 )                                                               
                                                                 
 conv1d_1 (Conv1D)           (None, None, 128)         82048     
                                                                 
 max_pooling1d_1 (MaxPooling  (None, None, 128)        0         
 1D)                                                         

## Train the model

First, convert our list-of-strings data to NumPy arrays of integer indices. The arrays
are right-padded.

In [25]:
x_train = vectorizer(np.array([[s] for s in train_samples])).numpy()
x_val = vectorizer(np.array([[s] for s in val_samples])).numpy()

y_train = np.array(train_labels)
y_val = np.array(val_labels)

In [26]:
x_train.shape, y_train.shape, x_val.shape, y_val.shape

((15998, 200), (15998,), (3999, 200), (3999,))

We use categorical crossentropy as our loss since we're doing softmax classification.
Moreover, we use `sparse_categorical_crossentropy` since our labels are integers.

In [27]:
y_train[:5]

array([10, 15, 18,  5, 15])

In [28]:
x_train[100]

array([ 1349,  3581,   418,  4569,    75,   255,     7,     1,     1,
          40,   710,  3809,     9,    96,   332,    12,     2,   271,
           4,  5191,     7,     1,    54,    46,  1939,   705,     9,
          96,  1372,     3,  1496,     5,   187,    12,   370,     7,
           1,    41,  2724,     1,     9,    96,   395,     7,  6476,
        2724,  1258,     6,  2437,  4363,  1317,   193,    25,     9,
          57,    19,    49,   159,    41,     1,  5191,    31,   575,
        2079,     5,    90,   171,     7,     1,   499,    49,   856,
          37,    16,  9665,  7894,     1,  6041,    88,     9,  4417,
          57,    65,    23,    26,    18,    90,   306,    22,     1,
          25,    36,     8,    30,  1939,  1235,   124,    29,   834,
           1,   834,     1,   171,     1, 17449,     1,  1918,   434,
        7577,     1,     1,  5991,     8,  5334,  4891,   580,   624,
          57,    19,    30,   150,   426,  4456,   381,     0,     0,
           0,     0,

In [29]:
model.compile(
    loss="sparse_categorical_crossentropy", optimizer="rmsprop", metrics=["acc"]
)

hist=model.fit(x_train, y_train, batch_size=128, epochs=20, validation_data=(x_val, y_val))

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [31]:
hist.history["loss"]

[2.6638693809509277,
 1.9270107746124268,
 1.496738076210022,
 1.2749024629592896,
 1.1124058961868286,
 0.988091766834259,
 0.8736910820007324,
 0.7689957618713379,
 0.6734247207641602,
 0.5944406986236572,
 0.5224837064743042,
 0.4658552408218384,
 0.3954528868198395,
 0.3521294891834259,
 0.31205713748931885,
 0.27565211057662964,
 0.252868115901947,
 0.22124083340168,
 0.20364879071712494,
 0.2045324295759201]

# TASK 1

A) **Inference**

* So far, we learned how to train a model.  But we do not discuss how to apply it for a new textual instance. We should put the capabilities learned during training to work. We call inference that is applying a machine learning model to a dataset and generating an output or prediction.

So write a function **inference(model, textual_input)** to do inference for the CNN based model trained above "pre-trained word embeddings with CNN"

 

In [32]:
def inference(model, textual_input):
    # Vectorize the textual input using the same TextVectorization layer was used during training
    vectorized_input = vectorizer(np.array([textual_input])).numpy()
    prediction = model.predict(vectorized_input)
    predicted_class = class_names[np.argmax(prediction[0])]
    return predicted_class


In [33]:
# Inference function Example 1

textual_input = "It doesn\'t take much looking to see that the U.S. is in a  \nstate of moral decay."
predicted_class = inference(model, textual_input)
print("Predicted class:", predicted_class)


Predicted class: soc.religion.christian


In [39]:
# Inference function Example 3

textual_input = "I am interested in learning more about Natural Language Processing."
predicted_class = inference(model, textual_input)
print("Predicted class:", predicted_class)


Predicted class: comp.graphics


In [40]:
# Example usage of inference function

# new textual instance-4

textual_input = "This is a test document about sports"
predicted_class = inference(model, textual_input)
print("Predicted class:", predicted_class)

Predicted class: rec.sport.hockey


In [37]:
# Another example for Inference function 

textual_inputs = [
    "I am interested in learning more about Natural Language Processing",
    "You don't get a new trial because you screwed up and\nforgot to call all of your witnesses.",
    "I love playing sports and being active",
]

for input_text in textual_inputs:
    predicted_class = inference(model, input_text)
    print("Input text:", input_text)
    print("Predicted class:", predicted_class)
    print("")

Input text: I am interested in learning more about Natural Language Processing
Predicted class: comp.graphics

Input text: You don't get a new trial because you screwed up and
forgot to call all of your witnesses.
Predicted class: talk.politics.guns

Input text: I love playing sports and being active
Predicted class: rec.sport.baseball



B) **GRADIO** 

Take a look at gradio and build a demo for your model with a user-friendly web interface so that we can use it. 

https://gradio.app/

In [38]:
pip install gradio

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting gradio
  Downloading gradio-3.23.0-py3-none-any.whl (15.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.8/15.8 MB[0m [31m40.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting mdit-py-plugins<=0.3.3
  Downloading mdit_py_plugins-0.3.3-py3-none-any.whl (50 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.5/50.5 KB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
Collecting aiohttp
  Downloading aiohttp-3.8.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m63.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting orjson
  Downloading orjson-3.8.8-cp39-cp39-manylinux_2_28_x86_64.whl (143 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.5/143.5 KB[0m [31m17.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting httpx
  Downloading 

In [41]:
# your gradio code goes here

import gradio as gr


def predict_category(text):
    prediction = inference(model, text)
    return prediction


iface = gr.Interface(
    fn=predict_category,
    inputs=gr.inputs.Textbox(label="Enter text here"),
    outputs=gr.outputs.Textbox(label="Predicted category"),
    title="Text Classification Demo",
    description="Enter some text and see which category it belongs to.",
)

# Launch the interface
iface.launch()




Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Note: opening Chrome Inspector may crash demo inside Colab notebooks.

To create a public link, set `share=True` in `launch()`.


<IPython.core.display.Javascript object>



In [None]:
# create a public link


import gradio as gr

def predict_category(text):
    prediction = inference(model, text)
    return prediction

iface = gr.Interface(
    fn=predict_category,
    inputs=gr.inputs.Textbox(label="Enter text here"),
    outputs=gr.outputs.Textbox(label="Predicted category"),
    title="Text Classification Demo",
    description="Enter some text and see which category it belongs to.",
)

iface.launch(share=True)




Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://8addf643791e4b8dea.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades (NEW!), check out Spaces: https://huggingface.co/spaces




In [42]:
TASK 3