#<font size=8pt> Vector Embeddings

## <font color='blue'> Probabilistic Language Modeling

**IMPORTANT** The conditional probability rule:

$$\large P(A\cap B) = P(A) \cdot P(B|A)$$

**Goal:** Assign the probability that a sequence of words such as $(w_1,w_2,w_3,...w_n)$ occurs:

$$\text{P}(\text{Sentence})=\text{P}(w_1,w_2,w_3,...w_n)=\text{P}(w_1)\cdot\text{P}(w_2,w_3,w_4...w_n|w_1)=$$

Smartphones use this information to predict what the next word you will type will be, for example:

$$\text{P}(w_1,w_2,w_3,w_4)=\text{P}(w_1)\cdot\text{P}(w_4,w_3,w_2|w_1)=\text{P}(w_1)\cdot\text{P}(w_2|w_1)\cdot\text{P}(w_4,w_3|w_2,w_1)=\text{P}(w_1)\cdot\text{P}(w_2|w_1)\cdot\text{P}(w_3|w_1,w_2)\cdot\text{P}(w_4|w_1,w_2,w_3)$$

which mean the probability of word $w_4$ provided the words $w_1, w_2$ and $w_3$ occurred.

<font face='Calibri' color='blue' size=4pt>Critical thinking:</span> How do we compute these probability values?</font>

<span style="font-family:Calibri; color:red; font-size:14pt;">Reasoning:</span> We compute the frequency of occurrence for different sequences of words.


<span style="font-family:Calibri; color:darkgreen; font-size:12pt;"> P(today | It, is, sunny) = 50%
The model you use to predict is called the “language model” </span>


<span style="font-family:Calibri; color:purple; font-size:14pt;"> Important Concept:</span> The Conditional Probability Rule states that probabilities of an events in the future are defined by the multiplication of all (conditional) probabilities leading to that given event.

P(Today, it, was, sunny) = P(Today)  P(it | Today) P(was | Today, it)  P(sunny | Today, it, was)

P(Today, is, the, fiftennth) = P(Today) P(is | Today) P(the | Today, is) P(fifteenth | Today, is, the)

1. Unigram Models:
        a. P(rainy | Today, it, was) ~ P(Today) P(it) P(was)
2. Bigram Models:
        a. P(rainy| Today, it, was) ~ P(rainy | was)
3. N-gram models:
        a. Same as the above, but for arbitrary distances.
        b. For example a tri-gram: P(rainy | Today, it, was)
            
Often used in nested ways (i.e., a 3-gram model + unigram).


### Evaluating NLP

• The goal of any NLP activity is important in deciding how to evaluate it.

• In a Bag of Words model, evaluation can come from classification accuracy (i.e., you have a training and test dataset).

• But what if you’re writing an algorithm that predicts the next word for a texting app?

### Perplexity an evaluative measure for NLP

One might expect a model to be good at predicting cold in this sentence:

“It is cold.”

And not as good at predicting:

“It is very cool outside when the winter is cold”

For a variety of reasons; the biggest is the complexity/length of the sentence.

• Perplexity is a measurement of how well a probability model predicts a test data. In the context of Natural Language Processing, perplexity is one way to evaluate language models.

• Perplexity is an exponentiation of the entropy.

• Low perplexity is good and high perplexity is bad since the perplexity is the exponentiation of an \cdotentropy\cdot.

• The goal is to minimize Perplexity(W).

Calculation of perplexity for a full a sequence of words:

$$\sqrt[n]{\prod_{i=1}^{n}\frac{1}{P(w_i|w_{1}w_{2}...w_{i-1})}}$$

Important applciations for Natural Language Processing:

    • Sentiment Analysis

    • Speech Recognition

    • Information Retrieval

    • Question Answering

<span style="font-family:Calibri; color:blue; font-size:14pt;">Big Idea:</span> Represent words as vectors: GloVe, Word2Vec algorithms (both are based on neural networks).

## Global Vectors for Word Representations (GloVe)

Reference: https://nlp.stanford.edu/projects/glove/

Example for using the vector words:  <font color='red'>monarch - man = queen.</font>

The main idea is that we can do more than just counting occurences but rather represent the words from the vocabulary of a language as vectors whose entries are real numbers. As such, the GloVe algorithm is analysing word \cdotco-occurrencies\cdot within a text corpus; the steps are as follows:

1.   A *co-occurence* matrix $X$ is created where its entries $X_{ij}$ represent how often word $i$ is present in the context of the word $j$. Thus there is a parsing of the corpus for building the matrix $X$ and then the model is constructed based on this matrix.
2.   For the words $i$ and $j$ we create vectors $\vec{w}_i$ and $\vec{w}_j$ such that $$\vec{w}_i^T\cdot\vec{w}_j+b_i+b_j=\log (X_{ij})$$ where $b_i$ and $b_j$ are bias terms (i.e. intercept terms for a regression model). We want to build word vectors that retain useful information of how words $i$ and $j$ co-occur.
3.   In order to determine the entries for the $\vec{w}_i$, we minimize the following objective function $$J:=\sum_{i=1}^{V}\sum_{j=1}^{V}f(X_{ij}) \left(\vec{w}_i^T\cdot\vec{w}_j+b_i+b_j-\log (X_{ij})\right)^2$$
4.   The function $f$ is chosen in order to prevent the skewing of the objective function by the words that co-occur too often. In this sense a choice for the function $f$ could be $$f(X_{ij}):=\begin{cases}
\left(\frac{X_{ij}}{x_{max}}\right)^{\alpha} \text{if} \;\; X_{ij}<x_{max} \\
1 \;\;\; \text{otherwise}
\end{cases}
$$ where $\alpha$ and $x_{max}$ can be adjusted by the user.






# Using pre-trained word embeddings

Reference:
**Author:** [fchollet](https://twitter.com/fchollet)<br>
**Date created:** 2020/05/05<br>
**Last modified:** 2020/05/05<br>
**Description:** Text classification on the Newsgroup20 dataset using pre-trained GloVe word embeddings.

## Setup

In [39]:
import numpy as np
import tensorflow as tf
from tensorflow import keras

## Introduction

In this example, we show how to train a text classification model that uses pre-trained
word embeddings.

We'll work with the Newsgroup20 dataset, a set of 20,000 message board messages
belonging to 20 different topic categories.

For the pre-trained word embeddings, we'll use
[GloVe embeddings](http://nlp.stanford.edu/projects/glove/).

## Download the Newsgroup20 data

In [40]:
data_path = keras.utils.get_file(
    "news20.tar.gz",
    "http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz",
    untar=True,
)

## Let's take a look at the data

In [41]:
import os
import pathlib

data_dir = pathlib.Path(data_path).parent / "20_newsgroup"
dirnames = os.listdir(data_dir)
print("Number of directories:", len(dirnames))
print("Directory names:", dirnames)

fnames = os.listdir(data_dir / "rec.motorcycles")
print("Number of files in rec.motorcycles:", len(fnames))
print("Some example filenames:", fnames[:15])

Number of directories: 20
Directory names: ['comp.os.ms-windows.misc', 'alt.atheism', 'comp.sys.ibm.pc.hardware', 'rec.sport.baseball', 'sci.electronics', 'comp.sys.mac.hardware', 'talk.politics.misc', 'talk.politics.mideast', 'talk.politics.guns', 'talk.religion.misc', 'rec.sport.hockey', 'sci.space', 'sci.crypt', 'soc.religion.christian', 'misc.forsale', 'rec.autos', 'sci.med', 'comp.windows.x', 'rec.motorcycles', 'comp.graphics']
Number of files in rec.motorcycles: 1000
Some example filenames: ['104343', '104615', '104653', '103170', '104487', '104755', '104767', '104802', '104519', '104359', '105214', '104765', '104435', '104870', '105064']


Here's a example of what one file contains:

In [42]:
print(open(data_dir / "rec.motorcycles" / "104521").read())

Newsgroups: rec.motorcycles
Path: cantaloupe.srv.cs.cmu.edu!magnesium.club.cc.cmu.edu!news.sei.cmu.edu!cis.ohio-state.edu!zaphod.mps.ohio-state.edu!howland.reston.ans.net!agate!doc.ic.ac.uk!syma!tafi3
From: tafi3@syma.sussex.ac.uk (Ian Deeley)
Subject: Re: CB750 C with flames out the exhaust!!!!---->>>
Message-ID: <1993Apr20.133712.10909@syma.sussex.ac.uk>
Organization: University of Sussex
References: <C5quw0.Btq@ux1.cso.uiuc.edu>
Date: Tue, 20 Apr 1993 13:37:12 GMT
Lines: 25

From article <C5quw0.Btq@ux1.cso.uiuc.edu>, by mikeh@ux1.cso.uiuc.edu (Mike Hollyman):
> Hi, I have an 82 CB750 Custom that I just replaced the cylinder head gasket
> on.  Now when I put it back together again, it wouldn't idle at all.  It was
> only running on 2-3 cylinders and it would backfire and spit flames out the
> exhaust on the right side.  The exhaust is 4-2 MAC.  I bought new plugs
> today and it runs very rough and still won't idle.  I am quite sure the fine
> tune knobs on the carbs are messed up.  

As you can see, there are header lines that are leaking the file's category, either
explicitly (the first line is literally the category name), or implicitly, e.g. via the
`Organization` filed. Let's get rid of the headers:

In [43]:
# automated pre-processing of Newsgroups Dataset
samples = []
labels = []
class_names = []
class_index = 0
for dirname in sorted(os.listdir(data_dir)):
    class_names.append(dirname)
    dirpath = data_dir / dirname
    fnames = os.listdir(dirpath)
    print("Processing %s, %d files found" % (dirname, len(fnames)))
    for fname in fnames:
        fpath = dirpath / fname
        f = open(fpath, encoding="latin-1")
        content = f.read()
        lines = content.split("\n")
        lines = lines[10:]
        content = "\n".join(lines)
        samples.append(content)
        labels.append(class_index)
    class_index += 1

print("Classes:", class_names)
print("Number of samples:", len(samples))

Processing alt.atheism, 1000 files found
Processing comp.graphics, 1000 files found
Processing comp.os.ms-windows.misc, 1000 files found
Processing comp.sys.ibm.pc.hardware, 1000 files found
Processing comp.sys.mac.hardware, 1000 files found
Processing comp.windows.x, 1000 files found
Processing misc.forsale, 1000 files found
Processing rec.autos, 1000 files found
Processing rec.motorcycles, 1000 files found
Processing rec.sport.baseball, 1000 files found
Processing rec.sport.hockey, 1000 files found
Processing sci.crypt, 1000 files found
Processing sci.electronics, 1000 files found
Processing sci.med, 1000 files found
Processing sci.space, 1000 files found
Processing soc.religion.christian, 997 files found
Processing talk.politics.guns, 1000 files found
Processing talk.politics.mideast, 1000 files found
Processing talk.politics.misc, 1000 files found
Processing talk.religion.misc, 1000 files found
Classes: ['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.ha

In [15]:
len(samples)

19997

There's actually one category that doesn't have the expected number of files, but the
difference is small enough that the problem remains a balanced classification problem.

## Shuffle and split the data into training & validation sets

In [44]:
# Shuffle the data
seed = 1234
rng = np.random.RandomState(seed)
rng.shuffle(samples)
rng = np.random.RandomState(seed)
rng.shuffle(labels)

# Extract a training & validation split
validation_split = 0.25
num_validation_samples = int(validation_split * len(samples))
train_samples = samples[:-num_validation_samples]
val_samples = samples[-num_validation_samples:]
train_labels = labels[:-num_validation_samples]
val_labels = labels[-num_validation_samples:]

In [45]:
train_samples[0]

"NNTP-Posting-Host: piaget.phys.ksu.edu\n\nIn <C5qGF5.K2I@alta-oh.com> chris@zeus.alta-oh.com (Chris Murphy) writes:\n\n>In article <FULL_GL.93Apr18005752@dolphin.pts.mot.com>, full_gl@pts.mot.com (Glen Fullmer) writes:\n>|> Looking for a graphics/CAD/or-whatever package on a X-Unix box that will\n>|> take a file with records like:\n\n>Hi,\n>  See Roger Grywalski's response to :\n\n>Re: Help on network visualization\n\n>in comp.graphics.visualization.\n\nCould someone please post Roger Grywalski's response?  Or point me to where\nI could find it?\n\nThanks a lot,\n\n\nS. Raj Chaudhury\t\t\t|\nDept. of Physics    \t\t\t|  raj@phys.ksu.edu\nKansas State University\t\t\t|\nManhattan, KS 66506\t\t\t|\n--\nS. Raj Chaudhury\t\t\t|\nDept. of Physics    \t\t\t|  raj@phys.ksu.edu\nKansas State University\t\t\t|\nManhattan, KS 66506\t\t\t|\n"

## Create a vocabulary index

Let's use the `TextVectorization` to index the vocabulary found in the dataset.
Later, we'll use the same layer instance to vectorize the samples.

Our layer will only consider the top 20,000 words, and will truncate or pad sequences to
be actually 200 tokens long.

In [46]:
from tensorflow.keras.layers import TextVectorization

vectorizer = TextVectorization(max_tokens=30000, output_sequence_length=500)
text_ds = tf.data.Dataset.from_tensor_slices(train_samples).batch(128)
vectorizer.adapt(text_ds)

You can retrieve the computed vocabulary used via `vectorizer.get_vocabulary()`. Let's
print the top 5 words:

In [47]:
vectorizer.get_vocabulary()[:100]

['',
 '[UNK]',
 'the',
 'to',
 'of',
 'a',
 'and',
 'in',
 'is',
 'i',
 'that',
 'it',
 'for',
 'you',
 'this',
 'on',
 'be',
 'not',
 'are',
 'have',
 'with',
 'as',
 'or',
 'if',
 'but',
 'was',
 'they',
 'from',
 'by',
 'at',
 'an',
 'can',
 'what',
 'my',
 'would',
 'all',
 'there',
 'will',
 'one',
 'writes',
 'do',
 'about',
 'we',
 'so',
 'your',
 'has',
 'he',
 'article',
 'no',
 'any',
 'me',
 'some',
 'who',
 'which',
 'its',
 'were',
 'dont',
 'out',
 'people',
 'when',
 'like',
 'just',
 'more',
 'their',
 '1',
 'know',
 'other',
 'only',
 'them',
 'up',
 'get',
 'how',
 'than',
 'had',
 'lines',
 'been',
 'think',
 '2',
 'his',
 'also',
 'does',
 'then',
 'use',
 'time',
 'these',
 'im',
 'should',
 'could',
 'well',
 'may',
 'good',
 'because',
 'us',
 'even',
 'am',
 'now',
 'new',
 'see',
 'very',
 'into']

Let's vectorize a test sentence:

In [48]:
output = vectorizer([["This is Data 301."]])
output.numpy()[0, :]

array([  14,    8,  198, 4692,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,   

In [57]:
len(np.array(output)[0])

500

As you can see, "the" gets represented as "2". Why not 0, given that "the" was the first
word in the vocabulary? That's because index 0 is reserved for padding and index 1 is
reserved for "out of vocabulary" tokens.

Here's a dict mapping words to their indices:

In [58]:
voc = vectorizer.get_vocabulary()
word_index = dict(zip(voc, range(len(voc))))

In [59]:
len(voc)

30000

As you can see, we obtain the same encoding as above for our test sentence:

In [60]:
test = ["the", "cat", "jumped", "on", "the", "hat"]
[word_index[w] for w in test]

[2, 3718, 6557, 15, 2, 4214]

## Load pre-trained word embeddings

Let's download pre-trained GloVe embeddings (a 822M zip file).

You'll need to run the following commands:

```
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip -q glove.6B.zip
```

In [9]:
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip -q glove.6B.zip

--2023-12-07 12:25:55--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2023-12-07 12:25:56--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2023-12-07 12:25:56--  https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


202

The archive contains text-encoded vectors of various sizes: 50-dimensional,
100-dimensional, 200-dimensional, 300-dimensional. We'll use the 100D ones.

Let's make a dict mapping words (strings) to their NumPy vector representation:

In [61]:
path_to_glove_file = os.path.join("glove.6B.300d.txt")

embeddings_index = {}
with open(path_to_glove_file) as f:
    for line in f:
        word, coefs = line.split(maxsplit=1)
        coefs = np.fromstring(coefs, "f", sep=" ")
        embeddings_index[word] = coefs

print("Found %s word vectors." % len(embeddings_index))

Found 400000 word vectors.


In [66]:
vu = embeddings_index['love']
vw = embeddings_index['friendship']

In [63]:
vu

array([ 4.6443e-01,  3.7730e-01, -2.1459e-01, -5.0768e-01, -2.4576e-01,
        8.1340e-02,  1.0145e-01,  2.5155e-01, -3.6152e-01, -1.6030e+00,
        2.8219e-01,  3.6653e-01,  4.4611e-01,  2.7950e-01,  4.7722e-02,
        3.0087e-01, -1.6226e-01, -2.6055e-02, -2.6815e-01, -4.6282e-01,
        2.5012e-01,  6.0389e-01,  1.5111e-01, -6.2823e-02, -9.6755e-02,
       -3.0548e-01, -1.1376e-01,  5.3914e-01,  1.0966e-01, -7.0618e-01,
       -6.6316e-01,  4.3559e-01, -4.8631e-02,  2.7755e-01, -4.8685e-01,
        1.1938e-01, -5.4538e-01, -2.9563e-01,  3.4470e-02,  5.3187e-01,
       -1.5880e-03,  4.1692e-01, -2.0742e-01, -3.7833e-02,  4.3333e-01,
        4.7521e-02,  8.3507e-01, -6.5088e-02, -2.9974e-01,  4.7139e-03,
        1.2339e-01, -5.0660e-01,  2.5870e-01,  2.1264e-01,  1.9132e-01,
        5.4204e-01, -1.1385e-01, -4.2384e-01, -2.7808e-01, -1.5105e-01,
       -6.2104e-01,  2.7678e-01, -5.4974e-02,  1.8479e-02, -1.1744e-01,
        3.3029e-01, -3.5251e-01, -2.1953e-01,  5.5140e-02,  1.69

$$\large <\vec{u},\vec{u}> = (u_1)^2 + (u_2)^2 + ... +(u_n)^2$$

So the magnitude of $\vec{u}$ is

$$\large \|\vec{u}\| = \sqrt{<\vec{u},\vec{u}>} $$

In [67]:
vu.dot(vw)/(np.sqrt(vu.dot(vu))*np.sqrt(vw.dot(vw)))

0.49185005

$$\large <\vec{u},\vec{w}> = \cos(\theta)\cdot \|\vec{u}\|\cdot \|\vec{w}\|$$

In [None]:
# the actual angle between the two vector embeddings is
np.arccos(vu.dot(vw)/(np.sqrt(vu.dot(vu))*np.sqrt(vw.dot(vw))))

1.0321491

Now, let's prepare a corresponding embedding matrix that we can use in a Keras
`Embedding` layer. It's a simple NumPy matrix where entry at index `i` is the pre-trained
vector for the word of index `i` in our `vectorizer`'s vocabulary.

In [68]:
num_tokens = len(voc) + 2
embedding_dim = 300
hits = 0
misses = 0

# Prepare embedding matrix
embedding_matrix = np.zeros((num_tokens, embedding_dim))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # Words not found in embedding index will be all-zeros.
        # This includes the representation for "padding" and "OOV"
        embedding_matrix[i] = embedding_vector
        hits += 1
    else:
        misses += 1
print("Converted %d words (%d misses)" % (hits, misses))


Converted 25083 words (4917 misses)


In [30]:
embedding_matrix.shape

(30002, 300)

In [None]:
embedding_matrix[2

array([ 4.65600006e-02,  2.13180006e-01, -7.43639981e-03, -4.58539993e-01,
       -3.56389992e-02,  2.36430004e-01, -2.88360000e-01,  2.15210006e-01,
       -1.34859994e-01, -1.64129996e+00, -2.60910004e-01,  3.24340016e-02,
        5.66210002e-02, -4.32960019e-02, -2.16719992e-02,  2.24759996e-01,
       -7.51290023e-02, -6.70180023e-02, -1.42470002e-01,  3.88250016e-02,
       -1.89510003e-01,  2.99769998e-01,  3.93049985e-01,  1.78870007e-01,
       -1.73429996e-01, -2.11779997e-01,  2.36169994e-01, -6.36809990e-02,
       -4.23180014e-01, -1.16609998e-01,  9.37540010e-02,  1.72959998e-01,
       -3.30729991e-01,  4.91120011e-01, -6.89949989e-01, -9.24620032e-02,
        2.47419998e-01, -1.79910004e-01,  9.79079977e-02,  8.31179991e-02,
        1.52989998e-01, -2.72760004e-01, -3.89339998e-02,  5.44529974e-01,
        5.37370026e-01,  2.91049987e-01, -7.35139987e-03,  4.78800014e-02,
       -4.07599986e-01, -2.67590005e-02,  1.79189995e-01,  1.09770000e-02,
       -1.09630004e-01, -

Next, we load the pre-trained word embeddings matrix into an `Embedding` layer.

Note that we set `trainable=False` so as to keep the embeddings fixed (we don't want to
update them during training).

In [69]:
from tensorflow.keras.layers import Embedding

embedding_layer = Embedding(
    num_tokens,
    embedding_dim,
    embeddings_initializer=keras.initializers.Constant(embedding_matrix),
    trainable=False,
)

## Build the model

A simple 1D convnet with global max pooling and a classifier at the end.

In [76]:
from tensorflow.keras import layers

int_sequences_input = keras.Input(shape=(None,), dtype="int64")
embedded_sequences = embedding_layer(int_sequences_input)
x = layers.Conv1D(256, 5, activation="relu")(embedded_sequences)
x = layers.MaxPooling1D(5)(x)
x = layers.Conv1D(256, 5, activation="relu")(x)
x = layers.MaxPooling1D(5)(x)
x = layers.Conv1D(128, 5, activation="relu")(x)
x = layers.GlobalMaxPooling1D()(x)
x = layers.Dense(200, activation="relu")(x)
x = layers.Dropout(0.25)(x)
preds = layers.Dense(len(class_names), activation="softmax")(x)
model = keras.Model(int_sequences_input, preds)
model.summary()

Model: "model_8"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_9 (InputLayer)        [(None, None)]            0         
                                                                 
 embedding_1 (Embedding)     (None, None, 300)         9000600   
                                                                 
 conv1d_6 (Conv1D)           (None, None, 256)         384256    
                                                                 
 max_pooling1d_4 (MaxPoolin  (None, None, 256)         0         
 g1D)                                                            
                                                                 
 conv1d_7 (Conv1D)           (None, None, 256)         327936    
                                                                 
 max_pooling1d_5 (MaxPoolin  (None, None, 256)         0         
 g1D)                                                      

## Train the model

First, convert our list-of-strings data to NumPy arrays of integer indices. The arrays
are right-padded.

In [71]:
x_train = vectorizer(np.array([[s] for s in train_samples])).numpy()
x_val = vectorizer(np.array([[s] for s in val_samples])).numpy()

y_train = np.array(train_labels)
y_val = np.array(val_labels)

In [None]:
x_val.shape

(4999, 500)

In [None]:
x_val[0]

We use categorical crossentropy as our loss since we're doing softmax classification.
Moreover, we use `sparse_categorical_crossentropy` since our labels are integers.

In [77]:
model.compile(
    loss="sparse_categorical_crossentropy", optimizer="rmsprop", metrics=["acc"]
)
model.fit(x_train, y_train, batch_size=80, epochs=50, validation_data=(x_val, y_val))

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.src.callbacks.History at 0x7cc8df709db0>

## Export an end-to-end model

Now, we may want to export a `Model` object that takes as input a string of arbitrary
length, rather than a sequence of indices. It would make the model much more portable,
since you wouldn't have to worry about the input preprocessing pipeline.

Our `vectorizer` is actually a Keras layer, so it's simple:

In [79]:
string_input = keras.Input(shape=(1,), dtype="string")
x = vectorizer(string_input)
preds = model(x)
end_to_end_model = keras.Model(string_input, preds)

probabilities = end_to_end_model.predict(
    [["Yesterday I had a flat tire."]]
)

class_names[np.argmax(probabilities[0])]



'rec.autos'

In [None]:
class_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']