In [56]:
import tensorflow as tf

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Input, SimpleRNN, GRU, LSTM, Bidirectional, Dense, Embedding, Attention

# Neural Networks for Language Processing

NLP (Natural Language Processing)
+
NLU (Natural Language Understanding /semantics/)
=>
NLG (Natural Language Generation)

## __Time-Dependent Model Examples__

Usage: speech recognition (audio → transcript; encoder for audio and decoder for text), machine translation (text of one language) → text of another language), activity recognition (in a video → activity type e.g. walking); sentiment analysis (text → sentiment), generation (music generation, voice cloning, and text summarization).

"Standard" models -> 𝑦 = 𝑓(𝑥); 

recurrent models -> 𝑦 = 𝑓(𝑥, 𝑠), s- current state

In [2]:
def f(x, current_world_state):
    # do something
    return result, new_world_state 

"State" is a parameter of the function. Here "state" is context.

Autoregression: $ y(t, y^{<t-1>}) $

$ y^{<t>}(t, y^{<t-1>}, y^{<t-2>},.....) $

$ y^{<o>}, y^{<1>}(y^{<o>}), y^{<2>}(y^{<1>}).... $

1-layer NN -> recurrent NN (2 entres, 2 exits) 

Number of tokesn -> Input size 

One RNN sell processes one token. Many tokens need a layer.

Backpropagation depends on time.

In [12]:
model = Sequential([
    Input((6, 20)), # (batch), steps, features
    SimpleRNN(128, activation = "relu") # activation = None => regression
])

In [9]:
model.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 simple_rnn_1 (SimpleRNN)    (None, 128)               19072     
                                                                 
Total params: 19072 (74.50 KB)
Trainable params: 19072 (74.50 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [10]:
model_return = Sequential([
    Input((6, 20)), 
    SimpleRNN(128, return_sequences = True) 
])

In [11]:
model_return.summary()

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 simple_rnn_2 (SimpleRNN)    (None, 6, 128)            19072     
                                                                 
Total params: 19072 (74.50 KB)
Trainable params: 19072 (74.50 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [13]:
model_return_2 = Sequential([
    Input((6, 20)), 
    SimpleRNN(128, return_sequences = True),
    SimpleRNN(64, return_sequences = True),
])

In [14]:
model_return_2.summary()

Model: "sequential_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 simple_rnn_4 (SimpleRNN)    (None, 6, 128)            19072     
                                                                 
 simple_rnn_5 (SimpleRNN)    (None, 6, 64)             12352     
                                                                 
Total params: 31424 (122.75 KB)
Trainable params: 31424 (122.75 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


## __Architecture__

"one to one" (standard model)

"one to many" (generative model /completion model) (Mandatory two extres => if there is not second entry, it is given as a zero vector)

"many to one" (e.g. sentiment analysis)

"many to many" (e.g. named entity recognition (NER); completion models, LLM)

There can be unused exits => they are calculated, but we ignore them.

In [19]:
# many to one => one output

model_to_one = Sequential([
    Input((6, 20)), 
    SimpleRNN(128, return_sequences = True),
    SimpleRNN(64, return_sequences = True),
    SimpleRNN(64, return_sequences = True),
    SimpleRNN(64, return_sequences = False),
])

In [20]:
model_to_one.summary()

Model: "sequential_7"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 simple_rnn_14 (SimpleRNN)   (None, 6, 128)            19072     
                                                                 
 simple_rnn_15 (SimpleRNN)   (None, 6, 64)             12352     
                                                                 
 simple_rnn_16 (SimpleRNN)   (None, 6, 64)             8256      
                                                                 
 simple_rnn_17 (SimpleRNN)   (None, 64)                8256      
                                                                 
Total params: 47936 (187.25 KB)
Trainable params: 47936 (187.25 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


## __Language Models__

Training: tokanization, simple RNN, output: softmax

First token: $ \tilde{y} = 𝑃(𝑤_1) $ 

Second token: $ \tilde{y} = 𝑃(𝑤_2|𝑤_1) $

General: $ \tilde{y} = 𝑃(𝑤_𝑘|𝑤_1, 𝑤_2, … , 𝑤_{𝑘−1}) $

In [25]:
# encoder that makes calssification
model_class = Sequential([
    Input((6, 20)), 
    SimpleRNN(128, return_sequences = True),
    SimpleRNN(64, return_sequences = True),
    SimpleRNN(64, return_sequences = True),
    SimpleRNN(20, return_sequences = True, activation = "softmax"),
])

In [26]:
model_class.summary()

Model: "sequential_10"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 simple_rnn_26 (SimpleRNN)   (None, 6, 128)            19072     
                                                                 
 simple_rnn_27 (SimpleRNN)   (None, 6, 64)             12352     
                                                                 
 simple_rnn_28 (SimpleRNN)   (None, 6, 64)             8256      
                                                                 
 simple_rnn_29 (SimpleRNN)   (None, 6, 20)             1700      
                                                                 
Total params: 41380 (161.64 KB)
Trainable params: 41380 (161.64 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


The model calculates probabilities (which is the most probable next token).

Generation: random sampling according to computed 𝑃.
Input $ 𝑥^{<0>} = 0, 𝑎^{<0>} = 0 $; compute $ 𝑎^{<1>} $, $ \tilde{y}^{<1>} $; choose a word $ 𝑤_1 $ ==> 
Input $ 𝑥^{<1>} ≡ 𝑤_1, 𝑎^{<1>} $; compute $ 𝑎^{<2>}, \tilde{y}^{<2>} $; choose a word $ 𝑤_2 $ ==> 
… until you reach [.]

__Vanishing gradients at backpropagation__

Gated recurrent unit (GRU)

Long-Short Term Memory (LSTM)

Bi-Directional Networks

In [36]:
model_class = Sequential([
    Input((6, 20)), 
    SimpleRNN(15),
])

In [37]:
model_class.summary()

Model: "sequential_15"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 simple_rnn_34 (SimpleRNN)   (None, 15)                540       
                                                                 
Total params: 540 (2.11 KB)
Trainable params: 540 (2.11 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [38]:
model_bi = Sequential([
    Input((6, 20)), 
    Bidirectional(SimpleRNN(15))
])

In [39]:
model_bi.summary()

Model: "sequential_16"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 bidirectional (Bidirection  (None, 30)                1080      
 al)                                                             
                                                                 
Total params: 1080 (4.22 KB)
Trainable params: 1080 (4.22 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


There are twice more inputs and parameters.

In [41]:
model_bi_sum = Sequential([
    Input((6, 20)), 
    Bidirectional(SimpleRNN(15), merge_mode = "sum")
])

In [42]:
model_bi_sum.summary()

Model: "sequential_17"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 bidirectional_1 (Bidirecti  (None, 15)                1080      
 onal)                                                           
                                                                 
Total params: 1080 (4.22 KB)
Trainable params: 1080 (4.22 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


The number of parameters remain twice more, although inputs are 15. There are also two separate models.

*Not applicable for forecasting.

## __Representing tokens__

In [43]:
model_tokens = Sequential([
    Input((500, 20_000)),  # 500 tokens, vocabulary of 20000 words
    SimpleRNN(15),
])

In [44]:
model_tokens.summary()

Model: "sequential_18"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 simple_rnn_37 (SimpleRNN)   (None, 15)                300240    
                                                                 
Total params: 300240 (1.15 MB)
Trainable params: 300240 (1.15 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


Only 1 layer but 300 000 parameters.

In [47]:
# vectorizing the input by Dense

model_dense = Sequential([
    Input((500, 20_000)),  # 500 tokens, vocabulary of 20000 words
    Dense(128, use_bias = False),
    SimpleRNN(15),
])

In [48]:
model_dense.summary()

Model: "sequential_19"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense (Dense)               (None, 500, 128)          2560000   
                                                                 
 simple_rnn_38 (SimpleRNN)   (None, 15)                2160      
                                                                 
Total params: 2562160 (9.77 MB)
Trainable params: 2562160 (9.77 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


The paramenters in the __Dense__ layer are much more, but thus, the SimpleRNN layer will get only 128 parameters.

However, this could be done by using __Embedding__.

In [53]:
model_embedding = Sequential([
    Input((20_000),), # every token is received separately
    Embedding(20000, 128),
    SimpleRNN(15),
])

In [54]:
model_embedding.summary()

Model: "sequential_21"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, 20000, 128)        2560000   
                                                                 
 simple_rnn_40 (SimpleRNN)   (None, 15)                2160      
                                                                 
Total params: 2562160 (9.77 MB)
Trainable params: 2562160 (9.77 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


Embedding makes dimensionality reduction. Embedding from a space with one dimension per word to a lower-dimensional space.

Word2Vec and GloVe

## __Refinement Algorithms__

_Beam search_

Similar to generation:  $ \tilde{y} = 𝑓(𝑥) $, maximize $ 𝑃(\tilde{y}|𝑥) $

_Multinomial sampling_

Multinomial sampling is often used in tasks like classification or sequence generation, where the model predicts a probability distribution over k classes or outcomes. A sample is drawn from this distribution (e.g., using a softmax output) to select an outcome, commonly applied in reinforcement learning, generative models, or stochastic training strategies.

__Attention__

Attention deals with complicated input. The longer sentences have inherently lower probabilities so models tend to favor short sentences.
There is no need to know the entire sequence in order to translate it.

It is like a convolution filter.

Attention is used in translation, image captioning, speech recognition, text summarization, etc.

Dot product attention:

$ A(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $

where $ d_k $​ is the dimension of q and k.

Outputs have variance = 1

In [60]:
attention = Attention()

# q = Embedding()(...)
# v = Embedding()(...)

# q_v = attention(q, v)

_Transformers_

Universal transofrmer: BERT It is bidirectional i.e. it works with whole sentences.

Encoder BERT => Decoder GPT (Generative Pre-trained Transformer)