<a href="https://colab.research.google.com/github/ashikshafi08/Learning_Tensorflow/blob/main/Experiments/De_shuffling_text_using_tfa.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# De-Scrambling the text with Sequence-to-Sequence with Attention Mechanism. 



In [24]:
# Downloading tensorflow addons 
!pip install tensorflow-addons



In [9]:
!pip install aicrowd-cli
API_KEY = '' 
!aicrowd login --api-key $API_KEY

# Downloading the Dataset
!rm -rf data
!mkdir data
!aicrowd dataset download --challenge de-shuffling-text -j 3 -o data



[32mAPI Key valid[0m
[32mSaved API Key successfully![0m
val.csv:   0% 0.00/714k [00:00<?, ?B/s]
val.csv: 100% 714k/714k [00:00<00:00, 1.91MB/s]

train.csv: 100% 7.00M/7.00M [00:00<00:00, 10.9MB/s]
test.csv: 100% 1.83M/1.83M [00:00<00:00, 3.17MB/s]


In [25]:
# Importing all the packages we need 
import tensorflow as tf 
import tensorflow_addons as tfa
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt

In [11]:
# Importing the data 

train_data = pd.read_csv('data/train.csv')
val_data = pd.read_csv('data/val.csv')
test_data = pd.read_csv('data/test.csv')

# Printing out all shapes of our data 
print(f'Shape of the train data: {train_data.shape}')
print(f'Shape of the validation data: {val_data.shape}')
print(f'Shape of the test data: {test_data.shape}')

Shape of the train data: (40001, 2)
Shape of the validation data: (4001, 2)
Shape of the test data: (10000, 3)


In [17]:
# How does our train data looks like? 
train_data.head()

Unnamed: 0,text,label
0,"presented here Furthermore, naive improved. im...","Furthermore, the naive implementation presente..."
1,vector a in a form vector multidimensional spa...,Those coefficients form a vector in a multidim...
2,compatible of The model with recent is model s...,The model is compatible with a recent model of...
3,but relevance outlined. hemodynamics its based...,"The model is based on electrophysiology, but i..."
4,of transitions lever-like involve reorientatio...,Conformational transitions in macromolecular c...


In [18]:
# Shuffling our train data 
train_data_shuffled = train_data.sample(frac = 1 , random_state = 42)
train_data_shuffled.head() , train_data_shuffled.shape

(                                                    text                                              label
 32824  on work, supervised label image the segmentati...  In our work, we focus on the weakly supervised...
 16298  we small of a for set work, In this features i...  In this work, we propose a small set of featur...
 30180  ($G_h^{Der}$ to factors the contributes $\tau_...  The increment of both factors ($G_h^{Der}$ and...
 6689   new precise particular, for entailment. bounds...  In particular, we provide new precise analytic...
 26893  a these causation Incorporating features, defi...  Incorporating these three features, a definiti...,
 (40001, 2))

In [19]:
# Splitting sentences and labels
train_sentences = train_data_shuffled['text'].to_numpy()
train_labels = train_data_shuffled['label'].to_numpy()

val_sentences = val_data['text'].to_numpy()
val_labels = val_data['label'].to_numpy()

test_sentences = test_data['text'].to_numpy()
test_labels = test_data['label'].to_numpy()


# Checking the shapes 
print(f'Shape of the train sentences: {train_sentences.shape}')
print(f'Shape of the validation sentences: {val_sentences.shape}')
print(f'Shape of the train labels: {train_labels.shape}')
print(f'Shape of the validation labels: {val_labels.shape}')

Shape of the train sentences: (40001,)
Shape of the validation sentences: (4001,)
Shape of the train labels: (40001,)
Shape of the validation labels: (4001,)


In [20]:
# Creating a tf.data.dataset of our sentences and labels 

train_dataset = tf.data.Dataset.from_tensor_slices((train_sentences , train_labels)).shuffle(1000)
val_dataset = tf.data.Dataset.from_tensor_slices((val_sentences , val_labels))

# Adding a batch 
train_dataset = train_dataset.batch(64).prefetch(tf.data.AUTOTUNE)
val_dataset = val_dataset.prefetch(tf.data.AUTOTUNE)

train_dataset , val_dataset

(<PrefetchDataset shapes: ((None,), (None,)), types: (tf.string, tf.string)>,
 <PrefetchDataset shapes: ((), ()), types: (tf.string, tf.string)>)

In [21]:
# Looking into our train_dataset just a batch (only 5 first texts in a batch)
for scrambled_text , unscrambled_text in train_dataset.take(1):
  print(f'Below is the Scrambled version:\n {scrambled_text[:5]}')
  print('\n----------\n')
  print(f'Below is the Un-Scrambled version:\n {unscrambled_text[:5]}')

Below is the Scrambled version:
 [b'are pignistic through the two subsystems transformation. connected The'
 b'contribute of layer. a all previous to equally the way, layers output This'
 b'combine and we subset algorithm popular this In (Au the article, Probab. simulation Beck,'
 b'valuable this in are situation. particularly Diagnostics'
 b'Duality (for observations and between observables pixels) established. example, is']

----------

Below is the Un-Scrambled version:
 [b'The two subsystems are connected through the pignistic transformation.'
 b'This way, all previous layers contribute equally to the output of a layer.'
 b'In this article, we combine the popular subset simulation algorithm (Au and Beck, Probab.'
 b'Diagnostics are particularly valuable in this situation.'
 b'Duality between observables (for example, pixels) and observations is established.']


In [28]:
train_dataset

<PrefetchDataset shapes: ((None,), (None,)), types: (tf.string, tf.string)>

In [27]:
# Getting the example input batch annd target batch 

example_input_batch , example_target_batch = next(iter(train_dataset))

example_input_batch.shape , example_target_batch.shape

(TensorShape([64]), TensorShape([64]))

In [30]:
# Creating text vectorization layer for the scrambled words 
max_vocab_length = 10000

input_text_vectorizer = tf.keras.layers.experimental.preprocessing.TextVectorization(
    standardize = 'lower_and_strip_punctuation' , 
    ngrams = 2 , 
    max_tokens = max_vocab_length 
)

# Fitting on our train sentences (scrambled words )
input_text_vectorizer.adapt(train_sentences)

In [31]:
# First 10 words from the vocabulary 
input_text_vectorizer.get_vocabulary()[:10]

['', '[UNK]', 'the', 'of', 'a', 'in', 'to', 'is', 'and', 'we']

In [32]:
# Creating a text vectorization layer for the unscrambled words 
output_text_vectorizer = tf.keras.layers.experimental.preprocessing.TextVectorization(
    standardize = 'lower_and_strip_punctuation' , 
    ngrams = 2, 
    max_tokens = max_vocab_length
)

# Fitting on our train labels (unscrambled words)
output_text_vectorizer.adapt(train_labels)

In [33]:
# First 10 words from the vocab 
output_text_vectorizer.get_vocabulary()[:10]

['', '[UNK]', 'the', 'of', 'a', 'in', 'to', 'is', 'and', 'we']

In [34]:
# Passing a scrambled text (strings) into our layer 
scrambled_tokens = input_text_vectorizer(scrambled_text)
scrambled_tokens[:3]

<tf.Tensor: shape=(3, 29), dtype=int64, numpy=
array([[  12,    1,  181,    2,   42,    1,  911, 1226,    2,    1,    1,
        3174, 1114,    1,    1,    1,    1,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0],
       [2086,    3,  789,    4,  111,  212,    6, 3853,    2,  309,  623,
         798,   11,    1,    1,    1,    1,    1, 7779,    1,    1, 8485,
           1,    1,    1,    0,    0,    0,    0],
       [1632,    8,    9, 2042,   39,  419,   11,    5, 4709,    2,  792,
           1,  345,    1,    1,  444,    1,    1,    1,    1,  296,    1,
           1, 8553,    1,    1,    1,    0,    0]])>

In [38]:
# Creating a numpy array of the vocabulary
input_vocab = np.array(input_text_vectorizer.get_vocabulary())
output_vocab = np.array(output_text_vectorizer.get_vocabulary())

In [36]:
# Indexing our scrambled tokens into the array of vocbulary
tokens = input_vocab[scrambled_tokens.numpy()]
print(f'Actual sequence:\n\n {scrambled_text[:3]}\n')
print(f'\nThe sequence in tokens:\n\n {tokens[:3]}')

Actual sequence:

 [b'are pignistic through the two subsystems transformation. connected The'
 b'contribute of layer. a all previous to equally the way, layers output This'
 b'combine and we subset algorithm popular this In (Au the article, Probab. simulation Beck,']


The sequence in tokens:

 [['are' '[UNK]' 'through' 'the' 'two' '[UNK]' 'transformation'
  'connected' 'the' '[UNK]' '[UNK]' 'through the' 'the two' '[UNK]'
  '[UNK]' '[UNK]' '[UNK]' '' '' '' '' '' '' '' '' '' '' '' '']
 ['contribute' 'of' 'layer' 'a' 'all' 'previous' 'to' 'equally' 'the'
  'way' 'layers' 'output' 'this' '[UNK]' '[UNK]' '[UNK]' '[UNK]' '[UNK]'
  'previous to' '[UNK]' '[UNK]' 'the way' '[UNK]' '[UNK]' '[UNK]' '' ''
  '' '']
 ['combine' 'and' 'we' 'subset' 'algorithm' 'popular' 'this' 'in' 'au'
  'the' 'article' '[UNK]' 'simulation' '[UNK]' '[UNK]' 'and we' '[UNK]'
  '[UNK]' '[UNK]' '[UNK]' 'this in' '[UNK]' '[UNK]' 'the article' '[UNK]'
  '[UNK]' '[UNK]' '' '']]


In [44]:
# Defining some important parameters 
inp_vocab_size = len(input_vocab)
lab_vocab_size = len(output_vocab) # this will be our label 
embedding_dim = 256
units = 1024
max_length = 15
BATCH_SIZE = 64

#### Encoder 

In [63]:
class Encoder(tf.keras.Model):
  def __init__(self , vocab_size , embedding_dim , enc_units , batch_size):
    super(Encoder, self).__init__()
    self.batch_size = batch_size # batch size
    self.enc_units = enc_units # Encoder units / units
    self.embedding = tf.keras.layers.Embedding(vocab_size , embedding_dim) # the embedding layer

    ## LSTM layer in our Encoder 
    self.lstm_layer = tf.keras.layers.LSTM(self.enc_units , 
                                           return_sequences = True , 
                                           return_state = True , 
                                           recurrent_initializer = 'glorot_uniform')
    
  def call(self , x , hidden):
    x = self.embedding(x) # 
    output , h , c = self.lstm_layer(x , initial_state = hidden)
    return output , h , c

  def initialize_hidden_state(self): 
    return [tf.zeros((self.batch_size , self.enc_units)) , tf.zeros((self.batch_size , self.enc_units))]




- hidden_state --> output of the lstm layer

[Difference between return state and return sequence](https://machinelearningmastery.com/return-sequences-and-return-states-for-lstms-in-keras/#:~:text=The%20output%20of%20an%20LSTM,the%20cell%20state%2C%20or%20c.&text=The%20LSTM%20hidden%20state%20output,last%20time%20step%20(again).)

In [50]:
dum = tf.zeros((64 , 1024))


In [65]:
# Test the Encoder layer we built 
encoder = Encoder(vocab_size = inp_vocab_size ,
                  embedding_dim = embedding_dim , 
                  enc_units = units , 
                  batch_size = BATCH_SIZE)

In [59]:
# This will initialize the hidden state of our lstm layer 
encoder.initialize_hidden_state()

[<tf.Tensor: shape=(64, 1024), dtype=float32, numpy=
 array([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]], dtype=float32)>,
 <tf.Tensor: shape=(64, 1024), dtype=float32, numpy=
 array([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]], dtype=float32)>]

In [67]:
sample_hidden = encoder.initialize_hidden_state()  # Sample input

# Apply the text vectorizer and turn the sequence into tokens before passing into Encoder
sample_output , sample_h , sample_c = encoder(input_text_vectorizer(example_input_batch) , sample_hidden)

print(f'Encoder output shape: (batch size , sequence length , units) --> {sample_output.shape}')
print(f'Encoder h vector shape: (batch_size , units) --> {sample_h.shape}')
print(f'Encoder c vector shape: (batch_size , units) --> {sample_c.shape}')

Encoder output shape: (batch size , sequence length , units) --> (64, 29, 1024)
Encoder h vector shape: (batch_size , units) --> (64, 1024)
Encoder c vector shape: (batch_size , units) --> (64, 1024)


#### Decoder 

Some threads I used to refer: 
- https://stackoverflow.com/questions/48187283/whats-the-difference-between-lstm-and-lstmcell



In [None]:
class Decoder(tf.keras.Model):
  def __init__(self, vocab_size , embedding_dim , dec_units , batch_size , attention_type = 'luong'):
    super(Decoder , self).__init__()
    self.batch_size = batch_size 
    self.dec_units = dec_units 
    self.attention_type = attention_type 

    # Embedding layer 
    self.embedding = tf.keras.layers.Embedding(vocab_size , embedding_dim)

    # Final Dense layer where softmax will be applied (fully connected layer)
    self.fc = tf.keras.layers.Dense(vocab_size)

    # Define the fundamental cell for decoder recurrent structure
    self.decoder_rnn_cell = tf.keras.layers.LSTMCell(self.dec_units)

    # Sampler 
    self.sampler = tfa.seq2seq.sampler.TrainingSampler() 
 
  def build_attention_mechanism(self , dec_units , memory , memory_sequence_length , attention_type = 'luong'):
    '''
    1. attention_type --> Which sort of attention (Bahdanau , Luong)
    2. dec_units: Final dimension of attention outputs (Decoder units)
    3. memory: Encoder hidden states of shape (batch_size , max_length_inputs , enc_units)
    4. memory_sequence_length: 1D array of shape (batch_size) with every element set to max_length_input (for masking purpose)
    '''