<a href="https://colab.research.google.com/github/ashikshafi08/Learning_Tensorflow/blob/main/Experiments/De_Scrambling_Text.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Converting Scrambled sequence into a Unscrambled sequence using attention. 

Reference: https://www.tensorflow.org/text/tutorials/nmt_with_attention

In [1]:
!pip install aicrowd-cli
API_KEY = 'b37b516b0cf698701bd83f05f784aab5' 
!aicrowd login --api-key $API_KEY

# Downloading the Dataset
!rm -rf data
!mkdir data
!aicrowd dataset download --challenge de-shuffling-text -j 3 -o data

Collecting aicrowd-cli
[?25l  Downloading https://files.pythonhosted.org/packages/1f/57/59b5a00c6e90c9cc028b3da9dff90e242ad2847e735b1a0e81a21c616e27/aicrowd_cli-0.1.7-py3-none-any.whl (49kB)
[K     |████████████████████████████████| 51kB 3.0MB/s 
[?25hCollecting gitpython<4,>=3.1.12
[?25l  Downloading https://files.pythonhosted.org/packages/27/da/6f6224fdfc47dab57881fe20c0d1bc3122be290198ba0bf26a953a045d92/GitPython-3.1.17-py3-none-any.whl (166kB)
[K     |████████████████████████████████| 174kB 7.6MB/s 
Collecting requests<3,>=2.25.1
[?25l  Downloading https://files.pythonhosted.org/packages/29/c1/24814557f1d22c56d50280771a17307e6bf87b70727d975fd6b2ce6b014a/requests-2.25.1-py2.py3-none-any.whl (61kB)
[K     |████████████████████████████████| 61kB 4.7MB/s 
Collecting requests-toolbelt<1,>=0.9.1
[?25l  Downloading https://files.pythonhosted.org/packages/60/ef/7681134338fc097acef8d9b2f8abe0458e4d87559c689a8c306d0957ece5/requests_toolbelt-0.9.1-py2.py3-none-any.whl (54kB)
[K     |

In [2]:
# Importing all the packages we need 
import tensorflow as tf 
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt

In [3]:
# Importing the data 

train_data = pd.read_csv('data/train.csv')
val_data = pd.read_csv('data/val.csv')
test_data = pd.read_csv('data/test.csv')

# Printing out all shapes of our data 
print(f'Shape of the train data: {train_data.shape}')
print(f'Shape of the validation data: {val_data.shape}')
print(f'Shape of the test data: {test_data.shape}')


Shape of the train data: (40001, 2)
Shape of the validation data: (4001, 2)
Shape of the test data: (10000, 3)


In [4]:
# How does our train data looks like? 
train_data.head()

Unnamed: 0,text,label
0,"presented here Furthermore, naive improved. im...","Furthermore, the naive implementation presente..."
1,vector a in a form vector multidimensional spa...,Those coefficients form a vector in a multidim...
2,compatible of The model with recent is model s...,The model is compatible with a recent model of...
3,but relevance outlined. hemodynamics its based...,"The model is based on electrophysiology, but i..."
4,of transitions lever-like involve reorientatio...,Conformational transitions in macromolecular c...


In [5]:
# Shuffling our train data 
train_data_shuffled = train_data.sample(frac = 1 , random_state = 42)
train_data_shuffled.head() , train_data_shuffled.shape

(                                                    text                                              label
 32824  on work, supervised label image the segmentati...  In our work, we focus on the weakly supervised...
 16298  we small of a for set work, In this features i...  In this work, we propose a small set of featur...
 30180  ($G_h^{Der}$ to factors the contributes $\tau_...  The increment of both factors ($G_h^{Der}$ and...
 6689   new precise particular, for entailment. bounds...  In particular, we provide new precise analytic...
 26893  a these causation Incorporating features, defi...  Incorporating these three features, a definiti...,
 (40001, 2))

In [6]:
# Splitting sentences and labels
train_sentences = train_data_shuffled['text'].to_numpy()
train_labels = train_data_shuffled['label'].to_numpy()

val_sentences = val_data['text'].to_numpy()
val_labels = val_data['label'].to_numpy()

test_sentences = test_data['text'].to_numpy()
test_labels = test_data['label'].to_numpy()


# Checking the shapes 
print(f'Shape of the train sentences: {train_sentences.shape}')
print(f'Shape of the validation sentences: {val_sentences.shape}')
print(f'Shape of the train labels: {train_labels.shape}')
print(f'Shape of the validation labels: {val_labels.shape}')

Shape of the train sentences: (40001,)
Shape of the validation sentences: (4001,)
Shape of the train labels: (40001,)
Shape of the validation labels: (4001,)


In [43]:
# Creating a tf.data.dataset of our sentences and labels 

train_dataset = tf.data.Dataset.from_tensor_slices((train_sentences , train_labels)).shuffle(1000)
val_dataset = tf.data.Dataset.from_tensor_slices((val_sentences , val_labels))

# Adding a batch 
train_dataset = train_dataset.batch(64)

train_dataset , val_dataset

(<BatchDataset shapes: ((None,), (None,)), types: (tf.string, tf.string)>,
 <TensorSliceDataset shapes: ((), ()), types: (tf.string, tf.string)>)

In [46]:
# Looking into our train_dataset just a batch (only 5 first texts in a batch)
for scrambled_text , unscrambled_text in train_dataset.take(1):
  print(f'Below is the Scrambled version:\n {scrambled_text[:5]}')
  print('\n----------\n')
  print(f'Below is the Un-Scrambled version:\n {unscrambled_text[:5]}')

Below is the Scrambled version:
 [b"representation's mechanism physiological this unknown. However, details of remain many"
 b'In Plug-and-Play convergence. we a fixed algorithm with paper, point this ADMM provable propose'
 b'those I information interpret show to how with theory. to statistical expressions respect'
 b'The code and publicly are datasets available.'
 b'on results R-CNN demonstrated impressive has recently The Faster detection benchmarks. various object']

----------

Below is the Un-Scrambled version:
 [b"However, many details of this representation's physiological mechanism remain unknown."
 b'In this paper, we propose a Plug-and-Play ADMM algorithm with provable fixed point convergence.'
 b'I show how to interpret those statistical expressions with respect to information theory.'
 b'The datasets and code are publicly available.'
 b'The Faster R-CNN has recently demonstrated impressive results on various object detection benchmarks.']


In [47]:
# Creating text vectorization layer for the scrambled words 
max_vocab_length = 10000

input_text_vectorizer = tf.keras.layers.experimental.preprocessing.TextVectorization(
    standardize = 'lower_and_strip_punctuation' , 
    ngrams = 2 , 
    max_tokens = max_vocab_length 
)

# Fitting on our train sentences (scrambled words )
input_text_vectorizer.adapt(train_sentences)

In [48]:
# First 10 words from the vocabulary 
input_text_vectorizer.get_vocabulary()[:10]

['', '[UNK]', 'the', 'of', 'a', 'in', 'to', 'is', 'and', 'we']

In [49]:
# Creating a text vectorization layer for the unscrambled words 
output_text_vectorizer = tf.keras.layers.experimental.preprocessing.TextVectorization(
    standardize = 'lower_and_strip_punctuation' , 
    ngrams = 2, 
    max_tokens = max_vocab_length
)

# Fitting on our train labels (unscrambled words)
output_text_vectorizer.adapt(train_labels)

In [50]:
# First 10 words from the vocab 
output_text_vectorizer.get_vocabulary()[:10]

['', '[UNK]', 'the', 'of', 'a', 'in', 'to', 'is', 'and', 'we']

In [51]:
# Passing a scrambled text (strings) into our layer 
scrambled_tokens = input_text_vectorizer(scrambled_text)
scrambled_tokens[:3]

<tf.Tensor: shape=(3, 29), dtype=int64, numpy=
array([[ 332,  494, 3090,   11,  887,   40, 1170,    3, 1392,   74,    1,
           1,    1,    1,    1,    1,    1,    1,    1,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0],
       [   5,    1,  666,    9,    4, 1311,   39,   16,   21,  373,   11,
        6184,    1,   43,    1,    1,    1,  218,    1,    1, 6737, 5780,
           1,    1,    1,    1,    1,    0,    0],
       [ 520,  830,   66, 3831,   53,    6,  132,   16,  166,    6,  356,
        1287, 1697,    1,    1,    1,    1, 5479, 4767,    1,    1, 7575,
        9543,    1,    1,    0,    0,    0,    0]])>

In the above example we passed our text (strings) into our text vectorizer layer and it returns us a vector of token ID's of our sequence. 

Likewise we can get the corresponding sequence of a token ID, that is convert token ids back to text using `get_vocabulary()` method. 

In [52]:
# Creating a numpy array of the vocabulary
input_vocab = np.array(input_text_vectorizer.get_vocabulary())
input_vocab


array(['', '[UNK]', 'the', ..., 'paper deep', 'packing', 'package the'],
      dtype='<U26')

In [59]:
# Indexing our scrambled tokens into the array of vocbulary
tokens = input_vocab[scrambled_tokens.numpy()]
print(f'Actual sequence:\n\n {scrambled_text[:3]}\n')
print(f'\nThe sequence in tokens:\n\n {tokens[:3]}') 

Actual sequence:

 [b"representation's mechanism physiological this unknown. However, details of remain many"
 b'In Plug-and-Play convergence. we a fixed algorithm with paper, point this ADMM provable propose'
 b'those I information interpret show to how with theory. to statistical expressions respect']


The sequence in tokens:

 [['representations' 'mechanism' 'physiological' 'this' 'unknown'
  'however' 'details' 'of' 'remain' 'many' '[UNK]' '[UNK]' '[UNK]'
  '[UNK]' '[UNK]' '[UNK]' '[UNK]' '[UNK]' '[UNK]' '' '' '' '' '' '' '' ''
  '' '']
 ['in' '[UNK]' 'convergence' 'we' 'a' 'fixed' 'algorithm' 'with' 'paper'
  'point' 'this' 'admm' '[UNK]' 'propose' '[UNK]' '[UNK]' '[UNK]' 'we a'
  '[UNK]' '[UNK]' 'algorithm with' 'with paper' '[UNK]' '[UNK]' '[UNK]'
  '[UNK]' '[UNK]' '' '']
 ['those' 'i' 'information' 'interpret' 'show' 'to' 'how' 'with' 'theory'
  'to' 'statistical' 'expressions' 'respect' '[UNK]' '[UNK]' '[UNK]'
  '[UNK]' 'show to' 'to how' '[UNK]' '[UNK]' 'theory to' 'to stati

## Modelling Part 

Here we are going to build a seq2seq architecture from scratch we will start building from, 
- Encoder 
- Decoder 
- Attention Head 

Since we are going to use a lot of low level API
s where it's easy to get the shapes wrong, this `SpaceChecker` is used to check shapes throughout the tutorial. 

In [60]:
class ShapeChecker():
  def __init__(self):
    # Keep a cache of every axis-name seen
    self.shapes = {}

  def __call__(self, tensor, names, broadcast=False):
    if not tf.executing_eagerly():
      return

    if isinstance(names, str):
      names = (names,)

    shape = tf.shape(tensor)
    rank = tf.rank(tensor)

    if rank != len(names):
      raise ValueError(f'Rank mismatch:\n'
                       f'    found {rank}: {shape.numpy()}\n'
                       f'    expected {len(names)}: {names}\n')

    for i, name in enumerate(names):
      if isinstance(name, int):
        old_dim = name
      else:
        old_dim = self.shapes.get(name, None)
      new_dim = shape[i]

      if (broadcast and new_dim == 1):
        continue

      if old_dim is None:
        # If the axis name is new, add its length to the cache.
        self.shapes[name] = new_dim
        continue

      if new_dim != old_dim:
        raise ValueError(f"Shape mismatch for dimension: '{name}'\n"
                         f"    found: {new_dim}\n"
                         f"    expected: {old_dim}\n")

In [61]:
# Defining needed constants for our model 
embedding_dim = 256 
units = 1024 

#### Building our encoder layer 
The encoder, 
- Takes a list of token IDs (from input_text_vectorizer)
- Looks up an embedding vector for each token (we will create that using `layers.Embedding`) 
- Processes the embeddings into a new sequences (using a `layers.GRU`)
- **Returns**
  - The processed sequence. This will be passed to the attention head.
  - The internal state. This will be used to initialize the encoder. 


In [62]:
# Building a Encoder layer 

class Encoder(tf.keras.layers.Layer): 
  def __init__(self ,input_vocab_size , embedding_dim , enc_units):
    super(Encoder , self).__init__()
    self.enc_units = enc_units 
    self.input_vocab_size = input_vocab_size 

    # This embedding layer converst tokens to vectors 
    self.embedding = tf.keras.layers.Embedding(self.input_vocab_size , 
                                             embedding_dim)
  
    # Using GRU layers to processes those vectors sequentially 
    self.gru = tf.keras.layers.GRU(self.enc_units , 
                                 return_sequences = True , 
                                 return_state = True , 
                                 recurrent_initializer = 'glorot_uniform')
  
  def call(self , tokens , state = None):
    shape_checker = ShapeChecker() 
    shape_checker(tokens, ('batch', 's'))

    # 2. The embedding layers looks up the embedding for each token
    vectors = self.embedding(tokens) # gives us the vectors for each token
    shape_checker(vectors , ('batch' , 's' , 'embed_dim'))

    # 3. The GRU processes the embedding sequence 
    #       output shape: (batch , s , enc_units)
    #       state_shape: (batch , enc_units)
    output , state = self.gru(vectors , initial_state = state)
    shape_checker(output , ('batch' ,'s' , 'enc_units'))
    shape_checker(state , ('batch' , 'enc_units'))

    # 4. Return the new sequence and it's state 
    return output , state
  


Alright that's complicated let's see how it works. 

In [63]:
# Firstly conver the input text to token using Textvectorizer 
example_tokens = input_text_vectorizer(scrambled_text)
example_tokens

<tf.Tensor: shape=(64, 29), dtype=int64, numpy=
array([[ 332,  494, 3090, ...,    0,    0,    0],
       [   5,    1,  666, ...,    1,    0,    0],
       [ 520,  830,   66, ...,    0,    0,    0],
       ...,
       [  61,  229,   30, ...,    0,    0,    0],
       [6190,   15,  514, ..., 7691, 8055,    1],
       [   2,   17,   18, ...,    1,    1,    1]])>

In [64]:
# Encode the input sequence (apply everything we wrote in our class)
encoder = Encoder(input_vocab_size= input_text_vectorizer.vocabulary_size() , 
                  embedding_dim = embedding_dim, enc_units = units)


In [65]:
# Unravelling with each variable by applying on our example toke

example_encoder_output , example_encoder_state = encoder(example_tokens)

In [67]:
# Good, let's print them one by one 
print(f'Input batch, shape (batch): {scrambled_text.shape}\n')
print(f'Input batch tokens , shape (batch ,s): {example_tokens.shape}\n')

print(f'Encoder output , shape (batch, s , units): {example_encoder_output.shape}\n')
print(f'Encoder state , shape (batch, units): {example_encoder_state.shape}\n')

Input batch, shape (batch): (64,)

Input batch tokens , shape (batch ,s): (64, 29)

Encoder output , shape (batch, s , units): (64, 29, 1024)

Encoder state , shape (batch, units): (64, 1024)

