**In this notebook we demonstarte: trainig wave2vec2 with tensorflow - TPU**

![wav2vec2_structure](https://raw.githubusercontent.com/patrickvonplaten/scientific_images/master/xls_r.png)

* for training with hugging-face torch follow this [notebook](https://www.kaggle.com/code/nazmuddhohaansary/wave2vec2-starter-for-dl-sprint-commonvoice)
* training with this notebook is way faster due to tfrecords and TPU 

![TPU](https://diamond-thumbnails.s3.us-west-2.amazonaws.com/thinkbigcms/Product/logo/6e276569-90f0-4491-a0ed-45b07b8b05eb.png?hash=f54001f5e45cdaa749504dafc8d86bc4) 


TPU is an accelerator available on **colab** and **kaggle** which provides a way to train **tensorflow** models way faster than on gpus. To train any data on TPU the dataset has to be converted into **tfrecords** format. 


### Useful links to understand tfrecords and TPU's 

**TPU for (~20x)faster training** 
* [what is TPU and why do we need them](https://www.quora.com/What-is-TPU-and-GPU-Why-and-when-do-we-need-them)
* [Kaggle TPU a-z](https://www.kaggle.com/docs/tpu)

**TFRecords**
* [Official Tensorflow Doc](https://www.tensorflow.org/tutorials/load_data/tfrecord)
* [basics](https://www.kaggle.com/code/ryanholbrook/tfrecords-basics/notebook)

### **install dependencies**

In [1]:
!pip install -q git+https://github.com/vasudevgupta7/gsoc-wav2vec2@main



# Data Access

* We locate the tfrecods by file patterns.We use star as wild card entry
* While training with tfrecords **we must not load the data locally**. We have to use **GCS Buckets** to load the data. 
    * kaggle_datasets api provides a way to access both public and private GCS(google cloud storage). Here we are using public data but private datasets can also be used.
* **PER_REPLICA_BATCH_SIZE**  global batch size while training will be **8 times the PER_REPLICA_BATCH_SIZE** we provide 

* **REC_SIZE=256** simply means while creating the tfrecords , we stored 256 audio files with their labels in one tfrecord

* for params
```python
PER_REPLICA_BATCH_SIZE  = 32      # this is a safe batch size 
EPOCHS                  = 50      # change this as needed .. keep the kaggle allowed TPU limit of 9 hours in mind    
```
* to use the full-dataset

```python
TRAIN_GCS_PATTERNS      = [os.path.join(GCS_PATH,"voted","*/*.tfrecord"),
                           os.path.join(GCS_PATH,"unverified","*/*.tfrecord"),]

```

In [2]:
# from kaggle_datasets import KaggleDatasets
# GCS_PATH=KaggleDatasets().get_gcs_path("dl-sprint-tfrecords")


import os 
#------------------------------
# change able params
#------------------------------
TRAIN_GCS_PATTERNS      = [os.path.join(GCS_PATH,"voted","*/*.tfrecord"),
                           os.path.join(GCS_PATH,"unverified","*/*.tfrecord")]
                          
EVAL_GCS_PATTERNS       = [os.path.join(GCS_PATH,"eval","*/*.tfrecord")]

PER_REPLICA_BATCH_SIZE  = 32      # this is a safe batch size 
EPOCHS                  = 25      # change this as needed .. keep the kaggle allowed TPU limit of 9 hours in mind    

#------------------------------
# fixed params while creating the tfrecords
#------------------------------
REC_SIZE=256  
VOCAB   =[ 'pad','start','end','\u200d',
        ' ','!',"'",',','-','.',':',';','=','?','।',
        'ঁ','ং','ঃ',
        'অ','আ','ই','ঈ','উ','ঊ','ঋ','এ','ঐ','ও','ঔ',
        'ক','খ','গ','ঘ','ঙ',
        'চ','ছ','জ','ঝ','ঞ',
        'ট','ঠ','ড','ঢ','ণ',
        'ত','থ','দ','ধ','ন',
        'প','ফ','ব','ভ','ম',
        'য','র','ল',
        'শ','ষ','স','হ',
        'া','ি','ী','ু','ূ','ৃ','ে','ৈ','ো','ৌ','্',
        'ৎ','ড়','ঢ়','য়',
        '০','১','২','৩','৪','৫','৬','৭','৮','৯']


We import needed libraries here and collect the tfrecord paths that can be fed into [tf.data api](https://www.tensorflow.org/api_docs/python/tf/data/Dataset)which is the official way to use tfrecords 

### Imports and data 

In [3]:
#-------------------------------
# imports
#-------------------------------
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' 
import random
import tensorflow as tf
import tensorflow_hub as hub
import matplotlib.pyplot as plt
import numpy as np 
from tqdm.auto import tqdm
from IPython.display import display,Audio
from wav2vec2 import Wav2Vec2Config,CTCLoss
tqdm.pandas()

#--------------------------
# GCS Paths and tfrecords
#-------------------------
train_recs=[]
eval_recs =[]
def get_tfrecs(gcs_pattern):
    file_paths = tf.io.gfile.glob(gcs_pattern)
    random.shuffle(file_paths)
    print("found ",len(file_paths), "tfrecords")
    return file_paths

for gcs in TRAIN_GCS_PATTERNS:
    print("Looking into gcs path:",gcs)
    train_recs+=get_tfrecs(gcs)
for gcs in EVAL_GCS_PATTERNS:
    print(gcs)
    eval_recs+=get_tfrecs(gcs)

print("Total Eval-recs:",len(eval_recs))
print("Total Train-recs:",len(train_recs))
#------------------------------------------------
# change config
#------------------------------------------------
config = Wav2Vec2Config()
config.vocab_size=len(VOCAB)+1
config

Looking into gcs path: gs://kds-90328aa8d26e17c5bffb9a7f73013580f05a9bfecda822e30cc04946/voted/*/*.tfrecord
found  144 tfrecords
Looking into gcs path: gs://kds-90328aa8d26e17c5bffb9a7f73013580f05a9bfecda822e30cc04946/unverified/*/*.tfrecord
found  660 tfrecords
gs://kds-90328aa8d26e17c5bffb9a7f73013580f05a9bfecda822e30cc04946/eval/*/*.tfrecord
found  31 tfrecords
Total Eval-recs: 31
Total Train-recs: 804


Wav2Vec2Config(vocab_size=87, dropout=0.1, hidden_size=768, num_heads=12, num_layers=12, intermediate_size=3072, is_gelu_approx=False, layer_norm_eps=1e-05, survival_prob=1.0, pad_id=0, num_conv_pos_embeddings=128, num_conv_pos_embedding_groups=16, filter_sizes=[512, 512, 512, 512, 512, 512, 512], kernal_sizes=[10, 3, 3, 3, 3, 2, 2], strides=[5, 2, 2, 2, 2, 2, 2], conv_bias=False, apply_spec_augment=True, mask_time_prob=0.05, mask_time_length=10, attention_norm_type='postnorm', feature_extractor_norm_type='group', is_robust=False)

# Initialize TPU
* we initialize the tpu cluster for using
* based on number of **replicas** or devices we fix:
    * BATCH_SIZE
    * STEPS_PER_EPOCH
    * and evaluation steps within an epoch (EVAL_STEPS)

In [4]:
#----------------------------------------------------------
# Detect hardware, return appropriate distribution strategy
#----------------------------------------------------------
# TPU detection. No parameters necessary if TPU_NAME environment variable is set. On Kaggle this is always the case.
try:
    tpu = tf.distribute.cluster_resolver.TPUClusterResolver()  
    print('Running on TPU ', tpu.master())
except ValueError:
    tpu = None

if tpu:
    tf.config.experimental_connect_to_cluster(tpu)
    tf.tpu.experimental.initialize_tpu_system(tpu)
    strategy = tf.distribute.experimental.TPUStrategy(tpu)
    tf.config.optimizer.set_jit(True)
else:
    strategy = tf.distribute.get_strategy() 
    # default distribution strategy in Tensorflow. Works on CPU and single GPU.

print("REPLICAS: ", strategy.num_replicas_in_sync)

#-------------------------------------
# batching , strategy and steps
#-------------------------------------
if strategy.num_replicas_in_sync==1:
    BATCH_SIZE = PER_REPLICA_BATCH_SIZE
else:
    BATCH_SIZE = PER_REPLICA_BATCH_SIZE*strategy.num_replicas_in_sync

# set    
STEPS_PER_EPOCH = (len(train_recs)*REC_SIZE)//(BATCH_SIZE)
EVAL_STEPS      = (len(eval_recs)*REC_SIZE)//(2*BATCH_SIZE)
print("Batch Size:",BATCH_SIZE)
print("Steps:",STEPS_PER_EPOCH)
print("Eval Steps:",EVAL_STEPS)

Running on TPU  grpc://10.0.0.2:8470
REPLICAS:  8
Batch Size: 256
Steps: 804
Eval Steps: 15


# Data Loader 
* cfg = our data config and some constant storing
* config=actual wave2vec2 modeling config

In [5]:
class cfg:
    audio_shape      =  (246000,)                   # this is actually fixed for the pretrained weights we are using -- highets audio length=15 secs
    label_shape      =  (250,)                      # this is actually fixed for the pretrained weights we are using 
    sample_rate      =  16000
    shuffle_buffer   =  1024
    batch_size       =  BATCH_SIZE
    vocab_len        =  len(VOCAB)+1                # the additional vocab can account for <UNK>
    

In [6]:
#------------------------------
# parsing tfrecords 
#------------------------------
def normalize(x):
    # -> (1, seqlen)
    mean = tf.reduce_mean(x, axis=-1, keepdims=True)
    var = tf.math.reduce_variance(x, axis=-1, keepdims=True)
    return tf.squeeze((x - mean) / tf.sqrt(var + 1e-5))

def read_raw_audio(audio):
    wave,rate = tf.audio.decode_wav(audio, desired_channels=1, desired_samples=-1)
    return tf.reshape(wave, shape=[-1]) 
    
def preprocess_example(audio,label):
    with tf.device("/CPU:0"):
        signal = normalize(read_raw_audio(audio))
        label = tf.strings.to_number(tf.strings.split(label), out_type=tf.int32)
        return signal,label

def data_input_fn(recs): 
    '''
      This Function generates data from gcs
      * The parser function should look similiar now because of datasetEDA
    '''
    def _parser(example):   
        feature ={  'audio' : tf.io.FixedLenFeature([],tf.string) ,
                    'label' : tf.io.FixedLenFeature([],tf.string) 
        }    
        example=tf.io.parse_single_example(example,feature)
        audio,label=preprocess_example(**example)
        return audio,label
    # fixed code (for almost all tfrec training)
    dataset = tf.data.TFRecordDataset(recs)
    dataset = dataset.map(_parser)
    dataset = dataset.shuffle(cfg.shuffle_buffer,reshuffle_each_iteration=True)
    dataset = dataset.repeat()
    dataset = dataset.padded_batch(cfg.batch_size, padded_shapes=(cfg.audio_shape[0],cfg.label_shape[0]), padding_values=(0.0, 0))
    dataset = dataset.prefetch(tf.data.experimental.AUTOTUNE)
    dataset = dataset.apply(tf.data.experimental.ignore_errors())
    return dataset

In [7]:
train_ds=data_input_fn(train_recs)
eval_ds =data_input_fn(eval_recs)

### Visualize

In [8]:
#------------------------------
# view data
#------------------------------
for x,y in eval_ds.take(1):
    signal=x[0].numpy()
    display(Audio(data=signal, rate=cfg.sample_rate))
    label=y[0].numpy()
    sen="".join([VOCAB[int(i)] for i in label if i > VOCAB.index("end")])
    print("label:",sen)
    print("input shape:",x.shape)
    print("output shape:",y.shape)

label: ঘরোয়া প্রথম-শ্রেণীর ক্রিকেটে মধ্যপ্রদেশ দলের হয়ে খেলছেন।
input shape: (256, 246000)
output shape: (256, 250)


# Modeling

In [9]:
def create_model(cfg):
    load_locally = tf.saved_model.LoadOptions(experimental_io_device='/job:localhost')
    pretrained_layer = hub.KerasLayer("https://tfhub.dev/vasudevgupta7/wav2vec2/1",load_options=load_locally,trainable=True)
    inputs = tf.keras.Input(shape=cfg.audio_shape)
    states = pretrained_layer(inputs)
    logits= tf.keras.layers.Dense(cfg.vocab_len)(states)
    model = tf.keras.Model(inputs=inputs, outputs=logits)
    return model

**model weights can be loaded from saved ones to continue training**
```python
model.load_weights("path to previously trained weights")
```

In [10]:
with strategy.scope():
    model=create_model(cfg)
    # model.load_weights("model.h5")
model.summary()

Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, 246000)]          0         
_________________________________________________________________
keras_layer (KerasLayer)     (None, 768, 768)          94371712  
_________________________________________________________________
dense (Dense)                (None, 768, 87)           66903     
Total params: 94,438,615
Trainable params: 94,438,615
Non-trainable params: 0
_________________________________________________________________


# Training
* some ideas to extend: 
    * use different schedulers
    * use callbacks to track some metrics
    * reduce learning rate on plateau, early stopping setup might need some inspection 

In [11]:
    
# early stopping
early_stopping = tf.keras.callbacks.EarlyStopping(patience=10, 
                                                  verbose=1, 
                                                  mode = 'auto') 
lr_reducer=tf.keras.callbacks.ReduceLROnPlateau( patience=3)
model_save=tf.keras.callbacks.ModelCheckpoint("model.h5",
                                                save_best_only=True,
                                                save_weights_only=True,
                                                verbose=1)
callbacks = [lr_reducer,model_save]

with strategy.scope():
    loss_fn = CTCLoss(config, (PER_REPLICA_BATCH_SIZE,cfg.audio_shape[0]), division_factor=PER_REPLICA_BATCH_SIZE)
    model.compile(optimizer=tf.keras.optimizers.Adam(5e-5),
                  loss=loss_fn)

In [12]:
history=model.fit(train_ds,
                  epochs=EPOCHS,
                  steps_per_epoch=STEPS_PER_EPOCH,
                  verbose=1,
                  validation_data=eval_ds,
                  validation_steps=EVAL_STEPS, 
                  callbacks=callbacks)

Epoch 1/25

Epoch 00001: val_loss improved from inf to 1746.72913, saving model to model.h5
Epoch 2/25

Epoch 00002: val_loss improved from 1746.72913 to 413.10388, saving model to model.h5
Epoch 3/25

Epoch 00003: val_loss improved from 413.10388 to 271.03256, saving model to model.h5
Epoch 4/25

Epoch 00004: val_loss improved from 271.03256 to 203.81522, saving model to model.h5
Epoch 5/25

Epoch 00005: val_loss improved from 203.81522 to 176.95090, saving model to model.h5
Epoch 6/25

Epoch 00006: val_loss improved from 176.95090 to 157.85423, saving model to model.h5
Epoch 7/25

Epoch 00007: val_loss improved from 157.85423 to 147.83566, saving model to model.h5
Epoch 8/25

Epoch 00008: val_loss improved from 147.83566 to 137.75560, saving model to model.h5
Epoch 9/25

Epoch 00009: val_loss improved from 137.75560 to 134.38936, saving model to model.h5
Epoch 10/25

Epoch 00010: val_loss improved from 134.38936 to 133.71933, saving model to model.h5
Epoch 11/25

Epoch 00011: val_los

In [13]:
curves={}
for key in history.history.keys():
    curves[key]=history.history[key]
curves=pd.DataFrame(curves)
curves.to_csv(f"history.csv",index=False)

NameError: name 'pd' is not defined

In [None]:
curves