# Efficiently reading multiple files in Tensorflow 2
Author: [Biswajit Sahoo](https://biswajitsahoo1111.github.io/)

**Note**: Whether this method is efficient or not is contestable. Efficiency of a pipeline depends on many factors. How efficiently data are loaded? What is the computer architecture on which computations are being done? Is GPU available? And the list goes on. So readers might get different performance results when they run this method on their own system. The system on which we ran this notebook has 44 CPU cores. `Tensorflow` version is 2.2.0 and it is `XLA` enabled. We did not use any GPU. We achieved 20% improvement over naive method. For one personal application, involving moderate size data (3-4 GB), I achieved 10x performance improvement. So I hope that this method can be applied for other applications as well. Pleses note that for some weird reason, the speedup technique doesn't work in `Google Colab`. But it works in GPU enabled personal systems, that I have checked. 

<table class="tfo-notebook-buttons" align="left">
  <td>
    <a target="_blank" href="https://github.com/biswajitsahoo1111/cbm_codes_open/blob/master/notebooks/Reading_multiple_files_in_Tensorflow_2.ipynb">
    <img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />
    View source on GitHub</a>
  </td>
  <td>
    <a href="https://www.dropbox.com/s/o4aevvuqr39kq20/Reading_multiple_files_in_Tensorflow_2.ipynb?dl=1"><img src="https://www.tensorflow.org/images/download_logo_32px.png" />Download notebook</a>
  </td>
</table>

This post is a sequel to [an older post](https://biswajitsahoo1111.github.io/post/reading-multiple-files-in-tensorflow-2/). In the previous post, we discussed ways in which we can read multiple files in `Tensorflow 2`. If our aim is only to read files without doing any transformation on data, that method might work well for most applications. But if we need to make complex transformations on data before training our deep learning algorithm, the old method might turn out to be slow. In this post, we will describe a way in which we can speedup that process. The transformations that we will consider are `spectrogram` and normalizing (converting each value to a standard normal value). We have chosen these transformtions just to illustrate the point. Readers can use any transformation (or no transformation) of their choice. More details regarding improving data performance can be found in this [tensorflow guide](https://www.tensorflow.org/guide/data_performance).

As this post is a sequel, we expect readers to be familiar with the old post. We will not elaborate on points that have already been discussed. Rather, we will focus on [section 4](#speedup) which is the main topic of this post.

## Outline:
1. [Create 500 `".csv"` files and save it in the folder "random_data" in current directory.](#create_files)
2. [Write a generator that reads data from the folder in chunks and transforms it.](#generator)
3. [Build data pipeline and train a CNN model.](#model)
4. [How to make the code run faster?](#speedup)
5. [How to make predictions?](#predictions)

<a id = "create_files"></a>

## 1. Create 500 `.csv` files of random data

As we intend to train a CNN model for classification using our data, we will generate data for 5 different classes. Following is the process that we will follow.
* Each `.csv` file will have one column of data with 1024 entries.
* Each file will be saved using one of the following names (Fault_1, Fault_2, Fault_3, Fault_4, Fault_5). The dataset is balanced, meaning, for each category, we have approximately same number of observations. Data files in "Fault_1" 
category will have names as "Fault_1_001.csv", "Fault_1_002.csv", "Fault_1_003.csv", ..., "Fault_1_100.csv". Similarly for other classes.

In [1]:
import numpy as np
import os
import glob
np.random.seed(1111)

First create a function that will generate random files. 

In [2]:
def create_random_csv_files(fault_classes, number_of_files_in_each_class):
    os.mkdir("./random_data/")  # Make a directory to save created files.
    for fault_class in fault_classes:
        for i in range(number_of_files_in_each_class):
            data = np.random.rand(1024,)
            file_name = "./random_data/" + eval("fault_class") + "_" + "{0:03}".format(i+1) + ".csv" # This creates file_name
            np.savetxt(eval("file_name"), data, delimiter = ",", header = "V1", comments = "")
        print(str(eval("number_of_files_in_each_class")) + " " + eval("fault_class") + " files"  + " created.")

Now use the function to create 100 files each for five fault types. 

In [3]:
create_random_csv_files(["Fault_1", "Fault_2", "Fault_3", "Fault_4", "Fault_5"], number_of_files_in_each_class = 100)

100 Fault_1 files created.
100 Fault_2 files created.
100 Fault_3 files created.
100 Fault_4 files created.
100 Fault_5 files created.


In [4]:
files = np.sort(glob.glob("./random_data/*"))
print("Total number of files: ", len(files))
print("Showing first 10 files...")
files[:10]

Total number of files:  500
Showing first 10 files...


array(['./random_data/Fault_1_001.csv', './random_data/Fault_1_002.csv',
       './random_data/Fault_1_003.csv', './random_data/Fault_1_004.csv',
       './random_data/Fault_1_005.csv', './random_data/Fault_1_006.csv',
       './random_data/Fault_1_007.csv', './random_data/Fault_1_008.csv',
       './random_data/Fault_1_009.csv', './random_data/Fault_1_010.csv'],
      dtype='<U29')

To extract labels from file name, extract the part of the file name that corresponds to fault type. 

In [5]:
print(files[0])

./random_data/Fault_1_001.csv


In [6]:
print(files[0][14:21])

Fault_1


Now that data have been created, we will go to the next step. That is, define a generator, preprocess the time series like data into a matrix like shape such that a 2-D CNN can ingest it. 

<a id = "generator"></a>

## 2. Write a generator that reads data in chunks and preprocesses it

These are the few things that we want our generator to have.

 1. It should run indefinitely, i.e., it is an infinite loop.
 2. Inside generator loop, read individual files using `pandas`.
 3. Do transformations on data if required.
 4. Yield the data.
 
As we will be solving a classification problem, we have to assign labels to each raw data. We will use following labels for convenience.

|Class| Label|
|-----|------|
|Fault_1| 0|
|Fault_2| 1|
|Fault_3| 2|
|Fault_4| 3|
|Fault_5| 4|

The generator will `yield` both data and labels. The generator takes a list of file names as first argument. The second argument is `batch_size`.

In [7]:
import tensorflow as tf
import pandas as pd
import re

In [8]:
def tf_data_generator(file_list, batch_size = 20):
    i = 0
    while True:    # This loop makes the generator an infinite loop
        if i*batch_size >= len(file_list):  
            i = 0
            np.random.shuffle(file_list)
        else:
            file_chunk = file_list[i*batch_size:(i+1)*batch_size] 
            data = []
            labels = []
            label_classes = tf.constant(["Fault_1", "Fault_2", "Fault_3", "Fault_4", "Fault_5"]) 
            for file in file_chunk:
                temp = pd.read_csv(open(file,'r')).astype(np.float32)    # Read data
                #########################################################################################################
                # Apply transformations. Comment this portion if you don't have to do any.
                # Try to use Tensorflow transformations as much as possible. First compute a spectrogram.
                temp = tf.math.abs(tf.signal.stft(tf.reshape(temp.values, shape = (1024,)),frame_length = 64, frame_step = 32, fft_length = 64))
                # After STFT transformation with given parameters, shape = (31,33)
                temp = tf.image.per_image_standardization(tf.reshape(temp, shape = (-1,31,33,1))) # Image Normalization
                ##########################################################################################################
                # temp = tf.reshape(temp, (32,32,1)) # Uncomment this line if you have not done any transformation.
                data.append(temp)
                pattern = tf.constant(eval("file[14:21]"))  
                for j in range(len(label_classes)):
                    if re.match(pattern.numpy(), label_classes[j].numpy()): 
                        labels.append(j)
            data = np.asarray(data).reshape(-1,31,33,1) 
            labels = np.asarray(labels)
            yield data, labels
            i = i + 1

In [9]:
batch_size = 15
dataset = tf.data.Dataset.from_generator(tf_data_generator,args= [files, batch_size],output_types = (tf.float32, tf.float32),
                                                output_shapes = ((None,31,33,1),(None,)))

In [10]:
for data, labels in dataset.take(7):
  print(data.shape)
  print(labels)

(15, 31, 33, 1)
tf.Tensor([0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.], shape=(15,), dtype=float32)
(15, 31, 33, 1)
tf.Tensor([0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.], shape=(15,), dtype=float32)
(15, 31, 33, 1)
tf.Tensor([0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.], shape=(15,), dtype=float32)
(15, 31, 33, 1)
tf.Tensor([0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.], shape=(15,), dtype=float32)
(15, 31, 33, 1)
tf.Tensor([0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.], shape=(15,), dtype=float32)
(15, 31, 33, 1)
tf.Tensor([0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.], shape=(15,), dtype=float32)
(15, 31, 33, 1)
tf.Tensor([0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 1.], shape=(15,), dtype=float32)


The generator works fine. Now, we will train a full CNN model using the generator. As is done in every model, we will first shuffle data files. Split the files into train, validation, and test set. Using the `tf_data_generator` create three tensorflow datasets corresponding to train, validation, and test data respectively. Finally, we will create a simple CNN model. Train it using train dataset, see its performance on validation dataset, and obtain prediction using test dataset. Keep in mind that our aim is not to improve performance of the model. As the data are random, don't expect to see good performance. The aim is only to create a pipeline. 

<a id = "model"></a>

## 3. Building data pipeline and training a CNN model

Before building the data pipeline, we will first move files corresponding to each fault class into different folders. This will make it convenient to split data into training, validation, and test set, keeping the balanced nature of the dataset intact.

In [11]:
import shutil

Create five different folders.

In [12]:
fault_folders = ["Fault_1", "Fault_2", "Fault_3", "Fault_4", "Fault_5"]
for folder_name in fault_folders:
    os.mkdir(os.path.join("./random_data", folder_name))

Move files into those folders.

In [13]:
for file in files:
    pattern = "^" + eval("file[14:21]")
    for j in range(len(fault_folders)):
        if re.match(pattern, fault_folders[j]):
            dest = os.path.join("./random_data/",eval("fault_folders[j]"))
            shutil.move(file, dest)

In [14]:
glob.glob("./random_data/*")

['./random_data/Fault_1',
 './random_data/Fault_2',
 './random_data/Fault_3',
 './random_data/Fault_4',
 './random_data/Fault_5']

In [15]:
np.sort(glob.glob("./random_data/Fault_1/*"))[:10] # Showing first 10 files of Fault_1 folder

array(['./random_data/Fault_1/Fault_1_001.csv',
       './random_data/Fault_1/Fault_1_002.csv',
       './random_data/Fault_1/Fault_1_003.csv',
       './random_data/Fault_1/Fault_1_004.csv',
       './random_data/Fault_1/Fault_1_005.csv',
       './random_data/Fault_1/Fault_1_006.csv',
       './random_data/Fault_1/Fault_1_007.csv',
       './random_data/Fault_1/Fault_1_008.csv',
       './random_data/Fault_1/Fault_1_009.csv',
       './random_data/Fault_1/Fault_1_010.csv'], dtype='<U37')

In [16]:
np.sort(glob.glob("./random_data/Fault_3/*"))[:10] # Showing first 10 files of Falut_3 folder

array(['./random_data/Fault_3/Fault_3_001.csv',
       './random_data/Fault_3/Fault_3_002.csv',
       './random_data/Fault_3/Fault_3_003.csv',
       './random_data/Fault_3/Fault_3_004.csv',
       './random_data/Fault_3/Fault_3_005.csv',
       './random_data/Fault_3/Fault_3_006.csv',
       './random_data/Fault_3/Fault_3_007.csv',
       './random_data/Fault_3/Fault_3_008.csv',
       './random_data/Fault_3/Fault_3_009.csv',
       './random_data/Fault_3/Fault_3_010.csv'], dtype='<U37')

Prepare that data for training set, validation set, and test_set. For each fault type, we will keep 70 files for training, 10 files for validation and 20 files for testing.

In [17]:
fault_1_files = glob.glob("./random_data/Fault_1/*")
fault_2_files = glob.glob("./random_data/Fault_2/*")
fault_3_files = glob.glob("./random_data/Fault_3/*")
fault_4_files = glob.glob("./random_data/Fault_4/*")
fault_5_files = glob.glob("./random_data/Fault_5/*")

In [18]:
from sklearn.model_selection import train_test_split

In [19]:
fault_1_train, fault_1_test = train_test_split(fault_1_files, test_size = 20, random_state = 5)
fault_2_train, fault_2_test = train_test_split(fault_2_files, test_size = 20, random_state = 54)
fault_3_train, fault_3_test = train_test_split(fault_3_files, test_size = 20, random_state = 543)
fault_4_train, fault_4_test = train_test_split(fault_4_files, test_size = 20, random_state = 5432)
fault_5_train, fault_5_test = train_test_split(fault_5_files, test_size = 20, random_state = 54321)

In [20]:
fault_1_train, fault_1_val = train_test_split(fault_1_train, test_size = 10, random_state = 1)
fault_2_train, fault_2_val = train_test_split(fault_2_train, test_size = 10, random_state = 12)
fault_3_train, fault_3_val = train_test_split(fault_3_train, test_size = 10, random_state = 123)
fault_4_train, fault_4_val = train_test_split(fault_4_train, test_size = 10, random_state = 1234)
fault_5_train, fault_5_val = train_test_split(fault_5_train, test_size = 10, random_state = 12345)

In [21]:
train_file_names = fault_1_train + fault_2_train + fault_3_train + fault_4_train + fault_5_train
validation_file_names = fault_1_val + fault_2_val + fault_3_val + fault_4_val + fault_5_val
test_file_names = fault_1_test + fault_2_test + fault_3_test + fault_4_test + fault_5_test

# Shuffle files
np.random.shuffle(train_file_names)
np.random.shuffle(validation_file_names)
np.random.shuffle(test_file_names)

In [22]:
print("Number of train_files:" ,len(train_file_names))
print("Number of validation_files:" ,len(validation_file_names))
print("Number of test_files:" ,len(test_file_names))

Number of train_files: 350
Number of validation_files: 50
Number of test_files: 100


In [23]:
batch_size = 32
train_dataset = tf.data.Dataset.from_generator(tf_data_generator, args = [train_file_names, batch_size], 
                                              output_shapes = ((None,31,33,1),(None,)),
                                              output_types = (tf.float32, tf.float32))

validation_dataset = tf.data.Dataset.from_generator(tf_data_generator, args = [validation_file_names, batch_size],
                                                   output_shapes = ((None,31,33,1),(None,)),
                                                   output_types = (tf.float32, tf.float32))

test_dataset = tf.data.Dataset.from_generator(tf_data_generator, args = [test_file_names, batch_size],
                                             output_shapes = ((None,31,33,1),(None,)),
                                             output_types = (tf.float32, tf.float32))

Now create the model.

In [24]:
from tensorflow.keras import layers

In [25]:
model = tf.keras.Sequential([
    layers.Conv2D(16, 3, activation = "relu", input_shape = (31,33,1)),
    layers.MaxPool2D(2),
    layers.Conv2D(32, 3, activation = "relu"),
    layers.MaxPool2D(2),
    layers.Flatten(),
    layers.Dense(16, activation = "relu"),
    layers.Dense(5, activation = "softmax")
])
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d (Conv2D)              (None, 29, 31, 16)        160       
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 14, 15, 16)        0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 12, 13, 32)        4640      
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 6, 6, 32)          0         
_________________________________________________________________
flatten (Flatten)            (None, 1152)              0         
_________________________________________________________________
dense (Dense)                (None, 16)                18448     
_________________________________________________________________
dense_1 (Dense)              (None, 5)                 8

Compile the model.

In [26]:
model.compile(loss = "sparse_categorical_crossentropy", optimizer = "adam", metrics = ["accuracy"])

Before we fit the model, we have to do one important calculation. Remember that our generators are infinite loops. So if no stopping criteria is given, it will run indefinitely. But we want our model to run for, say, 10 epochs. So our generator should loop over the data files just 10 times and no more. This is achieved by setting the arguments `steps_per_epoch` and `validation_steps` to desired numbers in `model.fit()`. Similarly while evaluating model, we need to set the argument `steps` to a desired number in `model.evaluate()`.

There are 350 files in training set. Batch_size is 10. So if the generator runs 35 times, it will correspond to one epoch. Therefor, we should set `steps_per_epoch` to 35. Similarly, `validation_steps = 5` and in `model.evaluate()`, `steps = 10`.

In [27]:
steps_per_epoch = np.int(np.ceil(len(train_file_names)/batch_size))
validation_steps = np.int(np.ceil(len(validation_file_names)/batch_size))
steps = np.int(np.ceil(len(test_file_names)/batch_size))
print("steps_per_epoch = ", steps_per_epoch)
print("validation_steps = ", validation_steps)
print("steps = ", steps)

steps_per_epoch =  11
validation_steps =  2
steps =  4


In [28]:
model.fit(train_dataset, validation_data = validation_dataset, steps_per_epoch = steps_per_epoch,
         validation_steps = validation_steps, epochs = 5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7fc3b818dc50>

In [29]:
test_loss, test_accuracy = model.evaluate(test_dataset, steps = steps)



In [30]:
print("Test loss: ", test_loss)
print("Test accuracy:", test_accuracy)

Test loss:  1.6098381280899048
Test accuracy: 0.20000000298023224


As expected, model performs terribly. 

<a id = "speedup"></a>

## How to make the code run faster?

If no transformations are used, just using `prefetch` might improve performance. In deep learing usually GPUs are used for training. But all the data processing is done in CPU. In the naive approach, we will first process data in CPU, then send the processed data to GPU and after training finishes, we will prepare another batch of data. This approch is not efficient because GPU has to wait for data to get prepared. But using `prefetch`, we prepare and keep ready batches of data while training continues. In this way, waiting time of GPU is minimized.

When data transformations are used, out aim should always be to use parallel processing capabilites of `tensorflow`. We can achieve this using `map` function. Inside the `map` function, all transformations are defined. Then we can `prefetch` batches to further improve performance. The whole pipeline is as follows.

```
1. def transformation_function(...):
    # Define all transormations (STFT, Normalization, etc.)
    
2. def generator(...):
    
       # Read data
    
       # Call transformation_function using tf.data.Dataset.map so that it can parallelize operations.
    
       # Finally yield the processed data

3. Create tf.data.Dataset s.

4. Prefecth datasets.

5. Create model and train it.
```
We will use one extra library `tensorflow_datasets` that will allow us to switch from `tf.dataset` to `numpy`. If `tensorflow_datasets` is not installed in your system, use `pip install tensorflow-datasets` to install it and then run following codes.

In [31]:
import tensorflow_datasets as tfds

In [32]:
def data_transformation_func(data):
  transformed_data = tf.math.abs(tf.signal.stft(data,frame_length = 64, frame_step = 32, fft_length = 64))
  transformed_data = tf.image.per_image_standardization(tf.reshape(transformed_data, shape = (-1,31,33,1))) # Normalization
  return transformed_data

In [33]:
def tf_data_generator_new(file_list, batch_size = 4):
    i = 0
    while True:
        if i*batch_size >= len(file_list):  
            i = 0
            np.random.shuffle(file_list)
        else:
            file_chunk = file_list[i*batch_size:(i+1)*batch_size]
            data = []
            labels = []
            label_classes = tf.constant(["Fault_1", "Fault_2", "Fault_3", "Fault_4", "Fault_5"]) 
            for file in file_chunk:
                temp = pd.read_csv(open(file,'r')).astype(np.float32)    # Read data
                data.append(tf.reshape(temp.values, shape = (1,1024)))
                pattern = tf.constant(eval("file[22:29]"))
                for j in range(len(label_classes)):
                    if re.match(pattern.numpy(), label_classes[j].numpy()): 
                        labels.append(j)
                    
            data = np.asarray(data)
            labels = np.asarray(labels)
            first_dim = data.shape[0]
            # Create tensorflow dataset so that we can use `map` function that can do parallel computation.
            data_ds = tf.data.Dataset.from_tensor_slices(data)
            data_ds = data_ds.batch(batch_size = first_dim).map(data_transformation_func,
                                                                num_parallel_calls = tf.data.experimental.AUTOTUNE)
            # Convert the dataset to a generator and subsequently to numpy array
            data_ds = tfds.as_numpy(data_ds)   # This is where tensorflow-datasets library is used.
            data = np.asarray([data for data in data_ds]).reshape(first_dim,31,33,1)
            
            yield data, labels
            i = i + 1

In [34]:
train_file_names[:10]

['./random_data/Fault_3/Fault_3_045.csv',
 './random_data/Fault_1/Fault_1_032.csv',
 './random_data/Fault_1/Fault_1_025.csv',
 './random_data/Fault_2/Fault_2_013.csv',
 './random_data/Fault_3/Fault_3_053.csv',
 './random_data/Fault_1/Fault_1_087.csv',
 './random_data/Fault_5/Fault_5_053.csv',
 './random_data/Fault_4/Fault_4_019.csv',
 './random_data/Fault_3/Fault_3_034.csv',
 './random_data/Fault_2/Fault_2_044.csv']

In [35]:
train_file_names[0][22:29]

'Fault_3'

In [36]:
batch_size = 20
dataset_check = tf.data.Dataset.from_generator(tf_data_generator_new,args= [train_file_names, batch_size],output_types = (tf.float32, tf.float32),
                                                output_shapes = ((None,31,33,1),(None,)))

In [37]:
for data, labels in dataset_check.take(7):
  print(data.shape)
  print(labels)

(20, 31, 33, 1)
tf.Tensor([2. 0. 0. 1. 2. 0. 4. 3. 2. 1. 1. 0. 3. 3. 2. 3. 1. 4. 2. 4.], shape=(20,), dtype=float32)
(20, 31, 33, 1)
tf.Tensor([3. 1. 1. 3. 4. 4. 2. 3. 4. 3. 3. 0. 1. 2. 0. 3. 2. 2. 2. 4.], shape=(20,), dtype=float32)
(20, 31, 33, 1)
tf.Tensor([2. 3. 0. 2. 2. 4. 3. 0. 4. 1. 0. 0. 2. 0. 0. 1. 0. 3. 2. 1.], shape=(20,), dtype=float32)
(20, 31, 33, 1)
tf.Tensor([4. 2. 2. 2. 0. 3. 4. 2. 0. 1. 2. 2. 3. 4. 0. 4. 2. 0. 4. 4.], shape=(20,), dtype=float32)
(20, 31, 33, 1)
tf.Tensor([1. 0. 4. 4. 0. 1. 0. 4. 0. 2. 1. 4. 3. 2. 1. 4. 4. 2. 4. 3.], shape=(20,), dtype=float32)
(20, 31, 33, 1)
tf.Tensor([2. 2. 0. 1. 3. 2. 2. 2. 1. 3. 3. 4. 0. 1. 4. 1. 3. 2. 1. 3.], shape=(20,), dtype=float32)
(20, 31, 33, 1)
tf.Tensor([2. 1. 2. 2. 4. 4. 1. 0. 2. 2. 1. 2. 3. 0. 0. 2. 2. 0. 3. 3.], shape=(20,), dtype=float32)


In [38]:
batch_size = 32
train_dataset_new = tf.data.Dataset.from_generator(tf_data_generator_new, args = [train_file_names, batch_size], 
                                                  output_shapes = ((None,31,33,1),(None,)),
                                                  output_types = (tf.float32, tf.float32))

validation_dataset_new = tf.data.Dataset.from_generator(tf_data_generator_new, args = [validation_file_names, batch_size],
                                                       output_shapes = ((None,31,33,1),(None,)),
                                                       output_types = (tf.float32, tf.float32))

test_dataset_new = tf.data.Dataset.from_generator(tf_data_generator_new, args = [test_file_names, batch_size],
                                                 output_shapes = ((None,31,33,1),(None,)),
                                                 output_types = (tf.float32, tf.float32))

Prefetch datasets.

In [39]:
train_dataset_new = train_dataset_new.prefetch(buffer_size = tf.data.experimental.AUTOTUNE)
validation_dataset_new = validation_dataset_new.prefetch(buffer_size = tf.data.experimental.AUTOTUNE)

In [40]:
model.compile(loss = "sparse_categorical_crossentropy", optimizer = "adam", metrics = ["accuracy"])

In [41]:
model.fit(train_dataset_new, validation_data = validation_dataset_new, steps_per_epoch = steps_per_epoch,
         validation_steps = validation_steps, epochs = 5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7fc398126090>

In [42]:
test_loss_new, test_acc_new = model.evaluate(test_dataset_new, steps = steps)



<a id = "predictions"></a>

## How to make predictions?

In the generator used for prediction, we can also use `map` function to parallelize data preprocessing. But in practice, inference is much faster. So we can make fast predictions using naive method also. We show the naive implementation below.

In [43]:
def create_prediction_set(num_files = 20):
    os.mkdir("./random_data/prediction_set")
    for i in range(num_files):
        data = np.random.randn(1024,)
        file_name = "./random_data/prediction_set/"  + "file_" + "{0:03}".format(i+1) + ".csv" # This creates file_name
        np.savetxt(eval("file_name"), data, delimiter = ",", header = "V1", comments = "")
    print(str(eval("num_files")) + " "+ " files created in prediction set.")

Create some files for prediction set.

In [44]:
create_prediction_set(num_files = 55)

55  files created in prediction set.


In [45]:
prediction_files = glob.glob("./random_data/prediction_set/*")
print("Total number of files: ", len(prediction_files))
print("Showing first 10 files...")
prediction_files[:10]

Total number of files:  55
Showing first 10 files...


['./random_data/prediction_set/file_001.csv',
 './random_data/prediction_set/file_002.csv',
 './random_data/prediction_set/file_003.csv',
 './random_data/prediction_set/file_004.csv',
 './random_data/prediction_set/file_005.csv',
 './random_data/prediction_set/file_006.csv',
 './random_data/prediction_set/file_007.csv',
 './random_data/prediction_set/file_008.csv',
 './random_data/prediction_set/file_009.csv',
 './random_data/prediction_set/file_010.csv']

Now, we will create a generator to read these files in chunks. This generator will be slightly different from our previous generator. Firstly, we don't want the generator to run indefinitely. Secondly, we don't have any labels. So this generator should only `yield` data. This is how we achieve that.

In [46]:
def generator_for_prediction(file_list, batch_size = 20):
    i = 0
    while i <= (len(file_list)/batch_size):
        if i == np.floor(len(file_list)/batch_size):
            file_chunk = file_list[i*batch_size:len(file_list)]
            if len(file_chunk)==0:
                break
        else:
            file_chunk = file_list[i*batch_size:(i+1)*batch_size] 
        data = []
        for file in file_chunk:
            temp = pd.read_csv(open(file,'r')).astype(np.float32)
            temp = tf.math.abs(tf.signal.stft(tf.reshape(temp.values, shape = (1024,)),frame_length = 64, frame_step = 32, fft_length = 64))
            # After STFT transformation with given parameters, shape = (31,33)
            temp = tf.image.per_image_standardization(tf.reshape(temp, shape = (-1,31,33,1))) # Image Normalization
            data.append(temp) 
        data = np.asarray(data).reshape(-1,31,33,1)
        yield data
        i = i + 1

Check whether the generator works or not.

In [47]:
pred_gen = generator_for_prediction(prediction_files,  batch_size = 10)
for data in pred_gen:
    print(data.shape)

(10, 31, 33, 1)
(10, 31, 33, 1)
(10, 31, 33, 1)
(10, 31, 33, 1)
(10, 31, 33, 1)
(5, 31, 33, 1)


Create a `tensorflow dataset`.

In [48]:
batch_size = 10
prediction_dataset = tf.data.Dataset.from_generator(generator_for_prediction,args=[prediction_files, batch_size],
                                                 output_shapes=(None,31,33,1), output_types=(tf.float32))

In [49]:
steps = np.int(np.ceil(len(prediction_files)/batch_size))
predictions = model.predict(prediction_dataset,steps = steps)

In [50]:
print("Shape of prediction array: ", predictions.shape)
predictions

Shape of prediction array:  (55, 5)


array([[0.13783312, 0.06810743, 0.18828638, 0.4399181 , 0.16585506],
       [0.2011155 , 0.0909321 , 0.12722781, 0.34147328, 0.23925126],
       [0.184051  , 0.11195082, 0.15630874, 0.41264012, 0.13504937],
       [0.17021744, 0.15275575, 0.17176864, 0.36582083, 0.13943738],
       [0.22107455, 0.13893652, 0.16182247, 0.22719847, 0.25096804],
       [0.16544239, 0.09297101, 0.19448881, 0.37793893, 0.16915883],
       [0.20981115, 0.09095117, 0.1454936 , 0.37553373, 0.17821036],
       [0.18948458, 0.08287238, 0.16043249, 0.31469837, 0.25251222],
       [0.14806318, 0.08988151, 0.18063019, 0.43348154, 0.14794365],
       [0.19300967, 0.17423573, 0.1853214 , 0.29504803, 0.15238515],
       [0.14796554, 0.10064519, 0.17332935, 0.46094754, 0.11711246],
       [0.1620164 , 0.10878453, 0.19735815, 0.28250632, 0.2493346 ],
       [0.17244144, 0.13593125, 0.18931074, 0.3498449 , 0.1524716 ],
       [0.16827711, 0.08276799, 0.16664039, 0.38747287, 0.19484173],
       [0.16345006, 0.1138956 , 0.

Outputs of prediction are 5 dimensional vector. This is so because we have used 5 neurons in the output layer and our activation function is softmax. The 5 dimensional output vector for an input add to 1. So it can be interpreted as probability. Thus we should classify the input to a class, for which prediction probability is maximum. To get the class corresponding to maximum probability, we can use `np.argmax()` command.

In [51]:
np.argmax(predictions, axis = 1)

array([3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 3, 3, 3, 4, 3, 3,
       4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 4, 3, 3, 3])

As a final comment, read the **note** at the beginning of this post.