Data pipeline adalah serangkaian proses (filter, shuffle, map, dll) yang dilakukan pada data secara berantai, menggunakan tensorflow data pipeline (?) akan memproses data secara parallel sehingga pemrosesan data menjadi cepat.

In [1]:
import tensorflow as tf

import numpy as np

#### Dataset random

Misal datasetnya merupakan data random yang digenerate seperti di bawah ini

In [2]:
random_arr= np.random.randint(-100,500, 10)
random_arr

array([404, 229, 137, -79, 288, -93, 498,  31, -35, 134])

Hal pertama yang harus kita lakukan untuk menerapkan tensorflow data pipeline adalah membuat data tersebut menjadi objek tensor

In [3]:
# Fungsi untuk mengubah list atau array menjadi objek tensor
tf_dataset= tf.data.Dataset.from_tensor_slices(random_arr)

tf_dataset

<TensorSliceDataset shapes: (), types: tf.int32>

Menampilkan data yang sekarang menjadi objek tensor

In [4]:
# Perulangan dari tf_dataset
for i in tf_dataset:
    # Jika ingin melakukan slice n elemen pada tf_dataset, lakukan tf_dataset.take(n)
    print(i)
    # Jika kita memprint i, yang akan tampil bukan nilanya langsung
    print(i.numpy())
    # Kita bisa memanggil fungsi numpy() untuk melakukan hal tersebut

tf.Tensor(404, shape=(), dtype=int32)
404
tf.Tensor(229, shape=(), dtype=int32)
229
tf.Tensor(137, shape=(), dtype=int32)
137
tf.Tensor(-79, shape=(), dtype=int32)
-79
tf.Tensor(288, shape=(), dtype=int32)
288
tf.Tensor(-93, shape=(), dtype=int32)
-93
tf.Tensor(498, shape=(), dtype=int32)
498
tf.Tensor(31, shape=(), dtype=int32)
31
tf.Tensor(-35, shape=(), dtype=int32)
-35
tf.Tensor(134, shape=(), dtype=int32)
134


Hal yang akan kita lakukan pada tf_dataset yaitu kumpulan dari beberapa proses, yaitu filter angka yang lebih besar dari 0, normalisasi dengan membagi 10, lalu shuffle (mengacak)

Filter

In [5]:
filter_ds= lambda x: x>0

tf_dataset= tf_dataset.filter(filter_ds)

In [6]:
for i in tf_dataset:
    print(i.numpy())

404
229
137
288
498
31
134


Normalisasi

In [7]:
tf_dataset= tf_dataset.map(lambda y: y/10)

for i in tf_dataset:
    print(i)

tf.Tensor(40.4, shape=(), dtype=float64)
tf.Tensor(22.9, shape=(), dtype=float64)
tf.Tensor(13.7, shape=(), dtype=float64)
tf.Tensor(28.8, shape=(), dtype=float64)
tf.Tensor(49.8, shape=(), dtype=float64)
tf.Tensor(3.1, shape=(), dtype=float64)
tf.Tensor(13.4, shape=(), dtype=float64)


Shuffle

In [8]:
tf_dataset= tf_dataset.shuffle(2)

for i in tf_dataset:
    print(i.numpy())

22.9
40.4
13.7
49.8
28.8
13.4
3.1


Semua proses diatas bisa dan disarankan dirangkum seperti ini

In [9]:
tf_dataset_2= tf.data.Dataset.from_tensor_slices(random_arr)

tf_dataset_2= tf_dataset_2.filter(filter_ds).map(lambda x: x/10).shuffle(2)

for i in tf_dataset_2:
    print(i.numpy())

40.4
22.9
13.7
28.8
49.8
3.1
13.4


#### Another case

Misal pada kasus dataset review (text)

Membaca data

In [10]:
review_ds= tf.data.Dataset.list_files('Dataset/reviews/*/*', shuffle=False)

type(review_ds)

tensorflow.python.data.ops.dataset_ops.TensorSliceDataset

Menampilkan data sekaligus membuat fungsi untuk menampilkan data

In [11]:
def show():
    global review_ds
    for i in review_ds:
        print(i.numpy())
        
show()

b'Dataset\\reviews\\negative\\neg_1.txt'
b'Dataset\\reviews\\negative\\neg_2.txt'
b'Dataset\\reviews\\negative\\neg_3.txt'
b'Dataset\\reviews\\positive\\pos_1.txt'
b'Dataset\\reviews\\positive\\pos_2.txt'
b'Dataset\\reviews\\positive\\pos_3.txt'


Bisa kita lihat yang diprint adalah path dari file

Untuk melihat isi file kita bisa menggunakan method seperti di bawah

In [12]:
tf.io.read_file(b'Dataset\\reviews\\negative\\neg_2.txt')

<tf.Tensor: shape=(), dtype=string, numpy=b"This show was an amazing, fresh & innovative idea in the 70's when it first aired. The first 7 or 8 years were brilliant, but things dropped off after that. By 1990, the show was not really funny anymore, and it's continued its decline further to the complete waste of time it is today.<br /><br />It's truly disgraceful how far this show has fallen. The writing is painfully bad, the performances are almost as bad - if not for the mildly entertaining respite of the guest-hosts, this show probably wouldn't still be on the air. I find it so hard to believe that the same creator that hand-selected the original cast also chose the band of hacks that followed. How can one recognize such brilliance and then see fit to replace it with such mediocrity? I felt I must give 2 stars out of respect for the original cast that made this show such a huge success. As it is now, the show is just awful. I can't believe it's still on the air.\n">

Selanjutnya kita akan melakukan shuffle, filter file random, dan mendapatkan class

Shuffle

In [13]:
review_ds= review_ds.shuffle(5)

show()

b'Dataset\\reviews\\positive\\pos_2.txt'
b'Dataset\\reviews\\negative\\neg_3.txt'
b'Dataset\\reviews\\positive\\pos_3.txt'
b'Dataset\\reviews\\negative\\neg_1.txt'
b'Dataset\\reviews\\positive\\pos_1.txt'
b'Dataset\\reviews\\negative\\neg_2.txt'


Mendapatkan/mengekstrak text, sekaligus mendapatkan label/class nya berdasarkan direktorinya

In [14]:
def get_text_and_class(file_path):
    # Kalau pake cara string biasa akan error
    label= tf.strings.split(file_path, '\\')[-2]
    text= tf.io.read_file(file_path)
    
    return text, label

Melakukan mapping pada review ds dengan fungsi di atas

In [15]:
review_ds= review_ds.map(get_text_and_class)

In [16]:
for i in review_ds.take(1):
    print(i)

(<tf.Tensor: shape=(), dtype=string, numpy=b"This show was an amazing, fresh & innovative idea in the 70's when it first aired. The first 7 or 8 years were brilliant, but things dropped off after that. By 1990, the show was not really funny anymore, and it's continued its decline further to the complete waste of time it is today.<br /><br />It's truly disgraceful how far this show has fallen. The writing is painfully bad, the performances are almost as bad - if not for the mildly entertaining respite of the guest-hosts, this show probably wouldn't still be on the air. I find it so hard to believe that the same creator that hand-selected the original cast also chose the band of hacks that followed. How can one recognize such brilliance and then see fit to replace it with such mediocrity? I felt I must give 2 stars out of respect for the original cast that made this show such a huge success. As it is now, the show is just awful. I can't believe it's still on the air.\n">, <tf.Tensor: sha

Filter file yang tidak mempunyai isi, terdapat 2 kondisi yang bisa kita gunakan

In [17]:
for i in review_ds:
    print(len(i[0].numpy()))
    print((i[0]=='').numpy())

749
False
1762
False
999
False
0
True
0
True
935
False


Melakukan filter, kondisi yang kita pilih yaitu text yang merupakan string kosong

In [18]:
review_filter= lambda x, y: x!=''
review_ds= review_ds.filter(review_filter)

for i in review_ds.take(2):
    print(i[0])

tf.Tensor(b"This show was an amazing, fresh & innovative idea in the 70's when it first aired. The first 7 or 8 years were brilliant, but things dropped off after that. By 1990, the show was not really funny anymore, and it's continued its decline further to the complete waste of time it is today.<br /><br />It's truly disgraceful how far this show has fallen. The writing is painfully bad, the performances are almost as bad - if not for the mildly entertaining respite of the guest-hosts, this show probably wouldn't still be on the air. I find it so hard to believe that the same creator that hand-selected the original cast also chose the band of hacks that followed. How can one recognize such brilliance and then see fit to replace it with such mediocrity? I felt I must give 2 stars out of respect for the original cast that made this show such a huge success. As it is now, the show is just awful. I can't believe it's still on the air.\n", shape=(), dtype=string)
tf.Tensor(b"One of the ot

Dan sekarang kita menerapkan semua proses di atas dalam satu baris kode

In [19]:
review_ds_2= tf.data.Dataset.list_files('Dataset/reviews/*/*', shuffle=False)

review_ds_2= review_ds_2.shuffle(5).map(get_text_and_class).filter(review_filter)

for i in review_ds_2.take(2):
    print(i)

(<tf.Tensor: shape=(), dtype=string, numpy=b"Basically there's a family where a little boy (Jake) thinks there's a zombie in his closet & his parents are fighting all the time.<br /><br />This movie is slower than a soap opera... and suddenly, Jake decides to become Rambo and kill the zombie.<br /><br />OK, first of all when you're going to make a film you must Decide if its a thriller or a drama! As a drama the movie is watchable. Parents are divorcing & arguing like in real life. And then we have Jake with his closet which totally ruins all the film! I expected to see a BOOGEYMAN similar movie, and instead i watched a drama with some meaningless thriller spots.<br /><br />3 out of 10 just for the well playing parents & descent dialogs. As for the shots with Jake: just ignore them.\n">, <tf.Tensor: shape=(), dtype=string, numpy=b'negative'>)
(<tf.Tensor: shape=(), dtype=string, numpy=b"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. The