## File used to process the VCTK files

Need to make sure that the *following* files are in in the directory:

- DataProcessing.py
- config.yml
- process.py (optional)
- utils.py

NOTE: from DataProcessing.py change `from data_processing.utils import *` to `from utils import *`.

config.yml contents:


```
img_height: 300
img_width: 300
bucket_name: "DataSet"
```



In [1]:
! pip install -q tensorflow-io

[K     |████████████████████████████████| 25.4MB 1.7MB/s 
[?25h

In [20]:
import tensorflow_datasets as tfds
import matplotlib.pyplot as plt
import tensorflow_io as tfio
import tensorflow as tf
import importlib
import yaml

from google.colab import auth
from pathlib import Path
from tqdm import tqdm 
from PIL import Image


In [3]:
auth.authenticate_user()

In [4]:
!echo "deb http://packages.cloud.google.com/apt gcsfuse-bionic main" > /etc/apt/sources.list.d/gcsfuse.list
!curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add -
!apt -qq update
!apt -qq install gcsfuse

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100  2537  100  2537    0     0  72485      0 --:--:-- --:--:-- --:--:-- 72485
OK
45 packages can be upgraded. Run 'apt list --upgradable' to see them.
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'apt autoremove' to remove it.
The following NEW packages will be installed:
  gcsfuse
0 upgraded, 1 newly installed, 0 to remove and 45 not upgraded.
Need to get 10.8 MB of archives.
After this operation, 23.1 MB of additional disk space will be used.
Selecting previously unselected package gcsfuse.
(Reading database ... 160690 files and directories currently installed.)
Preparing to unpack .../gcsfuse_0.35.0_amd64.deb ...
Unpacking gcsfuse (0.35.0) ...
Setting up gcsfuse (0.35.0) ...


In [5]:
!mkdir DataSet
!gcsfuse --implicit-dirs spade_dataset DataSet

2021/04/28 12:01:38.881113 Using mount point: /content/DataSet
2021/04/28 12:01:38.889555 Opening GCS connection...
2021/04/28 12:01:39.084889 Mounting file system "spade_dataset"...
2021/04/28 12:01:39.099679 File system has been successfully mounted.


In [13]:
import DataProcessing as dp
importlib.reload(dp)

<module 'DataProcessing' from '/content/DataProcessing.py'>

In [14]:
config = yaml.load(Path("config.yml").read_text(), Loader=yaml.SafeLoader)

# Processing and writing cityscape images

In [15]:
bucket_name = config["bucket_name"] 
set_type = "val"
writer = dp.DataWriter(bucket_name, config, set_type)

In [16]:
writer.process_files()
writer.write_files()

Start writing files
Number of samples in dataset: 500
Finished writing files in:  143.57651257514954s


# Reading files

An example of the cityscape dataset looks as follows:



```
{
  'label': <tf.Tensor: shape=(), dtype=string, numpy=b'frankfurt_000000_000294'>, 
  'subset': <tf.Tensor: shape=(), dtype=string, numpy=b'val'>, 
  'img_masked': <tf.Tensor: shape=(300, 300, 3), dtype=uint8, numpy=
        array([[[0, 0, 0],
              [0, 0, 0],
              [0, 0, 0],
                ...,
              [0, 0, 0],
              [0, 0, 0],
              [0, 0, 0]]], dtype=uint8)>, 
  'img_original': <tf.Tensor: shape=(300, 300, 3), dtype=uint8, numpy= 
        array([[[165, 171, 137],
              [135, 143, 113],
              [115, 123,  97],
                ...,
              [ 34,  38,  31],
              [ 27,  31,  29],
              [ 27,  32,  29]]], dtype=uint8)>
}

```



In [17]:
set_type = "val"
base_path = "DataSet/cityscape/processed_data"
reader = dp.DataReader(base_path, set_type)

In [18]:
reader.read_data_set()
data_set = reader.get_dataset()

In [29]:
for example in data_set.take(1):
  print(example.keys())
  print(example["label"])
  print(example["subset"])

  img_masked = example["img_masked"]
  img_original = example["img_original"]

  
  # Check images.
  img_masked = Image.fromarray(img_masked.numpy(), 'RGB')
  img_masked.save("img_masked.jpg")

  img_original = Image.fromarray(img_original.numpy(), 'RGB')
  img_original.save("img_original.jpg")


dict_keys(['label', 'subset', 'img_masked', 'img_original'])
tf.Tensor(b'frankfurt_000000_000294', shape=(), dtype=string)
tf.Tensor(b'val', shape=(), dtype=string)
