## Data processing and checking dataset
Assume you are going to establish a animals classifications CNN  
To prepare dataset, your final data folder should be like this
```
.
└── Dataset/
    ├── Dog/
    │   ├── dogImg1.jpg
    │   ├── dogImg1.jpg
    │   └── ...
    ├── Cat/
    │   ├── catImg1.jpg
    │   ├── catImg1.jpg
    │   └── ...
    ├── Bird/
    │   ├── birdImg1.jpg
    │   ├── birdImg1.jpg
    │   └── ...
    ├── Hamster/
    │   ├── hamsterImg1.jpg
    │   ├── hamsterImg1.jpg
    │   └── ...
    └── ...\
```

Each folder should contains same class images, and try to make all classes share the same / near numbers of images numbers.
```
** Good example **
Dog (120 images)  
Cat (131 images)
Bird (125 images)
Hamster (129 images)
...

** Bad example **
Dog (120 images)  
Cat (40 images)
Bird (300 images)
Hamster (200 images)
...
```

If you found out you data is relative high distributed between classes (the bad example). you may reduce the images of several classes, or scale up all the classes images number into the same number like the below python code.

In [None]:
# Handle high distributed classes, if no just skip this section
# Or you may see see how to handle the situations
import os
import shutil
import numpy as np

def genReport(path):
    """
    Generate report regarding the folder informations
    """
    totalcount = 0
    tinydict = {}

    for ind, folder in enumerate( os.listdir(path) ) :

        insidePath = os.path.join(path, folder)
        tinydict[folder] = len( os.listdir(insidePath) )
        totalcount += len( os.listdir(insidePath) )

    numOnly = np.array( [ value for key, value in tinydict.items() ] )

    avgVal = int(np.average(numOnly))
    meanVal =  int(np.median(numOnly))
    maxVal = max(tinydict.items(), key = lambda k : k[1])
    minVal = min(tinydict.items(), key = lambda k : k[1])

    print("-------------------------------")
    print("Generated report for" , path)
    print("Total folder:" , len(numOnly))
    print("Total items:" , totalcount)
    print("----------------")
    print("Avg:" , avgVal )
    print("Mean:" , meanVal )
    print("Max:" , maxVal )
    print("Min:" , minVal )
    print("-------------------------------")

    return { "minVal":minVal[1], "maxVal":maxVal[1], "avgVal" : avgVal, "meanVal": meanVal }

def scaleByMax(path, result, offset = 0):

    """
    Scaling folders items helpers
    """

    for ind, folder in enumerate( os.listdir(path) ):
        insidePath = os.path.join(path, folder)
        insidePathArr = os.listdir(insidePath)

        copyVal = result["maxVal"] - len(insidePathArr) + offset

        print("---------------")
        print("current:", len(insidePathArr))
        print("copy count:" , copyVal)
        print("---------------")

        for k in range(copyVal):
            inside_folder = np.random.choice( insidePathArr )
            shutil.copy( os.path.join(insidePath, inside_folder) , os.path.join(insidePath, "newImg" + str(k) + inside_folder ))

def scaleDataSmart(dataPath, offset):
    """
    Scale your data by the max numbers of items occur in all files
    """

    scaleByMax( dataPath, genReport(dataPath) , offset )
    print("After Report:")
    genReport(dataPath)

# the folder structure is same as top mentioned one
if __name__ == "__main__":
  scaleDataSmart("yourDataFolderPath", 50)

## Q&A

Q.   
Is the images resolutions matters to the training?  
A.     
Ususlly the model will resize your input images from original size to a relative small size (e.g. 224 x 224, 300x300, 480 x 480 ...)   
Depends on the requirement of the models, the original images resolutions are not really matters to final result unless the images are really small (like 50 x 50). In general bigger than the model requirement is ok.


#### Tensorflow keras provided a great utilis to create a tfDataset
https://www.tensorflow.org/api_docs/python/tf/keras/utils/image_dataset_from_directory  

No need to run the below code, just a example of how we use this function to import data from out devices.


In [None]:
import tensorflow as tf
train_data_dir = "your training dataset"
batch_size = 258
imgSize = 300
dataSeed = 1337

train_ds = tf.keras.preprocessing.image_dataset_from_directory(
  train_data_dir, seed=dataSeed, 
  image_size=(imgSize, imgSize), color_mode='rgb'
)