# **Federated EMNIST dataset TensorFlow Data loader with generator function**

Code corresponding to FEMNIST Data loader taken from: https://gist.github.com/negedng/c9573c55a4f4e50e5e93303917dd28ef#file-femnist_loader-py

---

This notebook aims to cover the loading of the Federated EMNIST dataset by means of the TensorFlow Data loader with generator function, its subsequent preprocessing and the training of different Keras models with these data.

In [None]:
import os
import numpy as np
import tensorflow as tf

In [None]:
# TensorFlow and tf.keras
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Dropout, Flatten, Dense


# Images, plots, display, and visualization
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import cv2
import imutils
import IPython
from six.moves import urllib

## Download FEMNIST dataset
First of all, the FEMNIST dataset needs to be downloaded from [this link](https://github.com/TalwalkarLab/leaf.git), by means of the *git clone* command.

More details regarding the setup instructions can be found on that page.

---

Then, the json Python package is imported to be able to decode all the data from the FEMNIST dataset.

In [None]:
#Download FEMNIST dataset (not in Tensorflow)
!git clone https://github.com/TalwalkarLab/leaf.git
%cd /content/leaf/data/femnist/
!./preprocess.sh -s niid --sf 0.05 -k 0 -t sample

[1;30;43mSe han truncado las últimas 5000 líneas del flujo de salida.[0m
  inflating: by_write/hsf_1/f0654_02/c0654_02/c0654_02_00065.png  
  inflating: by_write/hsf_1/f0654_02/c0654_02/c0654_02_00130.png  
  inflating: by_write/hsf_1/f0654_02/c0654_02/c0654_02_00151.png  
  inflating: by_write/hsf_1/f0654_02/c0654_02/c0654_02_00111.png  
  inflating: by_write/hsf_1/f0654_02/c0654_02/c0654_02_00057.png  
  inflating: by_write/hsf_1/f0654_02/c0654_02/c0654_02_00089.png  
  inflating: by_write/hsf_1/f0654_02/c0654_02/c0654_02_00008.png  
  inflating: by_write/hsf_1/f0654_02/c0654_02/c0654_02_00105.png  
  inflating: by_write/hsf_1/f0654_02/c0654_02/c0654_02_00066.png  
  inflating: by_write/hsf_1/f0654_02/c0654_02/c0654_02_00073.png  
  inflating: by_write/hsf_1/f0654_02/c0654_02/c0654_02_00002.png  
  inflating: by_write/hsf_1/f0654_02/c0654_02/c0654_02_00125.png  
  inflating: by_write/hsf_1/f0654_02/c0654_02/c0654_02_00024.png  
  inflating: by_write/hsf_1/f0654_02/c0654_02/c0654_02

In [None]:
FEMNIST_JSON_FOLDER = "data/all_data/"
import json

## Switch function implementation
Hereunder, the implementation of a typical switch function is proposed. It is going to be useful later on to choose among the available Keras models to be generated.

In [None]:
class switch:

	def __init__(self, variable, comparator=None, strict=False):
		self.variable = variable
		self.matched = False
		self.matching = False
		if comparator:
			self.comparator = comparator
		else:
			self.comparator = lambda x, y: x == y
		self.strict = strict

	def __enter__(self):
		return self

	def __exit__(self, exc_type, exc_val, exc_tb):
		pass

	def case(self, expr, break_=False):
		if self.strict:
			if self.matched:
				return False
		if self.matching or self.comparator(self.variable, expr):
			if not break_:
				self.matching = True
			else:
				self.matched = True
				self.matching = False
			return True
		else:
			return False

	def default(self):
		return not self.matched and not self.matching

## FEMNIST data loader
The following functions correspond to the FEMNIST dataset TensorFlow Data loader that has been taken from: https://gist.github.com/negedng/c9573c55a4f4e50e5e93303917dd28ef#file-femnist_loader-py

---
Basically, what they do is the following:
* The *femnist_generator* function collects all the json files from the FEMNIST dataset folder and, one by one, it stores all the samples' names for every user. Then, it shuffles all these names and calls the *decode_id* function.

* The *decode_id* function receives the FEMNIST data contained in each json file and the corresponding name for the user sample that needs to be decoded in the current step. From them, it stores the image and the lable, resizes the image to a shape of 24-by-24-by-1 and returns a constant tensor consisting of the resized image and the label.


In [None]:
def decode_id(femnist_data, idx):
    sidx = idx.split('/')
    uid = sidx[0]
    sid = int(sidx[1])
    img = femnist_data['user_data'][uid]['x'][sid]
    img = np.resize(img, (28,28,1))
    label = femnist_data['user_data'][uid]['y'][sid]
    return tf.constant(img, shape=(28,28,1)), tf.constant(label, shape=(1))


def femnist_generator():
    for json_file in os.listdir(FEMNIST_JSON_FOLDER):
        femnist_data = []
        with open(FEMNIST_JSON_FOLDER+json_file) as f:
            femnist_data = json.load(f)
        all_sample_ids = []
        for uid in femnist_data['users']:
            n_samples = len(femnist_data['user_data'][uid]['y'])
            user_sample_ids = [uid+'/'+str(i) for i in list(range(n_samples))]
            all_sample_ids += user_sample_ids
        np.random.shuffle(all_sample_ids)
        
        for idx in all_sample_ids:
            yield decode_id(femnist_data, idx)

## Number of samples per class
The following function *samples_per_class* aims to count the number of samples per class that each user provides to the loaded FEMNIST dataset.

To do that, it again collects all the json files from the FEMNIST dataset folder and, one by one, it checks the class assigned to each of the samples provided by every user.

With that, it increases the count of the corresponding class assigned to the current sample.

The final result is a count of the samples of each class provided by each user that has contributed to the FEMNIST dataset.

In [None]:
def samples_per_class():
  #Open json files one by one
  for json_file in os.listdir(FEMNIST_JSON_FOLDER):
        femnist_data = []
        with open(FEMNIST_JSON_FOLDER+json_file) as f:
            femnist_data = json.load(f)

        samples_per_class = {}
        #Analyze samples/class for each user
        for uid in femnist_data['users']:
          samples_per_class[uid] = {'class': [str(i) for i in list(range(62))],
                                    'n_samples': [0 for i in range(62)]}
          n_samples = len(femnist_data['user_data'][uid]['y'])

          #Check class of each user's sample
          for n in range(n_samples):
            lbl = femnist_data['user_data'][uid]['y'][n]
            samples_per_class[uid]['n_samples'][lbl] += 1
  return(samples_per_class)

## FEMNIST classes dictionary
The following function *classes()* has been implemented to create a dictionary of the classes that correspond to each of the labels in the FEMNIST dataset.
The detailed explanation of the labelling of each class can be found in [this link](https://github.com/TalwalkarLab/leaf/blob/master/data/femnist/preprocess/data_to_json.py).

In brief, FEMNIST is an image dataset with 62 different classes (10 digits, 26 lowercase, 26 uppercase).
The assigned labels to each class are mapped as follows:
* Decimal numbers from 0 to 9 are assigned to classes representing respective numbers
* Decimal numbers from 10 to 35 are assigned to classes representing respective uppercase letters
* Decimal numbers from 36 to 61 are assigned to classes representing respective lowercase letters.

Then, by means of the *classes* function, these decimal numbers are decoded back to their respective classes and put into a dictionary to store the correspondences between them.

In [None]:
def classes ():
  """This function creates a dictionary of the class
  that correspond to each label in the femnist dataset"""
  # keys = assigned labels
  # values = femnist classes

  classes = {}
  for i in range(62):
    classes[str(i)] = '' 

  for k in classes:
    if int(k)<10 : #digits
      classes[k] = chr(int(k)+48)
    elif int(k)>9 and int(k)<36 : #uppercase
      classes[k] = chr(int(k)+55)
    else: #lowercase
      classes[k] = chr(int(k)+61)
  return(classes)


femnist_classes = classes()
NUM_CLASSES = len(femnist_classes)

## Images preprocessing
The *resize_images* function uses lambda transformations to resize the images of the FEMNIST dataset so that their shape matches the input shape of the chosen model.

In [None]:
def resize_imgs(IMG_SIZE, ds,):
  size = (IMG_SIZE, IMG_SIZE)
  #Resize images to input shape of used model
  ds = ds.map(lambda img, lbl:
                          (tf.image.resize(img, size, method='nearest'), lbl))
  return ds

## Proposed pipeline
The aim of the following code block is to check the correct operation of the functions implemented above.

In it, first of all, the FEMNIST dataset is loaded by means of the previously  implemented dataset generator.

Then, we prepare these data to be trained by different models.
For that, data is shuffled, allocating a buffer size of 1000 to pick random entries.
By means of the *Dataset.batch* function, as a batch size of 64 examples has been defined, it stacks 64 consecutive elements of the dataset into a single element.

Finally, it applies prefetching to overlap the preprocessing and model execution of training steps. In this way, while the model is executing training step n, the input pipeline will be reading the data for step n+1, thus reducing the step time of the training and the time to extract the data.

In [None]:
channels = 3

ds = tf.data.Dataset.from_generator(
    femnist_generator,
    output_types=(tf.float32, tf.int32),
    output_shapes=((28,28,1), (1))
)

In [None]:
ds = ds.shuffle(buffer_size=10000)
ds = ds.batch(batch_size=64)
ds = ds.prefetch(buffer_size=tf.data.AUTOTUNE)

In [None]:
if channels == 3:
    ds = ds.map(lambda img, lbl:(tf.image.grayscale_to_rgb(img), lbl))

The next step is to resize images of the dataset so that their shape match the input shape of the models used. In this case, input shape for all models is 224 by 224.

In [None]:
#Resize images to (224, 224)
IMG_SIZE = 224
ds_resized= resize_imgs(IMG_SIZE, ds)

#Generate available models
The function included hereunder (*get_model*) is used to get the chosen model among the imported Keras models.

The choice is performed by means of the previously included switch function, that selects the model to generate depending on the model name received at its input.

The function returns the generated model, which is already pre-trained.

More information about the different pre-trained models (https://keras.io/api/applications/) used in this notebook can be found in the following links:
*   Resnet50: https://keras.io/api/applications/resnet/#resnet50-function
*   MobileNet: https://keras.io/api/applications/mobilenet/
*   DenseNet121: https://keras.io/api/applications/densenet/#densenet121-function
*   EfficientNetB0: https://keras.io/api/applications/efficientnet/#efficientnetb0-function

In [None]:
from tensorflow.keras.applications.resnet50 import ResNet50
from tensorflow.keras.preprocessing import image
from tensorflow.keras.applications.resnet50 import preprocess_input as resnet_preprocess_input, decode_predictions

from tensorflow.keras.applications.mobilenet import MobileNet
from tensorflow.keras.applications.mobilenet import preprocess_input as mobilenet_preprocess_input, decode_predictions

from tensorflow.keras.applications.densenet import DenseNet121
from tensorflow.keras.applications.densenet import preprocess_input as densenet_preprocess_input, decode_predictions

from tensorflow.keras.applications.efficientnet import EfficientNetB0
from tensorflow.keras.applications.efficientnet import preprocess_input as efficientnet_preprocess_input, decode_predictions

#Function to get the desired model
def get_model(model_name, classes):
  '''This function builds the desired model'''
  with switch(model_name) as m:
    if m.case('resnet', True): model=ResNet50(weights=None, classes=classes)
    if m.case('mobilenet', True): model=MobileNet(weights=None, classes=classes)
    if m.case('densenet', True): model=DenseNet121(weights=None, classes=classes)
    if m.case('efficientnet', True): model=EfficientNetB0(weights=None, classes=classes)
    if m.default(): print('error')
  return model

## Compile and train the model
At this point of the pipeline, once the datasets are preprocessed, the model is obtained using the explained *get_model* function.
Finally, the generated model is compiled and trained with the FEMNIST dataset that was already preprocessed at this point, thus checking the correct operation of the proposed pipeline and implemented functions.

In [None]:
#Compile desired model
model = get_model('resnet', NUM_CLASSES)
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

In [None]:
#Train model
model.fit(ds_resized)

     21/Unknown - 934s 44s/step - loss: 2.9557 - accuracy: 0.2537

KeyboardInterrupt: ignored