# DEBIAI Python module tutorial

## 0. Introduction

#### This tutorial has been designed to show you how to use the DEBIAI python module to import data from your models into the app.
#### From the dataset creation to the result display in DEBIAI, you can follow this tutorial with the same dataset as ours or just use your own instead.

### Import

In [1]:
# System modules
import importlib
import os
import pathlib

# Tensorflow modules
import tensorflow as tf
import tensorflow_datasets as tfds
from tensorflow.keras import layers

# Math modules
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
import random
import pandas as pd
import scipy

# Image modules
import PIL
import PIL.Image

# DEBIAI module
# While not upload in pip, use this, with package in parent directory
sys.path.insert(1, os.path.join(sys.path[0], '..'))
from debiai import debiai

## 1. Datasets creations

#### First, we are going to load and reformat datasets to be used in this tutorial. You can skip this section if you already have some data to play with.
##### _See Tutorial_data.ipynb_ 

## 2. Models and results functions

#### Here are some functions that will be helpful during the example.

### Models creation functions

In [2]:
def create_model_from_dir(path, batch_size=32, nb_layers=3):
    """ 
    Create a CNN model from directories of images grouped by labels.
    Return the train and val dataset and the model.
    """
    data_dir = pathlib.Path(path)
    
    # Create a dataset
    img_height = 32
    img_width = 32

    train_ds = tf.keras.preprocessing.image_dataset_from_directory(
        data_dir,
        validation_split=0.2,
        subset='training',
        seed=123,
        image_size=(img_height, img_width),
        batch_size=batch_size)

    val_ds = tf.keras.preprocessing.image_dataset_from_directory(
        data_dir,
        validation_split=0.2,
        subset='validation',
        seed=123,
        image_size=(img_height, img_width),
        batch_size=batch_size)
    
    model = create_cnn_model(batch_size=batch_size, nb_layers=nb_layers)
    
    return (train_ds, val_ds, model)
 
def create_cnn_model(batch_size=32, nb_layers=3):
    """ Return a CNN model for 32*32*3 inputs images 
        nb_layers allow to choose number of Conv2D, MaxPooling2D layers
    """
    # Create model
    num_classes = 10
    
    l = []
    
    model = tf.keras.Sequential()
    
    
    model.add(layers.experimental.preprocessing.Rescaling(1./255))
    
    for i in range(nb_layers):
        model.add(layers.Conv2D(32, 3, activation='relu'))
        model.add(layers.MaxPooling2D())
        
    model.add(layers.Flatten())
    model.add(layers.Dense(128, activation='relu'))
    model.add(layers.Dense(num_classes))

    # Compile model functions
    model.compile(
      optimizer='adam',
      loss=tf.losses.SparseCategoricalCrossentropy(from_logits=True),
      metrics=['accuracy'])

    model.build((batch_size,32,32,3))
    model.summary()
    
    return model

def visualize_dataset(dataset):
    """ Display a set of 9 images from the dataset """
    # Visualize data
    plt.figure(figsize=(10,10))
    class_name = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
    for images, labels in dataset.take(1):
        for i in range(9):
            ax = plt.subplot(3, 3, i + 1)
            plt.imshow(images[i].numpy().astype("uint8"))
            plt.title(class_name[labels[i]])
            plt.axis('off')

### Predictions functions

In [3]:
def get_samples_from_dataset(iterator, nb_batch):
    """ Get a samples of inputs from the dataset (in nb of batch)"""
    l = []
    
    for i in iterator:
        if nb_batch == 0:
            break
        nb_batch -= 1
        
        for j in range(32):
            row = []
        
            row.append(i[0][j])
            row.append(i[1][j])
        
            l.append(row)
        
    return np.asarray(l)

from scipy.special import softmax

def predict_input(sample, model):
    """ Predict one input - used in predict_from_pd()"""
    reshape_sample = sample.reshape(1,32,32,3)
                
    # Add predictions to result
    pred = model.predict(reshape_sample, batch_size = 1)
    
    sft = softmax(pred)
    percent = (str(round(np.max(sft) * 100, 2)))
    return (str(np.argmax(pred)), percent)

def predict_from_pd(df, model):
    """ Predict result from a dataframe of inputs """
    new_df = pd.DataFrame()
    new_df["hash"] = df["hash"]
    new_df["results"]= df.apply(lambda x: predict_input(x['inputs'], model), axis=1)
    new_df[['results', 'percents']] = pd.DataFrame(new_df['results'].tolist(), index=df.index)
    return new_df

## 3. Debiai modular project

#### We acknowledge that the datasets are already created before starting this section

#### To start with, let's introduce the context to our example: 
* We need to create a basic AI capable of recognizing digits. To do so we start by training one model with the MNIST dataset

In [4]:
# We get the mnist dataset, validation set and model from this function
(mnist_ds, mnist_val, mnist_model_1) = create_model_from_dir("data/MNIST_reformat/", nb_layers=1)

# We get the iterator for later use
mnist_iter = mnist_val.as_numpy_iterator()

Found 70000 files belonging to 10 classes.
Using 56000 files for training.
Found 70000 files belonging to 10 classes.
Using 14000 files for validation.
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
rescaling (Rescaling)        (32, 32, 32, 3)           0         
_________________________________________________________________
conv2d (Conv2D)              (32, 30, 30, 32)          896       
_________________________________________________________________
max_pooling2d (MaxPooling2D) (32, 15, 15, 32)          0         
_________________________________________________________________
flatten (Flatten)            (32, 7200)                0         
_________________________________________________________________
dense (Dense)                (32, 128)                 921728    
_________________________________________________________________
dense_1 (Dense)              (32, 10

In [5]:
# We are going to take 100 batch of 32 samples for our first visualization.
mnist_val_data = get_samples_from_dataset(mnist_iter, 100)

In [6]:
# Let's create a dataframe for the samples
columns = ["inputs","GT"]
data_mnist = pd.DataFrame(mnist_val_data, columns=columns)

#### Now that we have data, we can add them to debiai by creating a new project

In [7]:
# Let's create a debiai project
importlib.reload(debiai)

# We need to precise where is running our debiai backend (by default it's this one)
my_debiai = debiai.Debiai("http://localhost:3000/")

# Just by safety, we delete any project that could be named as our's and create a new one.
project = my_debiai.delete_project_byId("Digit-Recognition")
project = my_debiai.create_project("Digit-Recognition")

#### In debiai, data are stored following the blockstructure of the project, this one has to be specified before any import of data into it. The blockstructure works like a tree with first block being the roots (contexts) and last being the leaves (samples). 
#### Here we just have regular samples without context, so we can just put one block representing the leaves.

In [8]:
# Next, we create a block structure to design the architecture of the DEBIAI project
# We only have the GroundTruth label for now but let's put it for the example

first_block_struct = [{
        
        # Block Samples
        "name":"image",
        "groundTruth": [
            {
                "name":"GT",
                "type":"number"
            }
        ],
        "contexts": [],
        "others": [],
        "inputs" : [],
    }
]

project.set_blockstructure(first_block_struct)


True

#### To add samples into a project, every blocks name and attributes of it must be present in the dataframe. Moreover, to see every samples separately, we need to give a unique block name for the last block (sample) as his ID.

#### In our example that means that each row of sample should have a "image" name (unique) and a "GT" number at least. Here, we can use map_id to map the index of your dataframe to the "image" block name required even if the name of file is a better name to pass.

In [9]:
# We want to load our data samples into debiai.
# The map_id parameter allow to use data.index as an id for samples instead of specifying an "image" column.
# If you don't want to use map_id, you dataframe need to have an "images" column with unique value (such as files names)

project.add_samples_pd(data_mnist, map_id="image")

True

#### Another mandatory element of every debiai project is the Expected_Result structure. It defines how will be the results of every models. There is only one per project, so for now, every models will have same results type in debiai.

In [10]:
# Let's add our first model !
debiai_model_1 = project.create_model("Model 1")

# We are going to use results so need an expected_results structure.
result_struct = [
    {
        "name":"results",
        "type":"number"
    },
    {
        "name":"percents",
        "type":"number"
    }
]

project.set_expected_results(result_struct)

[{'name': 'results', 'type': 'number'},
 {'name': 'pourcents', 'type': 'number'}]

#### Next, we train our model and add results to a newly created debiai model

In [11]:
# Let's train our first model
mnist_model_1.fit(mnist_ds, validation_data=mnist_val, epochs=1)



<tensorflow.python.keras.callbacks.History at 0x7f2fe1adbc10>

In [12]:
# We are now going to predict results to put into debiai
df_results_1 = predict_from_pd(data_mnist, mnist_model_1)
df_results_1.head()

Unnamed: 0,hash,results,pourcents
0,1ff70ec879acb37790026a019419e2151dc7821d2a62f5...,5,99.62
1,102d4b5f5e701a1519ea10d0036ad97fc1a7c6c2be51e2...,1,99.92
2,0f0f302e98b22e49eeb144646a5927b142bf3af1d9949a...,4,89.53
3,cae0f1a5bb0384316b769377b7b1fc8fa97f5315853114...,0,94.78
4,9db1a02677fc56e855484dde31d4dc7a2ca3a34752b0a8...,2,100.0


In [13]:
# We can now add results to our first model into debiai. Results are stored by models.
# Again map_id allows to map a specific column to data.index
debiai_model_1.add_results_df(df_results_1, map_id="image")

#### Now that we have a model and results into our project, we can create another to compare them into debiai.

In [14]:
# We want to try another model with more Conv2D layers
mnist_model_2 = create_cnn_model(nb_layers=3)

# Add a second model to the block
debiai_model_2 = project.create_model("Model 2")

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
rescaling_1 (Rescaling)      (32, 32, 32, 3)           0         
_________________________________________________________________
conv2d_1 (Conv2D)            (32, 30, 30, 32)          896       
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (32, 15, 15, 32)          0         
_________________________________________________________________
conv2d_2 (Conv2D)            (32, 13, 13, 32)          9248      
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (32, 6, 6, 32)            0         
_________________________________________________________________
conv2d_3 (Conv2D)            (32, 4, 4, 32)            9248      
_________________________________________________________________
max_pooling2d_3 (MaxPooling2 (32, 2, 2, 32)           

In [15]:
# Train this model
mnist_model_2.fit(mnist_ds, validation_data=mnist_val, epochs=1)



<tensorflow.python.keras.callbacks.History at 0x7f2fd0637070>

In [16]:
#We can now predict and add new results to this model in debiai. We will use the same sample as the first model to compare
df_results_2 = predict_from_pd(data_mnist, mnist_model_2)
df_results_2.head()

# Add new results
debiai_model_2.add_results_df(df_results_2, map_id="image")

##### You can now compare the results of our two models on the same samples into Debiai

##### Let's add color to our data to be able to recognize digits with different noise and tones.

#### For now in order to change the blockstructure of a project and add a new "dataset" block we need to create a new project

In [17]:
my_debiai.delete_project_byId("Digit-Recognition2")
full_project = my_debiai.create_project("Digit-Recognition2")

In [18]:
# We need to set a new more useful blockstructure for our new datasets

second_block_struct = [
    {
        # Dataset Block
        "name":"dataset",
        "contexts": [
            {
                "name":"colored",
                "type":"boolean"
            },
            {
                "name":"noised",
                "type":"boolean"
            }
        ]
    },
    {
        # Block Samples
        "name":"image",
        "groundTruth": [
            {
                "name":"GT",
                "type":"number"
            }
        ],
        "contexts": [],
        "others": [],
        "inputs" : [],
    }
]

full_project.set_blockstructure(second_block_struct)

# The results stay the same so let's add them too
full_project.set_expected_results(result_struct)

[{'name': 'results', 'type': 'number'},
 {'name': 'pourcents', 'type': 'number'}]

#### Because we added new contexts, we need to add them to the dataframe too.

In [19]:
data_mnist['dataset'] = 'mnist'
data_mnist['colored'] = False
data_mnist['noised'] = False

#### Now we can add the MNIST_M dataset

In [20]:
# We should create dataset and model for MNIST_M 
(mnistm_ds, mnistm_val, mnistm_model_1) = create_model_from_dir("data/MNIST_M/train", nb_layers=3)

# We get the iterator for later use
mnistm_iter = mnistm_val.as_numpy_iterator()

Found 67085 files belonging to 10 classes.
Using 53668 files for training.
Found 67085 files belonging to 10 classes.
Using 13417 files for validation.
Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
rescaling_2 (Rescaling)      (32, 32, 32, 3)           0         
_________________________________________________________________
conv2d_4 (Conv2D)            (32, 30, 30, 32)          896       
_________________________________________________________________
max_pooling2d_4 (MaxPooling2 (32, 15, 15, 32)          0         
_________________________________________________________________
conv2d_5 (Conv2D)            (32, 13, 13, 32)          9248      
_________________________________________________________________
max_pooling2d_5 (MaxPooling2 (32, 6, 6, 32)            0         
_________________________________________________________________
conv2d_6 (Conv2D)            (32, 

In [21]:
# We are going to take 100 batch of 32 samples here again
mnistm_val_data = get_samples_from_dataset(mnistm_iter, 100)

# Let's create a dataframe for the samples
data_mnistm = pd.DataFrame(mnistm_val_data, columns=columns)

# Add specific context to this dataset
data_mnistm['dataset'] = 'mnistm'
data_mnistm['colored'] = True
data_mnistm['noised'] = False

#### We can now see our two datasets ready to be used. Notice that every attribute of the blockstructure is present except "image" which is map to dataframe index.

In [22]:
data_mnistm.head()

Unnamed: 0,inputs,GT,dataset,colored,noised
0,"[[[55.0, 61.0, 27.0], [57.0, 63.0, 29.0], [58....",3,mnistm,True,False
1,"[[[204.0, 180.0, 108.0], [203.0, 179.0, 107.0]...",1,mnistm,True,False
2,"[[[195.0, 115.0, 80.0], [191.0, 111.0, 76.0], ...",2,mnistm,True,False
3,"[[[127.0, 127.0, 93.0], [102.0, 102.0, 64.0], ...",7,mnistm,True,False
4,"[[[74.0, 82.0, 41.0], [35.0, 47.0, 1.0], [23.0...",6,mnistm,True,False


##### The hash you can see here is from the path of the precedent project, so it will not work here.

In [23]:
data_mnist.head()

Unnamed: 0,inputs,GT,hash,dataset,colored,noised
0,"[[[0.0, 0.0, 0.0], [0.0, 0.0, 0.0], [0.0, 0.0,...",5,1ff70ec879acb37790026a019419e2151dc7821d2a62f5...,mnist,False,False
1,"[[[0.0, 0.0, 0.0], [0.0, 0.0, 0.0], [0.0, 0.0,...",1,102d4b5f5e701a1519ea10d0036ad97fc1a7c6c2be51e2...,mnist,False,False
2,"[[[0.0, 0.0, 0.0], [0.0, 0.0, 0.0], [0.0, 0.0,...",4,0f0f302e98b22e49eeb144646a5927b142bf3af1d9949a...,mnist,False,False
3,"[[[0.0, 0.0, 0.0], [0.0, 0.0, 0.0], [0.0, 0.0,...",0,cae0f1a5bb0384316b769377b7b1fc8fa97f5315853114...,mnist,False,False
4,"[[[0.0, 0.0, 0.0], [0.0, 0.0, 0.0], [0.0, 0.0,...",2,9db1a02677fc56e855484dde31d4dc7a2ca3a34752b0a8...,mnist,False,False


In [24]:
# We can now push both dataframe to debiai project
full_project.add_samples_pd(data_mnist, map_id="image")
full_project.add_samples_pd(data_mnistm, map_id="image")

True

#### Now that add_samples_pd has been called, a new hash can be seen, linked to the new path with the new blockstructure

In [33]:
data_mnist.head()

Unnamed: 0,inputs,GT,hash,dataset,colored,noised
0,"[[[0.0, 0.0, 0.0], [0.0, 0.0, 0.0], [0.0, 0.0,...",5,4d5556a51281631e752fb2ee1dbb54a0814e327bd41c6a...,mnist,False,False
1,"[[[0.0, 0.0, 0.0], [0.0, 0.0, 0.0], [0.0, 0.0,...",1,c9117565e04307d797ce53292c3f64fc544a4700285be6...,mnist,False,False
2,"[[[0.0, 0.0, 0.0], [0.0, 0.0, 0.0], [0.0, 0.0,...",4,6651aa5c08d18026d4cc1c810d0d290de40508b007dc6e...,mnist,False,False
3,"[[[0.0, 0.0, 0.0], [0.0, 0.0, 0.0], [0.0, 0.0,...",0,70421938da0287194cdb0a9160a6a2677f92d8a87837b8...,mnist,False,False
4,"[[[0.0, 0.0, 0.0], [0.0, 0.0, 0.0], [0.0, 0.0,...",2,85ead609f2bf88ece418c632f16c58937bb9edcba8126c...,mnist,False,False


#### Now that the samples are nicely imported into debiai, we just have to create a debiai.model for each of our models and import their predictions results.
#### This phase is very repetitive because each new results dataframe created needs to have all the attributes of blockstructure to be able to link it with the correct samples. Because of that, we could set it all inside a function instead of writing it multiples times for different models. 

In [26]:
# Now we can add models using the two dataframe to see results.
# Let's start by adding the previous model with new predictions for the new dataset
df_res_mnistm1 = predict_from_pd(data_mnistm, mnist_model_1)

# We need to add the others columns to specify which samples we are referring to
df_res_mnistm1['dataset'] = 'mnistm'
df_res_mnistm1['colored'] = True
df_res_mnistm1['noised'] = False
df_results_1['dataset'] = 'mnist'
df_results_1['colored'] = False
df_results_1['noised'] = False

debiai_model_1 = full_project.create_model("Model 1")

# Push the results into debiai
debiai_model_1.add_results_df(df_results_1, map_id="image")
debiai_model_1.add_results_df(df_res_mnistm1, map_id="image")

In [27]:
# We can also add the second model already created
df_res_mnistm2 = predict_from_pd(data_mnistm, mnist_model_2)

# We need to add the others columns to specify which samples we are referring to
df_res_mnistm2['dataset'] = 'mnistm'
df_res_mnistm2['colored'] = True
df_res_mnistm2['noised'] = False
df_results_2['dataset'] = 'mnist'
df_results_2['colored'] = False
df_results_2['noised'] = False

debiai_model_2 = full_project.create_model("Model 2")

# Push the results into debiai
debiai_model_2.add_results_df(df_results_2, map_id="image")
debiai_model_2.add_results_df(df_res_mnistm2, map_id="image")

In [28]:
# We can fit the new model from MNISTM
mnistm_model_1.fit(mnistm_ds, validation_data=mnistm_val, epochs=1)



<tensorflow.python.keras.callbacks.History at 0x7f2fb871ccd0>

#### Another way to push results is to link them with the hash so that we do not need to enter all columns like dataset, colored, etc
#### Here the predict_from_pd() function also add the associated hash from samples when predicting results, we can use this value to push results into our models.

In [29]:
# Now we can add it to debiai like the others
df_res_mnistm3 = predict_from_pd(data_mnistm, mnistm_model_1)
df_res_mnist3 = predict_from_pd(data_mnist, mnistm_model_1)

debiai_model_3 = full_project.create_model("Model 3")

# Push the results into debiai
debiai_model_3.add_results_df(df_res_mnist3)
debiai_model_3.add_results_df(df_res_mnistm3)

{}

In [30]:
# Last we can add a model trained on both dataset
# Merge both dataset
full_dataset = mnist_ds.concatenate(mnistm_ds)
full_dataset.shuffle(1)
full_val = mnist_val.concatenate(mnistm_val)

full_model = create_cnn_model()

full_model.fit(full_dataset,validation_data=full_val,epochs=1)

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
rescaling_3 (Rescaling)      (32, 32, 32, 3)           0         
_________________________________________________________________
conv2d_7 (Conv2D)            (32, 30, 30, 32)          896       
_________________________________________________________________
max_pooling2d_7 (MaxPooling2 (32, 15, 15, 32)          0         
_________________________________________________________________
conv2d_8 (Conv2D)            (32, 13, 13, 32)          9248      
_________________________________________________________________
max_pooling2d_8 (MaxPooling2 (32, 6, 6, 32)            0         
_________________________________________________________________
conv2d_9 (Conv2D)            (32, 4, 4, 32)            9248      
_________________________________________________________________
max_pooling2d_9 (MaxPooling2 (32, 2, 2, 32)           

<tensorflow.python.keras.callbacks.History at 0x7f2fe02060d0>

In [31]:
# Now we can add it to debiai like the others
df_res_mnistm4= predict_from_pd(data_mnistm, full_model)
df_res_mnist4 = predict_from_pd(data_mnist, full_model)

debiai_model_4 = full_project.create_model("Model 4")

# Push the results into debiai
debiai_model_4.add_results_df(df_res_mnist4)
debiai_model_4.add_results_df(df_res_mnistm4)

{}

#### If you don't want to use dataFrame to push results, you can also use a dictionary.  
#### This can be a dictionary representing the tree of data using _add_results_dict()_ : 
```{mnist: {sample_id: [results], ...}, mnistm: {sample_id, [results], ...}, ...}```
#### or a dictionary representing hash and results directly using _add_results_hash()_:
```{hash: [results], ...}```

#### Last, we have a remove_expected_result function to delete an expected result and erase every result link to that attribute. Use it carefully !

In [32]:
# You can also remove an expected result, this will delete all data linked to this result for each models.
full_project.remove_expected_result("percents")

[{'name': 'results', 'type': 'number'}]

## 4. Conclusion

#### You have created a debiai project with 4 models to compare and 2 datasets with different contexts.

#### As you may have noticed, we didn't use the "noised" context, yet, but we put it for the example for now.
#### That's all folks ! Hopes you have fun using Debiai !

## 5. TODO

* Create a hash based architecture to allow user not to worry about id on samples anymore: **DONE**
* Create a better way to charge results without having to put all columns specification (using block as python object certainly)
* Create a function to add an expected_result with a default value for already inserted data.
* Add new dataset with noised samples