# DEBIAI Getting started :

1
- Data importing from a CSV file
- Creation of a DEBIAI project
- Insertion of the data into the project
- Statistical analysis

2
- Simple model training
- Insertion of two model results into DEBIAI
- Statistical Model comparison
- Creation of a new data selection

3
- Training of two new models
- Results comparison
- Conclusion

In [1]:
import pandas as pd
import numpy as np
import tensorflow as tf

from debiai import debiai

2021-09-01 17:19:33.205437: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-09-01 17:19:33.205460: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


### Download the csv file containing a simple wine quality dataset.

Modeling wine preferences by data mining from physicochemical properties.

https://archive.ics.uci.edu/ml/datasets/Wine+Quality

In [2]:
csv_file = tf.keras.utils.get_file('winequality-red.csv', 'https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv')

Read the csv file using pandas.

In [3]:
df = pd.read_csv(csv_file, delimiter=';')
df

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.00100,3.00,0.45,8.8,6
1,6.3,0.30,0.34,1.6,0.049,14.0,132.0,0.99400,3.30,0.49,9.5,6
2,8.1,0.28,0.40,6.9,0.050,30.0,97.0,0.99510,3.26,0.44,10.1,6
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.99560,3.19,0.40,9.9,6
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.99560,3.19,0.40,9.9,6
...,...,...,...,...,...,...,...,...,...,...,...,...
4893,6.2,0.21,0.29,1.6,0.039,24.0,92.0,0.99114,3.27,0.50,11.2,6
4894,6.6,0.32,0.36,8.0,0.047,57.0,168.0,0.99490,3.15,0.46,9.6,5
4895,6.5,0.24,0.19,1.2,0.041,30.0,111.0,0.99254,2.99,0.46,9.4,6
4896,5.5,0.29,0.30,1.1,0.022,20.0,110.0,0.98869,3.34,0.38,12.8,7


## Insert data into DEBIAI for a first step statistical analysis

In [4]:
# Creation of the DEBIAI wine quality project block structure
DEBIAI_block_structure = [
    {
        "name": "sampleId",
        "inputs": [
            { "name": "fixed acidity",           "type": "number"},
            { "name": "volatile acidity",        "type": "number"},
            { "name": "citric acid",             "type": "number"},
            { "name": "residual sugar",          "type": "number"},
            { "name": "chlorides",               "type": "number"},
            { "name": "free sulfur dioxide",     "type": "number"},
            { "name": "total sulfur dioxide",    "type": "number"},
            { "name": "density",                 "type": "number"},
            { "name": "pH",                      "type": "number"},
            { "name": "sulphates",               "type": "number"},
            { "name": "alcohol",                 "type": "number"},
        ],
        "groundTruth": [
            { "name": "quality",                 "type": "number"},
        ]
    }
]

In [5]:
# Add an unique value column to the dataframe
df.insert(0, "sampleId", range(len(df.index)), True)
df.dtypes

sampleId                  int64
fixed acidity           float64
volatile acidity        float64
citric acid             float64
residual sugar          float64
chlorides               float64
free sulfur dioxide     float64
total sulfur dioxide    float64
density                 float64
pH                      float64
sulphates               float64
alcohol                 float64
quality                   int64
dtype: object

### Insert the dataframe into DEBIAI

In [6]:
DEBIAI_URL = 'http://localhost:3000/'
DEBIAI_PROJECT_NAME = 'winequality demo'
my_debiai = debiai.Debiai(DEBIAI_URL)

In [7]:
# Create or recreate the project
debiai_project = my_debiai.get_project(DEBIAI_PROJECT_NAME)

if debiai_project:
    # Deleting the project if already existing
    my_debiai.delete_project_byId(DEBIAI_PROJECT_NAME)

debiai_project = my_debiai.create_project(DEBIAI_PROJECT_NAME)
debiai_project.set_blockstructure(DEBIAI_block_structure)

# Add the dataframe
print("Adding the dataframe ~ sec")
debiai_project.add_samples_pd(df, get_hash=False)

Adding the dataframe ~ sec


True

The input data and the project are now ready to be analyzed into the dashboard

## Statistical analysis :
<img src="statAns.png" height="500">

# Model training
## Load data using `tf.data.Dataset`

In [8]:
trainingDf = df.copy()
trainingDf.pop('sampleId')
target = trainingDf.pop('quality')

In [9]:
dataset = tf.data.Dataset.from_tensor_slices((trainingDf.to_numpy(), target.values))

2021-09-01 17:19:41.245673: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-09-01 17:19:41.251452: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2021-09-01 17:19:41.254283: W tensorflow/stream_executor/cuda/cuda_driver.cc:326] failed call to cuInit: UNKNOWN ERROR (303)
2021-09-01 17:19:41.258155: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (tomansion-HP-EliteBook-840-G4): /proc/driver/nvidia/version does not exist
2021-09-01 17:19:41.283075: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild Tenso

Shuffle and batch the dataset.

In [10]:
train_dataset = dataset.shuffle(len(trainingDf)).batch(1)

## Create and train two models

In [11]:
def get_compiled_model():
  model = tf.keras.Sequential([
    tf.keras.layers.Dense(10, activation='relu'),
    tf.keras.layers.Dense(10, activation='relu'),
    tf.keras.layers.Dense(10)
  ])

  model.compile(optimizer='adam',
                loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
                metrics=['accuracy'])
  return model

In [12]:
model1 = get_compiled_model()
model1.fit(train_dataset, epochs=2)

Epoch 1/2


2021-09-01 17:19:42.531830: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
2021-09-01 17:19:42.559350: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2699905000 Hz


Epoch 2/2


<tensorflow.python.keras.callbacks.History at 0x7f829217a940>

In [13]:
model2 = get_compiled_model()
model2.fit(train_dataset, epochs=1)



<tensorflow.python.keras.callbacks.History at 0x7f8285ca6eb0>

In [14]:
from scipy.special import softmax


def predict_from_pd(trainingDf, model):
    inp = trainingDf.to_numpy()
    predictions = model.predict(inp)

    return pd.concat([pd.DataFrame(
        [
            [str(i), str(np.argmax(pred)), str(
                round(np.max(softmax(pred)) * 100, 2))]
        ], columns=["sampleId", "prediction", "percent"])
        for (i, pred) in enumerate(predictions)], ignore_index=True)


In [15]:
results1 = predict_from_pd(trainingDf, model1)
results1

Unnamed: 0,sampleId,prediction,percent
0,0,5,42.71
1,1,5,42.07
2,2,6,42.16
3,3,6,47.41
4,4,6,47.41
...,...,...,...
4893,4893,6,43.54
4894,4894,6,40.26
4895,4895,6,46.38
4896,4896,6,47.19


In [16]:
results2 = predict_from_pd(trainingDf, model2)
results2

Unnamed: 0,sampleId,prediction,percent
0,0,5,68.85
1,1,5,48.29
2,2,5,31.46
3,3,5,66.77
4,4,5,66.77
...,...,...,...
4893,4893,6,30.23
4894,4894,5,61.78
4895,4895,5,35.39
4896,4896,5,34.63


## Insert the model results into DEBIAI for a results statistical analysis

In [17]:
# debiai_project.delete_model("Model 2e")
# debiai_project.delete_model("Model 4e")

In [18]:
# Creating the two DEBIAI models
DEBIAI_model_name1 = "Model 2e"
DEBIAI_model_name2 = "Model 4e"
debiai_model1 = debiai_project.create_model(DEBIAI_model_name1)
debiai_model2 = debiai_project.create_model(DEBIAI_model_name2)

# Set the DEBIAI expected_results structure.
DEBIAI_result_struct = [
    { "name": "prediction", "type": "number" },
    { "name": "percent",    "type": "number" }
]

debiai_project.set_expected_results(DEBIAI_result_struct)

# Add the model results
debiai_model1.add_results_df(results1)
debiai_model2.add_results_df(results2)



{}

The model results should now appear on the dashboard

## model performance analysis

<img src="resAns.png" height="500">

  
# DEBIAI dataset generation 

Generation of a smaller less biased dataset based on the last models errors with the dashboard.

<img src="newDataset.png" height="200">

In [19]:
debiai_project = my_debiai.get_project('winequality demo')
debiai_project.get_selections()

[]

In [22]:
selection = debiai_project.get_selection('less biased')
selection

DEBIAI selection : 'less biased'
creation date : '2021-09-01 17:23:13'
number of samples  : '2700'

In [23]:
# Loading the selection as a dataframe
selection_df = selection.get_dataframe()
print(selection_df)
print(selection_df.dtypes)

selection_df.pop('sampleId')
target = selection_df.pop('quality')
dataset2 = tf.data.Dataset.from_tensor_slices((selection_df.to_numpy(), target.values))
train_dataset2 = dataset2.shuffle(len(selection_df)).batch(1)

      sampleId  fixed acidity  volatile acidity  citric acid  residual sugar  \
0          762            6.8              0.24         0.49           19.30   
1          607            7.3              0.25         0.29            7.50   
2         4645            5.0              0.24         0.34            1.10   
3         1179            7.2              0.20         0.25            4.50   
4         3528            6.9              0.75         0.13            6.30   
...        ...            ...               ...          ...             ...   
2695      4710            5.4              0.33         0.31            4.00   
2696       283            6.7              0.34         0.30           15.60   
2697      2178            7.6              0.32         0.58           16.75   
2698      4096            8.0              0.25         0.35            1.10   
2699      2813            5.8              0.32         0.31            2.70   

      chlorides  free sulfur dioxide  t

In [24]:
model3 = get_compiled_model()
model3.fit(train_dataset2, epochs=2)

Epoch 1/2
Epoch 2/2


<tensorflow.python.keras.callbacks.History at 0x7f8285c982e0>

In [25]:
model4 = get_compiled_model()
model4.fit(train_dataset2, epochs=4)

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<tensorflow.python.keras.callbacks.History at 0x7f8278bc1e80>

In [26]:
results3 = predict_from_pd(trainingDf, model3)
results3

Unnamed: 0,sampleId,prediction,percent
0,0,5,70.25
1,1,5,56.78
2,2,7,60.73
3,3,5,67.82
4,4,5,67.82
...,...,...,...
4893,4893,7,66.23
4894,4894,7,53.88
4895,4895,7,55.82
4896,4896,7,57.96


In [27]:
results4 = predict_from_pd(trainingDf, model4)
results4

Unnamed: 0,sampleId,prediction,percent
0,0,5,67.34
1,1,5,50.2
2,2,7,41.7
3,3,5,63.06
4,4,5,63.06
...,...,...,...
4893,4893,7,53.32
4894,4894,5,43.79
4895,4895,7,39.49
4896,4896,7,53.52


In [28]:
# Creating the two DEBIAI models
DEBIAI_model_name3 = "Model LB 2e"
DEBIAI_model_name4 = "Model LB 4e"
debiai_model3 = debiai_project.create_model(DEBIAI_model_name3)
debiai_model4 = debiai_project.create_model(DEBIAI_model_name4)

# Add the model results
debiai_model3.add_results_df(results3)
debiai_model4.add_results_df(results4)



{}

The new model results should now appear on the dashboard

## Second model performance analysis

<img src="resAns2.png" height="500">


# Training on a dataset directly from the DEBIAI selection

In [29]:
train_dataset_imported = selection.get_tf_dataset()

In [30]:
train_dataset_imported = train_dataset_imported.shuffle(selection.nbSamples).batch(1)

model5 = get_compiled_model()
model5.fit(train_dataset_imported, epochs=15)

Epoch 1/15
0/2700
1000/2700
      1/Unknown - 8s 8s/step - loss: 3.9518 - accuracy: 0.0000e+002000/2700
Epoch 2/15
0/2700
1000/2700
   1/2700 [..............................] - ETA: 4:22:29 - loss: 1.1415 - accuracy: 0.0000e+002000/2700
Epoch 3/15
0/2700
1000/2700
   1/2700 [..............................] - ETA: 4:28:44 - loss: 0.5108 - accuracy: 1.00002000/2700
Epoch 4/15
0/2700
1000/2700
   1/2700 [..............................] - ETA: 4:02:55 - loss: 0.2750 - accuracy: 1.00002000/2700
Epoch 5/15
0/2700
1000/2700
   1/2700 [..............................] - ETA: 3:54:50 - loss: 0.4753 - accuracy: 1.00002000/2700
Epoch 6/15
0/2700
1000/2700
   1/2700 [..............................] - ETA: 3:58:35 - loss: 0.3969 - accuracy: 1.00002000/2700
Epoch 7/15
0/2700
1000/2700
   1/2700 [..............................] - ETA: 4:10:41 - loss: 0.2046 - accuracy: 1.00002000/2700
Epoch 8/15
0/2700
1000/2700
   1/2700 [..............................] - ETA: 3:56:42 - loss: 2.5688 - accuracy: 0.000

<tensorflow.python.keras.callbacks.History at 0x7f8278e1e910>

In [31]:
results5 = predict_from_pd(trainingDf, model5)
results5

Unnamed: 0,sampleId,prediction,percent
0,0,5,81.54
1,1,5,70.24
2,2,7,49.05
3,3,5,81.36
4,4,5,81.36
...,...,...,...
4893,4893,7,70.45
4894,4894,5,73.3
4895,4895,7,44.38
4896,4896,7,76.55


In [32]:
# Creating the last DEBIAI model
DEBIAI_model_name3 = "Model LB 2e"
debiai_model5 = debiai_project.create_model("Model LB 15e")

# Add the model results
debiai_model5.add_results_df(results5)



{}