# DIVALGO demonstration

This notebook contains a demostration of the DIVALGO tool. At first we train a basic logistic model with no hyperparameter tuning, as we focus on demonstrating the divalgo tool and Evaluation class methods

In [1]:
import divalgo_class as div
import os
import sklearn.linear_model as lm
import numpy as np
from PIL import Image

#### Train model for example case

In [2]:
# Load data
dogs = sorted(os.listdir(os.path.join("..", "data", "dogs")))
wolves =  sorted(os.listdir(os.path.join("..", "data", "wolves")))


Load in images one at a time, convert to numerical arrays and append to list

In [3]:
# Preprocessing
img_size = 50
dogs_images = []
wolves_images = [] 

for i in dogs:
    if os.path.isfile(os.path.join("..","data", "dogs", f"{i}")):
        img = Image.open(os.path.join("..","data", "dogs", f"{i}")).convert('L')            
        img = img.resize((img_size,img_size), Image.ANTIALIAS)
        img = np.asarray(img)/255.0
        dogs_images.append(img)    

for i in wolves:
    if os.path.isfile(os.path.join("..","data", "wolves", f"{i}")):
        img = Image.open(os.path.join("..","data", "wolves", f"{i}")).convert('L')
        img = img.resize((img_size,img_size), Image.ANTIALIAS)
        img = np.asarray(img)/255.0     
        wolves_images.append(img)   

Manually split the train-test set for showcase purposes. In an actual ML pipeline, this step would need careful consideration to ensure balanced classes. 

In [4]:
# Manual train-test split (to track filenames)
X_train = np.asarray(dogs_images[0:800] + wolves_images[0:800])
X_train = X_train.reshape(X_train.shape[0], X_train.shape[1]*X_train.shape[2])
X_test = np.asarray(dogs_images[800:1000] + wolves_images[800:1000])
X_test = X_test.reshape(X_test.shape[0], X_test.shape[1]*X_test.shape[2])
y_train = np.asarray(["dog" for y in range(800)] + ["wolf" for y in range(800)])
y_train = y_train.reshape(y_train.shape[0],1)
y_test_ar = np.asarray(["dog" for y in range(200)] + ["wolf" for y in range(200)])
y_test = y_test_ar.reshape(y_test_ar.shape[0],1)

y_train, y_test = [k.T for k in [y_train, y_test]]
filenames_test = [os.path.join("..", "data", "dogs", d) for d in dogs[800:1000]] + [os.path.join("..", "data", "wolves", w) for w in wolves[800:1000]]

In [5]:
# Train model
model = lm.LogisticRegression(penalty='none', tol=0.1, max_iter=500).fit(X_train, y_train[0])

#### Showing tool
Now we demonstrate the divalgo tool by creating an instante of the Evaluate class and exploring the methods in it
We feed the class the required tuple of (test set data, true labels of test data, list of test image filenames) and fitted sklearn model

In [6]:
# Instantiating class
                        #data   #true labels   #filenames   #trained model
dog_wolf = div.Evaluate((X_test, y_test[0], filenames_test), model)

The chunk below sets up Jupyter notebook for running Plotly figures. It is only necessary to run this once, as it sets the default for plotly permanently

In [2]:
import plotly.io as pio
pio.renderers.default = "notebook_connected"

The chunk below runs the two accuracy charts shown in the dashboard. The first is overall accuracy, and the second is split by type. In the dashboard, users can easily shift between these with a checkbox

In [9]:
# Accuracy charts - overall and by type
dog_wolf.accuracy()
dog_wolf.accuracy_type()

The chunk below runs the two metric tables shown in the dashboard. The first includes a column with equations for the metrics, the other excludes it. In the dashboard, users can easily shift between these with a checkbox

In [10]:
# Get table with performance metrics - w/o column showing formulas
dog_wolf.get_metrics(equations=True)
dog_wolf.get_metrics(equations=False) # This is also the default

The chunk below runs the confusion matrix shown in the dashboard. Here, it is positioned next to the accuracy chart

In [11]:
dog_wolf.confusion()

The chunk below runs the AUC-ROC curve shown in the dashboard. This is shown on the same page as accuracy charts and metric table

In [12]:
# Show AUC-ROC curve
dog_wolf.plot_roc_curve()


The chunks below runs the heatmaps shown in the dashboard. They can be run both with and without absolute values. In the dashboard, they are both shown at the same time for complementary purposes. 

In [13]:
# Plot coefficient heatmaps - as absolutes or not
dog_wolf.plot_coefs(absolute=False) # Default
dog_wolf.plot_coefs(absolute=True)

NB: the following chunk takes a long time to run. This is because UMAP embeddings are being created for all images. We therefore recommend you view the embedding plot in the streamlit dashboard. Here, it is found in the model workings page. This page also takes a long time to run when it is first opened, but after the embeddings are created, it runs faster. 

In [None]:
# dog_wolf.explore_embeddings()

In [7]:
dog_wolf.open_visualization()

*IN CASE OF ERROR*

If the following chunk finishes successfully but does not open the dashboard, it may be because you have not recently used streamlit, and you are being prompted for an email request. This can be solved by running the following commands in a terminal: 

cd divalgo

stream run ☌frontpage.py

When the interface opens in your browser, close it again. It can now be opened with the proper settings by running dog_wolf.open_visualization()
