# SOM Visualization


## Description

This notebook contains code for visualizing the contents of hierarchical data (`*.h5`) files produced by [`Sompy_experimentation.ipynb`](Sompy_experimentation.ipynb). The current things it displays are as follows:

 - A set of linear-gradient heatmaps
 - A set of log-gradient heatmaps
 - An interactive interface for browsing through what items are in each cluster
 - A "U-matrix" that highlights regions of high euclidean distance gradients, to identify how well-matching the cluster boundaries are
 
 
 ## Workflow
 
 
 1. Enter the particular codebook file you want to use in [the codebook selection cell](#codebook_selection_cell), change the KM_CLUSTERS value to whatever number of clusters you feel is appropriate, and then rerun the notebook.
 1. Inspect the heatmaps in [the heatmap view cells](#heatmap_view_cell).
 1. Search the cells for materials of interest in [the cluster inspector cell](#cluster_inspector_cell).
 1. Evaluate the accuracy of the fit of the clusters in [the umatrix view cell](#umatrix_view_cell).
 
 
 ## Notes
 
  - The trained SOM has many arrays, one for each parameter it was trained on. This is what is seen in the heatmap views - each heatmap corresponds to one of these arrays.
  - The cluster inspector has various features, which shall be listed here.
      - First, each cluster is given its own tab in the inspector. The order of the tabs by color matches the pattern in the colorbar to the side of the cluster map. You can confirm the color of a given cluster with the colored square next to the cluster's name inside of the tab's frame.
      - When a cluster is selected, you may view its statistical information and contained items in the tables in the tab's frame.
      - You can restrict the number of items shown in the tab by entering a **regular expression**([1](https://en.wikipedia.org/wiki/Regular_expression), [2](https://www.regular-expressions.info/quickstart.html)) that matches the items you wish to see.
      - Once the items have been filtered to your liking, you can hit the "render points" button in order to display all items visible in the table to the cluster map.

In [1]:
import numpy as np
import pandas as pd
import logging
import ipywidgets as widgets
import matplotlib.pyplot as plt
import tables
import sompy
from sompy.sompy import SOMFactory

backend module://ipykernel.pylab.backend_inline version unknown


In [2]:
from tfprop_sompy.jupyter_integration.cluster_inspector import sort_materials_by_cluster, cluster_tabs

In [3]:
from tfprop_sompy.tfprop_vis import render_posmap_to_axes, kmeans_clust, show_posmap, ViewTFP, dataframe_to_coords , clusteringmap_category

In [4]:
# This makes all the loggers stay quiet unless it's important
#logging.getLogger().setLevel(logging.WARNING)

In [5]:
# Paste name of file generated by training in Sompy_experimentation
CODEBOOK_FILE = 'som_codemat_4props_20-02-05.h5'

KM_CLUSTERS = 5

In [6]:
# Creates necessary pd dataframes for visualization
stored_cb_matrix = pd.read_hdf(CODEBOOK_FILE, 'sm_codebook_matrix')
stored_mapsize = pd.read_hdf(CODEBOOK_FILE, 'sm_codebook_mapsize').values
mats_df = pd.read_hdf(CODEBOOK_FILE, 'sm_data')

# FIXME:
# We do a hack using the "pytables" library in order to extract the information
# For some reason pandas doesn't like to read object series out of h5 files
stored_columns = None
stored_matfamilies = None
with tables.open_file(CODEBOOK_FILE, "r") as store:
    # We normally get byte strings from this
    # The mapping operation turns them all into unicode strings ready for presentation
    stored_columns = list(map(lambda x: x.decode('utf-8'), store.root.sm_codebook_columns.property_names.read()))
    stored_matfamilies = list(map(lambda x: x, store.root.sm_codebook_matfamilies.material_families.read()))


FileNotFoundError: File som_codemat_4props_20-02-05.h5 does not exist

In [None]:
mats_df["Row"] = stored_matfamilies


In [None]:
sm = SOMFactory.build(mats_df[stored_columns].values, 
                mapsize=(*stored_mapsize,),
                normalization="var", 
                initialization="pca", 
                component_names=stored_columns)

In [None]:
sm.codebook.matrix = stored_cb_matrix.values

In [None]:
def create_posmap(mysom: sompy.sompy.SOM, num_clusters: int=KM_CLUSTERS):
    cl_labels = kmeans_clust(mysom, n_clusters=num_clusters)

    # plot positioning map with clustered groups
    show_posmap(mysom, mats_name_df, mats_name_df,
                num_clusters, cl_labels,
                show_data=False, labels=False)

In [None]:
cl_labels = kmeans_clust(sm, KM_CLUSTERS)

In [None]:
heatmap_size = (20, 20)
heatmap_col_sz = 4
gauss_alpha = None

cmap = plt.get_cmap('RdYlBu_r')  # set color map
viewTFP = ViewTFP(*heatmap_size, '', stdev_colorscale_coeff=1., text_size=14)

In [None]:
my_out = widgets.Output()

# No scaling
viewTFP.knee_value = 0.0
with my_out:
    print("Linear scaling")
    viewTFP.show(sm, cl_labels, col_sz=heatmap_col_sz,
                         which_dim='all', desnormalize=True, col_norm='mean',
                         cmap=cmap, isOutHtmap=False)
my_out

In [None]:
my_out = widgets.Output()
cmap = plt.get_cmap('RdYlBu_r')  # set color map

# No scaling
viewTFP.knee_value = 0.0
with my_out:
    print("Log scaling")
    viewTFP.show(sm, cl_labels, col_sz=heatmap_col_sz,
                         which_dim='all', desnormalize=True, col_norm='mean',
                         cmap=cmap, normalizer="log")
my_out

In [None]:
# viewTFP2 = ViewTFP(*(7, 7), '', stdev_colorscale_coeff=1,text_size=14)
# for i, p in enumerate(stored_columns):
#     viewTFP2.show(sm, cl_labels, col_sz=1,
#                      which_dim=i, desnormalize=True, col_norm='mean',
#                      cmap=cmap, normalizer="log", isOutHtmap=False)

In [None]:
# from sompy.visualization.mapview import View2D

# my_out = widgets.Output()
# cmap = plt.get_cmap('RdYlBu_r')  # set color map

# view2d = View2D(*heatmap_size, '', stdev_colorscale_coeff=1., text_size=14)
# # No scaling
# viewTFP.knee_value = 0.0
# with my_out:
#     print("Log scaling")
#     viewTFP.show(sm, cl_labels, col_sz=heatmap_col_sz,
#                          which_dim='all', desnormalize=True, col_norm='mean',
#                          cmap=cmap, normalizer="log")
# my_out

In [None]:
## %matplotlib inline
my_dataframe = mats_df
clusters_list = sort_materials_by_cluster(sm, my_dataframe, cl_labels)

# This makes it so it will display the full lists
pd.set_option('display.max_rows', 2000)
pd.set_option('display.max_columns', 50)
pd.set_option('display.width', 1000)

# This should be the last statement of the cell, to make it display
# That, or assign the return value to a variable, and have that variable be the final expression in a cell
cluster_tabs(sm, my_dataframe, clusters_list, cl_labels)

In [None]:
from tfprop_sompy.jupyter_integration.cluster_inspector import make_cluster_graph
from tfprop_sompy.tfprop_vis import dataframe_to_coords, render_points_to_axes

fig, ax = make_cluster_graph(sm, cl_labels)
clusteringmap_category(ax, sm, KM_CLUSTERS, my_dataframe, "Top_Film_(Coded)", my_dataframe["Row"], 'plot.png')


In [None]:
from tfprop_sompy.tfprop_vis import UMatrixTFP

umat_size = (50, 50)

umat = UMatrixTFP(*umat_size, 'U-matrix')

umat.show(sm, my_dataframe, my_dataframe, cmap=cmap)
None

In [None]:
sm.calculate_quantization_error()

In [None]:
sm.calculate_topographic_error()

In [None]:
# Run cells below this one manually
# Used for testing code
assert False