**This notebook shows an example of how to use this module to cluster the field data.**

This is completely analog to what was done for the actual publication. It therefore allows to reproduce the results and also to play with different clustering techniques and parameters.

# Imports

In [1]:
%matplotlib inline
import os
import sys
from copy import deepcopy

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

from mpl_toolkits.axes_grid1 import ImageGrid
sns.set_context('notebook')

Import of the modules defined in the `src` module of this project.

In [2]:
# Determine the project directory and at `src` to the python path
if not 'workbookDir' in globals():
    workbookDir = os.getcwd()
project_dir = os.path.normpath(os.path.join(workbookDir, os.pardir))
sys.path.insert(0, project_dir)

from src import tools, in_out, visualize as vis

# Clustering

To perform the clustering, we first load the general results from the `parameters_and_results.h5` using the `tools.get_results` function. This dataset links the input parameters (such as polarization, incident angle, ...) to simulation indices, which are then used to find the correct rows in the field database.

**Note** that the full datasets are large! Loading them completely and performing the clustering will need a lot of memory, depending on the clustering technique and parameters. The `tools.get_results` function therefore allows to use only every nth angle and wavelengths, which is convenient and great for testing and to find the system limitations.

In [7]:
every = 40

print 'Loading simulation data, using every', every, 'angles and wavelength.'
sim_data_init = tools.get_results(every, every)
print 'Total number of rows is', len(sim_data_init)

Loading simulation data, using every 40 angles and wavelength.
Total number of rows is 264


We now loaded only every 40th wavelength and angle, which is a drastic limitation and will cause quite useless results. But the runtime will be fine and we can see if everything works out well.

In the next step, we need to define the parameters for the clustering procedure. We use the `DEFAULT_SIM_DDICT` and `DEFAULT_SIM_PDICT` dictionaries defined in the `tools` module to have convenient access to the possible parameter names. Note that some of the clustering parameters are defined in this module (see the doc string of `tools.get_clustering_input_data`), while others are specific to the clustering algorithm (e.g. `n_clusters` for `MiniBatchKMeans`).

In [16]:
# Define init parameters
ddict = tools.DEFAULT_SIM_DDICT
pdict = tools.DEFAULT_SIM_PDICT
field_type = 'electric'
cluster_type = 'MiniBatchKMeans'

# Type `tools.get_clustering_input_data?` for info on these parameters
common_clkws = dict(treat_complex=None,
                    preprocess='normalize',
                    random_state=0)  # <- for reproducibility

# Set these defaults for all combinations of direction and polarization
clkw_dicts = tools.defaultdict(dict)
for direc_ in ddict:
    for pol_ in pdict:
        print 'Setting for direction', direc_, ' and polarization', pol_
        clkw_dicts[direc_][pol_] = deepcopy(common_clkws)

# Set individual parameters, in this case for the number of clusters
clkw_dicts['Gamma-K']['TE']['n_clusters'] = 8
clkw_dicts['Gamma-K']['TM']['n_clusters'] = 8
clkw_dicts['Gamma-M']['TE']['n_clusters'] = 7
clkw_dicts['Gamma-M']['TM']['n_clusters'] = 7
clkw_dicts = dict(clkw_dicts)

Setting for direction Gamma-K  and polarization TM
Setting for direction Gamma-K  and polarization TE
Setting for direction Gamma-M  and polarization TM
Setting for direction Gamma-M  and polarization TE


We now set up the logging to see the desired output of the clustering procedure.

In [None]:
import logging
log_fmt = '%(asctime)s - %(levelname)s - %(message)s'
logging.basicConfig(level=logging.DEBUG, format=log_fmt)

We're all set and ready to cluster! All we have to do is to pass the initial simulation data, the clustering type, the clustering parameters and the field type to the `cluster_all_modes` function of the `tools` module. It will return enriched simulation data and the trained models.

In [12]:
sim_data, model_data = tools.cluster_all_modes(sim_data_init,
                                               cluster_type=cluster_type,
                                               cluster_kwargs_dicts=clkw_dicts,
                                               field_type=field_type)

2018-04-13 12:46:22,094 - INFO - Clustering for Gamma-K TM
2018-04-13 12:46:22,096 - INFO - Reducing data set for pol=TM and direction Gamma-K
2018-04-13 12:46:22,098 - INFO - Loading sample data
2018-04-13 12:46:22,100 - INFO - Collecting field data
2018-04-13 12:46:22,101 - DEBUG - Length of `sim_numbers`: 132
2018-04-13 12:46:22,102 - DEBUG - Opening HDF5store: /hmi/kme/workspace/code_publications/phc_mode_clustering/data/raw/field_data_E_TM.h5
2018-04-13 12:48:23,808 - INFO - Loading time: 0:02:01.708251.
2018-04-13 12:48:23,809 - INFO - Running MiniBatchKMeans ...
2018-04-13 12:48:24,095 - INFO - Time for clustering: 0:00:00.284489.
2018-04-13 12:48:24,108 - INFO - Silhouette score fit: 0.315340055811
2018-04-13 12:48:24,111 - INFO - Calculating Euclidian distances
2018-04-13 12:48:24,150 - INFO - Updating simulation data set.
2018-04-13 12:48:24,158 - INFO - Finished
2018-04-13 12:48:24,160 - INFO - Clustering for Gamma-K TE
2018-04-13 12:48:24,168 - INFO - Reducing data set for 

# Basic Plotting and Storage

For many of the plotting routines we will need the field metadata, as the field data is stored as flat rows which need to be reshaped properly and assigned to their respective locations in 3D space. The metadata can easily be loaded using: 

In [18]:
lengths, pointlist, domain_ids = tools.get_metadata()

2018-04-13 13:13:35,615 - INFO - Reading metadata


In [None]:
model_dir = os.path.join(tools.PROJECT_DIR, 'models')
metadata = dict(lengths=lengths, pointlist=pointlist, domain_ids=domain_ids)
in_out.save_plots_models(model_dir, sim_data,
                         model_data, cluster_type, metadata)