TODO: Overview of what we are doing

---

First we read our data from our csv file and set the `subject`, `sessionIndex`,
and `rep` columns as our index columns to facilitate our access of the data we
want (see documentation for
[pandas.DataFrame.loc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html)
). `pd.read_csv` sets the type of the subject column as `object`, so we
explicitly declare it as `string`.


In [2]:
import gudhi as gd
import pandas as pd

strong_password_data_frame = pd.read_csv('data/DSL-StrongPasswordData.csv',
                                   # declare type of 'subject' column
                                   dtype = {'subject' : 'string'},
                                   index_col = ['subject', 'sessionIndex', 'rep'])
strong_password_data_frame

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,H.period,DD.period.t,UD.period.t,H.t,DD.t.i,UD.t.i,H.i,DD.i.e,UD.i.e,H.e,...,H.a,DD.a.n,UD.a.n,H.n,DD.n.l,UD.n.l,H.l,DD.l.Return,UD.l.Return,H.Return
subject,sessionIndex,rep,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1
s002,1,1,0.1491,0.3979,0.2488,0.1069,0.1674,0.0605,0.1169,0.2212,0.1043,0.1417,...,0.1349,0.1484,0.0135,0.0932,0.3515,0.2583,0.1338,0.3509,0.2171,0.0742
s002,1,2,0.1111,0.3451,0.2340,0.0694,0.1283,0.0589,0.0908,0.1357,0.0449,0.0829,...,0.1412,0.2558,0.1146,0.1146,0.2642,0.1496,0.0839,0.2756,0.1917,0.0747
s002,1,3,0.1328,0.2072,0.0744,0.0731,0.1291,0.0560,0.0821,0.1542,0.0721,0.0808,...,0.1621,0.2332,0.0711,0.1172,0.2705,0.1533,0.1085,0.2847,0.1762,0.0945
s002,1,4,0.1291,0.2515,0.1224,0.1059,0.2495,0.1436,0.1040,0.2038,0.0998,0.0900,...,0.1457,0.1629,0.0172,0.0866,0.2341,0.1475,0.0845,0.3232,0.2387,0.0813
s002,1,5,0.1249,0.2317,0.1068,0.0895,0.1676,0.0781,0.0903,0.1589,0.0686,0.0805,...,0.1312,0.1582,0.0270,0.0884,0.2517,0.1633,0.0903,0.2517,0.1614,0.0818
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
s057,8,46,0.0884,0.0685,-0.0199,0.1095,0.1290,0.0195,0.0945,0.0757,-0.0188,0.1328,...,0.1219,0.1383,0.0164,0.0820,0.1329,0.0509,0.1005,0.2054,0.1049,0.1047
s057,8,47,0.0655,0.0630,-0.0025,0.0910,0.1148,0.0238,0.0916,0.0636,-0.0280,0.1256,...,0.1008,0.0512,-0.0496,0.1037,0.0868,-0.0169,0.1445,0.2206,0.0761,0.1198
s057,8,48,0.0939,0.1189,0.0250,0.1008,0.1122,0.0114,0.0721,0.0462,-0.0259,0.0903,...,0.0913,0.1169,0.0256,0.0689,0.1311,0.0622,0.1034,0.2017,0.0983,0.0905
s057,8,49,0.0923,0.1294,0.0371,0.0913,0.0990,0.0077,0.0992,0.0897,-0.0095,0.1016,...,0.0882,0.0821,-0.0061,0.0576,0.0697,0.0121,0.0979,0.1917,0.0938,0.0931


We put a sample of the subjects and their password typing data in a `people`
array (each index contains a DataFrame). The subjects `s006`, `s009`, `s014`,
`s023`, and `s045` are missing; so we create and use the helper function
`subjects_in_range` to generate the labels.

In [3]:
def subjects_in_range(start, stop):
    """Returns a list of labels for subjects in the subject column.

    :param start: integer between 2 and 57, inclusive
    :param stop: integer between 2 and 57, inclusive. Should be greater than or
                 equal to start.
    :returns: list of zero-padded subject labels beginning with s{start} to s{stop}
    """
    return [f's{i:03}' for i in range(start, 1 + stop) if i not in [6, 9, 14, 23, 45]]

people = [strong_password_data_frame.loc[subject] for subject in subjects_in_range(2,6)]

We probably don't want the index columns when creating our simplicial
complexes, so we use `to_numpy` on our `DataFrame`s to strip them out.

In [4]:
people[0].to_numpy()

array([[0.1491, 0.3979, 0.2488, ..., 0.3509, 0.2171, 0.0742],
       [0.1111, 0.3451, 0.234 , ..., 0.2756, 0.1917, 0.0747],
       [0.1328, 0.2072, 0.0744, ..., 0.2847, 0.1762, 0.0945],
       ...,
       [0.1642, 0.175 , 0.0108, ..., 0.3048, 0.1997, 0.1259],
       [0.1623, 0.2126, 0.0503, ..., 0.314 , 0.1601, 0.1154],
       [0.1792, 0.1889, 0.0097, ..., 0.5261, 0.3862, 0.108 ]])

In [None]:
simplex_trees = []
persistence_diagrams = []

for person in people:
    simplicial_complex = gd.RipsComplex(points = person.to_numpy(),
                                        max_edge_length = 0.2)
    simplex_tree = simplicial_complex.create_simplex_tree(max_dimension = 4)
    simplex_trees.append(simplex_tree)
    persistence_diagrams.append(simplex_tree.persistence())

Now we plot the persistence diagrams:

In [None]:
for diagram in persistence_diagrams:
    gd.plot_persistence_diagram(diagram)

... and the barcode diagrams:

In [None]:
for diagram in persistence_diagrams:
    gd.plot_persistence_barcode(diagram)

Now we compute some bottleneck distances:

In [None]:
pers_intervals_array_in_dim = []

for i in range(4):
    pers_intervals_array_in_dim.append(
        [simplex_tree.persistence_intervals_in_dimension(i)
         for simplex_tree in simplex_trees])

In [None]:
for dimension in range(len(pers_intervals_array_in_dim)):
    print(f'i j\t bottleneck distance ({dimension}D)')
    print('---\t -------------------')
    for i in range(len(pers_intervals_array_in_dim)):
        for j in range(i, len(pers_intervals_array_in_dim)):
            a = pers_intervals_array_in_dim[dimension][i]
            b = pers_intervals_array_in_dim[dimension][j]
            print(f'{i} {j}\t {gd.bottleneck_distance(a, b)}')
    print('\n')