# Plotly and Pandas for homemade experiments

This notebook will show one way to use pandas dataframes to store results of a computation, and then use plotly to look at the results in a convenient way.

What we will actually do in the notebook:
- Generate some tensor data
- Decompose it with 2 algorithms, with one hyperparameter (number of components), and store two different metrics (time, reconstruction error)
- We do the previous step 10 times, everytime with a different initialization

In [14]:
import numpy as np
import tensorly as tl
from tensorly.random import random_cp
from tensorly.decomposition import non_negative_parafac, non_negative_parafac_hals, parafac
import plotly.express as px
#import chart_studio.plotly as py
import pandas as pd
import copy # for fair initialization
import time # for checking speed

First good practice: setting some seed (non trivial) so that randomized experiments are reproducible.

In [15]:
rs = tl.check_random_state(hash("this meeting is dope")%(2**32))

## Part 1: an experiment with logs in pandas DataFrame
### Before loops
- We set some fixed hyperparameters
- We generate a random tensor data to use a tensor decomposition library (anything else would have worked)

In [16]:
# A bunch of fixed hyperparameters
dims = [10,11,12] # sizes of the tensor
rank = 5 # true number of components
N_init = 10 # number of random inits
n_iter = 100 # number of iteration in the algorithms
# we will check the impact of the number of estimated components
list_of_ranks = [4,5,6]


# This is where we initialize the DataFrame that will help us store results
df = pd.DataFrame()

data_cp = random_cp(dims, rank, random_state=rs)
# transforming into nonnegative cp
data_cp[0] = tl.tensor([1 for i in range(rank)])
data_cp[1] = [tl.abs(data_cp[1][i]) for i in range(len(dims))]
# make the data into a full tensor
data_tensor = data_cp.to_tensor()#+0.1*np.random.randn(*dims)

# We can add the hyperparameters to the DataFrame in a first row
# We can also store the data, but with random seeds it is not mandatory
#df = pd.concat([pd.DataFrame(
#    {
#        "rank":rank,
#        "dims":[dims], # careful: put lists into brackets, pandas removes one instance
#        "N_init":N_init,
#        "n_iter":n_iter#,
#        #"data":[data_cp]
#    }), df])

#print(df,df.dtypes)

### The loops
I tend to have a lot of loops when running experiments, e.g. when gridding some hyperparameter. This is where the dataframe is useful: to store the results during each loop and the context of this result.

In [17]:
for i in range(N_init):
    for r in list_of_ranks:
    
        # Drawing random initial values
        init_cp = random_cp(dims,r,random_state=rs)
        init_cp[0] = tl.tensor([1 for i in range(r)])
        init_cp[1] = [tl.abs(init_cp[1][i]) for i in range(len(dims))]

        # using a first algorithm to compute the decomposition
        tic = time.time()
        #out = non_negative_parafac(data_tensor, r, return_errors=True,
        #                           init=copy.deepcopy(init_cp), n_iter_max=n_iter)
        out = parafac(data_tensor, r, return_errors=True,
                      init=copy.deepcopy(init_cp), n_iter_max=n_iter)
        # out is [estimated cp tensor, list of errors]
        toc = time.time() - tic
        
        tic2 = time.time()
        out2 = non_negative_parafac_hals(data_tensor,r, return_errors=True,
                                    init=copy.deepcopy(init_cp),n_iter_max=n_iter)
        toc2 = time.time() - tic2
        
        # Now is the time to store results
        df = pd.concat([df,pd.DataFrame(
        {
            "rec_error":[out[1],out2[1]], # a little heavy but doable
            "final_rec_error":[out[1][-1],out2[1][-1]],
            "time":[toc,toc2],
            "algorithm":['default', 'hals'],
            "init_nb": i,
            "rank_est": r
        })], ignore_index=True)

### Storing the results for later

In [18]:
print(df)
df.to_pickle('./my_xp_results.pkl')

                                            rec_error  final_rec_error  \
0   [0.16135338106959818, 0.11910553687611242, 0.1...         0.057854   
1   [0.17067386434147208, 0.10696447543733074, 0.0...         0.057846   
2   [0.17606326144954407, 0.08374195332669614, 0.0...         0.001265   
3   [0.1682406029688855, 0.09115272739385925, 0.07...         0.000406   
4   [0.136643663656339, 0.08910767795340095, 0.068...         0.002883   
5   [0.16320293832892074, 0.08918308911737952, 0.0...         0.001625   
6   [0.1387443992530884, 0.09613659738207948, 0.08...         0.057836   
7   [0.15332801622433792, 0.11043837002145528, 0.0...         0.057847   
8   [0.13145918977348373, 0.0908001968085309, 0.08...         0.001361   
9   [0.1464294323500054, 0.07215509488616327, 0.05...         0.000495   
10  [0.1024241390109571, 0.0758443799193823, 0.069...         0.001735   
11  [0.15203190421283794, 0.08653033624589918, 0.0...         0.001457   
12  [0.12090363861923044, 0.0878346760

## Part2 : I want to look at my results with little efforts

We do that with plotly !! 

In [19]:
# if needed, import the data
df = pd.read_pickle("./my_xp_results.pkl")

In [20]:
# which algorithm is faster?
fig = px.box(df, x='algorithm', y='time', title="Which algorithm is faster?")
# you can use a template for the layout if you have personnal preferences you use all the time
fig.update_layout(
    font = dict(
        size = 20
    )
)
fig.show() # requires nbconvert package

In [27]:
# which algorithm is more accurate? What is the impact of rank value?
fig = px.box(df, x='algorithm', y='final_rec_error', 
            title="Which algorithm is more accurate? Does rank have an impact?",
            template="plotly_white", points="all", facet_col="rank_est",
            color="algorithm", log_y=True)
fig.update_layout(font = dict(size = 20))
fig.show()

In [22]:
# time vs error ?
fig = px.scatter(df, x="time", y="final_rec_error", 
                 color="algorithm", symbol="rank_est")
fig.update_traces(marker=dict(size=10))
fig.show()

In [63]:
# distributions of the timings
fig = px.histogram(df[df["rank_est"]==4], x="time", facet_col="algorithm", 
                   barmode="overlay", facet_col_spacing=0.1,
                   template="plotly_white",
                   color= 'algorithm')
fig.update_layout(font = dict(size = 20))

fig.show()

### A more complicated example: convergence plots
Sadly I have not found a convenient way with plotly express to use the lists of errors (i.e. objects) stored in the dataframe.
Two solutions:
- Use plotly graph_objects, or matplotlib, to do it by hand (meh)
- transform df into another DataFrame suited for line plots (one error per iteration per row)

Let's do the second one (some pandas manipulation !!)

In [24]:
# we want to plot error vs iteration vs algorithm vs run
# we can discard anything else
df2 = pd.DataFrame()
for idx_pd,i in enumerate(df["rec_error"]):
    for it,j in enumerate(i):
        df2=pd.concat([df2, pd.DataFrame({
            "it":[it],
            "rec_err":j,
            "algorithm":df.iloc[idx_pd]["algorithm"],
            "run":df.iloc[idx_pd]["init_nb"],
            "rank_est":df.iloc[idx_pd]["rank_est"],
        })], ignore_index=True)

In [25]:
# let's plot only some runs
fig = px.line(df2[df2["run"]<3], x="it", y= "rec_err",color='algorithm',
              facet_col="run", log_y=True,
              facet_row="rank_est",
              height=1000)
fig.update_layout(font = dict(size = 20))
fig.show()