# About experiment log

Loading the parameters of experiments has always been a headache for me. I can remember the parameters when the amount of data is small, typically at the beginning of a project. However, as data get larger and larger, I increasingly need to go back to the log file and almost end up messing up, and eventually solve the problem in an ugly way. 

Below are some of my attempts to incorporate my experiment logs to a function, so that as long as I follow the protocol to call the function, I do not have to remember the details of the logs. 
```python
def data_log_mapping(kw='aug'):
    """
    Returns the data log mapping. 
    My experiments are recorded in date/number fashion, without detailed parameters.
    All the parameters are logged in separated log files. 
    This function maps the parameters to date/number. 
    
    Args:
    kw -- keyword of data. Since I have done a new set of experiment in August, I have set 'aug' as one valid value. The old data may still be useful in the future. When needed, I will implement the mappings for the old data.
    
    Returns:
    dirs -- the data-log mapping.
    
    IMPORTANT: Whenever new experiments are added, this function needs to be updated.
    """
    if kw == 'aug':
        dirs = {}
        dirs['120'] = ['08062020-3', '08062020-4', '08062020-5']
        dirs['100'] = ['08062020-0', '08062020-1', '08062020-2']
        dirs['85'] = ['08052020-3', '08052020-4', '08052020-5']
        dirs['80'] = ['08032020-0', '08032020-1', '08032020-2']
        dirs['70'] = ['08042020-0', '08042020-1', '08042020-2']
        dirs['60'] = ['08032020-3', '08032020-4', '08032020-5']
        dirs['50'] = ['08042020-3', '08042020-4', '08042020-5']
        dirs['40'] = ['08032020-6', '08032020-7', '08032020-8']
        dirs['30'] = ['08042020-6', '08042020-7', '08042020-8']
        dirs['20'] = ['08032020-9', '08032020-10', '08032020-11']
        dirs['10'] = ['08042020-9', '08042020-10', '08042020-11']
        dirs['00'] = ['08032020-12', '08032020-13', '08032020-14']    
    return dirs

def tentative_log():
    """
    Another log function of density fluctuations data.
    """
    conc = [120, 100, 85, 80, 70, 60, 50, 40, 30, 20, 10]
    folders = ['08062020', '08062020', '08052020', '08032020', '08042020', '08032020', '08042020', '08032020', '08042020', '08032020', '08042020']
    sample_num = [range(3, 6), range(0, 3), range(3, 6), range(0, 3), range(0, 3), range(3, 6), range(3, 6), range(6, 9), range(6, 9), range(9, 12), range(9, 12)]
    return conc, folders, sample_num

def data_log():
    """
    Return the data log: log[date][num, fps]    
    """
    log = {}
    log['08032020'] = {}
    log['08032020']['num'] = list(range(0, 15))
    log['08032020']['fps'] = [30, 30, 30, 30, 30, 30, 30, 10, 10, 10, 10, 10, 10, 10, 10]
    log['08042020'] = {}
    log['08042020']['num'] = list(range(0, 12))
    log['08042020']['fps'] = [30, 30, 30, 30, 30, 30, 30, 30, 30, 10, 10, 10]
    log['08052020'] = {}
    log['08052020']['num'] = list(range(0, 12))
    log['08052020']['fps'] = [30, 30, 30, 30, 30, 30, 30, 30, 30, 10, 10, 10]
    log['08062020'] = {}
    log['08062020']['num'] = list(range(0, 13))
    log['08062020']['fps'] = [30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 10]
    return log
```

Although these functions all have helped me a lot back in the time I wrote them, they suffer from similar drawbacks. That is, once data have changed, or once I need to include more parameters, all these functions need almost a reconstruction before they can work again. Only when I am sure that data and parameters will not change, can these functions serve satisfactorily. 

Therefore, I start to think of a different way to manage the log, and solve the problem of adding data and parameters. Currently, I still like to use a function which I can call for log, because of its simplicity and robustness (data are stored in a file different from the notebook I work in, so it's less likely that I accidentally change the numbers in the logs). For the return value, I think a DataFrame is much better than either _dict_ or _list_ which I am using right now, simply because the structure of DataFrame is more linear, which each experiment run as a row. 

Using a DataFrame makes adding data easier. Each day of experiment will be transformed into a small DataFrame, with parameters as the columns, and numbers of runs as the indexing variable (which is not set to index because it confuses with data from other dates later on). The idea is illustrated below

In [4]:
import pandas as pd
import numpy as np
pd.DataFrame(data=np.random.rand(2, 2), columns=['param1', 'param2'])

Unnamed: 0,param1,param2
0,0.743867,0.595178
1,0.294152,0.39125


Note that the experiment run numbers are also a column of parameter. 

## Guidelines

To maintain an `experiment_log()` function, the following protocol needs to be followed:

- Log of each day should be transformed into DataFrame, with all the relevant parameters as columns
- When new parameters are added in the middle of a project, ideally, all the old DataFrame's need to be modified to include the change. However, it is possible to keep them and use the log function since DataFrame handles NaN values well. 
- When using the log, call the function to import the DataFrame, and use conditions to narrow down to the data I need.
- The function is saved in a separate file `log.py` to avoid accidental modification. The function has a potential to grow into a long file as well.

In [37]:
def experiment_log(verbose=False):
    """
    Save the experiment logs of a research project in a DataFrame. See "About-experiment-log" notebook for more details.
    
    Args:
    verbose -- print the integrity check result to find mistakes in the log. 
                Default to False, which only prints "The data looks OK!" if integrity check is passed.
    
    Returns:
    log_df -- DataFrame of experiment logs
    """
    
    log_dict = {} # dict log, with date (or folder name, str) as keys
    log_dict['08032020'] = { # transform daily log to a dict, each parameter forms a list
        'run_number': range(0, 15),
        'conc': [80, 80, 80, 60, 60, 60, 40, 40, 40, 20, 20, 20, 0, 0, 0],        
        'FPS': [30, 30, 30, 30, 30, 30, 30, 10, 10, 10, 10, 10, 10, 10, 10],
        'MPP': np.ones(15) * 0.33,
        'length': [3600, 3600, 3600, 3600, 3600, 3600, 3600, 1800, 1800, 1800, 1800, 1800, 100, 100, 100],
        'exposure_time': [1, 4, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3],
        'thickness': np.ones(15) * 140
    }
    log_dict['08042020'] = { # transform daily log to a dict, each parameter forms a list
        'run_number': range(0, 12),
        'conc': [70, 70, 70, 50, 50, 50, 30, 30, 30, 10, 10, 10],        
        'FPS': [30, 30, 30, 30, 30, 30, 30, 30, 30, 10, 10, 10],
        'MPP': np.ones(12) * 0.33,
        'length': [3600, 3600, 3600, 3600, 3600, 3600, 3600, 3600, 3600, 1800, 1800, 1800],
        'exposure_time': [2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3],
        'thickness': np.ones(12) * 140
    }
    log_dict['08052020'] = { # transform daily log to a dict, each parameter forms a list
        'run_number': range(0, 12),
        'conc': np.ones(12) * 85,        
        'FPS': [30, 30, 30, 30, 30, 30, 30, 30, 30, 10, 10, 10],
        'MPP': np.ones(12) * 0.33,
        'length': [3600, 3600, 3600, 3600, 3600, 3600, 3600, 3600, 3600, 1800, 1800, 1800],
        'exposure_time': [4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3],
        'thickness': [200, 200, 200, 140, 140, 140, 100, 100, 100, 20, 20, 20]
    }
    log_dict['08062020'] = { # transform daily log to a dict, each parameter forms a list
        'run_number': range(0, 13),
        'conc': [100, 100, 100, 120, 120, 120, 120, 120, 120, 120, 120, 120, 120],        
        'FPS': [30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 10],
        'MPP': np.ones(13) * 0.33,
        'length': [3600, 3600, 3600, 3600, 3600, 3600, 3600, 3600, 3600, 3600, 3600, 3600, 3600],
        'exposure_time': [4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4],
        'thickness': [140, 140, 140, 140, 140, 140, 100, 100, 100, 200, 200, 200, 20]
    }
    
    # Check integrity: each column from the same day should be of the same length
    for kw in log_dict:
        if verbose == True:
            print('---------{}----------'.format(kw))
        for count, param in enumerate(log_dict[kw]):
            l = len(log_dict[kw][param])
            if verbose == True:
                print('length of {0:15s}: {1:d}'.format(param, l))
            if count > 0:
                assert(l==l_temp)
            l_temp = l
        
    print("-------The log looks OK!--------")
    
    
    log_df = pd.DataFrame()
    for kw in log_dict:
        log_df_temp = pd.DataFrame(log_dict[kw]).assign(date=kw)
        log_df = log_df.append(log_df_temp)
        
    return log_df

In [36]:
experiment_log(verbose=True)

---------08032020----------
length of run_number     : 15
length of conc           : 15
length of FPS            : 15
length of MPP            : 15
length of length         : 15
length of exposure_time  : 15
length of thickness      : 15
---------08042020----------
length of run_number     : 12
length of conc           : 12
length of FPS            : 12
length of MPP            : 12
length of length         : 12
length of exposure_time  : 12
length of thickness      : 12
---------08052020----------
length of run_number     : 12
length of conc           : 12
length of FPS            : 12
length of MPP            : 12
length of length         : 12
length of exposure_time  : 12
length of thickness      : 12
---------08062020----------
length of run_number     : 13
length of conc           : 13
length of FPS            : 13
length of MPP            : 13
length of length         : 13
length of exposure_time  : 13
length of thickness      : 13
-------The log looks OK!--------


Unnamed: 0,run_number,conc,FPS,MPP,length,exposure_time,thickness,date
0,0,80.0,30,0.33,3600,1,140.0,8032020
1,1,80.0,30,0.33,3600,4,140.0,8032020
2,2,80.0,30,0.33,3600,4,140.0,8032020
3,3,60.0,30,0.33,3600,4,140.0,8032020
4,4,60.0,30,0.33,3600,4,140.0,8032020
5,5,60.0,30,0.33,3600,4,140.0,8032020
6,6,40.0,30,0.33,3600,3,140.0,8032020
7,7,40.0,10,0.33,1800,3,140.0,8032020
8,8,40.0,10,0.33,1800,3,140.0,8032020
9,9,20.0,10,0.33,1800,3,140.0,8032020
