# Formatting data into a DataFrame
The desired format: each column is an observable quantity, each row is a different sample point. 

This notebook gives an example using fictitious data about particles of gas in a box. Observables are positions and speeds, sample points are different particles. 

Modify the notebook to import your own data instead. 

Available functions: 
- df_from_ndarray: to transform an array with more than 2 dimensions into a 2d, MultiIndexed DataFrame
- df_from_blocks: to concatenate 2d arrays containing sample points from different conditions
- regroup_levels: to add a level to the MultiIndex of a DataFrame, in order to regroup the values in another level (for instance, regroup positions together under the label "Position", and velocity components together under the label "Velocity")
- load_object, save_object: load or save pickle files

In [1]:
# Initial module imports
import numpy as np
import scipy as sp
import pandas as pd
from format_tools import load_object, save_object, df_from_ndarray, df_from_blocks, regroup_levels
import os

## Importing your data
For the sake of the example, we import some fictitious data about gas particles. 

Instead, import your own data it its original format here. 

### For the df_from_blocks method
Choose this subsection **OR** the one below (from an ndarray). 

In [2]:
# Creating and saving the fictitious data: remove this part
folder = "data/blocks/"
# Test the input by block function
blocks = [np.random.rand(5, 6) for i in range(3)]  # 3 blocks

for i in range(len(blocks)):
    save_object(blocks[i], folder + "gas_example_block_{}.pkl".format(i))


In [3]:
# Import your files here

# Example: each block corresponds to a different (temperature, pressure) tuple. 
folder = "data/blocks/"
files = os.listdir(folder)
files = sorted([folder + fi for fi in files if fi.startswith("gas_example")])
list_of_blocks = []
print("Loaded blocks:")
for f in files:
    list_of_blocks.append(load_object(f))
    print(list_of_blocks[-1])

Loaded blocks:
[[0.71316279 0.81414003 0.77321754 0.54052901 0.89496055 0.45597601]
 [0.44306042 0.81198928 0.94780037 0.48387119 0.55149527 0.33600366]
 [0.78094743 0.13559562 0.90051875 0.44166881 0.44237303 0.56283908]
 [0.04940954 0.52147117 0.27074943 0.60764999 0.31374207 0.62179219]
 [0.97107481 0.75066481 0.48028947 0.80489248 0.11517844 0.42233976]]
[[0.02233586 0.77536212 0.50644677 0.52973631 0.75171172 0.10941163]
 [0.8774583  0.65599972 0.26391383 0.96055786 0.15751302 0.46005792]
 [0.73511335 0.17131204 0.69563038 0.31297605 0.11632371 0.71790573]
 [0.70201956 0.95538669 0.05637364 0.77619421 0.10768882 0.18512795]
 [0.49291985 0.46393642 0.27202198 0.8436711  0.64284533 0.71313817]]
[[0.7580745  0.85981574 0.03488712 0.22022355 0.7619963  0.46614814]
 [0.69255638 0.51276624 0.49519083 0.00930358 0.77645245 0.33297056]
 [0.1023417  0.01506129 0.46167481 0.53610569 0.89790049 0.51603358]
 [0.35836954 0.93827666 0.43511498 0.23378003 0.62597862 0.94811184]
 [0.69259641 0.23

### For the df_from_ndarray method

In [4]:
# Creating and saving the fictitious data: remove this part
folder = "data/ndarrays/"
arr = np.arange(4*5*6).reshape(4, 5, 6)
save_object(arr, folder + "gas_example_ndarray.pkl")
print(arr)

[[[  0   1   2   3   4   5]
  [  6   7   8   9  10  11]
  [ 12  13  14  15  16  17]
  [ 18  19  20  21  22  23]
  [ 24  25  26  27  28  29]]

 [[ 30  31  32  33  34  35]
  [ 36  37  38  39  40  41]
  [ 42  43  44  45  46  47]
  [ 48  49  50  51  52  53]
  [ 54  55  56  57  58  59]]

 [[ 60  61  62  63  64  65]
  [ 66  67  68  69  70  71]
  [ 72  73  74  75  76  77]
  [ 78  79  80  81  82  83]
  [ 84  85  86  87  88  89]]

 [[ 90  91  92  93  94  95]
  [ 96  97  98  99 100 101]
  [102 103 104 105 106 107]
  [108 109 110 111 112 113]
  [114 115 116 117 118 119]]]


In [5]:
# Import your files here

# Example: each block will correspond to a different (temperature, pressure) tuple. 
folder = "data/ndarrays/"
files = [fi for fi in os.listdir(folder) if fi.startswith("gas_example")]
ndarr = load_object(folder + files[0])
print("Loaded ndarray: ")
print(ndarr)

Loaded ndarray: 
[[[  0   1   2   3   4   5]
  [  6   7   8   9  10  11]
  [ 12  13  14  15  16  17]
  [ 18  19  20  21  22  23]
  [ 24  25  26  27  28  29]]

 [[ 30  31  32  33  34  35]
  [ 36  37  38  39  40  41]
  [ 42  43  44  45  46  47]
  [ 48  49  50  51  52  53]
  [ 54  55  56  57  58  59]]

 [[ 60  61  62  63  64  65]
  [ 66  67  68  69  70  71]
  [ 72  73  74  75  76  77]
  [ 78  79  80  81  82  83]
  [ 84  85  86  87  88  89]]

 [[ 90  91  92  93  94  95]
  [ 96  97  98  99 100 101]
  [102 103 104 105 106 107]
  [108 109 110 111 112 113]
  [114 115 116 117 118 119]]]


## Building the DataFrame
Now that blocks or ndarrays have been imported, it is time to combine or reshape them

### With the df_from_blocks method

In [6]:
# Prepare your block labels here

# Labels that we know from somewhere else (could be saved and imported; depends on what you have)
temperatures = ["{} C".format(t) for t in range(10, 50, 10)]
pressures = ["1 atm", "2 atm", "3 atm"]
# Tuples identifying each block
labels = tuple([(temperatures[i], pressures[i]) for i in range(3)])
axes_names = ["Temperature", "Pressure"]
print(labels)
# Observables
obs_labels = ['vx', 'vy', 'vz'] + ['x', 'y', 'z']

df = df_from_blocks(list_of_blocks, labels, obs_labels, axes_names)
print("DataFrame created witht the df_from_blocks function: ")
print(df)

(('10 C', '1 atm'), ('20 C', '2 atm'), ('30 C', '3 atm'))
DataFrame created witht the df_from_blocks function: 
                                   vx        vy        vz         x         y  \
Temperature Pressure Sample                                                     
10 C        1 atm    0       0.713163  0.814140  0.773218  0.540529  0.894961   
                     1       0.443060  0.811989  0.947800  0.483871  0.551495   
                     2       0.780947  0.135596  0.900519  0.441669  0.442373   
                     3       0.049410  0.521471  0.270749  0.607650  0.313742   
                     4       0.971075  0.750665  0.480289  0.804892  0.115178   
20 C        2 atm    0       0.022336  0.775362  0.506447  0.529736  0.751712   
                     1       0.877458  0.656000  0.263914  0.960558  0.157513   
                     2       0.735113  0.171312  0.695630  0.312976  0.116324   
                     3       0.702020  0.955387  0.056374  0.776194  0.107689 

### With the df_from_ndarray method

In [7]:
# Prepare your labels for each axis here

# Dictionaries where keys are the axes, elements are axes names or parameter values at each index of the axis
axes_names_dict = {0:"Temperature", 1:"Pressure"}
param_labels = {
    0: ["10 C", "20 C", "30 C", "40 C"],
    1: ["{} atm".format(i) for i in range(5)]
}
# Observables
obs_labels = ['vx', 'vy', 'vz'] + ['x', 'y', 'z']
obs_axis = 2  # axis along which the observables are
# Here, compared to the example with blocks, there is no 
# "Sample" axis created: only one sample per condition
# If there are many samples for the same condition, 
# then one of the axes of the ndarray indexes the different samples

df2 = df_from_ndarray(ndarr, param_labels, obs_axis, obs_labels, axes_names_dict)
print("DataFrame created with the df_from_ndarray function: ")
print(df2)

DataFrame created with the df_from_ndarray function: 
Observables            vx   vy   vz    x    y    z
Temperature Pressure                              
10 C        0 atm       0    1    2    3    4    5
            1 atm       6    7    8    9   10   11
            2 atm      12   13   14   15   16   17
            3 atm      18   19   20   21   22   23
            4 atm      24   25   26   27   28   29
20 C        0 atm      30   31   32   33   34   35
            1 atm      36   37   38   39   40   41
            2 atm      42   43   44   45   46   47
            3 atm      48   49   50   51   52   53
            4 atm      54   55   56   57   58   59
30 C        0 atm      60   61   62   63   64   65
            1 atm      66   67   68   69   70   71
            2 atm      72   73   74   75   76   77
            3 atm      78   79   80   81   82   83
            4 atm      84   85   86   87   88   89
40 C        0 atm      90   91   92   93   94   95
            1 atm      96   

## Adding extra indexing to your DataFrame
You might want to regroup some observables or some conditions under the same label. The function regroup_levels serves that purpose. 

In [8]:
# Regroup columns in the sample DataFrame made from blocks
# Regrouping x, y, and z coordinates
groups = {
    "X":['x', 'vx'], 
    "Y":['y', 'vy'], 
    "Z":['z', 'vz']
}
regrouped_df = regroup_levels(df, groups, level_group="Observables", axis=1, name="Dimension")
print(regrouped_df)
print("Columns are now a MultiIndex:")
print(regrouped_df.columns, "\n")


# Regroup rows in the sample example made from a ndarray. 
# If an inner level is regrouped, it is moved to be second to outermost, 
# and rows are sorted with respect to it. 
# Regrouping pressures by effect on a human (not accurate)
groups = {
    "burst":['0 atm'],
    "fine":["1 atm"], 
    "faint":["2 atm"], 
    "crush":['3 atm', '4 atm']
}
regrouped_df2 = regroup_levels(df2, groups, level_group="Pressure", axis=0, name="Effect")
print(regrouped_df2)

Dimension                           X                   Y                   Z  \
Observables                        vx         x        vy         y        vz   
Temperature Pressure Sample                                                     
10 C        1 atm    0       0.713163  0.540529  0.814140  0.894961  0.773218   
                     1       0.443060  0.483871  0.811989  0.551495  0.947800   
                     2       0.780947  0.441669  0.135596  0.442373  0.900519   
                     3       0.049410  0.607650  0.521471  0.313742  0.270749   
                     4       0.971075  0.804892  0.750665  0.115178  0.480289   
20 C        2 atm    0       0.022336  0.529736  0.775362  0.751712  0.506447   
                     1       0.877458  0.960558  0.656000  0.157513  0.263914   
                     2       0.735113  0.312976  0.171312  0.116324  0.695630   
                     3       0.702020  0.776194  0.955387  0.107689  0.056374   
                     4      

## Save the final dataframe
Once you are done, save it! 

In [9]:
folder = "data/"
save_object(regrouped_df, folder + "gas_example_blocks_formatted.pkl")
save_object(regrouped_df2, folder + "gas_example_ndarray_formatted.pkl")

## Appendix: Useful pandas methods and functions
You will probably need to do some manipulations by yourself. Here are some useful pandas functions. 

(*Under construction*)

### For DataFrames
#### Creation, manipulation
- DataFrame.
- DataFrame.to_sparse(): creates a sparse version, saves memory if the same value is repeated often
- DataFrame.to_dense(): creates the dense version. 

#### Indexing
- DataFrame.reindex : to rearrange the order of the values (keeps values associated with their original label, but in a new order). Can be used on both axes. 
- DataFrame.set_index: to replace the index without changing the order of values
    - See this explanation of the difference between reindex and set_index: https://stackoverflow.com/questions/50741330/difference-between-df-reindex-and-df-set-index-methods-in-pandas
- DataFrame.reset_index: useful to remove one or more indexing level(s)


### For MultiIndexes