# Data Condensing

It is important to always keep data portable as an experimentalist. If we can easily take all the data needed for a manuscript with us (via a cloud storage solution like Google Drive), we can work anywhere at anytime as long as internet is available. More importantly, after publishing a paper with the data, it's much easier to share the data with other researchers, making the data more valuable. 

The nature of my current research, which relies strongly on large videos, makes it hard to keep all the data portable. However, the raw videos are never necessary. By analyzing the videos (such as PIV and PTV), we can get most essential data, which does not require a lot of storage space, and a cloud drive can easily afford some data like this kind. 

Currently, I save my data in a local hard drive. All the data are organized in folders in the following structure:

- Date
    - raw images (by number from 0 to total number of videos in the day)
        - 01
        - 02
        - ...
    - data from analysis
        - piv_imseq (PIV)
        - df2_kinetics (density fluctuations analysis - kinetics)
        - ...

This is a good separation of raw data and data from analysis. Ideally, the data from analysis part should be the portable part. However, I find it not very feasible because the total size of analysis data is still large (for example the folder 08032020 has 14.3 GB analysis data at the time when I'm writing this document). I realize that when visualizing these data, not all data are used. For example, I don't need the flow field at every frame, but rather some frames for illustration. And other PIV flow fields may only be used for energy and flow order evolution, which abstract the detailed flow field in each frame into a single number.

For important data, I always write a code to summarize the data of each day of experiment in "summary.csv". Here, I write a piece of code to copy all the "summary.csv" files to my Google Drive - research project folder.

## 0 Packages and presets

In [2]:
import shutil
import os
from myImageLib import dirrec
from corrLib import *

## 1 Copy essential data files to a master folder

In [2]:
def copy_summary(src_folder, dest_folder, sub_folders, file_list=['summary.csv']):
    """
    copy summary.csv files to other folder (mainly for cloud drive storage).
    
    Args:
    src_folder -- source folder
    dest_folder -- destination folder
    sub_folders -- choose subfolders under source folder in which data are copied
    
    Returns:
    None
        
    """
    
    for sf in sub_folders:
        src = os.path.join(src_folder, sf)
        for file in file_list:
            f = dirrec(src, file)
            for src_file in f:
                dest_file = src_file.replace(src_folder, dest_folder)
                dest = os.path.split(dest_file)[0]
                if os.path.exists(dest) == False:
                    os.makedirs(dest)
        #             print('Create folder ' + dest)
                shutil.copyfile(src_file, dest_file)
                print('Copy file ' + dest_file.replace(dest_folder, ''))

In [5]:
# test copy_summary
src_folder = r'E:\moreData'
dest_folder = r'E:\Google Drive\Research projects\DF\data\level-2-data'
sub_folders = ['08032020', '08042020', '08052020', '08062020']
file_list = ['cv-summary.csv'] # 'summary.csv', 'kinetics_data.csv', 'intensity.csv', 'energy_order.csv'
copy_summary(src_folder, dest_folder, sub_folders, file_list=file_list)

Copy file \08032020\cav_imseq\cv-summary.csv
Copy file \08042020\cav_imseq\cv-summary.csv
Copy file \08052020\cav_imseq\cv-summary.csv
Copy file \08062020\cav_imseq\cv-summary.csv


## 2 Slimming PIV data

Currently, PIV data are saved as text files (.csv) with four columns (x, y, u, v). This structure requires at least twice of the space as needed. For example, a 42x50 grid PIV data file takes up 105 kb. As a result, 3600 frames of video can generate 180 mb data.

In [26]:
np.save(r'E:\moreData\test.npy', np.random.rand(7560, 1000))

A 1,000,000 64-bit double precision floats forms a npy file of 7.62 mb, verifying that such a float is 64 bit, i.e. 8 bytes. A PIV dataset with 1800x42x50x2 = 7,560,000 floats is around 60 mb, 3 times smaller than the original text file (.csv).

Although the size of PIV dataset can be significantly reduced in this way, making all the PIV data portable is still challenging. If I have 100 videos, the PIV data will be 6 gb, unfeasible for current downloading speed.

To convert text files to binary files, the first task is to strictly structure the data. For PIV data, besides velocity information, I need to store coordinate information (x, y). Other parameters, such as window size and fps, can be put in log files. Below I store x and y as a (2, m, n) array, where m and n are the number of rows and cols in the PIV data.

I will first convert 08032020\00 data to npy and save in folder "piv_slim".

In [38]:
save_folder = r'E:\moreData\08032020\piv_slim\00'
if os.path.exists(save_folder) == False:
    os.makedirs(save_folder)
l = readdata(r'E:\moreData\08032020\piv_imseq\00', 'csv')
save_frame = 100 # 
pivData = pd.read_csv(l.iloc[0].Dir)
frame0 = int(l.iloc[0].Name.split('-')[0])
row = len(pivData.y.drop_duplicates())
col = len(pivData.x.drop_duplicates())
X = np.array(pivData['x']).reshape((row, col))
Y = np.array(pivData['y']).reshape((row, col))
v_list = []
for num, i in l.iterrows():       
    pivData = pd.read_csv(i.Dir)
    frame = int(i.Name.split('-')[0])
    U = np.array(pivData['u']).reshape((row, col))
    V = np.array(pivData['v']).reshape((row, col))
    v_list.append(np.stack([U, V], axis=0))
    if frame - 100 >= frame0 or num == len(l)-1: # every 100 frame, or reach the last row
        print("Save frame {0:04d}-{1:04d}".format(frame0, frame))        
        v_stack = np.stack(v_list, axis=0)
        np.save(os.path.join(save_folder, "{0:04d}-{1:04d}".format(frame0, frame)), v_stack)    
        frame0 = frame
        v_list = []
# print("Save frame {0:04d}-{1:04d}".format(frame0, frame))        
# v_stack = np.stack(v_list, axis=0)
# np.save(os.path.join(save_folder, "{0:04d}-{1:04d}".format(frame0, frame)), v_stack)

Save frame 0000-0100
Save frame 0100-0200
Save frame 0200-0300
Save frame 0300-0400
Save frame 0400-0500
Save frame 0500-0600
Save frame 0600-0700
Save frame 0700-0800
Save frame 0800-0900
Save frame 0900-1000
Save frame 1000-1100
Save frame 1100-1200
Save frame 1200-1300
Save frame 1300-1400
Save frame 1400-1500
Save frame 1500-1600
Save frame 1600-1700
Save frame 1700-1800
Save frame 1800-1900
Save frame 1900-2000
Save frame 2000-2100
Save frame 2100-2200
Save frame 2200-2300
Save frame 2300-2400
Save frame 2400-2500
Save frame 2500-2600
Save frame 2600-2700
Save frame 2700-2800
Save frame 2800-2900
Save frame 2900-3000
Save frame 3000-3100
Save frame 3100-3200
Save frame 3200-3300
Save frame 3300-3400
Save frame 3400-3500
Save frame 3500-3598


Save frame 3500-3598


In [36]:
frame0

3500