# Read in Microstructure Data

We want to create data to read in parallel into a ML pipeline. There are 121 files with a total of 9000 microstructures. Some files have 100 microstructures and some 50. Each microstructure is 51x51x51, but with only binary data. The file format is `.mat` and we will output it to `.zarr`, which is a parallel read format.

In [1]:
import scipy.io
import os
import numpy as np
import pandas 
import h5py
import dask.dataframe as dd
import glob
from dask import delayed
import dask

The following reads and writes the x data. We first read each data file with `scipy.io.loadmat` in the `read_ydata` function. Each data set is stored in a Pandas dataframe. The dataframes are then combined into a single Dask dataframe, `dask_df` using Dask delayed. 

In [5]:
def get_files(path_):
    return sorted(glob.glob(path_))

@delayed
def read_ydata(file_):
    data = scipy.io.loadmat(file_)
    return pandas.DataFrame(data['Ceff'])

@delayed
def read_xdata(file_):
    f = h5py.File(file_, 'r')
    return pandas.DataFrame(np.array(f['M']))


def get_dfs(path_):
    return list(map(read_xdata, get_files(path_)))

# yfile_path = '/home/berkay/Desktop/Graspi/50_contrast_elastic_51_51_51_homogenization/effective_C11/*.mat'
path_ = '/home/berkay/Desktop/Graspi/50_contrast_elastic_51_51_51_homogenization/microstructures/*.mat'

dfs = get_dfs(path_)
dask_df = dd.from_delayed(dfs, meta=dfs[0].compute())

(Delayed('int-6a2793af-695e-4c5f-a201-49ad384d39b7'), 132651)

After creating a single Dask dataframe we convert it to a Dask array as it's easier to save the data and rechunk. Basically, it's easier to read in as a dataframe and write as an array so that the chunks are all the same size. Zarr, Parquet etc prefer the chunks to be the same size. The to_dask_array and rechunk steps are quite slow, but doesn't use much memory. Although the `dask_arr` is 9.44 GB it is never loaded into memory.

In [7]:
dask_arr = dask_df.to_dask_array(lengths=True).rechunk((100, -1))
dask_arr

Unnamed: 0,Array,Chunk
Bytes,9.44 GB,106.12 MB
Shape,"(8900, 132651)","(100, 132651)"
Count,506 Tasks,89 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 9.44 GB 106.12 MB Shape (8900, 132651) (100, 132651) Count 506 Tasks 89 Chunks Type float64 numpy.ndarray",132651  8900,

Unnamed: 0,Array,Chunk
Bytes,9.44 GB,106.12 MB
Shape,"(8900, 132651)","(100, 132651)"
Count,506 Tasks,89 Chunks
Type,float64,numpy.ndarray


Finally, we can save the data.

In [8]:
dask_arr.to_zarr('output.zarr', overwrite=True)