<a href="https://colab.research.google.com/github/amnahhebrahim/Data-Formats-/blob/main/Data_formats_and_Libraries_YAML.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Popular Data Formats and How to Process Them!- PT.2

This tutorial is part of a mini-video series that aims to explore the various data formats available and provide a brief overview of what python libraries can be used to process and handle them. This tutorial aims to look at structured data formats as enumerated below:

### ✅ YAML
YAML is a human-readable data serialization language. It is usually utilized to store data in configuration files and in applications where data is being stored or transmitted.

It comes in a ".yml" and ".yaml" data extension.YAML is a superset of JSON, so JSON files are valid in YAML.


#### Application


A good application for using YAML files would be to create a description file for your data that includes the metadata of your data such as date of collection, author, source, and features.


In [1]:
!pip install pyYAML



In [4]:
from yaml import load, dump
import yaml
#python dictionary of the metadata of a hypothetical dataset:
metadata={"Datasheet":{"Name":"Student Grades","Source":"X Database", "Author": "John Doe", "Features":["Name", "Subject","Grades"]}}

#default_flow_style is set to false to ensures that nested dictionaries
#and lists are written in a block style for better readability, rather flow style.
with open('metadata.yaml', 'w') as file:
    yaml.dump(metadata, file, default_flow_style=False)




In [5]:
#To read YAML files you can do the following:
with open("metadata.yaml", "r") as file:
  loaded_meta=yaml.safe_load(file)
print(loaded_meta)

{'Datasheet': {'Author': 'John Doe', 'Features': ['Name', 'Subject', 'Grades'], 'Name': 'Student Grades', 'Source': 'X Database'}}


In [7]:
loaded_meta["Datasheet"]["Author"]="Smith John"

In [10]:
loaded_meta["Datasheet"]["Features"].append("semester")

In [11]:
loaded_meta

{'Datasheet': {'Author': 'Smith John',
  'Features': ['Name', 'Subject', 'Grades', 'semester'],
  'Name': 'Student Grades',
  'Source': 'X Database'}}

In [12]:
with open("updated_metadata.yaml", "w") as file:
  yaml.dump(loaded_meta, file, default_flow_style=False)



## ✅ Hierarchical Data Format (HDF) 5
The Hierarchical Data Format HDF version 5 was designed to to store and organize large amounts of data. It is a great mechanism for storing large numerical arrays of homogenous type, for data models that can be organized hierarchically and benefit from tagging of datasets with arbitrary metadata.It appears with numerous file extensions such as ".h5", and ".hdf5".




### Creating a HDF5 dataset:

In [42]:
import h5py
# create a file to store the data:
f = h5py.File("dataset.hdf5","w")



#### Different methods for file management in python:

 f = h5py.File("name.hdf5", "w")   # New file overwriting any existing file

 f = h5py.File("name.hdf5", "r")   # Open read-only (must exist)

 f = h5py.File("name.hdf5", "r+")  # Open read-write (must exist)

 f = h5py.File("name.hdf5", "a")   # Open read-write (create if doesn't exist)

In [43]:
# The File object we created is itself a group, in this case the root group, named /:
f.name

'/'

In [44]:
#Create groups within the dataset:

dataset=f.create_group("Dataset")
grp_st1 =dataset.create_group("Station1")
grp_st2 =dataset.create_group("Station2")


In [46]:
grp_st1.name

'/Dataset/Station1'

In [47]:
dset1_st1=grp_st1.create_dataset("wind", 10, dtype=int)
dset2_st1=grp_st1.create_dataset("humidty", 10, dtype=int)
dset3_st1=grp_st1.create_dataset("temperature", 10, dtype=int)


In [48]:
dset1_st2=grp_st2.create_dataset("wind", 10, dtype=int)
dset2_st2=grp_st2.create_dataset("humidty", 10, dtype=int)
dset3_st2=grp_st2.create_dataset("temperature", 10, dtype=int)

In [50]:
#print the path of the dset1 for station 1
dset3_st1.name

'/Dataset/Station1/temperature'

In [51]:
## asign the attributes for each group:
grp_st1.attrs['start_time'] = 80920251300
grp_st1.attrs['interval_min'] = 30
grp_st2.attrs['start_time'] = 80920251300
grp_st2.attrs['interval_min'] = 30


In [56]:
## asign the values for each array for station 1:
import numpy as np
dset1_st1=np.arange(2,12, 1)
dset2_st1=np.arange(10)
dset3_st1=np.arange(3,13, 1)
## assign values for station 2:
dset1_st2=np.arange(2,12, 1)
dset2_st2=np.arange(10)
dset3_st2=np.arange(3,13, 1)


In [57]:
dset1_st2

array([ 2,  3,  4,  5,  6,  7,  8,  9, 10, 11])

In [58]:
# Let HDF5 flush its buffers and actually write to disk:
f.flush()
#close the file we just created and wrote to
f.close()

In [59]:
!ls -lh

total 12K
-rw-r--r-- 1 root root 6.1K Sep 10 13:55 dataset.hdf5
drwxr-xr-x 1 root root 4.0K Sep  8 13:41 sample_data


### Reading a HDF5 dataset:

Dataset:openPMD

Description: This dataset contains information about different particles such as electronics, with their charge, position, momentum, etc...


In [60]:
!git clone https://github.com/openPMD/openPMD-example-datasets.git


Cloning into 'openPMD-example-datasets'...
remote: Enumerating objects: 132, done.[K
remote: Counting objects: 100% (24/24), done.[K
remote: Compressing objects: 100% (18/18), done.[K
remote: Total 132 (delta 14), reused 9 (delta 6), pack-reused 108 (from 1)[K
Receiving objects: 100% (132/132), 323.61 MiB | 39.73 MiB/s, done.
Resolving deltas: 100% (48/48), done.


In [62]:
files = h5py.File("/content/openPMD-example-datasets/example-femm-3d.h5", "r")

In [63]:
# print the hierarchy of the H5 dataset:
files.visit(lambda x: print (x))


data
data/1
data/1/meshes
data/1/meshes/B
data/1/meshes/B/x
data/1/meshes/B/y
data/1/meshes/B/z
data/1/meshes/E
data/1/meshes/E/x
data/1/meshes/E/y
data/1/meshes/E/z


In [64]:
dataset_1=files["data/1/meshes/B/x"]
dataset_1

<HDF5 dataset "x": shape (47, 47, 47), type "<f8">

In [65]:
#obtain the data:
x_array = dataset_1[:]
print(x_array)

[[[ 0.00173245  0.00214297  0.00278616 ... -0.00284619 -0.00213305
   -0.00169444]
  [ 0.0018205   0.0022785   0.00303151 ... -0.00306161 -0.00227971
   -0.00179205]
  [ 0.00190167  0.00240754  0.00326678 ... -0.00328131 -0.00239403
   -0.00189035]
  ...
  [ 0.00190167  0.00240754  0.00326678 ... -0.00328131 -0.00239403
   -0.00189035]
  [ 0.0018205   0.0022785   0.00303151 ... -0.00306161 -0.00227971
   -0.00179205]
  [ 0.00173245  0.00214297  0.00278616 ... -0.00284619 -0.00213305
   -0.00169444]]

 [[ 0.00174135  0.00217944  0.00289971 ... -0.0029285  -0.0021806
   -0.00171413]
  [ 0.00182272  0.00230881  0.00313563 ... -0.00314876 -0.0022952
   -0.00181201]
  [ 0.00190563  0.00242746  0.00330579 ... -0.00336784 -0.00241244
   -0.00189772]
  ...
  [ 0.00190563  0.00242746  0.00330579 ... -0.00336784 -0.00241244
   -0.00189772]
  [ 0.00182272  0.00230881  0.00313563 ... -0.00314876 -0.0022952
   -0.00181201]
  [ 0.00174135  0.00217944  0.00289971 ... -0.0029285  -0.0021806
   -0.0017

In [66]:
#Obtain the attributes for one dataset:
attributes_dset1 = dataset_1.attrs
for keys, values in attributes_dset1.items():
    print(keys, values)


position [0. 0. 0.]
unitSI 1.0


In [68]:
B=files["data/1/meshes/B"]
b_attrs=B.attrs

for keys, values in b_attrs.items():
  print(keys, values)

axisLabels [b'x' b'y' b'z']
dataOrder b'C'
geometry b'cartesian'
gridGlobalOffset [-1.15  -1.15  -0.375]
gridSpacing [0.05  0.05  0.125]
gridUnitSI 1.0
timeOffset 0.0
unitDimension [ 0.  1. -2. -1.  0.  0.  0.]
