# Chapter 11: Reading and Writing Data
by [Arief Rahman Hakim](https://github.com/ahman24)

In [1]:
# Path to current working dir
import os

output_dir = os.path.join(os.getcwd(), 'output')
print(output_dir)

d:\Python\Numerical Methods\numerical_methods\output


## 1. TXT files
To work with text files, we need to use open function which returns a file object. It is commonly used with two arguments:

> f = open(filename, mode)  

The modes are,
* `r`: this is the default mode, which opens a file for reading
* `w`: this mode opens a file for writing, if the file does not exist, it creates a new file.
* `a`: open a file in append mode, append data to end of file. If the file does not exist, it creates a new file.
* `b`: open a file in binary mode.
* `r+`: open a file (do not create) for reading and writing.
* `w+`: open or create a file for writing and reading, discard existing contents.
* `a+`: open or create file for reading and writing, and append data to end of file.

There are 2 ways of writing a txt file,
* 'typical' approach
* context manager

Let's see how those compared,

In [2]:
# Define the intended path to the file
name_file = 'Ch11_txt_typical_approach.txt'
path_file = os.path.join(output_dir, name_file)

# Typical approach
f = open(path_file, 'w')
for i in range(5):
    f.write(f'This is line{i}')
f.close()

In [3]:
# Define the intended path to the file
name_file = 'Ch11_txt_context_manager.txt'
path_file = os.path.join(output_dir, name_file)

with open(path_file, 'w') as f:
    for i in range(5):
        f.write(f"This is line {i}\n")

If you notice, the context manager `with` does not require to close the file manually. The context manager will do the job. The object `f` will be cleaned automatically. If possible, proceed with context manager to open, read, etc files.

Sometimes, all we need is to append a line to the file. We could do just that as follows,

In [4]:
with open(path_file, 'a') as f:
    f.write(f"This is another line\n")

We could read the file as follows,

In [5]:
with open(path_file, 'r') as f:
    content = f.read()

print(content)

This is line 0
This is line 1
This is line 2
This is line 3
This is line 4
This is another line



Since mostlikely we will be working with `arrays` on `numpy`, we could also save the file directly with the built in feature in numpy as follows,

In [6]:
import numpy as np
arr = np.array([[1.20, 2.20, 3.00], [4.14, 5.65, 6.42]])
arr

array([[1.2 , 2.2 , 3.  ],
       [4.14, 5.65, 6.42]])

In [7]:
np.savetxt(path_file, arr, fmt='%.2f', header = 'Col1 Col2 Col3')

In [8]:
with open(path_file, 'r') as f:
    content = f.read()

print(content)

# Col1 Col2 Col3
1.20 2.20 3.00
4.14 5.65 6.42



Above we save the file with only **2 significant digits**,  
Let's load the file again with `numpy`,

In [9]:
my_arr = np.loadtxt(path_file)
my_arr

array([[1.2 , 2.2 , 3.  ],
       [4.14, 5.65, 6.42]])

## 2. CSV files
There are many scientific data are stored in the **comma-separated values (CSV)** file format, a delimited text file that uses a comma to separate values. It is a very useful format that can store large tables of data (numbers and text) in plain text.

Let's see how it works,

In [10]:
# Define the intended path to the file
name_file = 'Ch11_csv.csv'
path_file = os.path.join(output_dir, name_file)

data = np.random.random((100,5))
np.savetxt(path_file, data, fmt = '%.2f', delimiter=',', header = 'c1, c2, c3, c4, c5')

Let's load the array with `numpy` again,

In [11]:
my_csv = np.loadtxt(path_file, delimiter=',')
my_csv[:5, :]

array([[0.55, 0.22, 0.12, 0.35, 0.44],
       [0.53, 0.79, 0.89, 0.63, 0.87],
       [0.27, 0.14, 0.02, 0.36, 0.66],
       [0.51, 0.35, 0.9 , 0.34, 0.85],
       [0.53, 0.05, 0.76, 0.9 , 0.47]])

## 3. Pickle files
We might want to store `dictionaries`, `tuples`, `lists`, or any other data type to the disk and use them later or send them to some colleagues. **Pickle** can serialize objects so that they can be saved into a file and loaded again later.

Pickle can be used to serialize Python object structures, which refers to the process of **converting an object in the memory to a byte stream that can be stored as a binary file on disk**. When we load it back to a Python program, this binary file can be de-serialized back to a Python object.

In [12]:
# Define the intended path to the file
name_file = 'Ch11_pickle.pkl'
path_file = os.path.join(output_dir, name_file)

import pickle
dict_a = {'A':0, 'B':1, 'C':2}
pickle.dump(dict_a, open(path_file, 'wb'))

We could load the pickle file again as follows,

In [13]:
my_dict = pickle.load(open(path_file, 'rb'))
my_dict

{'A': 0, 'B': 1, 'C': 2}

## 4. JSON files
JSON stands for **JavaScript Object Notation**. Unlike pickle, which is Python dependent, JSON is a **language-independent data format**. Besides, it is usually takes **less space on the disk** and the **manipulation is faster** than pickle (if you are interested, search online to find more materials about it).

Let's take a look on the JSON format,

In [14]:
school = {
  "school": "UC Berkeley",
  "address": {
    "city": "Berkeley", 
    "state": "California",
    "postal": "94720"
  }, 
    
  "list":[
      "student 1",
      "student 2",
      "student 3"
      ]
}
school

{'school': 'UC Berkeley',
 'address': {'city': 'Berkeley', 'state': 'California', 'postal': '94720'},
 'list': ['student 1', 'student 2', 'student 3']}

We could write a JSON file with the `json` library,

In [15]:
import json
# Define the intended path to the file
name_file = 'Ch11_json.json'
path_file = os.path.join(output_dir, name_file)

json.dump(school, open(path_file, 'w'))

We could load the json file as follows,

In [16]:
my_school = json.load(open(path_file, 'r'))
my_school

{'school': 'UC Berkeley',
 'address': {'city': 'Berkeley', 'state': 'California', 'postal': '94720'},
 'list': ['student 1', 'student 2', 'student 3']}

## 5. HDF5 files
HDF5 stands for **Hierarchical Data Format**. HDF5 helps to store **large amounts of data with quick access**. HDF5 is a powerful binary data format with no limit on the file size. It **provides parallel IO (input/output)**, and carries out a **bunch of low level optimizations** under the hood to make the **queries faster and storage requirements smaller**.

An HDF5 file saves two types of objects: 
* datasets: `array-like` collections of data (like NumPy arrays), 
* groups: `folder-like` containers that hold datasets and other groups. 

There are also attributes that could associate with the datasets and groups to describe some properties. The so called hierarchical in HDF5 refers to the fact that **the data could be saved like a file system, with folder-like structures, such as folder, subfolder (in HDF5, it is called group, subgroup)**. Groups operate like dictionaries with the `keys` and `values`, with the **keys are names of the groups**, and the **values are the subgroups or datasets**.

Let's demonstrate the HDF5 below,

**Example**  
Suppose we deployed some instruments to monitor the accelerations and GPS location in Bay Area, CA. We deployed two accelerometers at Berkeley and Oakland as well as one GPS station at San Fransisco. And they record data at different sampling rates, with the accelerometer at Berkeley sample the data every 0.04 s, and 0.01 s for the sensor at Oakland. The GPS samples the location every 60 seconds in San Fransisco. Now we want to store the two types of data into a HDF5 as well as some attributes indicate where the data is recorded, start time of the recording, station name and the sampling interval.

In [17]:
# Generate random data for recording
acc_1 = np.random.random(1000)
station_number_1 = '1'
# unix timestamp
start_time_1 = 1542000276
# time interval for recording
dt_1 = 0.04
location_1 = 'Berkeley'

acc_2 = np.random.random(500)
station_number_2 = '2'
start_time_2 = 1542000576
dt_2 = 0.01
location_2 = 'Oakland'

In [18]:
# Define the intended path to the file
name_file = 'Ch11_hdf5.hdf5'
path_file = os.path.join(output_dir, name_file)

import h5py

# Let's use the context manager
with h5py.File(path_file, 'w') as hf:
    hf['/acc/1/data'] = acc_1
    hf['/acc/1/data'].attrs['dt'] = dt_1
    hf['/acc/1/data'].attrs['start_time'] = start_time_1
    hf['/acc/1/data'].attrs['location'] = location_1

    hf['/acc/2/data'] = acc_2
    hf['/acc/2/data'].attrs['dt'] = dt_2
    hf['/acc/2/data'].attrs['start_time'] = start_time_2
    hf['/acc/2/data'].attrs['location'] = location_2

    hf['/gps/1/data'] = np.random.random(100)
    hf['/gps/1/data'].attrs['dt'] = 60
    hf['/gps/1/data'].attrs['start_time'] = 1542000000
    hf['/gps/1/data'].attrs['location'] = 'San Francisco'

We can see we have two top level groups,
* acc,
* gps, 

both of them contains **subgroups 1 or 2** indicate the station names. **Each station will contain the next level subgroup**, `data`, that is used to store the array data we created. We could then **add attributes to the groups or the data**. Here we only added the `dt`, `start_time`, and `location` as the attributes to the datasets we store here. You can see that it is quite similar to folder-like structure, with data `acc_1` saved at `/acc/1/data`.

We could load the HDF5 file again as follows,

In [19]:
with h5py.File(path_file, 'r') as hf:
    hf = h5py.File(path_file, 'r')

In [20]:
list(hf.keys())

['acc', 'gps']

In [21]:
acc = hf['acc']
list(acc.keys())

['1', '2']

In [22]:
data_1 = hf['acc/1/data']
data_1[:10]

array([0.04102403, 0.69825357, 0.42450819, 0.34359535, 0.43676345,
       0.22934737, 0.28688871, 0.22603361, 0.60105346, 0.16275696])

In [23]:
print(list(data_1.attrs))
print(data_1.attrs['dt'])
print(data_1.attrs['location'])
print(data_1.attrs['start_time'])

['dt', 'location', 'start_time']
0.04
Berkeley
1542000276
