# Reading Sensors Measures

This dataset is composed of multiple files where each file corresponds to a set of sensor's readings. For instance, we have the humidity reads and temperature reads from DHT-11 sensor in 2 files: `H-DHT11-measures.json` and `T-DHT11-measures.csv`.

Also, the dataset files have multiple formats (`json` and `csv`), which is not a thing that we want to.

In [2]:
import json
from pathlib import Path

import pandas as pd

## Notebook technologies

To read the json files and csv files, we are going to use two packages, json standard python package and `pandas`.

- `json` package is an standard Python package that exposes an API similar to the `pickle` one to read Javascript Object Notation (JSON) files.

## Reading JSON files

The files in this dataset are not *pure* JSON, they are `JSON Lines text format` \[1\]. Also known as newline-delimited JSON. JSON Lines is a convenient format for storing structured data that may be processed one record at a time (which seems pretty handy for sensor data).  

Unfortunately, we cannot directly read JSON Lines format directly with `pandas` or `json` package. Therefore, we first should read the file lines and parse each one using the `json` package.

In [3]:
DATA_PATH = Path('../../data/raw')

For example, we can read the `T-DM280-measures.json` data.

In [4]:
fname = DATA_PATH / 'measures' / 'T-DM280-measures.json'
json_lines = fname.read_text().split('\n')
print(json_lines[:5])

['{"sensor": "T-DM280", "value": 26.43, "time": "2017-12-22T10:51:31Z"}', '{"sensor": "T-DM280", "value": 26.43, "time": "2017-12-22T10:51:35Z"}', '{"sensor": "T-DM280", "value": 26.45, "time": "2017-12-22T10:51:38Z"}', '{"sensor": "T-DM280", "value": 26.45, "time": "2017-12-22T10:51:41Z"}', '']


And using the `json.loads` method, we can convert a string into a python dictionary.

In [9]:
json_data = [json.loads(o, encoding='utf-8') for o in json_lines if o] 
len(json_data), json_data[0]

(4, {'sensor': 'T-DM280', 'value': 26.43, 'time': '2017-12-22T10:51:31Z'})

Once we have a list of parsed JSONs, actually a list of dictionaries in python. It is pretty convinient to load them as a `pandas.DataFrame`.

In [10]:
df = pd.DataFrame(json_data)
df.head()

Unnamed: 0,sensor,value,time
0,T-DM280,26.43,2017-12-22T10:51:31Z
1,T-DM280,26.43,2017-12-22T10:51:35Z
2,T-DM280,26.45,2017-12-22T10:51:38Z
3,T-DM280,26.45,2017-12-22T10:51:41Z


Now we have all the `pandas` functionalities in a data that initially was formated as JSON.

We pack all the three steps mentioned in a single function, so we can reuse it later.

In [13]:
def read_json_lines(f: str) -> pd.DataFrame:
    lines = Path(f).read_text().split('\n')
    json_data = [json.loads(o, encoding='utf-8') for o in lines if o]
    return pd.DataFrame(json_data)

read_json_lines(DATA_PATH / 'measures' / 'H-DHT11-measures.json').head()

Unnamed: 0,sensor,value,time
0,H-DHT11,31,2017-12-22T11:22:11Z
1,H-DHT11,31,2017-12-22T11:22:16Z
2,H-DHT11,31,2017-12-22T11:22:20Z
3,H-DHT11,31,2017-12-22T11:22:24Z
4,H-DHT11,31,2017-12-22T11:22:28Z


## Reading CSV files

As we did [here](notebooks/iris/1.\ Read\ the\ data.ipynb), to load a csv file we just call `pd.read_csv`.

For instance, let's try to read `H-DHT22-measures.csv` file.

In [15]:
df = pd.read_csv(DATA_PATH / 'measures' / 'H-DHT22-measures.csv')
df.head(5)

Unnamed: 0,sensor,value,time
0,H-DHT22,15.7,2017-12-19T14:07:18Z
1,H-DHT22,15.7,2017-12-19T14:07:25Z
2,H-DHT22,15.7,2017-12-19T14:07:32Z
3,H-DHT22,15.7,2017-12-19T14:07:38Z
4,H-DHT22,15.7,2017-12-19T14:07:45Z


In this case, contrary to Iris data, we have the columns names in the first row of the csv file. For this reason there is no need to specify them when reading the tabluar formatted data.

## References

\[1\] [JSON Line text format](http://jsonlines.org/)