# 1. Introduction

The purpose of this notebook is to illustrate how data manually extracted as .csv from the HOBO, and being put in a folder, is parsed through a *database insert script*. 

## Process 

There are 3 folders involved in the process:
 * To_Insert
 * Inside_Database
 * Logs 

The CRON job launchs the script every hour to check the **To_Insert** folder for files manually put there. If at least one new file is found, it will proceed to process one file at a time in the database.

As the script executes each of the **formatting assumptions** listed in section 2, it will create in the **Logs** folder a file associated to the execution with the following name format: `<sucess or error>_<filename>_<timestamp of attempt>`. If it is sucessful, it will also move the file from **To_Insert** to **Inside_Database**. If it fails, the file remains on the **To_Insert** folder so it is attempted again when the script is relaunched (either the next hour or manually). 

The best way to diagnose if any data was not inserted in the database and *why* is therefore:

 1. Verify if any file is sitting in the **To_Input** folder. 
 1. See if there are any **error** prefixed files in the **Logs** folder.

Maintaining the files despite storing in the database is intentional to preserve data in it's *raw format* in case in the future it needs to be revisited. 

## File Parsing

For every file inside the **To_Insert** folder, the script: 
 1. **Parse** the readings sampled by the HOBO hardware.
 1. Format the data under certain **formatting assumptions**.
 1. Checks a separate .csv file (in the actual script, it checks the database **Purpose** table) using the **HOBO Serial Number** to retrieve the **Purpose ID** field.
 1. Uses the **Purpose ID** field to save the formatted text file (in the actual script, it stores in the **Readings** table) for every sensor: Temperature (Temp), Relative Humidity (RH), and Luminous intensity (lum/ft²). 
 



# 1. Parse the readings information to understand the *format assumptions*

The HOBO software exports a .csv file by adding on the first row the serial number for it's sensors, having the remaining lines follow the standard .csv format with header. The columns, which are defined through double quotes, are as follows: 

* Row ID 
* Timestamp 
* Temperature 
* Relative Humidity 
* Luminous Intensity

In [21]:
import pandas

df = pandas.read_csv('To_Insert/9790163-sample.csv')
df


Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Plot Title: 9790163
#,"Date Time, GMT-10:00","Temp, °F (LGR S/N: 9790163, SEN S/N: 9790163)","RH, % (LGR S/N: 9790163, SEN S/N: 9790163)","Intensity, lum/ft² (LGR S/N: 9790163, SEN S/N:..."
1,02/03/17 04:00:00 PM,76.375,69.420,1.8
2,02/03/17 04:01:00 PM,76.332,69.296,1.8
3,02/03/17 04:02:00 PM,76.203,67.938,1.8
4,02/03/17 04:03:00 PM,75.942,68.361,1.8
5,02/03/17 04:04:00 PM,75.724,68.698,1.8
6,02/03/17 04:05:00 PM,75.508,68.945,1.8
7,02/03/17 04:06:00 PM,75.290,69.280,1.8
8,02/03/17 04:07:00 PM,75.160,69.565,1.8
9,02/03/17 04:08:00 PM,75.029,69.670,1.8


# 2. Format the data under certain *formatting assumptions*

To our purposes, the following assumptions must be made in order to conform to the **Reading** table [schema](https://github.com/erdl/database):

1. Timezone must be passed **from** header **to** the existing **Date Time** column values in order to be inserted as **reading_timestamp** database column
1. An external table must be checked, using the sensor serial number in order to **retrieve the purpose_id**, which will be stored as a new column. 
1. Table must be **split into 3 tables**, one for every sensor
   1. For each of the 3 tables, **units** must be passed **from** header **to** as a new **unit** column
1. Set union of the 3 tables is performed to be stored as a single file to be submitted to the database. 

## 2.1 Timezone column formatting 