# Aq_input Tips and Samples

This notebook goes over aq_input's options and it's sample usages. 
Based on AQ Tools version: 2.0.1-1.


Here, in order to observe the input data stream, we'll be showing examples of input spec with `aq_pp` command without data transformation. 


### Prerequisites
Users are assumed to be equipped with decent knowledge of
- bash commands
- knowledge of `aq_pp` command. 
- input, column and output spec for aq_tools


We'll be going over each options in the `aq_pp` command and it's use cases. Have the [aq_pp documentation](http://auriq.com/documentation/source/reference/manpages/aq-input.html) ready on the side, so you can refer to it whenever needed.
We'll start with basic usage of each options, then dive into advanced usage.


## Motivation
Aq_input / input spec specifies behaviors and interpretations of provided data by aq_tools. 
Being able to input data sources into aq_tools in any ways you like, is an essentia skill for using aq_tools. 


## Major Components
Below is the general structure of the input-spec. 

`aq_command ... -f[,attributes] fileName [more filenames...] -d columnSpec`

There are 2 major components in input specification of aq_tool. 

* `-f`: file input: specifies data source's type, and its format as well as input behaviors, and data input's name(file name).
* `-d`: column specs: specifies numbers, names and data types of columns.

Before getting into examples, let us go through variations of accepted attributes for both options.
If you've already read the documentation and are familiar with these options, go ahead and skip this section and jump right in to the examples!

Let's take a look at some commonly used options here.


## `-f` Basic file reading options

By this option, we can specify 3 things + some extra, such as input sorce type, input format and error handling. 

### input source type
- file
- stream from standard output (std)
- named pipe
- stream from connection to listener

### input format (column separator) selection
- `csv`: the default option.
- `sep`: this option lets users specify their own separater. 
- `div`: used with `sep` option in column spec, this let's us specify different separators for each columns. 
- `tab`: html table format
- `jsn`: allows us to input json formatted files. 
- `xml`: xml formatted file input.
- `bin`: aq_pp's original input format
- `aq`: from another aq_tool outputtting aq. No column spec needed. 


### Error Handling
- `eok`: Make error non critical, resulting in skipping the rows with error.

### Others
- `esc`: input '\' character in the data as escape character.
- `Num`: lines / byhtes to skip
- `bz`: buffer size


## `-d` Column specs

Column spec specifies what data type will the each column in the data be interpreted by, as well as each column's name. 
Therefore, column spec has a format of 
``` bash
-d dtype[,attributes]:colName dtype2[,attributes]:colName2 ... 
```


### Generic Col Spec

Supported data types are 
- `S`: string
- `F`: Double precision floating point
- `L`: unsigned integer
- `LS`: 64 bit signed integer
- `I`: 32 bit unsigned integer
- `IS`: 32 bit signed integer
- `IP`: v4/v6 address
- `X`: placeholder for unwanted input column.

And some of the attributes that might come in handy are...
- `trm`: trim leading/trailing spaces from field value
- `lo`, `up`: convert a string field value to lower or upper case. 

There's more attributes available, take a look at the documentation for further details.

Now we are equipped with the basic knowledge of input specifications, let's get started on examples.

# TODO list from here --------------------------


# Basic 

Concrete version of the examples from documentation manual page. 

- csv file
- tsv file
- specifiying sep option from `-f`

# Advanced

## Input source type
* named pipe
* frozen file with pipe
* stream from connection to listener

## Extract keys
- jsn
- xml


# TODO LIST until here ------------------------------

## Data 

We will start with the simplest inputting of csv file, consists of numeric and string columns.
This is a data of street price of cannabis across different states.(The data is modified for simplicity)

Data looks like following.

State|HighQ|HighQN|date
---|---|---|---|
Alabama|339.06|1042|2014-01-01
Alaska|288.75|252|2014-01-01
Arizona|303.31|1941|2014-01-01
Arkansas|361.85000000000002|576|2014-01-01
California|248.78|12096|2014-01-01
Colorado|236.31|2161|2014-01-01
Connecticut|347.89999999999998|1294|2014-01-01
Delaware|373.18000000000001|347|2014-01-01
District of Columbia|352.25999999999999|433|2014-01-01
Florida|306.43000000000001|6506|2014-01-01


As you can see, this data contains 
- 2 string columns, `State` and `date`
- 1 float column, `HighQ` and
- 1 int columns, `HighQN`

## Samples
What we need, to read in this data is to specify
- column spec
- file name
- attribute

**Column Spec**<br>
This needs to be in the same order as the actual columns in the data. Therefore, considering the data types we've looked at earlier, we can specify column specs with data types and names.

`-d S:state F:highQ I:highQN S:date`

Note that data type's capitailzation does not matter, and no need to put comma between each column specs.

**Input File**<br>
For here, we'll be using an csv file. 2 things needs to be considered.
1. path to the fileName, which is `data/aq_input/partial_cannibas_price.csv`
2. attribute for skipping the header row

`... -f,+1 data/aq_input/partial_cannibas_price.csv ...`

After the `-f` option, `+1` attribute is provided to indicate that we are skipping the first row of the file, which is header. 
After the attribute, we can simply provide the file name.

Now let's combine these options together and see it in action with `aq_pp` command.

In [4]:
# reading in the file, and displaying it by aq_pp
aq_pp -f,+1 data/aq_input/partial_cannabis_price.csv -d S:state F:highQ I:highQN S:date

"state","highQ","highQN","date"
"Alabama",339.06,1042,"198.63999999999999"
"Alaska",288.75,252,"260.60000000000002"
"Arizona",303.31,1941,"209.34999999999999"
"Arkansas",361.85000000000002,576,"185.62"
"California",248.78,12096,"193.56"
"Colorado",236.31,2161,"195.28999999999999"
"Connecticut",347.89999999999998,1294,"273.97000000000003"
"Delaware",373.18000000000001,347,"226.25"
"District of Columbia",352.25999999999999,433,"295.67000000000002"
"Florida",306.43000000000001,6506,"220.03"
"Georgia",332.20999999999998,3099,"213.52000000000001"
"Hawaii",310.95999999999998,328,"270.38"
"Idaho",276.05000000000001,315,"254.96000000000001"
"Illinois",359.74000000000001,4008,"287.23000000000002"
"Indiana",336.80000000000001,1665,"206.24000000000001"
"Iowa",371.69999999999999,697,"292.33999999999997"
"Kansas",353.50999999999999,838,"260.97000000000003"
"Kentucky",337.32999999999998,1013,"183.16999999999999"
"Louisiana",377.70999999999998,1071,"243.25999999999999"
"Maine",321.10000000000002,450,

Let's try to apply some of the attributes on the column spec. 
We will capitalize the name of the states, as we input the data. We can do this by adding `up` attribute on the column spec of `state` column, like below.


In [5]:
aq_pp -f,+1 data/aq_input/partial_cannabis_price.csv -d S,up:state F:highQ I:highQN S:date

"state","highQ","highQN","date"
"ALABAMA",339.06,1042,"198.63999999999999"
"ALASKA",288.75,252,"260.60000000000002"
"ARIZONA",303.31,1941,"209.34999999999999"
"ARKANSAS",361.85000000000002,576,"185.62"
"CALIFORNIA",248.78,12096,"193.56"
"COLORADO",236.31,2161,"195.28999999999999"
"CONNECTICUT",347.89999999999998,1294,"273.97000000000003"
"DELAWARE",373.18000000000001,347,"226.25"
"DISTRICT OF COLUMBIA",352.25999999999999,433,"295.67000000000002"
"FLORIDA",306.43000000000001,6506,"220.03"
"GEORGIA",332.20999999999998,3099,"213.52000000000001"
"HAWAII",310.95999999999998,328,"270.38"
"IDAHO",276.05000000000001,315,"254.96000000000001"
"ILLINOIS",359.74000000000001,4008,"287.23000000000002"
"INDIANA",336.80000000000001,1665,"206.24000000000001"
"IOWA",371.69999999999999,697,"292.33999999999997"
"KANSAS",353.50999999999999,838,"260.97000000000003"
"KENTUCKY",337.32999999999998,1013,"183.16999999999999"
"LOUISIANA",377.70999999999998,1071,"243.25999999999999"
"MAINE",321.10000000000002,450,

### Arbiturary Column Separator

We can set up different separating characters for each columns. Let's try it with the same dataset, but this time we'll be importing the date as 3 distinct string columns, as year, month and date.

First thing we need to do is to let the command know that we'll be using distinct separators. We do this in input spec, like

`..-f,+1,div fileName ...`

where `div` tells the command that we'll use our own separator. Remember that this option is not compatible with other attributes, such as csv or tsv.(these option will assume that file uses one kind of separator across the file.)

Now we will actually specify the separator for each columns. You can simply place `sep:'sep_character'` block in between the columns where the separator character will be located in the file. Note that when using `div` option, seperators for all of the columns need to be specified.

In our case, one of the row will look like this.

`Alabama,339.06,1042,2014-01-01`

Therefore, our new column spec will look like this.

`-d S:state sep:',' F:highQ sep:',' I:highQN sep:',' S:year sep:'-' S:month sep:'-' S:day`

As you can see, we are using separator `-` for the new 3 columns, with `sep` option. 
For the rest, we are simply using `,` as our separator.

In [6]:
aq_pp -f,+1,div data/aq_input/partial_cannabis_price.csv -d  S:state sep:',' F:highQ sep:',' I:highQN sep:',' S:year sep:'-' S:month sep:'-' S:day

"state","highQ","highQN","year","month","day"
"Alabama",339.06,1042,"198.63999999999999,933,149.49000000000001,123,2014","01","01"
"Alaska",288.75,252,"260.60000000000002,297,388.57999999999998,26,2014","01","01"
"Arizona",303.31,1941,"209.34999999999999,1625,189.44999999999999,222,2014","01","01"
"Arkansas",361.85000000000002,576,"185.62,544,125.87,112,2014","01","01"
"California",248.78,12096,"193.56,12812,192.91999999999999,778,2014","01","01"
"Colorado",236.31,2161,"195.28999999999999,1728,213.5,128,2014","01","01"
"Connecticut",347.89999999999998,1294,"273.97000000000003,1316,257.36000000000001,91,2014","01","01"
"Delaware",373.18000000000001,347,"226.25,273,199.88,34,2014","01","01"
"District of Columbia",352.25999999999999,433,"295.67000000000002,349,213.72,39,2014","01","01"
"Florida",306.43000000000001,6506,"220.03,5237,158.25999999999999,514,2014","01","01"
"Georgia",332.20999999999998,3099,"213.52000000000001,2269,153.44999999999999,229,2014","01","01"
"Hawaii",310.959999999

You can see that we've successfully separated date into 3 distinct columns!

Let's try something more challenging.

# TODO
ADD WEBLOG ANALYSIS EXAMPLE HERE WITH `DIV` OPTION
