# Essentia Playground

Collection of bash kernel Jupyter Notebook for users to learn aq tools and Essentia interactively.

**Note**<br>
* Based on AQ Tools version: 2.0.1-2
* Only execute aq commands, as well as `cat`, `head`, `tail` or `zcat`. Do not execute any malicous commands on the notebook.


## Table of Contents
### aq_tools
Options that are commonly used among all aq_tools.
- [aq-input](aq_input.ipynb) - input specification for all of aq_tools
- [aq-output](aq_output.ipynb) - output specification for all of aq_tools
- [aq-emod](aq-emod.ipynb): use this as a reference for various builtin funtions.

#### aq_pp
- [aq_pp -eval](aq_pp%20-eval.ipynb) - option for executing evaluation, such as arithmatic and more.
- [aq_pp -filt](aq_pp%20-filt.ipynb) - filtering / selection of record based on given condition.
- [aq_pp -map](aq_pp%20-map.ipynb) - extracting, mapping and manipulation of string.
	* also includes examples of `-mapf` and `-mapc`
- [aq_pp -cmb](aq_pp%20-cmb.ipynb) - data lookup option, also can be used for joining datasets
- [aq_pp -sub](aq_pp%20-sub.ipynb) - substitute values of a arbitrary column with matched value from given lookup table file.
- [aq_pp -if -elif -else](aq_pp%20conditional%20processing%20groups.ipynb) - conditional flow control in `aq_pp` command.
- more on its way...

#### [aq_cnt](aq_cnt.ipynb)
This command helps users to find out more details about data stored in any given column, such as unique values, distribution of them, as well as column statistics.

#### [aq_rst](aq_rst.ipynb)
This command convert keyed flattened dataset into pivot table. Close to panda's `df.pivot()` method.

#### [aq_cat](aq_cat.ipynb)
works as input multiplexier. Use this command to concatenate several data streams into one, and pipe to other command or write to file.

#### [aq_udb](aq_udb.ipynb)
Command to interact with UDB (User Database), perfect for complex data preprocessing task with large amount of data. This notebook only goes though the very basic of the command. 
* also checkout [a simple UDB use case Amazon Product Review Dataset](UDB-Amazon.ipynb)

### Mini Data Science Projects with Essentia
- [Financial Service Complaints Dataset EDA](projects/data_gov_analysis/Financial%20Service%20Consumer%20Complaint%20Database.ipynb)

- [Amazon Product Reviews EDA w.o. udb](projects/amazon_review/Amazon%20Product%20Review%20Dataset%20Analysis.ipynb)
    * simpler edition to analyze the dataset without using UDB (User Database)

- [apache weblog analysis with Essentia](projects/weblog/Weblog%20Data%20Analysis.ipynb)
   * under construction...


## Recommended order for notebooks
Beginning users are encouraged to follow learning path to get familiarize yourself with basic syntax, as well as simple usage examples of each commands.

The following 2 notebooks go over input, column and output specs for aq_tools in general.
1. [aq_input](aq_input.ipynb)
2. [aq_output](aq_output.ipynb)

Others go over what can be done with `aq_pp` command's options, from conditional control to string manipulation. Feel free to go over them in any order you'd like.
- [aq_pp -eval](aq_pp%20-eval.ipynb)
- [aq_pp -filt](aq_pp%20-filt.ipynb)
- [aq_pp -map](aq_pp%20-map.ipynb)



## To-dos

### UDB Table, Vector and attribute super basic examples
#### [aq_udb](aq_udb.ipynb) new items being added
This command let users interact with UDB(User DataBase)
1. add `-pp` examples
2. add `-goto` option, basics and use cases with `-pp` as well, and `post` attribute.
3. add `-del_key` and `-del_row` option sample
4. each attribute's simple example, and default behaviors


#### [aq-output](aq-output.ipynb)
* add the tip for exporting data using different file names based on if else statement (see the img below)
<img src="img/output_if_trick.png">

#### aq-input
* key extraction from `xml` and `json` format


### Possible topics to be covered in the future
#### project-wise: weblog analysis anomaly detection using essentia and sagemaker Random Cut Forest
**Potential Data Sources**<br>
* [US. Securities and Exchange Commision](https://www.sec.gov/dera/data/edgar-log-file-data-set.html)
* [SotM 34 web log raw dataset](http://old.honeynet.org/scans/scan34/) - in `essentia-playground/weblog_dataset/` bucket.

1. preprocess the data and store them into s3 bucket with Essentia, so it is digestable form of Sagemaker
2. use RCF to perform anomaly detection 
* scalable key counting technique using udb and aq_pp, instad of using `aq_cnt` whose ability is limited by the amount of physical memory of the machine
* `aq_pp -var` option.
    * outputting multiple variables with `-ovar` option.Refer to [the last example - official doc](http://auriq.com/documentation/source/reference/manpages/aq_pp.html#ovar)
* multiple table join, stream join example
* SQL conditional joins of multiple tables, based on non-primary keys as well
* [data analysis with essentia cluseter]()
* [SQL with ess and category]()
* [SQL with UDB using `udbsql`]()
* [weblog analysis with aq_tool]() - Project