Sql2Awk

This little binary will translate your SQL, only query statement, into AWK script. You can directly run the generated AWK script or combine the AWK with other traditional linux command line tools to form your own anlytical tools.

Features

Query
- Join
  - Currently, only natural join is supported
- Aggregation Function
  - Min
  - Max
  - Avg
  - Sum
  - Count
    - Count(*) is allowed
  - Percentile
    - Require GAWK function
  - Histogram
- Group by
- Order by
  - Asc/Desc order both supports
  - Notes, to support this feature, the generated code will have to use GAWK function asort/asorti
- Distinct
- Limit
- Star/Wildcard matching
- Like/Regex/Pattern
  - like/not like operator are supported
    - By translating them into regexp
  - match/not match operator are supported
    - Powered by AWK/GAWK's regex
Just AWK/GAWK code
- No other runtime tools/library/binary is needed for execution
- Order by/Percentile requires GAWK (asort/asorti function)
Advanced Features
- Special Aggregation Functions
  - Percentile
    - Calculate the percentile of column, for example getting median number
  - Histogram
    - Calculate the histogram of certain column
- Wildcard matching
  - Match a specific set of columns by specifying a regex expression
  - Match a specific line by specifying a regex expression
- Rewrite
  - For each table been selected, one can use rewrite keyword to rewrite column of table
- Format
  - Allow fine grained format of the output in terminal. Like color the output for better visibility
File Format
- Tab/Tabular
  - Using tab/tabular type inside of from clause indicates that the file is parsed as delimter separated file, by default it just uses space delimited file
    - select * from tab("sample1.txt")
    - select * from tab("sample1.txt", ":", 1, 200)
      - selecting a every fields separated by ":", starting from line 1 until line 200
- CSV
  - Using csv/xsv type inside of from clause indicates that the file is parsed as csv file, this will correctly handle quoted string, but it is not performant
    - select * from csv("sample1.txt")
    - select * from csv("sample1.txt", ",", 1, 200)
      - selecting a every fields via CSV syntax separated by ",", starting from line 1 until line 200
Scheme
- No scheme is needed, use $N to reference the N'th field
- $1 represents first field, $2 second, ...
- $0 represents the full line
- $FN represents the field count after parsing
- rownum/$rownum represents row's number/index, starting from 1
Builtin Functions
- Type
  - is_decimal
  - is_integer
  - is_string
  - is_empty
  - type
  - cast
- String
  - string_length
  - string_to_lower
  - string_to_upper
  - string_substr
  - string_index
  - string_include
  - string_format
    - sprintf
- Math
  - math_cos
  - math_sin
  - math_sqrt
  - math_exp
  - math_int
  - math_log
  - math_atan2
- Bit
  - bit_and
  - bit_or
  - bit_not
  - bit_lshift
  - bit_rshift
- Base64
  - base64_decode
  - base64_encode

Caveats

Type is limited
- AWK/GAWK can only support numerical type and string type
- NULL is missing
CSV is not performant
- The support of CSV is not performant. The CSV parser is written in AWK and it will have to scan each character inside of the line to parse the quoted string etc ...

Example

----------------------------------------------------------------
-- Basic SQL features
----------------------------------------------------------------

-- select all the fields from the file, delimited by space
select * from tab("sample.txt")

-- distinct
select distinct * from tab("sample.txt")

-- select 1st first field
select $1 from tab("sample.txt")

-- aggregation

select count($1) from tab("sample.txt")
select avg($1) from tab("sample.txt")
select sum($1) from tab("sample.txt")
select max($1) from tab("sample.txt")
select min($1) from tab("sample.txt")

-- aggregation with group by
select count($1), $2+100
from tab("sample.txt")
group by $2

-- filter
select *
from tab("sample.txt")
where $1 > 10 && $2 != 20

-- having
select min($1)
from tab("sample.txt")
group by $2
having max($3) > 10

-- order by, require gawk to run
select *
from tab("sample.txt")
order by $2

-- special aggregation
select percentile($1, 10) # 10% high value
from tab("sample.txt")

select histogram($1, 1, 20, 5) # histgoram distribution with min/max/# of bins
from tab("sample.txt")

-- join
select t1.$1, t2.$2
from tab("sample1.txt") as t1,
     tab("sample2.txt") as t2
where t1.$2 == t2.$1

----------------------------------------------------------------
-- Advanced SQL features
----------------------------------------------------------------

-- row/column wildcard filtering

-- filtering all lines in tabluar file with specified regex pattern
select ROWS("[0-9][0-9]")
from tab("sample1.txt")

-- filtering all columns that is composed by specified regex pattern
select COLUMNS("[a-z][a-z]+")
from tab("sample1.txt")

-- filtering certain table's ROW or COLUMNS
select t1.COLUMNS("abc*"), t2.ROWS("[0-9]+")
from tab("sample1.txt") as t1,
     tab("sample2.txt") as t2

-- calling builtin functions
select *
from tab("sample1.txt")
where is_string($2)

-- Rewrite
-- rewrite the table value before filtering, joining etc ... rewrite phase
-- happened at very early stage
select *
from
  tab("sample1.txt") as t1
  rewrite
    when $1 > 0 then set $1 = "Positive";
    when $2 < 0 then set $2 = "Negative";
  end,

  tab("sample2.txt") as t2
  rewrite
    when $1 % 2 == 0 then set $1 = "Even";
    when $2 % 2 != 0 then set $2 = "Odd";
  end

-- Format
-- print the title bar
select *
from tab("sample1.txt")
format title=true, border="|";

-- print the second column in red
select *
from tab("sample1.txt")
format title=true, border="|", column(1)="red";

Status

Currently, it is still under active development and testing. Bugs are expected and you are more than welcomed to submit an issue.

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
_script		_script
cg		cg
plan		plan
sql		sql
.gitignore		.gitignore
README.md		README.md
build.sh		build.sh
go.mod		go.mod
go.sum		go.sum
main.go		main.go
test.sh		test.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

_script

_script

cg

cg

plan

plan

sql

sql

.gitignore

.gitignore

README.md

README.md

build.sh

build.sh

go.mod

go.mod

go.sum

go.sum

main.go

main.go

test.sh

test.sh

Repository files navigation

Sql2Awk

Features

Caveats

Example

Status

About

Releases

Packages

Languages

dianpeng/sql2awk

Folders and files

Latest commit

History

Repository files navigation

Sql2Awk

Features

Caveats

Example

Status

About

Resources

Stars

Watchers

Forks

Languages