Skip to content

Functions that generate SQL queries that summarize high-dimensional tables stored in various databases (e.g. Microsoft SQL Servers, Netezza, DB2, Postgres, Oracle, MySQL, etc.).

License

Notifications You must be signed in to change notification settings

casualcomputer/sql.mechanic

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Summarizing large tables with SQL

Henry Luan

Overview

The core helper function get_summary_codes in the sql.mechanic package takes in a string that specifies a table’s database name, schema name, and table name. It outputs an SQL query that summarizes the table’s column statistics.

If you don’t have much time to read or just want a quick solution, jump to Example 2.

Intended User

  • If you work with tables inside databases hosted on powerful servers (usually on-premise) but have limited compute resources (e.g. RAM, GPU, CPU) for advanced BI (e.g. Python, R, SAS, etc.)

  • If it’s much cheaper for you to use database servers than analytics servers (e.g. those for Python, R, etc.) either on-premise or on the cloud.

Limitations

  • You probably want to understand how Example 2affect the CPU and disk usage of your database server, to avoid bad surprises on your server’s resource usage.

  • If you use the setup of Example 2on a cloud database, you MUST do some testing, to understand how the example affects the CPU and disk usage of your cloud resource. Please avoid potentially expensive mistakes.

  • Currently, the function only works with Microsoft SQL Server and Netezza databases. Feel free to contribute to the codes, if interested.

Credits

Specail credits: the R codes in this package are built on top of Gordon S. Linoff’s book Data Analysis Using SQL and Excel, 2nd Edition. His work has been a tremendous inspiration for the creation of this package.

Example 1: Generate SQL queries and execute them in DMBS

Step 1: Install packages

You can install the library from my GitHub. If you have concerns regarding the package’s security, you can download, check, and use the “get_summary_codes.R” file directly.

# Install package
library(devtools)
install_github("casualcomputer/sql.mechanic",quiet=TRUE)

# Alternative: use the 'get_summary_codes.R' file only
  # source("get_summary_codes.R")

Step 2: Generate SQL quires

The following codes 1) generate the SQL queries you need to summarize a table, and 2) copy (Ctrl+C) the codes to your clipboard. All you have to do is paste it to your SQL editor and execute the queries.

library(sql.mechanic)

#SQL codes for basic summary, Netezza database 
    sql_basic_netezza = get_summary_codes("DB_NAME.SCHEMA_NAME.TABLE_NAME", type="basic", dbtype="Netezza")   
    
#SQL codes for advanced summary, Netezza database 
    sql_advanced_netezza = get_summary_codes("DB_NAME.SCHEMA_NAME.TABLE_NAME", type="advanced", dbtype="Netezza")   
    
#SQL codes for basic summary, Microsoft SQL Server 
    sql_basic_mssql = get_summary_codes("DB_NAME.SCHEMA_NAME.TABLE_NAME", type="basic", dbtype="MSSQL")  

#SQL codes for advanced summary, Microsoft SQL Server
    sql_advanced_mssql = get_summary_codes("DB_NAME.SCHEMA_NAME.TABLE_NAME", type="advanced", dbtype="MSSQL")  
    
#copy some of the sql queries to the clipboard    
    writeClipboard(sql_basic_netezza) 

Step 3: Paste the codes in your clipboard and run it in SQL

In case you are curious, the SQL copied to your clipboard looks like this.

SELECT REPLACE(REPLACE(REPLACE('<start> SELECT ''<col>'' as colname,
                               COUNT(*) as numvalues,
                               MAX(freqnull) as freqnull,
                               CAST(MIN(minval) as CHAR(100)) as minval,
                               SUM(CASE WHEN <col> = minval THEN freq ELSE 0 END) as numminvals,
                               CAST(MAX(maxval) as CHAR(100)) as maxval,
                               SUM(CASE WHEN <col> = maxval THEN freq ELSE 0 END) as nummaxvals,
                               SUM(CASE WHEN freq =1 THEN 1 ELSE 0 END) as numuniques

                               FROM (SELECT <col>, COUNT(*) as freq
                               FROM SCHEMA_NAME.<tab> GROUP BY <col>) osum
                                                   CROSS JOIN (SELECT MIN(<col>) as minval, MAX(<col>) as maxval, SUM(CASE WHEN <col> IS NULL THEN 1 ELSE 0 END) as freqnull
                                                   FROM (SELECT <col> FROM SCHEMA_NAME.<tab>) osum
                                                   ) summary',
                               '<col>', column_name),
                               '<tab>', 'TABLE_NAME'),
                               '<start>',
                               (CASE WHEN ordinal_position = 1 THEN ''
                               ELSE 'UNION ALL' END)) as codes_data_summary
                               FROM (SELECT table_name, case when regexp_like(column_name,'[a-z.]|GROUP')  then '"'||column_name||'"'
                                                             else column_name end as column_name  , ordinal_position
                               FROM information_schema.columns
                               WHERE table_name ='TABLE_NAME') a;

Step 4: Copy, paste and execute the query results from Step 3.

Example 2: Automatically summarize tables in your databases

This example shows you how you can summarize tables as you did in Example 1, with only a few lines of R codes.

# Install package
library(devtools)
install_github("casualcomputer/sql.mechanic",quiet=TRUE)

# Alternative: use the 'get_summary_codes.R' file only
  # source("get_summary_codes.R")

# Load packages  
library(sql.mechanic)
library(odbc)
library(DBI)

# Connect to database(s)
## Method 1: prompting user (you need to make some changes here)
  con <- dbConnect(odbc(),
                   Driver = "SQL Server",
                   Server = "mysqlhost",
                   Database = "mydbname",
                   UID = "myuser",
                   PWD = "Database password",
                   Port = 1433, encoding = 'windows-1252') #'windows-1252' allows French to display properly

## Alternative Method: Using a DSN
  #con <- dbConnect(odbc::odbc(), "DNS_NAMES", encoding = 'windows-1252')

sql_query = get_summary_codes("DB_NAME.SCHEMA_NAME.TABLE_NAME", type="basic", dbtype="Netezza") 

res = dbSendQuery(con, sql_query) # part of Step 3 in "Example 1"
sql_query_mod = dbFetch(res) # part of Step 3 in "Example 1"

res = dbSendQuery(con, sql_query_mod) # part of Step 4 in "Example 1"
output_table = dbFetch(res) # part of Step 4 in "Example 1"
print(output) #desired summary table

dbDisconnect(con) #close database connection

About

Functions that generate SQL queries that summarize high-dimensional tables stored in various databases (e.g. Microsoft SQL Servers, Netezza, DB2, Postgres, Oracle, MySQL, etc.).

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages