# BlazingSQL

BlazingSQL (https://blazingsql.com/) is an independent, open-source element of RAPIDS that offers high-performance distributed SQL processing on GPU.

Most of RAPIDS can run on CUDA Compute 6.0, but BlazingSQL requires 6.1 ... not a huge issue, but in a cutting-edge technical plan it's useful to be precise.

In [None]:
! nvidia-smi

Let's do a quick test to make sure BlazingSQL -- which involves a number of background services -- is online

We'll also demo how to define a BlazingSQL table over any existing cuDF DataFrame

In [None]:
import cudf

gdf = cudf.DataFrame({'test':[1,2,3]})
print(gdf)
print(gdf.describe())

In [None]:
from blazingsql import BlazingContext

In [None]:
bc = BlazingContext()

In [None]:
bc.create_table('foo', gdf)

In [None]:
# Query

bc.sql('SELECT * FROM foo ORDER BY test DESC')

### What about processing my data lake?

In some cases, we may have a data in a cuDF DataFrame, but -- maybe more often -- we're using SQL early in the pipeline to perform ETL, joins, or course-grained feature extraction over our data lake.

So we want to consume Parquet, CSV, and other formats straight from S3, HDFS, etc. (docs are at https://docs.blazingdb.com/docs)

For simplicity, we'll use a local file here for a quick demo.

In [None]:
import os

data_path = os.getcwd() +'/data/'

In [None]:
bc.create_table('beer', data_path + 'beer_small.csv', header=True)

In [None]:
result_gdf = bc.sql("SELECT * FROM beer WHERE brewery_name='Sunday River Brewing Co.'")
result_gdf

In [None]:
bc.sql("SELECT * FROM beer WHERE beer_style='Belgian IPA' AND review_overall > 4.5 ORDER BY brewery_name")

In [None]:
result = bc.sql("SELECT brewery_name, count(*) AS number FROM beer WHERE beer_style='Belgian IPA' AND review_overall > 4.5 GROUP BY brewery_name ORDER BY number DESC")
result

In [None]:
import pandas as pd

pdf = result.to_pandas()

In [None]:
%matplotlib inline

pdf.plot.bar('brewery_name', 'number', figsize=(16,10))