diff --git a/README.md b/README.md index 43799c994..ea12d65a4 100644 --- a/README.md +++ b/README.md @@ -6,6 +6,7 @@ [![Read the Docs](https://img.shields.io/readthedocs/dask-sql)](https://dask-sql.readthedocs.io/en/latest/) [![Codecov](https://img.shields.io/codecov/c/github/nils-braun/dask-sql?logo=codecov)](https://codecov.io/gh/nils-braun/dask-sql) [![GitHub](https://img.shields.io/github/license/nils-braun/dask-sql)](https://github.com/nils-braun/dask-sql/blob/main/LICENSE.txt) +[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/nils-braun/dask-sql-binder/main?urlpath=lab) `dask-sql` adds a SQL query layer on top of `dask`. This allows you to query and transform your dask dataframes using diff --git a/notebooks/Custom Functions.ipynb b/notebooks/Custom Functions.ipynb new file mode 100644 index 000000000..3d5cc7f16 --- /dev/null +++ b/notebooks/Custom Functions.ipynb @@ -0,0 +1,159 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Custom Functions" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Apart from the SQL functions that are already implemented in `dask-sql`, it is possible to add custom functions and aggregations.\n", + "Have a look into [the documentation](https://dask-sql.readthedocs.io/en/latest/pages/custom.html) for more information." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import numpy as np\n", + "import dask.dataframe as dd\n", + "import dask.datasets\n", + "from dask_sql.context import Context" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We use some generated test data for the notebook:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "c = Context()\n", + "\n", + "df = dask.datasets.timeseries().reset_index().persist()\n", + "c.create_table(\"timeseries\", df)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As a first step, we will create a scalar function to calculate the absolute value of a column.\n", + "(Please note that this can also be done via the `ABS` function in SQL):" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# The input to the function will be a dask series\n", + "def my_abs(x):\n", + " return x.abs()\n", + "\n", + "# As SQL is a typed language, we need to specify all types \n", + "c.register_function(my_abs, \"MY_ABS\", parameters=[(\"x\", np.float64)], return_type=np.float64)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We are now able to use our new function in all queries" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "c.sql(\"\"\"\n", + " SELECT\n", + " x, y, MY_ABS(x) AS \"abs_x\", MY_ABS(y) AS \"abs_y\"\n", + " FROM\n", + " \"timeseries\"\n", + " WHERE\n", + " MY_ABS(x * y) > 0.5\n", + "\"\"\").compute()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Next, we will register an aggregation, which gets a column as input and returns a single value.\n", + "An aggregation needs to be an instance of `dask.Aggregation` (see the [dask docu](https://docs.dask.org/en/latest/dataframe-groupby.html#aggregate))." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "my_sum = dd.Aggregation(\"MY_SUM\", lambda x: x.sum(), lambda x: x.sum())\n", + "\n", + "c.register_aggregation(my_sum, \"MY_SUM\", [(\"x\", np.float64)], np.float64)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "c.sql(\"\"\"\n", + " SELECT\n", + " name, MY_SUM(x) AS \"my_sum\"\n", + " FROM\n", + " \"timeseries\"\n", + " GROUP BY\n", + " name\n", + " LIMIT 10\n", + "\"\"\").compute()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.5" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/notebooks/Introduction.ipynb b/notebooks/Introduction.ipynb new file mode 100644 index 000000000..7886675c2 --- /dev/null +++ b/notebooks/Introduction.ipynb @@ -0,0 +1,175 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# dask-sql Introduction" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "`dask-sql` lets you query your (dask) data using usual SQL language.\n", + "You can find more information on the usage in the [documentation](https://dask-sql.readthedocs.io/)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from dask_sql import Context\n", + "from dask.datasets import timeseries\n", + "from dask.distributed import Client" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As a first step, we will create a dask client to connect to a local dask cluster (which is started implicitly).\n", + "You can open the dashboard by clicking on the shown link (in binder, this is already open on the left)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "client = Client()\n", + "client" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Next, we create a context to hold the registered tables.\n", + "You typically only do this once in your application." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "c = Context()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Load the data and register it in the context. This will give the table a name.\n", + "In this example, we generate random data.\n", + "It is also possible to load data from file, S3, hdfs etc.\n", + "Have a look into [Data Loading](https://dask-sql.readthedocs.io/en/latest/pages/data_input.html) for more information." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "df = timeseries()\n", + "c.create_table(\"timeseries\", df)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now execute an SQL query. \n", + "The result is a dask dataframe.\n", + "\n", + "The query looks for the id with the highest x for each name (this is just random test data, but you could think of looking for outliers in the sensor data)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "result = c.sql(\"\"\"\n", + " SELECT\n", + " lhs.name,\n", + " lhs.id,\n", + " lhs.x\n", + " FROM\n", + " timeseries AS lhs\n", + " JOIN\n", + " (\n", + " SELECT\n", + " name AS max_name,\n", + " MAX(x) AS max_x\n", + " FROM timeseries\n", + " GROUP BY name\n", + " ) AS rhs\n", + " ON\n", + " lhs.name = rhs.max_name AND\n", + " lhs.x = rhs.max_x\n", + "\"\"\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now we can show the result:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "result.compute()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "... or use it for any other dask calculation" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "result.x.mean().compute()" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.5" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +}