diff --git a/docs/domain/document/index.md b/docs/domain/document/index.md index a56a1861..72a5d912 100644 --- a/docs/domain/document/index.md +++ b/docs/domain/document/index.md @@ -9,8 +9,15 @@ Storing documents in CrateDB provides the same development convenience like the document-oriented storage layer of Lotus Notes / Domino, CouchDB, MongoDB, and PostgreSQL's `JSON(B)` types. -- [](inv:cloud#object) +- [](#objects-basics) - [Unleashing the Power of Nested Data: Ingesting and Querying JSON Documents with SQL] [Unleashing the Power of Nested Data: Ingesting and Querying JSON Documents with SQL]: https://youtu.be/S_RHmdz2IQM?feature=shared + +```{toctree} +:maxdepth: 1 +:hidden: + +objects-hands-on +``` diff --git a/docs/domain/document/objects-hands-on.md b/docs/domain/document/objects-hands-on.md new file mode 100644 index 00000000..73e90adc --- /dev/null +++ b/docs/domain/document/objects-hands-on.md @@ -0,0 +1,128 @@ +(objects-basics)= + +# Objects: Analyzing Marketing Data + +Marketers often need to handle multi-structured data from different platforms. +CrateDB's dynamic `OBJECT` data type allows us to store and analyze this complex, +nested data efficiently. In this tutorial, we'll explore how to leverage this +feature in marketing data analysis, along with the use of generated columns to +parse and manage URLs. + +Consider marketing data that captures details of various campaigns. + +:::{code} json +{ + "campaign_id": "c123", + "source": "Google Ads", + "metrics": { + "clicks": 500, + "impressions": 10000, + "conversion_rate": 0.05 + }, + "landing_page_url": "https://example.com/products?utm_source=google" +} +::: + +To begin, let's create the schema for this dataset. + +## Creating the Table + +CrateDB uses SQL, the most popular query language for database management. To +store the marketing data, create a table with columns tailored to the +dataset using the `CREATE TABLE` command: + +:::{code} sql +CREATE TABLE marketing_data ( + campaign_id TEXT PRIMARY KEY, + source TEXT, + metrics OBJECT(DYNAMIC) AS ( + clicks INTEGER, + impressions INTEGER, + conversion_rate DOUBLE PRECISION + ), + landing_page_url TEXT, + url_parts GENERATED ALWAYS AS parse_url(landing_page_url) +); +::: + +Let's highlight two features in this table definition: + +:metrics: An `OBJECT` column featuring a dynamic structure for + performing flexible queries on its nested attributes like + clicks, impressions, and conversion rate. +:url_parts: A generated column to + decode an URL from the `landing_page_url` column. This is convenient + to query for specific components of the URL later on. + +The table is designed to accommodate both fixed and dynamic attributes, +providing a robust and flexible structure for storing your marketing data. + + +## Inserting Data + +Now, insert the data using the `COPY FROM` SQL statement. + +:::{code} sql +COPY marketing_data +FROM 'https://github.com/crate/cratedb-datasets/raw/main/cloud-tutorials/data_marketing.json.gz' +WITH (format = 'json', compression='gzip'); +::: + +## Analyzing Data + +Start with a basic `SELECT` statement on the `metrics` column, and limit the +output to display only 10 records, in order to quickly explore a few samples +worth of data. + +:::{code} sql +SELECT metrics +FROM marketing_data +LIMIT 10; +::: + +You can see that the `metrics` column returns an object in the form of a JSON. +If you just want to return a single property of this object, you can adjust the +query slightly by adding the property to the selection using bracket notation. + +:::{code} sql +SELECT metrics['clicks'] +FROM marketing_data +LIMIT 10; +::: + +It's helpful to select individual properties from a nested object, but what if +you also want to filter results based on these properties? For instance, to find +`campaign_id` and `source` where `conversion_rate` exceeds `0.09`, employ +the same bracket notation for filtering as well. + +:::{code} sql +SELECT campaign_id, source +FROM marketing_data +WHERE metrics['conversion_rate'] > 0.09 +LIMIT 50; +::: + +This allows you to narrow down the query results while still leveraging CrateDB's +ability to query nested objects effectively. + +Finally, let's explore data aggregation based on UTM source parameters. The +`url_parts` generated column, which is populated using the `parse_url()` +function, automatically splits the URL into its constituent parts upon data +insertion. + +To analyze the UTM source, you can directly query these parsed parameters. The +goal is to count the occurrences of each UTM source and sort them in descending +order. This lets you easily gauge marketing effectiveness for different sources, +all while taking advantage of CrateDB's powerful generated columns feature. + +:::{code} sql +SELECT + url_parts['parameters']['utm_source'] AS utm_source, + COUNT(*) +FROM marketing_data +GROUP BY 1 +ORDER BY 2 DESC; +::: + +In this tutorial, we explored the versatility and power of CrateDB's dynamic +`OBJECT` data type for handling complex, nested marketing data. diff --git a/docs/domain/search/index.md b/docs/domain/search/index.md index 680b965e..9884ff67 100644 --- a/docs/domain/search/index.md +++ b/docs/domain/search/index.md @@ -6,7 +6,7 @@ Learn how to set up your database for full-text search, how to create the relevant indices, and how to query your text data efficiently. A must-read for anyone looking to make sense of large volumes of unstructured text data. -- [](inv:cloud#full-text) +- [](#search-basics) :::{note} @@ -15,3 +15,10 @@ data sets. One of its standout features are its full-text search capabilities, built on top of the powerful Lucene library. This makes it a great fit for organizing, searching, and analyzing extensive datasets. ::: + +```{toctree} +:maxdepth: 1 +:hidden: + +search-hands-on +``` diff --git a/docs/domain/search/search-hands-on.md b/docs/domain/search/search-hands-on.md new file mode 100644 index 00000000..8aefd669 --- /dev/null +++ b/docs/domain/search/search-hands-on.md @@ -0,0 +1,111 @@ +(search-basics)= + +# Full-Text: Exploring the Netflix Catalog + +In this tutorial, we will explore how to manage a dataset of Netflix titles, +making use of CrateDB Cloud's full-text search capabilities. +Each entry in our imaginary dataset will have the following attributes: + +:show_id: A unique identifier for each show or movie. +:type: Specifies whether the title is a movie, TV show, or another format. +:title: The title of the movie or show. +:director: The name of the director. +:cast: An array listing the cast members. +:country: The country where the title was produced. +:date_added: A timestamp indicating when the title was added to the catalog. +:release_year: The year the title was released. +:rating: The content rating (e.g., PG, R, etc.). +:duration: The duration of the title in minutes or seasons. +:listed_in: An array containing genres that the title falls under. +:description: A textual description of the title, indexed using full-text search. + +To begin, let's create the schema for this dataset. + + +## Creating the Table + +CrateDB uses SQL, the most popular query language for database management. To +store the data, create a table with columns tailored to the +dataset using the `CREATE TABLE` command. + +Importantly, you will also take advantage +of CrateDB's full-text search capabilities by setting up a full-text index on +the description column. This will enable you to perform complex textual queries +later on. + +:::{code} sql +CREATE TABLE "netflix_catalog" ( + "show_id" TEXT PRIMARY KEY, + "type" TEXT, + "title" TEXT, + "director" TEXT, + "cast" ARRAY(TEXT), + "country" TEXT, + "date_added" TIMESTAMP, + "release_year" TEXT, + "rating" TEXT, + "duration" TEXT, + "listed_in" ARRAY(TEXT), + "description" TEXT INDEX using fulltext +); +::: + +Run the above SQL command in CrateDB to set up your table. With the table ready, +you’re now set to insert the dataset. + +## Inserting Data + +Now, insert data into the table you just created, by using the `COPY FROM` +SQL statement. + +:::{code} sql +COPY netflix_catalog +FROM 'https://github.com/crate/cratedb-datasets/raw/main/cloud-tutorials/data_netflix.json.gz' +WITH (format = 'json', compression='gzip'); +::: + +Run the above SQL command in CrateDB to import the dataset. After this commands +finishes, you are now ready to start querying the dataset. + +## Using Full-text Search + +Start with a basic `SELECT` statement on all columns, and limit the output to +display only 10 records, in order to quickly explore a few samples worth of data. + +:::{code} sql +SELECT * +FROM netflix_catalog +LIMIT 10; +::: + +CrateDB Cloud’s full-text search can be leveraged to find specific entries based +on text matching. In this query, you are using the `MATCH` function on the +`description` field to find all movies or TV shows that contain the word "love". +The results can be sorted by relevance score by using the synthetic `_score` column. + +:::{code} sql +SELECT title, description +FROM netflix_catalog +WHERE MATCH(description, 'love') +ORDER BY _score DESC +LIMIT 10; +::: + +While full-text search is incredibly powerful, you can still perform more +traditional types of queries. For example, to find all titles directed by +"Kirsten Johnson", and sort them by release year, you can use: + +:::{code} sql +SELECT title, release_year +FROM netflix_catalog +WHERE director = 'Kirsten Johnson' +ORDER BY release_year DESC; +::: + +This query uses the conventional `WHERE` clause to find movies directed by +Kirsten Johnson, and the `ORDER BY` clause to sort them by their release year +in descending order. + +Through these examples, you can see that CrateDB Cloud offers you a wide array +of querying possibilities, from basic SQL queries to advanced full-text +searches, making it a versatile choice for managing and querying your datasets. diff --git a/docs/domain/timeseries/index.md b/docs/domain/timeseries/index.md index f464ecd8..6ed5af3a 100644 --- a/docs/domain/timeseries/index.md +++ b/docs/domain/timeseries/index.md @@ -6,8 +6,8 @@ Learn how to optimally use CrateDB for time series use-cases. - [](#timeseries-basics) - [](#timeseries-normalize) - [Financial data collection and processing using pandas] -- [](inv:cloud#time-series) -- [](inv:cloud#time-series-advanced) +- [](#timeseries-analysis) +- [](#timeseries-objects) - [Time-series data: From raw data to fast analysis in only three steps] :::{toctree} @@ -15,6 +15,8 @@ Learn how to optimally use CrateDB for time series use-cases. generate/index normalize-intervals +timeseries-querying +timeseries-and-metadata ::: [Financial data collection and processing using pandas]: https://community.cratedb.com/t/automating-financial-data-collection-and-storage-in-cratedb-with-python-and-pandas-2-0-0/916 diff --git a/docs/domain/timeseries/timeseries-and-metadata.md b/docs/domain/timeseries/timeseries-and-metadata.md new file mode 100644 index 00000000..af609e19 --- /dev/null +++ b/docs/domain/timeseries/timeseries-and-metadata.md @@ -0,0 +1,292 @@ +(timeseries-objects)= + +# Analyzing Device Readings with Metadata Integration + +CrateDB is highly regarded as an optimal database solution for managing +time series data thanks to its unique blend of features. It is particularly +effective when you need to combine time series data with metadata, for +instance, in scenarios where data like sensor readings or log entries, need +to be augmented with additional context for more insightful analysis. + + +:::::{grid} +:padding: 0 + +::::{grid-item} +:class: rubric-slimmer +:columns: auto 6 6 6 + +:::{rubric} About +::: + +CrateDB supports effective time series analysis with enhanced features +for fast aggregations. + +- Rich data types for storing structured nested data (OBJECT) alongside + time series data. +- A rich set of built-in functions for aggregations. +- Relational JOIN operations. +- Common table expressions (CTEs). + +:::: + +::::{grid-item} +:class: rubric-slimmer +:columns: auto 6 6 6 + +:::{rubric} Data +::: +This tutorial illustrates how to effectively query time series data with +metadata, in order to conduct comprehensive data analysis. + +It uses a time series dataset that includes telemetry readings from appliances, +such as battery, CPU, and memory information, as well as metadata information +like manufacturer, model, and firmware version. +:::: + +::::: + + +## Creating the Tables + +CrateDB uses SQL, the most popular query language for database management. To +store the device readings and the device info data, define two tables with +columns tailored to the datasets. + +To get started, let’s use a time series dataset that captures various device +readings, such as battery, CPU, and memory information. Each record includes: + +:ts: Timestamp when each reading was taken. +:device_id: Identifier of the device. +:battery: Object containing battery level, status, and temperature. +:cpu: Object containing average CPU loads over the last 1, 5, and 15 minutes. +:memory: Object containing information about the device's free and used memory. + +The second dataset in this tutorial contains metadata information about various +devices. Each record includes: + +:device_id: Identifier of the device. +:api_version: Version of the API that the device supports. +:manufacturer: Name of the manufacturer of the device. +:model: Model name of the device. +:os_name: Name of the operating system running on the device. + +Create the tables using the `CREATE TABLE` command: + +:::{code} sql +CREATE TABLE IF NOT EXISTS doc.devices_readings ( + "ts" TIMESTAMP WITH TIME ZONE, + "device_id" TEXT, + "battery" OBJECT(DYNAMIC) AS ( + "level" BIGINT, + "status" TEXT, + "temperature" DOUBLE PRECISION + ), + "cpu" OBJECT(DYNAMIC) AS ( + "avg_1min" DOUBLE PRECISION, + "avg_5min" DOUBLE PRECISION, + "avg_15min" DOUBLE PRECISION + ), + "memory" OBJECT(DYNAMIC) AS ( + "free" BIGINT, + "used" BIGINT + ) +); +::: + +:::{code} sql +CREATE TABLE IF NOT EXISTS doc.devices_info ( + "device_id" TEXT, + "api_version" TEXT, + "manufacturer" TEXT, + "model" TEXT, + "os_name" TEXT +); +::: + +Using objects in the `devices_readings` dataset allows for the structured and efficient organization of complex, nested data, enhancing both data integrity and flexibility. + +## Inserting Data + +Now, insert the data using the `COPY FROM` SQL statement. + +:::{code} sql +COPY doc.devices_info +FROM 'https://github.com/crate/cratedb-datasets/raw/main/cloud-tutorials/devices_info.json.gz' +WITH (compression='gzip', empty_string_as_null=true) +RETURN SUMMARY; +::: + +:::{code} sql +COPY doc.devices_readings +FROM 'https://github.com/crate/cratedb-datasets/raw/main/cloud-tutorials/devices_readings.json.gz' +WITH (compression='gzip', empty_string_as_null=true) +RETURN SUMMARY; +::: + +## Time Series Analysis with Metadata + + +:::{rubric} JOIN Operations +::: +To illustrate `JOIN` operations, the first query retrieves the 30 rows of combined data from two tables, `devices.readings` and `devices.info`, based on a matching `device_id` in both. It effectively merges the detailed readings and corresponding device information, providing a comprehensive view of each device's status and metrics. + +:::{code} sql +SELECT * +FROM devices.readings r +JOIN devices.info i ON r.device_id = i.device_id +LIMIT 30; +::: + + +:::{rubric} Aggregate Values +::: +The next query illustrates the calculation of summaries for aggregate values. In particular, it finds average battery levels (`avg_battery_level`) for each day and shows the result in an ascending order. + +:::{code} sql +SELECT date_trunc('day', ts) AS "day", AVG(battery['level']) AS avg_battery_level +FROM doc.devices_readings +GROUP BY "day" +ORDER BY "day"; +::: + + +:::{rubric} Rolling Averages and Window Functions +::: +Rolling averages are crucial in time series analysis because they help smooth out short-term fluctuations and reveal underlying trends by averaging data points over a specified period. This approach is particularly effective in mitigating the impact of outliers and noise in the data, allowing for a clearer understanding of the true patterns in the time series. + +The following example illustrates the average (`AVG`), minimum (`MIN`), and maximum (`MAX`) battery temperature over a window of the last 100 temperature readings (`ROWS BETWEEN 100 PRECEDING AND CURRENT ROW`). The window is defined in descending order by timestamp (`ts`) and can be adapted to support different use cases. + +:::{code} sql +SELECT r.device_id, + AVG(battery['temperature']) OVER w AS "last 100 temperatures", + MIN(battery['temperature']) OVER w AS "min temperature", + MAX(battery['temperature']) OVER w AS "max temperature" +FROM doc.devices_readings r +JOIN doc.devices_info i ON r.device_id = i.device_id +WINDOW w AS (ORDER BY "ts" DESC ROWS BETWEEN 100 PRECEDING AND CURRENT ROW); +::: + + +:::{rubric} Most Recent Observation +::: +The next query shows how to extract the most recent reading for each device of +the _mustang_ model. The query selects the latest timestamp (`MAX(r.ts)`), +which represents the most recent reading time, and the corresponding latest +readings for battery, CPU, and memory. It uses `MAX_BY` for each respective +component, using the timestamp as the determining factor. + +These results are grouped by `device_id`, `manufacturer`, and `model` to ensure +that the latest readings for each unique device are included. This query is +particularly useful for monitoring the most current status of specific devices +in a fleet. + +:::{code} sql +SELECT + MAX(r.ts) as time, + r.device_id, + MAX_BY(r.battery, r.ts) as battery, + MAX_BY(r.cpu, r.ts) as cpu, + MAX_BY(r.memory, r.ts) as memory, + i.manufacturer, + i.model +FROM + devices_readings r +JOIN + devices_info i ON r.device_id = i.device_id +WHERE + i.model = 'mustang' +GROUP BY + r.device_id, i.manufacturer, i.model; +::: + + +:::{rubric} Common Table Expressions (CTEs) +::: +Finally, we illustrate the use of Common Table Expressions (CTEs) on behalf of +a complex query to aggregate and analyze device readings and metadata information. +The query relies on three CTEs to temporarily capture data. + +:max_timestamp: + Find the most recent timestamp (`MAX(ts)`) in the + `doc.devices_readings` table. This CTE is used to focus the analysis + on recent data. + +:device_readings_agg: + Calculate the average battery level and temperature for each + device, but only for readings taken within the last week, as defined by + `r.ts >= m.max_ts - INTERVAL '1 week'`. + +:device_model_info: + Select details from the `doc.devices_info` table, specifically + the `device_id`, `manufacturer`, `model`, and `api_version`, but only for + devices with an API version between 21 and 25. + +The main `SELECT` statement joins the `device_readings_agg` and `device_model_info` +CTEs, and aggregates data to provide the average battery level and temperature +for each combination of manufacturer, model, and API version. +It also provides the number of readings (`COUNT(*)`) for each grouping. + +The query aims to provide a detailed analysis of the battery performance (both level and temperature) for devices with specific API versions, while focusing only on recent data. It allows for a better understanding of how different models and manufacturers are performing in terms of battery efficiency within a specified API range and time frame. + +:::{code} sql +WITH +max_timestamp AS ( + SELECT MAX(ts) AS max_ts + FROM doc.devices_readings +), +device_readings_agg AS ( + SELECT + r.device_id, + AVG(r.battery['level']) AS avg_battery_level, + AVG(r.battery['temperature']) AS avg_battery_temperature + FROM + devices_readings r, max_timestamp m + WHERE + r.ts >= m.max_ts - INTERVAL '1 week' + GROUP BY + r.device_id +), +device_model_info AS ( + SELECT + device_id, + manufacturer, + model, + api_version + FROM + devices_info + WHERE + api_version BETWEEN 21 AND 25 +) +SELECT + info.manufacturer, + info.model, + info.api_version, + AVG(read.avg_battery_level) AS model_avg_battery_level, + AVG(read.avg_battery_temperature) AS model_avg_battery_temperature, + COUNT(*) AS readings_count +FROM + device_readings_agg read +JOIN + device_model_info info +ON + read.device_id = info.device_id +GROUP BY + info.manufacturer, + info.model, + info.api_version +ORDER BY + model_avg_battery_level DESC; +::: + + +:::{rubric} Conclusion +::: + +This tutorial has guided you through the process of querying and +analyzing time series data with CrateDB, demonstrating how to effectively merge +device metrics with relevant metadata. + +These techniques and queries are important for unlocking deeper insights into +device performance, equipping you with the skills needed to harness the full +potential of time series data in real-world applications. diff --git a/docs/domain/timeseries/timeseries-querying.md b/docs/domain/timeseries/timeseries-querying.md new file mode 100644 index 00000000..b989bd77 --- /dev/null +++ b/docs/domain/timeseries/timeseries-querying.md @@ -0,0 +1,158 @@ +(timeseries-analysis)= + +# Time Series: Analyzing Weather Data + +CrateDB is a powerful database designed to handle various use cases, one of +which is managing time series data. Time series data refers to collections of +data points recorded at specific intervals over time, like the hourly +temperature of a city or the daily sales of a store. + +:::::{grid} +:padding: 0 + +::::{grid-item} +:class: rubric-slimmer +:columns: auto 6 6 6 + +:::{rubric} About +::: + +Effectively query observations using enhanced features for time series data. + +Run aggregations with gap filling / interpolation, using common +table expressions (CTEs) and LAG / LEAD window functions. + +Find maximum values using the MAX_BY aggregate function, returning +the value from one column based on the maximum or minimum value of another +column within a group. +:::: + +::::{grid-item} +:class: rubric-slimmer +:columns: auto 6 6 6 + +:::{rubric} Data +::: +For this tutorial, imagine a dataset that captures weather +readings from CrateDB offices across the globe. Each record includes: + +:timestamp: The exact time of the recording. +:location: The location of the weather station. +:temperature: The temperature in degrees Celsius. +:humidity: The humidity in percentage. +:wind_speed: The wind speed in km/h. +:::: + +::::: + + +## Creating the Table + +CrateDB uses SQL, the most popular query language for database management. To +store the weather data, create a table with columns tailored to the +dataset using the `CREATE TABLE` command: + +:::{code} sql +CREATE TABLE "weather_data" ( + "timestamp" TIMESTAMP, + "location" VARCHAR, + "temperature" DOUBLE, + "humidity" DOUBLE, + "wind_speed" DOUBLE +); +::: + +Run the above SQL command in CrateDB to set up your table. With the table ready, +you are now set to insert the dataset. + + +## Inserting Data + +Insert the data using the `COPY FROM` SQL statement. + +:::{code} sql +COPY weather_data +FROM 'https://github.com/crate/cratedb-datasets/raw/main/cloud-tutorials/data_weather.csv.gz' +WITH (format='csv', compression='gzip', empty_string_as_null=true); +::: + + +## Analyzing Data + +Start with a basic `SELECT` statement on all columns, and limit the output to +display only 10 records, in order to quickly explore a few samples worth of data. + +:::{code} sql +SELECT * +FROM weather_data +LIMIT 10; +::: + +CrateDB is built for fast aggregation using the columnar storage to speed up +queries. For example, calculate the average temperature for each location by using the +`AVG` aggregation function: + +:::{code} sql +SELECT location, AVG(temperature) AS avg_temp +FROM weather_data +GROUP BY location; +::: + +:::{rubric} MAX_BY Aggregate Functions +::: +Computing basic averages is nothing special, but what if you need to answer more detailed +questions? For example, if you want to know the highest temperature for each +place and when it occurred. + +Simple groupings might not be enough, but +thankfully, CrateDB has enhanced tools for time series data. You can use the +`max_by(returned_value, maximized_value)` function, which gives you a value (like +the time) when another value (like the temperature) is at its highest. + +Let's put this to use with the following query: + +:::{code} sql +SELECT location, + max(temperature) AS highest_temp, + max_by(timestamp, temperature) AS time_of_highest_temp +FROM weather_data +GROUP BY location; +::: + +:::{rubric} Gap Filling +::: +You have probably observed by now, that there are gaps in the dataset for certain +metrics. Such occurrences are common, perhaps due to a sensor malfunction or +disconnection. To address this, the missing values need to be filled in. + +Window functions paired with the `IGNORE NULLS` feature will solve your needs. +Within a Common Table Expression (CTE), we utilize window functions to +spot the next and prior non-null temperature recordings, and then compute the +arithmetic mean to fill the gap. + +:::{code} sql +WITH OrderedData AS ( + SELECT timestamp, + location, + temperature, + LAG(temperature, 1) IGNORE NULLS OVER w AS prev_temp, + LEAD(temperature, 1) IGNORE NULLS OVER w AS next_temp + FROM weather_data + WINDOW w AS (PARTITION BY location ORDER BY timestamp) +) +SELECT timestamp, + location, + temperature, + COALESCE(temperature, (prev_temp + next_temp) / 2) AS interpolated_temperature +FROM OrderedData +ORDER BY location, timestamp; +::: + +The `WINDOW` clause defines a window that partitions the data by location and +orders it by timestamp. + +This ensures that the `LAG` and `LEAD` window functions operate within each +location group chronologically. If the temperature value is defined as `NULL`, +the query returns the interpolated value calculated as the average of the +previous and next available temperature readings. Otherwise, it uses the +original value.