Skip to content

Commit

Permalink
Move Cloud tutorials to Guide (#70)
Browse files Browse the repository at this point in the history
* Move Cloud tutorials to Guide
  • Loading branch information
matkuliak committed May 2, 2024
1 parent c2d5383 commit 8686293
Show file tree
Hide file tree
Showing 7 changed files with 709 additions and 4 deletions.
9 changes: 8 additions & 1 deletion docs/domain/document/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,15 @@ Storing documents in CrateDB provides the same development convenience like the
document-oriented storage layer of Lotus Notes / Domino, CouchDB, MongoDB, and
PostgreSQL's `JSON(B)` types.

- [](inv:cloud#object)
- [](#objects-basics)
- [Unleashing the Power of Nested Data: Ingesting and Querying JSON Documents with SQL]


[Unleashing the Power of Nested Data: Ingesting and Querying JSON Documents with SQL]: https://youtu.be/S_RHmdz2IQM?feature=shared

```{toctree}
:maxdepth: 1
:hidden:
objects-hands-on
```
128 changes: 128 additions & 0 deletions docs/domain/document/objects-hands-on.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,128 @@
(objects-basics)=

# Objects: Analyzing Marketing Data

Marketers often need to handle multi-structured data from different platforms.
CrateDB's dynamic `OBJECT` data type allows us to store and analyze this complex,
nested data efficiently. In this tutorial, we'll explore how to leverage this
feature in marketing data analysis, along with the use of generated columns to
parse and manage URLs.

Consider marketing data that captures details of various campaigns.

:::{code} json
{
"campaign_id": "c123",
"source": "Google Ads",
"metrics": {
"clicks": 500,
"impressions": 10000,
"conversion_rate": 0.05
},
"landing_page_url": "https://example.com/products?utm_source=google"
}
:::

To begin, let's create the schema for this dataset.

## Creating the Table

CrateDB uses SQL, the most popular query language for database management. To
store the marketing data, create a table with columns tailored to the
dataset using the `CREATE TABLE` command:

:::{code} sql
CREATE TABLE marketing_data (
campaign_id TEXT PRIMARY KEY,
source TEXT,
metrics OBJECT(DYNAMIC) AS (
clicks INTEGER,
impressions INTEGER,
conversion_rate DOUBLE PRECISION
),
landing_page_url TEXT,
url_parts GENERATED ALWAYS AS parse_url(landing_page_url)
);
:::

Let's highlight two features in this table definition:

:metrics: An `OBJECT` column featuring a dynamic structure for
performing flexible queries on its nested attributes like
clicks, impressions, and conversion rate.
:url_parts: A generated column to
decode an URL from the `landing_page_url` column. This is convenient
to query for specific components of the URL later on.

The table is designed to accommodate both fixed and dynamic attributes,
providing a robust and flexible structure for storing your marketing data.


## Inserting Data

Now, insert the data using the `COPY FROM` SQL statement.

:::{code} sql
COPY marketing_data
FROM 'https://github.com/crate/cratedb-datasets/raw/main/cloud-tutorials/data_marketing.json.gz'
WITH (format = 'json', compression='gzip');
:::

## Analyzing Data

Start with a basic `SELECT` statement on the `metrics` column, and limit the
output to display only 10 records, in order to quickly explore a few samples
worth of data.

:::{code} sql
SELECT metrics
FROM marketing_data
LIMIT 10;
:::

You can see that the `metrics` column returns an object in the form of a JSON.
If you just want to return a single property of this object, you can adjust the
query slightly by adding the property to the selection using bracket notation.

:::{code} sql
SELECT metrics['clicks']
FROM marketing_data
LIMIT 10;
:::

It's helpful to select individual properties from a nested object, but what if
you also want to filter results based on these properties? For instance, to find
`campaign_id` and `source` where `conversion_rate` exceeds `0.09`, employ
the same bracket notation for filtering as well.

:::{code} sql
SELECT campaign_id, source
FROM marketing_data
WHERE metrics['conversion_rate'] > 0.09
LIMIT 50;
:::

This allows you to narrow down the query results while still leveraging CrateDB's
ability to query nested objects effectively.

Finally, let's explore data aggregation based on UTM source parameters. The
`url_parts` generated column, which is populated using the `parse_url()`
function, automatically splits the URL into its constituent parts upon data
insertion.

To analyze the UTM source, you can directly query these parsed parameters. The
goal is to count the occurrences of each UTM source and sort them in descending
order. This lets you easily gauge marketing effectiveness for different sources,
all while taking advantage of CrateDB's powerful generated columns feature.

:::{code} sql
SELECT
url_parts['parameters']['utm_source'] AS utm_source,
COUNT(*)
FROM marketing_data
GROUP BY 1
ORDER BY 2 DESC;
:::

In this tutorial, we explored the versatility and power of CrateDB's dynamic
`OBJECT` data type for handling complex, nested marketing data.
9 changes: 8 additions & 1 deletion docs/domain/search/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ Learn how to set up your database for full-text search, how to create the
relevant indices, and how to query your text data efficiently. A must-read
for anyone looking to make sense of large volumes of unstructured text data.

- [](inv:cloud#full-text)
- [](#search-basics)


:::{note}
Expand All @@ -15,3 +15,10 @@ data sets. One of its standout features are its full-text search capabilities,
built on top of the powerful Lucene library. This makes it a great fit for
organizing, searching, and analyzing extensive datasets.
:::

```{toctree}
:maxdepth: 1
:hidden:
search-hands-on
```
111 changes: 111 additions & 0 deletions docs/domain/search/search-hands-on.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
(search-basics)=

# Full-Text: Exploring the Netflix Catalog

In this tutorial, we will explore how to manage a dataset of Netflix titles,
making use of CrateDB Cloud's full-text search capabilities.
Each entry in our imaginary dataset will have the following attributes:

:show_id: A unique identifier for each show or movie.
:type: Specifies whether the title is a movie, TV show, or another format.
:title: The title of the movie or show.
:director: The name of the director.
:cast: An array listing the cast members.
:country: The country where the title was produced.
:date_added: A timestamp indicating when the title was added to the catalog.
:release_year: The year the title was released.
:rating: The content rating (e.g., PG, R, etc.).
:duration: The duration of the title in minutes or seasons.
:listed_in: An array containing genres that the title falls under.
:description: A textual description of the title, indexed using full-text search.

To begin, let's create the schema for this dataset.


## Creating the Table

CrateDB uses SQL, the most popular query language for database management. To
store the data, create a table with columns tailored to the
dataset using the `CREATE TABLE` command.

Importantly, you will also take advantage
of CrateDB's full-text search capabilities by setting up a full-text index on
the description column. This will enable you to perform complex textual queries
later on.

:::{code} sql
CREATE TABLE "netflix_catalog" (
"show_id" TEXT PRIMARY KEY,
"type" TEXT,
"title" TEXT,
"director" TEXT,
"cast" ARRAY(TEXT),
"country" TEXT,
"date_added" TIMESTAMP,
"release_year" TEXT,
"rating" TEXT,
"duration" TEXT,
"listed_in" ARRAY(TEXT),
"description" TEXT INDEX using fulltext
);
:::

Run the above SQL command in CrateDB to set up your table. With the table ready,
you’re now set to insert the dataset.

## Inserting Data

Now, insert data into the table you just created, by using the `COPY FROM`
SQL statement.

:::{code} sql
COPY netflix_catalog
FROM 'https://github.com/crate/cratedb-datasets/raw/main/cloud-tutorials/data_netflix.json.gz'
WITH (format = 'json', compression='gzip');
:::

Run the above SQL command in CrateDB to import the dataset. After this commands
finishes, you are now ready to start querying the dataset.

## Using Full-text Search

Start with a basic `SELECT` statement on all columns, and limit the output to
display only 10 records, in order to quickly explore a few samples worth of data.

:::{code} sql
SELECT *
FROM netflix_catalog
LIMIT 10;
:::

CrateDB Cloud’s full-text search can be leveraged to find specific entries based
on text matching. In this query, you are using the `MATCH` function on the
`description` field to find all movies or TV shows that contain the word "love".
The results can be sorted by relevance score by using the synthetic `_score` column.

:::{code} sql
SELECT title, description
FROM netflix_catalog
WHERE MATCH(description, 'love')
ORDER BY _score DESC
LIMIT 10;
:::

While full-text search is incredibly powerful, you can still perform more
traditional types of queries. For example, to find all titles directed by
"Kirsten Johnson", and sort them by release year, you can use:

:::{code} sql
SELECT title, release_year
FROM netflix_catalog
WHERE director = 'Kirsten Johnson'
ORDER BY release_year DESC;
:::

This query uses the conventional `WHERE` clause to find movies directed by
Kirsten Johnson, and the `ORDER BY` clause to sort them by their release year
in descending order.

Through these examples, you can see that CrateDB Cloud offers you a wide array
of querying possibilities, from basic SQL queries to advanced full-text
searches, making it a versatile choice for managing and querying your datasets.
6 changes: 4 additions & 2 deletions docs/domain/timeseries/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,15 +6,17 @@ Learn how to optimally use CrateDB for time series use-cases.
- [](#timeseries-basics)
- [](#timeseries-normalize)
- [Financial data collection and processing using pandas]
- [](inv:cloud#time-series)
- [](inv:cloud#time-series-advanced)
- [](#timeseries-analysis)
- [](#timeseries-objects)
- [Time-series data: From raw data to fast analysis in only three steps]

:::{toctree}
:hidden:

generate/index
normalize-intervals
timeseries-querying
timeseries-and-metadata
:::

[Financial data collection and processing using pandas]: https://community.cratedb.com/t/automating-financial-data-collection-and-storage-in-cratedb-with-python-and-pandas-2-0-0/916
Expand Down

0 comments on commit 8686293

Please sign in to comment.