# CrateDB Full-Text and Vector Search Workshop

TODO overview.

## Install Dependencies

First, install the required depenencies by executing the `pip install` command below.

In [None]:
! pip install ipyleaflet sqlalchemy-cratedb pandas

## Connect to CrateDB

Before going any further, you'll need to update the code below to include a connection string for your CrateDB cluster.  If you prefer, you can set the environment variable `CRATEDB_CONNECTION_STRING` instead.

The code below assumes that you're using a managed [CrateDB Cloud](https://console.cratedb.cloud/) cluster.  If you're running CrateDB locally (for example with [Docker](https://hub.docker.com/_/crate)), use the "localhost" code block instead.

In [1]:
import os
import sqlalchemy as sa

# # Define database address when using CrateDB Cloud.
# # Please find these settings on your cluster overview page.
#CONNECTION_STRING = os.environ.get(
#   "CRATEDB_CONNECTION_STRING",
#   "crate://<USERNAME>:<PASSWORD>@<HOST>/?ssl=true",
#)

# # Define database address when using CrateDB on localhost.
CONNECTION_STRING = os.environ.get(
  "CRATEDB_CONNECTION_STRING",
  "crate://crate@localhost/",
)

# # Connect to CrateDB using SQLAlchemy.
engine = sa.create_engine(CONNECTION_STRING, echo=os.environ.get("DEBUG"))
connection = engine.connect()

## Create a Community Areas Table

First, you'll need to create a table to store the community areas data in.  You may have a `community_areas` table that was created by following other CrateDB workshops.  The code below drops any existing such table, replacing it with a new version.  This new version as an additional column `description_vec` in the `details` object.  You'll learn about what this is for later in this workshop!

In [5]:
_ = connection.execute(sa.text(
"""
DROP TABLE IF EXISTS community_areas
"""
))

_ = connection.execute(sa.text(
"""
CREATE TABLE IF NOT EXISTS community_areas (
   areanumber INTEGER PRIMARY KEY,
   name TEXT,
   details OBJECT(DYNAMIC) AS (
       description TEXT INDEX USING fulltext,
       description_vec FLOAT_VECTOR(2048),
       population BIGINT
   ),
   boundaries GEO_SHAPE INDEX USING geohash WITH (PRECISION='1m', DISTANCE_ERROR_PCT=0.025)
);
"""))

## Load the Data

Next, load the community areas data, which is stored as a JSON file on GitHub...

In [6]:
def display_results(table_name, info):
    print(f"{table_name}: loaded {info['success_count']}, errors: {info['error_count']}")

    if info["error_count"] > 0:
        print(f"Errors: {info['errors']}")

# Load the community areas data file.
result = connection.execute(sa.text("""
    COPY community_areas 
    FROM 'https://github.com/crate/cratedb-datasets/raw/main/academy/chicago-data/chicago_community_areas_with_vectors.json' 
    RETURN SUMMARY;                                  
    """))

display_results("community_areas", result.mappings().first())

community_areas: loaded 77, errors: 0


Once the data's loaded, verify that the output shows 0 errors.  Next, we'll run a `REFRESH` command to make sure that the data's up to date before querying it.  We'll also run `ANALYZE`, which collects statistics used by the query optimizer.

In [7]:
_ = connection.execute(sa.text("REFRESH TABLE community_areas, three_eleven_calls, libraries"))
_ = connection.execute(sa.text("ANALYZE"))

TODO simple select to verify things are good.

In [12]:
import pandas as pd

query = """
SELECT 
    name, details['description'] as desc_text, details['description_vec'] as desc_vec 
FROM community_areas WHERE areanumber <= 5 ORDER BY areanumber ASC
"""
df = pd.read_sql(query, CONNECTION_STRING)

df

Unnamed: 0,name,desc_text,desc_vec
0,ROGERS PARK,Rogers Park is the first of Chicago's 77 commu...,"[-0.0022681202, 0.009500365, -0.019281529, 1.3..."
1,WEST RIDGE,West Ridge is one of 77 Chicago community area...,"[0.010529446, -0.002814602, -0.013438543, 0.01..."
2,UPTOWN,Uptown is one of Chicago's 77 community areas....,"[0.012312315, 0.035940763, -0.018374063, -0.01..."
3,LINCOLN SQUARE,"Lincoln Square on the North Side of Chicago, I...","[0.035294976, 0.029014729, -0.022433033, 0.005..."
4,NORTH CENTER,North Center is one of the 77 community areas ...,"[0.021928966, 0.022093171, -0.024451768, 0.011..."


TODO commentary on the above.

## Full-text Search

TODO full text search parts

## Vector Similarity Search

TODO vector similarity parts

## Towards Hybrid Search

TODO combining the two searches in one query

## Combining Search and Other Query Criteria

TODO using geo, the third search... and querying with supporting data...

## Additional Resources

TODO call to action for other tutorials

## Continue your Learning Journey

To learn more about CrateDB, sign up for our courses at the CrateDB Academy.  We recommend the [CrateDB Fundamentals](https://learn.cratedb.com/cratedb-fundamentals) course for a comprehensive overview, and our [Advanced Time Series](https://learn.cratedb.com/time-series) course for a deep dive into time series data concepts.