# CrateDB Multi-Model Data Workshop

TODO Introduction.

## Install Dependencies

First, install the required depenencies by uncommenting and executing the `pip install` command below.  Make sure to restart the notebook runtime environment once this command has completed.

In [None]:
! pip install ipyleaflet sqlalchemy-cratedb pandas

## Connect to CrateDB

Before going any further, you'll need to update the code below to include a connection string for your CrateDB cluster.  If you prefer, you can set the environment variable `CRATEDB_CONNECTION_STRING` instead.

The code below assumes that you're using a managed [CrateDB Cloud](https://console.cratedb.cloud/) cluster.  If you're running CrateDB locally (for example with [Docker](https://hub.docker.com/_/crate)), use the "localhost" code block instead.

In [81]:
import os
import sqlalchemy as sa

# # Define database address when using CrateDB Cloud.
# # Please find these settings on your cluster overview page.
# CONNECTION_STRING = os.environ.get(
#     "CRATEDB_CONNECTION_STRING",
#     "crate://<USERNAME>:<PASSWORD>@<HOST>/?ssl=true",
# )

# # Define database address when using CrateDB on localhost.
CONNECTION_STRING = os.environ.get(
   "CRATEDB_CONNECTION_STRING",
   "crate://crate@localhost/",
)

# # Connect to CrateDB using SQLAlchemy.
engine = sa.create_engine(CONNECTION_STRING, echo=os.environ.get('DEBUG'))
connection = engine.connect()

## Create Tables

Next, we'll create three tables as follows:

* `community_areas` - to contain document data about the 77 community areas that make up the city of Chicago.

* `three_eleven_calls` - details about service requests placed with the Chicago 311 non-emergency issue reporting service.

* `libraries` - data about Chicago's public libraries, including their locations and opening times.

Run the code below to create them, taking a moment to understand the table schemas.

In [12]:
_ = connection.execute(sa.text(
"""
CREATE TABLE IF NOT EXISTS community_areas (
   areanumber INTEGER PRIMARY KEY,
   name TEXT,
   details OBJECT(DYNAMIC) AS (
       description TEXT INDEX USING fulltext,
       population BIGINT
   ),
   boundaries GEO_SHAPE INDEX USING geohash WITH (PRECISION='1m', DISTANCE_ERROR_PCT=0.025)
);
"""))

_  = connection.execute(sa.text(
"""
CREATE TABLE IF NOT EXISTS three_eleven_calls (
   srnumber TEXT,
   srtype TEXT,
   srshortcode TEXT,
   createddept TEXT,
   ownerdept TEXT,
   status TEXT,
   origin TEXT,
   createddate TIMESTAMP,
   lastmodifieddate TIMESTAMP,
   closeddate TIMESTAMP,
   week GENERATED ALWAYS AS date_trunc('week', createddate),
   isduplicate BOOLEAN,
   createdhour SMALLINT,
   createddayofweek SMALLINT,
   createdmonth SMALLINT,
   locationdetails OBJECT(DYNAMIC) AS (
       streetaddress TEXT,
       city TEXT,
       state TEXT,
       zipcode TEXT,
       streetnumber TEXT,
       streetdirection TEXT,
       streetname TEXT,
       streettype TEXT,
       communityarea SMALLINT,
       ward SMALLINT,
       policesector SMALLINT,
       policedistrict SMALLINT,
       policebeat SMALLINT,
       precinct SMALLINT,
       latitude DOUBLE PRECISION,
       longitude DOUBLE PRECISION,
       location GEO_POINT
   )
) PARTITIONED BY (week);
"""))

_ = connection.execute(sa.text(
"""
CREATE TABLE IF NOT EXISTS libraries (
   name TEXT,
   location OBJECT(DYNAMIC) AS (
       address TEXT,
       zipcode TEXT,
       communityarea INTEGER,
       position GEO_POINT
   ),
   hours ARRAY(TEXT),
   phone TEXT,
   website TEXT
);
"""))

## Loading the Data

We'll load the data from files contained in the `cratedb-datasets` public GitHub repository.  There's one file for each table:

* Data for the `community_areas` table is contained in a JSON file named `chicago_community_areas.json`.

* Data for the `three_eleven_calls` table is contained in a compressed JSON file named `311_records_apr_2024.json.gz`.

* Data for the `libraries` table is contained in a JSON dile named `chicago_libraries.json`.

The following code populates each table in turn, using `COPY FROM` statements.

In [5]:
def display_results(table_name, info):
    print(f"{table_name}: loaded {info['success_count']}, errors: {info['error_count']}")

    if info['error_count'] > 0:
        print(f"Errors: {info['errors']}")

# Load the community areas data file.
result = connection.execute(sa.text("""
    COPY community_areas 
    FROM 'https://github.com/crate/cratedb-datasets/raw/main/academy/chicago-data/chicago_community_areas.json' 
    RETURN SUMMARY;                                  
    """))

display_results("community_areas", result.mappings().first())

community_areas: loaded 77, errors: 0


In [7]:
# Load the 311 calls data file.
result = connection.execute(sa.text("""
    COPY three_eleven_calls 
    FROM 'https://github.com/crate/cratedb-datasets/raw/main/academy/chicago-data/311_records_apr_2024.json.gz' 
    WITH (compression='gzip') RETURN SUMMARY                                   
    """))

display_results("three_eleven_calls", result.mappings().first())

three_eleven_calls: loaded 174092, errors: 0


In [8]:
# Load the libraries data file.
result = connection.execute(sa.text("""
    COPY libraries 
    FROM 'https://github.com/crate/cratedb-datasets/raw/main/academy/chicago-data/chicago_libraries.json' 
    RETURN SUMMARY;                       
    """))

display_results("libraries", result.mappings().first())

libraries: loaded 81, errors: 0


Once the data's loaded, verify that the output shows 0 errors for each table.  Next, we'll run a `REFRESH` command to make sure that the data's up to date before querying it.  We'll also run `ANALYZE`, which collects statistics used by the query optimizer.

In [10]:
_ = connection.execute(sa.text("REFRESH TABLE community_areas, three_eleven_calls, libraries"))
_ = connection.execute(sa.text("ANALYZE"))

## Displaying Community Areas on a Map

TODO plot a basic population map.

In [82]:
import pandas as pd
import random
from ipyleaflet import Map, GeoJSON

center = (41.83068856472101, -87.74024963378908)
map = Map(center=center, zoom=10)

query = """
SELECT name, boundaries, details['population'] as population FROM community_areas
"""
df = pd.read_sql(query, CONNECTION_STRING)

def get_color_for_population(population):
    if population < 20000:
        return 'green'
    elif population < 40000:
        return 'yellow'
    elif population < 60000:
        return 'orange'

    return 'red'

for row in df.iterrows():
    community_area = GeoJSON(
        data=row[1]['boundaries'],
        style={
            'stroke': False,
            'fillColor': get_color_for_population(row[1]['population']),
            'fillOpacity': 0.5
        },
    )

    map.add(community_area)

display(map)

Map(center=[41.83068856472101, -87.74024963378908], controls=(ZoomControl(options=['position', 'zoom_in_text',…

## TODO title...

check you can do click events on the map?  do something with that?

In [107]:
from ipyleaflet import Icon, Marker
from ipywidgets import HTML

libraries_map = Map(center=center, zoom=10)
location_icon = Icon(icon_url='https://raw.githubusercontent.com/pointhi/leaflet-color-markers/master/img/marker-icon-2x-red.png', icon_size=[25, 41], icon_anchor=[12, 41])
library_marker = None

def on_my_position_changed(pos):
    global library_marker

    my_lat = pos['new'][0]
    my_lon = pos['new'][1]
    query = f"""
    SELECT 
        name, 
        hours, 
        location['position'] as location,
        distance('POINT({my_lon} {my_lat})', location['position']) AS distance 
        FROM libraries ORDER BY distance ASC LIMIT 1;
    """
    
    df = pd.read_sql(query, CONNECTION_STRING)
    closest_library = df.values[0]
    library_lat = closest_library[2][1]
    library_lon = closest_library[2][0]

    if library_marker:
        libraries_map.remove(library_marker)

    library_marker = Marker(location = (library_lat, library_lon), draggable=False)
    library_details = HTML()

    library_opening_hours = [None] * 14
    library_opening_hours[::2] = ["<b>M</b>: ", "<br/><b>T</b>: ", "<br/><b>W:</b> ", "<br/><b>T:</b> ", "<br/><b>F:</b> ", "<br/><b>S:</b> ", "<br/><b>S:</b> "]
    library_opening_hours[1::2] = closest_library[1]
    library_details.value = f"<span style=\"color: #000000;\"><b>{closest_library[0]}</b><hr/>{''.join(library_opening_hours)}</span>"
    library_marker.popup = library_details
    libraries_map.add(library_marker)


my_position = Marker(location=center, icon=location_icon, draggable=True)
my_position.observe(on_my_position_changed, "location")

libraries_map.add_control(my_position)
display(libraries_map)

Map(center=[41.83068856472101, -87.74024963378908], controls=(ZoomControl(options=['position', 'zoom_in_text',…