# CrateDB Full-Text and Vector Search Workshop

TODO overview.

## Install Dependencies

First, install the required depenencies by executing the `pip install` command below.

In [None]:
! pip install ipyleaflet sqlalchemy-cratedb pandas

## Connect to CrateDB

Before going any further, you'll need to update the code below to include a connection string for your CrateDB cluster.  If you prefer, you can set the environment variable `CRATEDB_CONNECTION_STRING` instead.

The code below assumes that you're using a managed [CrateDB Cloud](https://console.cratedb.cloud/) cluster.  If you're running CrateDB locally (for example with [Docker](https://hub.docker.com/_/crate)), use the "localhost" code block instead.

In [29]:
import os
import sqlalchemy as sa

# # Define database address when using CrateDB Cloud.
# # Please find these settings on your cluster overview page.
#CONNECTION_STRING = os.environ.get(
#   "CRATEDB_CONNECTION_STRING",
#   "crate://<USERNAME>:<PASSWORD>@<HOST>/?ssl=true",
#)

# # Define database address when using CrateDB on localhost.
CONNECTION_STRING = os.environ.get(
  "CRATEDB_CONNECTION_STRING",
  "crate://crate@localhost/",
)

# # Connect to CrateDB using SQLAlchemy.
engine = sa.create_engine(CONNECTION_STRING, echo=os.environ.get("DEBUG"))
connection = engine.connect()

## Create a Community Areas Table

First, you'll need to create a table to store the community areas data in.  You may have a `community_areas` table that was created by following other CrateDB workshops.  The code below drops any existing such table, replacing it with a new version.  This new version as an additional column `description_vec` in the `details` object.  You'll learn about what this is for later in this workshop!

In [78]:
_ = connection.execute(sa.text(
"""
DROP TABLE IF EXISTS community_areas
"""
))

_ = connection.execute(sa.text(
"""
CREATE TABLE IF NOT EXISTS community_areas (
   areanumber INTEGER PRIMARY KEY,
   name TEXT,
   details OBJECT(DYNAMIC) AS (
       description TEXT INDEX USING fulltext WITH (analyzer='english'),
       description_vec FLOAT_VECTOR(2048),
       population BIGINT
   ),
   boundaries GEO_SHAPE INDEX USING geohash WITH (PRECISION='1m', DISTANCE_ERROR_PCT=0.025)
);
"""))

## Load the Data

Next, load the community areas data, which is stored as a JSON file on GitHub...

In [79]:
def display_results(table_name, info):
    print(f"{table_name}: loaded {info['success_count']}, errors: {info['error_count']}")

    if info["error_count"] > 0:
        print(f"Errors: {info['errors']}")

# Load the community areas data file.
result = connection.execute(sa.text("""
    COPY community_areas 
    FROM 'https://github.com/crate/cratedb-datasets/raw/main/academy/chicago-data/chicago_community_areas_with_vectors.json' 
    RETURN SUMMARY;                                  
    """))

display_results("community_areas", result.mappings().first())

community_areas: loaded 77, errors: 0


Once the data's loaded, verify that the output shows 0 errors.  Next, we'll run a `REFRESH` command to make sure that the data's up to date before querying it.  We'll also run `ANALYZE`, which collects statistics used by the query optimizer.

In [80]:
_ = connection.execute(sa.text("REFRESH TABLE community_areas, three_eleven_calls, libraries"))
_ = connection.execute(sa.text("ANALYZE"))

## Familiarization with the Data

Before we try out some different ways to search the textual data in the `community_areas` table, let's first run a simple `SELECT` query to take a look at some of it.

In [81]:
import pandas as pd

pd.set_option("display.max_colwidth", None)

query = """
SELECT 
    name, details['description'] as desc_text, details['description_vec'] as desc_vec 
FROM community_areas WHERE areanumber = 51
"""
df = pd.read_sql(query, CONNECTION_STRING)
vals = df.to_dict(orient="records")

display(df)

Unnamed: 0,name,desc_text,desc_vec
0,SOUTH DEERING,"South Deering, located on Chicago's far South Side, is the largest of the 77 official community areas of that city. Primarily an industrial area, a small residential neighborhood exists in the northeast corner and Lake Calumet takes up a large portion of the area. 80% of the community area is zoned as industrial, natural wetlands, or parks. The remaining 20% is zoned for residential and small-scale commercial uses. It is part of the 10th Ward, once under the control of former Richard J. Daley ally Alderman Edward Vrdolyak. The neighborhood is named for Charles Deering, an executive in the Deering Harvester Company that would later form a major part of International Harvester. International Harvester owned Wisconsin Steel, which was originally established in 1875 and was located along Torrence Avenue south of 106th Street to 109th Street. It is the location of Calumet Fisheries, a historic seafood restaurant that opened in 1928 and has been featured on Anthony Bourdain: No Reservations. The original Calumet Bakery store, a South Side favorite since 1935, is located at 2510 E 106th St, Chicago, IL 60617. It was also the location of the Wisconsin Steel Works, originally the Joseph H. Brown Iron and Steel Company, which opened in 1875 and closed in 1980. Since the closing of the steel mill, the neighborhood has remained economically depressed.","[0.03875456, -0.008511306, -0.017262578, -0.03257543, -0.02393664, -0.0598416, 0.0027521136, -0.033445306, -0.009021234, 0.013333129, -0.026981214, 0.02165696, -0.0078289015, -0.0024690286, 0.024236599, -0.02699621, -0.030265752, 0.014502965, 0.008308833, -0.028675975, 0.016572675, -0.016977618, -0.025946358, 0.0040006884, 0.038934536, 0.026831234, -0.0020340895, -0.029455867, -0.04355389, -0.029440869, -0.00797888, 0.01945227, 0.038604584, -0.019857213, 0.009253701, 0.0011913953, -0.033205338, 0.019077323, 0.015732791, 0.026201323, 0.02080208, 0.04586356, -0.0030070778, 0.014870413, 0.0013901173, 0.010633508, -0.018942341, 0.010363545, 0.004874316, 0.0038469601, -0.0039856904, 0.011345908, -0.0065803262, 0.0021390747, 0.0126507245, -0.024431571, 0.052192673, 0.11812342, -0.05600214, 0.026561271, 0.0003538566, -0.0377647, -0.040944252, -0.0062578716, 0.052672606, 0.041874122, 0.0064378465, 0.05066289, 0.015987756, 0.00797888, -0.010940964, 0.009058729, -0.07240984, 0.03578498, 0.027011208, -0.026681256, -0.011533381, -0.004094425, -0.004746834, -0.023456708, 0.010813482, -0.030895663, -0.010611011, -0.009433676, 0.033205338, 0.03170555, -0.03233546, -0.05222267, -0.009096223, -0.00235092, 0.015522822, -0.041574165, 0.014113019, -0.03854459, 0.025046485, -0.02906592, -0.028106056, 0.0019150437, -0.027821096, 0.04211409, ...]"


Take a look at the values for `desc_text` and `dest_vec`.  

* `dest_text` is a free-text descrtipion of the characteristics of the community area, sourced from Wikipedia.  We'll use this to explore CrateDB's full-text search capabilities.
* `dest_vec` is a `FLOAT_VECTOR` column, containing vector embeddings created from the text in `dest_text` by passing it through OpenAI's `text-embedding-3-large` model.  These embeddings have been created for you, so you don't need to use the OpenAI API to work with data in this workbook.  We chose to use 2048 dimensions.

## Full-text Search

The first type of search we'll learn about here is full-text search.  We use full-text search when we want to find documents containing particular words or phrases whilst considering the search query can contain typos or synonyms and that we may want to search for given prefixes or perform fuzzy matching.

CrateDB uses Apache Lucene for full-text search.  Search indexes can be built over any number of `TEXT` columns in a table, including those deeply nested inside `OBJECT` columns.  Composite indexes containing data from more than one `TEXT` column can also be created.

Consider our `community_areas` table schema:

```sql
CREATE TABLE IF NOT EXISTS community_areas (
   ...
   details OBJECT(DYNAMIC) AS (
       description TEXT INDEX USING fulltext WITH (analyzer='english'),
       ...
```

Here, `description` is declared as `TEXT` with the additional `INDEX using fulltext` clause.  This tells CrateDB to create a full-text index for this field and that we expect the content to be in English.

### Introducing `MATCH`

The `MATCH` predicate is used to perform full-text searches.  Let's search for the term "railway" in our community area data:

In [82]:
query = """
SELECT name, _score, details['description'] as description
FROM community_areas 
WHERE match(details['description'], 'railway')
ORDER BY _score DESC;
"""

df = pd.read_sql(query, CONNECTION_STRING)
df

Unnamed: 0,name,_score,description
0,AUSTIN,1.749486,"Austin is one of 77 community areas in Chicago. Located on the city's West Side, it is the third largest community area by population (behind the Near North Side and Lake View) and the second-largest geographically (behind South Deering). Austin's eastern boundary is the Belt Railway located just east of Cicero Avenue. Its northernmost border is the Milwaukee District / West Line. Its southernmost border is at Roosevelt Road from the Belt Railway west to Austin Boulevard. The northernmost portion, north of North Avenue, extends west to Harlem Avenue, abutting Elmwood Park. In addition to Elmwood Park, Austin also borders the suburbs of Cicero and Oak Park"
1,BURNSIDE,1.150856,"Burnside is one of the 77 community areas in Chicago. The 47th numbered area, it is located on the city's far south side. This area is also called 'The Triangle' by locals, as it is bordered by railroad tracks on every side; the Canadian National Railway on the west, the Union Pacific Railroad on the south and the Norfolk Southern Railway on the east. With a population of 2,254 in 2016, it is the least populous of the community areas, as well as the second smallest by area after Oakland."
2,ASHBURN,1.140486,"Ashburn, one of Chicago's 77 community areas, is located on the south side of the city. Greater Ashburn covers nearly five square miles. The approximate boundaries of Ashburn are 72nd Street (north), Western Avenue (east), 87th Street (south) and Cicero Avenue (west). Ashburn, which got its name as the dumping site for the city's ashes, was slow to experience growth at the beginning of the 20th century. In 1893, the 'Clarkdale' subdivision was planned near 83rd and Central Park Avenue along the new Chicago and Grand Trunk Railway, with only 19 homes built in the first 50 years. The early residents were Dutch, Swedish and Irish. Ashburn opened Ashburn Flying Field, the first airfield in Chicago, in 1916"
3,WEST ENGLEWOOD,0.899924,"West Englewood, one of the 77 community areas, is on the southwest side of Chicago, Illinois. At one time it was known as South Lynne. The boundaries of West Englewood are Garfield Blvd to the north, Racine Ave to the east, the CSX and Norfolk Southern RR tracks to the west, and the Belt Railway of Chicago to the south. Though it is a separate community area, much of the history and culture of the neighborhood is linked directly to the Englewood neighborhood."
4,GREATER GRAND CROSSING,0.759754,"Greater Grand Crossing is one of the 77 community areas of Chicago, Illinois. It is located on the city's South Side. The name 'Grand Crossing' comes from an 1853 right-of-way feud between the Lake Shore and Michigan Southern Railway and the Illinois Central Railroad that led to a frog war and a crash that killed 18 people. The crash was the result of Roswell B. Mason (later to serve as mayor of Chicago) illegally constructing railroad tracks, on behalf of the Illinois Central, across another railroad company's tracks. Due to the lack of safety at the crossing, trains made complete stops here and therefore industry developed around the area to cater to the railroad workers."
5,LOOP,0.439959,"The Loop, one of Chicago's 77 designated community areas, is the central business district of the city and is the main section of Downtown Chicago. Home to Chicago's commercial core, it is the second largest commercial business district in North America after Midtown Manhattan in New York City, and contains the headquarters and regional offices of several global and national businesses, retail establishments, restaurants, hotels, and theaters, as well as many of Chicago's most famous attractions. It is home to Chicago's City Hall, the seat of Cook County, and numerous offices of other levels of government and consulates of foreign nations. The intersection of State Street and Madison Street is the origin point for the address system on Chicago's street grid. Most of Grant Park's 319 acres (129 hectares) are in the eastern section of the community area. The Loop community area is bounded on the north and west by the Chicago River, on the east by Lake Michigan, and on the south by Roosevelt Road. In 1803, the United States Army built Fort Dearborn in what is now the Loop, the first settlement in the area sponsored by the United States' federal government. When Chicago and Cook County were incorporated in the 1830s the area was selected as the site of their respective seats. Originally mixed use, the character of the area became commercial starting in the 1870s, especially after it was mostly destroyed in the Great Chicago Fire of 1871. At that time some of the world's earliest skyscrapers were constructed in the area, starting a legacy of architecture that continues to this day. In the late 19th century, cable car turnarounds and a prominent elevated railway loop encircled the area, giving the Loop its name. Starting in the 1920s many highways were constructed in the Loop, most prominently U.S. Route 66, which opened in 1926 with its eastern terminus in the area. While dominated by offices and public buildings, its residential population boomed during the latter 20th century and first decades of the 21st; its population has increased the most of Chicago's community areas since 1950."


`MATCH` returns a special column, `_score`.  This indicates the relative quality of the match.

### Experimenting with Full-text Search

The following query searches for the terms "railroad" OR "tracks":

In [83]:
query = """
SELECT name, _score, details['description'] AS description 
FROM community_areas 
WHERE MATCH(details['description'], 'railroad tracks') 
ORDER BY _score DESC
LIMIT 5;
"""

df = pd.read_sql(query, CONNECTION_STRING)
df

Unnamed: 0,name,_score,description
0,GREATER GRAND CROSSING,2.148552,"Greater Grand Crossing is one of the 77 community areas of Chicago, Illinois. It is located on the city's South Side. The name 'Grand Crossing' comes from an 1853 right-of-way feud between the Lake Shore and Michigan Southern Railway and the Illinois Central Railroad that led to a frog war and a crash that killed 18 people. The crash was the result of Roswell B. Mason (later to serve as mayor of Chicago) illegally constructing railroad tracks, on behalf of the Illinois Central, across another railroad company's tracks. Due to the lack of safety at the crossing, trains made complete stops here and therefore industry developed around the area to cater to the railroad workers."
1,BURNSIDE,1.898286,"Burnside is one of the 77 community areas in Chicago. The 47th numbered area, it is located on the city's far south side. This area is also called 'The Triangle' by locals, as it is bordered by railroad tracks on every side; the Canadian National Railway on the west, the Union Pacific Railroad on the south and the Norfolk Southern Railway on the east. With a population of 2,254 in 2016, it is the least populous of the community areas, as well as the second smallest by area after Oakland."
2,GRAND BOULEVARD,1.835274,"Grand Boulevard on the South Side of Chicago, Illinois, is one of the city's Community Areas. The boulevard from which it takes its name is now Martin Luther King Jr. Drive. The area is bounded by 39th to the north, 51st Street to the south, Cottage Grove Avenue to the east, and the Chicago, Rock Island & Pacific Railroad tracks to the west."
3,WEST TOWN,1.770313,"West Town, northwest of the Loop on Chicago's West Side, is one of the city's officially designated community areas. Much of this area was historically part of Polish Downtown, along Western Avenue, which was then the city's western boundary. West Town was a collection of several distinct neighborhoods and the most populous community area until it was surpassed by Near West Side in the 1960s. The boundaries of the community area are the Chicago River to the east, the Union Pacific railroad tracks to the south, the former railroad tracks on Bloomingdale Avenue to the North, and an irregular western border to the west that includes the city park called Humboldt Park. Humboldt Park is also the name of the community area to West Town's west, Logan Square is to the north, Near North Side to the east, and Near West Side to the south. The collection of neighborhoods in West Town along with the neighborhoods of Bucktown and the eastern portion of Logan Square have been referred to by some media as the 'Near Northwest Side'."
4,IRVING PARK,1.453797,"Irving Park is one of 77 officially designated Chicago community areas, and is located on the Northwest Side. It is bounded by the Chicago River on the east, the Milwaukee Road railroad tracks on the west, Addison Street on the south and Montrose Avenue on the north, west of Pulaski Road stretching to encompass the region between Belmont Avenue on the south and, roughly, Leland Avenue on the north. It is named after the American author Washington Irving. Old Irving Park, bounded by Montrose Avenue, Pulaski Road, Addison Street, and Cicero Avenue, has a variety of housing stock with Queen Anne, Victorian, and Italianate homes, a few farmhouses, and numerous bungalows. The CTA Blue Line runs through this neighborhood with stops at Addison, Irving Park, and Montrose."


Take a moment to study where the terms "railroad" or "tracks" are contained in the above matches.  What it we wanted to search for the specific phrase "railroad tracks"?  For that, we add `USING phrase`:

In [84]:
query = """
SELECT name, _score, details['description'] AS description 
FROM community_areas 
WHERE MATCH(details['description'], 'railroad tracks') USING phrase
ORDER BY _score DESC
LIMIT 5;
"""

df = pd.read_sql(query, CONNECTION_STRING)
df

Unnamed: 0,name,_score,description
0,GRAND BOULEVARD,1.835274,"Grand Boulevard on the South Side of Chicago, Illinois, is one of the city's Community Areas. The boulevard from which it takes its name is now Martin Luther King Jr. Drive. The area is bounded by 39th to the north, 51st Street to the south, Cottage Grove Avenue to the east, and the Chicago, Rock Island & Pacific Railroad tracks to the west."
1,WEST TOWN,1.770313,"West Town, northwest of the Loop on Chicago's West Side, is one of the city's officially designated community areas. Much of this area was historically part of Polish Downtown, along Western Avenue, which was then the city's western boundary. West Town was a collection of several distinct neighborhoods and the most populous community area until it was surpassed by Near West Side in the 1960s. The boundaries of the community area are the Chicago River to the east, the Union Pacific railroad tracks to the south, the former railroad tracks on Bloomingdale Avenue to the North, and an irregular western border to the west that includes the city park called Humboldt Park. Humboldt Park is also the name of the community area to West Town's west, Logan Square is to the north, Near North Side to the east, and Near West Side to the south. The collection of neighborhoods in West Town along with the neighborhoods of Bucktown and the eastern portion of Logan Square have been referred to by some media as the 'Near Northwest Side'."
2,BURNSIDE,1.668631,"Burnside is one of the 77 community areas in Chicago. The 47th numbered area, it is located on the city's far south side. This area is also called 'The Triangle' by locals, as it is bordered by railroad tracks on every side; the Canadian National Railway on the west, the Union Pacific Railroad on the south and the Norfolk Southern Railway on the east. With a population of 2,254 in 2016, it is the least populous of the community areas, as well as the second smallest by area after Oakland."
3,IRVING PARK,1.453797,"Irving Park is one of 77 officially designated Chicago community areas, and is located on the Northwest Side. It is bounded by the Chicago River on the east, the Milwaukee Road railroad tracks on the west, Addison Street on the south and Montrose Avenue on the north, west of Pulaski Road stretching to encompass the region between Belmont Avenue on the south and, roughly, Leland Avenue on the north. It is named after the American author Washington Irving. Old Irving Park, bounded by Montrose Avenue, Pulaski Road, Addison Street, and Cicero Avenue, has a variety of housing stock with Queen Anne, Victorian, and Italianate homes, a few farmhouses, and numerous bungalows. The CTA Blue Line runs through this neighborhood with stops at Addison, Irving Park, and Montrose."
4,GREATER GRAND CROSSING,1.426055,"Greater Grand Crossing is one of the 77 community areas of Chicago, Illinois. It is located on the city's South Side. The name 'Grand Crossing' comes from an 1853 right-of-way feud between the Lake Shore and Michigan Southern Railway and the Illinois Central Railroad that led to a frog war and a crash that killed 18 people. The crash was the result of Roswell B. Mason (later to serve as mayor of Chicago) illegally constructing railroad tracks, on behalf of the Illinois Central, across another railroad company's tracks. Due to the lack of safety at the crossing, trains made complete stops here and therefore industry developed around the area to cater to the railroad workers."


Take a moment to look at the results here and see how they differ to those from the previous query that searched for "railroad" or "tracks". 

Let's search for communities whose description matches both "railword" and "historic":

In [86]:
query = """
SELECT name, _score, details['description'] AS description 
FROM community_areas 
WHERE MATCH(details['description'], 'railroad historic') USING best_fields WITH (operator='and')
ORDER BY _score DESC
LIMIT 5;
"""

df = pd.read_sql(query, CONNECTION_STRING)
df

Unnamed: 0,name,_score,description
0,PULLMAN,1.834025,"Pullman, one of Chicago's 77 defined community areas, is a neighborhood located on the city's South Side. Twelve miles from the Chicago Loop, Pullman is situated adjacent to Lake Calumet. The area known as Pullman encompasses a much wider area than its two historic areas (the older historic area is often referred to as 'Pullman' and is a Chicago Landmark district and a national historical park. The northern annex historic area is usually referred to as 'North Pullman'). The development built by the Pullman Company is bounded by 103rd Street on the North, 115th Street on the South, the railroad tracks on the East and Cottage Grove on the West. Since the late 20th century, the Pullman neighborhood has been gentrifying. Many residents are involved in the restoration of their own homes, and projects throughout the district as a whole. Walking tours of Pullman are available/ Pullman has many historic and architecturally significant buildings; among these are the Hotel Florence; the Arcade Building, which was destroyed in the 1920s; the Clock Tower and Factory, the complex surrounding Market Square, and Greenstone Church. In the adjacent Kensington neighborhood of the nearby Roseland district is the home of one of the many beautiful churches in Chicago built in Polish Cathedral style, the former church of St. Salomea. It is now used by Salem Baptist Church of Chicago. In a contest sponsored by the Illinois Department of Commerce and Economic Opportunity, Pullman was one of seven sites nominated for the Illinois Seven Wonders."
1,LOGAN SQUARE,1.563071,"Logan Square is an official community area, historical neighborhood, and public square on the northwest side of the City of Chicago. The Logan Square community area is one of the 77 city-designated community areas established for planning purposes. The Logan Square neighborhood, located within the Logan Square community area, is centered on the public square that serves as its namesake, located at the three-way intersection of Milwaukee Avenue, Logan Boulevard and Kedzie Boulevard. The community area of Logan Square is, in general, bounded by the Metra/Milwaukee District North Line railroad on the west, the North Branch of the Chicago River on the east, Diversey Parkway on the north, and the 606 (also known as the Bloomingdale Trail) on the south. The area is characterized by the prominent historical boulevards, stately greystones and large bungalow-style homes."
2,WEST TOWN,1.408882,"West Town, northwest of the Loop on Chicago's West Side, is one of the city's officially designated community areas. Much of this area was historically part of Polish Downtown, along Western Avenue, which was then the city's western boundary. West Town was a collection of several distinct neighborhoods and the most populous community area until it was surpassed by Near West Side in the 1960s. The boundaries of the community area are the Chicago River to the east, the Union Pacific railroad tracks to the south, the former railroad tracks on Bloomingdale Avenue to the North, and an irregular western border to the west that includes the city park called Humboldt Park. Humboldt Park is also the name of the community area to West Town's west, Logan Square is to the north, Near North Side to the east, and Near West Side to the south. The collection of neighborhoods in West Town along with the neighborhoods of Bucktown and the eastern portion of Logan Square have been referred to by some media as the 'Near Northwest Side'."
3,HYDE PARK,1.372627,"Hyde Park is a neighborhood on the South Side of Chicago, Illinois, located on and near the shore of Lake Michigan 7 miles (11 km) south of the Loop. It is one of the city’s 77 municipally recognized community areas. Hyde Park’s boundaries and subdivisions have several local definitions. The community area’s formal boundaries are 51st Street (signed locally as Hyde Park Boulevard) on the north, Midway Plaisance on the south, Washington Park on the west, and Lake Michigan on the east. Another local definition considers a section to the north between 47th Street[3] and Hyde Park Boulevard to be in Hyde Park, although this area is, according to municipal boundaries, the southern half of the Kenwood community area. As such, it is often called “South Kenwood.” Hyde Park and South Kenwood are also sometimes collectively termed “Hyde Park-Kenwood” (as in the name of the epoynmous Historic District, for example). Meanwhile, the portion of Hyde Park that lies between the Illinois Central Railroad tracks and the lake is usually referred to as “East Hyde Park” and is usually also taken to include “Indian Village,” the small southeastern corner of Kenwood. Hyde Park is home to a number of institutions of higher education: the University of Chicago, Catholic Theological Union, Lutheran School of Theology at Chicago, McCormick Theological Seminary, and Chicago Theological Seminary. The community area is also home to the Museum of Science and Industry, and two of Chicago's four historic sites listed in the original 1966 National Register of Historic Places (Chicago Pile-1, the world's first artificial nuclear reactor, and Robie House). In the early 21st century, Hyde Park received national attention for its association with U.S. President Barack Obama, who, before running for president, was a Senior Lecturer for twelve years at the University of Chicago Law School, an Illinois state senator representing the area, and U.S senator from Illinois. The Barack Obama Presidential Center which is currently under construction in Jackson Park is located nearby."
4,WASHINGTON HEIGHTS,1.225143,"Washington Heights is the 73rd of Chicago's 77 community areas. Located 12 miles (19 km) from the Loop, it is on the city's far south side. Washington Heights is considered part of the Blue Island Ridge, along with the nearby community areas of Beverly, Morgan Park and Mount Greenwood, and the village of Blue Island. It contains a neighborhood also known as Washington Heights, as well as the neighborhoods of Brainerd and Fernwood. As of 2017, Washington Heights had 27,453 inhabitants. Named for the heights which are now part of the adjacent Beverly, the area was settled in the late 19th century at the intersection of two railroad lines. It was incorporated as a village in 1874, and was annexed by Chicago in 1890. During most of the 20th century, Washington Heights was primarily inhabited by Irish, Germans and Swedes; after late-20th-century white flight, it has been mainly inhabited by African-Americans. The area largely retained its middle-class character during its racial transition, declining somewhat in recent years. Historically influenced by transit, Washington Heights includes the original site of the former Chicago Bridge & Iron Company. The Brainerd Bungalow Historic District and the Carter G. Woodson Regional Library, home of the largest collection of African-American history in the midwestern United States, are in the area."


Again, take a moment to study the text in each matching result.

### Combining Full-text Search with Other Criteria

As full-text search in CrateDB uses SQL, you can combine it with other criteria.  For example, let's search for community areas whose description matches term "Univresity".

In [87]:
query = """
SELECT name, _score, details['population'] AS population, details['description'] AS description 
FROM community_areas 
WHERE MATCH(details['description'], 'Univresity')
ORDER BY _score DESC
LIMIT 5;
"""

df = pd.read_sql(query, CONNECTION_STRING)
df

Unnamed: 0,name,_score,population,description


How many results do we get?  None... because there's a small typo in the search term.  Specifying a `fuzziness` factor helps compensate for this sort of error in user input.  Let's try again:

In [88]:
query = """
SELECT name, _score, details['population'] AS population, details['description'] AS description 
FROM community_areas 
WHERE MATCH(details['description'], 'Univresity') USING best_fields WITH (fuzziness = 2)
ORDER BY _score DESC
LIMIT 5;
"""

df = pd.read_sql(query, CONNECTION_STRING)
df

Unnamed: 0,name,_score,population,description
0,ROGERS PARK,0.8227,55628,"Rogers Park is the first of Chicago's 77 community areas. Located 9 miles (14 km) from the Loop, it is on the city's far north side on the shore of Lake Michigan. The neighborhood is culturally diverse and features green spaces, early 20th century architecture, live theater, bars, restaurants, and beaches. It is bounded by the city of Evanston along Juneway Terrace and Howard Street to the north, Ridge Boulevard to the west, Devon Avenue and the Edgewater neighborhood to the south, and Lake Michigan to the east. The neighborhood just to the west, West Ridge, was part of Rogers Park until the 1890s and is still sometimes referred to as West Rogers Park. In the early 1900s, what is now Loyola University Chicago became established at the south eastern end of the community area along the lake. In 2022, Rogers Park was ranked as a top 5 neighborhood to live in the United States."
1,RIVERDALE,0.748409,7262,"Riverdale is one of the 77 official community areas of Chicago, Illinois and is located on the city's far south side. As originally designated by the Social Science Research Committee at the University of Chicago and officially adopted by the City of Chicago, the Riverdale community area extends from 115th Street south to the city boundary at 138th Street and from the Illinois Central Railroad tracks east to the Bishop Ford Freeway."
2,WOODLAWN,0.610295,24425,"Woodlawn, on the South Side of Chicago, Illinois, is one of Chicago's 77 community areas. It is bounded by Lake Michigan to the east, 60th Street to the north, Martin Luther King Drive to the west, and 67th Street to the south. Both Hyde Park Career Academy and the all-boys Catholic Mount Carmel High School are in this neighborhood; much of its eastern portion is occupied by Jackson Park. The Woodlawn section of the park includes the site of the planned Obama Presidential Center, an estimated $500 million investment. The northern edge of Woodlawn contains a portion of the campus of the University of Chicago."
3,NEAR WEST SIDE,0.609237,67881,"The Near West Side, one of the 77 community areas of Chicago, is on the West Side, west of the Chicago River and adjacent to the Loop. The Great Chicago Fire of 1871 started on the Near West Side. Waves of immigration shaped the history of the Near West Side of Chicago, including the founding of Hull House, a prominent settlement house. In the 19th century railroads became prominent features. In the mid-20th century, the area saw the development of freeways centered in the Jane Byrne Interchange. The area is home to the University of Illinois at Chicago (UIC), Chicago-Kent College of Law, and City Colleges' Malcolm X College. The United Center arena, the Illinois Medical District, Union Station, Ogilvie Station, and the Jane Byrne Interchange are also located in the community area."
4,SOUTH SHORE,0.596766,53971,"South Shore is one of 77 defined community areas of Chicago, Illinois, United States. Located on the city's South Side, the area is named for its location along the city's southern lakefront. Although South Shore has seen a greater than 40% decrease in residents since Chicago's population peaked in the 1950s, the area remains one of the most densely populated neighborhoods on the South Side. The community benefits from its location along the waterfront, its accessibility to Lake Shore Drive, and its proximity to major institutions and attractions such as the University of Chicago, the Museum of Science and Industry, and Jackson Park."


By adding a second clause, we can limit the results to those areas with a population of at least 30,000 people:

In [89]:
query = """
SELECT name, _score, details['population'] AS population, details['description'] AS description 
FROM community_areas 
WHERE MATCH(details['description'], 'Univresity') USING best_fields WITH (fuzziness = 2)
AND details['population'] >= 30000
ORDER BY _score DESC;
"""

df = pd.read_sql(query, CONNECTION_STRING)
df

Unnamed: 0,name,_score,population,description
0,ROGERS PARK,1.8227,55628,"Rogers Park is the first of Chicago's 77 community areas. Located 9 miles (14 km) from the Loop, it is on the city's far north side on the shore of Lake Michigan. The neighborhood is culturally diverse and features green spaces, early 20th century architecture, live theater, bars, restaurants, and beaches. It is bounded by the city of Evanston along Juneway Terrace and Howard Street to the north, Ridge Boulevard to the west, Devon Avenue and the Edgewater neighborhood to the south, and Lake Michigan to the east. The neighborhood just to the west, West Ridge, was part of Rogers Park until the 1890s and is still sometimes referred to as West Rogers Park. In the early 1900s, what is now Loyola University Chicago became established at the south eastern end of the community area along the lake. In 2022, Rogers Park was ranked as a top 5 neighborhood to live in the United States."
1,NEAR WEST SIDE,1.609237,67881,"The Near West Side, one of the 77 community areas of Chicago, is on the West Side, west of the Chicago River and adjacent to the Loop. The Great Chicago Fire of 1871 started on the Near West Side. Waves of immigration shaped the history of the Near West Side of Chicago, including the founding of Hull House, a prominent settlement house. In the 19th century railroads became prominent features. In the mid-20th century, the area saw the development of freeways centered in the Jane Byrne Interchange. The area is home to the University of Illinois at Chicago (UIC), Chicago-Kent College of Law, and City Colleges' Malcolm X College. The United Center arena, the Illinois Medical District, Union Station, Ogilvie Station, and the Jane Byrne Interchange are also located in the community area."
2,SOUTH SHORE,1.596766,53971,"South Shore is one of 77 defined community areas of Chicago, Illinois, United States. Located on the city's South Side, the area is named for its location along the city's southern lakefront. Although South Shore has seen a greater than 40% decrease in residents since Chicago's population peaked in the 1950s, the area remains one of the most densely populated neighborhoods on the South Side. The community benefits from its location along the waterfront, its accessibility to Lake Shore Drive, and its proximity to major institutions and attractions such as the University of Chicago, the Museum of Science and Industry, and Jackson Park."


Here's an example of a negative search... we'll look for smaller communities with a population of 10,000 or fewer and whose descriptions don't mention railroads:

In [90]:
query = """
SELECT name, _score, details['population'], details['description'] AS description 
FROM community_areas 
WHERE NOT MATCH(details['description'], 'railroad')
AND details['population'] <= 10000
ORDER BY _score DESC;
"""

df = pd.read_sql(query, CONNECTION_STRING)
df

Unnamed: 0,name,_score,details['population'],description
0,OAKLAND,2.0,6799,"Oakland, located on the South Side of Chicago, Illinois, USA, is one of 77 officially designated Chicago community areas. Bordered by 35th and 43rd Streets, Cottage Grove Avenue and Lake Shore Drive, The Oakland area was constructed between 1872 and 1905. Some of Chicago's great old homes may be seen on Drexel Boulevard. The late 19th-century Monument Baptist Church on Oakwood Blvd. is modeled after Boston's Trinity Church. Oakwood/41st Street Beach in Burnham Park is at 4100 S. Lake Shore Drive. With an area of only 0.6 sq mi Oakland is the smallest community area by area in Chicago."
1,FULLER PARK,2.0,2567,"Fuller Park is the 37th of Chicago's 77 community areas. Located on the city's South Side, it is 5 miles (8.0 km) from the Loop. It is named for a small park also known as Fuller Park within the neighborhood, which is in turn named for Melville Weston Fuller, a Chicago attorney who was the Chief Justice of the United States between 1888 and 1910."


## Vector Similarity Search

TODO vector similarity parts

## Towards Hybrid Search

TODO combining the two searches in one query

## Additional Resources

The following are additional resources and workbooks that expand on the topics covered here:

* [Blog: Hybrid Search in CrateDB](https://cratedb.com/blog/hybrid-search-explained)
* [Blog: Dissecting a Hybrid Search Query in SQL](https://cratedb.com/blog/dissecting-a-hybrid-search-query-in-sql)
* [CrateDB documentation: Full-text Search](https://cratedb.com/docs/guide/feature/search/fts/index.html)
* [CrateDB documentation: Hybrid Search](https://cratedb.com/docs/guide/feature/search/hybrid/index.html)
* [Jupyter notebook: Applying RAG using CrateDB and LangChain](https://github.com/crate/cratedb-examples/blob/main/topic/machine-learning/llm-langchain/cratedb_rag_customer_support_langchain.ipynb)


## Continue your Learning Journey

To learn more about CrateDB, sign up for our courses at the CrateDB Academy.  We recommend the [CrateDB Fundamentals](https://learn.cratedb.com/cratedb-fundamentals) course for a comprehensive overview, and our [Advanced Time Series](https://learn.cratedb.com/time-series) course for a deep dive into time series data concepts.