<a href="https://colab.research.google.com/github/denisabrantesredis/denisd-redis-learning-sessions/blob/main/Search/Search.ipynb" target="_newt">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

<div style="display:flex;width=100%;">
<img src="https://redis.io/wp-content/uploads/2024/04/Logotype.svg?auto=webp&quality=85,75&width=120" alt="Redis" width="90"/>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
</div>

# Redis Learning Session - Search

<img src="https://github.com/denisabrantesredis/denisd-redis-learning-sessions/blob/main/Search/_assets/images/banner.png?raw=true" alt="Redis Data Types"/>

[Try an online search demo application](https://ecommerce.redisventures.com//)

In this notebook, we will explore the different types of Search provided by Redis.

## Installing the Pre-Reqs

In [None]:
!pip install -q folium
!pip install -q pandas
!pip install -q redis
!pip install -q unzip

## Installing Redis Stack Locally
If you are not using Redis Cloud as a database, uncomment and run the code below to install Redis locally. Then set your connection to 127.0.0.1

In [None]:
# %%sh
# curl -fsSL https://packages.redis.io/gpg | sudo gpg --dearmor -o /usr/share/keyrings/redis-archive-keyring.gpg 
# echo "deb [signed-by=/usr/share/keyrings/redis-archive-keyring.gpg] https://packages.redis.io/deb $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/redis.list 
# sudo apt-get update  > /dev/null 2>&1
# sudo apt-get install redis-stack-server  > /dev/null 2>&1
# redis-stack-server --daemonize yes

## Copying and Unzipping Lab Files

In [None]:
import os

In [None]:
if not os.path.exists("lab_assets.zip"):
  !wget https://denisd-bucket-p.s3.us-east-1.amazonaws.com/lab_assets/search/lab_assets.zip
  !unzip lab_assets.zip

## Connecting to Redis

In [None]:
import redis
from google.colab import userdata

#### Setup the Connection String

<img src="https://github.com/denisabrantesredis/denisd-redis-learning-sessions/blob/main/Search/_assets/images/callout_secrets.png?raw=true" alt="Callout - Use Google Colab secrets instead"/>

In [None]:
try:
  REDIS_HOST = userdata.get('REDIS_HOST')
except:
  REDIS_HOST="127.0.0.1"

try:
  REDIS_PORT = userdata.get('REDIS_PORT')
except:
  REDIS_PORT=6379

try:
  REDIS_PASSWORD = userdata.get('REDIS_PASSWORD')
except:
  REDIS_PASSWORD=""

REDIS_URL = f"redis://default:{REDIS_PASSWORD}@{REDIS_HOST}:{REDIS_PORT}"

#### Testing the Connection to Redis

<img src="https://github.com/denisabrantesredis/denisd-redis-learning-sessions/blob/main/Search/_assets/images/callout_connection.png?raw=true" alt="Callout - Make sure connection works"/>

In [None]:
r = redis.from_url(REDIS_URL, decode_responses=True)

if r.ping():
    print("Connection successful!")
else:
    print("Connection issue!")

## Geospatial

Geospatial data is supported in Redis as a native data type, and as part of our native JSON support. In this lab, we will explore both options.

Redis uses coordinate points to represent geospatial locations. You can store individual points but you can also use a set of points to define a polygon shape (the shape of a town, for example). You can query several types of interactions between points and shapes, such as whether a point lies within a shape or whether two shapes overlap.

### Part 1 - Native Geo data type

In [None]:
import pandas as pd
import json
import folium

#### Load Data

For this lab, we will use a dataset containing Airbnb listings and metrics in New York City for January, 2024. Each listing contains coordinates for the location, which is what we are interested in for this lab.

In [None]:
dataset = pd.read_csv('./new_york_listings_2024.csv')
print(len(dataset))
dataset.head()

#### Understand the distribution of the Room_Type attribute

In [None]:
dataset['room_type'].hist()

#### Create a new Dataframe only with Hotel Room records

In [None]:
hotel_rooms = dataset[dataset['room_type'] == 'Hotel room']
print(len(hotel_rooms))
hotel_rooms.head()

#### Save data to Redis

We will use a pipeline to write data to Redis. The pipeline will gather all commands, and then send the list of commands to the server, where they are executed in order. The pipeline then returns a list containing the response for each command.

In [None]:
pipe = r.pipeline(transaction=False)
keyname = "geo:nyc:hotel_rooms"

for index, row in hotel_rooms.iterrows():
      lat = row['latitude']
      lon = row['longitude']
      id = row['id']
      pipe.geoadd(keyname, [lon, lat, id])
results = pipe.execute()
print(len(results))

&nbsp;

&nbsp;

<img src="https://github.com/denisabrantesredis/denisd-redis-learning-sessions/blob/main/Search/_assets/images/callout_insight.png?raw=true" alt="Callout - Check Redis Insight"/>

Open Redis Insight and confirm that the geo key was generated. There should be only one key, called `geo:nyc:hotel_rooms`, which contains all the different locations from the DataFrame.

&nbsp;

&nbsp;

#### Search for Hotel Rooms near the Empire State Building 

Next, we will search for Hotel Rooms within 850 meters of the Empire State Building, which is located at coordinates 40.7491301,-73.9924523. You can change the 

<img src="https://github.com/denisabrantesredis/denisd-redis-learning-sessions/blob/main/Search/_assets/images/callout_georadius.png?raw=true" alt="Callout - Change Search Radius"/>

In [None]:
esb_lat = 40.748534150023396
esb_lon = -73.98568519949094
radius = 850

#### Run the search

In [None]:
rooms = r.georadius(keyname, esb_lon, esb_lat, radius, 'm', withdist=True, withcoord=True)
len(rooms)

#### Render a map to see results and add the search results to the map

In [None]:
map1 = folium.Map(location=[esb_lat, esb_lon], zoom_start=15, tiles="Cartodb Positron")
folium.Marker([esb_lat, esb_lon], popup="Empire State Building", icon=folium.Icon(color='blue')).add_to(map1)
folium.Circle(
    location=[esb_lat, esb_lon], radius=radius, color="cornflowerblue", weight=1, fill_opacity=0.3, opacity=1, stroke=False,
    fill=True, popup="{} meters".format(radius), tooltip="Search Radius"
).add_to(map1)

for room in rooms:
    folium.Marker([room[2][1], room[2][0]], popup=f"{room[0]} {round(room[1],2)} meters", icon=folium.Icon(color='red')).add_to(map1)

In [None]:
map1

### Part 2 - Geospatial data with JSON

In [None]:
from redis.commands.search.field import GeoShapeField
from redis.commands.search.field import TextField
from redis.commands.search.field import TagField
from redis.commands.search.field import NumericField
from redis.commands.search.query import Query
from redis.commands.search.index_definition import IndexDefinition, IndexType

#### Save JSON data to Redis

In this step, we will save the same data from before, but as JSON format, which will allow us to capture more attributes, not just the coordinates. Notice how the `location` attribute is saved as a `POINT()` record, with longitude and latitude. We need this to run our polygon search.

In [None]:
for index, row in hotel_rooms.iterrows():
      id = row['id']
      keyname = f"geo:nyc:hotel_rooms:{id}"
      value = {
            'id': id,
            'location': f"POINT ({row['longitude']} {row['latitude']})",
            'latitude': row['latitude'],
            'longitude': row['longitude'],
            'host_name' : row['host_name'],
            'neighbourhood': row['neighbourhood'],
            'price': row['price'],
            'reviews': row['number_of_reviews'],
            'rating': row['rating'],
            'bedrooms': row['bedrooms']
      }
      pipe.json().set(keyname, "$", value)
result = pipe.execute()
len(result)

#### Create Search Index

JSON data needs to be indexed for search. This command will create the search index for the location attribute (which includes the latitude and longitude information).

In [None]:
geo_schema = (GeoShapeField("$.location", as_name="location"))

try:
    r.ft("idx:rooms").drop_index()
except:
    print("Index does not exist")

try:
  geo_index_create_result = r.ft("idx:rooms").create_index(
    geo_schema,
    definition=IndexDefinition(
        prefix=["geo:nyc:hotel_rooms:"], index_type=IndexType.JSON
    )
  )
  print(geo_index_create_result)

except Exception as e:
  print(e)


#### Check the index

Make sure the index is 100% done, and the total of document is > 0.

In [None]:
info = r.ft('idx:rooms').info()
print(f" Percent Indexed: {int(info['percent_indexed'])*100}")
print(f" Total Documents: {info['num_docs']}")

#### Run Search

In [None]:
shape = "POLYGON ((-73.9792654 40.7545612, -73.9928266 40.7549838, -73.9988777 40.7489044, -73.9946291 40.7390201, -73.9805099 40.7385324, -73.9746305 40.7464335, -73.9792654 40.7545612))"
params_dict = {"esb": shape}

q = Query("@location:[WITHIN $esb]").dialect(3)
res = r.ft("idx:rooms").search(q, query_params=params_dict).docs
print(len(res))

### Equivalent Redis Insight command

```
    FT.SEARCH idx:rooms "(@location:[WITHIN $qshape])" 
        PARAMS 2 qshape "POLYGON ((-73.9792654 40.7545612, -73.9928266 40.7549838, -73.9988777 40.7489044, -73.9946291 40.7390201, -73.9805099 40.7385324, -73.9746305 40.7464335, -73.9792654 40.7545612))" 
        RETURN 1 name 
        DIALECT 2
```

&nbsp;

#### Render a new map with the polygon and the results

In [None]:
map2 = folium.Map(location=[esb_lat, esb_lon], zoom_start=15, tiles="Cartodb Positron")
folium.Marker([esb_lat, esb_lon], popup="Empire State Building", icon=folium.Icon(color='blue')).add_to(map2)

locations = [[40.7545612, -73.9792654], [40.7549838, -73.9928266], [40.7489044, -73.9988777], [40.7390201, -73.9946291], [40.7385324, -73.9805099], [40.7464335, -73.9746305], [40.7545612, -73.9792654]]

folium.Polygon(locations=locations, color="cornflowerblue", weight=1, fill_opacity=0.3, opacity=1, stroke=False, fill_color="maroon", fill=True, popup="Polygon", tooltip="Click me!",).add_to(map2)

In [None]:
for room in res:
    room_json = json.loads(room.json)
    lat = room_json[0]['latitude']
    lon = room_json[0]['longitude']
    host_name = room_json[0]['host_name']
    price = room_json[0]['price']
    folium.Marker([lat, lon], popup=f"{host_name} ${price}", icon=folium.Icon(color='red')).add_to(map2)

In [None]:
map2

## Streams

A Redis stream is a data structure that acts like an append-only log but also implements several operations to overcome some of the limits of a typical append-only log. These include random access in O(1) time and complex consumption strategies, such as consumer groups. You can use streams to record and simultaneously syndicate events in real time. Examples of Redis stream use cases include:

- Event sourcing (e.g., tracking user actions, clicks, etc.)
- Sensor monitoring (e.g., readings from devices in the field)
- Notifications (e.g., storing a record of each user's notifications in a separate stream)


In this example, we will load IoT data from temperature monitoring devices as streams, and look for data within a certain time range. 

In [None]:
from datetime import datetime, timedelta
import pytz

#### Load data from the CSV file

In [None]:
iot_ds = pd.read_csv('iot.csv')
print(len(iot_ds))
iot_ds.head()

### Using a pipeline to save data to Redis

All messages will be under a single stream (key), named `stream:iot`. The pipeline will gather all commands and execute them once against Redis.

In [None]:
keyname = "streams:iot"
for index, row in iot_ds.iterrows():
      value = {
            'id': row['id'],
            'room': row['room'],
            'date': row['date'],
            'temp' : row['temp'],
            'location': row['location'],
            'timestamp': row['timestamp']
      }
      pipe.xadd(keyname, id=row['timestamp'], fields=value)
result = pipe.execute()
len(result)

#### Getting the first and last timestamps - format DAY/MONTH/YEAR

In [None]:
print(iot_ds.iloc[0]['date'])
print(iot_ds.iloc[len(iot_ds)-1]['date'])

#### Define the time range you want to search on

<img src="https://github.com/denisabrantesredis/denisd-redis-learning-sessions/blob/main/Search/_assets/images/callout_streams.png?raw=true" alt="Callout - Change Time Range"/>

In [None]:
start_date = "01-04-2024 12:00"
end_date = "01-04-2024 12:30"

In [None]:
datetime_format = "%d-%m-%Y %H:%M"
local_tz = pytz.timezone('America/Chicago')

local_start = datetime.strptime(start_date, datetime_format)
utc_start = local_tz.localize(local_start)
first_ts = utc_start.timestamp()

local_end = datetime.strptime(end_date, datetime_format)
utc_end = local_tz.localize(local_end)
last_ts = utc_end.timestamp()

#### Running the Search

In [None]:
messages = r.xrange("streams:iot", int(first_ts), int(last_ts))
for message in messages:
  print(message)

<img src="https://github.com/denisabrantesredis/denisd-redis-learning-sessions/blob/main/Search/_assets/images/callout_insight.png?raw=true" alt="Callout - Check Redis Insight"/>

You can use Redis Insight to check on the stream data. There should be a key called `streams:iot` with all 97,546 messages in it. The UI will show some of the messages, and you can run the same search in the Workbench, using this command:

&nbsp;

```
    XRANGE streams:iot 1711972800 1711974600
```

&nbsp;

## Time Series

The Redis time series data type lets you store real-valued data points along with the time they were collected. You can combine the values from a selection of time series and query them by time or value range. You can also compute aggregate functions of the data over periods of time and create new time series from the results. When you create a time series, you can specify a maximum retention period for the data, relative to the last reported timestamp, to prevent the time series from growing indefinitely.

In this lab, we will load stock data from a few different companies and search for data within a specific time range. The stock data contains the Open and Close values with one data point per day.

### Importing Required Packages

In [None]:
import datetime as dt
import matplotlib.pyplot as plt
import matplotlib.dates as mdates

#### Load data from CSV file

In [None]:
ts_df = pd.read_csv('timeseries.csv')
ts_df = ts_df.sort_values(by='Ticker')
print(len(ts_df))
ts_df.head()

#### Time Series data with the python client is accessed through the `r.ts()` object. It has its own pipeline.

In [None]:
redis_ts = r.ts()

datetime_format = "%Y-%m-%d %H:%M:%S%z"
local_tz = pytz.timezone('America/New_York')

keyprefix = "timeseries:stock"
pipets = redis_ts.pipeline(transaction=False)

Get a list of all stock ticks in the file:

In [None]:
ticker_list = ts_df['Ticker'].unique()
ticker_list

Each time series (key) will hold the values of one metric over time. The CSV file contains 2 metrics (Open and Close values) for 5 different stock tickers, which means that we will need to create 10 keys, 2 for each ticker.

The loop will first create 2 keys for each ticker in a try statement (because the keys can only be created once), then it will add the Open and Close values to those keys. We will use labels to filter the keys during the search:

- Ticker will allow us to filter the search by company ("give me all Apple stock data, etc)
- Type will allow us to select the type of metric ("give me all Open values for Dec 1st, 2015")

In [None]:
for ticker in ticker_list:
    df_loop = ts_df[ts_df['Ticker'] == ticker]
    print(f"Creating Keys for Ticker {ticker} : {len(df_loop)} total rows")

    label_open = {'TICKER' : ticker, 'TYPE': 'open'}
    label_close = {'TICKER' : ticker, 'TYPE': 'close'}
    keyopen = f"{keyprefix}:{ticker}:open"
    keyclose = f"{keyprefix}:{ticker}:close"

    try:
      redis_ts.create(keyopen, labels=label_open, duplicate_policy='LAST')
      redis_ts.create(keyclose, labels=label_close, duplicate_policy='LAST')
    except:
      pass

    values = df_loop.values.tolist()
    counter = 0
    for value in values:
      m_date = value[0]
      m_open = value[1]
      m_close = value[2]
      local_start = datetime.strptime(m_date, datetime_format)
      timestamp = int(local_start.timestamp())
      
      _ = pipets.add(keyopen, timestamp, m_open) 
      _ = pipets.add(keyclose, timestamp, m_close)
      counter += 1
      # print(f"Counter: {counter}")
result = pipets.execute()
print(len(result))

#### List the first and last timestamp - date format DAY/MONTH/YEAR

In [None]:
print(ts_df.iloc[0]['Date'])
print(ts_df.iloc[len(ts_df)-1]['Date'])

#### Set the time range and stock ticker for our search

<img src="https://github.com/denisabrantesredis/denisd-redis-learning-sessions/blob/main/Search/_assets/images/callout_timeseries.png?raw=true" alt="Callout - Change Time Range and Stock Ticker"/>

In [None]:
start_time = "2016-05-01 12:00:00-04:00"
end_time = "2016-05-11 12:00:00-04:00"
ticker = "AAPL"

#### Retrieve open and close data from the selected stock ticker

In [None]:
filters = [f"TICKER=({ticker})"]
start_ts = int(datetime.strptime(start_time, datetime_format).timestamp())
end_ts = int(datetime.strptime(end_time, datetime_format).timestamp())
results = redis_ts.mrange(from_time=start_ts, to_time=end_ts, filters=filters)
len(results[0]['timeseries:stock:AAPL:close'][1])

#### Adding the search results to a new DataFrame

In [None]:
open_data, close_data = [], []
for result in results:
  keyname = list(result.keys())[0]
  keydata = result[keyname]
  if 'open' in keyname:
    for item in keydata[1]:
      open_data.append((str(datetime.fromtimestamp(item[0]).strftime('%Y-%m-%d %H:%M:%S.%f')[:-16]), item[1]))
  else:
    for item in keydata[1]:
      close_data.append((str(datetime.fromtimestamp(item[0]).strftime('%Y-%m-%d %H:%M:%S.%f')[:-16]), item[1]))

df_open = pd.DataFrame(open_data, columns=['Date', 'Open'])
df_close = pd.DataFrame(close_data, columns=['Date', 'Close'])
df_open.head()

#### Visualizing the search results in a timeline chart

In [None]:
plt.figure(figsize=(15, 5)) 
plt.plot(df_open['Date'], df_open['Open'], label='Open',)
plt.plot(df_close['Date'], df_close['Close'], label='Close')
plt.title(f"{ticker} - Open & Close")
plt.xlabel("Date")
plt.ylabel("Value")
plt.legend()
plt.show()

&nbsp;

<img src="https://github.com/denisabrantesredis/denisd-redis-learning-sessions/blob/main/Search/_assets/images/callout_insight.png?raw=true" alt="Callout - Check Redis Insight"/>

Redis Insight will show the new time series keys, but it will not display that value (this data type is not yet supported in that UI). However, can run the same search and see the values using the **Workbench**:

&nbsp;

```
    TS.MRANGE 1431345600 1462104000 WITHLABELS FILTER TICKER='AAPL'
```

&nbsp;

## Hash

Redis hashes are record types structured as collections of field-value pairs. You can use hashes to represent basic objects and to store groupings of counters, among other things.

In this lab, we will search through a product catalog from an e-commerce dataset.

In [None]:
from redis.commands.search.query import NumericFilter, Query
import redis.commands.search.aggregation as aggregations
import redis.commands.search.reducers as reducers


#### Load data from CSV file

In [None]:
with open('products.json', 'r') as file:
    ds = json.load(file)
len(ds)

#### Save data to Redis

In [None]:
for product in ds:
  keyname = f"product:{product['id']}"
  pipe.hset(keyname, mapping=product)
results = pipe.execute()
len(results)

### Intermission: how to find documents without Search

Redis is a key-value store, meaning you need to know the key you want to access and then Redis will either read or write the value of that key. This works well for most use cases; however, you may need to identify all keys under a certain criteria, in which case, Redis offers the `SCAN` command, that can scan through multiple keys that follow a pattern.

Depending on what you are looking for, however, this can be a very time-consuming operation. For instance, let's try to use `SCAN` to count the number of Shirts in our catalog (products that have `articleType=Shirts`):

In [None]:
cursor = 0
all_keys = []
while True:
    cursor, keys = r.scan(cursor=cursor, match="product:*", count=100)
    for key in keys:
      if r.hget(key, "articleType") == "Shirts":
        all_keys.append(key)
    if cursor == 0:
        break
print(f"Found shirt keys: {len(all_keys)}")

...that took way longer than it should.

Back to our normal programming.

#### Create the Search Index

In [None]:
schema = (
    TextField("productDisplayName", as_name="productDisplayName"),
    TextField("articleType", as_name="articleType"),
    TextField("articleNumber", as_name="articleNumber"),
    TagField("brandName", as_name="brandName"),
    TextField("variantName", as_name="variantName"),
    TagField("ageGroup", as_name="ageGroup"),
    TagField("gender", as_name="gender"),
    TagField("fashionType", as_name="fashionType"),
    TagField("season", as_name="season"),
    TagField("year", as_name="year"),
    TagField("masterCategory", as_name="masterCategory"),
    TagField("subCategory", as_name="subCategory"),
    TagField("displayCategories", as_name="displayCategories"),
    TagField("baseColour", as_name="baseColour"),
    NumericField("id", as_name="id"),
    NumericField("price", as_name="price"),
    NumericField("discountedPrice", as_name="discountedPrice"),
    NumericField("catalogAddDate", as_name="catalogAddDate"),
    NumericField("rating", as_name="rating"),
    NumericField("discount_pct", as_name="discount_pct"),
    NumericField("inventoryCount", as_name="inventoryCount")
)
try:
    r.ft("idx:product").dropindex()
except:
    print("--> Product index doesn't exist; creating it")
try:
    definition = IndexDefinition(prefix=["product:"], index_type=IndexType.HASH)
    result = r.ft("idx:product").create_index(fields=schema, definition=definition)
except Exception as ex:
    result = f"FAILED to create index: {ex}"

#### Check the index

In [None]:
info = r.ft('idx:product').info()
print(f" Percent Indexed: {int(info['percent_indexed'])*100}")
print(f" Total Documents: {info['num_docs']}")

#### Before we Start - Count the number of Shirts using Redis Search

In [None]:
req = aggregations.AggregateRequest("@articleType:(Shirts)"
                    ).group_by([], reducers.count().alias("total_count"))

r.ft("idx:product").aggregate(req).rows

#### Start Simple - Search for Nike Women Shoes

<img src="https://github.com/denisabrantesredis/denisd-redis-learning-sessions/blob/main/Search/_assets/images/callout_hsearch.png?raw=true" alt="Callout - Change Hash Search Terms"/>

In [None]:
search_query = "Nike Women Shoe"

#### Run Search with Pagination

It's possible to limit the number of results using an offset. This allows queries to paginate through results, providing better performance.

In [None]:
query = Query(f'@productDisplayName:{search_query}'
                ).sort_by('id', asc=True
                ).return_fields('productDisplayName', 'price', 'id'
                ).paging(0,10)

r.ft("idx:product").search(query).docs

In [None]:
query = Query(f'@productDisplayName:{search_query}'
                ).sort_by('id', asc=True
                ).return_fields('productDisplayName', 'price', 'id'
                ).paging(10,10)

r.ft("idx:product").search(query).docs

#### Search using words out of order

We will test if Redis is able to find results, even if words are out of order. Products in the catalog are usually named 'Men [...] blue jeans'; we will test if a query like "jeans blue men" yields any results.

In [None]:
search_query = "jeans blue men"

In [None]:
query = Query(f'@productDisplayName:{search_query}'
                ).sort_by('id', asc=True
                ).return_fields('productDisplayName', 'price', 'id'
                ).paging(0,10)

r.ft("idx:product").search(query).docs

#### Search for White or Black shoes between \$100 and $200

In [None]:
query = Query('@articleType:(Shoes) @baseColour:{White|Black} @discountedPrice:[100 200]'
                ).return_fields('productDisplayName', 'discountedPrice', 'baseColour', 'id')

r.ft("idx:product").search(query).docs

#### Search for Shirts that are not Blue or Black

In [None]:
query = Query('@articleType:(Shirts) -@baseColour:{Blue|Black}'
                ).return_fields('productDisplayName', 'discountedPrice', 'baseColour', 'id')

r.ft("idx:product").search(query).docs

#### List the Top 5 discounted products

In [None]:
query = Query('*'
                ).sort_by('discount_pct', asc=False
                ).return_fields('productDisplayName', 'discountedPrice', 'discount_pct', 'articleType', 'id'
                ).paging(0,5)

r.ft("idx:product").search(query).docs

#### List the Top 10 most expensive shoes

In [None]:
query = Query('@articleType:(Shoes)'
                ).sort_by('discountedPrice', asc=False
                ).return_fields('productDisplayName', 'discountedPrice', 'discount_pct', 'articleType', 'id'
                ).paging(0,10)

r.ft("idx:product").search(query).docs

#### Fuzzy Search

A fuzzy search allows you to find documents with words that approximately match your search term. To perform a fuzzy search, you wrap search terms with pairs of % characters. A single pair represents a ([**Levenshtein**](https://en.wikipedia.org/wiki/Levenshtein_distance)) distance of one, two pairs represent a distance of two, and three pairs, the maximum distance, represents a distance of three.

For instance, in this case, we will search for ***addidas*** shoes. First, let's try the standard search, which should return an empty list.

In [None]:
query = Query('@productDisplayName:(addidas)'
                ).sort_by('id', asc=True
                ).return_fields('productDisplayName', 'articleType', 'id', 'brandName'
                ).paging(0,10)

r.ft("idx:product").search(query).docs

Now let's try a search using the % character.

In [None]:
query = Query('@productDisplayName:(%addidas%)'
                ).sort_by('id', asc=True
                ).return_fields('productDisplayName', 'articleType', 'id', 'brandName'
                ).paging(0,10)

r.ft("idx:product").search(query).docs

Success! However, if we make our search term worse (***addidaz***), a single Levenshtein distance will not be enough to find it:

In [None]:
query = Query('@productDisplayName:(%addidaz%)'
                ).sort_by('id', asc=True
                ).return_fields('productDisplayName', 'articleType', 'id'
                ).paging(0,10)

r.ft("idx:product").search(query).docs

In this case, using 2 % characters should do the trick:

In [None]:
query = Query('@productDisplayName:(%%addidaz%%)'
                ).sort_by('id', asc=True
                ).return_fields('productDisplayName', 'articleType', 'id'
                ).paging(0,10)

r.ft("idx:product").search(query).docs

#### Spellcheck

Finally, we can use the Spellcheck function to suggest the right term.

In [None]:
r.ft('idx:product').spellcheck('addidas')

&nbsp;

## JSON

The JSON capability of Redis Open Source provides JavaScript Object Notation (JSON) support for Redis. It lets you store, update, and retrieve JSON values in a Redis database, similar to any other Redis data type. Redis JSON also works seamlessly with the Redis Query Engine to let you index and query JSON documents.

In this lab, we will use the same product data that we used for the Hash searches.

In [None]:
import os
import json
import pandas as pd
import numpy as np

#### Load dataframe from file

In [None]:
prod_df = pd.read_pickle('prodjson.pkl')
print(len(prod_df))
prod_df.head()

#### Save JSON data to Redis

In [None]:
for index, product in prod_df.iterrows():
  keyname = f"jsonprod:{product['id']}"
  pipe.json().set(keyname, "$", product.to_dict())
results = pipe.execute()
len(results)

Create Search index. We will configure some fields to be searchable even if they are missing or have empty values.

In [None]:
schema = (
    NumericField("$.id", as_name="id", sortable=True),
    NumericField("$.price", as_name="price"),
    NumericField("$.discountedPrice", as_name="discountedPrice"),
    TextField("$.articleNumber", as_name="articleNumber"),
    TextField("$.productDisplayName", as_name="productDisplayName"),
    TextField("$.productDescription", as_name="productDescription", index_missing=True, index_empty=True),
    TextField("$.variantName", as_name="variantName"),
    NumericField("$.catalogAddDate", as_name="catalogAddDate"),
    TagField("$.brandName", as_name="brandName"),
    TagField("$.ageGroup", as_name="ageGroup"),
    TagField("$.gender", as_name="gender"),
    TagField("$.baseColour", as_name="baseColour"),
    TagField("$.fashionType", as_name="fashionType"),
    TagField("$.season", as_name="season"),
    TagField("$.year", as_name="year"),
    NumericField("$.rating", as_name="rating"),    
    TagField("$.displayCategories", as_name="displayCategories"),
    TagField("$.masterCategory", as_name="masterCategory"),
    TagField("$.subCategory", as_name="subCategory"),
    TextField("$.articleType", as_name="articleType"),
    NumericField("$.discount_pct", as_name="discount_pct"),
    NumericField("$.inventoryCount", as_name="inventoryCount", index_missing=True)
)
try:
    r.ft("idx:jsonprod").dropindex()
except:
    print("--> JSONProd index doesn't exist; creating it")
try:
  definition = IndexDefinition(prefix=["jsonprod:"], index_type=IndexType.JSON)
  result = r.ft("idx:jsonprod").create_index(fields=schema, definition=definition)
except Exception as ex:
    result = f"FAILED to create index: {ex}"

#### Check status of the index

In [None]:
info = r.ft('idx:jsonprod').info()
print(f" Percent Indexed: {int(info['percent_indexed'])*100}")
print(f" Total Documents: {info['num_docs']}")

### Aggregation Queries

An aggregation query allows you to perform the following actions:

- Apply simple mapping functions.
- Group data based on field values.
- Apply aggregation functions on the grouped data.

#### Count products by category

In [None]:
req = aggregations.AggregateRequest("*"
            ).group_by(["@articleType"], reducers.count().alias("total_units")
            ).sort_by(reducers.Desc('@total_units')
            ).limit(0,20)

r.ft("idx:jsonprod").aggregate(req).rows

<img src="https://github.com/denisabrantesredis/denisd-redis-learning-sessions/blob/main/Search/_assets/images/callout_insight.png?raw=true" alt="Callout - Check Redis Insight"/>

Check Redis Insight and take a look at the JSON data. You can also run the query above in the ***Workbench*** using this command:

&nbsp;
```
    FT.AGGREGATE idx:jsonprod "*"
        GROUPBY 1 @articleType
        REDUCE COUNT 0 AS "total_units"
        SORTBY 2 @total_units DESC
        LIMIT 0 20
```
&nbsp;

#### Count how many blue shirts

In [None]:
req = aggregations.AggregateRequest("@baseColour:{Blue} @articleType:(Shirts)"
            ).group_by([], reducers.count().alias("count"))

r.ft("idx:jsonprod").aggregate(req).rows

#### Find the average discount for Jeans

In [None]:
req = aggregations.AggregateRequest("@articleType:(Jeans)"
                ).group_by([], reducers.avg("@discount_pct").alias("avg_discount"))

r.ft("idx:jsonprod").aggregate(req).rows

#### Count the number of dresses that are cheaper than the average cost

In [None]:
# Step 1 - Get average dress price
req = aggregations.AggregateRequest("@articleType:Dress"
            ).group_by([], reducers.avg("@price").alias("avgPrice"))

result = r.ft("idx:jsonprod").aggregate(req).rows
avg = float(result[0][1])

# Step 2 - Count Dresses under this value
req = aggregations.AggregateRequest(f"@articleType:Dress @price:[0 {avg}]"
            ).group_by([], reducers.count().alias("count"))

r.ft("idx:jsonprod").aggregate(req).rows

#### Find which products are missing the 'Description' attribute

In [None]:
query = Query('(ismissing(@productDescription))'
            ).return_fields('productDisplayName', 'productDescription', 'articleType', 'id'
            ).paging(0,10)

r.ft("idx:jsonprod").search(query).docs

List the description of 2 products

In [None]:
print(f"--> Product ID 11653: {r.json().get('jsonprod:11653', '$.productDescription')}")
print(f"--> Product ID 12032: {r.json().get('jsonprod:12032', '$.productDescription')}")

Delete the description from both products

In [None]:
r.json().delete('jsonprod:11653', '$.productDescription')
r.json().delete('jsonprod:12032', '$.productDescription')

Run the query again

In [None]:
query = Query('(ismissing(@productDescription))'
            ).return_fields('productDisplayName', 'productDescription', 'articleType', 'id'
            ).paging(0,10)

r.ft("idx:jsonprod").search(query).docs

#### Synonyms

Redis supports synonyms, which allows searching for synonym words defined by the synonym data structure. The synonym data structure is a set of groups, each of which contains synonym terms.

Let's run a quick test and search for pants.

In [None]:
query = Query(f'@productDisplayName:(pants)'
            ).sort_by('id', asc=True
            ).return_fields('productDisplayName', 'price', 'id'
            ).paging(0,10)

r.ft("idx:jsonprod").search(query).docs

#### Create a synonym for pants

In [None]:
r.ft("idx:jsonprod").synupdate("synonym", False, "pants", "jeans")

Try the search again

In [None]:
query = Query(f'@productDisplayName:(pants)'
            ).sort_by('id', asc=True
            ).return_fields('productDisplayName', 'price', 'id'
            ).paging(0,10)

r.ft("idx:jsonprod").search(query).docs

### Summarization

Summarization will fragment the text into smaller sized snippets, each of which containing the found term(s) and some additional surrounding context.
To be clear, this is not AI-Generated summary, it's just the original attribute value truncated around the search term.

Let's search for products that contain the word 'sleek' in the description.

In [None]:
query = Query(f'@productDescription:(sleek)'
            ).sort_by('id', asc=True
            ).return_fields('productDisplayName', 'productDescription', 'id'
            ).paging(0,2)

r.ft("idx:jsonprod").search(query).docs

Long product descriptions make it harder to find the word we're looking for. Let's create a summarized version of the description.

In [None]:
query = Query(f'@productDescription:(sleek)'
            ).sort_by('id', asc=True
            ).return_fields('productDisplayName', 'productDescription', 'id'
            ).summarize(["productDescription"], context_len=30
            ).paging(0,2)

r.ft("idx:jsonprod").search(query).docs

### Highlighting

Highlighting will surround the found term (and its variants) with a user-defined pair of tags. This may be used to display the matched text in a different typeface using a markup language, or to otherwise make the text appear differently.

Let's highlight our 'sleek' search term in the results. We will also use the summary function to make it easier to find in the results.

In [None]:
query = Query(f'@productDescription:(sleek)'
            ).sort_by('id', asc=True
            ).return_fields('productDisplayName', 'productDescription', 'id'
            ).summarize(["productDescription"], context_len=30
            ).highlight(["productDescription"], tags=["<<<", ">>>"]
            ).paging(0,2)
r.ft("idx:jsonprod").search(query).docs

### Scoring

When searching, documents are scored based on their relevance to the query. The score is a floating point number between 0.0 and 1.0, where 1.0 is the highest score. The score is returned as part of the search results and can be used to sort the results.

Redis comes with a few scoring functions to evaluate document relevance. They are based on document scores and term frequency. This is regardless of the ability to use sortable fields.

For instance, let's compare how different scoring functions handle a search for Men's shoes that have the word 'grip' in the description:

#### BM25 (the default)

In [None]:
query = Query('@productDescription:(grip) @articleType:(Shoes) @gender:{Men}'
                    ).scorer("BM25"
                    ).with_scores(
                    ).return_fields('id', 'productDisplayName'
                    ).paging(0,10)

r.ft("idx:jsonprod").search(query).docs

#### TFIDF

In [None]:
query = Query('@productDescription:(grip) @articleType:(Shoes) @gender:{Men}'
                    ).scorer("TFIDF"
                    ).with_scores(
                    ).return_fields('id', 'productDisplayName'
                    ).paging(0,10)

r.ft("idx:jsonprod").search(query).docs

results = r.ft("idx:jsonprod").search(query).docs
for result in results:
  print(result)

#### TFIDF.DOCNORM

In [None]:
query = Query('@productDescription:(grip) @articleType:(Shoes) @gender:{Men}'
                    ).scorer("TFIDF.DOCNORM"
                    ).with_scores(
                    ).return_fields('id', 'productDisplayName'
                    ).paging(0,10)

r.ft("idx:jsonprod").search(query).docs

results = r.ft("idx:jsonprod").search(query).docs
for result in results:
  print(result)

#### BM25STD

In [None]:
query = Query('@productDescription:(grip) @articleType:(Shoes) @gender:{Men}'
                    ).scorer("BM25STD"
                    ).with_scores(
                    ).return_fields('id', 'productDisplayName'
                    ).paging(0,10)
r.ft("idx:jsonprod").search(query).docs

results = r.ft("idx:jsonprod").search(query).docs
for result in results:
  print(result)

### Influencing the importance of certain values

It's possible to assign higher relevancy to query parameters. For instance, let's search for Nike or Adidas shoes:

In [None]:
query = Query('@productDescription:(shoe) @brandName:{Nike|Adidas}'
            ).scorer("TFIDF"
            ).with_scores( 
            ).return_fields('id', 'productDisplayName'
            ).paging(0,10)
            
results = r.ft("idx:jsonprod").search(query).docs
for result in results:
  print(result)

As we can see, all results are from Nike shoes. Now let's boost the importance of documents where the brand is Nike:

In [None]:
req = aggregations.AggregateRequest('@productDescription:(shoe) @brandName:{Nike|Adidas}'
            ).scorer("TFIDF"
            ).add_scores(
            ).load('hybrid_score', 'final_score', 'productDisplayName', 'brandName'
            ).apply(hybrid_score="(@brandName == 'Nike')*5"
            ).apply(final_score="@__score*@hybrid_score"
            ).sort_by(reducers.Desc('@final_score'))
            
results = r.ft("idx:jsonprod").aggregate(req).rows
for result in results:
  print(result)

&nbsp;

&nbsp;

&nbsp;


&nbsp;



# Congrats, this is the end of the lab!!