## Deploying AI-Driven Personalization Engines

Welcome to this in-depth guide on constructing a Deploying AI-Driven Personalization Engines. This notebook will walk you through an example of setting up a model for the Movielens dataset using Shaped.ai with data in SingleStore and then fetching ranked movies for a specific user.

### Setup

Create your free tier workspace and connect to the database.

Replace `<YOUR_API_KEY>` with your Shaped.ai API key below.

In [13]:
import os

SHAPED_API_KEY = os.getenv('TEST_SHAPED_API_KEY', '0YOaykvq7I8LXI7K4R3zaZrF46VtjHy4R6tkOvZa')

1. Install `shaped` to leverage the Shaped CLI to create, view, and use your model.
2. Install `pandas` to view and edit the sample dataset.
3. Install `pyyaml` to create Shaped Dataset and Model schema files.

In [14]:
! pip install shaped
! pip install pandas
! pip install pyyaml

Collecting shaped
  Downloading shaped-0.14.1-py3-none-any.whl.metadata (5.4 kB)
Collecting typer>=0.7.0 (from shaped)
  Downloading typer-0.12.5-py3-none-any.whl.metadata (15 kB)
Collecting pyarrow==11.0.0 (from shaped)
  Downloading pyarrow-11.0.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.0 kB)
Collecting pandas==1.5.3 (from shaped)
  Downloading pandas-1.5.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
Collecting tqdm==4.65.0 (from shaped)
  Downloading tqdm-4.65.0-py3-none-any.whl.metadata (56 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.0/57.0 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting s3fs==0.4.2 (from shaped)
  Downloading s3fs-0.4.2-py3-none-any.whl.metadata (1.3 kB)
Collecting fsspec==2023.5.0 (from shaped)
  Downloading fsspec-2023.5.0-py3-none-any.whl.metadata (6.7 kB)
Collecting pytest>=6.2.5 (from shaped)
  Downloading pytest-8.3.3-py3-none-any.whl.metadata (7.5 kB)
Co

Initialize the CLI with your API key.

In [190]:
! shaped init --api-key $SHAPED_API_KEY

Initializing with config: {'api_key': '0YOaykvq7I8LXI7K4R3zaZrF46VtjHy4R6tkOvZa', 'env': 'prod'}


### Download Public Dataset

Fetch the publicly hosted movielens dataset.

In [16]:
! echo "Downloading movielens data..."

DIR_NAME = "notebook_assets"
! mkdir $DIR_NAME
! wget http://files.grouplens.org/datasets/movielens/ml-100k.zip --no-check-certificate -P $DIR_NAME
! unzip $DIR_NAME/ml-100k.zip -d $DIR_NAME

Downloading movielens data...
--2024-10-16 23:29:04--  http://files.grouplens.org/datasets/movielens/ml-100k.zip
Resolving files.grouplens.org (files.grouplens.org)... 128.101.65.152
Connecting to files.grouplens.org (files.grouplens.org)|128.101.65.152|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4924029 (4.7M) [application/zip]
Saving to: ‘notebook_assets/ml-100k.zip’


2024-10-16 23:29:05 (13.1 MB/s) - ‘notebook_assets/ml-100k.zip’ saved [4924029/4924029]

Archive:  notebook_assets/ml-100k.zip
   creating: notebook_assets/ml-100k/
  inflating: notebook_assets/ml-100k/allbut.pl  
  inflating: notebook_assets/ml-100k/mku.sh  
  inflating: notebook_assets/ml-100k/README  
  inflating: notebook_assets/ml-100k/u.data  
  inflating: notebook_assets/ml-100k/u.genre  
  inflating: notebook_assets/ml-100k/u.info  
  inflating: notebook_assets/ml-100k/u.item  
  inflating: notebook_assets/ml-100k/u.occupation  
  inflating: notebook_assets/ml-100k/u.user  
  inflat

Let's take a look at the downloaded dataset. There are three tables of interest:
- `ratings` which are stored in `ml-100k/u.data`
- `users` which are stored in `ml-100k/u.user`
- `movies` which are stored in `ml-100k/u.item`

Unfortunately each of these tab separated files don't have a header (which is required by Shaped). To address this, we can prepend the header as shown below:

In [191]:
import pandas as pd

data_dir = "notebook_assets/ml-100k"

events_cols = ['user_id', 'movie_id', 'rating', 'timestamp']
events_df = pd.read_csv(f'{data_dir}/u.data', sep='\t', names=events_cols, encoding='latin-1')
display(events_df.head())
events_df.to_csv(f'{data_dir}/events.csv', sep='\t', index=False)

users_cols = ['user_id', 'age', 'sex', 'occupation', 'zip_code']
users_df = pd.read_csv(f'{data_dir}/u.user', sep='|', names=users_cols, encoding='latin-1')
display(users_df.head())
users_df.to_csv(f'{data_dir}/users.csv', sep='\t', index=False)

genre_cols = [
    "genre_unknown", "Action", "Adventure", "Animation", "Children", "Comedy",
    "Crime", "Documentary", "Drama", "Fantasy", "Film_Noir", "Horror",
    "Musical", "Mystery", "Romance", "Sci_Fi", "Thriller", "War", "Western"
]
movies_cols = ['movie_id', 'title', 'release_date', "video_release_date", "imdb_url"] + genre_cols
movies_df = pd.read_csv(f'{data_dir}/u.item', sep='|', names=movies_cols, encoding='latin-1')
# Drop null column.
movies_df = movies_df.drop(columns=["video_release_date"])
display(movies_df.head())
movies_df.to_csv(f'{data_dir}/items.csv', sep='\t', index=False)

Unnamed: 0,user_id,movie_id,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


Unnamed: 0,user_id,age,sex,occupation,zip_code
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213


Unnamed: 0,movie_id,title,release_date,imdb_url,genre_unknown,Action,Adventure,Animation,Children,Comedy,...,Fantasy,Film_Noir,Horror,Musical,Mystery,Romance,Sci_Fi,Thriller,War,Western
0,1,Toy Story (1995),01-Jan-1995,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,1,...,0,0,0,0,0,0,0,0,0,0
1,2,GoldenEye (1995),01-Jan-1995,http://us.imdb.com/M/title-exact?GoldenEye%20(...,0,1,1,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,3,Four Rooms (1995),01-Jan-1995,http://us.imdb.com/M/title-exact?Four%20Rooms%...,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,4,Get Shorty (1995),01-Jan-1995,http://us.imdb.com/M/title-exact?Get%20Shorty%...,0,1,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,5,Copycat (1995),01-Jan-1995,http://us.imdb.com/M/title-exact?Copycat%20(1995),0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


### Upload Data to SingleStore

Shaped has support for many data connectors! For this tutorial we're going to be using native Shaped Datasets. To do that we need to:
1. Create a .yaml file to connect Shaped to SingleStore.
2. Use Shaped CLI to create the dataset.

In [278]:
import re
import yaml

dir_path = "notebook_assets"

# Define the parsing function
def parse_singlestore_connection_string(connection_string):
    # Regular expression to parse the connection string
    pattern = re.compile(
        r'singlestoredb://(?P<name>[^:]+):(?P<password>[^@]+)@(?P<host>[^:]+):(?P<port>\d+)/(?P<database>[^/]+)'
    )

    # Match the pattern against the connection string
    match = pattern.search(connection_string)
    if not match:
        raise ValueError("Connection string format is incorrect.")
    
    # Extract components as a dictionary
    connection_info = match.groupdict()
    
    # Convert port to an integer
    connection_info['port'] = int(connection_info['port'])
    
    # Assume 'name' is the username
    connection_info['user'] = connection_info['name']
    
    return connection_info

# Parse the connection string
connection_info = parse_singlestore_connection_string(os.environ['DATABASE_URL'])

# Define the dataset schemas
events_dataset_schema = {
    "dataset_name": "movielens_events",
    "schema": {
        "rating": "Int32",
        "user_id": "String",
        "movie_id": "String",
        "timestamp": "DateTime"
    }
}

# Function to create the dataset configuration
def create_dataset_config(dataset_schema, connection_info):
    dataset_config = {
        "name": dataset_schema['dataset_name'],
        "host": connection_info['host'],
        "port": connection_info['port'],
        "user": connection_info['user'],
        "password": connection_info['password'],
        "table": dataset_schema['dataset_name'],
        "database": connection_info['database'],
        "replication_key": "timestamp", 
        "schema_type": "MYSQL"
    }
    return dataset_config

# Create and write the events dataset configuration
events_dataset_config = create_dataset_config(events_dataset_schema, connection_info)
with open(f'{dir_path}/events_dataset_schema.yaml', 'w') as file:
    yaml.dump(events_dataset_config, file)

In [274]:
"""
Create a Shaped Dataset using the .yaml schema files.
"""
! shaped create-dataset --file $DIR_NAME/events_dataset_schema.yaml

{
  "database": "recommender_db",
  "host": "svc-21789980-1028-4e52-86a9-7d97eb7234ae-dml.aws-virginia-5.svc.singlestore.com",
  "name": "movielens_events",
  "password": "eyJhbGciOiJFUzUxMiIsImtpZCI6IjhhNmVjNWFmLThlNWEtNDQxOS04NmM4LWRkMDkxN2U1YWNlMSIsInR5cCI6IkpXVCJ9.eyJhdHRyIjp7ImNvbXBhbnlOYW1lIjpbIlNpbmdsZVN0b3JlIl0sImNvdW50cnkiOlsiVVMiXSwiZ29hbCI6WyJleHBsb3JlIHRoZSBwcm9kdWN0Il0sImlkcEdvb2dsZSI6WyJ0cnVlIl0sImpvYlJvbGUiOlsiRGF0YSBFbmdpbmVlciJdLCJtb3N0X3JlY2VudF9yZWZlcnJlciI6WyJodHRwczovL3d3dy5nb29nbGUuY29tLyJdLCJvcmlnaW5hbF9yZWZlcnJlZF9wYWdlIjpbImh0dHBzOi8vd3d3LnNpbmdsZXN0b3JlLmNvbS9ibG9nL2Fubm91bmNpbmctdGhlLXNpbmdsZXN0b3JlZGItZHJpdmVyLWZvci10aGUtc3FsdG9vbHMtdnNjb2RlLWV4dGVuc2lvbi8iXSwib3JpZ2luYWxfcmVmZXJyZXIiOlsiaHR0cHM6Ly93d3cuZ29vZ2xlLmNvbS8iXSwicHJvZHVjdCI6WyJjbG91ZCJdLCJ0ZXJtc19hbmRfY29uZGl0aW9ucyI6WyIxNjc5MzM2OTcyIl19LCJhdWQiOlsiMzYxYzc1YWUtZmM1My00OGU5LWIxYWMtODM4ZTJlOGNlMTBmIl0sImVtYWlsIjoiYXBlbmdAc2luZ2xlc3RvcmUuY29tIiwiZW1haWxWZXJpZmllZCI6dHJ1ZSwiZXhwIjoxNzI5MTI4ODM4LCJncm9

It takes a moment to provision the infrastructure required for the datasets. You can monitor them using the CLI commnad:

In [269]:
! shaped list-datasets

datasets: []



In [268]:
! shaped delete-dataset --dataset-name movielens_events 

message: Dataset with name 'movielens_events' was successfully scheduled for deletion.



### Model Creation

We're now ready to create your Shaped model! To keep things simple, today, we're using the ratings records to build a collaborative filtering model. Shaped will use these ratings to determine which users like which movie with the assumption that the higher the rating the more likely a user likes the rated movie.


1. Create a .yaml file containing the model schema definition.
2. Use Shaped CLI to create the model!

For further details about creating models please refer to the [Create Model](https://docs.shaped.ai/docs/api#tag/Model/operation/post_create_models_post) API reference.

In [267]:
"""
Create a Shaped Model schema and store in a .yaml file.
"""

import yaml

movielens_ratings_model_schema = {
    "model": {
        "name": "movielens_movie_recommendations"
    },
    "connectors": [
        {
            "type": "Dataset",
            "id": "movielens_events",
            "name": "movielens_events"
        },
        {
            "type": "Dataset",
            "id": "movielens_users",
            "name": "movielens_users"
        },
        {
            "type": "Dataset",
            "id": "movielens_items",
            "name": "movielens_items"
        },
    ],
    "fetch": {
        "events": "SELECT user_id, movie_id AS item_id, timestamp AS created_at, rating AS label FROM movielens_events",
        "users": "SELECT user_id, age, sex, occupation, zip_code FROM movielens_users",
        "items": "SELECT movie_id AS item_id, title, release_date, imdb_url, genre_unknown, Action, Adventure, Animation, Children, Comedy, Crime, Documentary, Drama, Fantasy, Film_Noir, Horror, Musical, Mystery, Romance, Sci_Fi, Thriller, War, Western FROM movielens_items"
    }
}

with open(f'{dir_path}/movielens_ratings_model_schema.yaml', 'w') as file:
    yaml.dump(movielens_ratings_model_schema, file)

In [None]:
"""
Create a Shaped Model using the .yaml schema file.
"""

! shaped create-model --file $DIR_NAME/movielens_ratings_model_schema.yaml

{
  "connectors": [
    {
      "id": "movielens_events",
      "name": "movielens_events",
      "type": "Dataset"
    },
    {
      "id": "movielens_users",
      "name": "movielens_users",
      "type": "Dataset"
    },
    {
      "id": "movielens_items",
      "name": "movielens_items",
      "type": "Dataset"
    }
  ],
  "fetch": {
    "events": "SELECT user_id, movie_id AS item_id, timestamp AS created_at, rating AS label FROM movielens_events",
    "items": "SELECT movie_id AS item_id, title, release_date, imdb_url, genre_unknown, Action, Adventure, Animation, Children, Comedy, Crime, Documentary, Drama, Fantasy, Film_Noir, Horror, Musical, Mystery, Romance, Sci_Fi, Thriller, War, Western FROM movielens_items",
    "users": "SELECT user_id, age, sex, occupation, zip_code FROM movielens_users"
  },
  "model": {
    "name": "movielens_movie_recommendations"
  }
}
model_url: https://api.prod.shaped.ai/v1/models/movielens_movie_recommendations



Your recommendation model can take up to a few hours to provision your infrastructure and train on your historic events. This time mostly depends on how large your dataset is i.e. the volume of your users, items and interactions and the number of attributes you're providing.

While the model is being setup, you can view its status with either the [List Models](https://docs.shaped.ai/docs/api#tag/Model/operation/get_models_models_get) or [View Model](https://docs.shaped.ai/docs/api) endpoints. For example, with the CLI:

In [199]:
! shaped list-models

models: []



The initial model creation goes through the following stages in order:

1. `SCHEDULING`<br/>
2. `FETCHING`<br/>
3. `TRAINING`<br/>
4. `DEPLOYING`<br/>
5. `ACTIVE`

You can periodically poll Shaped to inspect these status changes. Once it's in the ACTIVE state, you can move to next step and use it to make rank requests.

### Rank!

You're now ready to fetch your movie recommendations! You can do this with the [Rank endpoint](https://docs.shaped.ai/docs/api#tag/Rank/operation/post_rank_models__model_id__rank_post). Just provide the user_id you wish to get the recommendations for and the number of recommendations you want returned.

Shaped's CLI provides a convenience rank command to quickly retrieve results from the command line. You can use it as follows:

In [None]:
! shaped rank --model-name movielens_movie_recommendations --user-id 1 --limit 5

ids:
- '483'
- '318'
- '603'
- '427'
- '313'
scores:
- 0.8177296
- 0.81196507
- 0.7755279
- 0.77541394
- 0.75258675



The response returns 2 parallel arrays containing the ids and ranking scores for the movies that Shaped estimates are most interesting to the given user.

If you want to integrate this endpoint into your website or application you can use the Rank POST REST endpoint directly with the following request:

In [None]:
! curl https://api.prod.shaped.ai/v1/models/movielens_movie_recommendations/rank \
  -H "x-api-key: <API_KEY>" \
  -H "Content-Type: application/json" \
  -d '{ "user_id": "1", "limit": 5 }'

Wow! It was that easy to see top 5 rated movies for the passed in `user_id` 🍾. Now let's add ranking to your product :)

### Clean Up

Don't forget to delete your model (and its assets) and the datasets once you're finished with them. You can do it with the following CLI command:

In [None]:
! shaped delete-model --model-name movielens_movie_recommendations

! shaped delete-dataset --dataset-name movielens_events
! shaped delete-dataset --dataset-name movielens_users
! shaped delete-dataset --dataset-name movielens_items

! rm -r notebook_assets

message: Model with name 'movielens_movie_recommendations' is deleting...

message: Dataset with name 'movielens_events' was successfully deleted

message: Dataset with name 'movielens_users' was successfully deleted

message: Dataset with name 'movielens_items' was successfully deleted

