# Generate Simple Dataset

for highlighting how to use [`deepscatter`][deepscatter] (from [NOMIC.ai][NOMIC.ai]) with [`svelte`][svelte] and [`sveltekit`][sveltekit].

Credit goes to Benjamin Schmidt ([@bmschmidt][Benjamin Schmidt]) of [NOMIC.ai][NOMIC.ai] for his assistance.


**DISCLAIMER** from [deepscatter][deepscatter]'s GitHub page under [API]:

> This is still subject to change and is not fully documented. The encoding portion of the API mimics Vega-Lite with some minor distinctions to avoid deeply-nested queries and to add animation and jitter parameters.



[api section]: https://github.com/nomic-ai/deepscatter#api
[svelte]: https://svelte.dev
[sveltekit]: https://kit.svelte.dev
[NOMIC.ai]: https://home.nomic.ai
[deepscatter]: https://github.com/nomic-ai/deepscatter
[Benjamin Schmidt]: https://gist.github.com/bmschmidt
[add_sidecars.py]: https://gist.github.com/bmschmidt/03947d36664ec07c63d7b72a5c8adbf8

## Slack Message

OK so if this is slow we can make it a gazillion times faster by doing this the right way, not joining on string keys, which is to use this [program][add_sidecars.py].

The workflow is:

1. Run quadfeather to create tiles the way you currently are into a folder at ~/data/whatever/my_tiles

2. Create a single file that contains all the data you want to add, but none of the data that’s already there except for your unique id field. (barcode). That file needs to be somewhat strictly formatted, right now. (This is an unreleased feature). barcode must be the same name and data type as in your primary file.

3. The file must be a feather file, not parquet. (from pyarrow import feather; feather.write_feather(parquet.read_table("fin.parquet"), "fout.feather") )

4. All columns that you want to show up in the data should ideally be float32() type, although doubles might not be the end of the world.

Save the program above to [add_sidecars.py][add_sidecars.py], and run 

```shell
python3 add_sidecars.py --tileset ~/data/whatever/my_tiles --sidecar fout.feather --key barcode
```

[add_sidecars.py]: https://gist.github.com/bmschmidt/03947d36664ec07c63d7b72a5c8adbf8

## Imports

In [1]:
# standard lib
import os, pwd, sys, json, yaml, atexit, tempfile, inspect

# for data-science
import pandas as pd, numpy as np, quadfeather
from pyarrow import feather

# for plotting
import matplotlib as mpl, matplotlib.pyplot as plt, seaborn as sns

## Setup

In [2]:
SEED = 3
np.random.seed(SEED)

# NOTE: this is much smaller than what deepscatter can actually handle
N_POINTS = 20000

# NOTE: this is much smaller than the default tile size of 50,000
TILE_SIZE = 1000

# full path to this notebook
FILE = os.path.abspath('')

# the sveltekit project you might be working on / want to deploy
SVELTEKIT_DIR = os.path.join(FILE, '..')

# the static assets directory of the sveltekit project where files are hosted
STATIC_ASSETS_DIR = os.path.join(SVELTEKIT_DIR, 'static')

# we are assuming that you might have multiple datasets you want to host / switch between
DATASETS_DIR = os.path.join(STATIC_ASSETS_DIR, 'datasets')

# this is where we are going to store our dataset
DATASET_NAME = 'demo'
DEMO_DATASET_DIR = os.path.join(DATASETS_DIR, DATASET_NAME)

# NOTE: this is the unique ID that will be used map additional columns to the dataset
LABEL_NAME = 'label'

In [3]:
# you can switch TARGET_DIR with whatever dataset you want to work with
TARGET_DIR = DEMO_DATASET_DIR

if not os.path.isdir(TARGET_DIR):
    os.makedirs(TARGET_DIR)    

# NOTE: you can use a temp direcotry, but this is so you can view the files and confirm they are deleted
TMP_DIR = os.path.expanduser('~/Downloads')

## Utils

In [4]:
usr = pwd.getpwuid(os.getuid())[0]

def collapse_user(path: str) -> str:
    prefix, rest = path.split(usr)    
    return '~' + rest

In [5]:
def make_temp_file(**kwargs) -> tempfile.NamedTemporaryFile:
    temp = tempfile.NamedTemporaryFile(**kwargs)
    @atexit.register
    def delete_temp() -> None:
        temp.close()
    return temp

In [25]:
# where we will store points
csv_points = make_temp_file(suffix='.parquet', dir=os.path.expanduser(TMP_DIR))

# where we will store additional information
csv_sidecar = make_temp_file(suffix='.parquet', dir=os.path.expanduser(TMP_DIR))

# where we will store additional information as feather file
feather_sidecar = make_temp_file(suffix='.feather', dir=os.path.expanduser(TMP_DIR))

In [26]:
csv_labels = os.path.join(TARGET_DIR, 'labels.csv')

## Fake Data

In [27]:
labels = pd.Series(np.arange(N_POINTS), name=LABEL_NAME).map(lambda x: f'Label {x}')

In [28]:
df_points = pd.DataFrame(
    np.random.randn(20000, 3),
    index=labels, columns=['x', 'y', 'z']
)
df_points.head()

Unnamed: 0_level_0,x,y,z
label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Label 0,0.664128,-0.006745,1.824689
Label 1,0.678044,0.025185,-1.036497
Label 2,-0.567204,-0.90924,0.378961
Label 3,-0.044125,0.191463,0.072447
Label 4,-1.03112,-0.586381,-1.524462


In [52]:
n = 100
df_sidecar = pd.DataFrame(
    np.random.randn(20000, 4 + n), index=labels, 
    columns=['category', 'continuous', 'feature', 'target', *[f"feature_{i}" for i in range(n)]]
    # dtype=['category', 'bool', 'float64', 'float64', 'float64']
)
import random
categories = ['red', 'green', 'blue']
df_sidecar.category = df_sidecar.category.map(lambda x: random.choice(categories))
# Bools are not supported.
# df_sidecar.boolean = df_sidecar.boolean.map(lambda x: x >= 0)
df_sidecar = df_sidecar.astype({'category': 'category'})
df_sidecar.head()

ValueError: Shape of passed values is (20000, 4), indices imply (20000, 104)

In [37]:
#labels.to_parquet(csv_labels, index=False)
df_points.to_parquet(csv_points.name)
df_sidecar.to_parquet(csv_sidecar.name)

In [38]:
csv_points

<tempfile._TemporaryFileWrapper at 0x17f97d1c0>

## Workflow

### 1) create tiles

In [39]:
!quadfeather --files {csv_points.name} --tile_size {TILE_SIZE} --destination {os.path.join(TARGET_DIR, 'tiles')}

In [40]:
for item in os.listdir(TMP_DIR):
    # NOTE: _deepscatter_tmp doesn't automatically get cleaned up
    if '_deepscatter_tmp' == item:
        os.remove(os.path.join(TMP_DIR, item))

### 2) make single file

Create a single file that contains all the data you want to add, but none of the data that’s already there except for your unique id field (`label` in this case). 

NOTE: `label` must be the same name and data type as in your primary file.

In [41]:
df_all = pd.concat([df_points, df_sidecar], axis=1)

In [42]:
df_all.head()

Unnamed: 0_level_0,x,y,z,category,continuous,feature,target
label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Label 0,0.664128,-0.006745,1.824689,green,-1.982126,0.353178,-0.557645
Label 1,0.678044,0.025185,-1.036497,blue,0.67325,0.3108,-0.112366
Label 2,-0.567204,-0.90924,0.378961,blue,-0.069939,-1.015194,-0.736243
Label 3,-0.044125,0.191463,0.072447,red,-0.421524,0.17593,0.922671
Label 4,-1.03112,-0.586381,-1.524462,red,-0.652183,-1.149531,0.112916


In [43]:
# NOTE: this is the same as df_sidecar
df_all = df_all.drop(columns=df_points.columns)
df_all.head()

Unnamed: 0_level_0,category,continuous,feature,target
label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Label 0,green,-1.982126,0.353178,-0.557645
Label 1,blue,0.67325,0.3108,-0.112366
Label 2,blue,-0.069939,-1.015194,-0.736243
Label 3,red,-0.421524,0.17593,0.922671
Label 4,red,-0.652183,-1.149531,0.112916


NOTE: **All** columns that you want to show up in the data should ideally be `float32()` type, although doubles might not be the end of the world.

In [46]:
for col in df_all:
    if df_all[col].dtype == 'float64':
        df_all[col] = df_all[col].astype('float32')
df_all.head()

Unnamed: 0_level_0,category,continuous,feature,target
label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Label 0,green,-1.982126,0.353178,-0.557645
Label 1,blue,0.67325,0.3108,-0.112366
Label 2,blue,-0.069939,-1.015194,-0.736243
Label 3,red,-0.421524,0.17593,0.922671
Label 4,red,-0.652183,-1.149531,0.112916


The file must be a [feather file][feather file], not parquet. 

```python 
from pyarrow import feather; 

# if converting from parquet
feather.write_feather(parquet.read_table('fin.parquet), 'fout.feather))

# if converting pandas
feather.write_feather(df, 'fout.feather')
```

[feather file]: https://arrow.apache.org/docs/python/feather.html

In [47]:
feather.write_feather(df_all, feather_sidecar.name)

### 3) run `add_sidecars.py`

In [48]:
!python3 add_sidecars.py --tileset {os.path.join(TARGET_DIR, 'tiles')}\
                         --sidecar {feather_sidecar.name} --key {LABEL_NAME};
!clear                         

/Users/ben/featherplot/nbs/../static/datasets/demo/tiles/0/0/0.feather
/Users/ben/featherplot/nbs/../static/datasets/demo/tiles/1/0/1.feather
/Users/ben/featherplot/nbs/../static/datasets/demo/tiles/1/0/0.feather
/Users/ben/featherplot/nbs/../static/datasets/demo/tiles/1/1/1.feather
/Users/ben/featherplot/nbs/../static/datasets/demo/tiles/1/1/0.feather
/Users/ben/featherplot/nbs/../static/datasets/demo/tiles/4/9/6.feather
/Users/ben/featherplot/nbs/../static/datasets/demo/tiles/4/9/7.feather
/Users/ben/featherplot/nbs/../static/datasets/demo/tiles/4/9/8.feather
/Users/ben/featherplot/nbs/../static/datasets/demo/tiles/4/9/9.feather
/Users/ben/featherplot/nbs/../static/datasets/demo/tiles/4/8/6.feather
/Users/ben/featherplot/nbs/../static/datasets/demo/tiles/4/8/7.feather
/Users/ben/featherplot/nbs/../static/datasets/demo/tiles/4/8/8.feather
/Users/ben/featherplot/nbs/../static/datasets/demo/tiles/4/8/9.feather
/Users/ben/featherplot/nbs/../static/datasets/demo/tiles/3/4/5.feather
/Users

## Meta Data

In [49]:
meta = dict(
    seed=SEED, n_points=N_POINTS, tile_size=TILE_SIZE, 
    dataset_name=DATASET_NAME, label_name=LABEL_NAME,
    
    # NOTE: since all these direcetories are relative to the static assets directory
    #       we can use the relative path to the static assets directory instead of the wrangling
    #       we did above.
    target_dir=TARGET_DIR.replace(STATIC_ASSETS_DIR, ''), 
    tiles_dir=os.path.join(TARGET_DIR, 'tiles').replace(STATIC_ASSETS_DIR, ''),

    embedding_columns=df_points.columns.values.tolist(),
    sidecar_columns=df_sidecar.columns.values.tolist(),
)

In [50]:
with open(os.path.join(TARGET_DIR, 'meta.yml'), 'w') as f:
    f.write(yaml.dump(meta))

## Cleanup

NOTE: these files will automatically be deleted when the kernel stops, but we delete them here for good practice

In [51]:
csv_points.close()
csv_sidecar.close()
feather_sidecar.close()