# Generate Simple Dataset

for highlighting how to use [`deepscatter`][deepscatter] (from [NOMIC.ai][NOMIC.ai]) with [`svelte`][svelte] and [`sveltekit`][sveltekit].

Credit goes to Benjamin Schmidt ([@bmschmidt][Benjamin Schmidt]) of [NOMIC.ai][NOMIC.ai] for his assistance.


**DISCLAIMER** from [deepscatter][deepscatter]'s GitHub page under [API]:

> This is still subject to change and is not fully documented. The encoding portion of the API mimics Vega-Lite with some minor distinctions to avoid deeply-nested queries and to add animation and jitter parameters.



[api section]: https://github.com/nomic-ai/deepscatter#api
[svelte]: https://svelte.dev
[sveltekit]: https://kit.svelte.dev
[NOMIC.ai]: https://home.nomic.ai
[deepscatter]: https://github.com/nomic-ai/deepscatter
[Benjamin Schmidt]: https://gist.github.com/bmschmidt
[add_sidecars.py]: https://gist.github.com/bmschmidt/03947d36664ec07c63d7b72a5c8adbf8

## Slack Message

OK so if this is slow we can make it a gazillion times faster by doing this the right way, not joining on string keys, which is to use this [program][add_sidecars.py].

The workflow is:

1. Run quadfeather to create tiles the way you currently are into a folder at ~/data/whatever/my_tiles

2. Create a single file that contains all the data you want to add, but none of the data that’s already there except for your unique id field. (barcode). That file needs to be somewhat strictly formatted, right now. (This is an unreleased feature). barcode must be the same name and data type as in your primary file.

3. The file must be a feather file, not parquet. (from pyarrow import feather; feather.write_feather(parquet.read_table("fin.parquet"), "fout.feather") )

4. All columns that you want to show up in the data should ideally be float32() type, although doubles might not be the end of the world.

Save the program above to [add_sidecars.py][add_sidecars.py], and run 

```shell
python3 add_sidecars.py --tileset ~/data/whatever/my_tiles --sidecar fout.feather --key barcode
```

[add_sidecars.py]: https://gist.github.com/bmschmidt/03947d36664ec07c63d7b72a5c8adbf8

## Imports

In [None]:
# standard lib
import os, pwd, sys, json, yaml, atexit, tempfile, inspect

# for data-science
import pandas as pd, numpy as np, quadfeather
from pyarrow import feather

# for plotting
import matplotlib as mpl, matplotlib.pyplot as plt, seaborn as sns

## Setup

In [None]:
SEED = 3
np.random.seed(SEED)

# NOTE: this is much smaller than what deepscatter can actually handle
N_POINTS =1000

# NOTE: this is much smaller than the default tile size of 50,000
TILE_SIZE = 100

# full path to this notebook
FILE = os.path.abspath('')

# the sveltekit project you might be working on / want to deploy
SVELTEKIT_DIR = os.path.join(FILE, '..')

# the static assets directory of the sveltekit project where files are hosted
STATIC_ASSETS_DIR = os.path.join(SVELTEKIT_DIR, 'static')

# we are assuming that you might have multiple datasets you want to host / switch between
DATASETS_DIR = os.path.join(STATIC_ASSETS_DIR, 'datasets')

# this is where we are going to store our dataset
DATASET_NAME = 'mini'
DEMO_DATASET_DIR = os.path.join(DATASETS_DIR, DATASET_NAME)

# NOTE: this is the unique ID that will be used map additional columns to the dataset
LABEL_NAME = 'label'

In [None]:
# you can switch TARGET_DIR with whatever dataset you want to work with
TARGET_DIR = DEMO_DATASET_DIR

if not os.path.isdir(TARGET_DIR):
    os.makedirs(TARGET_DIR)    

# NOTE: you can use a temp direcotry, but this is so you can view the files and confirm they are deleted
TMP_DIR = os.path.expanduser('~/Downloads')

## Utils

In [None]:
usr = pwd.getpwuid(os.getuid())[0]

def collapse_user(path: str) -> str:
    prefix, rest = path.split(usr)    
    return '~' + rest

In [None]:
def make_temp_file(**kwargs) -> tempfile.NamedTemporaryFile:
    temp = tempfile.NamedTemporaryFile(**kwargs)
    @atexit.register
    def delete_temp() -> None:
        temp.close()
    return temp

In [None]:
# where we will store points
csv_points = make_temp_file(suffix='.csv', dir=os.path.expanduser(TMP_DIR))

# where we will store additional information
csv_sidecar = make_temp_file(suffix='.csv', dir=os.path.expanduser(TMP_DIR))

# where we will store additional information as feather file
feather_sidecar = make_temp_file(suffix='.feather', dir=os.path.expanduser(TMP_DIR))

# where we store raw labels
csv_labels = os.path.join(TARGET_DIR, 'labels.csv')

## Fake Data

In [None]:
labels = pd.Series(np.arange(N_POINTS), name=LABEL_NAME).map(lambda x: f'Label {x}')

In [None]:
df_points = pd.DataFrame(
    np.random.randn(N_POINTS, 3),
    index=labels, columns=['x', 'y', 'z']
)
df_points.head()

Unnamed: 0_level_0,x,y,z
label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Label 0,1.788628,0.43651,0.096497
Label 1,-1.863493,-0.277388,-0.354759
Label 2,-0.082741,-0.627001,-0.043818
Label 3,-0.477218,-1.313865,0.884622
Label 4,0.881318,1.709573,0.050034


In [None]:
df_sidecar = pd.DataFrame(
    np.random.randn(N_POINTS, 5), index=labels, 
    columns=['category', 'boolean', 'continuous', 'feature', 'target'],
    # dtype=['category', 'bool', 'float64', 'float64', 'float64']
)
df_sidecar.category = df_sidecar.category.map(lambda x: np.abs(int(x * 3)))
df_sidecar.boolean = df_sidecar.boolean.map(lambda x: x >= 0)
df_sidecar = df_sidecar.astype({'category': 'category', 'boolean': 'bool'})
df_sidecar.head()

Unnamed: 0_level_0,category,boolean,continuous,feature,target
label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Label 0,3,False,-1.599727,-1.085874,-1.157137
Label 1,5,False,0.633804,-0.754475,0.043049
Label 2,3,False,-0.059582,0.906537,0.295364
Label 3,0,False,-0.98225,0.618337,-0.309664
Label 4,2,True,0.341083,0.601993,0.159713


In [None]:
labels.to_csv(csv_labels, index=False)
df_points.to_csv(csv_points)
df_sidecar.to_csv(csv_sidecar)

## Workflow

### 1) create tiles

In [None]:
!quadfeather --files {csv_points.name} --tile_size {TILE_SIZE} --destination {os.path.join(TARGET_DIR, 'tiles')}

In [None]:
!rm -rf {os.path.join(TMP_DIR, '_deepscatter_tmp')}

### 2) make single file

Create a single file that contains all the data you want to add, but none of the data that’s already there except for your unique id field (`label` in this case). 

NOTE: `label` must be the same name and data type as in your primary file.

**UPDATE:**

> as we combine data we are going to extract important meta data

In [None]:
def extract_column_metadata(
    df:pd.DataFrame,
    is_sidecar:bool=False,
    do_rename:bool=True, copy:bool=False,
) -> (pd.DataFrame, dict):
    df_cur = df.copy() if copy else df

    meta = {}

    # NOTE: strictly required
    _required_columns = 'x y'.split()
    # NOTE: assumed to be present
    _assumed_columns = _required_columns + ['z']
    
    # NOTE: first we check if the required columns are present
    _missing_cols = list(set(_required_columns) - set(df_cur.columns))
    _to_rename = dict()

    # NOTE: if they are not present, we then rename the first column
    # to the next missing required column. This may not be the desired effect.
    if do_rename and not is_sidecar:
        for i, cname in enumerate(df_cur.columns):
            if cname not in _assumed_columns and len(_missing_cols) > 0:
                new_col_name = _missing_cols.pop(0)
                _to_rename[cname] = dict(name=new_col_name, text=cname, index=i)


    for i, cname in enumerate(df_cur.columns):        
        col = df_cur[cname]
        dtype = col.dtype.name
        if dtype == 'category':
            col = col.cat.as_ordered()
            _min, _max = int(col.cat.codes.min()), int(col.cat.codes.max())
        elif dtype == 'bool':
            _min, _max = 0, 1
        else:
            _min, _max = float(col.min()), float(col.max())
        
        text = str(cname)
        if do_rename and not is_sidecar:
            if cname in _to_rename:
                text = _to_rename[cname]['text']
                new_col_name = _to_rename[cname]['name']
                df_cur = df_cur.rename(columns={cname: new_col_name})
                cname = new_col_name

        cmeta = dict(
            name=str(cname), text=str(cname),type=str(dtype),
            min=_min, max=_max, domain=[_min, _max],
            is_sidecar=is_sidecar,
        )

        meta[cname] = cmeta
    return df_cur, meta

In [None]:
df_p, meta_p = extract_column_metadata(df_points,  do_rename=True, is_sidecar=False)
df_s, meta_s = extract_column_metadata(df_sidecar, do_rename=False, is_sidecar=True)

column_meta = {**meta_p, **meta_s}

In [None]:
df_all = pd.concat([df_p, df_s], axis=1)

In [None]:
df_all.head()

Unnamed: 0_level_0,x,y,z,category,boolean,continuous,feature,target
label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Label 0,1.788628,0.43651,0.096497,3,False,-1.599727,-1.085874,-1.157137
Label 1,-1.863493,-0.277388,-0.354759,5,False,0.633804,-0.754475,0.043049
Label 2,-0.082741,-0.627001,-0.043818,3,False,-0.059582,0.906537,0.295364
Label 3,-0.477218,-1.313865,0.884622,0,False,-0.98225,0.618337,-0.309664
Label 4,0.881318,1.709573,0.050034,2,True,0.341083,0.601993,0.159713


In [None]:
# NOTE: this is the same as df_sidecar
df_all = df_all.drop(columns=df_p.columns)
df_all.head()

Unnamed: 0_level_0,category,boolean,continuous,feature,target
label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Label 0,3,False,-1.599727,-1.085874,-1.157137
Label 1,5,False,0.633804,-0.754475,0.043049
Label 2,3,False,-0.059582,0.906537,0.295364
Label 3,0,False,-0.98225,0.618337,-0.309664
Label 4,2,True,0.341083,0.601993,0.159713


NOTE: **All** columns that you want to show up in the data should ideally be `float32()` type, although doubles might not be the end of the world.

In [None]:
df_all = df_all.astype('float32')
df_all.head()

Unnamed: 0_level_0,category,boolean,continuous,feature,target
label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Label 0,3.0,0.0,-1.599727,-1.085874,-1.157137
Label 1,5.0,0.0,0.633804,-0.754475,0.043049
Label 2,3.0,0.0,-0.059582,0.906537,0.295364
Label 3,0.0,0.0,-0.98225,0.618337,-0.309664
Label 4,2.0,1.0,0.341083,0.601993,0.159713


The file must be a [feather file][feather file], not parquet. 

```python 
from pyarrow import feather; 

# if converting from parquet
feather.write_feather(parquet.read_table('fin.parquet), 'fout.feather))

# if converting pandas
feather.write_feather(df, 'fout.feather')
```

[feather file]: https://arrow.apache.org/docs/python/feather.html

In [None]:
feather.write_feather(df_all, feather_sidecar.name)

### 3) run `add_sidecars.py`

In [None]:
!python3 add_sidecars.py --tileset {os.path.join(TARGET_DIR, 'tiles')}\
                         --sidecar {feather_sidecar.name} --key {LABEL_NAME};
!clear                         

/Users/solst/Projects/featherplot/nbs/../static/datasets/mini/tiles/0/0/0.feather
[H[2J

**(NEW)**
> check if feather has all data

In [None]:
feather.read_feather(os.path.join(TARGET_DIR, 'tiles', '0/0/0.feather')).shape

(1000, 5)

## Meta Data

In [None]:
meta = dict(
    seed=SEED, n_points=N_POINTS, tile_size=TILE_SIZE, 
    dataset_name=DATASET_NAME, label_name=LABEL_NAME,
    
    # NOTE: since all these direcetories are relative to the static assets directory
    #       we can use the relative path to the static assets directory instead of the wrangling
    #       we did above.
    target_dir=TARGET_DIR.replace(STATIC_ASSETS_DIR, ''), 
    tiles_dir=os.path.join(TARGET_DIR, 'tiles').replace(STATIC_ASSETS_DIR, ''),

    embedding_columns=df_p.columns.values.tolist(),
    sidecar_columns=df_s.columns.values.tolist(),
    column_metadata=column_meta,
)

In [None]:
with open(os.path.join(TARGET_DIR, 'meta.yml'), 'w') as f:
    f.write(yaml.dump(meta))

## Cleanup

NOTE: these files will automatically be deleted when the kernel stops, but we delete them here for good practice

In [None]:
csv_points.close()
csv_sidecar.close()
feather_sidecar.close()