# Draco 1 vs. Draco 2

In this notebook, we compare and contrast the capabilities of _Draco 1_ and _Draco 2_ through hands-on examples.

> ⚠️ This notebook requires a Node.js runtime so that the `draco1` bindings work as expected. Draco 2 **does not** require non-Python dependencies

In [1]:
# Display utilities
from IPython.display import display, Markdown
from typing import Callable

def md(markdown: str):
    display(Markdown(markdown))

def run_guarded(func: Callable):
    try:
        func()
    except Exception as e:
        md(f'**Error:** <i style="color: red;">{e}</i>')

We'll be installing a forked version of Draco 1, specifically named `draco1`. This is to prevent any conflicts with the currently installed `draco` package, which refers to Draco 2 (i.e., this repository). It's important to note that the `draco1` fork doesn't modify the original functionality of Draco 1 - it's simply a renaming of the package. This way, we can clearly distinguish between the two versions of Draco for our comparison and we can interact with them within the same notebook.

In [2]:
# Installing `clyngor` prior to `draco1`, as it is a build requirement
!pip -qq install --upgrade pip && pip -qq install clyngor
!pip install -qq 'git+https://github.com/peter-gy/draco.git@named-to-draco1#draco1'

In [3]:
import draco1 as drc1
import draco as drc2

md(f'Comparing _Draco 1: v{drc1.__version__}_ with _Draco 2: v{drc2.__version__}_')

Comparing _Draco 1: v0.0.9_ with _Draco 2: v2.0.0b5_

## Loading data and generating its schema

We set off by loading the [Seattle Weather](https://github.com/vega/vega/blob/main/docs/data/seattle-weather.csv) dataset from the [Vega Datasets](https://pypi.org/project/vega-datasets/) package. We then use Draco 1 (`drc1`) and Draco 2 (`drc2`) to generate the schema of the dataset.

In [4]:
import pandas as pd
from vega_datasets import data as vega_data

df: pd.DataFrame = vega_data.seattle_weather()
df.head()

Unnamed: 0,date,precipitation,temp_max,temp_min,wind,weather
0,2012-01-01,0.0,12.8,5.0,4.7,drizzle
1,2012-01-02,10.9,10.6,2.8,4.5,rain
2,2012-01-03,0.8,11.7,7.2,2.3,rain
3,2012-01-04,20.3,12.2,5.6,4.7,rain
4,2012-01-05,1.3,8.9,2.8,6.1,rain


### Draco 1

As the cells below show, while Draco 1 exposes the `data2schema` function to generate the schema of a dataset, it is not directly compatible with a Pandas `DataFrame`. What's more, even after converting the dataframe to a list of dictionaries - under the assumption that it will be JSON serializable without issues - the function still fails to generate the schema due to the fact that the `data` column of the dataset is stored as a `Timestamp` object, which is not JSON serializable.

We succeed with the schema generation only after converting the `date` column to a string of the format `YYYY-MM-DD`.

In [5]:
run_guarded(lambda: drc1.data2schema(df))

**Error:** <i style="color: red;">Object of type DataFrame is not JSON serializable</i>

In [6]:
data_records = df.to_dict('records')
run_guarded(lambda: drc1.data2schema(data_records))

**Error:** <i style="color: red;">Object of type Timestamp is not JSON serializable</i>

In [7]:
# convert the `date` column to a string of the format `YYYY-MM-DD`
df_serializable = df.copy()
df_serializable['date'] = df_serializable['date'].apply(lambda x: x.strftime('%Y-%m-%d'))
data_records = df_serializable.to_dict('records')
drc1.data2schema(data_records)

{'stats': {'date': {'type': 'string',
   'unique': {'2012-01-01': 1,
    '2012-01-02': 1,
    '2012-01-03': 1,
    '2012-01-04': 1,
    '2012-01-05': 1,
    '2012-01-06': 1,
    '2012-01-07': 1,
    '2012-01-08': 1,
    '2012-01-09': 1,
    '2012-01-10': 1,
    '2012-01-11': 1,
    '2012-01-12': 1,
    '2012-01-13': 1,
    '2012-01-14': 1,
    '2012-01-15': 1,
    '2012-01-16': 1,
    '2012-01-17': 1,
    '2012-01-18': 1,
    '2012-01-19': 1,
    '2012-01-20': 1,
    '2012-01-21': 1,
    '2012-01-22': 1,
    '2012-01-23': 1,
    '2012-01-24': 1,
    '2012-01-25': 1,
    '2012-01-26': 1,
    '2012-01-27': 1,
    '2012-01-28': 1,
    '2012-01-29': 1,
    '2012-01-30': 1,
    '2012-01-31': 1,
    '2012-02-01': 1,
    '2012-02-02': 1,
    '2012-02-03': 1,
    '2012-02-04': 1,
    '2012-02-05': 1,
    '2012-02-06': 1,
    '2012-02-07': 1,
    '2012-02-08': 1,
    '2012-02-09': 1,
    '2012-02-10': 1,
    '2012-02-11': 1,
    '2012-02-12': 1,
    '2012-02-13': 1,
    '2012-02-14': 1,
    '20

### Draco 2

Thanks to the fact that Draco 2 is written entirely using Python, it is able to directly accept a Pandas `DataFrame` as input for the schema generation without any traces of the issues we encountered with Draco 1.

In [8]:
drc2.schema_from_dataframe(df)

{'number_rows': 1461,
 'field': [{'name': 'date',
   'type': 'datetime',
   'unique': 1461,
   'entropy': 7287},
  {'name': 'precipitation',
   'type': 'number',
   'unique': 111,
   'entropy': 2422,
   'min': 0,
   'max': 55,
   'std': 6},
  {'name': 'temp_max',
   'type': 'number',
   'unique': 67,
   'entropy': 3934,
   'min': -1,
   'max': 35,
   'std': 7},
  {'name': 'temp_min',
   'type': 'number',
   'unique': 55,
   'entropy': 3596,
   'min': -7,
   'max': 18,
   'std': 5},
  {'name': 'wind',
   'type': 'number',
   'unique': 79,
   'entropy': 3950,
   'min': 0,
   'max': 9,
   'std': 1},
  {'name': 'weather',
   'type': 'string',
   'unique': 5,
   'entropy': 1201,
   'freq': 714}]}