# Getting started with `kedro`

There is an **example jupyter notebook** that demonstrates some basic `kedro`, `plotly` and `qgrid` features. Once you start the Docker container,
just open the jupyter link, navigate to the `notebooks/` directory, and checkout `kedro_example.ipynb`

This notebook demonstrates some basic `kedro`, `plotly` and `qgrid` features. In particular, the examples are meant to showcase the benefits of **interactive** plotly charts and qgrid tables.

Below, you'll find examples for how to...
1. Add a `ExcelDataSet` to the kedro catalog
2. Load data from the kedro catalog
3. Use `qgrid` to render pandas DataFrames as interactive Tables
4. Use `plotly` to create interactive charts, based on your data

---

## Add a `ExcelDataSet` to the kedro catalog
The data at which we are looking here is just a publickly available retail data set from the UCI machine learning repository.

For demo purposes, we're not using `context.catalog`, but create our own, local `catalog` object. However, the following steps (reading, visualizing) would have worked in exactly the same way if `retail_data_spark` was defined in `catalog.yml`

Important to note here is that by explicitly definint the `openpyxl` angine in `load_args`, we can **read the data directly from the url**. However, we'll not be able to manipulate the data and **write it back** to the data set (i.e. using `catalog.save("retail_data", retail_data)`

In [1]:
# create a DataCatalog and add a SparkDataset
from kedro.io import DataCatalog
from kedro.extras.datasets.pandas import ExcelDataSet

catalog = DataCatalog({'retail_data':ExcelDataSet(
    filepath='https://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx',
    load_args={'engine':'openpyxl'}
)})

  and should_run_async(code)
  for spec in entry_points.get("fsspec.specs", []):


## Load data from the kedro catalog

In [2]:
retail_data = catalog.load('retail_data')

2021-05-12 21:36:56,315 - kedro.io.data_catalog - INFO - Loading data from `retail_data` (ExcelDataSet)...


  and should_run_async(code)


In [3]:
retail_data.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom


## Use `qgrid` to render pandas DataFrames as interactive Tables

`qgrid` is a lightweight, yet powerful widget to render pandas data frames with interactive features. Here are some aspects worth considering when using `qgrid`:
* Don't render the **DataFrame as a whole** but pre-filter the number of rows (e.g. `head()`, `tail()`, `sample()`, etc.)
* Use `qgrid.set_grid_option()` to set display option for qgrid globally

In [7]:
import qgrid
qgrid.set_grid_option('maxVisibleRows',12)
qgrid.set_grid_option('forceFitColumns',False)
qgrid.show_grid(retail_data.head(500))

  and should_run_async(code)


QgridWidget(grid_options={'fullWidthRows': True, 'syncColumnCellResize': True, 'forceFitColumns': False, 'defa…

## Use `plotly` to create interactive charts, based on your data

In [22]:
from plotly import express as px

daily_revenue = (retail_data['Quantity']*retail_data['UnitPrice']).groupby([retail_data.Country,retail_data['InvoiceDate'].dt.date]).sum()
fig = px.line(
    data_frame=daily_revenue.reset_index().rename(columns={0:"DailyRevenue"}),
    x='InvoiceDate',
    y='DailyRevenue',
    color='Country'
)
fig.show()

  and should_run_async(code)
