**This notebook should be executed after `tutorial_01` and in the directory `kedro-tutorial-generated-folder`**

In [4]:
#normal way

import pandas as pd

pd.read_csv("data/01_raw/iris.csv")

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


Normally, we would need to know the the method to call (`read_csv`, `read_excel`, `read_json` etc), exact name of the file, its exact location, file extension (if any) and also any additional arguments. This gets particularly burdensome when we have to read multiple similar files in one script/notebook, or in multiple script/notebooks

In [13]:
#imagine if these statements are scattered in a notebook, and across notebook
pd.read_csv("data/01_raw/iris.csv");
pd.read_csv("data/01_raw/iris.csv");
pd.read_csv("data/01_raw/iris.csv");
pd.read_csv("data/01_raw/iris.csv");

Below is the new way to load a file

In [14]:
from kedro.config import ConfigLoader
from kedro.io import DataCatalog

# Initialise a ConfigLoader
conf_loader = ConfigLoader("conf/base")

# Load the data catalog configuration from catalog.yml
conf_catalog = conf_loader.get("*catalog*.yml")
parameters = conf_loader.get("*parameters*.yml")

# Create the DataCatalog instance from the configuration
catalog = DataCatalog.from_config(conf_catalog)

# Load the dataset and print the output
df = catalog.load("example_iris_data")
df


Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


In [63]:
! tree

.
├── conf
│   ├── base
│   │   ├── catalog.yml
│   │   ├── logging.yml
│   │   └── parameters.yml
│   ├── local
│   │   └── credentials.yml
│   └── README.md
├── data
│   └── 01_raw
│       └── iris.csv
├── Example Notebook.ipynb
├── README.md
├── requirements.txt
└── tutorial_02.ipynb

5 directories, 10 files


In [12]:
! cat conf/base/catalog.yml

# Here you can define all your data sets by using simple YAML syntax.
#
# Documentation for this file format can be found in "The Data Catalog"
# Link: https://kedro.readthedocs.io/en/stable/05_data/01_data_catalog.html

example_iris_data:
  type: pandas.CSVDataSet
  filepath: data/01_raw/iris.csv


In [15]:
catalog.datasets.example_iris_data._describe()

{'filepath': PurePosixPath('data/01_raw/iris.csv'),
 'protocol': 'file',
 'load_args': {},
 'save_args': {'index': False},
 'version': None}

In [68]:
#get full path
from pathlib import Path

relative_path = catalog.datasets.example_iris_data._describe()["filepath"]
print(f"relative path: {relative_path.as_posix()}") #.as_posix() converts to string

full_path = Path(relative_path).resolve().as_posix()
print(f"full path: {full_path}")

relative path: data/01_raw/iris.csv
full path: /home/jovyan/kedro-tutorial-generated-folder/data/01_raw/iris.csv


### Still benefiting from this without using python

**If you use bash or other languages, you can still use the full path**

#### Bash

In [29]:
! echo {full_path}

/home/jovyan/kedro-tutorial-generated-folder/data/01_raw/iris.csv


In [30]:
! head {full_path}

sepal_length,sepal_width,petal_length,petal_width,species
5.1,3.5,1.4,0.2,setosa
4.9,3.0,1.4,0.2,setosa
4.7,3.2,1.3,0.2,setosa
4.6,3.1,1.5,0.2,setosa
5.0,3.6,1.4,0.2,setosa
5.4,3.9,1.7,0.4,setosa
4.6,3.4,1.4,0.3,setosa
5.0,3.4,1.5,0.2,setosa
4.4,2.9,1.4,0.2,setosa


#### R


In [41]:
!conda install rpy2 -y #

Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /srv/conda/envs/notebook

  added / updated specs:
    - rpy2


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    _r-mutex-1.0.1             |      anacondar_1           3 KB  conda-forge
    binutils_impl_linux-64-2.31.1|       h6176602_1         3.9 MB  defaults
    binutils_linux-64-2.31.1   |       h6176602_9          26 KB  defaults
    bwidget-1.9.14             |       ha770c72_0         119 KB  conda-forge
    bzip2-1.0.8                |       h7f98852_4         484 KB  conda-forge
    ca-certificates-2020.12.5  |       ha878542_0         137 KB  conda-forge
    cairo-1.14.12              |       h8948797_3         906 KB  defaults
    certifi-2020.12.5          |   py37h89c1867_1         143 KB  conda-forge
    cffi-1.14.4                |   py37hc58025

**Passing only the path**

In [43]:
%load_ext rpy2.ipython

In [65]:
%%R -i full_path
print(full_path)
head(read.table(full_path))

[1] "/home/jovyan/kedro-tutorial-generated-folder/data/01_raw/iris.csv"
                                                         V1
1 sepal_length,sepal_width,petal_length,petal_width,species
2                                    5.1,3.5,1.4,0.2,setosa
3                                    4.9,3.0,1.4,0.2,setosa
4                                    4.7,3.2,1.3,0.2,setosa
5                                    4.6,3.1,1.5,0.2,setosa
6                                    5.0,3.6,1.4,0.2,setosa


**Passing the dataframe**

In [52]:
%%R -i df
head(df)

  sepal_length sepal_width petal_length petal_width species
0          5.1         3.5          1.4         0.2  setosa
1          4.9         3.0          1.4         0.2  setosa
2          4.7         3.2          1.3         0.2  setosa
3          4.6         3.1          1.5         0.2  setosa
4          5.0         3.6          1.4         0.2  setosa
5          5.4         3.9          1.7         0.4  setosa


### Changing the path of dataset

In [72]:
!mkdir data/01_raw/csv
!mv data/01_raw/iris.csv data/01_raw/csv #moving to another directory/reorganizing



mkdir: cannot create directory ‘data/01_raw/csv’: File exists
mv: cannot stat 'data/01_raw/iris.csv': No such file or directory


In [None]:
catalog.load("example_iris_data") #this line will show a FileNotFoundError

In [74]:
! cat conf/base/catalog.yml

# Here you can define all your data sets by using simple YAML syntax.
#
# Documentation for this file format can be found in "The Data Catalog"
# Link: https://kedro.readthedocs.io/en/stable/05_data/01_data_catalog.html

example_iris_data:
  type: pandas.CSVDataSet
  filepath: data/01_raw/iris.csv


**Edit the catalog with the new path**

In [75]:
%%writefile conf/base/catalog.yml
example_iris_data:
  type: pandas.CSVDataSet
  filepath: data/01_raw/csv/iris.csv

Overwriting conf/base/catalog.yml


In [76]:
! cat conf/base/catalog.yml

example_iris_data:
  type: pandas.CSVDataSet
  filepath: data/01_raw/csv/iris.csv


Because we're in the same ipython session, we have to recreate the `catalog variable` to update the change. We see that we can load the data again

In [82]:
conf_catalog = conf_loader.get("*catalog*.yml") #reinstantiate catalog to update the change because we're in the same session
catalog = DataCatalog.from_config(conf_catalog) 
catalog.load("example_iris_data")

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


We only had to change one line in one file. In contrast, if we hardcoded the path, we would have had to change `n * l` lines, where `n` is the number of script/notebooks we loaded this file in, and l is the number of times we load the file in each notebook. Trying to remember which notebook uses this file, then open each of them to edit will be especially painful

### Adding another dataset

In [106]:
import json

iris = catalog.load("example_iris_data").head()

In [94]:
catalog.save("example_iris_data_json", iris) #will show a DataSetNotFoundError because we have not added this dataset to the catalog

DataSetNotFoundError: DataSet 'example_iris_data_json' not found in the catalog - did you mean one of these instead: example_iris_data

In [103]:
%%writefile conf/base/catalog.yml

example_iris_data:
  type: pandas.CSVDataSet #use pd.read_csv
  filepath: data/01_raw/csv/iris.csv
    
example_iris_data_json:
  type: pandas.JSONDataSet
  filepath: data/01_raw/json/iris.json
    
example_iris_data_tsv:
  type: pandas.CSVDataSet #use pd.read_csv
  filepath: data/01_raw/tsv/iris.tsv
  save_args:
    sep: "\t"
  load_args:
    sep: "\t"

Overwriting conf/base/catalog.yml


In [107]:
conf_catalog = conf_loader.get("*catalog*.yml") #reinstantiate catalog to update the change because we're in the same session
catalog = DataCatalog.from_config(conf_catalog)

catalog.save("example_iris_data_json", iris)
catalog.save("example_iris_data_tsv", iris)

In [113]:
!head data/01_raw/json/iris.json

{"sepal_length":{"0":5.1,"1":4.9,"2":4.7,"3":4.6,"4":5.0},"sepal_width":{"0":3.5,"1":3.0,"2":3.2,"3":3.1,"4":3.6},"petal_length":{"0":1.4,"1":1.4,"2":1.3,"3":1.5,"4":1.4},"petal_width":{"0":0.2,"1":0.2,"2":0.2,"3":0.2,"4":0.2},"species":{"0":"setosa","1":"setosa","2":"setosa","3":"setosa","4":"setosa"}}

In [110]:
!head data/01_raw/tsv/iris.tsv

sepal_length	sepal_width	petal_length	petal_width	species
5.1	3.5	1.4	0.2	setosa
4.9	3.0	1.4	0.2	setosa
4.7	3.2	1.3	0.2	setosa
4.6	3.1	1.5	0.2	setosa
5.0	3.6	1.4	0.2	setosa


In [111]:
!head data/01_raw/csv/iris.csv

sepal_length,sepal_width,petal_length,petal_width,species
5.1,3.5,1.4,0.2,setosa
4.9,3.0,1.4,0.2,setosa
4.7,3.2,1.3,0.2,setosa
4.6,3.1,1.5,0.2,setosa
5.0,3.6,1.4,0.2,setosa
5.4,3.9,1.7,0.4,setosa
4.6,3.4,1.4,0.3,setosa
5.0,3.4,1.5,0.2,setosa
4.4,2.9,1.4,0.2,setosa


The user does not need to know the format of the dataset and all the arguments needed. View all the supported datasets [here](https://kedro.readthedocs.io/en/stable/kedro.extras.datasets.html) or learn how to [define your custom dataset](https://kedro.readthedocs.io/en/stable/07_extend_kedro/03_custom_datasets.html)