# Introduction dvb.datascience

In this tutorial, you will learn what the basic usage of dvb.datascience is and how you can use in in your datascience activities.

If you have any suggestions for features or if you encounter any bugs, please let us know at [tc@devolksbank.nl](mailto:tc@devolksbank.nl)

In [None]:
# %matplotlib inline
import dvb.datascience as ds

## Defining a pipeline
Defining a pipeline is just adding some Pipes (actions) which will be connected

Every Pipe can have 0, 1 or more inputs from other pipes.
Every Pipe can have 0, 1 or more outputs to other pipes.
Every Pipe has a name. Every input and output of the pipe has a key by which the input/output is identified. The name of the Pipe and the key of the input/output are used to connect pipes.

In [None]:
p = ds.Pipeline(dataframe_engine='pandas')
p.addPipe("read", ds.data.SampleData(dataset_name="iris"))
p.addPipe("metadata", ds.data.DataPipe("df_metadata", {"y_true_label": "label"}))
p.addPipe("write",ds.data.CSVDataExportPipe("dump_input_to_output.csv", sep=",", index_label="caseId"),[("read", "df", "df")],)

For creating a new pipeline, just call `ds.Pipeline()`. When no parameters are given, all pipes will use the Pandas Dataframe. You can also specify dataframe_engine='dask' to use distribute the computations over multiple cores and/or nodes by Dask.

A pipeline has two main methods: `fit_transform()` and `transform().fit_transform()` is training the pipeline. Depending on the Pipe, the training can be computing the mean, making a decision tree, etc. During the transform, those learnings are used to transform() the input to output, for example by replacing outliers by means, predicting with the trained model, etc.

In [None]:
p.fit_transform()

After the transform, the output of the transform is available.

In [None]:
p.get_pipe_output('read')

## Multiple inputs
Some Pipes have multiple inputs, for example to merge two datasets we can do the following.

In [None]:
p = ds.Pipeline()
p.addPipe('read1', ds.data.CSVDataImportPipe())
p.addPipe('read2', ds.data.CSVDataImportPipe())
p.addPipe('merge', ds.transform.Union(2, axis=0, join='outer'), [("read1", "df", "df0"), ("read2", "df", "df1")])

In [None]:
p.fit_transform(transform_params={'read1': {'file_path': '../test/data/train.csv'}, 'read2': {'file_path': '../test/data/test.csv'}})

In [None]:
p.get_pipe_output('merge')

## Plots
It's easy to get some plots of the data:

In [None]:
p = ds.Pipeline()
p.addPipe('read', ds.data.SampleData('iris'))
p.addPipe('split', ds.transform.RandomTrainTestSplit(test_size=0.3), [("read", "df", "df")])
p.addPipe('boxplot', ds.eda.BoxPlot(), [("split", "df", "df")])
p.fit_transform(transform_params={'split': {'split': ds.transform.split.TrainTestSplitBase.TRAIN}})

In [None]:
p.get_pipe_output('read')

In [None]:
p.transform(transform_params={'split': {'split': ds.transform.split.TrainTestSplitBase.TEST}}, name='test', close_plt=True)

## Some plots can combine transforms to one plot
You can add a name to the transform in order to add it to the legend.
By default, the transform won't close the plots. So when you leave out close_plt=True in the call of (fit_)transform, plots of the next transform will be integrated in the plots of the previous transform.
Do not forget to call close_plt=True on the last transform, otherwise all plots will remain open and will be plotted by jupyter again.

In [None]:
p = ds.Pipeline()
p.addPipe('read', ds.data.SampleData('iris'))
p.addPipe('split', ds.transform.RandomTrainTestSplit(test_size=0.3), [("read", "df", "df")])
p.addPipe('ecdf', ds.eda.ECDFPlots(), [("split", "df", "df")])
p.fit_transform(transform_params={'split': {'split': ds.transform.split.TrainTestSplitBase.TRAIN}})
p.transform(transform_params={'split': {'split': ds.transform.split.TrainTestSplitBase.TEST}}, name='test', close_plt=True)

In [None]:
p = ds.Pipeline()
p.addPipe('read', ds.data.SampleData('iris'))
p.addPipe('split', ds.transform.RandomTrainTestSplit(test_size=0.3), [("read", "df", "df")])
p.addPipe('scatter', ds.eda.ScatterPlots(), [("split", "df", "df")])
p.fit_transform(transform_params={'split': {'split': ds.transform.split.TrainTestSplitBase.TRAIN}})
p.transform(transform_params={'split': {'split': ds.transform.split.TrainTestSplitBase.TEST}}, name='test', close_plt=True)

In [None]:
p = ds.Pipeline()
p.addPipe('read', ds.data.SampleData('iris'))
p.addPipe('split', ds.transform.RandomTrainTestSplit(test_size=0.3), [("read", "df", "df")])
p.addPipe('hist', ds.eda.Hist(), [("split", "df", "df")])
p.fit_transform(transform_params={'split': {'split': ds.transform.split.TrainTestSplitBase.TRAIN}})
p.transform(transform_params={'split': {'split': ds.transform.split.TrainTestSplitBase.TEST}}, name='test', close_plt=True)

## Drawing a pipeline
Once defined, a pipeline can be drawn.

In [None]:
p = ds.Pipeline()
p.addPipe('read', ds.data.CSVDataImportPipe())
p.addPipe('read2', ds.data.CSVDataImportPipe())
p.addPipe('numeric', ds.transform.FilterTypeFeatures(), [("read", "df", "df")])
p.addPipe('numeric2', ds.transform.FilterTypeFeatures(), [("read2", "df", "df")])
p.addPipe('boxplot', ds.eda.BoxPlot(), [("numeric", "df", "df"), ("numeric2", "df", "df")])
p.draw_design()

## Predicting

In [None]:
from sklearn.neighbors import KNeighborsClassifier
p = ds.Pipeline()
p.addPipe('read', ds.data.SampleData('iris'))
p.addPipe('clf', ds.predictor.SklearnClassifier(KNeighborsClassifier, n_neighbors=3), [("read", "df", "df"), ("read", "df_metadata", "df_metadata")])
p.fit_transform()
p.get_pipe_output('clf')

## Scoring

In [None]:
from sklearn.neighbors import KNeighborsClassifier
p = ds.Pipeline()
p.addPipe('read', ds.data.SampleData('iris'))
p.addPipe('clf', ds.predictor.SklearnClassifier(KNeighborsClassifier, n_neighbors=3), [("read", "df", "df"), ("read", "df_metadata", "df_metadata")])
p.addPipe('score', ds.score.ClassificationScore(), [("clf", "predict", "predict"), ("clf", "predict_metadata", "predict_metadata")])
p.fit_transform()

# Fetching the output
You can fetch the output of a pipe using the following:

In [None]:
p.get_pipe_output('clf')

# Confusion matrix
You can print the confusion matrix of a score pipe using the following:

In [None]:
p.get_pipe('score').plot_confusion_matrix()

# Precision Recall Curve
And the same holds for the precision recall curve:

In [None]:
p.get_pipe('score').precision_recall_curve()

# AUC plot
As well as the AUC plot

In [None]:
p.get_pipe('score').plot_auc()
