# Describe

 This function will analysis the data and outputs the following artifacts per
    column within the data frame (based on data types):

    histogram matrix chart
    histogram per feature chart
    violin chart
    correlation-matrix chart
    correlation-matrix csv
    imbalance pie chart
    imbalance-weights-vec csv

<a id="handler1"></a>

## analyse

###  Docs

#### Parameters:
* **`context`**: `mlrun.MLClientCtx` - The MLRun function execution context
* **`name`**: `str` - Key of the dataset to database ("dataset" for default).
* **`table`**: `DataItem = None` - MLRun input pointing to pandas dataframe (csv/parquet file path)
* **`label_column`**: `str = None` - Ground truth column label
* **`plots_dest`**: `str = "plots"` - Destination folder of summary plots (relative to artifact_path)
* **`frac`**: `float = 0.1` -  when the table has more than 5000 samples, the function will execute on random frac from the data.


### DEMO
#### Set-up

In [12]:
import pandas as pd
import mlrun
import os
from sklearn.datasets import make_classification


In [3]:
# Set our project's name:
project_name = "new-describe-project"

# Create the project:
project = mlrun.get_or_create_project(name=project_name, context="./", user_project=True)

> 2022-03-07 15:57:33,482 [info] loaded project new-describe-project from MLRun DB


#### Loading dataset
We will use make_classification to generate random dataset

In [4]:
n_features=5
X, y = make_classification(n_samples=100, n_features=n_features, n_classes=3, random_state = 18,
                                     class_sep=2, n_informative=3)
df = pd.DataFrame(X, columns=[f"feature_{i}" for i in range(n_features)])
df['label'] = y
try:
    os.mkdir('artifacts')
except:
    pass
df.to_parquet("artifacts/random_dataset.parquet")

#### Run the function on new data set
Import the describe MLRun function with analysis handler and run it.

After we run the function you can see the created artifacts by click on the run uid and go -> artifacts

In [8]:
describe_func = mlrun.import_function("hub://describe")
describe_func.apply(mlrun.platforms.auto_mount())

<mlrun.runtimes.kubejob.KubejobRuntime at 0x7f3645cb4d90>

In [9]:
describe_run = describe_func.run(
            name="task-describe",
            handler='analyze',
            inputs={"table": os.path.abspath("artifacts/random_dataset.parquet")},
            params={"label_column": "label"},
            local=True
        )

> 2022-03-07 15:58:36,721 [info] starting run task-describe uid=2933ae26f64c4c1a91f43dd0f220de44 DB=http://mlrun-api:8080


project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
new-describe-project-davids,...f220de44,0,Mar 07 15:58:36,completed,task-describe,v3io_user=davidskind=owner=davidshost=jupyter-davids-5d6fdc4597-l79r4,table,update_dataset=Truelabel_column=label,,histograms matrixhistogram_feature_0histogram_feature_1histogram_feature_2histogram_feature_3histogram_feature_4violinimbalanceimbalance-weights-veccorrelation-matrix csvcorrelation-matrixdataset





> 2022-03-07 15:58:42,751 [info] run executed, status=completed


#### Run the function on alredy loaded data set

log new data set to the project


In [32]:
context = mlrun.get_or_create_ctx(project_name)
df = pd.read_parquet(os.path.abspath("artifacts/random_dataset.parquet"))
context.log_dataset(key="dataset", db_key="dataset1", stats=True, df=df)

<mlrun.artifacts.dataset.DatasetArtifact at 0x7f362c84ec50>

Import the describe MLRun function with analysis handler and run it.

After we run the function you can see the created artifacts by click on the run uid and go -> artifacts

In [33]:
describe_func = mlrun.import_function("hub://describe")
describe_func.apply(mlrun.platforms.auto_mount())

<mlrun.runtimes.kubejob.KubejobRuntime at 0x7f362ef32550>

In [34]:
describe_run = describe_func.run(
            name="task-describe",
            handler='analyze',
            inputs={"table": os.path.abspath("artifacts/random_dataset.parquet")},
            params={"name": "dataset1", "label_column": "label"},
            local=True
        )

> 2022-03-07 16:31:05,060 [info] starting run task-describe uid=512a02a6152f4b4bb92f3b7301cc439c DB=http://mlrun-api:8080


project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
new-describe-project-davids,...01cc439c,0,Mar 07 16:31:05,completed,task-describe,v3io_user=davidskind=owner=davidshost=jupyter-davids-5d6fdc4597-l79r4,table,name=dataset1update_dataset=Truelabel_column=label,,histograms matrixhistogram_feature_0histogram_feature_1histogram_feature_2histogram_feature_3histogram_feature_4violinimbalanceimbalance-weights-veccorrelation-matrix csvcorrelation-matrixdataset





> 2022-03-07 16:31:07,131 [info] run executed, status=completed
