Update the Dataset Object Documentation (#500)

Update to user guide Small updates to readme and index
deepchecks · Jan 7, 2022 · dfa08fa · dfa08fa
1 parent 2051d65
commit dfa08fa
Show file tree

Hide file tree

Showing 3 changed files with 160 additions and 109 deletions.
diff --git a/README.rst b/README.rst
@@ -307,6 +307,16 @@ of checks and optional conditions.
 What Do You Need in Order to Start Validating?
 ----------------------------------------------
 
+Environment
+~~~~~~~~~~~~
+
+- The deepchecks package installed
+
+- JupyterLab or Jupyter Notebook
+
+Data / Model 
+~~~~~~~~~~~~
+
 Depending on your phase and what you wish to validate, you'll need a
 subset of the following:
 
@@ -328,6 +338,7 @@ phase requires different assets for the validation.
 See more about typical usage scenarios and the built-in suites in the
 `docs <https://docs.deepchecks.com/?utm_source=github.com&utm_medium=referral&utm_campaign=readme&utme_content=what_do_you_need_in_order_to_start_validating>`__.
 
+
 Documentation
 --------------
 

diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -22,14 +22,15 @@ Head over to our :doc:`/examples/guides/quickstart_in_5_minutes` tutorial,
 and click on  |binder badge|  or on  |colab badge|  to launch it and see it in action,
 or see our :doc:`/getting-started/index` to install it locally and continue from there.
 
+.. note:: The package's output is suited for running in a jupyter environment.
+          HTML and PDF reports for the graphs may be added in the near future.
 
 When Should You Use Deepchecks?
 ================================
 
-
 While you're in the research phase, and want to validate your data, find potential methodological 
 problems, and/or validate your model and evaluate it.
-See the :doc:`Section in the User Guide </user-guide/when_should_you_use>` for an elaborate explanation of the typical scenarios.
+See the :doc:`When Should You Use Section in the User Guide </user-guide/when_should_you_use>` for an elaborate explanation of the typical scenarios.
 
 
 Example - Validating a Model that Classifies Malicious URLs
@@ -97,4 +98,4 @@ as this package is in active development!
    :target: /examples/guides/quickstart_in_5_minutes.html
 
 .. |colab badge| image:: /_static/colab-badge.svg
-   :target: /examples/guides/quickstart_in_5_minutes.html
+   :target: /examples/guides/quickstart_in_5_minutes.html
diff --git a/docs/source/user-guide/dataset_object.rst b/docs/source/user-guide/dataset_object.rst
@@ -3,126 +3,165 @@
 ====================
 The Dataset Object
 ====================
-The dataset is one of the basic blocks of deepchecks. It is a container for the data and its relevant metadata, like special column names (label, date, index, etc).
-Some of the checks allows to use a dataframe directly, but some others requires the metadata in order to run, so they are limited to working only with Datasets.
 
-Class Parameters
-===================
-All of the parameters are optional.
-
-.. list-table::
-    :widths: 20 20 50 10
-    :header-rows: 1
-
-    * - Name
-      - Type
-      - Description
-      - Default
-    * - label
-      - pandas.Series
-      - Data of labels as separate series from the data
-      - None
-    * - features
-      - List[Hashable]
-      - Names of the features in the data
-      - None
-    * - cat_features
-      - List[Hashable]
-      - Names of the categorical features in the data. Must be subset of `features`
-      - None
-    * - label_name
-      - Hashable
-      - Name of label column in the data
-      - None
-    * - use_index
-      - bool
-      - If data is dataframe, whether to use the dataframe index as index column for index related checks
-      - False
-    * - index_name
-      - Hashable
-      - Name of index column in the data
-      - None
-    * - date_name
-      - Hashable
-      - Name of date column in the data
-      - None
-    * - date_unit_type
-      - str
-      - Unit to convert date column if it's numeric. using `pandas.Timestamp <https://pandas.pydata.org/docs/reference/api/pandas.Timestamp.html>`__ to convert
-      - None
-    * - max_categorical_ratio
-      - float
-      - Used to infer which columns are categorical (if `cat_features` isn't explicitly passed).
-        Set maximum ratio of unique values in a column in order for it to be categorical.
-        The higher the value, the chance of column inferred as categorical is higher
-      - 0.01
-    * - max_categories
-      - int
-      - Used to infer which columns are categorical (if `cat_features` isn't explicitly passed).
-        Set maximum number of unique values in a column in order for it to be categorical.
-        The higher the value, the chance of column inferred as categorical is higher
-      - 30
-    * - max_float_categories
-      - int
-      - Same as `max_categories` but for columns of type float
-      - 5
-    * - convert_date
-      - bool
-      - Whether to convert date column if it's numeric to date
-      - True
-
-Inferring Features And Categorical Features
-================================================
-Dataset defines which columns of the data are features and of them which are categorical features.
-For features, if parameter `features` not passed explicitly, all will be considered features apart from the label, index and date.
-For categorical features, if parameter `cat_features` not passed explicitly, the following logic runs on every column to determine
-whether the column is categorical or not:
-
-* if columns is float type:
-    * number of unique < `max_float_categories`
-* else:
-    *  number of unique < `max_categories` AND (number of unique / number of samples) < `max_categorical_ratio`
+The ``Dataset`` is a container for the data and the relevant ML metadata such as special column roles (e.g. label, index, categorical columns).
+It enables to take into account the relevant context during validation,
+and to save it in a convenient manner, and is a basic building block in deepchecks.
+
+
+Class Properties
+==================
+
+The common properties are:
+
+- **label** - The target values that the model is trying to predict.
+- **cat_features** - List of features that should be treated as categorical. If not specified explicitly, they will be :ref:`inferred automatically <dataset_object__inferring_categorical_features>`.
+- **index_name** - If the dataset has a meaningful unique index, defining it as such will enable more validations to run.
+- **date_name** - A date column representing the sample.
+- **features** - Specifies the columns used by model for training.
+  Used for defining only a subset of the columns in the data as features. If not supplied then
+  all of the columns that aren't explicitly specified as ``label``, ``date``, or ``index`` are considered to be features.
+
+The Dataset's metadata properties are all optional. Check out the API Reference for more details.
+
+Dataset API Reference
+--------------------------
+
+.. currentmodule:: deepchecks.base.dataset
+
+.. autosummary::
+
+    Dataset
+
 
 Creating a Dataset
-======================
+=======================
+
+From a Pandas DataFrame
+--------------------------
 
-From a DataFrame
-~~~~~~~~~~~~~~~~
-The default dataset constructor is expecting to get a dataframe. the rest of the properties
-are optional, but if your data have date/index/label you would want to define them.
+The default ``Dataset`` constructor expects to get a ``pd.DataFrame``
+The rest of the properties are optional, but if your data has ``date``/``index``/``label``
+columns you would want to define them for more possible validation checks.
 
-.. code-block:: python
+>>> d = {"id": [1,2,3,4],
+...      "feature1": [0.1,0.3,0.2,0.6],
+...      "feature2": [4,5,6,7],
+...      "categorical_feature": [0,0,0,1],
+...      "class": [1,2,1,2]}
+... df = pd.DataFrame(d)
+... ds = Dataset(df, label="class", index_name="id", cat_features=["categorical_feature"])
 
-    Dataset(my_dataframe, features=['feat1', 'feat2', 'feat3'], label='target', index='id', date='timestamp')
 
 From Numpy Arrays
-~~~~~~~~~~~~~~~~~~~
-A Dataset can be created using a 2D numpy array for features and 1D numpy array for the labels. The features array is mandatory, and the labels array is optional.
+---------------------
 
-.. code-block:: python
+A Dataset can be created using a 2D numpy array for features and 1D numpy array for the labels.
+The features array is mandatory, and the labels array is optional.
 
-    features = np.array([[0.25, 0.3, 0.3], [0.14, 0.75, 0.3], [0.23, 0.39, 0.1]])
-    labels = np.array([0.1, 0.1, 0.7])
-    dataset_with_labels = Dataset.from_numpy(features, labels)
-    dataset_without_labels = Dataset.from_numpy(features)
+>>> features = np.array([[0.25, 0.3, 0.3], [0.14, 0.75, 0.3], [0.23, 0.39, 0.1]])
+>>> labels = np.array([0.1, 0.1, 0.7])
+>>> ds_with_labels = Dataset.from_numpy(features, labels)
+>>> ds_without_labels = Dataset.from_numpy(features)
 
 Also, it's possible to assign names to the features and label:
 
-.. code-block:: python
+>>> Dataset.from_numpy(
+...     features, labels,
+...     columns=['feat1', 'feat2', 'feat3'],
+...     label_name='target'
+... )
+
+All the rest of the Dataset's properties can be passed also as regular keyword arguments:
+
+>>> Dataset.from_numpy(
+...     features, labels,
+...     columns=['feat1', 'feat2', 'feat3'],
+...     label_name='target',
+...     max_float_categories=10
+... )
+
+Useful Functions
+===================
+
+Train Test Split
+--------------------
+
+Uses internally `sklearn.model_selection.train_test_split <https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html>`_
+(so the same arguments can be passed) and also copies the metadata to each instance of the split and returns two ``Datasets``.
+
+>>> train_ds, test_ds = ds.train_test_split(stratify=True)
+
+
+Copy
+------
+
+Copy enables to copy the metadata from an existing ``Dataset`` instance, for creating a new ``Dataset`` from a new ``pd.DataFrame``'s data.
+This can be useful for implementing data splits independentaly or for comparing datasets, when receiving new data (of the same known format of existing data).
+
+>>> new_ds = ds.copy(new_df)
+
+
+Working with Class Parameters
+---------------------------------
+
+We can work directly with the ``Dataset`` object, to inspect its defined features and label:
+
+>>> ds.features
+['feature1', 'feature2', 'category']
+>>> ds.label_name
+['class']
+
+Get its internal ``pd.DataFrame``:
+
+>>> ds.data
+    feature1    feature2    categorical_feature class    
+0   0.1         4           0                   1
+1   0.3         5           0                   2
+2   0.2         6           0                   1
+3   0.6         7           1                   2
+
+
+Or extract directly only the feature columns or only the label column from within it:
+
+>>> ds.features_columns
+    feature1	feature2	categorical_feature
+0	0.1	        4	        0
+1	0.3	        5	        0
+2	0.2	        6	        0
+3	0.6	        7	        1
+
+>>> ds.label_col
+    class
+0   1
+1   2
+2   1
+3   2
+
+
+.. _dataset_object__inferring_categorical_features:
+
+Inferring Categorical Features
+==================================
+
+.. warning::
+    It is highly recommended to explicitly state the categorical features or define their column type to be ``category``.
+    Otherwise, the inherent limitations of the automatic, and may cause inconsistencies (misdetection, different detection between
+    train and test, etc.), and required tuning and adaptions.
+
+If the parameter ``cat_features`` was not passed explicitly, the following inference logic 
+will run on the columns to determine which are classified as categorical:
 
-    Dataset.from_numpy(
-        features, labels,
-        feature_names=['feat1', 'feat2', 'feat3',],
-        label_name='target'
-    )
+#. If the ``pd.dtypes`` of any of the existing columns is ``category`` then all of the columns that are of type ``category`` 
+   will be  considered categorical (and only them).
 
-All the rest of the Dataset's properties can be passed also as a regular keyword arguments:
+#. Otherwise, a heuristic is used for deducting the type. Each column for which at least one of the following conditions is met is considered categorical:
 
-.. code-block:: python
+   - If (`number of unique values in column` <= `max_float_categories`) 
+     **AND** (`column type` is `float`)
 
-    Dataset.from_numpy(
-        features, labels,
-        feature_names=['feat1', 'feat2', 'feat3',],
-        label_name='target',
-        max_float_categories=10
-    )
+   - If (`number of unique values in column` <= ``max_categories`)  
+     **AND** ((the ratio between the `number of unique values` and the `number of samples`) < `max_categorical_ratio`)
+  
+Check the API Reference for :doc:`infer_categorical_features </api/utils/generated/deepchecks.utils.features.infer_categorical_features>` 
+for more details.