Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file modified docs/images/increasingwidthdisc.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/increasingwidthintervalsize.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
93 changes: 27 additions & 66 deletions docs/user_guide/discretisation/ArbitraryDiscretiser.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,22 +5,23 @@
ArbitraryDiscretiser
====================

The :class:`ArbitraryDiscretiser()` sorts the variable values into contiguous intervals
which limits are arbitrarily defined by the user. Thus, you must provide a dictionary
with the variable names as keys and a list with the limits of the intervals as values,
when setting up the discretiser.
:class:`ArbitraryDiscretiser()` sorts the variable values into contiguous intervals
whose limits are arbitrarily defined by the user.

The :class:`ArbitraryDiscretiser()` works only with numerical variables. The discretiser
will check that the variables entered by the user are present in the train set and cast
as numerical.
.. note::
You must provide a dictionary
with the variable names as keys and a list with the limits of the intervals as values,
when setting up the discretiser.

Example
-------

Let's take a look at how this transformer works. First, let's load a dataset and plot a
histogram of a continuous variable. We use the california housing dataset that comes
Python implementation
---------------------

Let's take a look at how this transformer works. We'll use the california housing dataset that comes
with Scikit-learn.

Let's load the dataset:

.. code:: python

import numpy as np
Expand All @@ -29,7 +30,11 @@ with Scikit-learn.
from sklearn.datasets import fetch_california_housing
from feature_engine.discretisation import ArbitraryDiscretiser

X, y = fetch_california_housing( return_X_y=True, as_frame=True)
X, y = fetch_california_housing(return_X_y=True, as_frame=True)

Let's plot a histogram of a continuous variable.

.. code:: python

X['MedInc'].hist(bins=20)
plt.xlabel('MedInc')
Expand Down Expand Up @@ -75,7 +80,7 @@ setting `return_boundaries` to `True`.

.. code:: python

X, y = fetch_california_housing( return_X_y=True, as_frame=True)
X, y = fetch_california_housing(return_X_y=True, as_frame=True)

user_dict = {'MedInc': [0, 2, 4, 6, np.inf]}

Expand All @@ -99,65 +104,21 @@ If we return the interval values as integers, the discretiser has the option to
the transformed variable as integer or as object. Why would we want the transformed
variables as object?

Categorical encoders in Feature-engine are designed to work with variables of type
Categorical encoders in feature-engine are designed to work with variables of type
object by default. Thus, if you wish to encode the returned bins further, say to try and
obtain monotonic relationships between the variable and the target, you can do so
seamlessly by setting `return_object` to True. You can find an example of how to use
this functionality `here <https://nbviewer.org/github/feature-engine/feature-engine-examples/blob/main/discretisation/ArbitraryDiscretiser_plus_MeanEncoder.ipynb>`_.
seamlessly by setting `return_object` to True. You can find an example of discretisation followed
by encoding to obtain monotonic releationships `here <https://nbviewer.org/github/feature-engine/feature-engine-examples/blob/main/discretisation/ArbitraryDiscretiser_plus_MeanEncoder.ipynb>`_.

Additional resources
--------------------

Check also:

- `Jupyter notebook <https://nbviewer.org/github/feature-engine/feature-engine-examples/blob/main/discretisation/ArbitraryDiscretiser.ipynb>`_
- `Jupyter notebook - Discretiser plus Mean Encoding <https://nbviewer.org/github/feature-engine/feature-engine-examples/blob/main/discretisation/ArbitraryDiscretiser_plus_MeanEncoder.ipynb>`_

For more details about this and other feature engineering methods check out these resources:

- `Feature Engineering for Machine Learning <https://www.trainindata.com/p/feature-engineering-for-machine-learning>`_, online course.
- `Feature Engineering for Time Series Forecasting <https://www.trainindata.com/p/feature-engineering-for-forecasting>`_, online course.
- `Python Feature Engineering Cookbook <https://www.packtpub.com/en-us/product/python-feature-engineering-cookbook-9781835883587>`_, book.

.. figure:: ../../images/feml.png
:width: 300
:figclass: align-center
:align: left
:target: https://www.trainindata.com/p/feature-engineering-for-machine-learning

Feature Engineering for Machine Learning

|
|
|
|
|
|
|
|
|
|

Or read our book:

.. figure:: ../../images/cookbook.png
:width: 200
:figclass: align-center
:align: left
:target: https://www.packtpub.com/en-us/product/python-feature-engineering-cookbook-9781835883587

Python Feature Engineering Cookbook

|
|
|
|
|
|
|
|
|
|
|
|
|

Both our book and course are suitable for beginners and more advanced data scientists
alike. By purchasing them you are supporting Sole, the main developer of Feature-engine.
Both our book and courses are suitable for beginners and more advanced data scientists
alike. By purchasing them you are supporting `Sole <https://linkedin.com/in/soledad-galli>`_,
the main developer of feature-engine.
Loading