Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the documentation for Sklearn integration in EVADB. #1425

Open
wants to merge 4 commits into
base: staging
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
46 changes: 40 additions & 6 deletions docs/source/reference/ai/model-train-sklearn.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,21 +6,55 @@ Model Training with Sklearn
1. Installation
---------------

To use the `Sklearn framework <https://scikit-learn.org/stable/>`_, we need to install the extra sklearn dependency in your EvaDB virtual environment.
To use the `Flaml Sklearn AutoML framework <https://microsoft.github.io/FLAML/docs/Examples/Integrate%20-%20Scikit-learn%20Pipeline/>`_, we need to install the extra Flaml dependency in your EvaDB virtual environment.

.. code-block:: bash
pip install evadb[sklearn]

pip install "flaml[automl]"

2. Example Query
----------------

.. code-block:: sql

CREATE OR REPLACE FUNCTION PredictHouseRent FROM
CREATE FUNCTION IF NOT EXISTS PredictRent FROM
( SELECT number_of_rooms, number_of_bathrooms, days_on_market, rental_price FROM HomeRentals )
TYPE Sklearn
PREDICT 'rental_price';

In the above query, you are creating a new customized function by training a model from the ``HomeRentals`` table using the ``Sklearn`` framework.
The ``rental_price`` column will be the target column for predication, while the rest columns from the ``SELECT`` query are the inputs.
In the above query, you are creating a new customized function by training a model from the ``HomeRentals`` table using the ``Flaml Sklearn`` framework.
The ``rental_price`` column will be the target column for predication, while the rest columns from the ``SELECT`` query are the inputs.
This shall run the ``Random Forest`` model by default.

3. Model Training Parameters
----------------------------

.. list-table:: Available Parameters
:widths: 25 75

* - PREDICT (**required**)
- The name of the column we wish to predict.
* - MODEL
- The Sklearn models supported as of now are ``Random Forest``, ``Extra Trees Regressor`` and ``KNN``.
You can use ``rf`` for Random Forests, ``extra_tree`` for ExtraTrees Regressor, and ``kneighbor`` for KNN.
* - TIME_LIMIT
- Time limit to train the model in seconds. Default: 120.
* - TASK
- Specify whether you want to perform ``regression`` task or ``classification`` task.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any correlation between TASK and MODEL here? For every model (i.e., random forest, extratrees, KNN), we can choose either regression or classification?

* - METRIC
- Specify the metric that you want to use to train your model. For e.g. for training ``regression`` tasks you could
use the ``r2`` or ``RMSE`` metrics. For training ``classification`` tasks you could use the ``accuracy`` or ``f1_score`` metrics.
More information about the model metrics could be found `here <https://microsoft.github.io/FLAML/docs/Use-Cases/Task-Oriented-AutoML#optimization-metric>`_

Below are the example queries specifying the above parameters

.. code-block:: sql

CREATE OR REPLACE FUNCTION PredictHouseRentSklearn FROM
( SELECT number_of_rooms, number_of_bathrooms, days_on_market, rental_price FROM HomeRentals )
TYPE Sklearn
PREDICT 'rental_price'
MODEL 'extra_tree'
METRIC 'r2'
TASK 'regression'
TIME_LIMIT 180;
Loading