Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Documentation For Model Training #1201

Merged
merged 7 commits into from
Sep 26, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 6 additions & 2 deletions docs/_toc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,8 @@ parts:
title: Emotion Analysis
- file: source/usecases/homesale-forecast.rst
title: Home Sale Forecasting
- file: source/usecases/homerental-predict.rst
title: Home Rental Prediction
# - file: source/usecases/privategpt.rst
# title: PrivateGPT

Expand Down Expand Up @@ -69,8 +71,10 @@ parts:
- file: source/reference/ai/index
title: AI Engines
sections:
- file: source/reference/ai/model-train
title: Model Training
- file: source/reference/ai/model-train-ludwig
title: Model Training with Ludwig
- file: source/reference/ai/model-train-sklearn
title: Model Training with Sklearn
- file: source/reference/ai/model-forecasting
title: Time Series Forecasting
- file: source/reference/ai/hf
Expand Down
2 changes: 1 addition & 1 deletion docs/source/overview/model-inference.rst
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ In EvaDB, we can also use models in joins.
The most powerful usecase is lateral join combined with ``UNNEST``, which is very helpful to flatten the output from `one-to-many` models.
The key idea here is a model could give multiple outputs (e.g., bounding box) stored in an array. This syntax is used to unroll elements from the array into multiple rows.
Typical examples are `face detectors <https://github.com/georgia-tech-db/evadb/blob/staging/evadb/functions/face_detector.py>`_ and `object detectors <https://github.com/georgia-tech-db/evadb/blob/staging/evadb/functions/fastrcnn_object_detector.py>`_.
In the below example, we use `emotion detector <https://github.com/georgia-tech-db/evadb/blob/staging/evadb/functions/emotion_detector.py>_` to detect emotions from faces in the movie, where a single scene can contain multiple faces.
In the below example, we use `emotion detector <https://github.com/georgia-tech-db/evadb/blob/staging/evadb/functions/emotion_detector.py>`_ to detect emotions from faces in the movie, where a single scene can contain multiple faces.

.. code-block:: sql

Expand Down
2 changes: 1 addition & 1 deletion docs/source/reference/ai/model-forecasting.rst
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ EvaDB's default forecast framework is `statsforecast <https://nixtla.github.io/s
.. list-table:: Available Parameters
:widths: 25 75

* - PREDICT (required)
* - PREDICT (**required**)
- The name of the column we wish to forecast.
* - TIME
- The name of the column that contains the datestamp, wihch should be of a format expected by Pandas, ideally YYYY-MM-DD for a date or YYYY-MM-DD HH:MM:SS for a timestamp. Please visit the `pandas documentation <https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html>`_ for details. If not provided, an auto increasing ID column will be used.
Expand Down
65 changes: 65 additions & 0 deletions docs/source/reference/ai/model-train-ludwig.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
.. _ludwig:

Model Training with Ludwig
==========================

1. Installation
---------------

To use the `Ludwig framework <https://ludwig.ai/latest/>`_, we need to install the extra ludwig dependency in your EvaDB virtual environment.

.. code-block:: bash

pip install evadb[ludwig]

2. Example Query
----------------

.. code-block:: sql

CREATE OR REPLACE FUNCTION PredictHouseRent FROM
( SELECT sqft, location, rental_price FROM HomeRentals )
TYPE Ludwig
PREDICT 'rental_price'
TIME_LIMIT 120;

In the above query, you are creating a new customized function by automatically training a model from the ``HomeRentals`` table.
The ``rental_price`` column will be the target column for predication, while ``sqft`` and ``location`` are the inputs.

You can also simply give all other columns in ``HomeRentals`` as inputs and let the underlying AutoML framework to figure it out. Below is an example query:

.. code-block:: sql

CREATE FUNCTION IF NOT EXISTS PredictHouseRent FROM
( SELECT * FROM HomeRentals )
TYPE Ludwig
PREDICT 'rental_price'
TIME_LIMIT 120;

.. note::

Check out our :ref:`homerental-predict` for working example.

3. Model Training Parameters
----------------------------

.. list-table:: Available Parameters
:widths: 25 75

* - PREDICT (**required**)
- The name of the column we wish to predict.
* - TIME_LIMIT
- Time limit to train the model in seconds. Default: 120.
* - TUNE_FOR_MEMORY
- Whether to refine hyperopt search space for available host / GPU memory. Default: False.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cannot understand this parameter.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we add a note explaining this? Any issues of using it if there is no GPU?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also not sure. Ludwig documentation also don't have a good explanation. Shall we hide this parameter from the documentation?


Below is an example query specifying the above parameters:

.. code-block:: sql

CREATE FUNCTION IF NOT EXISTS PredictHouseRent FROM
( SELECT * FROM HomeRentals )
TYPE Ludwig
PREDICT 'rental_price'
TIME_LIMIT 3600
TUNE_FOR_MEMORY True;
26 changes: 26 additions & 0 deletions docs/source/reference/ai/model-train-sklearn.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
.. _sklearn:

Model Training with Sklearn
============================

1. Installation
---------------

To use the `Sklearn framework <https://scikit-learn.org/stable/>`_, we need to install the extra sklearn dependency in your EvaDB virtual environment.

.. code-block:: bash

pip install evadb[sklearn]

2. Example Query
----------------

.. code-block:: sql

CREATE OR REPLACE FUNCTION PredictHouseRent FROM
( SELECT number_of_rooms, number_of_bathrooms, days_on_market, rental_price FROM HomeRentals )
TYPE Sklearn
PREDICT 'rental_price';

In the above query, you are creating a new customized function by training a model from the ``HomeRentals`` table using the ``Sklearn`` framework.
The ``rental_price`` column will be the target column for predication, while the rest columns from the ``SELET`` query are the inputs.
46 changes: 0 additions & 46 deletions docs/source/reference/ai/model-train.rst

This file was deleted.

2 changes: 1 addition & 1 deletion docs/source/reference/evaql/create.rst
Original file line number Diff line number Diff line change
Expand Up @@ -117,7 +117,7 @@ Where the `parameter` is ``key value`` pair.

.. note::

Go over :ref:`hf`, :ref:`predict`, and :ref:`forecast` to check examples for creating function via type.
Go over :ref:`hf`, :ref:`ludwig`, and :ref:`forecast` to check examples for creating function via type.

CREATE MATERIALIZED VIEW
------------------------
Expand Down
124 changes: 124 additions & 0 deletions docs/source/usecases/homerental-predict.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,124 @@
.. _homerental-predict:

Home Rental Prediction
=======================

.. raw:: html

<embed>
<table align="left">
<td>
<a target="_blank" href="https://colab.research.google.com/github/georgia-tech-db/eva/blob/staging/tutorials/17-home-rental-prediction.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" /> Run on Google Colab</a>
</td>
<td>
<a target="_blank" href="https://github.com/georgia-tech-db/eva/blob/staging/tutorials/17-home-rental-prediction.ipynb"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" /> View source on GitHub</a>
</td>
<td>
<a target="_blank" href="https://github.com/georgia-tech-db/eva/raw/staging/tutorials/17-home-rental-prediction.ipynb"><img src="https://www.tensorflow.org/images/download_logo_32px.png" /> Download notebook</a>
</td>
</table><br><br>
</embed>

Introduction
------------

In this tutorial, we present how to use :ref:`Prediction AI Engines<ludwig>` in EvaDB to predict home rental prices. EvaDB makes it easy to do predictions using its built-in AutoML engines with your existing databases.

.. include:: ../shared/evadb.rst

.. include:: ../shared/postgresql.rst

We will assume that the input data is loaded into a ``PostgreSQL`` database.
To load the home rental data into your database, see the complete `home rental prediction notebook on Colab <https://colab.research.google.com/github/georgia-tech-db/eva/blob/staging/tutorials/17-home-rental-prediction.ipynb>`_.

Preview the Home Sales Data
-------------------------------------------

We use the `home rental data <https://www.dropbox.com/scl/fi/gy2682i66a8l2tqsowm5x/home_rentals.csv?rlkey=e080k02rv5205h4ullfjdr8lw&raw=1>`_ in this usecase. The data contains eight columns: ``number_of_rooms``, ``number_of_bathrooms``, ``sqft``, ``location``, ``days_on_market``, ``initial_price``, ``neighborhood``, and ``rental_price``.

.. code-block:: sql

SELECT * FROM postgres_data.home_rentals LIMIT 3;

This query previews the data in the home_rentals table:

.. code-block::

+------------------------------+----------------------------------+-------------------+-----------------------+-----------------------------+----------------------------+---------------------------+---------------------------+
| home_rentals.number_of_rooms | home_rentals.number_of_bathrooms | home_rentals.sqft | home_rentals.location | home_rentals.days_on_market | home_rentals.initial_price | home_rentals.neighborhood | home_rentals.rental_price |
|------------------------------|----------------------------------|-------------------|-----------------------|-----------------------------|----------------------------|---------------------------|---------------------------|
| 1 | 1 | 674 | good | 1 | 2167 | downtown | 2167 |
| 1 | 1 | 554 | poor | 19 | 1883 | westbrae | 1883 |
| 0 | 1 | 529 | great | 3 | 2431 | south_side | 2431 |
+------------------------------+----------------------------------+-------------------+-----------------------+-----------------------------+----------------------------+---------------------------+---------------------------+

Train a Home Rental Prediction Model
-------------------------------------

Let's next train a prediction model from the home_rental table using EvaDB's ``CREATE FUNCTION`` query.
We will use the built-in :ref:`Ludwig<ludwig>` engine for this task.

.. code-block:: sql

CREATE OR REPLACE FUNCTION PredictHouseRent FROM
( SELECT * FROM postgres_data.home_rental )
TYPE Ludwig
PREDICT 'rental_price'
TIME_LIMIT 3600;

In the above query, we use all the columns (except ``rental_price``) from ``home_rental`` table to predict the ``rental_price`` column.
We set the training time out to be 3600 seconds.

.. note::

Go over :ref:`ludwig` page on exploring all configurable paramters for the model training frameworks.

.. code-block::

+----------------------------------------------+
| Function PredictHouseRent successfully added |
+----------------------------------------------+

Predict the Home Rental Price using the Trained Model
-----------------------------------------------------

Next we use the trained ``PredictHouseRent`` to predict the home rental price.

.. code-block:: sql

SELECT PredictHouseRent(*) FROM postgres_data.home_rentals LIMIT 3;

We use ``*`` to simply pass all columns into the ``PredictHouseRent`` function.

.. code-block::

+-------------------------------------------+
| predicthouserent.rental_price_predictions |
+-------------------------------------------+
| 2087.763672 |
| 1793.570190 |
| 2346.319824 |
+-------------------------------------------+

We have the option to utilize a ``LATERAL JOIN`` to compare the actual rental prices in the ``home_rentals`` dataset with the predicted rental prices generated by the trained model, ``PredictHouseRent``.

.. code-block:: sql

SELECT rental_price, predicted_rental_price
FROM postgres_data.home_rentals
JOIN LATERAL PredictHouseRent(*) AS Predicted(predicted_rental_price)
LIMIT 3;

Here is the query's output:

.. code-block::

+---------------------------+----------------------------------+
| home_rentals.rental_price | Predicted.predicted_rental_price |
+---------------------------+----------------------------------+
| 2167 | 2087.763672 |
| 1883 | 1793.570190 |
| 2431 | 2346.319824 |
+------------------ --------+----------------------------------+

.. include:: ../shared/footer.rst
6 changes: 3 additions & 3 deletions docs/source/usecases/homesale-forecast.rst
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ Home Sale Forecasting
Introduction
------------

In this tutorial, we present how to use :ref:`forecasting models<forecast>` in EvaDB to predict home sale price. EvaDB makes it easy to do time series predictions using its built-in Auto Forecast function.
In this tutorial, we present how to use :ref:`Forecasting AI Engines<forecast>` in EvaDB to predict home sale price. EvaDB makes it easy to do time series predictions using its built-in Auto Forecast function.

.. include:: ../shared/evadb.rst

Expand All @@ -34,7 +34,7 @@ To load the home sales data into your database, see the complete `home sale fore
Preview the Home Sales Data
-------------------------------------------

We use the `raw_sales.csv of the House Property Sales Time Series <https://www.kaggle.com/datasets/htagholdings/property-sales?resource=download>`_ in this usecase. The data contains five columns: postcode, price, bedrooms, datesold, and propertytype.
We use the `raw_sales.csv of the House Property Sales Time Series <https://www.kaggle.com/datasets/htagholdings/property-sales?resource=download>`_ in this usecase. The data contains five columns: ``postcode``, ``price``, ``bedrooms``, ``datesold``, and ``propertytype``.

.. code-block:: sql

Expand Down Expand Up @@ -74,7 +74,7 @@ Particularly, we are interested in the price of the properties that have three b

In the ``home_sales`` dataset, we have two different property types, houses and units, and price gap between them are large.
We'd like to ask EvaDB to analyze the price of houses and units independently.
To do so, we specify the ``propertytype`` column as the ``ID `` of the time series data, which represents an identifier for the series.
To do so, we specify the ``propertytype`` column as the ``ID`` of the time series data, which represents an identifier for the series.
Here is the query's output ``DataFrame``:

.. note::
Expand Down
2 changes: 1 addition & 1 deletion script/test/test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -88,7 +88,7 @@ long_integration_test() {
}

notebook_test() {
PYTHONPATH=./ python -m pytest --durations=5 --nbmake --overwrite "./tutorials" --capture=sys --tb=short -v --log-level=WARNING --nbmake-timeout=3000 --ignore="tutorials/08-chatgpt.ipynb" --ignore="tutorials/14-food-review-tone-analysis-and-response.ipynb" --ignore="tutorials/15-AI-powered-join.ipynb" --ignore="tutorials/16-homesale-forecasting.ipynb"
PYTHONPATH=./ python -m pytest --durations=5 --nbmake --overwrite "./tutorials" --capture=sys --tb=short -v --log-level=WARNING --nbmake-timeout=3000 --ignore="tutorials/08-chatgpt.ipynb" --ignore="tutorials/14-food-review-tone-analysis-and-response.ipynb" --ignore="tutorials/15-AI-powered-join.ipynb" --ignore="tutorials/16-homesale-forecasting.ipynb" --ignore="tutorials/17-home-rental-prediction.ipynb"
code=$?
print_error_code $code "NOTEBOOK TEST"
}
Expand Down