# Training Prediction Models Directly Within PostgreSQL Using XGBoost EvaDB
In this tutorial, we'll harness EvaDB's model training capabilities to predict home rental prices, showcasing how EvaDB seamlessly integrates AI into your PostgreSQL database.
<table align="left">
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/georgia-tech-db/eva/blob/staging/tutorials/17-home-rental-prediction.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png"/> Run on Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/georgia-tech-db/eva/blob/staging/tutorials/17-home-rental-prediction.ipynb"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png"/> View source on GitHub</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/georgia-tech-db/eva/raw/staging/tutorials/17-home-rental-prediction.ipynb"><img src="https://www.tensorflow.org/images/download_logo_32px.png" /> Download notebook</a>
  </td>
</table><br><br>

In [36]:
!apt -qq install postgresql
!service postgresql start

postgresql is already the newest version (14+238).
0 upgraded, 0 newly installed, 0 to remove and 18 not upgraded.
 * Starting PostgreSQL 14 database server
   ...done.


## Setup

### Install and Launch the PostgreSQL Server

To kick things off, we'll start by setting up the PostgreSQL database backend. If you already have a PostgreSQL server up and running, you can skip this step and proceed directly to [installing EvaDB](#install-evadb).

### Create User and Database

In [37]:
!sudo -u postgres psql -c "CREATE USER eva WITH SUPERUSER PASSWORD 'password'"
!sudo -u postgres psql -c "CREATE DATABASE evadb"

ERROR:  role "eva" already exists
ERROR:  database "evadb" already exists


### Prettify  Output

In [38]:
import warnings
warnings.filterwarnings("ignore")

from IPython.core.display import display, HTML
def pretty_print(df):
    return display(HTML( df.to_html().replace("\\n","<br>")))

## Installing EvaDB and XGBoost dependencies
<a id='install_evadb'></a>
We install EvaDB along with the necessary PostgreSQL and XGBoost dependencies.

In [39]:
%pip install --quiet "evadb[postgres,xgboost]"

import evadb
cursor = evadb.connect().cursor()

## Load data into PostgresSQL

### Setting up a Data Source in EvaDB
To establish a direct connection between EvaDB and underlying database systems such as PostgreSQL, we will create a data source. This process entails supplying EvaDB with the connection credentials for the active PostgreSQL server.

In [40]:
params = {
    "user": "eva",
    "password": "password",
    "host": "localhost",
    "port": "5432",
    "database": "evadb",
}
query = f"CREATE DATABASE postgres_data WITH ENGINE = 'postgres', PARAMETERS = {params};"
cursor.query(query).df()

10-25-2023 03:09:30 ERROR [plan_executor:plan_executor.py:execute_plan:0179] postgres_data already exists.
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/evadb/executor/plan_executor.py", line 175, in execute_plan
    yield from output
  File "/usr/local/lib/python3.10/dist-packages/evadb/executor/create_database_executor.py", line 42, in exec
    raise ExecutorError(f"{self.node.database_name} already exists.")
evadb.executor.executor_utils.ExecutorError: postgres_data already exists.
ERROR:evadb.utils.logging_manager:postgres_data already exists.
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/evadb/executor/plan_executor.py", line 175, in execute_plan
    yield from output
  File "/usr/local/lib/python3.10/dist-packages/evadb/executor/create_database_executor.py", line 42, in exec
    raise ExecutorError(f"{self.node.database_name} already exists.")
evadb.executor.executor_utils.ExecutorError: postgres_data alrea

ExecutorError: ignored

### Loading Home Property Sales Data from CSV into PostgreSQL

In this step, we will import the [House Property Sales](https://www.kaggle.com/datasets/htagholdings/property-sales?resource=download) dataset into our PostgreSQL database. If you already have the data stored in PostgreSQL and are ready to proceed with the prediction model training, feel free to skip this section and head directly to the [model training process](#train-the-prediction-model).

In [41]:
!mkdir -p content
!wget -nc -O /content/home_rentals.csv https://www.dropbox.com/scl/fi/gy2682i66a8l2tqsowm5x/home_rentals.csv?rlkey=e080k02rv5205h4ullfjdr8lw&raw=1

File ‘/content/home_rentals.csv’ already there; not retrieving.


In [42]:
cursor.query("""
  USE postgres_data {
    CREATE TABLE IF NOT EXISTS home_rentals (
      number_of_rooms INT,
      number_of_bathrooms INT,
      sqft INT,
      location VARCHAR(128),
      days_on_market INT,
      initial_price INT,
      neighborhood VARCHAR(128),
      rental_price FLOAT
    )
  }
""").df()

Unnamed: 0,status
0,success


In [43]:
cursor.query("""
  USE postgres_data {
    COPY home_rentals(number_of_rooms, number_of_bathrooms, sqft, location, days_on_market, initial_price, neighborhood, rental_price)
    FROM '/content/home_rentals.csv'
    DELIMITER ',' CSV HEADER
  }
""").df()

Unnamed: 0,status
0,success


### Preview the Data

Within the home_rentals table, there are 8 columns at our disposal. Our objective is to utilize the remaining 7 columns to make predictions for the rental_price.

In [44]:
cursor.query("SELECT * FROM postgres_data.home_rentals LIMIT 3;").df()

Unnamed: 0,rental_price,number_of_bathrooms,sqft,initial_price,number_of_rooms,days_on_market,location,neighborhood
0,2167.0,1,674,2167,1,1,good,downtown
1,1883.0,1,554,1883,1,19,poor,westbrae
2,2431.0,1,529,2431,0,3,great,south_side


## Training Model

Next, we employ EvaDB to facilitate the training of an ML model, which will enable us to predict `home rental prices`.

### Train the prediction Model
For this purpose, we harness the capabilities of the [xgboost](https://xgboost.readthedocs.io/en/stable/) engine to train our prediction model. We employ the `Flaml` feature to automatically determine the optimal hyperparameters. Keep in mind that `TIME_LIMIT` specifies the time budget allocated for the training process. `METRIC` specifies the training error or accuracy you want to optimize on while training. `TASK` specifies whether you aim to perform classification or regression. In this example we shall use regression to predict home rental price.

In [45]:
cursor.query("""
  CREATE OR REPLACE FUNCTION PredictHouseRent FROM
  ( SELECT * FROM postgres_data.home_rentals )
  TYPE Xgboost
  PREDICT 'rental_price'
  METRIC 'rmse'
  TASK 'regression'
  TIME_LIMIT 180;
""").df()

[flaml.automl.logger: 10-25 03:09:57] {1679} INFO - task = regression
[flaml.automl.logger: 10-25 03:09:57] {1690} INFO - Evaluation method: cv
[flaml.automl.logger: 10-25 03:09:57] {1788} INFO - Minimizing error metric: rmse
[flaml.automl.logger: 10-25 03:09:57] {1900} INFO - List of ML learners in AutoML Run: ['xgboost']
[flaml.automl.logger: 10-25 03:09:57] {2218} INFO - iteration 0, current learner xgboost
[flaml.automl.logger: 10-25 03:09:57] {2344} INFO - Estimated sufficient time budget=1949s. Estimated necessary time budget=2s.
[flaml.automl.logger: 10-25 03:09:57] {2391} INFO -  at 0.3s,	estimator xgboost's best error=873.9536,	best estimator xgboost's best error=873.9536
[flaml.automl.logger: 10-25 03:09:57] {2218} INFO - iteration 1, current learner xgboost
[flaml.automl.logger: 10-25 03:09:57] {2391} INFO -  at 0.5s,	estimator xgboost's best error=873.9536,	best estimator xgboost's best error=873.9536
[flaml.automl.logger: 10-25 03:09:57] {2218} INFO - iteration 2, current 

Unnamed: 0,0
0,Function PredictHouseRent overwritten.


Example training query using the `R2` metric

In [46]:
cursor.query("""
  CREATE OR REPLACE FUNCTION PredictHouseRent FROM
  ( SELECT * FROM postgres_data.home_rentals )
  TYPE Xgboost
  PREDICT 'rental_price'
  METRIC 'r2'
  TASK 'regression'
  TIME_LIMIT 120;
""").df()

[flaml.automl.logger: 10-25 03:13:21] {1679} INFO - task = regression
[flaml.automl.logger: 10-25 03:13:21] {1690} INFO - Evaluation method: cv
[flaml.automl.logger: 10-25 03:13:21] {1788} INFO - Minimizing error metric: 1-r2
[flaml.automl.logger: 10-25 03:13:21] {1900} INFO - List of ML learners in AutoML Run: ['xgboost']
[flaml.automl.logger: 10-25 03:13:21] {2218} INFO - iteration 0, current learner xgboost
[flaml.automl.logger: 10-25 03:13:21] {2344} INFO - Estimated sufficient time budget=1994s. Estimated necessary time budget=2s.
[flaml.automl.logger: 10-25 03:13:21] {2391} INFO -  at 0.3s,	estimator xgboost's best error=0.4579,	best estimator xgboost's best error=0.4579
[flaml.automl.logger: 10-25 03:13:21] {2218} INFO - iteration 1, current learner xgboost
[flaml.automl.logger: 10-25 03:13:22] {2391} INFO -  at 0.5s,	estimator xgboost's best error=0.4579,	best estimator xgboost's best error=0.4579
[flaml.automl.logger: 10-25 03:13:22] {2218} INFO - iteration 2, current learner 

Unnamed: 0,0
0,Function PredictHouseRent overwritten.


### Utilizing the Prediction Model
Following the model training, we proceed to employ the `PredictHouseRent`` model to make predictions for home rental prices.

In [47]:
cursor.query("SELECT PredictHouseRent(*) FROM postgres_data.home_rentals LIMIT 10;").df()

Unnamed: 0,rental_price
0,2152.748779
1,1940.956787
2,2438.577881
3,5532.644043
4,2270.967529
5,4169.572754
6,2207.939941
7,2101.080566
8,3873.550049
9,2027.248047


We have the option to utilize a `LATERAL JOIN` to compare the actual rental prices in the `home_rentals` dataset with the predicted rental prices generated by the trained model, `PredictHouseRent`.

In [48]:
cursor.query("""
  SELECT rental_price, predicted_rental_price FROM postgres_data.home_rentals
  JOIN LATERAL PredictHouseRent(*) AS Predicted(predicted_rental_price) LIMIT 10;
""").df()

Unnamed: 0,rental_price,predicted_rental_price
0,2167.0,2166.888184
1,1883.0,1882.995605
2,2431.0,2430.979492
3,5510.0,5510.027832
4,2272.0,2272.018066
5,4123.812,4123.84082
6,2224.0,2223.957275
7,2104.0,2103.984131
8,3861.0,3860.960449
9,2041.0,2041.064087


## Training Classification Models using EVADB

Next, we employ EvaDB to facilitate the training of Classification ML models, which will enable us to predict `leave_or_not` i.e. a variable depicting whether an employee will leave the current company or not based on several parameters.

### Loading Employee Data from CSV into PostgreSQL

In this step, we will import the [Employee Data](https://www.kaggle.com/datasets/tawfikelmetwally/employee-dataset) dataset into our PostgreSQL database. If you already have the data stored in PostgreSQL and are ready to proceed with the prediction model training, feel free to skip this section and head directly to the [model training process](#train-the-prediction-model).

In [49]:
!mkdir -p content
!wget -nc -O /content/Employee.csv https://drive.google.com/file/d/1R4ij5Ww6bOGwLJrbBStzcaPRhAJ-fn72/view?usp=share_link

File ‘/content/Employee.csv’ already there; not retrieving.


In [55]:
cursor.query("""
  USE postgres_data {
    CREATE TABLE IF NOT EXISTS Employee (
      education VARCHAR(128),
      joining_year INTEGER,
      city VARCHAR(128),
      payment_tier INTEGER,
      age INTEGER,
      gender VARCHAR(128),
      ever_benched VARCHAR(128),
      experience_in_current_domain INTEGER,
      leave_or_not INTEGER
    )
  }
""").df()

Unnamed: 0,status
0,success


In [60]:
cursor.query("""
  USE postgres_data {
    COPY Employee(education, joining_year, city, payment_tier, age, gender, ever_benched, experience_in_current_domain, leave_or_not)
    FROM '/content/Employee.csv'
    DELIMITER ',' CSV HEADER
  }
""").df()

Unnamed: 0,status
0,success


In [62]:
cursor.query("SELECT * FROM postgres_data.employee LIMIT 3;").df()

Unnamed: 0,leave_or_not,joining_year,payment_tier,age,experience_in_current_domain,gender,city,ever_benched,education
0,0,2017,3,34,0,Male,Bangalore,No,Bachelors
1,1,2013,1,28,3,Female,Pune,No,Bachelors
2,0,2014,3,38,2,Female,New Delhi,No,Bachelors


### Train the prediction Model
Train the XGBoost AutoML model for classification using the `accuracy` metric

In [None]:
cursor.query("""
  CREATE FUNCTION IF NOT EXISTS PredictEmployee FROM
    ( SELECT payment_tier, age, gender, experience_in_current_domain, leave_or_not FROM postgres_data.employee )
    TYPE XGBoost
    PREDICT 'leave_or_not'
    TIME_LIMIT 180
    METRIC 'f1'
    TASK 'classification';
""").df()

### Utilizing the Prediction Model
Following the model training, we proceed to employ the `PredictEmployee` model to make predictions for whether the employee will leave or not.

In [None]:
cursor.query("SELECT PredictEmployee(payment_tier, age, gender, experience_in_current_domain, leave_or_not) FROM postgres_data.employee LIMIT 10;").df()