# Training Prediction Models Directly Within PostgreSQL Using XGBoost EvaDB
In this tutorial, we'll harness EvaDB's model training capabilities to predict home rental prices, showcasing how EvaDB seamlessly integrates AI into your PostgreSQL database.
<table align="left">
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/georgia-tech-db/eva/blob/staging/tutorials/17-home-rental-prediction.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png"/> Run on Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/georgia-tech-db/eva/blob/staging/tutorials/17-home-rental-prediction.ipynb"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png"/> View source on GitHub</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/georgia-tech-db/eva/raw/staging/tutorials/17-home-rental-prediction.ipynb"><img src="https://www.tensorflow.org/images/download_logo_32px.png" /> Download notebook</a>
  </td>
</table><br><br>

In [1]:
!apt -qq install postgresql
!service postgresql start

The following additional packages will be installed:
  libcommon-sense-perl libjson-perl libjson-xs-perl libtypes-serialiser-perl logrotate netbase
  postgresql-14 postgresql-client-14 postgresql-client-common postgresql-common ssl-cert sysstat
Suggested packages:
  bsd-mailx | mailx postgresql-doc postgresql-doc-14 isag
The following NEW packages will be installed:
  libcommon-sense-perl libjson-perl libjson-xs-perl libtypes-serialiser-perl logrotate netbase
  postgresql postgresql-14 postgresql-client-14 postgresql-client-common postgresql-common ssl-cert
  sysstat
0 upgraded, 13 newly installed, 0 to remove and 19 not upgraded.
Need to get 18.3 MB of archives.
After this operation, 51.5 MB of additional disk space will be used.
Preconfiguring packages ...
Selecting previously unselected package logrotate.
(Reading database ... 120874 files and directories currently installed.)
Preparing to unpack .../00-logrotate_3.19.0-1ubuntu1.1_amd64.deb ...
Unpacking logrotate (3.19.0-1ubuntu1.1

## Setup

### Install and Launch the PostgreSQL Server

To kick things off, we'll start by setting up the PostgreSQL database backend. If you already have a PostgreSQL server up and running, you can skip this step and proceed directly to [installing EvaDB](#install-evadb).

### Create User and Database

In [2]:
!sudo -u postgres psql -c "CREATE USER eva WITH SUPERUSER PASSWORD 'password'"
!sudo -u postgres psql -c "CREATE DATABASE evadb"

CREATE ROLE
CREATE DATABASE


### Prettify  Output

In [3]:
import warnings
warnings.filterwarnings("ignore")

from IPython.core.display import display, HTML
def pretty_print(df):
    return display(HTML( df.to_html().replace("\\n","<br>")))

## Installing EvaDB and XGBoost dependencies
<a id='install_evadb'></a>
We install EvaDB along with the necessary PostgreSQL and XGBoost dependencies.

In [4]:
%pip install --quiet "evadb[postgres,xgboost]"

import evadb
cursor = evadb.connect().cursor()

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m550.5/550.5 kB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m137.6/137.6 kB[0m [31m12.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m111.6/111.6 kB[0m [31m11.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.6/92.6 kB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m295.2/295.2 kB[0m [31m22.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m98.7/98.7 kB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.2/3.2 MB[0m [31m51.5 MB/s[0m eta [36m0:00:00[0m
[?25h

Downloading: "http://ml.cs.tsinghua.edu.cn/~chenxi/pytorch-models/mnist-b07bb66b.pth" to /root/.cache/torch/hub/checkpoints/mnist-b07bb66b.pth
100%|██████████| 1.03M/1.03M [00:01<00:00, 740kB/s]
Downloading: "https://download.pytorch.org/models/fasterrcnn_resnet50_fpn_coco-258fb6c6.pth" to /root/.cache/torch/hub/checkpoints/fasterrcnn_resnet50_fpn_coco-258fb6c6.pth


## Load data into PostgresSQL

### Setting up a Data Source in EvaDB
To establish a direct connection between EvaDB and underlying database systems such as PostgreSQL, we will create a data source. This process entails supplying EvaDB with the connection credentials for the active PostgreSQL server.

In [5]:
params = {
    "user": "eva",
    "password": "password",
    "host": "localhost",
    "port": "5432",
    "database": "evadb",
}
query = f"CREATE DATABASE postgres_data WITH ENGINE = 'postgres', PARAMETERS = {params};"
cursor.query(query).df()

Unnamed: 0,0
0,The database postgres_data has been successful...


## Training Classification Models using EVADB

Next, we employ EvaDB to facilitate the training of Classification ML models, which will enable us to predict `leave_or_not` i.e. a variable depicting whether an employee will leave the current company or not based on several parameters.

### Loading Employee Data from CSV into PostgreSQL

In this step, we will import the [Employee Data](https://www.kaggle.com/datasets/tawfikelmetwally/employee-dataset) dataset into our PostgreSQL database. If you already have the data stored in PostgreSQL and are ready to proceed with the prediction model training, feel free to skip this section and head directly to the [model training process](#train-the-prediction-model).

In [6]:
!mkdir -p content
!wget -nc -O /content/Employee.csv https://drive.google.com/file/d/1R4ij5Ww6bOGwLJrbBStzcaPRhAJ-fn72/view?usp=share_link

--2023-10-26 22:12:57--  https://drive.google.com/file/d/1R4ij5Ww6bOGwLJrbBStzcaPRhAJ-fn72/view?usp=share_link
Resolving drive.google.com (drive.google.com)... 74.125.26.101, 74.125.26.102, 74.125.26.100, ...
Connecting to drive.google.com (drive.google.com)|74.125.26.101|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘/content/Employee.csv’

/content/Employee.c     [<=>                 ]       0  --.-KB/s               /content/Employee.c     [ <=>                ]  81.89K  --.-KB/s    in 0.002s  

2023-10-26 22:12:57 (41.6 MB/s) - ‘/content/Employee.csv’ saved [83860]



In [19]:
cursor.query("""
  USE postgres_data {
    CREATE TABLE IF NOT EXISTS employee_data (
      education VARCHAR(128),
      joining_year INTEGER,
      city VARCHAR(128),
      payment_tier INTEGER,
      age INTEGER,
      gender VARCHAR(128),
      ever_benched VARCHAR(128),
      experience_in_current_domain INTEGER,
      leave_or_not INTEGER
    )
  }
""").df()

Unnamed: 0,status
0,success


In [None]:
cursor.query("""
  USE postgres_data {
    COPY employee_data(education, joining_year, city, payment_tier, age, gender, ever_benched, experience_in_current_domain, leave_or_not)
    FROM '/content/Employee.csv'
    DELIMITER ',' CSV HEADER
  }
""").df()

In [None]:
cursor.query("SELECT * FROM postgres_data.employee LIMIT 3;").df()

### Train the prediction Model
Train the XGBoost AutoML model for classification using the `accuracy` metric

In [None]:
cursor.query("""
  CREATE FUNCTION IF NOT EXISTS PredictEmployee FROM
    ( SELECT payment_tier, age, gender, experience_in_current_domain, leave_or_not FROM postgres_data.employee )
    TYPE XGBoost
    PREDICT 'leave_or_not'
    TIME_LIMIT 180
    METRIC 'f1'
    TASK 'classification';
""").df()

### Utilizing the Prediction Model
Following the model training, we proceed to employ the `PredictEmployee` model to make predictions for whether the employee will leave or not.

In [None]:
cursor.query("SELECT PredictEmployee(payment_tier, age, gender, experience_in_current_domain, leave_or_not) FROM postgres_data.employee LIMIT 10;").df()