# Training Prediction Models Directly Within PostgreSQL Using XGBoost EvaDB
In this tutorial, we'll harness EvaDB's model training capabilities to predict home rental prices, showcasing how EvaDB seamlessly integrates AI into your PostgreSQL database.
<table align="left">
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/georgia-tech-db/eva/blob/staging/tutorials/19-employee-classification-prediction.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png"/> Run on Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/georgia-tech-db/eva/blob/staging/tutorials/19-employee-classification-prediction.ipynb"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png"/> View source on GitHub</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/georgia-tech-db/eva/raw/staging/tutorials/19-employee-classification-prediction.ipynb"><img src="https://www.tensorflow.org/images/download_logo_32px.png" /> Download notebook</a>
  </td>
</table><br><br>

In [1]:
!apt -qq install postgresql
!service postgresql start

postgresql is already the newest version (14+238).
0 upgraded, 0 newly installed, 0 to remove and 19 not upgraded.
 * Starting PostgreSQL 14 database server
   ...done.


## Setup

### Install and Launch the PostgreSQL Server

To kick things off, we'll start by setting up the PostgreSQL database backend. If you already have a PostgreSQL server up and running, you can skip this step and proceed directly to [installing EvaDB](#install-evadb).

### Create User and Database

In [2]:
!sudo -u postgres psql -c "CREATE USER eva WITH SUPERUSER PASSWORD 'password'"
!sudo -u postgres psql -c "CREATE DATABASE evadb"

ERROR:  role "eva" already exists
ERROR:  database "evadb" already exists


### Prettify  Output

In [3]:
import warnings
warnings.filterwarnings("ignore")

from IPython.core.display import display, HTML
def pretty_print(df):
    return display(HTML( df.to_html().replace("\\n","<br>")))

## Installing EvaDB and XGBoost dependencies
<a id='install_evadb'></a>
We install EvaDB along with the necessary PostgreSQL and XGBoost dependencies.

In [4]:
%pip install --quiet "evadb[postgres,xgboost] @ git+https://github.com/georgia-tech-db/evadb.git@703dc9460e499a693ee83bfefe9fe49918499159"

import evadb
cursor = evadb.connect().cursor()

## Load data into PostgresSQL

### Setting up a Data Source in EvaDB
To establish a direct connection between EvaDB and underlying database systems such as PostgreSQL, we will create a data source. This process entails supplying EvaDB with the connection credentials for the active PostgreSQL server.

In [5]:
params = {
    "user": "eva",
    "password": "password",
    "host": "localhost",
    "port": "5432",
    "database": "evadb",
}
query = f"CREATE DATABASE postgres_data WITH ENGINE = 'postgres', PARAMETERS = {params};"
cursor.query(query).df()

10-27-2023 05:16:54 ERROR [plan_executor:plan_executor.py:execute_plan:0179] postgres_data already exists.
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/evadb/executor/plan_executor.py", line 175, in execute_plan
    yield from output
  File "/usr/local/lib/python3.10/dist-packages/evadb/executor/create_database_executor.py", line 42, in exec
    raise ExecutorError(f"{self.node.database_name} already exists.")
evadb.executor.executor_utils.ExecutorError: postgres_data already exists.
ERROR:evadb.utils.logging_manager:postgres_data already exists.
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/evadb/executor/plan_executor.py", line 175, in execute_plan
    yield from output
  File "/usr/local/lib/python3.10/dist-packages/evadb/executor/create_database_executor.py", line 42, in exec
    raise ExecutorError(f"{self.node.database_name} already exists.")
evadb.executor.executor_utils.ExecutorError: postgres_data alrea

ExecutorError: ignored

## Training Classification Models using EVADB

Next, we employ EvaDB to facilitate the training of Classification ML models, which will enable us to predict `leave_or_not` i.e. a variable depicting whether an employee will leave the current company or not based on several parameters.

### Loading Employee Data from CSV into PostgreSQL

In this step, we will import the [Employee Data](https://www.kaggle.com/datasets/tawfikelmetwally/employee-dataset) dataset into our PostgreSQL database. If you already have the data stored in PostgreSQL and are ready to proceed with the prediction model training, feel free to skip this section and head directly to the [model training process](#train-the-prediction-model).

In [6]:
!mkdir -p content
!wget -nc -O /content/Employee.csv https://drive.google.com/file/d/1R4ij5Ww6bOGwLJrbBStzcaPRhAJ-fn72/view?usp=share_link

--2023-10-27 04:16:45--  https://drive.google.com/file/d/1R4ij5Ww6bOGwLJrbBStzcaPRhAJ-fn72/view?usp=share_link
Resolving drive.google.com (drive.google.com)... 172.253.123.139, 172.253.123.101, 172.253.123.138, ...
Connecting to drive.google.com (drive.google.com)|172.253.123.139|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘/content/Employee.csv’

/content/Employee.c     [ <=>                ]  81.89K  --.-KB/s    in 0.001s  

2023-10-27 04:16:45 (60.3 MB/s) - ‘/content/Employee.csv’ saved [83856]



In [6]:
cursor.query("""
  USE postgres_data {
    CREATE TABLE IF NOT EXISTS employee_data (
      education VARCHAR(128),
      joining_year INTEGER,
      city VARCHAR(128),
      payment_tier INTEGER,
      age INTEGER,
      gender VARCHAR(128),
      ever_benched VARCHAR(128),
      experience_in_current_domain INTEGER,
      leave_or_not INTEGER
    )
  }
""").df()

Unnamed: 0,status
0,success


In [7]:
cursor.query("""
  USE postgres_data {
    COPY employee_data(education, joining_year, city, payment_tier, age, gender, ever_benched, experience_in_current_domain, leave_or_not)
    FROM '/content/Employee.csv'
    DELIMITER ',' CSV HEADER
  }
""").df()

Unnamed: 0,status
0,success


In [8]:
cursor.query("SELECT * FROM postgres_data.employee_data LIMIT 3;").df()

Unnamed: 0,leave_or_not,joining_year,payment_tier,age,experience_in_current_domain,gender,city,ever_benched,education
0,0,2017,3,34,0,Male,Bangalore,No,Bachelors
1,1,2013,1,28,3,Female,Pune,No,Bachelors
2,0,2014,3,38,2,Female,New Delhi,No,Bachelors


### Train the prediction Model
Train the XGBoost AutoML model for classification using the `accuracy` metric

In [9]:
cursor.query("""
  CREATE FUNCTION IF NOT EXISTS PredictEmployee FROM
    ( SELECT payment_tier, age, gender, experience_in_current_domain, leave_or_not FROM postgres_data.employee_data )
    TYPE XGBoost
    PREDICT 'leave_or_not'
    TIME_LIMIT 180
    METRIC 'accuracy'
    TASK 'classification';
""").df()

[flaml.automl.logger: 10-27 05:17:19] {1679} INFO - task = classification
[flaml.automl.logger: 10-27 05:17:19] {1690} INFO - Evaluation method: cv
[flaml.automl.logger: 10-27 05:17:19] {1788} INFO - Minimizing error metric: 1-accuracy
[flaml.automl.logger: 10-27 05:17:19] {1900} INFO - List of ML learners in AutoML Run: ['xgboost']
[flaml.automl.logger: 10-27 05:17:19] {2218} INFO - iteration 0, current learner xgboost
[flaml.automl.logger: 10-27 05:17:19] {2344} INFO - Estimated sufficient time budget=1155s. Estimated necessary time budget=1s.
[flaml.automl.logger: 10-27 05:17:19] {2391} INFO -  at 0.2s,	estimator xgboost's best error=0.3439,	best estimator xgboost's best error=0.3439
[flaml.automl.logger: 10-27 05:17:19] {2218} INFO - iteration 1, current learner xgboost
[flaml.automl.logger: 10-27 05:17:19] {2391} INFO -  at 0.3s,	estimator xgboost's best error=0.3439,	best estimator xgboost's best error=0.3439
[flaml.automl.logger: 10-27 05:17:19] {2218} INFO - iteration 2, curren

Unnamed: 0,0
0,Function PredictEmployee added to the database.


### Utilizing the Prediction Model
Following the model training, we proceed to employ the `PredictEmployee` model to make predictions for whether the employee will leave or not.

In [11]:
cursor.query("SELECT PredictEmployee(payment_tier, age, gender, experience_in_current_domain, leave_or_not) FROM postgres_data.employee_data LIMIT 10;").df()

Unnamed: 0,leave_or_not
0,0
1,1
2,0
3,0
4,0
5,0
6,0
7,0
8,0
9,0


Perform `LATERAL JOIN` to compare the query performance.

In [12]:
cursor.query("""
  SELECT leave_or_not, predicted_leave_or_not FROM postgres_data.employee_data
  JOIN LATERAL PredictEmployee(*) AS Predicted(predicted_leave_or_not) LIMIT 10;
""").df()

Unnamed: 0,leave_or_not,predicted_leave_or_not
0,0,0
1,1,1
2,0,0
3,1,0
4,1,0
5,0,0
6,0,0
7,1,0
8,0,0
9,0,0
