# Oracle Machine Learning for Python - Automatic Machine Learning
Oracle Machine Learning for Python (OML4Py), a component of the Oracle Advanced Analytics option to Oracle Database Enterprise Edition, makes the open source Python scripting language and environment ready for the enterprise and big data. Designed for problems involving both large and small volumes of data, Oracle Machine Learning for Python integrates Python with Oracle Database, allowing users to execute Python commands and scripts for statistical, machine learning, and graphical analyses on database tables and views using Python syntax. Many familiar Python functions are overloaded and translate Python functions into SQL for in-database execution, as well as new automated machine learning capabilities. 
![title](img/OML4P_icon.jpg)
In this notebook, we highlight using OML4Py Automated Machine Learning capability - AutoML, which consists of three key features:
* Auto Feature Selection
  * Reduce the number of features by identifying most predictive
  * Improve performance and/or accuracy of the resulting model
* Auto Model Selection for classification and regression
  * Identify the best algorithm to achieve maximum accuracy metric
  * Find best model many times faster than with exhaustive search techniques
* Auto Tuning of Hyperparameters
  * Significantly improve model accuracy
  * Tune model many times faster than with manual or exhaustive search techniques

# Connect to Oracle Database
To use OML4Py, first import the package ***oml***. OML4Py supports a variety of connection specification options, including Oracle Wallet. Once connected to an Oracle Database that has OML4Py installed, invoking ***oml.isconnected*** returns true. 

The ***automl*** parameter in function ***connect***, when set to true, initializes settngs required for the oml.automk modules. 

In [1]:
import warnings
warnings.filterwarnings('ignore')

import oml
from oml import automl
from oml import algo
from oml.automl import FeatureSelection

oml.connect("pyquser2","pyquser2",
            '(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=localhost)(PORT=1521))(CONNECT_DATA=(service_name=OAA1)))',
           automl=True)
oml.isconnected()

True

# Create a Pandas DataFrame and load into Oracle Database
In this example, we load the iris data and combine target and predictors into a single DataFrame, which matches the form the data would have as a database table. This DataFrame is then loaded into Oracle Database using the ***create*** function, which creates a persistent table. 

For AutoML functionality, note that the target must be numeric, oven though the target column is categorical. In previous examples, we mapped the numeric species values to their string equivalents. Here, we leave it as numeric. 

In [2]:
from sklearn import datasets
from sklearn import linear_model
import pandas as pd

iris = datasets.load_iris()
x = pd.DataFrame(iris.data, 
                 columns = ["SEPAL_LENGTH", "SEPAL_WIDTH", "PETAL_LENGTH", "PETAL_WIDTH"])
y = pd.DataFrame(iris.target,
                 columns = ['Species']) # note that target must be numeric, despite categorical
#y = pd.DataFrame(list(map(lambda x: {0:'setosa', 1: 'versicolor', 2:'virginica'}[x], iris.target)), 
#                 columns = ['Species'])
iris_df = pd.concat([x,y], axis=1)

oml.drop(table="IRIS")
IRIS = oml.create(iris_df, table="IRIS")
print("Shape:",IRIS.shape)
IRIS.head(4)

Shape: (150, 5)


   SEPAL_LENGTH  SEPAL_WIDTH  PETAL_LENGTH  PETAL_WIDTH  Species
0           5.1          3.5           1.4          0.2        0
1           4.9          3.0           1.4          0.2        0
2           4.7          3.2           1.3          0.2        0
3           4.6          3.1           1.5          0.2        0

We can also access the proxy object for the database table by invoking ***sync*** and supplying the table name. Next, we split the data into train and test and prepare the train_x and train_y data. 

In [3]:
train_dat, test_dat = oml.sync(table = "IRIS").split()
train_x = train_dat.drop('Species')
train_y = train_dat['Species']

['FeatureSelection',
 'IRIS',
 'In',
 'Out',
 '_',
 '_1',
 '_2',
 '__',
 '___',
 '__builtin__',
 '__builtins__',
 '__doc__',
 '__loader__',
 '__name__',
 '__package__',
 '__spec__',
 '_dh',
 '_i',
 '_i1',
 '_i2',
 '_i3',
 '_i4',
 '_ih',
 '_ii',
 '_iii',
 '_oh',
 'algo',
 'automl',
 'datasets',
 'exit',
 'get_ipython',
 'iris',
 'iris_df',
 'linear_model',
 'oml',
 'pd',
 'quit',
 'test_dat',
 'train_dat',
 'train_x',
 'train_y',
 'x',
 'y']

# Automatic feature selection using the 'accuracy' metric
Automatic feature selection uses a technique called _meta-learning_ to quickly identify the most relevant features, or _predictors_, given a training data set and a machine learning technique, also referred to as a _mining function_, such as _classification_ or _regression_.

In this example, we use the scoring metric 'accuracy'  to find the best features for classification. After creating a FeatureSelection object, the function ***reduce*** produces the selected features for the given data set and algorithm. We see that a single column, PETAL_WIDTH, was selected. 

In [5]:
fs = automl.FeatureSelection(mining_function = 'classification', score_metric = 'accuracy')
selected_features = fs.reduce('dt', train_x, train_y)
train_x_reduced = train_x[:,selected_features]
print("Selected columns:",train_x_reduced.columns)

Selected columns: ['PETAL_WIDTH']


# Feature Selection on digits data set
Let's use a more interesting data set, _digits_, from the openml package. First, we import some needed items.

After loading the digits data set, we provide column names, create a _case_id_ column, and then create the database table DIGITS.

In [6]:
import openml
import time
import logging
import numpy as np
import sklearn as skl
from sklearn.datasets import load_digits

In [7]:
bc = load_digits()
bc_data = bc.data.astype(float)
print("Dataset {} -- shape {}".format('digits', bc_data.shape))
X = pd.DataFrame(bc_data, columns = [ 'COL{}'.format(i) for i in range(bc_data.shape[1]) ])
y = pd.DataFrame(bc.target, columns = ['TARGET'])

row_id_col = np.arange(bc_data.shape[0])
row_id = pd.DataFrame(row_id_col, columns = ['CASE_ID'])

try:
    oml.drop(table='DIGITS')
except:
    pass
DIGITS = oml.create(pd.concat([row_id, X, y], axis=1), table = 'DIGITS')
DIGITS.head()

Dataset digits -- shape (1797, 64)


   CASE_ID  COL0  COL1  COL2  COL3  COL4  COL5  COL6  COL7  COL8   ...    \
0       42     0     0     0     0    12     5     0     0     0   ...     
1       43     0     0     0     9    15    12     0     0     0   ...     
2       44     0     0     9    16    16    16     5     0     0   ...     
3       45     0     0     9    16    13     6     0     0     0   ...     
4       46     0     1    15     4     0     0     0     0     0   ...     

   COL55  COL56  COL57  COL58  COL59  COL60  COL61  COL62  COL63  TARGET  
0      0      0      0      0      3     16      8      0      0       1  
1      0      0      0      0     11      7      0      0      0       7  
2      0      0      0     13     10      0      0      0      0       7  
3      0      0      0     11     14     12      8      0      0       3  
4      0      0      0     14     14      4      0      0      0       5  

[5 rows x 66 columns]

We split the DIGITS data (in-database) using an 80/20 split, with a seed for reproducibility, producing the training and test data. Then, we specify the FeatureSelection object, and invoke ***reduce*** to select the features using the linear Support Vector Machine algorithm. Note that we can specify the degree of parallelism using the 'parallel' argument when creating the FeatureSelection object. 

Feature selection reduced the 64 columns to 45. 

In [8]:
train_dat, test_dat = DIGITS.split(ratio=(0.8, 0.2), seed = 1234, 
                                   hash_cols = 'CASE_ID', 
                                   strata_cols='TARGET')

train_x, train_y = train_dat.drop('TARGET'), train_dat['TARGET']
test_x, test_y = test_dat.drop('TARGET'), test_dat['TARGET']

fs = FeatureSelection(mining_function='classification', score_metric='accuracy', parallel=4)

selected_features = fs.reduce('svm_linear', train_x, train_y, case_id='CASE_ID')

print("# features selected: {} from {}".format(len(selected_features), train_x.shape[1]))

# features selected: 45 from 65


# Compare improvement with Feature Selection results
How does a model built using the selected features compare with one built on all the features? We'll build a model on the training data and score the test data - first using all columns from the original data set, and then on the reduced feature data set. We then compare the results for both speed and accuracy. 

In [9]:
mod = algo.svm(mining_function='classification')

start_time = time.time()
mod.fit(train_x, train_y)
score = mod.score(test_x, test_y)
no_fs_time = time.time() - start_time

train_x_reduced  = train_x[:,selected_features]
test_x_n = test_x[:,selected_features]

In [10]:
mod_reduced = algo.svm(mining_function='classification')

start_time = time.time()
mod_reduced.fit(train_x_reduced, train_y)
score_reduced = mod_reduced.score(test_x_n, test_y)
after_fs_time = time.time() - start_time

As we see below, accuracy is about the same, but time is reduced. But let's try a larger data set so the benefits become more obvious.

In [11]:
print("Feature reduction {:0.2f}x\n\n".format(float(X.shape[1]-1)/len(selected_features)) +
      "Accuracy Score: \n\tWithout FS: {:0.2}\n\t".format(score) +
      "With FS: {:0.2}\n".format(score_reduced))

print("Fit-time \n\tWithout FS: {:0.2f}s, \n\tWith FS: {:0.2f}s\n\tImprovement: {:0.1f}%".format(no_fs_time, 
                                                                                             after_fs_time,
                                                                                            (no_fs_time-after_fs_time)/no_fs_time*100))

Feature reduction 1.40x

Accuracy Score: 
	Without FS: 0.95
	With FS: 0.95

Fit-time 
	Without FS: 3.73s, 
	With FS: 2.03s
	Improvement: 45.6%


# Feature Selection on 554 data set
Using the OPENML data set '554', this example highlights the performance and accuracy impact of feature selection on a larger table involving 70K rows and 786 columns. This corresponds to the NMIST_784 data set of hand-written digits. 

As before, we prepare the data and create the database table.

In [12]:
# ds = openml.datasets.get_dataset('554')
# n_arr, col_names = ds.get_data(return_attribute_names=True)
# feats, y = n_arr[:,:-1], n_arr[:,-1]
# print("Dataset shape {}".format(feats.shape))
# 
# row_id_col = np.arange(feats.shape[0]).reshape(feats.shape[0], -1)
# X = np.hstack((feats, row_id_col))
# col_names[-1] = 'CASE_ID'
# 
# X = pd.DataFrame(X, columns = col_names)
# y = pd.DataFrame(y, columns = ['TARGET'])
# 
# try:
#     oml.drop(table='OPENML_554')
# except:
#     pass
# 
# %time OPENML_554 = oml.create(pd.concat([X, y], axis=1), table = 'OPENML_554')

OPENML_554 = oml.sync(table='OPENML_554')
print("Data Table Shape:",OPENML_554.shape)

Dataset shape (70000, 784)
CPU times: user 1min 30s, sys: 59.8 s, total: 2min 30s
Wall time: 2min 48s
Data Table Shape: (70000, 786)


In [14]:
fs = FeatureSelection(mining_function='classification', score_metric='accuracy', parallel=4)

%time selected_features = fs.reduce('svm_gaussian', train_x, train_y, case_id='CASE_ID')

print("# features selected: {} from {}".format(len(selected_features), train_x.shape[1]))

CPU times: user 151 ms, sys: 21 ms, total: 172 ms
Wall time: 19.8 s
# features selected: 45 from 65


As we did for the DIGITS data set, we compare the results for accuracy and performance, using the SVM Gaussian algorithm. 

In [15]:
def get_svm_gaussian():
    mod = algo.svm(mining_function='classification')
    mod.set_params(SVMS_KERNEL_FUNCTION='SVMS_GAUSSIAN')
    return mod

mod = get_svm_gaussian()

start_time = time.time()
%time mod.fit(train_x, train_y)
%time score = mod.score(test_x, test_y)
no_fs_time = time.time() - start_time

CPU times: user 55 ms, sys: 0 ns, total: 55 ms
Wall time: 3.48 s
CPU times: user 13 ms, sys: 2 ms, total: 15 ms
Wall time: 225 ms


In [17]:
train_x_reduced  = train_x[:,selected_features]
test_x_n = test_x[:,selected_features]

mod_reduced = get_svm_gaussian()

start_time = time.time()
%time mod_reduced.fit(train_x_reduced, train_y)
%time score_reduced = mod_reduced.score(test_x_n, test_y)
after_fs_time = time.time() - start_time

CPU times: user 129 ms, sys: 7 ms, total: 136 ms
Wall time: 4.76 s
CPU times: user 52 ms, sys: 7 ms, total: 59 ms
Wall time: 431 ms


In [18]:
print("Feature reduction {:0.2f}x\n\n".format(float(train_x.shape[1]-1)/len(selected_features)) +
      "Accuracy Score\n\tWithout FS: {:0.2f}\n".format(score) +
      "\tWith FS: {:0.2f}\n".format(score_reduced))

print("Fit-time\n\tWithout FS: {:0.2f}s, \n\tWith FS: {:0.2f}s\n\tImprovement: {:0.1f}%".format(no_fs_time, 
                                                                                             after_fs_time,
                                                                                            (no_fs_time-after_fs_time)/no_fs_time*100))

Feature reduction 1.42x

Accuracy Score
	Without FS: 0.98
	With FS: 0.98

Fit-time
	Without FS: 3.71s, 
	With FS: 5.20s
	Improvement: -40.4%


# Feature and Model Selection using 'BreastCancer' data set
In this example, we use another data set to illustrate both feature and model selection. For feature selection, we show the difference in selected columns based on the chosen metric. 

As before, we create the database table, then split the data into train and test.

In [19]:
bc = datasets.load_breast_cancer()
bc_data = bc.data.astype(float)
X = pd.DataFrame(bc_data, columns = bc.feature_names)
y = pd.DataFrame(bc.target, columns = ['TARGET'])

try:
    oml.drop(table='BREASTCANCER')
except:
    pass
BREASTCANCER = oml.create(pd.concat([X, y], axis=1), table = 'BREASTCANCER')
print("Shape:",BREASTCANCER.shape)

Shape: (569, 31)


In [20]:
train_dat, test_dat = oml.sync(table = "BREASTCANCER").split()
test_x, test_y = test_dat.drop('TARGET'), test_dat['TARGET']
train_x = train_dat.drop('TARGET')
train_y = train_dat['TARGET']

Next, we define FeatureSelection objects - one with metric 'accuracy', and one with metric 'f1_macro'.

Notice that the 'accuracy' metric selected significantly fewer features. 

In [21]:
fs = automl.FeatureSelection(mining_function = 'classification', score_metric = 'accuracy')
selected_features = fs.reduce('dt', train_x, train_y)
train_x_reduced = train_x[:,selected_features]
print("Selected columns:",train_x_reduced.columns)
print("Number of columns:", train_x_reduced.shape[1])

Selected columns: ['mean compactness', 'mean concave points', 'mean fractal dimension', 'radius error', 'smoothness error', 'concavity error', 'concave points error', 'symmetry error', 'fractal dimension error', 'worst fractal dimension']
Number of columns: 10


In [22]:
fs = automl.FeatureSelection(mining_function = 'classification', score_metric = 'f1_macro')
selected_features = fs.reduce('dt', train_x, train_y)
train_x_reduced = train_x[:,selected_features]
print("Selected columns:",train_x_reduced.columns)
print("Number of columns:", train_x_reduced.shape[1])

Selected columns: ['mean radius', 'mean compactness', 'mean concavity', 'mean concave points', 'mean symmetry', 'mean fractal dimension', 'radius error', 'smoothness error', 'concavity error', 'concave points error', 'symmetry error', 'fractal dimension error', 'worst smoothness', 'worst concavity', 'worst concave points', 'worst fractal dimension']
Number of columns: 16


# Select the best model
For AutoML model selection, we create a ModelSelection object, here with the 'f1_macro' score metric. Then, we invoke ***select*** to get the top predicted algorithm by setting k=1. This defaults to the tuned model, which is then displayed. 

In [23]:
ms = automl.ModelSelection(mining_function='classification', 
                           score_metric='f1_macro', parallel=4)

In [24]:
%time selected_model = ms.select(train_x, train_y, k=1)
selected_model

Exception ignored in: <bound method _ExtRef.__del__ of <oml.core.extref._ExtRef object at 0x7f43ba90bf98>>
Traceback (most recent call last):
  File "oml/core/extref.py", line 31, in oml.core.extref._ExtRef.__del__
Exception ignored in: <bound method _ExtRef.__del__ of <oml.core.extref._ExtRef object at 0x7f43ba90bf98>>
  File "oml/core/extref.py", line 47, in oml.core.extref._ExtRef.__dropObjects
cx_Oracle.DatabaseError: ORA-01001: invalid cursor
Traceback (most recent call last):
  File "oml/core/extref.py", line 31, in oml.core.extref._ExtRef.__del__
  File "oml/core/extref.py", line 47, in oml.core.extref._ExtRef.__dropObjects
cx_Oracle.DatabaseError: ORA-01001: invalid cursor


CPU times: user 1.51 s, sys: 1.1 s, total: 2.61 s
Wall time: 1min 20s



Algorithm Name: Support Vector Machine

Mining Function: CLASSIFICATION

Target: TARGET

Settings: 
                    setting name                 setting value
0                      ALGO_NAME  ALGO_SUPPORT_VECTOR_MACHINES
1          CLAS_WEIGHTS_BALANCED                           OFF
2                   ODMS_DETAILS                  ODMS_DISABLE
3   ODMS_MISSING_VALUE_TREATMENT       ODMS_MISSING_VALUE_AUTO
4                  ODMS_SAMPLING         ODMS_SAMPLING_DISABLE
5                      PREP_AUTO                            ON
6         SVMS_COMPLEXITY_FACTOR                            10
7            SVMS_CONV_TOLERANCE                         .0001
8           SVMS_KERNEL_FUNCTION                 SVMS_GAUSSIAN
9                SVMS_NUM_PIVOTS                           200
10                  SVMS_STD_DEV            5.3999999999999995

Attributes: 
mean radius
mean texture
mean perimeter
mean area
mean smoothness
mean compactness
mean concavity
mean concave points
mean symmet

Let's re-run model selection, but turn off tuning so we can see the 'f1_macro' metrics produced by each model. Here we see explicitly that SVM Gaussian produced the best model. 

In [25]:
%time ranked_models = ms.select(train_x, train_y, tune=False)

print("Ranked models:\n",ranked_models)

CPU times: user 295 ms, sys: 134 ms, total: 429 ms
Wall time: 25.4 s
Ranked models:
 [('svm_gaussian', 0.9683665987729823), ('nn', 0.9587999707195006), ('svm_linear', 0.9529611650687159)]


# Hyperparameter tuning
Using the training data from the BREASTCANCER table created above, we define a ModelTuning object for classification and specify degree of parallelism at 4. We then invoke ***run*** to produce the tuned model using the decision tree algorithm.

The result of model tuning is a dictionary with the best model and the tuned parameters. The model can be used for scoring using the ***predict*** function. 

In [26]:
from oml.automl import ModelTuning

at = ModelTuning(mining_function = 'classification', parallel=4)

In [27]:
%time results = at.run('dt', train_x, train_y, score_metric='accuracy')

tuned_model = results['best_model']
tuned_model

CPU times: user 421 ms, sys: 92 ms, total: 513 ms
Wall time: 37.2 s



Algorithm Name: Decision Tree

Mining Function: CLASSIFICATION

Target: TARGET

Settings: 
                    setting name            setting value
0                      ALGO_NAME       ALGO_DECISION_TREE
1              CLAS_MAX_SUP_BINS                       32
2          CLAS_WEIGHTS_BALANCED                      OFF
3                   ODMS_DETAILS             ODMS_DISABLE
4   ODMS_MISSING_VALUE_TREATMENT  ODMS_MISSING_VALUE_AUTO
5                  ODMS_SAMPLING    ODMS_SAMPLING_DISABLE
6                      PREP_AUTO                       ON
7           TREE_IMPURITY_METRIC    TREE_IMPURITY_ENTROPY
8            TREE_TERM_MAX_DEPTH                        7
9          TREE_TERM_MINPCT_NODE                     0.05
10        TREE_TERM_MINPCT_SPLIT                      0.1
11         TREE_TERM_MINREC_NODE                       10
12        TREE_TERM_MINREC_SPLIT                       20

Attributes: 
mean radius
mean texture
mean perimeter
mean area
mean smoothness
mean compactness

In [28]:
pred_y = tuned_model.predict(test_x)
pred_y.head()

   PREDICTION
0           1
1           0
2           1
3           0
4           1

Show tuned model score metric and tuned hyperparameters. Use the results to get the score on the test set.

In [29]:
score, params = results['all_evals'][0]
"{:.2}".format(score), ["{}:{}".format(k, params[k]) for k in sorted(params)]

('0.91',
 ['CLAS_MAX_SUP_BINS:32',
  'TREE_IMPURITY_METRIC:TREE_IMPURITY_ENTROPY',
  'TREE_TERM_MAX_DEPTH:7',
  'TREE_TERM_MINPCT_NODE:0.05',
  'TREE_TERM_MINPCT_SPLIT:0.1'])

In [30]:
"{:.2}".format(tuned_model.score(test_x, test_y))

'0.91'

Users can also invoke model tuning with user-defined search ranges for selected hyperparameters on a tuning metric, which in this example is 'f1_macro'.

Here, we use the Random Forest algorothm specified by 'rf'.

In [31]:
search_space={'RFOR_SAMPLING_RATIO': {'type': 'continuous', 
                                      'range': [0.05, 0.5]}, 
              'RFOR_NUM_TREES': {'type': 'discrete', 'range': [50, 55]}, 
              'TREE_IMPURITY_METRIC': {'type': 'categorical', 
                                       'range': ['TREE_IMPURITY_ENTROPY', 
                                                 'TREE_IMPURITY_GINI']},}

results = at.run('rf', train_x, train_y, score_metric='f1_macro', param_space=search_space)

score, params = results['all_evals'][0]
("{:.2}".format(score), ["{}:{}".format(k, round(params[k], 3) if isinstance(params[k], float) else params[k]) for k in sorted(params)])

('0.94',
 ['RFOR_NUM_TREES:53',
  'RFOR_SAMPLING_RATIO:0.35',
  'TREE_IMPURITY_METRIC:TREE_IMPURITY_GINI'])

Some hyperparameter search ranges, such as Random Forest's MTRY argument, need to be defined based on the training dataset sizes, e.g., number of samples or features. RFOR_MTRY sets the size of the random subset of columns to be considered when choosing a split at a node. The data set-specific placeholders like ***nr_features*** or ***nr_samples*** can be used for this purpose as shown below.

In [16]:
search_space={'RFOR_MTRY': {'type': 'discrete', 
                            'range': [1, '$nr_features/2']}}

results = at.run('rf', train_x, train_y, score_metric='f1_macro', param_space=search_space)

score, params = results['all_evals'][0]
("{:.2}".format(score), ["{}:{}".format(k, params[k]) for k in sorted(params)])

('0.94', ['RFOR_MTRY:10'])

<img src="img/Oracle-sm.jpg">