# Comparsion with HANA APL Library
In this notebook we'll find out, which AutoML library is more accurate: our or Automated Predictive Library (APL).
Firstly, let's import all necessary packages for the experiment

In [38]:
import pandas as pd
from hana_automl.automl import AutoML
from hana_ml.algorithms.apl.classification import AutoClassifier
from hana_ml.algorithms.apl.regression import AutoRegressor
from hana_automl.utils.connection import connection_context # connection info for database
from hana_ml.dataframe import create_dataframe_from_pandas
from sklearn.model_selection import train_test_split

## Classification
For this task, we'll use famous Titanic dataset. (https://www.kaggle.com/c/titanic)
We have cleaned it a bit.
Our task is to predict whether the passenger will survive or not.

In [7]:
cls_df = pd.read_csv('data/cleaned_train.csv')
cls_df.head()

Unnamed: 0.1,Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare,Sex_female,Sex_male
0,0,1,0,3,22.0,1,0,7.25,0,1
1,1,2,1,1,38.0,1,0,71.2833,1,0
2,2,3,1,3,26.0,0,0,7.925,1,0
3,3,4,1,1,35.0,1,0,53.1,1,0
4,4,5,0,3,35.0,0,0,8.05,0,1


Now let's split it into train and test dataframes.

In [8]:
cls_train, cls_test = train_test_split(cls_df, test_size=0.3)

To load the dataset to database, we need a connection context. For security reasons, we've imported it from our library.

In [9]:
hana_cls_df_train = create_dataframe_from_pandas(connection_context=connection_context,
                                            pandas_df=cls_train,
                                            table_name='cls_compare_train',
                                            force=True,
                                            drop_exist_tab=True)

100%|██████████| 1/1 [00:00<00:00,  5.75it/s]


In [10]:
hana_cls_df_test = create_dataframe_from_pandas(connection_context=connection_context,
                                            pandas_df=cls_test,
                                            table_name='cls_compare_test',
                                            force=True,
                                            drop_exist_tab=True)

100%|██████████| 1/1 [00:00<00:00,  6.16it/s]


### APL Library

Create APL model and fit it

In [None]:
cls_apl_model = AutoClassifier(connection_context, variable_auto_selection=True)

cls_apl_model.fit(data=hana_cls_df_train, label='Survived', key='PassengerId')

Score it. Remember this number!

In [13]:
cls_apl_model.score(hana_cls_df_test)

0.7761194029850746

### HANA AutoML library

Create HANA AutoML model and fit it. Note that it has automatically detected the task.

In [31]:
cls_automl = AutoML(connection_context)

cls_automl.fit(
        file_path="data/cleaned_train.csv",
        target="Survived",
        id_column="PassengerId",
        steps=30,
        categorical_features=["Survived"]
)

Creating table with name: AUTOML960a91a7-eed3-421d-b0e2-2844ece32e6f


100%|██████████| 1/1 [00:00<00:00,  5.54it/s]


Done
Task: cls


Score it.

In [32]:
cls_automl.get_model().score(hana_cls_df_test, label='Survived', key='PassengerId')

0.824626

As you see, out library is **more accurate** than HANA APL Library



## Regression

We'll use another famous Boston dataset (https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html). Let's load it.
We need to predict 'medv' column here.

In [39]:
reg_df = pd.read_csv('data/boston_data.csv')
reg_df.head()

Unnamed: 0,ID,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,black,lstat,medv
0,0,0.15876,0.0,10.81,0.0,0.413,5.961,17.5,5.2873,4.0,305.0,19.2,376.94,9.88,21.7
1,1,0.10328,25.0,5.13,0.0,0.453,5.927,47.2,6.932,8.0,284.0,19.7,396.9,9.22,19.6
2,2,0.3494,0.0,9.9,0.0,0.544,5.972,76.7,3.1025,4.0,304.0,18.4,396.24,9.97,20.3
3,3,2.73397,0.0,19.58,0.0,0.871,5.597,94.9,1.5257,5.0,403.0,14.7,351.85,21.45,15.4
4,4,0.04337,21.0,5.64,0.0,0.439,6.115,63.0,6.8147,4.0,243.0,16.8,393.97,9.43,20.5


Again, split the dataset

In [40]:
reg_train, reg_test = train_test_split(reg_df, test_size=0.3)

Load both dfs to the database

In [41]:
hana_reg_df_train = create_dataframe_from_pandas(connection_context=connection_context,
                                            pandas_df=reg_train,
                                            table_name='reg_compare_train',
                                            force=True,
                                            drop_exist_tab=True)

100%|██████████| 1/1 [00:00<00:00,  5.70it/s]


In [42]:
hana_reg_df_test = create_dataframe_from_pandas(connection_context=connection_context,
                                            pandas_df=reg_test,
                                            table_name='reg_compare_test',
                                            force=True,
                                            drop_exist_tab=True)

100%|██████████| 1/1 [00:00<00:00,  6.20it/s]


### APL Library

Create APL model and fit it

In [43]:
reg_apl_model = AutoRegressor(connection_context, variable_auto_selection=True)

reg_apl_model.fit(data=hana_reg_df_train, label='medv', key='ID')

Score it. Take a look at the accuracy.

In [44]:
reg_apl_model.score(hana_reg_df_test)

0.8590635581598624

### HANA AutoML library

Create and fit the model.

In [45]:
reg_automl = AutoML(connection_context)

reg_automl.fit(
        file_path="data/boston_data.csv",
        target="medv",
        id_column="ID",
        steps=30
)

Creating table with name: AUTOML6d26b9f0-e457-4772-8e91-3fc2dee8194d


100%|██████████| 1/1 [00:00<00:00,  5.57it/s]


Done
Task: reg


Score it.

In [47]:
reg_automl.get_model().score(hana_reg_df_test, label='medv', key='ID')

0.9355457739354234

In [None]:
And this time our library won again :