# Tutorial 1: Data backends

In this tutorial, we will review the data backends supported by the library.

Supported scenarios include:
 - Load data from Pandas
 - Load data from an ES node.

In [5]:
import warnings
warnings.filterwarnings('ignore')

## Load dataset

For this demo, we will use the [User Churn dataset](https://square.github.io/pysurvival/tutorials/churn.html).

In [18]:
!pip install pysurvival &> /dev/null
import numpy as np
from pysurvival.datasets import Dataset

# Load dataset
raw_dataset = Dataset('churn').load() 

time_column = 'months_active'
event_column = 'churned'

features = np.setdiff1d(raw_dataset.columns, [time_column, event_column] ).tolist()

raw_dataset

Unnamed: 0,product_data_storage,product_travel_expense,product_payroll,product_accounting,csat_score,articles_viewed,smartphone_notifications_viewed,marketing_emails_clicked,social_media_ads_viewed,minutes_customer_support,company_size,us_region,months_active,churned
0,2048,Free-Trial,Active,No,9,4,0,14,1,8.3,10-50,West North Central,3.0,1.0
1,2048,Free-Trial,Free-Trial,Active,9,4,2,12,1,0.0,100-250,South Atlantic,2.0,1.0
2,2048,Active,Active,Active,9,3,2,17,1,0.0,100-250,East South Central,7.0,0.0
3,500,Active,Free-Trial,No,10,0,0,14,0,0.0,50-100,East South Central,8.0,1.0
4,5120,Free-Trial,Active,Free-Trial,8,5,0,17,0,0.0,50-100,East North Central,7.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1995,1024,Free-Trial,Active,Free-Trial,9,3,0,19,1,0.4,50-100,Mountain,8.0,0.0
1996,1024,Free-Trial,Active,Active,7,3,0,15,0,0.0,50-100,Middle Atlantic,2.0,1.0
1997,500,Free-Trial,Active,Active,8,5,0,15,0,5.9,10-50,West North Central,1.0,1.0
1998,500,Free-Trial,No,Active,10,6,1,15,1,3.3,100-250,Pacific,3.0,1.0


## Pandas backend for elastic-surv

This variant is suitable for loading local data.

The loading process also handle data encoding, as shown below.

In [14]:
from elastic_surv.dataset import PandasDataset

dataset = PandasDataset(
    raw_dataset,
    time_column=time_column,
    event_column=event_column,
    features=features,
    verbose = True,
)

preprocess: dataset one-hot encoding for company_size {'10-50', '50-100', '100-250', 'self-employed', '1-10'}
preprocess: dataset one-hot encoding for product_accounting {'Active', 'No', 'Free-Trial'}
preprocess: dataset one-hot encoding for product_payroll {'Active', 'No', 'Free-Trial'}
preprocess: dataset one-hot encoding for product_travel_expense {'Active', 'No', 'Free-Trial'}
preprocess: dataset one-hot encoding for us_region {'New England', 'West South Central', 'South Atlantic', 'West North Central', 'East North Central', 'Pacific', 'Middle Atlantic', 'Mountain', 'East South Central'}


In [15]:
len(dataset.train())

1800

In [16]:
len(dataset.test())

200

## ElasticSearch backend

For this example, we will use a local ES server. Installation steps can be found [here](https://www.elastic.co/guide/en/elasticsearch/reference/current/install-elasticsearch.html).


### Prepare the index for the demo

We will create the index locally, for demo purposes.

In [9]:
import eland as ed

# Upload dataframe to ES
ed.pandas_to_eland(raw_dataset,
                   es_client='localhost',
                   es_dest_index='churn-prediction',
                   es_if_exists='replace',
                   es_dropna=True,
                   es_refresh=True,
                  ) 

Unnamed: 0,articles_viewed,churned,company_size,csat_score,marketing_emails_clicked,minutes_customer_support,months_active,product_accounting,product_data_storage,product_payroll,product_travel_expense,smartphone_notifications_viewed,social_media_ads_viewed,us_region
0,4,1.0,10-50,9,14,8.3,3.0,No,2048,Active,Free-Trial,0,1,West North Central
1,4,1.0,100-250,9,12,0.0,2.0,Active,2048,Free-Trial,Free-Trial,2,1,South Atlantic
2,3,0.0,100-250,9,17,0.0,7.0,Active,2048,Active,Active,2,1,East South Central
3,0,1.0,50-100,10,14,0.0,8.0,No,500,Free-Trial,Active,0,0,East South Central
4,5,0.0,50-100,8,17,0.0,7.0,Free-Trial,5120,Active,Free-Trial,0,0,East North Central
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1995,3,0.0,50-100,9,19,0.4,8.0,Free-Trial,1024,Active,Free-Trial,0,1,Mountain
1996,3,1.0,50-100,7,15,0.0,2.0,Active,1024,Active,Free-Trial,0,0,Middle Atlantic
1997,5,1.0,10-50,8,15,5.9,1.0,Active,500,Active,Free-Trial,0,0,West North Central
1998,6,1.0,100-250,10,15,3.3,3.0,Active,500,No,Free-Trial,1,1,Pacific


We now have the `churn-prediction` index available in our local ES deploy. 

## ES backend for elastic-surv

In [11]:
from elastic_surv.dataset import ESDataset

dataset = ESDataset(
    es_index_pattern = 'churn-prediction',
    time_column = 'months_active',
    event_column = 'churned',
    verbose = True,
)

preprocess: dataset one-hot encoding for company_size {'10-50', '50-100', '100-250', 'self-employed', '1-10'}
preprocess: dataset one-hot encoding for product_accounting {'Active', 'No', 'Free-Trial'}
preprocess: dataset one-hot encoding for product_payroll {'Active', 'No', 'Free-Trial'}
preprocess: dataset one-hot encoding for product_travel_expense {'Active', 'No', 'Free-Trial'}
preprocess: dataset one-hot encoding for us_region {'New England', 'West South Central', 'South Atlantic', 'West North Central', 'East North Central', 'Pacific', 'Middle Atlantic', 'Mountain', 'East South Central'}


In [12]:
len(dataset.train())

1800

In [13]:
len(dataset.test())

200

## Conclusion

In this tutorial, we've shown some basic examples for loading the data for `elastic-surv`.
Next, we will train models on the loaded data.