This example demonstrates how to load data from a data collection in a SAP DI workspace using the sapdi package. We then use this data to train a simple regression model. After the training step the model is saved into the model folder for later use.

Requirements for usage:
- create a Workspace using the *Data Manager*
- create a Data Collection inside the workspace
- adjust the parameters in the code below if you use other names

### load the required libraries and the data

In [18]:
import pandas as pd
import sapdi

ws = sapdi.get_workspace(name='Demo-WS')
dc = ws.get_datacollection(name='Titanic')
with dc.open('train.csv').get_reader() as reader:
    train = pd.read_csv(reader)
with dc.open('test.csv').get_reader() as reader:
    test = pd.read_csv(reader)

  after removing the cwd from sys.path.


### Apply some basic data transformation and cleansing

In [20]:
def impute_age(col):
    Age = col[0]
    Pclass = col[1]
    
    if pd.isnull(Age):
        if Pclass == 1:
            return 37
        elif Pclass == 2:
            return 29
        else:
            return 24
    else:
        return Age

In [21]:
train['Age']=train[['Age','Pclass']].apply(impute_age,axis=1)

In [22]:
train=train.drop(['Cabin','Name','Ticket'],axis=1)
train.dropna(inplace=True)

In [23]:
sex = pd.get_dummies(train['Sex'],drop_first=True)
embark = pd.get_dummies(train['Embarked'],drop_first=True)

In [24]:
train=train.drop(['Sex','Embarked'],axis=1)

In [25]:
train=pd.concat([train,sex,embark],axis=1)

In [26]:
train.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare',
       'male', 'Q', 'S'],
      dtype='object')

### split the dataset into train and test

In [27]:
from sklearn.model_selection import train_test_split

In [28]:
X_train, X_test, y_train, y_test = train_test_split(train.drop('Survived',axis=1),train['Survived'],random_state=101)

### Import the required libraries and train the model

In [29]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score

log_model=LogisticRegression()
log_model.fit(X_train,y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


LogisticRegression()

### calculate the RMSE metric to evaluate the model

In [30]:
import numpy as np
y_pred = log_model.predict(X_test)
mse = np.mean((y_pred - y_test)**2)
rmse = np.sqrt(mse)
rmse = round(rmse, 2)
print("RMSE: " , str(rmse))
print("n: ", str(len(X_test)))

RMSE:  0.44
n:  223


### Save the model

In [31]:
import pickle
pickle.dump(log_model, open("model/titanic_lm.pickle.dat", "wb"))