# Build a customer churn model and monitor the process in Databand

*In this notebook we build a customer churn model and monitor some of the steps in Databand. We also log metadata for the training dataset and model training metrcis.*  

We structured this notebook so that it can be used both interactively and in the batch mode because data scientists often need to understand and review data before they build the model. Notice that cells up to **Step 2: build model** produce output that can be reviewed. In **Step 2** we switch to modular programming (functions) because it will allow us to execution of each function in Databand (we add the *@task* decorator from the Databand SDK). We track 2 steps - building the model and saving it to the project.  

Review all cells in the notebook, make the required changes, and run all cells either step by step or the entire notebook. View results of the run in Databand.

You will need to make the following changes in the notebook (see cells for specific instructions):
- Add project token
- Add Cloud API key
- Add your Cloud URL

In [None]:
# IMPORTANT: Insert project token before running the notebook. Generated code is needed for WML API to save the model in the project

In [None]:
# DATABAND
# Run once during notebook execution to install the Databand SDK
!pip install databand

In [None]:
# DATABAND
# Import Databand libraries
from dbnd import dbnd_tracking, task, dataset_op_logger,log_metric

## Step 1: Explore and prepare Data

In [None]:
# Libraries for data understanding and model building
!pip install pandas_profiling
!pip install sklearn-pandas
# Update WML library
!pip install -U ibm-watson-machine-learning

In [None]:
import pandas as pd
import numpy as np
import pandas_profiling
import sklearn.pipeline
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, LabelEncoder, StandardScaler, LabelBinarizer, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score, accuracy_score, roc_curve, roc_auc_score
from sklearn_pandas import DataFrameMapper
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer
import json
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")

%matplotlib inline

### Load and review data 

In [None]:
url='https://raw.githubusercontent.com/elenalowery/data-samples/main/churn.csv'
    
customer_churn = pd.read_csv(url)
customer_churn.head()

In [None]:
url='https://raw.githubusercontent.com/elenalowery/data-samples/main/customer-profile.csv'

customer = pd.read_csv(url)
customer.head()

### Merge Files

In [None]:
trainingData = pd.merge(customer, customer_churn, on='ID')

### Rename some columns
This step is to remove spaces from columns names, it's an example of data preparation that you may want to do before creating a model. 

In [None]:
trainingData.columns

In [None]:
trainingData.rename(columns={'Est Income':'EstIncome', 'Car Owner':'CarOwner' }, inplace=True)

In [None]:
trainingData.head()

In [None]:
trainingData.shape

### Data understanding

In [None]:
trainingData.describe()

In [None]:
pandas_profiling.ProfileReport(trainingData)

In [None]:
# TODO: figure out a more elegant way to do this
# Repeating this step here because it's used by more than one function


# Define input data to the model
X = trainingData.drop(['ID','CHURN'], axis=1)
    
# Define the target variable and encode with value between 0 and n_classes-1, that is from T/F to 1/0
le = LabelEncoder()
y = le.fit_transform(trainingData['CHURN'])
    
label_mapping=le.inverse_transform([0,1])
    
# split the data to training and testing set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=5)


## Step 2: Build the sklearn pipeline and the Random Forest model


Notice that we now define functions with a *@task* above it. *@task* is provided by the Databand SDK. It lets Databand know that we are starting an execution of a pipeline step.  

In notebooks functons do not execute until they are invoked. Function definitions are provided above the call to the function. When you run through the function cells, the notebook will show completion of the cell execution, but the code does not actually run. All cells below will be invoked by the last cell that calls *buildCustomerChurnModel()*

In [None]:
@task
def train_model(trainingData):
    
    # Define input data to the model
    X = trainingData.drop(['ID','CHURN'], axis=1)
    
    # Define the target variable and encode with value between 0 and n_classes-1, that is from T/F to 1/0
    le = LabelEncoder()
    y = le.fit_transform(trainingData['CHURN'])
    
    label_mapping=le.inverse_transform([0,1])
    
    # split the data to training and testing set
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=5)
    
    # Log model training data in Databand
    with dataset_op_logger("CPDaaS://MLOps_Deployment/churnTrainingData", "read", with_schema=True, with_preview=True) as logger:
        logger.set(data=trainingData)
    
    mapper_good = DataFrameMapper([
    (['Gender'], LabelBinarizer()),
    (['Status'], LabelBinarizer()),
    (['CarOwner'], LabelBinarizer()),
    (['Paymethod'], LabelBinarizer()),
    (['MembershipPlan'], LabelBinarizer()),
    (['Children'],  StandardScaler()),
    (['EstIncome'],  StandardScaler()),
    (['Age'],  StandardScaler()),
    (['AvgMonthlySpend'],  StandardScaler()),
    (['CustomerSupportCalls'],  StandardScaler())], default=False)
    
    # Instantiate the Classifier
    random_forest = RandomForestClassifier(random_state=5)

    # Define the steps in the pipeline to sequentially apply a list of transforms and the estimator, i.e. RandomForestClassifier
    steps = [('mapper', mapper_good),('RandonForestClassifier', random_forest)]
    pipeline = sklearn.pipeline.Pipeline(steps)

    # train the model
    model=pipeline.fit( X_train, y_train )
    
    # Display Label Mapping to assist with interpretation of the model
    label_mapping=le.inverse_transform([0,1])

    ### call pipeline.predict() on your X_test data to make a set of test predictions
    y_prediction = pipeline.predict( X_test )

    ### test your predictions using sklearn.classification_report()
    report = sklearn.metrics.classification_report( y_test, y_prediction )
    
    parameters = { 'RandonForestClassifier__max_depth': [5,8,10],
               'RandonForestClassifier__n_estimators': [150,180,200]}
    
    grid_obj = GridSearchCV(estimator=model, param_grid=parameters,  cv=3)
    
    # Fit the grid search object to the training data and find the optimal parameters using fit()
    grid_fit = grid_obj.fit(X_train,y_train)
    
    # Get the estimator
    best_clf = grid_fit.best_estimator_
    
    # Fit the grid search object to the training data and find the optimal parameters using fit()
    grid_fit = grid_obj.fit(X_train,y_train)
    
    best_predictions = best_clf.predict(X_test)
    
    best_predictions_report = sklearn.metrics.classification_report( y_test, best_predictions )
    
    print('Results of best fitted model: \n\n',best_predictions_report)
    
    # Get accuracy and roc_auc values to save as metrics in Databand
    accuracy = accuracy_score(y_test, best_predictions)
    roc_score = roc_auc_score(y_test, best_predictions)
    
    # DATABAND
    log_metric('customer_churn_build_accuracy', accuracy)
    log_metric('customer_churn_build_roc', roc_score)
    # END DATABAND
    
    m_step=pipeline.named_steps['mapper']
    
    m_step.transformed_names_
    
    features = m_step.transformed_names_
    
    # Get the features importance
    importances = pipeline.named_steps['RandonForestClassifier'][1].feature_importances_
    indices = np.argsort(importances)
    
    # DATABAND
    # Log feature importance in Databand
    # Convert the importances object to a pandas dataframe in order to log it, and log it in Databand
    importances_pd = pd.DataFrame.from_dict({'feature': np.array(features)[indices], 'importances_score': importances[indices]}).sort_values(by=['importances_score'], ascending=False)
    with dataset_op_logger("CPDaaS://MLOps_Deployment/FeatureImportance", "read", with_schema=True, with_preview=True) as logger:
        logger.set(data=importances_pd)
    # END DATABAND
    
    plt.figure(1)
    plt.title('Feature Importances')
    plt.barh(range(len(indices)), importances[indices], color='b',align='center')
    plt.yticks(range(len(indices)), (np.array(features))[indices])
    plt.xlabel('Relative Importance')
    
    return pipeline
    

In [None]:
@task
def save_model_in_project(pipeline):
    
    from ibm_watson_machine_learning import APIClient

    # IMPORTANT
    # Replace with your Cloud API key and location. Cloud API key is available in our IBM Cloud dashboard under Manage - IAM (top menu bar)
    api_key = 'insert_api_key'
    location = 'insert_location_url'  # For example, Dallas location is 'https://us-south.ml.cloud.ibm.com'


    wml_credentials = {
        "apikey": api_key,
        "url": location
    }

    client = APIClient(wml_credentials)
    
    client.set.default_project(pc.projectID)
    
    # Provide metadata and save the model into the repository. After running this cell, the model will be displayed in the Assets view

    # Model Metadata

    model_name = 'customer_churn_model'
    software_spec_uid = client.software_specifications.get_uid_by_name('runtime-22.1-py3.9')

    metadata = {
        client.repository.ModelMetaNames.NAME: model_name,
        client.repository.ModelMetaNames.SOFTWARE_SPEC_UID: software_spec_uid,
        client.repository.ModelMetaNames.TYPE: "scikit-learn_1.0"
    }

    stored_model_details = client.repository.store_model(pipeline,
                                                   meta_props=metadata,
                                                   training_data=X_train,
                                                   training_target=y_train)
      

In [None]:
def buildCustomerChurnModel():

    # # DATABAND
    # Start databand tracking
    # TODO: Update databand URL and token
    with dbnd_tracking(
            conf={
                "core": {
                    "databand_url": "insert_url",
                    "databand_access_token": "insert_token",

                }
            },
            job_name = "buildCustomerChurnModel",
            run_name = "weekly",
            project_name = "Customer Analytics",
    ):

        # Call the step job - train model
        pipeline = train_model(trainingData)
        
        # Save the model
        save_model_in_project(pipeline)


        print("Finished running the model building notebook")


In [None]:
# Invoke model traning/saving functions
buildCustomerChurnModel()

**In this version of the notebook we will perform deployment steps in the UI.**

**Author:**  Elena Lowery and Catherine Cao <br/>
**Date:**  August 31, 2022