# Churn Prediction with Text

Customer churn is a problem faced by a wide range of companies, from
telecommunications to banking, where customers are typically lost to
competitors. It's in a company's best interest to retain existing
customer instead of acquiring new customers because it usually costs
significantly more to attract new customers. When trying to retain
customers, companies often focus their efforts on customers who are more
likely to leave. User behaviour and customer support chat logs can
contain valuable indicators on the likelihood of a customer ending the
service. In this solution, we train and deploy a churn prediction model
that uses state-of-the-art natural language processing model to find
useful signals in text. In addition to textual inputs, this model uses
traditional structured data inputs such as numerical and categorical
fields.

In this notebook, we'll train, deploy and use a churn prediction model
that processed numerical, categorical and textual features to make its
prediction.

**Note**: When running this notebook on SageMaker Studio, you should make
sure the 'SageMaker JumpStart PyTorch 1.0' image/kernel is used. When
running this notebook on SageMaker Notebook Instance, you should make
sure the 'sagemaker-soln' kernel is used.

We start by importing a variety of packages that will be used throughout
the notebook. One of the most important packages is the Amazon SageMaker
Python SDK (i.e. `import sagemaker`). We also import modules from our own
custom (and editable) package that can be found at `../package`.

In [None]:
import boto3
from pathlib import Path
import sagemaker
from sagemaker.pytorch import PyTorch
from sagemaker.predictor import json_serializer, json_deserializer
import sys

sys.path.insert(0, '../package')
from package import config, utils

Up next, we define the current folder and create a SageMaker client (from
`boto3`). We can use the SageMaker client to call SageMaker APIs
directly, as an alternative to using the Amazon SageMaker SDK. We'll use
it at the end of the notebook to delete certain resources that are
created in this notebook.

In [None]:
current_folder = utils.get_current_folder(globals())
sagemaker_client = boto3.client('sagemaker')

In [None]:
sagemaker_session = sagemaker.Session()

In [None]:
!aws s3 cp --recursive --quiet $config.SOURCE_S3_PATH/data ../data

In [None]:
!aws s3 cp --recursive --quiet ../data s3://$config.S3_BUCKET/$config.DATASETS_S3_PREFIX

In [None]:
hyperparameters = {
    "n-estimators": 100,
    "numerical-feature-names": "CustServ Calls,Account Length",
    "categorical-feature-names": "plan,limit",
    "textual-feature-names": "text",
    "label-name": "y"
}

current_folder = utils.get_current_folder(globals())
estimator = PyTorch(
    framework_version='1.5.1',
    entry_point='entry_point.py',
    source_dir=str(Path(current_folder, '../containers/model').resolve()),
    hyperparameters=hyperparameters,
    role=config.IAM_ROLE,
    train_instance_count=1,
    train_instance_type=config.TRAINING_INSTANCE_TYPE,
    output_path='s3://' + str(Path(config.S3_BUCKET, config.OUTPUTS_S3_PREFIX)),
    code_location='s3://' + str(Path(config.S3_BUCKET, config.OUTPUTS_S3_PREFIX)),
    base_job_name=config.SOLUTION_PREFIX,
    tags=[{'Key': config.TAG_KEY, 'Value': config.SOLUTION_PREFIX}],
    sagemaker_session=sagemaker_session,
    train_volume_size=30
)

In [None]:
estimator.fit({
    'train': 's3://' + str(Path(config.S3_BUCKET, config.DATASETS_S3_PREFIX, 'train')),
    'test': 's3://' + str(Path(config.S3_BUCKET, config.DATASETS_S3_PREFIX, 'test'))
})

We'll use the unique solution prefix to name the model and endpoint.

In [None]:
model_name = "{}-churn-prediction".format(config.SOLUTION_PREFIX)

In [None]:
predictor = estimator.deploy(
    endpoint_name=model_name,
    instance_type=config.HOSTING_INSTANCE_TYPE,
    initial_instance_count=1
)

When calling our new endpoint from the notebook, we use a Amazon
SageMaker SDK
[`Predictor`](https://sagemaker.readthedocs.io/en/stable/predictors.html).
A `Predictor` is used to send data to an endpoint (as part of a request),
and interpret the response. Our `estimator.deploy` command returned a
`Predictor` but, by default, it will send and receive numpy arrays. Our
endpoint expects to receive (and also sends) JSON formatted objects, so
we modify the `Predictor` to use JSON instead of the PyTorch endpoint
default of numpy arrays. JSON is used here because it is a standard
endpoint format and the endpoint response can contain nested data
structures.

In [None]:
predictor.content_type = 'application/json'
predictor.accept = 'application/json'
predictor.serializer = json_serializer
predictor.deserializer = json_deserializer

With our model successfully deployed and our predictor configured, we can
try out the churn prediction model out on example inputs.

In [None]:
data = {
    "CustServ Calls": -20.0,
    "Account Length": 133.12,
    "plan": "D",
    "limit": "unlimited",
    'text': "Well, I've been dealing with TelCom for three months now, and I feel like they're very helpful and responsive to my issues, but for a month now, I've only had one technical support call and that was very long and involved. My phone number was wrong on both contracts, and they gave me a chance to work with TelCom customer service and it was extremely helpful, so I've decided to stick with it. But I would like to have more help in terms of technical support, I haven't had the kind of help with my phone line and I don't have the type of tech support I want. So I would like to negotiate a phone contract, maybe an upgrade from a Sprint plan, or maybe from a Verizon plan.\nTelCom Agent: Very good."
}
response = predictor.predict(data=data)

We have the responce and we can print out the probability of churn.

In [None]:
print("{:.2%} probability of churn".format(response['probability']))

**Caution**: the probability returned by this model has not been
calibrated. When the model gives a probability of churn of 20%,
for example, this does not necessarily mean that 20% of customers with
a probability of 20% resulted in churn. Calibration is a useful
property in certain circumstances, but is not required in cases where
discrimination between cases of churn and non-churn is sufficient.
[CalibratedClassifierCV](https://scikit-learn.org/stable/modules/generated/sklearn.calibration.CalibratedClassifierCV.html)
from
[Scikit-learn](https://scikit-learn.org/stable/modules/calibration.html)
can be used to calibrate a model.

## Clean Up

When you've finished with the relationship extraction endpoint (and associated
endpoint-config), make sure that you delete it to avoid accidental
charges.

In [None]:
predictor.delete_endpoint()
predictor.delete_model()