### 1 Business problem statement >  2 Prepare dataset > 3 Model Training and evaluation > 4 Automatical model tuning > 5 Deployment > 6 AWS Auto scaling > 7 Relative cost of errors 

#### Losing customer is costly for any business. Identifying unhappy customers early gives you a chance to offer them incentives to stay. For a telecommunication company, If the company is aware that a particular customer is consdiering leaving, it can offer timely incentives, perhapes in the form of a phone upgrade, or monthly fee discount to encourage the customer ot continue service. Incentives are often more cost effective than losing and reacquiring new customer.

!head './churn.txt'

In [None]:
import pandas as pd
churn = pd.read_csv('./churn.txt')
churn

In [None]:
churn.describe(included='all')

In [1]:
import numpy as np
import matplotlib.pyplot as plt

#### create histogram to see how the values of individual attributes are distributed, as well as compute summary statistics for numeric attributes such as mean, min, max and standard deviation ect.

In [None]:
# show frequency tables for each categorical feature and counts of unique values
for column in churn.select_dtypes(included=['object']).columns:
    display(pd.crosstab(index=churn[column],
                       columns = '% observations',
                       normalize = 'columns'))
    print("# of unique values {}".format(churn[column].nunique()))
# show summary statistics
display(churn,describe())
# build histograms for each numeric feature
%matplotlib inline
hist = churn.hist(bins=30, sharey=Ture, figsize=(10,10))

In [None]:
# plot histogram to check how each feature relates to our target variable churn
churn.hist()
display(churn.corr(numeric_only='true'))
pd.plotting.scatter_matrix(churn,figsize=(20,20))
plt.show()

In [None]:
import seanborn as sn
sn.heatmap(churn.corr(numeric_only='true'))

#### Several features that essentially have 100 percent correlation with one another. Including these feature pairs in some mahcine learning algorithms can create catastropic problems, while in others it will only introduce minor redundancy and bias. So, need remove the columns that observed as unless for our purpose. Phone and Area code attributes should be removed. 

In [None]:
churn = churn.drop(['Phone','Area code'], axis=1)

#### Next, remove one feature from each of the highly correlated pairs: Day Charge: Day Mins, Eve Charge: Eve Mins, Night Charge: Night Mins, Intl Charge: Intl Mins

In [None]:
churn = churn.drop(['Day Charge','Eve Charge','Night Charge', 'Intl Charge'], axis=1)

In [None]:
churn.head()
# Convert the categorical features into numeric features
model_data = pd.get_dummies(churn, dtype='int')
model_data = pd.concat([Model_data['Churn?_True.'], model_data.drop(['Churn?_False'.,'Churn?_True.'], axis=1)], axis=1)

In [None]:
# Split data into training, validation and test sets. This will help prevent overfitting the model
train_data, validation_data, test_data = np.split(model_data.sample(frac=1, random_state=1729), [int(0.7* len(model_data),
     int(0.9*len(model_data))]) 
train_data.to_csv('train.csv', header=False, index=False)
validation_data.to_csv('Validation.csv',header=False, index=False)                                                                                               

In [None]:
# Upload data to Amazon S3
import sys
! {sys.excutable} -m pip install sagemaker -U

In [None]:
import os
import boto3
import sagemaker

sess = sagemaker.Session()
bucket = sess.default_bucket()
prefix = 'bootcamp-xgboost-churn'

boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train/train.csv')).upload_file('train_csv')
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'validation/validation.csv')).upload_file('validation.csv')

### Model Training
#### Amazon Sagemaker algorithms are packaged as Docker images. This provides the flexibility to use almost any algorithm code with Amazon Sagemaker regardless of implementation language, dependent libraries, frameworks, and so on.

In [None]:
# Set IAM Role
from sagemaker import get_execution_role
role = get_execution_role()
# Get the XGBoost docker image
from sagemaker import image_uris
container = image_uris.retrieve('xgboost',boto3.Session().region_name, '1.0-1')
display(container)

#### SageMaker Python SDK provides high-level abstractions for working with Amazon SageMaker: 
  * Estimators: Encapsulates training on SageMaker
  * Models: Encapsulates built Ml models
  * Predictors: Provides real-time inference and transformation using Python data-types against a SageMaker endpoint
  * Session: Prodvides a collection of methods for working with SegaMaker resources
#### Start by creating the xgboost_Estimator, the mandatory parameters are: image_url, role, session, instance_type, and instance_count. For this training job, we use below parameters

In [None]:
# creat the SageMaker Estimator object
import sagemaker
sess = sagemaker.Session()
xgb = sagemaker.estimator.Estimator(container,role, instance_count=1, instance_type='ml.m4.xlarge'
                                   ,output_path = 's3://{}/{}/output'.format(bucket,prefix),
                                   sagemaker_session=sess)