# Breast Cancer Wisconsin Data Set

Create a predictive model that classifies benign vs. malignant tumors. 
See https://www.kaggle.com/uciml/breast-cancer-wisconsin-data/data for data understanding.

## Import packages

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.metrics import confusion_matrix, classification_report

## Load and inspect data set 

In [None]:
# Fetch the file
my_file = project.get_file("data.csv")

# Read the CSV data file from the object storage into a pandas DataFrame
my_file.seek(0)
original_data = pd.read_csv(my_file)

original_data.head()

In [None]:
original_data = original_data.drop(['Unnamed: 32','id'], axis = 1) # remove unnecessary column with NaN only

In [None]:
original_data.describe(include='all') # descriptive statistics for all columns

In [None]:
original_data.isnull().sum() # check for null values

In [None]:
original_data[original_data.duplicated(keep=False)] # check for duplicate rows

There are no missing values and no duplicates, so you don't have to take actions here. 

## Inspect features

In [None]:
original_data[['radius_mean', 'diagnosis']].groupby(['diagnosis'], as_index=False).mean().sort_values(by='diagnosis', ascending=False)

Inspect more feature, e.g. texture, perimeter,... 

In [None]:
# your code

An important step during feature selection is removing features that strongly correlate with each other. You keep only one feature as "representer" of the information and remove redundant features. There are more advanced methods to do this but, for now, just look at the correlation map and decide which features to keep.

In [None]:
f,ax=plt.subplots(figsize = (18,18))
sns.heatmap(original_data.corr(),annot= True,linewidths=0.5,fmt = ".1f",ax=ax)
plt.xticks(rotation=90)
plt.yticks(rotation=0)
plt.title('Correlation Map')
plt.show()

## Select predictors

In [None]:
data_reduced_features = original_data[['<your feature 1>', '<your feature 2>','...']]

In [None]:
data_reduced_features.head()

Once again, have a look at the correlation map and remove more features if necessary. 

In [None]:
f,ax=plt.subplots(figsize = (18,18))
sns.heatmap(data_reduced_features.corr(),annot= True,linewidths=0.5,fmt = ".1f",ax=ax)
plt.xticks(rotation=90)
plt.yticks(rotation=0)
plt.title('Correlation Map')
plt.show()

## Prepare for modeling

Set X and y (predictors and target) according to your dataframe:

In [None]:
target = data_reduced_features['<your target column>']
predictors = # your code

In [None]:
X_train, X_test, y_train, y_test = train_test_split(predictors, target, test_size=0.2, random_state=123) # 80-20 split into training and test data

Use StandardScaler to scale your predictors (fit on training set and transform training and test set):

In [None]:
scaler = # your code 
# your code

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

## Classification models and evaluation metrics

Create a decision tree classifier: 

In [None]:
# your code
# your code

print('train performance')
print(classification_report(y_train, tree.predict(X_train))) 
print('test performance')
print(classification_report(y_test, tree.predict(X_test)))

How do you evaluate this result (hints: overfitting vs. underfitting, which metric might be important for the use case and why)? 

In [None]:
conf_mat = confusion_matrix(y_test, tree.predict(X_test))
df_cm = pd.DataFrame(conf_mat, index=['B','M'], columns=['B', 'M'],)
fig = plt.figure(figsize=[10,7])
heatmap = sns.heatmap(df_cm, annot=True, fmt="d")
heatmap.yaxis.set_ticklabels(heatmap.yaxis.get_ticklabels(), rotation=0, ha='right', fontsize=14)
heatmap.xaxis.set_ticklabels(heatmap.xaxis.get_ticklabels(), rotation=45, ha='right', fontsize=14)
plt.ylabel('True label')
plt.xlabel('Predicted label')

Create a logistic regression model: 

In [None]:
# your code
# your code
print('train performance')
print(classification_report(y_train, logreg.predict(X_train)))
print('test performance')
print(classification_report(y_test, logreg.predict(X_test)))

How do you evaluate this result? 

In [None]:
conf_mat = confusion_matrix(y_test, logreg.predict(X_test))
df_cm = pd.DataFrame(conf_mat, index=['B','M'], columns=['B', 'M'],)
fig = plt.figure(figsize=[10,7])
heatmap = sns.heatmap(df_cm, annot=True, fmt="d")
heatmap.yaxis.set_ticklabels(heatmap.yaxis.get_ticklabels(), rotation=0, ha='right', fontsize=14)
heatmap.xaxis.set_ticklabels(heatmap.xaxis.get_ticklabels(), rotation=45, ha='right', fontsize=14)
plt.ylabel('True label')
plt.xlabel('Predicted label')

Feel free to try out more classifiers (don't forget to import required packages!), change classifier parameters, modify train and test split, select other predictors,...  

# Deployment

Deploy the logistic regression and/or decision tree model via the _Watson Machine Learning_ (WML) service on IBM Cloud. 

In [None]:
# import Python client library (documentation available at http://ibm-wml-api-pyclient.mybluemix.net/)
from ibm_watson_machine_learning import APIClient

In [None]:
# set your IBM Cloud API key 
api_key = "..."

# set the URL of your WML instance 
# depending on the region you chose during instance creation it will take one of the below values:
# - Frankfurt: https://eu-de.ml.cloud.ibm.com
# - Dallas: https://us-south.ml.cloud.ibm.com
# - London: https://eu-gb.ml.cloud.ibm.com
# - Tokyo: https://jp-tok.ml.cloud.ibm.com
wml_url = "https://us-south.ml.cloud.ibm.com"

In [None]:
# setup the API client
wml_client = APIClient({
   "url": wml_url,
   "apikey": api_key
})


In [None]:
# list all existing deployment spaces
wml_client.spaces.list()

In [None]:
# set the id of the deployment space you want to use as default
wml_client.set.default_space("...")

In [None]:
# setup required properties to store the model
sofware_spec_uid = wml_client.software_specifications.get_id_by_name("default_py3.7")
metadata = {
            wml_client.repository.ModelMetaNames.NAME: 'Cancer Model',
            wml_client.repository.ModelMetaNames.TYPE: 'scikit-learn_0.23',
            wml_client.repository.ModelMetaNames.SOFTWARE_SPEC_UID: sofware_spec_uid
}
metadata

In [None]:
# assign your favorite model to the deployment_classifier variable
deployment_classifier = #your code
deployment_classifier

In [None]:
# store the scikit-learn model in WML
model = wml_client.repository.store_model(deployment_classifier, meta_props=metadata)

In [None]:
# review available models in your WML instance
wml_client.repository.list()

In [None]:
# retrieve the id of the model you deployed
published_model_uid = wml_client.repository.get_model_uid(model)
published_model_uid

In [None]:
# setup required properties to deploy the model
metadata = {
    wml_client.deployments.ConfigurationMetaNames.NAME: "Deployment of Cancer model",
    wml_client.deployments.ConfigurationMetaNames.ONLINE: {}
}

In [None]:
# deploy the model as a web service (an API endpoint is generated for your deployment so your tools and apps can use a REST API to send data to your deployed model for analysis)
created_deployment = wml_client.deployments.create(published_model_uid, name="Cancer Deployment", meta_props=metadata)

In [None]:
# keep the REST API endpoint for evaluation
scoring_endpoint = wml_client.deployments.get_scoring_href(created_deployment)
scoring_endpoint

## Deployment validation

Use the stored deployment to make a prediction.

In [None]:
# review original data
original_data.head(2)

In [None]:
# review predictors
predictors.head(2)

In [None]:
# setup the request payload as per the API documentation
scoring_values = predictors.iloc[0:2].to_numpy().tolist()
payload_scoring = {"input_data": [{"values": scoring_values}]}

In [None]:
# create a token to make an authenticated request
token_response = requests.post('https://iam.eu-de.bluemix.net/identity/token', data={"apikey": api_key, "grant_type": 'urn:ibm:params:oauth:grant-type:apikey'})
mltoken = token_response.json()["access_token"]

In [None]:
# send the scoring request
header = {'Content-Type': 'application/json', 'Authorization': 'Bearer ' + mltoken}
response_scoring = requests.post(f'{scoring_endpoint}?version=2020-10-10', json=payload_scoring, headers={'Authorization': 'Bearer ' + mltoken})
response_scoring.content

Do the results match your expectation? Were they classified correctly?

In [None]:
# use the local model to make the same prediction in your notebook and compare the results
deployment_classifier.predict(predictors.iloc[0:2])

Delete the newly created artifacts.

In [None]:
# your code