Linear regression to predict probability of Breast Cancer recurrence. Data:
https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Prognostic)

@author feBueno, June 2020
fernando.bueno.gutie@gmail.com

In [0]:
%tensorflow_version 2.x 
from __future__ import absolute_import, division, print_function, unicode_literals

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import clear_output
from six.moves import urllib

import tensorflow.compat.v2.feature_column as fc

import tensorflow as tf

Data loading and define train/test sets

In [0]:
data_df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wpbc.data', header=None)
data_df.columns = data_df.columns.map(str)#set colnames as trings

data_df=data_df.rename(columns = {'0':'Id','1':'Recurrence','2':'Time'})#rename the firts 3 columns

#define dependent variable as numeric
data_df['Recurrence']=data_df['Recurrence'].str.replace('N', '0')#R = 1 = recurrent, N = 0 = nonrecurrent
data_df['Recurrence']=data_df['Recurrence'].str.replace('R', '1')
data_df['Recurrence']=data_df['Recurrence'].astype('int64')

#random sample 70% of observations as training set
train_df=data_df.sample(frac=0.7)
test_df=data_df[~data_df.index.isin(train_df.index)]

#separate dependent variable
y_train_series = train_df.pop('Recurrence')
y_test_series = test_df.pop('Recurrence')

print(train_df[5:9])
print(y_train_series)
print(train_df.shape)


Some quick data exploration on the train set. Boxplots for the complete dataset for all variables can be found in https://www.openml.org/d/191

In [0]:
print(data_df.median())#median of all variables, as in https://www.openml.org/d/191
train_df.iloc[:, 1].value_counts().plot(kind='barh')#Dependent variable boxplot

In [0]:
train_df.iloc[:, 1].hist(bins=20)#histogram of the 1st predictor: 'Time'

In [0]:
train_df.iloc[:, 2].hist(bins=20)#histogram of the 2nd predictor: '1'

Feature columns that will be passed to the classifier

In [0]:
NUMERIC_COLUMNS = list(train_df.columns[2:33])#last predictor has missing data, so, we avoid it for now

feature_columns = []

for feature_name in NUMERIC_COLUMNS:
  feature_columns.append(tf.feature_column.numeric_column(str(feature_name), dtype=tf.float32))

Define a function that returns an input function that can be fit in the classifier. The input function should return batches of data in tf.data.Dataset format

In [0]:
def make_input_fn(data_df, label_df, num_epochs=10, shuffle=True, batch_size=20):
  def input_function():  #inner function, this will be returned
    ds = tf.data.Dataset.from_tensor_slices((dict(data_df), label_df))  #create tf.data.Dataset object with data and its label
    if shuffle:
      ds = ds.shuffle(1000)  #randomize order of data
    ds = ds.batch(batch_size).repeat(num_epochs)  #split dataset into batches of 20 and repeat process for number of epochs
    return ds  #return a batch of the dataset
  return input_function  #return a function object for use

train_input_fn = make_input_fn(train_df, y_train_series)  #call the input_function that was returned to us to get a dataset object we can feed to the model
eval_input_fn = make_input_fn(test_df, y_test_series, num_epochs=1, shuffle=False)
print(train_input_fn)

Create classifier

In [0]:
linear_est = tf.estimator.LinearClassifier(feature_columns=feature_columns)

Train the model and print accuracy

In [0]:
linear_est.train(train_input_fn)  #train
result = linear_est.evaluate(eval_input_fn)  #get model metrics/stats by testing on tetsing data

clear_output()  #clears console output
print(result['accuracy'])  #print accuracy from dict of model stats

Make novel predictions on test set

In [0]:
pred_dicts = list(linear_est.predict(eval_input_fn))
probs = pd.Series([pred['probabilities'][1] for pred in pred_dicts])

probs.plot(kind='hist', bins=20, title='predicted probabilities')