# Sklearn: sparse regression

## Predicting Breast Cancer

Sklearn includes the Winsconsin breast cancer database. It associates medical outcomes for tumor observation, with several characteristics. Can a machine learn how to predict whether a cancer is benign or malignant ?

__Import the Breast Cancer Dataset from sklearn. Describe it.__

In [1]:
import sklearn
import sklearn.datasets
# the as_frame option makes the function return a dataframe
dataset = sklearn.datasets.load_breast_cancer(as_frame=True)

In [2]:
data = dataset['data']
target = dataset['target']

__Properly train a linear logistic regression to predict cancer morbidity.__

In [3]:
# separate the training set and the testset
import sklearn.model_selection
data_train, data_test, target_train, target_test = sklearn.model_selection.train_test_split(data, target)

In [4]:
# quickly check thes size of th samples, correspond to  what we want:
[e.shape for e in [data_train, data_test, target_train, target_test]]

[(426, 30), (143, 30), (426,), (143,)]

In [5]:
import sklearn.linear_model
model = sklearn.linear_model.LogisticRegression()

In [6]:
model.fit(data_train, target_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [7]:
# We can check the performance out of sample:

In [8]:
model.score(data_test, target_test)

0.8951048951048951

In [9]:
# to know what the scores represent, we can read the doc
# it shows that score is measured by mean accuracy
# i.e. number of correct predictions divided by total number of predictions
model.score?

[0;31mSignature:[0m [0mmodel[0m[0;34m.[0m[0mscore[0m[0;34m([0m[0mX[0m[0;34m,[0m [0my[0m[0;34m,[0m [0msample_weight[0m[0;34m=[0m[0;32mNone[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Return the mean accuracy on the given test data and labels.

In multi-label classification, this is the subset accuracy
which is a harsh metric since you require for each sample that
each label set be correctly predicted.

Parameters
----------
X : array-like of shape (n_samples, n_features)
    Test samples.

y : array-like of shape (n_samples,) or (n_samples, n_outputs)
    True labels for `X`.

sample_weight : array-like of shape (n_samples,), default=None
    Sample weights.

Returns
-------
score : float
    Mean accuracy of ``self.predict(X)`` wrt. `y`.
[0;31mFile:[0m      /opt/conda/envs/escpython/lib/python3.10/site-packages/sklearn/base.py
[0;31mType:[0m      method

__Bonus__: the warning message suggests to scale the data. Let's  redo the last few steps accordingly


In [10]:
import sklearn.preprocessing
scaler = sklearn.preprocessing.StandardScaler()
scaler.fit(data)
scaled_data = scaler.transform(data)

In [11]:
# let's repackage in a dataframe
import pandas
scaled_data = pandas.DataFrame(scaled_data, columns=data.columns)
# and check the result has zero mean and constant standard deviation
scaled_data.describe()


Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
count,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,...,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0
mean,-1.373633e-16,6.868164e-17,-1.248757e-16,-2.185325e-16,-8.366672e-16,1.873136e-16,4.995028e-17,-4.995028e-17,1.74826e-16,4.745277e-16,...,-8.241796e-16,1.248757e-17,-3.746271e-16,0.0,-2.372638e-16,-3.371644e-16,7.492542e-17,2.247763e-16,2.62239e-16,-5.744282e-16
std,1.00088,1.00088,1.00088,1.00088,1.00088,1.00088,1.00088,1.00088,1.00088,1.00088,...,1.00088,1.00088,1.00088,1.00088,1.00088,1.00088,1.00088,1.00088,1.00088,1.00088
min,-2.029648,-2.229249,-1.984504,-1.454443,-3.112085,-1.610136,-1.114873,-1.26182,-2.744117,-1.819865,...,-1.726901,-2.223994,-1.693361,-1.222423,-2.682695,-1.443878,-1.305831,-1.745063,-2.16096,-1.601839
25%,-0.6893853,-0.7259631,-0.6919555,-0.6671955,-0.7109628,-0.747086,-0.7437479,-0.7379438,-0.7032397,-0.7226392,...,-0.6749213,-0.7486293,-0.6895783,-0.642136,-0.6912304,-0.6810833,-0.7565142,-0.7563999,-0.6418637,-0.6919118
50%,-0.2150816,-0.1046362,-0.23598,-0.2951869,-0.03489108,-0.2219405,-0.3422399,-0.3977212,-0.0716265,-0.1782793,...,-0.2690395,-0.04351564,-0.2859802,-0.341181,-0.04684277,-0.2695009,-0.2182321,-0.2234689,-0.1274095,-0.2164441
75%,0.4693926,0.5841756,0.4996769,0.3635073,0.636199,0.4938569,0.5260619,0.6469351,0.5307792,0.4709834,...,0.5220158,0.6583411,0.540279,0.357589,0.5975448,0.5396688,0.5311411,0.71251,0.4501382,0.4507624
max,3.971288,4.651889,3.97613,5.250529,4.770911,4.568425,4.243589,3.92793,4.484751,4.910919,...,4.094189,3.885905,4.287337,5.930172,3.955374,5.112877,4.700669,2.685877,6.046041,6.846856


In [12]:
# for compatibility purpose we save the scaled dataframe as data
data = scaled_data

In [13]:
# and redo the same training

In [14]:
# separate the training set and the testset
data_train, data_test, target_train, target_test = sklearn.model_selection.train_test_split(data, target)

In [15]:
import sklearn.linear_model
model = sklearn.linear_model.LogisticRegression()

In [16]:
model.fit(data_train, target_train) # this time, we don't get any error message

In [17]:
# and actually improve the prediction (which might just be chance)

In [18]:
model.score(data_test, target_test)

0.972027972027972

__Use k-fold validation to validate the model__

In [19]:
# because the dataset is relatively small we didn't set aside a validation set
# instead we rely on cross-validation

# we split the dataset in 5
# this provides 5 different testsets (with 20% of observation) to test the training on the remaining set (80%)

In [20]:
kf = sklearn.model_selection.KFold(n_splits=5)

In [21]:
scores = []

for i_train, i_test in kf.split(data):
    
    # i_train and i_test are indices of observations belonging to one of the two datasets
    kf_data_train = data.iloc[i_train,:]
    kf_target_train = target.iloc[i_train]
    
    kf_data_test = data.iloc[i_test,:]
    kf_target_test = target.iloc[i_test]
    
    model_kf = sklearn.linear_model.LogisticRegression()
    
    # we train the model
    model_kf.fit(kf_data_train, kf_target_train)
    
    # and test it
    sc = model_kf.score(kf_data_test, kf_target_test)
    
    scores.append(sc)
    
    print(f"Score: {sc}")

Score: 0.9736842105263158
Score: 0.956140350877193
Score: 0.9824561403508771
Score: 0.9824561403508771
Score: 0.9911504424778761


There is some volatility in the scores, but it stays reliably over 95% accuracy.

In [22]:
# to get an estimate of accuracy we can compute the mean:
print(f"KFold validation: mean accuracy {sum(scores)/5}")

KFold validation: mean accuracy 0.9771774569166279


__Try with other classifiers. Which one is best?__

The dataset being relatively small we can try Support Vector Machines, which are known to generalize well (see discussion [here](https://towardsdatascience.com/text-classification-with-extremely-small-datasets-333d322caee2)).

We perform a kfold selection exactly as above.

In [23]:
kf = sklearn.model_selection.KFold(n_splits=5)

In [24]:
scores_svc = []

for i_train, i_test in kf.split(data):
    
    # i_train and i_test are indices of observations belonging to one of the two datasets
    kf_data_train = data.iloc[i_train,:]
    kf_target_train = target.iloc[i_train]
    
    kf_data_test = data.iloc[i_test,:]
    kf_target_test = target.iloc[i_test]
    
    # we just change the following line
    model_kf = sklearn.svm.SVC()
    
    # we train the model
    model_kf.fit(kf_data_train, kf_target_train)
    
    # and test it
    sc = model_kf.score(kf_data_test, kf_target_test)
    
    scores_svc.append(sc)
    
    print(f"Score: {sc}")

Score: 0.9473684210526315
Score: 0.9649122807017544
Score: 0.9736842105263158
Score: 0.9912280701754386
Score: 0.9734513274336283


In [25]:
# to get an estimate of accuracy we can compute the mean:
print(f"KFold validation: mean accuracy {sum(scores_svc)/5}")

KFold validation: mean accuracy 0.9701288619779538


__Comment__: performance of support vector machine is similar to logistic regression. To assess the gains, we can compare the difference to both estimate (0.007) to the standard deviation of either of two models. Both are geater than 0.01, meaning that the difference between the two models is probably not significant.

In [26]:
# we can compute the standard deviation as follows (googld standard deviation python)

import numpy 
print( numpy.std(scores) )
print( numpy.std(scores_svc) )

0.01188053806820839
0.0142415326274357
