In [13]:
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()

In [15]:
print(data.DESCR)

.. _breast_cancer_dataset:

Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry 
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        largest values) of these features were computed for each image,
        resulting in 30 features.  For instance, field 3 is Mean Radius, f

In [16]:
from sklearn.model_selection import KFold
kfold = KFold(n_splits=10, random_state=11, shuffle=True)

In [23]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()

In [27]:
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
knn.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')

In [28]:
scores = cross_val_score(knn,X_test,y_test, cv=kfold)
scores

array([1.        , 1.        , 0.93333333, 0.92857143, 0.92857143,
       1.        , 1.        , 0.92857143, 0.92857143, 0.92857143])

In [29]:
print(f'Mean accuracy: {scores.mean():.2%}')

Mean accuracy: 95.76%


In [30]:
print(f'Accuracy standard deviation: {scores.std():.2%}')

Accuracy standard deviation: 3.46%


In [37]:
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression

In [38]:
estimators = {'KNeighborsClassifier': knn, 'SVC': SVC(gamma='scale'), 'GaussianNB': GaussianNB(), 
              'LogisticRegression': LogisticRegression(solver='lbfgs',random_state=0)}

In [39]:
for estimator_name, estimator_object in estimators.items():
    kfold = KFold(n_splits=10, random_state=11, shuffle=True)
    scores = cross_val_score(estimator_object,X_test,y_test, cv=kfold)
    #scores = cross_val_score(estimator=estimator_object, X=digits.data, y=digits.target, cv=kfold)
    print(f'{estimator_name:>20}: ' + f'mean accuracy={scores.mean():.2%}; ' + f'standard deviation={scores.std():.2%}')

KNeighborsClassifier: mean accuracy=95.76%; standard deviation=3.46%
                 SVC: mean accuracy=93.67%; standard deviation=6.63%
          GaussianNB: mean accuracy=95.14%; standard deviation=6.86%
  LogisticRegression: mean accuracy=93.67%; standard deviation=5.81%




The KNeighborsClassifier was the Classifier that had the highest accuracy percentage with the lowest standard deviation, which makes it the clear best choice. The follow up to that was the Gausian NB, which also had a high accuracy percentage but it was slightly below the KNeighborsClassifier, and it had a higher standard devation, making it slightly worse. The bottom two ones were tied in accuracy, but the logistic regression had a lower standard deviation, making it the slightly better model. This made the SVC the least accurate model