Exercise Objectives:

- PCA (Principal Component Analysis)
    

Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image. n the 3-dimensional space is that described in: [K. P. Bennett and O. L. Mangasarian: "Robust Linear Programming Discrimination of Two Linearly Inseparable Sets", Optimization Methods and Software 1, 1992, 23-34].

This database is also available through the UW CS ftp server: ftp ftp.cs.wisc.edu cd math-prog/cpo-dataset/machine-learn/WDBC/

Also can be found on UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29

Attribute Information:

1) ID number 2) Diagnosis (M = malignant, B = benign) 3-32)

Ten real-valued features are computed for each cell nucleus:

a) radius (mean of distances from center to points on the perimeter) b) texture (standard deviation of gray-scale values) c) perimeter d) area e) smoothness (local variation in radius lengths) f) compactness (perimeter^2 / area - 1.0) g) concavity (severity of concave portions of the contour) h) concave points (number of concave portions of the contour) i) symmetry j) fractal dimension ("coastline approximation" - 1)

The mean, standard error and "worst" or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features. For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius.

All feature values are recoded with four significant digits.

Missing attribute values: none

Class distribution: 357 benign, 212 malignant

kernels (results from other people): https://www.kaggle.com/uciml/breast-cancer-wisconsin-data/kernels


In [None]:
def readData():
    import pandas as pd
    df = pd.read_csv('cancer.txt')
    
    cancer = df.drop(df.columns[[0,32]],axis=1)
    y_cancer = cancer['diagnosis'].apply(lambda x: 0 if x in 'B' else 1)
    X_cancer = cancer.drop(cancer.columns[[0]],axis=1)
    return(X_cancer, y_cancer)

In [None]:
def returnColumnNames():
    (X,y) = readData()
    print('X: ', X.columns.values, '\ny: ',y.name)

In [None]:
def decisionTree():
    
    import numpy as np
    from sklearn.preprocessing import MinMaxScaler
    from sklearn.tree import DecisionTreeClassifier
    from sklearn.model_selection import train_test_split
    
    scaler = MinMaxScaler()
    (X,y) = readData()
    X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                       random_state = 0)
    # both training set and testing set need to be scaled
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    
    tree = DecisionTreeClassifier(max_depth=2,random_state=0).fit(X_train_scaled, y_train)
    print(tree.feature_importances_)
    print(returnColumnNames())
    return (tree.score(X_train_scaled, y_train),tree.score(X_test_scaled, y_test))

decisionTree()

In [None]:
def plotTopTwo():
    import matplotlib.pyplot as plt
    import pandas as pd
    
    pd.options.mode.chained_assignment = None # to stop warning
    
    (X,y) = readData()
    df = X[['concave points_mean','area_worst']]
    df['diagnosis'] = y
    benign = df[df['diagnosis'] == 0]
    malign = df[df['diagnosis'] == 1]
    
    plt.plot(benign['concave points_mean'],benign['area_worst'],'go')
    plt.plot(malign['concave points_mean'],malign['area_worst'],'bo')
    plt.show()

plotTopTwo()

In [None]:
# PCA (Principal Component Analysis) is used to reduce the dimension of the dataset
# so the data can be plotted and better understood
# it uses Singular Value Decomposition projecting the data to a lower dimensional space
# notice that the plot is similar to the plot with the top two features derived from the DecisionTree
# used for Unsupervised Learning as well
def PCAAnalysis():
    from sklearn.decomposition import PCA
    import matplotlib.pyplot as plt
    
    (X,y) = readData()
    pca = PCA(n_components=2)
    X_pca = pca.fit(X).transform(X)
    for i in range(len(y)):
        plt.plot(X_pca[i,0],X_pca[i,1],'go') if y[i] == 0 else plt.plot(X_pca[i,0],X_pca[i,1],'bo')
    plt.show()
    
PCAAnalysis()