# Principal Component Analysis of Breast Cancer Dataset

## Introduction

This notebook is my version of the code in https://www.datacamp.com/tutorial/principal-component-analysis-in-python, written by Aditya Sharma, December 2019.

I have renamed some variables.

## Resource

- [scikit-learn principal component analysis](https://scikit-learn.org/stable/modules/decomposition.html#decompositions)

## Analysis

### Prepare the Data

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import sklearn.datasets
import sklearn.decomposition
import sklearn.preprocessing

# Load the data and labels.
breast = sklearn.datasets.load_breast_cancer()
breast_data = breast.data
print(breast_data.shape)
breast_labels = breast.target
print(breast_labels)
print(breast_labels.shape)

# Reshape breast_labels to concatenate it with breast_data before creating
# a DataFrame that will have both the data and the labels.
labels = np.reshape(breast_labels, (569, 1))
# Concatenate the data and labels along the second axis to create a numpy
# ndarray with shape (569, 31).
final_breast_data = np.concatenate([breast_data, labels], axis=1)
print(final_breast_data.shape)

breast_df = pd.DataFrame(final_breast_data)

# Create DataFrame column labels from breast.feature_names.
# Since there are 30 features, we need a column label for the
# labels column in final_breast_data.
features = breast.feature_names
print(features)
features_labels = np.append(features, "label")
breast_df.columns = features_labels

# The original labels are in 0, 1 format, meaning "benign"/"malignant". Fix
# the values in the labels column.
breast_df["label"].replace(0, "Benign", inplace=True)
breast_df["label"].replace(1, "Malignant", inplace=True)
breast_df.tail()

Skip loading the CIFAR-10 images from keras for now.

### Standardize the Data
Standardization rescales a dataset to have a mean of 0 and a standard deviation of 1.
Normalization rescales a dataset so that each value falls between 0 and 1.
See https://www.statology.org/standardization-vs-normalization/.

After standardizing the data, it would be informative to do some EDA to look
at the distribution of each feature before and after scaling.

In [None]:
# Here x contains the original data, and y contains the standardized data.
x = breast_df.loc[:, features].values
print(x.shape)
print(x)
print()
y = sklearn.preprocessing.StandardScaler().fit_transform(x)
print(y.shape)
print(y)
print()
# Print the means and standard deviations of all values.
# It would make more sense to do this column by column.
print(np.mean(x), np.std(x))
print(np.mean(y), np.std(y))

In [None]:
# The tutorial keeps referring to the "normalized" data when it's actually
# standardized data.
# Create new column labels for the features.
feat_cols = ["feature" + str(i) for i in range(y.shape[1])]
standardized_breast_df = pd.DataFrame(y, columns=feat_cols)
standardized_breast_df.tail()

### Perform Principle Component Analysis

In [None]:
# It's time to use Principal Component Analysis.
pca_breast = sklearn.decomposition.PCA(n_components=2)
breast_pca = pca_breast.fit_transform(y)

# Create a DataFrame that will have the principal component values for all
# 569 samples.
breast_pca_df = \
    pd.DataFrame(
        data = breast_pca,
        columns = [
            "principal component 1",
            "principal component 2",
        ],
    )
breast_pca_df.tail()

Find the explained_variance_ratio. It will provide you with the amount of information or variance each principal component holds after projecting the data to a lower dimensional subspace.

When I used three components, the third component contains 9.39% of the information.

In [None]:
print('Explained variation per principal component: {}'.format(pca_breast.explained_variance_ratio_))

Plot the visualization of the 569 samples along the principal component - 1 and principal component - 2 axis. It should give you good insight into how your samples are distributed among the two classes.

From the plot, you can observe that the two classes `Benign` and `Malignant`, when projected to a two-dimensional space, can be linearly separable up to some extent. Other observations can be that the `Benign` class is spread out as compared to the `Malignant` class.

### Plot the Results

In [None]:
plt.figure()
plt.figure(figsize=(10,10))
plt.xticks(fontsize=12)
plt.yticks(fontsize=14)
plt.xlabel('Principal Component - 1',fontsize=20)
plt.ylabel('Principal Component - 2',fontsize=20)
plt.title("Principal Component Analysis of Breast Cancer Dataset", fontsize=20)
targets = ['Benign', 'Malignant']
colors = ['r', 'g']
for target, color in zip(targets,colors):
    indicesToKeep = breast_df['label'] == target
    plt.scatter(
        breast_pca_df.loc[indicesToKeep, 'principal component 1'],
        breast_pca_df.loc[indicesToKeep, 'principal component 2'],
        c = color,
        s = 50)
plt.legend(targets, prop={'size': 15})
plt.show()