https://www.datacamp.com/community/tutorials/principal-component-analysis-in-python

Leanr about PCA and how it can be leveraged (tận dụng) to extract information from the data without any supervision using 2 popular datasets: Breast Cancer and CIFAR-10

Where all you can appy PCA?
- Data Visualization: find out how the variables are correlated or understanding the distribution of a few variables. PCA helps it since it projects the data into a lower dimension., thereby allowing to visualize the data in a 2D or 3D space with a naked eye.
- Speeding ML algorithm: since PCA's main idea is dimensionality reduction, we can leverage that to speed up our ML algorithm's traing and testing time considering our data has a lot of features, and the ML algorithm's learning is too slow

At an abstract level, you take a dataset having many features, and you simplify that dataset by selecting a few Principal Components from original features.

- Breast Cancer (numerical): 
    + a real-valued multivariate data (đa biến có giá trị thực) consists of 2 classes: has breast cancer or not. 2 categories: malignant and benign (ác tính và lành tính)
    + Malignant class: 212 samples; benign class: 357 samples
    + 30 features shared across all classes: radius, texture, perimeter, area, smoothness, fractal dimension, etc.
- CIFAR - 10:
    + 60,000 images each of 32x32x3 color images having 10 classes, with 6000 images per category.
    + 50,000 training images and 10,000 test images
    + Classes: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck.

Data Exploration

Loading and analyzing both datasets. By now we have an idea regarding the dimensionality of both datasets.

In [9]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer

breast = load_breast_cancer()
breast_data = breast.data

breast_data.shape

(569, 30)

Result: 2 dimensions with the outermost dimension will have 569 arrays, each with 30 elements.

In [2]:
breast_labels = breast.target

breast_labels.shape

(569,)

Reshaping the breast_labels to concatenate it with the breast_data to finally create a DataFrame which will have both the data and labels.

In [4]:
# 2-D with 569 arrays, each with 1 element
new_shape = (len(breast_labels, 1))

labels = np.reshape(breast_labels, new_shape)
labels.shape

(569, 1)

Concatenate the data and labels along the 2nd axis, which means the the final shape of the array will be (569 x 30) + (569 x 1) = (569 x 31)

In [5]:
axis_col = 1
final_breast_data = np.concatenate([breast_data, labels], axis=axis_col)

final_breast_data.shape

(569, 31)

Create the DataFrame of the final data to represent the data in a tabular fashion.

In [10]:
breast_dataset = pd.DataFrame(final_breast_data)

Print the features that are there in the breast cancer dataset

In [11]:
features = breast.feature_names
features

array(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
       'mean smoothness', 'mean compactness', 'mean concavity',
       'mean concave points', 'mean symmetry', 'mean fractal dimension',
       'radius error', 'texture error', 'perimeter error', 'area error',
       'smoothness error', 'compactness error', 'concavity error',
       'concave points error', 'symmetry error',
       'fractal dimension error', 'worst radius', 'worst texture',
       'worst perimeter', 'worst area', 'worst smoothness',
       'worst compactness', 'worst concavity', 'worst concave points',
       'worst symmetry', 'worst fractal dimension'], dtype='<U23')

Note in the features array, the label field is missing.

We have to manually add it to the features array since you'll be equating this array with the column names of our breast_dataset dataframe.

In [14]:
features_labels = np.append(features,'label')

Embed the column names to the breast_dataset dataframe

In [15]:
breast_dataset.columns = features_labels

breast_dataset.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,label
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0.0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0.0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0.0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0.0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0.0


Since the original labels are in 0 and 1 format, we will change the labels to benign and malignant.

Use inplace=True will modify the dataframe.

In [16]:
breast_dataset['label'].replace(0, 'Benign', inplace=True)
breast_dataset['label'].replace(1, 'Malignant', inplace=True)

Print the last few rows of the breast_dataset

In [17]:
breast_dataset.tail()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,label
564,21.56,22.39,142.0,1479.0,0.111,0.1159,0.2439,0.1389,0.1726,0.05623,...,26.4,166.1,2027.0,0.141,0.2113,0.4107,0.2216,0.206,0.07115,Benign
565,20.13,28.25,131.2,1261.0,0.0978,0.1034,0.144,0.09791,0.1752,0.05533,...,38.25,155.0,1731.0,0.1166,0.1922,0.3215,0.1628,0.2572,0.06637,Benign
566,16.6,28.08,108.3,858.1,0.08455,0.1023,0.09251,0.05302,0.159,0.05648,...,34.12,126.7,1124.0,0.1139,0.3094,0.3403,0.1418,0.2218,0.0782,Benign
567,20.6,29.33,140.1,1265.0,0.1178,0.277,0.3514,0.152,0.2397,0.07016,...,39.42,184.6,1821.0,0.165,0.8681,0.9387,0.265,0.4087,0.124,Benign
568,7.76,24.54,47.92,181.0,0.05263,0.04362,0.0,0.0,0.1587,0.05884,...,30.37,59.16,268.6,0.08996,0.06444,0.0,0.0,0.2871,0.07039,Malignant


CIFAR-10 Image Data Exploration

Using a deep learning library called Keras

In [20]:
from keras.datasets import cifar10

ModuleNotFoundError: No module named 'keras'