# Principal Componenet Analysis (PCA)

PCA 是一種降維演算法，它很適合使用在具有相關列(column)的數據集。 降維就是希望資料的維度數減少，但整體的效能不會差異太多甚至會更好

## 1. Import 相關套件

In [ ]:
import cudf
import numpy as np
from ncue.datasets import make_blobs
from ncue.decomposition import PCA as cuPCA
from sklearn.decomposition import PCA as skPCA

## 2. 定義 Parameters

In [24]:
n_samples = 2**15
n_features = 400

n_components = 2
whiten = False
svd_solver = "full"
random_state = 23

## 3. 產生測試資料

In [25]:
%%time
device_data, _ = make_blobs(n_samples=n_samples, 
                            n_features=n_features, 
                            centers=5, 
                            random_state=random_state)

device_data = cudf.DataFrame.from_gpu_matrix(device_data)

CPU times: user 244 ms, sys: 13.2 ms, total: 257 ms
Wall time: 253 ms


In [26]:
# 將資料從GPU MEMORY複製到RAM，方便sklearn使用，以利最後結果的比對

host_data = device_data.to_pandas()

## 4. Scikit-learn 模型(CPU)

### 將資料餵入模型

In [27]:
%%time
pca_sk = skPCA(n_components=n_components,
               svd_solver=svd_solver,
               whiten=whiten,
               random_state=random_state)

result_sk = pca_sk.fit_transform(host_data)

CPU times: user 52 s, sys: 12.1 s, total: 1min 4s
Wall time: 4.26 s


## 5. NCUE 模型(GPU)

### 將資料餵入模型

In [28]:
%%time
pca_ncue = cuPCA(n_components=n_components,
                 svd_solver=svd_solver,
                 whiten=whiten,
                 random_state=random_state)

result_ncue = pca_ncue.fit_transform(device_data)

CPU times: user 366 ms, sys: 443 ms, total: 809 ms
Wall time: 364 ms


## 6. 評估比對結果

### Singular Values

In [29]:
passed = np.allclose(pca_sk.singular_values_, 
                     pca_ncue.singular_values_.to_array(), 
                     atol=0.01)
print('compare pca: ncue vs sklearn singular_values_ {}'.format('equal' if passed else 'NOT equal'))

compare pca: ncue vs sklearn singular_values_ equal


### Explained Variance

In [30]:
passed = np.allclose(pca_sk.explained_variance_, 
                     pca_ncue.explained_variance_.to_array(), 
                     atol=1e-6)
print('compare pca: ncue vs sklearn explained_variance_ {}'.format('equal' if passed else 'NOT equal'))

compare pca: ncue vs sklearn explained_variance_ equal


### Explained Variance Ratio

In [31]:
passed = np.allclose(pca_sk.explained_variance_ratio_, 
                     pca_ncue.explained_variance_ratio_.to_array(), 
                     atol=1e-6)
print('compare pca: ncue vs sklearn explained_variance_ratio_ {}'.format('equal' if passed else 'NOT equal'))

compare pca: ncue vs sklearn explained_variance_ratio_ equal


### Components

In [32]:
passed = np.allclose(pca_sk.components_, 
                     np.asarray(pca_ncue.components_.as_gpu_matrix()), 
                     atol=1e-6)
print('compare pca: ncue vs sklearn components_ {}'.format('equal' if passed else 'NOT equal'))

compare pca: ncue vs sklearn components_ equal


### Transform

In [33]:
passed = np.allclose(result_sk, np.asarray(result_ncue.as_gpu_matrix()), atol=1e-1)
print('compare pca: ncue vs sklearn transformed results %s'%('equal'if passed else 'NOT equal'))

compare pca: ncue vs sklearn transformed results equal
