# PCA_1


# 索引
[探索房屋數據集](#探索房屋數據集)

[標準化資料](#標準化資料)

[取得特徵值](#取得特徵值)

[列出並排序全部的特徵值](#列出並排序全部的特徵值)

## 探索房屋數據集

載入房屋數據集到數據框中

房屋數據集包含506個樣本，有14個特徵：
     
<pre>
1. CRIM      per capita crime rate by town
2. ZN        proportion of residential land zoned for lots over 
                 25,000 sq.ft.
3. INDUS     proportion of non-retail business acres per town
4. CHAS      Charles River dummy variable (= 1 if tract bounds 
                 river; 0 otherwise)
5. NOX       nitric oxides concentration (parts per 10 million)
6. RM        average number of rooms per dwelling
7. AGE       proportion of owner-occupied units built prior to 1940
8. DIS       weighted distances to five Boston employment centres
9. RAD       index of accessibility to radial highways
10. TAX      full-value property-tax rate per $10,000
11. PTRATIO  pupil-teacher ratio by town
12. B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks 
                 by town
13. LSTAT    % lower status of the population
14. MEDV     Median value of owner-occupied homes in $1000s
</pre>

In [None]:
import pandas as pd

df = pd.read_csv('housing.data.txt',
                 header=None,
                 sep='\s+')

df.columns = ['CRIM', 'ZN', 'INDUS', 'CHAS', 
              'NOX', 'RM', 'AGE', 'DIS', 'RAD', 
              'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
df.head()

## 將數據集分為訓練用跟測試用

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X = df.iloc[:,:-1].values #測試資料為0~13筆資料
y = df['MEDV'].values  #標籤是最後一筆

In [None]:
X_train, X_test, y_train, y_test =  train_test_split(X, y,
                     test_size=0.30,random_state = 1)

## 標準化資料

In [None]:
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
X_train_std = sc.fit_transform(X_train)
X_test_std = sc.transform(X_test)
print('訓練集資料標準化 \n%s' % X_train_std )
print('測試集標準化 \n%s' % X_test_std )

## 取得特徵值

In [None]:
import numpy as np
cov_mat = np.cov(X_train_std.T)
eigen_vals, eigen_vecs = np.linalg.eig(cov_mat)

print('\n特徵值 \n%s' % eigen_vals)

## 列出並排序全部的特徵值

In [None]:
tot = sum(eigen_vals)
var_exp = [(i / tot) for i in sorted(eigen_vals, reverse=True)]
cum_var_exp = np.cumsum(var_exp)

In [None]:
import matplotlib.pyplot as plt


plt.bar(range(1, 14), var_exp, alpha=0.5, align='center',
        label='individual explained variance')
plt.step(range(1, 14), cum_var_exp, where='mid',
         label='cumulative explained variance')
plt.ylabel('Explained variance ratio')
plt.xlabel('Principal component index')
plt.legend(loc='best')
plt.tight_layout()
# plt.savefig('demo7_2.png', dpi=300)
plt.show()