# PCA - Principal Components Analysis
***

## 1 Dataset: Energy efficiency


### 1.1 Description
We perform energy analysis using 12 different building shapes simulated in Ecotect. The buildings differ with respect to the glazing area, the glazing area distribution, and the orientation, amongst other parameters. We simulate various settings as functions of the afore-mentioned characteristics to obtain 768 building shapes. The dataset comprises 768 samples and 8 features, aiming to predict two real valued responses. It can also be used as a multi-class classification problem if the response is rounded to the nearest integer.


### 1.2 Attribute Information:

The dataset contains eight attributes (or features, denoted by X1...X8) and two responses (or outcomes, denoted by y1 and y2). The aim is to use the eight features to predict each of the two responses. 

Specifically: 
1. X1	Relative Compactness 
2. X2	Surface Area 
3. X3	Wall Area 
4. X4	Roof Area 
5. X5	Overall Height 
6. X6	Orientation 
7. X7	Glazing Area 
8. X8	Glazing Area Distribution 
9. y1	Heating Load 
10. y2	Cooling Load

### 1.3 Link 
https://archive.ics.uci.edu/ml/datasets/Energy+efficiency

***
## 2 Load libraries and dataset

### 2.1 Import libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn import linear_model
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, f1_score

%matplotlib inline

### 2.2 Load .xlsx file

In [None]:
dataset = pd.read_excel('ENB2012_data.xlsx')

# Show 5 rows
dataset.head()

***
## 3 Prepare data

### 3.1 Generate new class attribute

In [None]:
# Generate new class attribute. Is y1+y2 above average? 1= yes / 0 = no
dataset['aboveAVG'] = np.where((dataset['Y1'] + dataset['Y2'])>=(dataset['Y1'].mean() + dataset['Y2'].mean()), 1, 0)

### 3.2 Get X and y values

In [None]:
# Values of target
y = dataset['aboveAVG'].values

# Values of attributes
dataset = dataset.drop(['Y1', 'Y2', 'aboveAVG'], axis=1)
X = dataset.values

### Get number of features

In [None]:
number_features = len(dataset.columns)

***
## 4 PCA

### 4.1 Initialize and fit

In [None]:
pca=PCA(n_components=number_features)                                    
pca.fit(X)                                

### 4.2 Evalute Components

In [None]:
PCA(copy=True, n_components=None, whiten=False)
pca.components_

### 4.3 Explained Variance Ratio

In [None]:
pca.explained_variance_ratio_

### 4.3 Explained Variance

In [None]:
pca.explained_variance_

### 4.4 Cumulative sum of variance explained with [n] features

In [None]:
variance = np.cumsum(np.round(pca.explained_variance_ratio_, decimals=3)*100)
variance

### 4.5 Plot in graph

In [None]:
plt.ylabel('% Cumulative Variance')
plt.xlabel('Principal Componenet')
plt.title('PCA Analysis')
plt.ylim(30,100.5)
plt.style.context('seaborn-whitegrid')


plt.plot(variance)

### 4.6 Logistic Regression based on first and second principal components

In [None]:
# "rule of thumb"
pca = PCA(n_components=2)
X_r = pca.fit(X).transform(X)

logistic = linear_model.LogisticRegression()
pipeline = Pipeline(steps=[('pca', pca), ('logistic', logistic)])

### 4.7 Plot projection of first and second principal components

In [None]:
plt.figure()
colors = ['navy', 'turquoise', 'darkorange']
lw = 2

for color, i, target_name in zip(colors, [0, 1, 2], [0, 1]):
    plt.scatter(X_r[y == i, 0], X_r[y == i, 1], color=color, alpha=.8, lw=lw,
                label=target_name)
plt.legend(loc='best', shadow=False, scatterpoints=1)
plt.title('PCA')

plt.show()

***
## 5 Machine Learning

### 5.1 Split dataset in train and test subsets

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

### 5.2 Fit model

In [None]:
pipeline.fit(X_train, y_train) # Add solver='lbfgs' to get rid of the deprecation warning

### 5.3 Predict

In [None]:
y_pred = pipeline.predict(X_test)

### 5.4 Evaluate results

In [None]:
# Score
pipeline.score(X_test, y_test)

In [None]:
# F-measure
f1_score(y_test, y_pred, average='micro')

### 5.5 Confusion Matrix

In [None]:
# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
cm

In [None]:
# Normalized (%)
cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
print('Normalized confusion matrix')
print(cm_normalized)

In [None]:
# Plot confusion matrix
plt.imshow(cm, interpolation='nearest', cmap=plt.cm.Greens)
plt.title('Confusion matrix')
plt.colorbar()
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')