# Project objective
This project is designed to review PCA as a dimensionality reduction approach in combination with logistic regression as a supervised machine learning method and their implementation in python using Olivetti faces data-set from AT&T dataset.

Information about the dataset, some technical details about the used machine learning method(s) and mathematical details of the quantifications approaches are provided in the code. 

# Packages we work with in this notebook
We are going to use the following libraries and packages:

* **numpy**: NumPy is the fundamental package for scientific computing with Python. (http://www.numpy.org/)
* **sklearn**: Scikit-learn is a machine learning library for Python programming language. (https://scikit-learn.org/stable/)


In [0]:
import numpy as np
import sklearn as sk

# Introduction to the dataset

**Name**: Olivetti faces data-set from AT&T dataset

**Summary**: This dataset consists of 10 pictures each of 40 individuals.

**number of features**: 4096 (real, positive) 

**Number of data points (instances)**: 400

**dataset accessibility**: Dataset is available as part of sklearn package.

**Link to the dataset**: http://lijiancheng0614.github.io/scikit-learn/datasets/olivetti_faces.html




## Loading the dataset and separating features and labels
The dataset is available as part of sklearn package.

In [11]:
from sklearn.datasets import fetch_olivetti_faces

# Loading breast cancer data
target_dataset = fetch_olivetti_faces()

# separating feature arrays of pixel values (X) and labels (y) 
input_features = target_dataset.data
output_var = target_dataset.target
# printing number of features (pixels) and data points 
n_samples, n_features = input_features.shape
print("number of samples (data points):", n_samples)
print("number of features:", n_features)

number of samples (data points): 400
number of features: 4096


## Splitting data to training and testing sets

We need to split the data to train and test, if we do not have a separate dataset for validation and/or testing, to make sure about generalizability of the model we train.

**test_size**: Traditionally, 30%-40% of the dataset cna be used for test set. If you split the data to train, validation and test, you can use 60%, 20% and 20% of teh dataset, respectively.

**Note.**: We need the validation and test sets to be big enough for checking generalizability of our model. At the same time we would like to have as much data as possible in the training set to train a better model.

**random_state** as the name suggests, is used for initializing the internal random number generator, which will decide the splitting of data into train and test indices in your case.


In [0]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(input_features, output_var, test_size=0.30, random_state=5)

## Feature extraction (unsupervised)
We want to implement principle component analysis (PCA) to combine features. Considering the dataset in this code, there are too many features compared to the number of data points. Hence, this feature combination that reduced dimensionality of the dataset could help imporoving performance of supervised learning model we want to develop later in this code.

### Principle component analysis (PCA)
Principal component analysis creates new orthogonal variables (principle components) that are linear combinations of the original variables. The focus of PCA is to reproduce the total variance in the original higher dimensional space in the lower dimensional space.
PCA is an optimum approach for mapping to the lower dimensional space and be able to reconstruct the original space afterward.

1) The first principal component (PC) corresponds to a line that passes through the mean. The lines is the regression line so that it minimizes the sum of squares of the distances of the points from the line. 

2) The second PC corresponds to the same concept after all correlation with the first principal component has been subtracted from the points.



In [0]:
from sklearn import decomposition

# we want to reduce dimensionality of the dataset to 150 (150 is arbitrary in this code)

# Create PCA object
pca = decomposition.PCA(n_components=150,whiten=True, random_state = 42)

# fitting the PCA model using the training data
pca.fit_transform(X_train)
# generate principle components of the training data
X_train_pca = pca.fit_transform(X_train)

# let's now identify principle components in the test set using the mapping identified in the training set
X_test_pca = pca.transform(X_test)

## Building the supervised learning model
We want to build a binary classification model as the output variable is categorical with 2 classes. Here we build a simple logistic regression model.

### Logistic regression
If we have set of features X1 to Xn, y can be obtained as:
\begin{equation*} y=b0+b1X1+b2X2+...+bnXn\end{equation*}

where y is the predicted value obtained by weighted sum of the feature values.

Then probability of each class (for example if there is a malignant tumor) can be obtained using the logistic function 

\begin{equation*} p(class=malignant)=\frac{1}{(1+exp(-y))} \end{equation*}

Based on the given class labels and the features given in the trainign data, coefficients b0 to bn can be ontained during the optimization process.

b0 to bn are fixed for all samples while X1 to Xn are feature values specific to each sample. Hence, the logistic function will give us probability of each class assigned to each sample. Finally, the model will choose the class with the highest probability for each sample.


**Note.** The logistic regression model is parametric and the parameters are the regression coefficiets b0 to bn.

In [14]:
from sklearn.linear_model import LogisticRegression 

# Create logistic regression object
logreg = LogisticRegression(random_state = 42)

# Train the model using the training sets
logreg.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=42, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

## Prediction of test (or validation) set
We now have to use the trained model to predict y_test.

In [0]:
# Make predictions using the testing set
y_pred = logreg.predict(X_test)

## Implementing the supoervised learnign model on the reduced dimensions
We now want to repeat the same process using the new features (principle components).


In [0]:
# Create logistic regression object
logreg_pca = LogisticRegression(random_state = 42)

# Train the model using the training sets
logreg_pca.fit(X_train_pca, y_train)

# Make predictions using the testing set
y_pred_pca = logreg_pca.predict(X_test_pca)

## Evaluating performance of the model
We need to assess performance of the model using the predictions of the test set. We use accuracy and balanced accuracy. Here are their definitions:

* **recall** in this context is also referred to as the true positive rate or sensitivity

How many relevant item are selected




$${\displaystyle {\text{recall}}={\frac {tp}{tp+fn}}\,} $$

 

* **specificity** true negative rate



$${\displaystyle {\text{true negative rate}}={\frac {tn}{tn+fp}}\,}$$

* **accuracy**: This measure gives you a sense of performance for all the classes together as follows:

$$ {\displaystyle {\text{accuracy}}={\frac {tp+tn}{tp+tn+fp+fn}}\,}$$


\begin{equation*} accuracy=\frac{number\:of\:correct\:predictions}{(total\:number\:of\:data\:points (samples))} \end{equation*}


* **balanced accuracy**: This measure gives you a sense of performance for all the classes together as follows:

$${\displaystyle {\text{balanced accuracy}}={\frac {recall+specificity
}{2}}\,}$$


In [17]:
from sklearn import metrics

print("accuracy of the predictions using original features:", metrics.accuracy_score(y_test, y_pred))
print("accuracy of the predictions using new features (principle components):", metrics.accuracy_score(y_test, y_pred_pca))
print("blanced accuracy of the predictions using original features:", metrics.balanced_accuracy_score(y_test, y_pred))
print("blanced accuracy of the predictions using new features (principle components):", metrics.balanced_accuracy_score(y_test, y_pred_pca))

accuracy of the predictions using original features: 0.95
accuracy of the predictions using new features (principle components): 0.95
blanced accuracy of the predictions using original features: 0.9566666666666667
blanced accuracy of the predictions using new features (principle components): 0.96


## Take-home message
As we can see, perrormance of the logistic regression model using principle components is the same or a bit better (not significant) than the model with original features. However, the important point is that we could achieve this performance with only 150 features (principle components) instead of 4096 original features. Reducing number of dimensions (features) can help us to reduce memory usage and running time while at the saem time it can helop us to get rid of redundant and noisy features.


## Extracting the coefficient of the model
The trained logistic regresseion model predicts the class of a datapoint as a fucntion of linear combination of feature values. Hence, each feature has a coefficient in this linear combination for predicting output variable.

In [18]:
#print('Coefficients using original features: {}'.format(logreg.coef_))
print('Coefficients using new features (principle components): {}'.format(logreg_pca.coef_))

Coefficients using new features (principle components): [[-0.26044098  0.10272042  0.10610443 ... -0.02761116 -0.01568002
   0.02889717]
 [ 0.00231089 -0.15434158 -0.25431996 ... -0.01134162 -0.02918408
  -0.06746296]
 [-0.01215432  0.00632089  0.2470995  ... -0.08977587 -0.03819557
  -0.02374   ]
 ...
 [-0.02580804 -0.05658147  0.30839334 ...  0.06881294  0.0957967
  -0.03936866]
 [ 0.55723037  0.09982581 -0.08493365 ... -0.01938    -0.03544625
  -0.00536038]
 [-0.11796729  0.12682501  0.1481887  ... -0.05140645 -0.12856949
   0.17752318]]
