## Assignment 7 
### Dimensionality reduction

In this assignment, the HTRU2 dataset is used that can be found on [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/HTRU2).

In this workbook, dimensionality reduction techniques are implemented.

### Introduction
Introduction on HTRU2 dataset problem can be found on asgmt7_feat_importance_afoudouli.ipynb

Dimensionality reduction refers to the techniques that are used and are able to transform a high dimensional space into a lower dimensional one withouth the loss of much information.  
In Machine Learning, dimesnionality reduction can be classified into data preparation techniques. In can improve an algorithms results, as fewer parameters lead to faster train times, less noise in data and possibly better generalization.  

There are various dimensionality reduction techniques. They can be classified into two categories:
* Linear algebra methods  
Examples include PCA, SVD, LDA 
* Manifold learning methods   
Examples include tsne, UMAP, autoencoders

In [1]:
# if needed install packages uncommenting the following commands
# !pip install xgboost

In [2]:
import pandas as pd 
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from xgboost import XGBClassifier
import matplotlib.pyplot as plt
import os 

os.chdir('C:/Users/anast/OneDrive/Desktop/MSc/MachineLearning/Assignments/Asgmt7_FeatureSelection/')

In [3]:
data_file = 'HTRU2/HTRU_2.csv'

data = pd.read_csv(data_file, header=None)


feature_names = ["Mean of the integrated profile",
	"Standard deviation of the integrated profile",
	"Excess kurtosis of the integrated profile",
	"Skewness of the integrated profile",
	"Mean of the DM-SNR curve",
	"Standard deviation of the DM-SNR curve",
	"Excess kurtosis of the DM-SNR curve",
	"Skewness of the DM-SNR curve"]

data.columns = feature_names + ["target_class"]

### Principal Component Analysis
Principal Component Analysis, or PCA for short, is a method for reducing the dimensionality of data.  
In its essence, PCA is not a dimensionality reduction algorithm but the properties of principal components make it a data scientist' fevourite when trying to reduce the number of features.  
PCA transform the original N-dimensional dataset, into a set of N new features called principal components. The information contained in a column is the amount of variance it contains. The primary objective of Principal Components is to represent the information in the dataset with as low number of columns as possible. PCs are formed in such a way that the first Principal Component (PC1) explains more variance in original data compared to PC2. Likewise, PC2 explains more than PC3, and so on. Each of the PCs contains weights (called loadings) which are actuale the eigenvectors of the original data X.

The problem can be easier understood on a 2D space.  
Using these, we want to find aprojection that can better "represent" the information in the data. This new column can be thought of as a line that passes through these points. Such a line can be represented as a linear combination of the two columns and explains the maximum variation present in these two columns. It should be in a direction that minimizes the perpendicular distance of each point from the line.

<iframe src="https://gifer.com/embed/H7zW" width=480 height=192.000 frameBorder="0" allowFullScreen></iframe><p><a href="https://gifer.com">via GIFER</a></p>

Some great tutorials can be found on [umetrics](https://blog.umetrics.com/what-is-principal-component-analysis-pca-and-how-it-is-used), [built in](https://builtin.com/data-science/step-step-explanation-principal-component-analysis), [Machine Learning Plus](https://www.machinelearningplus.com/machine-learning/principal-components-analysis-pca-better-explained/).

In [4]:
X = data.drop(columns = 'target_class')
y = data['target_class'].astype(int)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=556, stratify=y)

In [5]:
scaler = StandardScaler()
clf = XGBClassifier(n_jobs=-1)

In [6]:
full_dim_pipe = Pipeline([('scaling', scaler),
                          ('classifier', clf)])

full_dim_pipe.fit(X_train, y_train);  

In [7]:
pca = PCA(n_components=4)

pca_pipe = Pipeline([('scaling', scaler),
                     ('pca', pca),
                     ('classifier', clf)])

pca_pipe.fit(X_train, y_train); 