# DIMENSIONALITY REDUCTION

### Content :
  > * Feature selection
  > * Feature extraction
  > * Advantages of dimensionality reduction
  > * PCA
  > * Practical implementation : Wholesale Customer dataset

### Requirement:
> * Pandas
> * Numpy
> * Matplotlib
> * Scikit learn
> * Ipython
> * Visual.py file used for ploting 

## Introduction to dimensionality reduction

Dimensionality reduction is the process of reducing the number of random variables under consideration by obtaining a set of principal variables.

COMPONENTS OF DIMENSIONALITY REDUCTION
- feature selection
- feature extraction.

Example :We are training a model for predicting the heights of people and we have data with features( weights, color, moles, marital status, gender). We can see that the features like color, moles and marital status are not linked with the heights of people i.e., irrelevant to our problem of finding heights of people. Hence we need to come up with a solution of finding features which are most useful for our task.

**Feature selection** : tries to find a subset of the input variables.

The three strategies are: 
1. the filter strategy (e.g. information gain), 
2. the wrapper strategy (e.g. search guided by accuracy),
3. the embedded strategy (selected features add or are removed while building the model based on prediction errors).

**Feature projection (Feature extraction)** : transforms the data from the high-dimensional space to a space of fewer dimensions. 

The data transformation may be
- linear, as in principal component analysis (PCA),
- nonlinear dimensionality reduction techniques also exist.

**Advantages of dimensionality reduction**
1. It reduces the time and storage space required.
2. Removal of multi-collinearity improves the interpretation of the parameters of the machine learning model.
3. It becomes easier to visualize the data when reduced to very low dimensions such as 2D or 3D.
4. It avoids the curse of dimensionality.

## PRINCIPAL COMPONENT ANALYSIS(PCA)
**UNSUPERVISED LINEAR FEATURE PROJECTION**

PCA is mostly used as a tool in **exploratory data analysis** and for making predictive models. It is often used to visualize genetic distance and relatedness between populations. 

**STEPS FOR DIMENSION REDUCTION**
1. Organize data as an m×n matrix, where m is the number of measurement types and n is the number of samples.
2. Subtract off the mean for each measurement type.
3. Calculate the SVD or the eigenvectors and eigen value of the covariance.
4. Now the eigen value with higher value corresponds to eigen vector with high variance.
5. Only use those dimension which contain maximum information(ie high eigen values)

## Practical Implementation
### DATA

We will be using a [Wholesale customers dataset](https://archive.ics.uci.edu/ml/datasets/Wholesale+customers#) from the [UCI Machine learning repository](https://archive.ics.uci.edu/ml/index.php) which has a very good collection of datasets.

Features used in this project:
- FRESH
- MILK
- GROCERY
- FROZEN
- DETERGENTS_PAPER
- DELICATESSEN

### IMPORT THE LIBRARIES

In [None]:
"""import neccessary libraries for this project"""

# Import libraries necessary for this project

# Allows the use of display() for DataFrames

# Import supplementary visualizations code visuals.py


# Pretty display for notebooks
%matplotlib inline

### LOAD DATA

In [None]:
"""read the csv file using pandas library"""
# Load the wholesale customers dataset


### DATA EXPLORATION

In [None]:
""" Display a description of the dataset"""


In [None]:
"""understand the features of the data by observing few samples"""
data.head()

In [None]:
# Select three indices of your choice you wish to sample from the dataset


# Create a DataFrame of the chosen samples


#### VISUALISE FEATURE DISTRIBUTION

In [None]:
"""Produce a scatter matrix for each pair of features in the data"""



### DATA PREPROCESSING

 To understand the customer better we need to scale the data and detect the outliers

#### FEATURE SCALING

In [None]:
"""Scale the data and sample using the natural logarithm and plot the scatter matrix"""
# Scale the data using the natural logarithm


# Scale the sample data using the natural logarithm


# Produce a scatter matrix for each pair of newly-transformed features


### FEATURE TRANSFORMATION

Using techniques like PCA can help us understand which compound combination of features can best describe the customer,as it maximises the variance.

#### PCA

In [None]:
"""Apply PCA by fitting the data with the same number of dimensions as features"""

In [None]:
log_samples.shape

In [None]:
#  Apply PCA by fitting the good data with the same number of dimensions as features


#  Transform log_samples using the PCA fit above


# Generate PCA results plot


### Observations

**Mention your observation**


In [None]:
# Display sample log-data after having a PCA transformation applied


**DIMENSIONALITY REDUCTION**

In [None]:
"""reduce the dimention from no of features to 2 and visualise biplot"""

# Apply PCA by fitting the good data with only two dimensions

# Transform the good data using the PCA fit above


# Transform log_samples using the PCA fit above


# Create a DataFrame for the reduced data


In [None]:
# Display sample log-data after applying PCA transformation in two dimensions


In [None]:
# Create a biplot


### FURTHER READING :
- https://en.wikipedia.org/wiki/Principal_component_analysis
- https://en.wikipedia.org/wiki/Dimensionality_reduction