# Lab 3 - Reducing the data set width


Reducing the data set width comes in two flavors:
  1. Feature Selection - Selecting from existing features
  1. Dimensionality Reduction - Using numerical methods to alter the feature space from known variables to computed variables


## Feature selection
----
If you are dealing with multivariate data, most of the times data has many variables in it. Not all features are equally significant compared to rest. You should be able to make better predictions using minimum numbers of features in the dataset. When data is humongous, computation time matters a lot. Building models with minimum features will help in reducing the computational effort. 

Feature selection acts like a filter, eliminating features that aren’t useful in addition to existing features. It helps in building predictive models free from correlated variables, biases and unwanted noise.

**Filter Methods: ** These methods apply a statistical measure and assign a scoring to each feature. The features are selected to be kept or removed from the dataset. The methods are often univariate or with regard to the dependent variable. Examples methods include the Chi squared test, information gain and correlation coefficient scores.

**Wrapper Methods: ** These methods consider the selection of a set of features, where different combinations are prepared, evaluated and compared to other combinations. Each combination is assigned a score based on model accuracy. Example method is the recursive feature elimination algorithm.

**Embedded Methods:** These methods learn which features best contribute to the accuracy of the model while the model is being created. The most common type of embedded feature selection methods are regularization methods. Examples of regularization algorithms are the LASSO, Elastic Net and Ridge Regression.

We have seen how chi squared test is used to determine are going to loo
[Feature selection using Caret package](https://www.r-bloggers.com/feature-selection-with-carets-genetic-algorithm-option/)

##### Covariance and Correletaion

Lets do a quich recap of correletion and covariance we covered in module 2. If you want to quickly filter out correlations with certain threshold, you can do it as shown below.

Additional reading for [correlation](http://www.r-tutor.com/elementary-statistics/numerical-measures/correlation-coefficient)

Additional reading for [covariance](http://www.r-tutor.com/elementary-statistics/numerical-measures/covariance)

In [None]:
cor(housing_prices[!names(housing_prices) %in% c('id','date')])>0.7

In [None]:
cov(housing_prices[!names(housing_prices) %in% c('id','date')])

We have seen how covariance, correlation and chi squared test help us identify relationship between variables. Information gain is something which selects variables based on amount of information it is giving..R package FSelector has function to calculate information gain for us.

In [None]:
# install.packages("FSelector",repo="http://cran.uk.r-project.org")
library(FSelector)
weights <- information.gain(price~., housing_prices)

print(weights)

subset <- cutoff.k(weights, 2)

f <- as.simple.formula(subset, "Prices")

print(f)

For this data probably entropy or the information gain is not the best choice for feature selection. zip code and lattitude are really not the most deciding features for price of the house. Ofcourse location of the house for a better price but they are not the most important ones. Lets continue our discussion on regression.

[Additional reading on Summarizing data is suggested](http://www.cookbook-r.com/Manipulating_data/Summarizing_data/)

We will work with communities and crime dataset. The data is socio-economic data with a total of 1994 instances and 128 features. Out of the 128 variables, 122 are predictive, 5 are non-predictive and one variable is dependent. The first five variables are non predictive. So we dont have to consider them when building the linear regression model.

The dataset is warned to have missing Values. The per capita violent crimes variable was calculated using population and the sum of crime variables considered violent crimes in the United States: murder, rape, robbery, and assault. There was apparently some controversy in some states concerning the counting of rapes. These resulted in missing values for rape, which resulted in incorrect values for per capita violent crime. These cities are not included in the dataset. 

We did not come across missing values in previous modules. Hopefully we deal with them and make our life easy when building the models. All numeric data is normalized into the decimal range 0.00-1.00 using an unsupervised, equal-interval binning method. Read the description about the dataset before starting to work on it. [Click here](../../../datasets/crime/readme.txt) for the readme file.

The actual data doesn't have any column headers. You need to grab the headers information from the readme file. We have to do a little bit of data carpentry before we can start using the data to apply linear regression on it. Lets start with reading the column headers information first.

In [None]:
#The headers information is present in read me file. We have put this information in a file called names.txt so that we can 
#access only the part of data we are interetsed in. So the data has so much unwanted information. A sample record is shown below

#'-- state: US state (by number) - not counted as predictive above, but if considered, should be consided nominal (nominal)'

#The only thing we are interested in is the first word in every line, the actual column name. So lets read the data separating 
#every word using the parameter sep="". 

#header will be FALSE, because we dont have the header in the actual data file. 
column_names= read.csv('../../../datasets/crime/names.txt',header = FALSE,sep="")
head(column_names)

We got the variable names separated from rest of the junk. But they are not perfect. They have ':' appended at the end. 

Now we need to get rid of the character ':' from every word. The gsub() function lets you replace characters in a string.

In [None]:
#First lets extract the 2nd column of names as we are just interested in that column. 
column_names=column_names[,2]

#The first argument to gsub() ':' is replaced with second argument ''(nothing here) from every string in names.
column_names=gsub(':','',column_names)
head(column_names)

Looks like we are all set to assign these names to our crime dataset and start the actual work of predicting crime.

In [None]:
#Uncomment below lines of code and run it.
crime_data <- read.csv('../../../datasets/crime/communities_and_crime.txt',header=FALSE)
# names(crime_data)=column_names

### Error
-----
It is throwing us following error.

Notebook Validation failed: "Error in names(crime_data) = column_names: 'names' attribute [132] must be the same length as the vector [128]\nTraceback:\n" is not of type 'array':
"Error in names(crime_data) = column_names: 'names' attribute [132] must be the same length as the vector [128]\nTraceback:\n"

Lets see why this error is being thrown. It is saying something about the lengths of vector column_names and names() attribute. Check the lengths of both column_names vector and number of columns in crime_data 

In [None]:
ncol(crime_data)
length(column_names)

We hope you understand what is happening here. There are 132 names in column_names vector which we are trying to assign to 128 columns/variables in crime_data. Some how we ended up extracting 132 names instead of 128. If we observe the names vector closely we can see what are those extra names.

In [None]:
column_names

"-", "and", "(numeric", "Part" are the four names that are created. Once we eliminate these we should be good to go.

In [None]:
#In the below command, we are selecting strings in names vector which are not in the specified list using the negation 
#operator'!' 
column_names = column_names [! column_names %in% c('-', 'and', '(numeric', 'Part')]
length(column_names)

Now that we have names for our columns in actual crime_data, lets assign them.

In [None]:
names(crime_data)=column_names
head(crime_data)

#### How will you check the accuracy or how good is the fit of your model.

You cannot build and test the model on the same data. Its meaningless. You have to test the accuracy of the model on unknown test data. R has libraries to split the data into train and test datasets. 

Lets go ahead and split our dataset into training and tetsing datasets. We can do this using catools package as shown below.

In [None]:
#set.seed(x) function is used to reproduc the data whe used the same input. It helps to split the data in same equal partitions 
#no matter how many times you split it.
set.seed(144)
# install.packages("caTools",repo="http://cran.uk.r-project.org")

library(caTools)

split = sample.split(crime_data$ViolentCrimesPerPop, SplitRatio = 0.7)

crime_train_data = subset(crime_data, split == TRUE)

crime_test_data = subset(crime_data, split == FALSE)

nrow(crime_train_data)
nrow(crime_test_data)

In [None]:
head(crime_test_data)

## Dimensionality Reduction
-----
Dimensionality reduction is not Feature selection. Even though both try to reduce the number of attributes in the dataset. Dimensionality reduction method tried to do so by creating new combinations of attributes, where as feature selection method includes and excludes attributes present in the data without altering them. Principal Component Analysis, Singular Value Decomposition, Factor Analysis and Sammon’s Mapping etc are examples of dimensionality reduction.

Following are some of the simplest of techniques for dimensionality reduction/variable exclusion.


**Missing Values Ratio:** Columns with many missing values carry less useful information. Thus if the number of missing values in a column is greater than a threshold value it can be removed.

**Low Variance Filter:** Columns with little variance in data carry little information. Thus if the number of missing values in a column is greater than a threshold value it can be removed. Variance is range dependent. Therefore data should be normalized before applying this technique.

**High Correlation Filter:** Columns with high correletion provide almost same information. Only one of them is enough to feed data to the model. Correlation is scale sensitive. So column normalization should be done for a meaningful correlation comparison.

**Random Forests / Ensemble Trees:**. Decision Tree Ensembles or random forests are useful for feature selection in addition to classfication of data they does. Trees are constructed with attributes as nodes. If an attribute is selected as best split, it is likely to be most informative feature of dataset.

**Principal Component Analysis (PCA):**. Principal Component Analysis (PCA) is a statistical technique that n features of the dataset, transforms into a new set of n coordinates called principal components. The transformation helps, the first principal component to explain largest possible variance. The components following have next highest possible variance with out any correletion with other components.
[Additional Reading](https://www.r-bloggers.com/principal-component-analysis-using-r/)


In [None]:
summary(crime_train_data)

In [None]:
table(crime_train_data$LemasSwFTFieldPerPop=='?')
table(is.na(crime_train_data$LemasSwFTFieldPerPop))

There are many variables who have missing values filled with `?`. 

In [None]:
head(crime_train_data)

### PCA

##### Centering and Standardizing Variables

Standardizing the variables is very important if we have to perform principal component analysis on the variables. If the variables are not standardized, then variables with large variances dominate other variables.

When the variables are standardized, they will all have variance 1 and mean 0. This would allow us to find the principal components that provide the best low-dimensional representation of the variation in the original data, without being overly biased by those variables that show the most variance in the original data.

We will use `scale()` function In R to standardize the variables.

In [None]:
str(crime_train_data[50:127])

In [None]:
standard_vars <- as.data.frame(scale(crime_train_data[!sapply(crime_train_data,class) %in% c('factor')]))
dim(standard_vars)
head(standard_vars)

You can verify the means and standard deviations of the variables. The means will be nearly equal to zero and stnard deviation of 1.

In [None]:
# sapply(standard_vars,mean)

In [None]:
# sapply(standard_vars,sd)

In [None]:
crime_train_data_pca <- prcomp(standard_vars)

In [None]:
summary(crime_train_data_pca)

#### Number of Principal Components to Retain

A scree plot helps us to decide on number of principal components to be retained. The plot will summarize the PCA analysis results. `screeplot()` function in R will help us to do this.

In [None]:
screeplot(crime_train_data_pca, type="lines")

The most obvious change in slope in the scree plot occurs at component 7, therefore first six components should be retained.

Another approach to decide on number of PCA components to choose is by using Kaiser’s criterion. It suggests that we should only retain principal components for which the variance is above 1 (on standardised variables). We can check this by finding the variance of each of the principal components. The standard deviations of PCA components are saved in a standard variable called sdev. you can access it in crime_train_data_pca dataframe.

In [None]:
(crime_train_data_pca$sdev)^2

The components 1 through 14 have variance above 1. Using Kaiser’s criterion, we can retain first fourteen principal components.

One more method to decide on number of PCA components to retain is to keep few components required to explain at least some minimum amount of the total variance. For example, if you want to explain at least 70% of the variance, we will retain the first eight principal components, as we can see from the output of `summary(crime_train_data_pca)` that the first eight principal components explain 70% of the variance (while the first four components explain 56%).

### Scatter plots of Principal components


The values of the principal components are stored in a named element `x` of the variable returned by `prcomp()`. `x` contains a matrix where the first column contains the first principal component, the second column the second component, and so on.

Thus, `housing_prices_pca$x[,1]` contains the first principal component, and `housing_prices_pca$x[,2]` contains the second principal component.

We will make a scatterplot of the first two principal components.

In [None]:
library(ggplot2)
pca_comp1_comp2 <- ggplot(crime_train_data, aes(x=crime_train_data_pca$x[,1],y=crime_train_data_pca$x[,2]))

pca_comp1_comp2+geom_point(alpha = 0.8)

In [None]:
# Calculating total number of elements in the dataset
len = length(as.matrix(crime_train_data))/length(crime_train_data)

biplot(crime_train_data_pca, xlabs = rep( '.', len))