# Variable/Feature Reduction Using Correlation in Python

## Why do we need to reduce features?

When building a statistical model, it is quite likely that we have lots of variables or features, potentially from different sources. These could represent various different information about a particular observation. 

But in case we are using a large number of features, it is likely that some of these might be identical, some could be a scalar multiple (for example, height in inches in column x and height in cm in column y). Others could be linear or polynomial combinations of a few variables. Therefore, there are only a few "dimensions" that are representede as various features i.e. the various of the input features is largely represented by only a few features. 

In order for us to build a parsimonious model, we need to limit the number of "dimensions" or types of information. This could be done in various ways such as removing duplicate or scalar multiples of variables or more complex ways such as using Principal Component Analysis (PCA) or variable clustering (K-means, hierarchical, etc.). The latter methods are arguably better but relatively harder to interpret for the lay person and also computationally expensive (especially when you have 1000s of features).


## Where does correlation come in?

As I said earlier, there are less effective but efficient and easy to interpret ways and other statistically better but computationally expensive and difficult to interpret ways of feature reduction. How can we be smarter about this?

1. Apply efficient techniques first using lower "thresholds". This would only remove features that are identical or very very similar to each other
2. Then, apply statistically better techniques that would be able to more complex ways to remove "dimensions" and not just features. Since we are left with fewer variables (due to step 1), this should be much quicker

Now step 1 is more critical the more redundancies there are in your data. If your data is clean and you know most features represent very different types of information, you could directly use step 2. Step 1 is where correlation based techniques come in.


## Different methods to put correlation to work

In this notebook, I've highlighted two main ways to put correlation to work. Please note that **only works with numeric data types**. 
1. Correlation with dependant variable (or target). This is the correlation of independent variables with the dependent variables
2. Cross correlation. This is the correlation between variable pairs among the features (independant variables)

### Correlation with dependent variable

This is the correlation of each of the input features with the target. There are different ways to calculate correlation and they are best explained <a href="https://www.kaggle.com/kiyoung1027/correlation-pearson-spearman-and-kendall" target = "_blank">here</a>. Now, think of the following examples:
1. Consider an equation of a line y = x where y is the target and x is the input feature. If that represents the data perfectly then correlation between x and y is 1
2. Consider a parabola (like the U-shaped correlation as shown in the link), the correlation is 0 but that need not mean the variable is meaningless
3. Now, consider a line y = 5 i.e. a variable containing the value 5 in all obsevations. This may hold a non-zero correlation but is meaningless to use in building a model

From these examples, it appears that lower correlations need to be examined and if "certain conditions" (will explain in just a bit) are met, they can be excluded from the input features and dataset. How do we calculate the correlation with target? By using the following function:

In [1]:
def corr_target(df_input, df_target, corr_coeff=0.95, savefile=0):
	""" The function retuns a list of features to be dropped from the input features.
	
	INPUTS:
	1. df_input: n numeric input features (pandas dataframe)
	2. df_target: Target values (ensure same order as input features)
	3. corr_coeff: Coefficient threshold (absolute value, no negatives) for feature and target below which feature will be dropped
	4. savefile: If set to 1, all relevant files will be saved
	
	PLEASE NOTE:
	- The dataframe df_input should contain only the n numeric input features (i.e. no ID and targets) 
	- The pandas series df_target should only be 1 column (if multiclass it should include all classes) and should be in the same order as the input dataset df_input

	SUMMARY OF LOGIC:
	1. The n numeric input variables are taken and a n-dimensional vector of correlation is created (these are absolute values i.e. a correlation of -0.8 is treated as 0.8)
	2. Variables with correlation lower than the corr_coeff threshold are dropped 

	SAVED FILES:
	If savefile is set to 1. Saved under current directory under corr_target under folder with a UTC timestamp.
	1. A CSV of the vector containing the values used for the heatmap
	2. A CSV of the list of variables to be dropped
	"""
	
	# Pre-processing
	import pandas as pd
	if savefile == 1:
		from datetime import datetime
		time = str(datetime.utcnow())
		import os
		path = str(os.getcwd()) + "/corr_target/" + str(time[0:19].replace(':',"-")) + "/"
		os.makedirs(path)

	# Combining the input and target data
	df = pd.DataFrame(df_input)
	df["target"] = pd.Series(df_target)

	# Generating correlation matrix of input features
	corr_matrix = df.corr(method = 'pearson') # For more info on the methods please refer to https://www.kaggle.com/kiyoung1027/correlation-pearson-spearman-and-kendall
	corr_matrix = corr_matrix.iloc[0:corr_matrix.shape[0]-1,corr_matrix.shape[0]-1]

	# Saving files
	if savefile ==1:
		corr_matrix.to_csv(path + "cross_corr.csv")


	# Selecting features to be dropped (Using two for loops that runs on one triangle of the corr_matrix to avoid checking the correlation of a variable with itself)
	corr_matrix = abs(corr_matrix)
	features_drop_list = list(corr_matrix[corr_matrix<corr_coeff].index)

	# Saving final list
	if savefile ==1:
		pd.Series(features_drop_list).to_csv(path + "features_drop_list.csv")

	return features_drop_list

I typically use thresholds less than 1% (i.e. 0.01) but at most 2%. Now DO NOT directly drop all these features. Once the list is generated try plotting univariates (variable vs target of each variable to be dropped) and bivariates (same as univariates but with population or frequency added) to check if any of these need to be retained. Drop these variables and now proceed to check cross-correlation.

### Cross-correlation

The above function is used to elimate features based on how they independently correlate with the dependent variable. But what happens if there is are features that are scalar multiples of each other (like height in cm and inches) that are highly correlated with the dependent variable? We can remove one of such pairs using the correlation between pairs of variables. If there are n features, then we need to calculate an n X n where each cell is the correlation between the pairs of variables corresponding to the row and column.

Suppose we see that a 2 variables are highly correlated, which one do we eliminate? There are two ways of doing this:
1. Eliminate the variable which has lower correlation with target
2. Eliminate the variable which has a higher mean correlation with all the input variables in the dataset

How do these methods differ? Option 1 uses the correlation technique we discussed previously to eliminate one of the two. But like we discussed earlier, a lower correlation might not necessarily mean we need to drop the variable (Think of that parabola shape we discussed earlier). A lower mean correlation with all variables doesn't check with the target but with other variables. A similar condition which could imply a similar drawback. 

I personally prefer option 2 as in my personal experience, checking with multiple variables usually results in less cases of parabolas and more cases of straightforward correlations but you need to explore your data and figure it out. Or, you could reduce the threshold and try both and use a superset of what you get. But since this is the quick and dirty variable reduction being performed, we need not aim for the absolute best method. That can be handled by variable clustering, PCA, etc.

### Cross-correlation (dropping by correlation with target)

In [2]:
def cross_corr_target(df_input, df_target, corr_coeff=0.95, plot=0, savefile=0):
	""" The function retuns a list of features to be dropped from the input features.
	
	INPUTS:
	1. df_input: n numeric input features (pandas dataframe)
	2. df_target: Target values (ensure same order as input features)
	3. corr_coeff: Coefficient threshold (absolute value, no negatives) for a pair of variables above which one of the two will be dropped
	4. plot: If set to 1 a plot will be displayed showing a heatmap of the cross-correlation between variables 
	5. savefile: If set to 1, all relevant files will be saved
	
	PLEASE NOTE:
	- The dataframe df_input should contain only the n numeric input features i.e. no ID and targets) 
	- The pandas series df_target should only be 1 column (if multiclass it should include all classes) and should be in the same order as the input dataset df_input

	SUMMARY OF LOGIC:
	1. The n numeric input variables are taken and a n X n matrix of correlation is created (these are absolute values i.e. a correlation of -0.8 is treated as 0.8)
	2. Variable pairs with correlation higher than the corr_coeff threshold are picked and one of the two variables will be dropped
	3. Which of the two will be dropped is based on the one having lower correlation with target variable

	SAVED FILES:
	If savefile is set to 1. Saved under current directory under cross_corr_target under folder with a UTC timestamp.
	1. A PDF heatmap representing the cross correlation between all the input features
	2. A CSV of the matrix containing the values used for the heatmap
	3. A CSV of correlation of input features with the target
	4. A CSV of the list of variables to be dropped
	"""
	
	# Pre-processing
	import pandas as pd
	if savefile == 1:
		from datetime import datetime
		time = str(datetime.utcnow())
		import os
		path = str(os.getcwd()) + "/cross_corr_target/" + str(time[0:19].replace(':',"-")) + "/"
		os.makedirs(path)

	# Combining the input and target data
	df = pd.DataFrame(df_input)
	df["target"] = pd.Series(df_target)

	# Generating correlation matrix of input features
	corr_matrix = df.corr(method = 'pearson') # For more info on the methods please refer to https://www.kaggle.com/kiyoung1027/correlation-pearson-spearman-and-kendall

	# Plotting cross correlation matrix
	if plot == 1:
		import matplotlib.pyplot as plt
		import seaborn as sns
		plt.figure(figsize=(10,8))
		sns.heatmap(corr_matrix.iloc[0:corr_matrix.shape[0]-1,0:corr_matrix.shape[0]-1].round(2), cmap=plt.cm.Blues)
		fig = plt.gcf()
		plt.show()
		if savefile == 1:
			fig.savefig(path + "cross_corr_heatmap.pdf")

	# Generating correlation with the target
	corr_target = (corr_matrix["target"])

	# Saving files
	if savefile ==1:
		corr_matrix = corr_matrix.iloc[0:corr_matrix.shape[0]-1,0:corr_matrix.shape[0]-1] # Removing the last row and column which contain the target
		corr_target = corr_target.iloc[0:corr_target.shape[0]-1] # Removing the value which contain the target
		corr_matrix.to_csv(path + "cross_corr.csv")
		corr_target.to_csv(path + "target_corr.csv")

	# Preparing data
	features_drop_list = [] # This will contain the list of features to be dropped
	features_index_drop_list = [] # This will contain the index of features to be dropped as per df_input
	corr_matrix = abs(corr_matrix)
	corr_target = abs(corr_target)

	# Selecting features to be dropped (Using two for loops that runs on one triangle of the corr_matrix to avoid checking the correlation of a variable with itself)
	for i in range(corr_matrix.shape[0]):
		for j in range(i+1,corr_matrix.shape[0]):

			# The following if statement checks if each correlation value is higher than threshold (or equal) and also ensures the two columns have NOT been dropped already.  
			if corr_matrix.iloc[i,j]>=corr_coeff and i not in features_index_drop_list and j not in features_index_drop_list:
			
				# The following if statement checks which of the 2 variables with high correlation has a lower correlation with target and then drops it. If equal we can drop any and it drops the first one (This is arbitrary)
				if corr_target[corr_matrix.columns[i]] >= corr_target[corr_matrix.columns[j]]:
					features_drop_list.append(corr_matrix.columns[j])	# Name of variable that needs to be dropped appended to list
					features_index_drop_list.append(j)	# Index of variable that needs to be dropped appended to list. This is used to not check for the same variables repeatedly
				else:
					features_drop_list.append(corr_matrix.columns[i])
					features_index_drop_list.append(i)
	
	# Saving final list
	if savefile ==1:
		pd.Series(features_drop_list).to_csv(path + "features_drop_list.csv")

	return features_drop_list

### Cross-correlation (dropping by mean correlation with all features)

In [3]:
def cross_corr_mean(df_input, corr_coeff=0.95, plot=0, savefile=0):
	""" The function retuns a list of features to be dropped from the input features.
	
	INPUTS:
	1. df_input: n input features (pandas dataframe)
	2. corr_coeff: Coefficient threshold (absolute value, no negatives) for a pair of variables above which one of the two will be dropped
	3. plot: If set to 1 a plot will be displayed showing a heatmap of the cross-correlation between variables 
	4. savefile: If set to 1, all relevant files will be saved
	
	PLEASE NOTE:
	- The dataframe df_input (should contain only the n input features i.e. no ID and targets) 
	
	SUMMARY OF LOGIC:
	1. The n input variables are taken and a n X n matrix of correlation is created (these are absolute values i.e. a correlation of -0.8 is treated as 0.8)
	2. Variable pairs with correlation higher than the corr_coeff threshold are picked and one of the two variables will be dropped
	3. Which of the two will be dropped is based on the one having lower mean absolute correlation with all other variables 

	SAVED FILES:
	If savefile is set to 1. Saved under current directory under cross_corr_target under folder with a UTC timestamp.
	1. A PDF heatmap representing the cross correlation between all the input features
	2. A CSV of the matrix containing the values used for the heatmap
	3. A CSV of the list of variables to be dropped
	"""
	
	# Pre-processing
	import pandas as pd
	if savefile == 1:
		from datetime import datetime
		time = str(datetime.utcnow())
		import os
		path = str(os.getcwd()) + "/cross_corr_mean/" + str(time[0:19].replace(':',"-")) + "/"
		os.makedirs(path)

	# Generating correlation matrix of input features
	corr_matrix = df_input.corr(method = 'pearson') # For more info on the methods please refer to https://www.kaggle.com/kiyoung1027/correlation-pearson-spearman-and-kendall

	# Plotting cross correlation matrix
	if plot == 1:
		import matplotlib.pyplot as plt
		import seaborn as sns
		plt.figure(figsize=(10,8))
		sns.heatmap(corr_matrix.round(2), cmap=plt.cm.Blues)
		fig = plt.gcf()
		plt.show()
		if savefile == 1:
			fig.savefig(path + "cross_corr_heatmap.pdf")

	# Generating correlation with the target
	corr_mean = abs(corr_matrix).mean()

	# Saving files
	if savefile ==1:
		corr_matrix.to_csv(path + "cross_corr.csv")
		corr_mean.to_csv(path + "corr_abs_mean.csv")

	# Preparing data
	features_drop_list = [] # This will contain the list of features to be dropped
	features_index_drop_list = [] # This will contain the index of features to be dropped as per df_input
	corr_matrix = abs(corr_matrix)

	# Selecting features to be dropped (Using two for loops that runs on one triangle of the corr_matrix to avoid checking the correlation of a variable with itself)
	for i in range(corr_matrix.shape[0]):
		for j in range(i+1,corr_matrix.shape[0]):

			# The following if statement checks if each correlation value is higher than threshold (or equal) and also ensures the two columns have NOT been dropped already.  
			if corr_matrix.iloc[i,j]>=corr_coeff and i not in features_index_drop_list and j not in features_index_drop_list:
			
				# The following if statement checks which of the 2 variables with high correlation has a lower correlation with target and then drops it. If equal we can drop any and it drops the first one (This is arbitrary)
				if corr_mean[corr_matrix.columns[i]] >= corr_mean[corr_matrix.columns[j]]:
					features_drop_list.append(corr_matrix.columns[i])	# Name of variable that needs to be dropped appended to list
					features_index_drop_list.append(i)	# Index of variable that needs to be dropped appended to list. This is used to not check for the same variables repeatedly
				else:
					features_drop_list.append(corr_matrix.columns[j])
					features_index_drop_list.append(j)
	
	# Saving final list
	if savefile ==1:
		pd.Series(features_drop_list).to_csv(path + "features_drop_list.csv")

	return features_drop_list

### Please note

1. The input features accepted here are only numeric in nature. Characters, strings, booleans, etc. do not work. Convert these to numeric types before use. Or use other techniques like Cramer's V
2. The target data to be input needs to be one column (even if multiclass) and needs to be in the same order as the observations in the input data