# Investigation of correlation in datasets with auto interaction

This notebook investigates the use of optimisation in feature engineering.

The algorithm works by defining maximising the correlation between a variable and the target by changing the power to which a variable is raised. 

The algorithm is compared to sklearn's PolynomialFeatures

Advantages:
- Allows powers that are not positive integers, i.e. fractional and negative powers
- Computationally faster for high powers.

Disadvantages:
- Does not consider interactions between variables
- The method is very inaccurate and only meant to give an idea of dimensionality.
- Poor accuracy if feature is unimportant

Consider:
- Scaling
- Choice of correlation method
- Choice of optimisation method

In [1]:
import numpy as np
import pandas as pd
from scipy.optimize import minimize
from sklearn.preprocessing import PolynomialFeatures
from sklearn import linear_model
import time

## Define the objective function
Variable coefficients are scaled to give similar importance to each variable

In [2]:
def f(x):
    return   1e-1   * x[0] ** 2 \
           + 2e1    * x[1] ** 0.5 \
           + 1e-5   * x[2] ** 5.7\
           + 1e4    * x[3] ** -3 

## Define number of variables, their mean and stdev

In [3]:
num_vars = 4
num_rows =1e12
mu = 10
sigma = 1

## Generate data
Data has been generated using the normal distribution. Consider adding noise.

In [4]:
xT=[np.random.normal(mu, sigma, 1000) for i in range(num_vars)]
x=np.transpose(xT)

y = [f(x[i]) for i in range(len(x))]

df=pd.DataFrame()
for i in range(num_vars):
    df['x_%d' %(i)]=xT[i]

    feats=df.columns
df['y']=y

## Correlation of variables with y

In [5]:
correlations = [np.corrcoef([df[feature]**1,df.y])[0][1] for feature in feats]
corrdf = pd.DataFrame({'Correlation':correlations}, index=feats)
corrdf

Unnamed: 0,Correlation
x_0,0.358407
x_1,0.531528
x_2,0.509809
x_3,-0.504638


## Change the power of indices to optimise correlation

In [6]:
t1 = time.time()
indices = []
optCorrelations=[]

for feature in feats:
    
    def objective(x):
        return -abs(np.corrcoef([df[feature]**x,df.y])[0][1])
    sol = minimize(objective,1,method = 'Nelder-Mead')
    indices.append(round(sol.x[0],1))
    optCorrelations.append(-sol.fun)
    
df2 = pd.DataFrame({'Correlation':correlations,'Optimised Index':indices, 
                    'Optimised correlation':optCorrelations},index=feats)
df2['%Change']=100*round((abs(df2['Optimised correlation'])-abs(df2['Correlation']))/abs(df2['Correlation']),4)

t2=time.time()
print('time taken:',t2-t1, 's')
df2

time taken: 0.11757206916809082 s


Unnamed: 0,Correlation,Optimised Index,Optimised correlation,%Change
x_0,0.358407,2.2,0.359799,0.39
x_1,0.531528,0.2,0.532597,0.2
x_2,0.509809,5.2,0.538864,5.7
x_3,-0.504638,-3.3,0.530602,5.15


## Generate auto interaction variables

In [7]:
t1 = time.time()
poly = PolynomialFeatures(degree=5, include_bias=False, interaction_only=False)
feats2 = poly.fit_transform(df[feats])
df3 = pd.DataFrame(feats2, index=df.index, columns=poly.get_feature_names(feats))

## Split data into training and testing datasets

In [8]:
yList = np.asarray(df['y'])
xList=np.asarray(df3)

xListTrain = [xList[i] for i in range(len(xList)) if i%5!=0]
xListTest = [xList[i] for i in range(len(xList)) if i%5==0]
yListTrain = [yList[i] for i in range(len(yList)) if i%5!=0]
yListTest = [yList[i] for i in range(len(yList)) if i%5==0]

## Model using Linear Regression

In [9]:
model = linear_model.LinearRegression()
model.fit(xListTrain,yListTrain)
print('Train Score:\t', model.score(xListTrain,yListTrain))
print('Test Score:\t', model.score(xListTest,yListTest))
t2=time.time()
print('time taken:',t2-t1, 's')

Train Score:	 0.9999999150744543
Test Score:	 0.9999852087430756
time taken: 0.08871603012084961 s


## Interactions

In [10]:
df4 = pd.DataFrame({'Coefficient':[round(i,2) for i in model.coef_]},
                   index=poly.get_feature_names(feats))
answer=df4.query('abs(Coefficient)>1e-4')

## Compare correlations between features and interaction features

In [11]:
newfeats=list(answer.index)
newcorrelations = [np.corrcoef([df3[feature],df.y])[0][1] for feature in newfeats]
newcorrdf = pd.DataFrame({'Coefficient':answer.Coefficient,'Correlation':newcorrelations,},index=newfeats)
newcorrdf[['Coefficient','Correlation']].sort_values(by='Coefficient',ascending=False)

Unnamed: 0,Coefficient,Correlation
x_3^2,36.94,-0.493907
x_2,8.85,0.509809
x_1,8.35,0.531528
x_0 x_3,0.84,-0.093323
x_0^2,0.67,0.359756
x_1 x_3,0.65,0.042726
x_2 x_3^2,0.44,-0.181325
x_3^4,0.15,-0.469947
x_0 x_2,0.07,0.621657
x_0 x_1^2,0.07,0.626437
