# Mapping Cities As Flows - Correlation Values

The most commonly adopted approach to evaluating the relative importance of each predictor in a model is to calculate what are known as ‘standardised’ regression coefficients. Instead of measuring the impact on the outcome variable of a one unit change the predictor variable, standard regression coefficients measure the impact of a one standard deviation change in the value of the predictor variable. The standard deviation is a measure of the spread/dispersion of a variable. Sixty-eight percent of all observations fall within +/- one standard deviation of the variable mean. Hence a one standard deviation change in the value of the predictor variable represents the same relative amount of change in ranking

In [10]:
#load packages
import pandas as pd

In [11]:
import sklearn

In [98]:
#setting to view more rows in the output
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

In [65]:
#load data
data = pd.read_csv('trips.csv')  # reads 'csv' into df

In [74]:
#specifiy columns to attain correlation coefficient for
columns = ['total_trip_time','overall_full_dist','velo_mean','std_velo','max_velo','dist_trns_line','in_range','closest_motorway_dst','dist_trns_busline','train_conf','bus_conf','motorway_conf', 'label'] #, 'geometry'] 

In [75]:
#filter data on columns selected
data = data[columns]

In [68]:
#process to normalise each of the variables
data['total_trip_time'] = data['total_trip_time'] / data['total_trip_time'].max()
data["overall_full_dist"] = data["overall_full_dist"] / data["overall_full_dist"].max()
data["velo_mean"] = data["velo_mean"] / data["velo_mean"].max()
data["std_velo"] = data["std_velo"] / data["std_velo"].max()
data["max_velo"] = data["max_velo"] / data["max_velo"].max()
data["dist_trns_line"] = data["dist_trns_line"] / data["dist_trns_line"].max()
data["in_range"] = data["in_range"] / data["in_range"].max()
data["closest_motorway_dst"] = data["closest_motorway_dst"] / data["closest_motorway_dst"].max()
data["dist_trns_busline"] = data["dist_trns_busline"] / data["dist_trns_busline"].max()
data["train_conf"] = data["train_conf"] / data["train_conf"].max()
data["bus_conf"] = data["bus_conf"] / data["bus_conf"].max()

In [76]:
pearsons = data[data.columns[:]].corr()['label']

In [None]:
#pearsons correlation 

In [77]:
pearsons

total_trip_time         0.085233
overall_full_dist       0.313164
velo_mean               0.539729
std_velo                0.529379
max_velo                0.515872
dist_trns_line         -0.072524
in_range                0.475997
closest_motorway_dst   -0.135708
dist_trns_busline       0.075231
train_conf              0.223637
bus_conf               -0.368634
motorway_conf           0.277704
label                   1.000000
Name: label, dtype: float64

In [51]:
#reset index on the df
data.reset_index(inplace=True , drop=True)

In [6]:
from scipy  import stats

In [2]:
#extract correlation coefficient and the level of significance

In [111]:
coeffmat = np.zeros(((data.shape[1]),1))

In [112]:
pvalmat = np.zeros((data.shape[1],1))

In [118]:
#process to print out the correlation coefficients as a dataframe
for i in range(data.shape[1]):
    corrtest = stats.pearsonr(data[data.columns[i]], data['label'])  

    coeffmat[i] = corrtest[0]
    pvalmat[i] = corrtest[1]

dfcoeff = pd.DataFrame(coeffmat, index=data.columns)
dfpval= pd.DataFrame(pvalmat, index=data.columns)

In [120]:
pd.concat([dfcoeff,dfpval], axis=1)

Unnamed: 0,0,0.1
total_trip_time,0.085233,4.108662e-06
overall_full_dist,0.313164,2.8049530000000002e-67
velo_mean,0.539729,7.13832e-220
std_velo,0.529379,4.4878209999999996e-210
max_velo,0.515872,8.67936e-198
dist_trns_line,-0.072524,8.962916e-05
in_range,0.475997,1.386604e-164
closest_motorway_dst,-0.135708,1.922175e-13
dist_trns_busline,0.075231,4.829664e-05
train_conf,0.223637,2.491165e-34


In [16]:
#example extracting one correlation coeffcient
scipy.stats.pearsonr(x=data['total_trip_time'],y=data['label'])

(0.08523256069006066, 4.10866226894241e-06)

### Notes on regression values and standardising 

The standard deviation is a measure of the spread/dispersion of a variable. Sixty-eight percent of all observations fall within +/- one standard deviation of the variable mean. Hence a one standard deviation change in the value of the predictor variable represents the same relative amount of change in ranking, regardless of whether the distribution of values is from 1 to 8 (% unemployment) or 4 to 98 (% flats) etc.

The main downsides of standardisation are that: (i) the variables being standardised need to be approximately normally distributed to start with, as this assumption underpins the concept of ‘standard deviation’; and (ii) dichotomous (yes/no) predictor variables are clearly not normally distributed, making standardisation inappropriate for evaluating the relative importance of any categorical variable.