# <font color='maroon'>Correlation matrix</font>

A correlation matrix contains $n$ number of rows and $n$ number of columns. Each variable in the dataset is represented in the rows and in the columns. The matrix entries correlation measures of the variable along the row and the variable along the column. The diagonal entries equal one. A variable is highly positively correlated with itself. The matrix will show the measure of correlation between the variables.


### The dataset

The dataset we'll use comes from the [Australian government website](https://data.gov.au/dataset/credit-unions-selected-assets-and-liabilities). It contains information on selected assets and liabilities figures submitted by credit unions referred to as authorised deposit-taking institutions, ADIs. The description of the variables in the dataset are provided as follows:

* ‘Cash and liquid assets’ is composed of ‘Cash’, ‘Balances with ADIs’ and ‘Other’. None of these items include bills of exchange, bills receivable, remittances in transit or certificates of deposit.

* ‘Cash’ includes Australian and foreign currency notes and coins, gold coin, gold bullion, and gold certificates held as  investments. It excludes loans repayable in gold bullion.

* ‘Balances with ADIs’ includes deposits at call with Australian resident banks and other ADIs and settlement account balances due from banks and other ADIs, incorporating receivables for unsettled sales of securities.

* ‘Other’ includes deposits at call with Registered Financial Corporations (RFCs) and other financial institutions, net claims on recognised clearing houses in Australia, securities purchased under agreements to resell, funds held with the Reserve Bank and other central banks, and settlement account balances due from the Reserve Bank, other central banks, RFCs and other financial institutions, incorporating receivables for unsettled sales of securities.

* ‘Government securities’, ‘ADI securities’, ‘Corporate paper’ and ‘Other securities’ include both trading and investment securities. Trading securities are recorded at net fair value. Investment securities are recorded at cost and adjusted for the amortisation of any premiums and discounts on purchase over the period of maturity.

* ‘Government securities’ include securities issued by the Australian, State, Territory and local governments and State and Territory central borrowing authority (CBA) securities.

* ‘ADI securities’ includes securities issued by banks and other ADIs, but not equity investments in parent, controlled or associated entities.

* ‘Other securities’ includes asset-backed securities, other debt securities and equity securities, other than those issued by ADIs, but not equity investments in parent, controlled or associated entities.

* ‘Residential’ includes both owner-occupied and investment housing loans to Australian households, net of specific provisions for doubtful debts.

* ‘Personal’ includes revolving credit for a purpose other than housing, credit card liabilities, lease financing net of unearned revenue, and other personal term loans to Australian households net of specific provisions for doubtful debts.

* ‘Commercial’ includes loans to public non-financial corporations, private trading corporations, private unincorporated businesses, community service organisations, Australian, State, Territory and local governments, ADIs and other financial institutions, net of specific provisions for doubtful debts. Loans to ADIs and other financial institutions includes loans to the Reserve Bank and other central banks, banks, other ADIs, RFCs, central borrowing authorities, fund managers, stockbrokers, insurance brokers, securitisers, mortgage, fixed interest and equity unit trusts and other financial intermediaries.

Selected Liabilities:

* ‘Borrowings from ADIs’ includes settlement account balances due to ADIs and both variable and fixed interest rate short-term loans from ADIs. A loan is reported as short-term if its residual term to maturity is one year or less.

* ‘Deposits’ includes retail transaction call deposit accounts held by households, all other transaction call deposit accounts held by entities other than households, deposits from resident banks, resident non-bank financial institutions and intermediaries such as merchant banks, vostro balances from banks and non-bank financial institutions (NBFIs), the Australian-dollar equivalent of foreign currency deposits, deposits from controlled and associated entities, retail non-transaction call deposit accounts held by households, all other non-transaction deposit call accounts held by entities other than households, term deposits, certificates of deposit and other forms of deposits.

* ‘Other’ liabilities includes settlement account balances due to RFCs and other financial institutions, securities sold under agreements to repurchase, promissory notes or commercial paper with a residual term to maturity of one year or less, other debt securities with a residual term of one year or less, variable interest rate short-term loans from counterparties other than ADIs, fixed interest rate short-term loans from counterparties other than ADIs, debt securities with a residual term to maturity of more than one year, variable and fixed interest rate loans and borrowings from Australian residents with a residual term to maturity of more than one year, interest accrued but not yet paid, interest received but not yet earned, unrealised losses on trading derivatives, items in suspense and other liabilities not separately identified above. A loan is reported as short-term if its residual term to maturity is one year or less. ‘Other’ liabilities do not include amounts due to clearing houses.

Let's study the correlation between these variables.

In [None]:
import numpy as np
import pandas as pd
from matplotlib import pyplot
import seaborn as sns
import scipy.stats as stats

In [None]:
%matplotlib inline

### The data
The data consists of information on selected assets and liabilities for credit unions in Australia. We suspect that purchasing a house is highly correlated with securing a loan from a credit institution. But how strong is the relationship between these two variables? And between these variables and other variables connected to a credit union?

In [None]:
data = pd.read_csv('assets.csv', sep=',')

In [None]:
data.columns

In [None]:
data.head()

## Visualizing the correlation matrix

We can use visualize a correlation matrix using many functions. One way is to use the pandas function corr(). Another is to use the Seaborn package, which uses a heatmap to indicate the strength of the relationship between variables. We use it for its lovely visuals.



Simply printing out a correlation matrix using the `corr()` on the dataframe results in the table below.

In [None]:
data.corr()

The basic Seaborn function for plotting a heatmap of the correlation matrix is `sns.heatmap()`. Similar to other functions, additional arguments can be passed. We see the lovely visual heatmap generated below for the same table previously generated.

In [None]:
corr = data.corr()
pyplot.figure(figsize=(16, 10))
sns.heatmap(corr, xticklabels=corr.columns.values, 
            yticklabels=corr.columns.values,  
            linewidths=.08,                   # set linewidth between entries in matrix
           cbar_kws={"shrink": .7})           # set length of legend on right


## Interpreting the results

The first column shows the dependent variable. As we go row by row, we see how the independent variable is correlated with the dependent variable. The legend on the right indicates the correlation coefficients. The measures are colour-coded. For example, how is the dependent variable `Residental loans` correlated with the independent variable `Other liabilities`? These variables are positively correlated. Let's plot a scatter diagram and calculate the Pearson correlation coefficient for these variables.

In [None]:
sns.regplot(data['Other liabilities'], data['Residential loans'])
print(stats.pearsonr(data['Other liabilities'], data['Residential loans']))

Scatter plots are useful for spotting linear relationships between the variables. The scatterplot indicates a positive linear relationship between `Residential loans` and `Other liabilities`. We observe a Pearson correlation coefficient $r=0.915$, a strong positive correlation. In machine learning, reducing the number of variables in your dataset (a technique referred to as dimension reduction) prevents one from using too many variables to fit a model. Removing highly correlated variables from a dataset reduces the dimension of your dataset.

See [Overfitting and Underfitting With Machine Learning Algorithms](https://machinelearningmastery.com/overfitting-and-underfitting-with-machine-learning-algorithms/) for a discussion on overfitting in machine learning.

### Exercise

Find other highly correlated variables in the dataset. Plot their scatterplot diagrams to what kind of relationship the variables share.

In [None]:
# your answer

### References

[Pairing the Unknown – Liability Correlations and Asset Allocation](https://www.neamgroup.com/insights/pairing-the-unknown-liability-correlations-and-asset-allocation).