## Visualizing Multicollinearity in Python

Network Graphs for the win.

### Introduction

#### What is multicollinearity?

Multicollinearity is when two or more features are correlated with each other. Although correlation between the independent and dependent features is desired. multicollinearity of independent features is less desired in some settings. In fact, they can be omitted as they are not necessarily more informative than the feature they are correlated with. Identifying these features is therefore a form of feature selection. As a data scientist, it is **key to identify and understand multicollinearity in a dataset** prior to training predictive models. And even after having trained a model, it is important to limit highly collinear features as it can lead to misleading outcomes when explaining models.  

#### Why visualize multicollinearity?

Checking the correlation between the independent and dependent features is typically done during some exploratory data analysis. It can provide an early insight towards feature importance and thus a good understanding of how informative the features will do to do prediction. For feature selection, you do not have to necessarily visually inspect correlations between features. You can use metrics such as VIF (**Variable Inflation Factors**) to detect multicollinearity. However, it can still be worthwhile to visualize correlation between features as a means of extracting insights about the features of the dataset.

Correlation between features is typically visualized using a **correlation matrix** which in return is visualized with a heatmap showing the correlation factor of each feature in the dataset. Unfortunately, if the dataset has a large amount of features, then all a heatmap may do at that point is draw a nice 8-bit artwork. It can be incredibly difficult to extract any type of information because of the sheer size of the resultign heatmap. **With 50 features, that is a matrix with a shape of 50 X 50.** Colors and intensity may help to distinguish the most important factors, but that will be about it. Surely, there must be a better way. 

In this article, we present **three ways to visualize multicollinearity**. Namely, the de facto heatmap, the clustermap and the interactive network graph visualization. I will highlight the pros and cons of each visualization. 

#### Visualizing strong correlated stocks of the S&P500.

We will use S&P500 stock data (between 01/01/2020 and 31/12/2021) to visualize collinear stocks. With the **yfinance** package, you can simply retrieve the stock market data using the stock ticker symbols. Prior to retrieving the stock data, the S&P500 stock table is scraped from the Wikipedia page to retrieve all current stock information in the S&P500. This includes the names of the stocks, the tickers, the corresponding sector, and more. 

Although we specifically work with time series data in this article, the proposed visualizations are data-agnostic. All we need is the correlation matrix of the resulting DataFrame generated with the Pandas function .corr().

In [3]:
# import all the required libraries
import pandas as pd
import seaborn as sns
import numpy as np
import os
import matplotlib.pyplot as plt
import plotly.graph_objects as go
import plotly.express as px
import networkx as nx
from ipywidgets import Layout, widgets
#from google.colab import output
#output.enable_custom_widget_manager()
import math
import matplotlib.dates as md
import yfinance as yf

In [4]:
payload = pd.read_html('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies')
# S&P500 metadata
sp500_table = payload[0]

In [5]:
sp500_table.head()

Unnamed: 0,Symbol,Security,SEC filings,GICS Sector,GICS Sub-Industry,Headquarters Location,Date first added,CIK,Founded
0,MMM,3M,reports,Industrials,Industrial Conglomerates,"Saint Paul, Minnesota",1976-08-09,66740,1902
1,AOS,A. O. Smith,reports,Industrials,Building Products,"Milwaukee, Wisconsin",2017-07-26,91142,1916
2,ABT,Abbott,reports,Health Care,Health Care Equipment,"North Chicago, Illinois",1964-03-31,1800,1888
3,ABBV,AbbVie,reports,Health Care,Pharmaceuticals,"North Chicago, Illinois",2012-12-31,1551152,2013 (1888)
4,ABMD,Abiomed,reports,Health Care,Health Care Equipment,"Danvers, Massachusetts",2018-05-31,815094,1981


In [7]:
# mappings
sp500_tickers = sp500_table.Symbol.str.upper().values
sp500_names = sp500_table.Security.values
sp500_sectors = sp500_table["GICS Sector"].values
sp500_sub_sectors = sp500_table["GICS Sub-Industry"].values
sp500_names_mapping = dict(zip(sp500_tickers, sp500_names))
sp500_sector_mapping = dict(zip(sp500_names, sp500_sectors))
sp500_sub_sector_mapping = dict(zip(sp500_names, sp500_sub_sectors))
sector_color_mapping = dict(zip(sp500_sectors, sns.color_palette("pastel", len(sp500_sectors)).as_hex()))
subsector_color_mapping = dict(zip(sp500_sub_sectors, sns.color_palette("pastel", len(sp500_sub_sectors)).as_hex()))

# download S&P500 financial data
tickers = list(sp500_tickers)
prices = yf.download(tickers, start="2020-01-01", end="2021-12-31", interval="1d")
prices = prices["Adj Close"]
prices = prices.rename(columns=sp500_names_mapping)

[*********************100%***********************]  503 of 503 completed

3 Failed downloads:
- BF.B: No data found for this date range, symbol may be delisted
- BRK.B: No data found, symbol may be delisted
- CEG: Data doesn't exist for startDate = 1577817000, endDate = 1640889000


Unnamed: 0_level_0,Agilent Technologies,American Airlines Group,Advance Auto Parts,Apple Inc.,AbbVie,AmerisourceBergen,Abiomed,Abbott,Accenture,Adobe Inc.,...,Wynn Resorts,Xcel Energy,ExxonMobil,Dentsply Sirona,Xylem Inc.,Yum! Brands,Zimmer Biomet,Zebra Technologies,Zions Bancorporation,Zoetis
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2019-12-31,83.887718,28.574404,154.004761,72.039871,77.837685,81.503708,170.589996,83.211227,202.855652,329.809998,...,137.714386,59.327812,59.024548,55.320499,76.491547,95.803047,142.644989,255.440002,47.863811,130.197800
2020-01-02,84.517044,28.982893,153.283569,73.683563,78.725624,81.561241,168.809998,83.297447,202.451019,334.429993,...,142.405029,58.290581,59.971928,55.652874,77.520622,97.172623,142.187531,259.140015,48.343185,131.958664
2020-01-03,83.160049,27.548195,153.293198,72.967216,77.978340,80.535484,166.820007,82.281982,202.113846,331.809998,...,140.292755,58.570908,59.489784,55.037003,77.976913,96.868271,141.815887,256.049988,47.660999,131.978363
2020-01-06,83.405884,27.219410,150.773895,73.548630,78.593735,81.714607,179.039993,82.713074,200.794037,333.709991,...,140.015091,58.486805,59.946548,55.340057,77.472084,96.811211,140.996292,258.010010,47.080219,130.965103
2020-01-07,83.661545,27.119778,148.985367,73.202728,78.145386,81.129837,180.350006,82.253220,196.458908,333.390015,...,140.679504,58.365322,59.455940,55.633320,77.180847,96.982407,140.872421,256.470001,46.794434,131.407822
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2021-12-23,157.067200,18.260000,228.445740,175.553543,129.402618,128.317825,352.209991,137.509171,399.651306,569.619995,...,88.879997,65.676071,59.164177,54.988499,116.684036,133.396210,122.545464,582.409973,61.183460,241.255219
2021-12-27,158.002838,18.170000,232.746384,179.586868,130.686050,130.665649,357.829987,139.781876,411.562225,577.679993,...,87.580002,65.883125,60.007717,55.504360,117.468552,136.027878,123.570305,606.330017,61.821198,245.234528
2021-12-28,158.440781,18.540001,234.350510,178.551132,130.666626,131.121338,357.440002,138.803619,411.502747,569.359985,...,86.459999,66.671913,59.813797,55.583717,118.690010,135.998291,123.957039,597.320007,61.919312,242.986237
2021-12-29,159.903961,18.049999,237.204483,178.640778,131.609741,132.092163,361.839996,139.515076,411.651428,569.289978,...,84.980003,67.007141,59.290222,56.198780,118.531128,136.668549,123.976379,601.119995,62.252895,245.751846


In [10]:
prices.head()

Unnamed: 0_level_0,Agilent Technologies,American Airlines Group,Advance Auto Parts,Apple Inc.,AbbVie,AmerisourceBergen,Abiomed,Abbott,Accenture,Adobe Inc.,...,Wynn Resorts,Xcel Energy,ExxonMobil,Dentsply Sirona,Xylem Inc.,Yum! Brands,Zimmer Biomet,Zebra Technologies,Zions Bancorporation,Zoetis
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2019-12-31 00:00:00,83.887718,28.574404,154.004761,72.039871,77.837685,81.503708,170.589996,83.211227,202.855652,329.809998,...,137.714386,59.327812,59.024548,55.320499,76.491547,95.803047,142.644989,255.440002,47.863811,130.1978
2020-01-02 00:00:00,84.517044,28.982893,153.283569,73.683563,78.725624,81.561241,168.809998,83.297447,202.451019,334.429993,...,142.405029,58.290581,59.971928,55.652874,77.520622,97.172623,142.187531,259.140015,48.343185,131.958664
2020-01-03 00:00:00,83.160049,27.548195,153.293198,72.967216,77.97834,80.535484,166.820007,82.281982,202.113846,331.809998,...,140.292755,58.570908,59.489784,55.037003,77.976913,96.868271,141.815887,256.049988,47.660999,131.978363
2020-01-06 00:00:00,83.405884,27.21941,150.773895,73.54863,78.593735,81.714607,179.039993,82.713074,200.794037,333.709991,...,140.015091,58.486805,59.946548,55.340057,77.472084,96.811211,140.996292,258.01001,47.080219,130.965103
2020-01-07 00:00:00,83.661545,27.119778,148.985367,73.202728,78.145386,81.129837,180.350006,82.25322,196.458908,333.390015,...,140.679504,58.365322,59.45594,55.63332,77.180847,96.982407,140.872421,256.470001,46.794434,131.407822


In [11]:
# impute
for i, row in prices.iterrows():
    if row.isnull().mean() > 0.9: prices.drop(i, inplace=True)
prices = prices.loc[:, prices.isnull().mean() < 0.3]
prices = prices.fillna(method="bfill")
print(prices.shape)

(505, 499)


In [21]:
# calculate rolling correlation
corr = prices.rolling(60).corr()
corr_ = np.array([corr.loc[i].to_numpy() for i in prices.index if not np.isnan(corr.loc[i].to_numpy()).all()])
corr_ = np.nansum(corr_, axis=0)/len(corr_)
corr_ = pd.DataFrame(columns = prices.columns.tolist(), index=prices.columns.tolist(), data = corr_)


KeyboardInterrupt

