# Example usage

Here, we will demonstrate how to use `collinearity_tool` to identify multicollinearity issues by correlation, VIF, and visualizations.

## Imports

In [7]:
import pandas as pd
import collinearity_tool.collinearity_tool as cl

In [2]:
data = pd.read_csv('https://raw.githubusercontent.com/tidyverse/ggplot2/main/data-raw/mpg.csv')
data

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class
0,audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact
1,audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact
2,audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact
3,audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact
4,audi,a4,2.8,1999,6,auto(l5),f,16,26,p,compact
...,...,...,...,...,...,...,...,...,...,...,...
229,volkswagen,passat,2.0,2008,4,auto(s6),f,19,28,p,midsize
230,volkswagen,passat,2.0,2008,4,manual(m6),f,21,29,p,midsize
231,volkswagen,passat,2.8,1999,6,auto(l5),f,16,26,p,midsize
232,volkswagen,passat,2.8,1999,6,manual(m5),f,18,26,p,midsize


## Create correlation matrix

We can create a generic correlation matrix and a longer form one for all numerical variables in a data frame using the `corr_matrix` function.

In [8]:
cl.corr_matrix(data)

(   variable1 variable2  correlation  rounded_corr
 0      displ     displ     1.000000          1.00
 1      displ      year     0.147843          0.15
 2      displ       cyl     0.930227          0.93
 3      displ       cty    -0.798524         -0.80
 4      displ       hwy    -0.766020         -0.77
 5       year     displ     0.147843          0.15
 6       year      year     1.000000          1.00
 7       year       cyl     0.122245          0.12
 8       year       cty    -0.037232         -0.04
 9       year       hwy     0.002158          0.00
 10       cyl     displ     0.930227          0.93
 11       cyl      year     0.122245          0.12
 12       cyl       cyl     1.000000          1.00
 13       cyl       cty    -0.805771         -0.81
 14       cyl       hwy    -0.761912         -0.76
 15       cty     displ    -0.798524         -0.80
 16       cty      year    -0.037232         -0.04
 17       cty       cyl    -0.805771         -0.81
 18       cty       cty     1.0

## Create correlation heatmap

We can plot a correlation heatmap given a dataframe using `corr_heatmap` function.

In [9]:
cl.corr_heatmap(data)

##  Create VIF matrix and bar chart of VIFs

We can create a list containing a data frame for Variable Inflation Factors (VIF) and a bar chart of the VIFs for each explanatory variable in a multiple linear regression model using `vif_bar_plot` function.

In [16]:
x = ['cyl', 'cty']
y = 'hwy'
vif = cl.vif_bar_plot(x, y, data, 6)
vif[0]

Unnamed: 0,vif_score,explanatory_var
0,150.964402,Intercept
1,2.851176,cyl
2,2.851176,cty


In [17]:
vif[1]

## Multicolinearity Identification

We can identify multicollinearity based on highly correlated pairs (using Pearson coefficient) with VIF values exceeding the threshold using `col_identify` function.

In [18]:
cl.col_identify(data, x, y)

Unnamed: 0,variable,pair,correlation,rounded_corr,vif_score,eliminate
0,cyl,cty | cyl,-0.805771,-0.81,2.851176,No
1,cty,cty | cyl,-0.805771,-0.81,2.851176,No
2,cty,cty | hwy,0.955916,0.96,2.851176,No
