# Example usage

Here, we will demonstrate how to use `collinearity_tool` to identify multicollinearity issues by correlation, VIF, and visualizations.

## Imports

In [1]:
import pandas as pd
import collinearity_tool.collinearity_tool as cl

In [2]:
data = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
data

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


## Create correlation matrix

We can create a generic correlation matrix and a longer form one for all numerical variables in a data frame using the `corr_matrix` function. Use `[0]` to return a longer form of the correlation matrix and use `[1]` to return a generic one.

In [3]:
cl.corr_matrix(data)[0]

Unnamed: 0,variable1,variable2,correlation,rounded_corr
0,sepal_length,sepal_length,1.0,1.0
1,sepal_length,sepal_width,-0.11757,-0.12
2,sepal_length,petal_length,0.871754,0.87
3,sepal_length,petal_width,0.817941,0.82
4,sepal_width,sepal_length,-0.11757,-0.12
5,sepal_width,sepal_width,1.0,1.0
6,sepal_width,petal_length,-0.42844,-0.43
7,sepal_width,petal_width,-0.366126,-0.37
8,petal_length,sepal_length,0.871754,0.87
9,petal_length,sepal_width,-0.42844,-0.43


In [4]:
cl.corr_matrix(data)[1]

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
sepal_length,1.0,-0.11757,0.871754,0.817941
sepal_width,-0.11757,1.0,-0.42844,-0.366126
petal_length,0.871754,-0.42844,1.0,0.962865
petal_width,0.817941,-0.366126,0.962865,1.0


## Create correlation heatmap

We can plot a correlation heatmap given a dataframe using `corr_heatmap` function.

In [5]:
cl.corr_heatmap(data)

##  Create VIF matrix and bar chart of VIFs

We can create a list containing a data frame for Variable Inflation Factors (VIF) and a bar chart of the VIFs for each explanatory variable in a multiple linear regression model using `vif_bar_plot` function.

In [6]:
x = ['petal_length', 'sepal_width']
y = 'sepal_length'
vif = cl.vif_bar_plot(x, y, data, 6)
vif[0]

Unnamed: 0,vif_score,explanatory_var
0,83.033291,Intercept
1,1.224831,sepal_width
2,1.224831,petal_length


In [7]:
vif[1]

## Multicolinearity Identification

We can identify multicollinearity based on highly correlated pairs (using Pearson coefficient) with VIF values exceeding the threshold using `col_identify` function. In this case the response variable is `sepal_width`, and the explanatory variables are `petal_length` and `sepal_length`.

In [8]:
cl.col_identify(data, x, y)

Unnamed: 0,variable,pair,correlation,rounded_corr,vif_score,eliminate
0,petal_length,petal_length | sepal_length,0.871754,0.87,1.224831,No
