VIF is a measure of how much the variance of the coefficient derived from the model is inflated by collinearity. It helps detect multicollinearity that you cannot catch just by eyeballing a pairwise correlation plot and even detects strong relations between 3 variables and more.

It is calculated by taking the ratio of **the variance of all of the coefficients** divided by **the variance of that one variable’s coefficient when it is the only variable in the model**.
- VIF = 1 : No correlation between that predictor and the other variables
- VIF = 4 : Suspicious, needs to be looked into
- VIF = 5-10 : Look into it or drop the variable.

Why is multicollinearity an issue with regression? Well, the regression equation is the best fit line to represent the effects of your predictors and the dependant variable, and does not include the effects of one predictor on another.
Having high collinearity (correlation of 1.00) between predictors will affect your coefficients and the accuracy, plus its ability to reduce the SSE (sum of squared errors — that thing you need to minimise with your regression).

While variables in a dataset are usually correlated to a small degree, highly collinear variables can be redundant in the sense that we only need to retain one of the features to give our model the necessary information.
Removing collinear features is a method to reduce model complexity by decreasing the number of features and can help to increase model generalization. 



In [1]:
import sys
import numpy as np
import pandas as pd
import statsmodels.api as smapi
import os
from statsmodels.stats.outliers_influence import variance_inflation_factor

In [2]:
class VIF_tester:
    
    """
    USAGE:
    This class is called on large .dataforML files to quickly evaluate a 
    'quick and dirty' estimate of the variance inflation factor between predictors.
    It's output is a cleaned up and reconstituted .dataforML file.
    
    ATTRIBUTES:
    1):filepaths: --> List of filepaths corresponding to your data. 
    Every VIF_tester object and method should know which files are undergoing analysis.
    2):
    
    CONSTRUCTOR VARIABLES:
    1):X: --> .dataforML file (an error is raised if other file types are passed).
    2):threshold: --> VIF threshold (defaults to 5.0).
    
    METHODS:
    1): dt_chk(X) --> Responsible for evaluating whether passed file is appropriate for VIF analysis, 
    and separates it into testable pieces. Currently does not separate the file via bootleg-stype procedure, 
    but this may change depending on kernel memory availability. More detail below.
    2): vif_chk(X, threshold) --> Conducts VIF comparison on list of DataFrames returned by dt_chk(). More detail below.
    """

    def __init__(self, X, threshold=5.0):
        self.X = X
        self.threshold = threshold
    
    def dt_chk(filepaths):
        
        """
        FUNCTION:
        This method reads in a file, determines whether the file is of the .mydataML type, drops any non-numeric columns and 
        ensures that remaining data is numerical, and then parses through the resulting DataFrame in order to split it into
        so-called "chunks." It returns a list of "chunked" DataFrame objects.
        
        Currently, this method arbitrarily picks a number of SNP covariates to chunk by, but could be extended to chuck
        based on the number of available cores for a job sent to an HPC center.

        INPUTS :
        :X: --> Any .mydataML file large enough to warrant evaluation in chunks.

        RETURNS
        :df_snp_list: --> A list including a series of DataFrame objects.
        """

        for X in filepaths:
            # Split the extension from the path and normalise it to lowercase.
            ext = os.path.splitext(X)[-1].lower()
            # Now we can simply use == to check for equality, no need for wildcards.
            if ext == ".dataforml":
                
                print("""
                      This is a dataForML file! 
                      Checking data for proper types ...
                      """)
                df = pd.read_csv(X, sep="\t")
                df.columns = df.columns.str.strip() # Strip erroneous white space.
                df.dropna()
                df = df._get_numeric_data() #drop non-numeric cols
                
                # Ensure that data is of an appropriate type:
                data_type = df.dtypes
                int_cols = \
                df.select_dtypes(include=["int", "int16", "int32", "int64", "float", 
                                         "float16", "float32", "float64"]).shape[1]
                total_cols = df.shape[1]
                try:
                    if int_cols != total_cols:
                        raise Value_Error("All columns in the input need to be numerical for a multicollinearity test.")
                    else:
                        print("""
                                This dataframe is acceptable for multicollinearity VIF analysis. 
                                Proceeding to dataframe conversion & fragmentation phase ...
                                """)
                        
                except ValueError as error:
                    print("Error: ", error)
                    
                df_snp_list = []
                cov_uni = df.iloc[:,1:5]
                pheno_df = df.iloc[:,0:1]
                cov_snps = df.iloc[:,5:]
                snp_count = len(cov_snps)
                snp_counter = 0
                chunk_delim = 50
                num_chunks = 1
                eof = False
                    
                # This loop will iterate through the snp_count DataFrame to split it into chunks.
                # chunk_delim will determine the size of the DataFrames passed to vif_multicollinearity_check()
                
                while eof == False:
                    if chunk_delim >= snp_count:
                        df_snp_list.append(cov_snps.iloc[:,snp_counter:])
                        print('\n'
                                'SNP elements added to chunk #', num_chunks,
                                'range from', snp_counter, 
                                'to the end of the input file at position',snp_count
                             )                         
                        print('\n'
                            'Chunked DataFrame #', num_chunks,
                            'has been added to the list and is ready to be passed to VIF calculator!'
                             )
                        eof == True
                        return df_snp_list
                        break
                    else:
                        df_snp_list.append(cov_snps.iloc[:,snp_counter:chunk_delim])
                        print('\n'
                                'SNP elements added to chunk #', num_chunks,
                                'range from', snp_counter, 
                                'to', chunk_delim
                             )
                        print('\n'
                            'Chunked DataFrame #', num_chunks,
                            'has been added to the list and is ready to be passed to VIF calculator!'
                             )
                        num_chunks += 1
                        snp_counter = chunk_delim
                        chunk_delim += 50

                
        # BUG: I can't figure out why these other columns won't append 
        # onto each DataFrame in the final list.
                
                    #for df in df_snp_list:
                     #   df.append(cov_uni.iloc[:,0:])
                return df_snp_list, cov_uni;
               
            else:
                print("this is an invalid file format. Aborting.")

In [None]:
    def vif_check(chunked_list, theshold=5.0):
        
        """
        FUNCTION
        Checks for multicollinearity between features in a given chunk of the original DataFrame and 
        removes features with a VIF greater than a specifed threshold. Recombines surviving features,
        calls .dt_chk to rechuck them, then recursively calls itself on .dt_chk's output until
        no multicollinearity in any tested chunk is found.

        INPUTS 
        :chunked_list: --> A list of chunked DataFrame objects built by .dt_chk
        :threshold: --> The collinearity threshold at which elements are removed (DEFAULT = 5.0)

        RETURNS
        --> A DataFrame with as many non-multicolinear elements from the original infile as possible.
        """
        pruned_df = []
        
        for chunk in chunked_list:
            
            # Create a list of indices corresponding to each column in a given chunk.
            variables = list(range(chunk.shape[1]))
            print(variables)
            

In [4]:
    test_obj = self.vif_check(self.dt_chk)

NameError: name 'self' is not defined

In [9]:
            
            dropped = True
                
            print("""
            \n
            The VIF calculation will now iterate through the features of each chunk and calculate 
            their respective values. VIF is calculated by taking the ratio of the variance of all of the coefficients 
            divided by the variance of that one variable’s coefficient when it is the only variable in the model.
            This function shall drop all features exceeding the specified threshold, and combine the remaining
            features into a new DataFrame object.
            \n
                """)
                
            while dropped:
                dropped = False
                vif = [variance_inflation_factor(X.iloc[:, variables].values, var) for var in variables]
                #print("\n\n The VIF is: ", vif)
                max_loc = vif.index(max(vif))
                
                if max(vif) > threshold:
                    print("Dropping \"" + X.iloc[:, variables].columns[max_loc] + "\" at index: " + str(max_loc))
                    X.drop(X.columns[variables[max_loc]], 1, inplace=True)
                    variables = list(range(X.shape[1]))
                    dropped = True
                    
                else:
                    pruned_dataframes.append(X)
                
            return pruned_dataframes
        
        X = VIF_tester.dt_chk('discrete.DATAFORML')
        vif_df_list = VIF_tester.vif_check(X)
        for element in vif_df_list:
            print(element)