# AUTO-DROP HIGHLY CORRELATED COLUMNS - GANESH RAM GURURAJAN

**Explanation** :
Steps:
1. First pass data frame into the function
2. Get Corr() data frame using **' pearson method '**
3. Filter with condition **df[ df [ columns > 0.85 ]**
4. **Set the diagonal to np.nan, because diagonal of corr() is always 1.0**
5. **Remove all completely empty columns and rows**, with absolute np.nan
6. If corr() is of shape (0,0), it means there's no highly correlated columns
7. Else, while corr() is not equal to (0,0) keep removing both the column and row with the highest correlation value in the whole corr() matrix, also remove all rows and columns with absolute np.nan. This will keep reducing the shape of correlation matrix.
8. Now, remove the columns from original dataFrame and return the DF

### Module Imports - Problem - 3

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns                                                           # FOR VISUALIZATION OF HEATMAP                               

  import pandas.util.testing as tm


## **THE FUNCTION** - PROBLEM 3

In [None]:
def dropHighlyCorrelatedColumns(df):

  '''
  This method removes minimum number of highly correlated columns using pearson method.
  '''  


  # INITIAL FEW STEPS
  corr = df.corr(method='pearson')                                              # CORR dataFrame consisting only correlation
  corr = corr[(corr >= 0.85)]                                                   # Filtering dateFrame with corr() >= 0.85                                               


  for column in corr.columns:                                                   # np.nan Diagonal, as corr() of diag is 1.0 always
    corr.loc[column][column] = np.nan


  corr.dropna(axis=1,how='all',inplace=True)                                    # Drop all columns with absolute NaN
  corr.dropna(axis=0,how='all',inplace=True)                                    # Drop all row with absolute NaN




  ###################### THIS IS THE IMPORTANT PART ######################
  if corr.shape!=(0,0):                                                         # If shape of the current dataFrame is not (0,0)  
    
    removed_cols = []                                                           # Stored the names of columns to be removed from original dataframe
        
    while corr.shape != (0,0):                                                  # While Correlation DF is not NONE:
      corr_dict = {}                                                            # Keep removing highly correlated columns in the descending order of
      for column in corr.columns:                                               # Correlation
        corr_dict[corr[column].max()] = column
      try:
          val = max(corr_dict)
          corr.drop(corr_dict[val],inplace=True)
          corr.drop(corr_dict[val],axis=1,inplace=True)
          corr.dropna(axis=1,how='all',inplace=True)
          corr.dropna(axis=0,how='all',inplace=True)
          removed_cols.append(corr_dict[val])
          del corr_dict[val]
      except ValueError:                                                        # When corr_dict is empty, it means all columns have been noted
          break

    df.drop(removed_cols,axis=1,inplace=True)                                   # Remove the columns from the original DF

    print("\nRemoved Columns are {}".format(removed_cols))                      # Print the removed columns

  else:
    print('There are no highly correlated columns')                             # No need of removal of columns if all corr() is less than 0.85
    

  return df                                                                     # In any case return DF

## - - - - - TEST YOUR DATA HERE - - - - -  PROBLEM 3

In [None]:
############### CHANGE THE NEXT LINE TO LOAD YOUR DATA ###############
# df = pd.read_csv('health.csv')


################ DO NOT CHANGE THIS ################
# This is the resultant DATA FRAME
new_df = dropHighlyCorrelatedColumns(df)


Removed Columns are ['per_capita_exp_PPP_2016', 'Specialist_surgical_per_1000_2008-18']


## VISUALIZE HERE
#### FIRST CELL IS THE **HEAT MAP OF ORIGINAL DATA**
#### SECOND CELL IS THE **HEAT MAP OF NEW DATA**
#### THIRD CELL IS TO VIEW **NEW_DATA.corr() > 0.85**

In [None]:
######################### THIS IS HEAT MAP OF CORR() OF ORIGINAL DATAFRAME ##############################
# VISUALIZE HERE
# UNCOMMENT THE BELOW LINE

# sns.heatmap(df.corr())

In [None]:
######################### THIS IS HEAT MAP OF CORR() OF NEW DATAFRAME ##############################
# VISUALIZE HERE
# UNCOMMENT THE BELOW LINE

# sns.heatmap(new_df.corr())

In [None]:
###################### VIEW NEW_DF.CORR() > 0.85 HERE ######################
# UNCOMMENT THE BELOW LINE

# new_df.corr() > 0.85