## <span style="color:green"> Categorical Feature Selection using Chi-Square Test - Part1
    
 - YouTube Video Explanation : https://youtu.be/JJO0WB8bGJM

###  Chi-Square Test using <span style="color:red"> scipy <span style="color:green"> Library <span style="color:red"> chi2_contingency <span style="color:green"> function

- it’s best suited for categorical variables and binary targets only, and the variables should be non-negative and
  typically boolean, frequencies, or counts.
- It is simply compare the observed distribution between various features in the dataset and the target variable.

**Categorical Feature Selection via Chi-Square Test of Independence**

- In our everyday data science work, we often encounter categorical features. Some people would be confused about how to handle these features, especially when we want to create a prediction model where those models basically an equation that accepting number; not a category.
- One way is to encode all the category variable using the OneHotEncoding method (encode all the categorical class into numerical values 0 and 1, where 0 mean absent and 1 is present).
- This method is preferable by many as the information is still present and it is easy to understand the concept. The downfall is when we possess many categorical features with high cardinality, the number of the features after OneHotEncoding process would be massive.
- Why we do not want a lot of features in our training dataset? it is because of the **curse of dimensionality.**
- While adding features could decrease the error in our prediction model, it would only decrease until a certain number of features; after that, the error will increase again. This is the concept of the curse of dimensionality.
- Many ways to alleviate this problem, here we will use the **feature selection via the Chi-Square test of independence.**

- The Chi-Square test of independence is used to determine if there is a significant relationship between two categorical (nominal) variables. 
- It means the Chi-Square Test of Independence is a hypothesis testing test with 2 hypotheses present; the Null Hypothesis and the Alternative Hypothesis. 
- The hypothesis is written below.
    - Null Hypothesis (H0): There is no relationship between the variables
    - Alternative Hypothesis (H1): There is a relationship between variables
    
- Just like any statistical testing, 
     - Lets we choose our p-value = 0.05 : Choose a significance level (e.g. SL = 0.05 with a 95% confidence).
    - if p-value test result is more than 0.05 , it means that test result will lie in acceptance region and we will accept the null hypothesis
    - if p-value test result is less than 0.05 , it means that test result will lie in rejection(critical) region and we will reject the null hypothesis and will accept the alternate hypothesis.
    

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import math

In [2]:
#Load the dataset #https://www.kaggle.com/burak3ergun/loan-data-set
df_loan = pd.read_csv("https://raw.githubusercontent.com/atulpatelDS/Data_Files/master/Loan_Dataset/loan_data_set.csv")

In [3]:
df_loan.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


In [4]:
df_loan.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            614 non-null    object 
 1   Gender             601 non-null    object 
 2   Married            611 non-null    object 
 3   Dependents         599 non-null    object 
 4   Education          614 non-null    object 
 5   Self_Employed      582 non-null    object 
 6   ApplicantIncome    614 non-null    int64  
 7   CoapplicantIncome  614 non-null    float64
 8   LoanAmount         592 non-null    float64
 9   Loan_Amount_Term   600 non-null    float64
 10  Credit_History     564 non-null    float64
 11  Property_Area      614 non-null    object 
 12  Loan_Status        614 non-null    object 
dtypes: float64(4), int64(1), object(8)
memory usage: 62.5+ KB


In [5]:
# Remove all null value
df_loan.dropna(inplace=True)
# drop the uninformative column("Loan_ID")
df_loan.drop(labels=["Loan_ID"],axis=1,inplace=True)
df_loan.reset_index(drop=True,inplace=True)

In [6]:
df_loan["Credit_History"]=df_loan["Credit_History"].apply(lambda x: "N" if x == 0 else "Y")

In [7]:
df_loan.head()

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,Y,Rural,N
1,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,Y,Urban,Y
2,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,Y,Urban,Y
3,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,Y,Urban,Y
4,Male,Yes,2,Graduate,Yes,5417,4196.0,267.0,360.0,Y,Urban,Y


In [8]:
df_loan1 = df_loan.copy()

In [9]:
from scipy import stats
from scipy.stats import chi2_contingency

In the Chi-Square test, we display the data in a cross-tabulation (contingency) format with each row representing a level (group) for one variable and each column representing a level (group) for another variable.

In [10]:
# create a cross-tabulation table(contingency table) between Gender and Loan_Status columns.
# contingency table: It is basically a tally of counts between two or more categorical variables. 
pd.crosstab(df_loan["Gender"],df_loan["Loan_Status"])

Loan_Status,N,Y
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1
Female,32,54
Male,116,278


In [11]:
# Lets use the chi-square test of independence to test the relationship between the 2 features
chi_rel = chi2_contingency(pd.crosstab(df_loan["Gender"],df_loan["Loan_Status"]))

In [13]:
chi_rel

(1.6495637942018448,
 0.1990183114281211,
 1,
 array([[ 26.51666667,  59.48333333],
        [121.48333333, 272.51666667]]))

In [12]:
print("Chi2 Statistics: ",chi_rel[0])
print("P_value: ",chi_rel[1])

Chi2 Statistics:  1.6495637942018448
P_value:  0.1990183114281211


As we can see that our 
test result = 0.199 > chossen p vlaue(0.05) so it will lie inside the acceptance region and we will accept the null hypothesis.
**This means, there is no relationship between the Gender and Loan_Status feature based on the Chi-Square test of independence.**

### Lets use this test with all the categorical features present

In [13]:
df_loan.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 480 entries, 0 to 479
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Gender             480 non-null    object 
 1   Married            480 non-null    object 
 2   Dependents         480 non-null    object 
 3   Education          480 non-null    object 
 4   Self_Employed      480 non-null    object 
 5   ApplicantIncome    480 non-null    int64  
 6   CoapplicantIncome  480 non-null    float64
 7   LoanAmount         480 non-null    float64
 8   Loan_Amount_Term   480 non-null    float64
 9   Credit_History     480 non-null    object 
 10  Property_Area      480 non-null    object 
 11  Loan_Status        480 non-null    object 
dtypes: float64(3), int64(1), object(8)
memory usage: 45.1+ KB


In [14]:
cat_cols = df_loan.select_dtypes(include= "object").columns
cat_cols

Index(['Gender', 'Married', 'Dependents', 'Education', 'Self_Employed',
       'Credit_History', 'Property_Area', 'Loan_Status'],
      dtype='object')

In [15]:
cat_col = df_loan.select_dtypes(include= "object").drop('Loan_Status', axis = 1).columns
cat_col

Index(['Gender', 'Married', 'Dependents', 'Education', 'Self_Employed',
       'Credit_History', 'Property_Area'],
      dtype='object')

In [16]:
chi_result = []
for i in cat_col:
    if chi2_contingency(pd.crosstab(df_loan[i],df_loan["Loan_Status"]))[1] < 0.05:
        chi_result.append("Reject the null Hypothesis")
    else :
        chi_result.append("Accept the null Hypothesis")

In [17]:
chi_result

['Accept the null Hypothesis',
 'Reject the null Hypothesis',
 'Accept the null Hypothesis',
 'Accept the null Hypothesis',
 'Accept the null Hypothesis',
 'Reject the null Hypothesis',
 'Reject the null Hypothesis']

In [18]:
result_chi = pd.DataFrame(data = [cat_col, chi_result]).T 
result_chi .columns = ['Cat_Features', 'Hypothesis']
print(result_chi)

     Cat_Features                  Hypothesis
0          Gender  Accept the null Hypothesis
1         Married  Reject the null Hypothesis
2      Dependents  Accept the null Hypothesis
3       Education  Accept the null Hypothesis
4   Self_Employed  Accept the null Hypothesis
5  Credit_History  Reject the null Hypothesis
6   Property_Area  Reject the null Hypothesis


### If we have multiple classes in features

In [19]:
df_loan.Property_Area.unique()

array(['Rural', 'Urban', 'Semiurban'], dtype=object)

In [20]:
pd.crosstab(df_loan["Property_Area"],df_loan["Loan_Status"])

Loan_Status,N,Y
Property_Area,Unnamed: 1_level_1,Unnamed: 2_level_1
Rural,54,85
Semiurban,42,149
Urban,52,98


In [21]:
chi_rel1 = chi2_contingency(pd.crosstab(df_loan["Property_Area"],df_loan["Loan_Status"]))
chi_rel1

(12.2259455519901,
 0.0022139594148752133,
 2,
 array([[ 42.85833333,  96.14166667],
        [ 58.89166667, 132.10833333],
        [ 46.25      , 103.75      ]]))

In [22]:
print("Chi2 Statistics: ",chi_rel1[0])
print("P_value: ",chi_rel1[1])

Chi2 Statistics:  12.2259455519901
P_value:  0.0022139594148752133


### If we have multiple classes within a category, we would not be able to easily tell which class of the features are responsible for the relationship if the Chi-square table is larger than 2×2. 

In [23]:
# Lets apply one hot encoding on Property_Area
Property_Area_dummy = pd.get_dummies(data=df_loan[["Property_Area","Loan_Status"]],columns=["Property_Area"])

In [24]:
Property_Area_dummy.head()

Unnamed: 0,Loan_Status,Property_Area_Rural,Property_Area_Semiurban,Property_Area_Urban
0,N,1,0,0
1,Y,0,0,1
2,Y,0,0,1
3,Y,0,0,1
4,Y,0,0,1


In [25]:
# Now we will create the cross-tab table for each of the Property_Area class against the target Loan_Status.
chi_rel2 = pd.crosstab(Property_Area_dummy["Property_Area_Rural"],Property_Area_dummy["Loan_Status"])

In [26]:
chi_rel2= chi2_contingency(pd.crosstab(Property_Area_dummy["Property_Area_Rural"],Property_Area_dummy["Loan_Status"]))
print("Chi2 Statistics: ",chi_rel2[0])
print("P_value: ",chi_rel2[1])

Chi2 Statistics:  5.377420967206426
P_value:  0.02039901197579578


Note :
- Comparing multiple classes of feature against each other would means that the error rate of a false positive(Type-1) compund with each test. 
- For example, if we choose our first test at p-value level 0.05 means there is a 5% chance of a false positive; if we have multiple classes, the test after that would compounding the error with the chance become 10% of a false positive, and so forth. With each subsequent test, the error rate would increase by 5%. In our case above, we had 3 pairwise comparisons. This means that our Chi-square test would have an error rate of 15%. Meaning our p-value being tested at would equal 0.15, which is quite high.
- In this case, we could use the Bonferroni-adjusted method for correcting the p-value we use. We adjust our P-value by the number of pairwise comparisons we want to do. The formula is p/N, where p= the p-value of the original test and N= the number of planned pairwise comparisons. For example, in our case, above we have 3 class within the Property_Area feature; which means we would have 3 pairwise comparisons if we test all the class against the Loan_Status feature. Our P-value would be 0.05/3 = 0.0167
- Using the adjusted P-value, we could test all the previously significant result to see which class are responsible for creating a significant relationship.

In [27]:
result_chi

Unnamed: 0,Cat_Features,Hypothesis
0,Gender,Accept the null Hypothesis
1,Married,Reject the null Hypothesis
2,Dependents,Accept the null Hypothesis
3,Education,Accept the null Hypothesis
4,Self_Employed,Accept the null Hypothesis
5,Credit_History,Reject the null Hypothesis
6,Property_Area,Reject the null Hypothesis


In [28]:
df_loan["Gender"].nunique()

2

In [29]:
# Creating hypothesis using all classes of the columns
chi_result1 = {}
for i in result_chi["Cat_Features"]:
    dummy = pd.get_dummies(df_loan[i])
    ad_p_value = 0.05/df_loan[i].nunique()
    for j in dummy:
        if chi2_contingency(pd.crosstab(df_loan["Loan_Status"],dummy[j]))[1]<ad_p_value:
            chi_result1["{}-{}".format(i,j)] = "Reject the null Hypothesis"
        else :
            chi_result1["{}-{}".format(i,j)] = "Accept the null Hypothesis"

In [30]:
chi_result1

{'Gender-Female': 'Accept the null Hypothesis',
 'Gender-Male': 'Accept the null Hypothesis',
 'Married-No': 'Reject the null Hypothesis',
 'Married-Yes': 'Reject the null Hypothesis',
 'Dependents-0': 'Accept the null Hypothesis',
 'Dependents-1': 'Accept the null Hypothesis',
 'Dependents-2': 'Accept the null Hypothesis',
 'Dependents-3+': 'Accept the null Hypothesis',
 'Education-Graduate': 'Accept the null Hypothesis',
 'Education-Not Graduate': 'Accept the null Hypothesis',
 'Self_Employed-No': 'Accept the null Hypothesis',
 'Self_Employed-Yes': 'Accept the null Hypothesis',
 'Credit_History-N': 'Reject the null Hypothesis',
 'Credit_History-Y': 'Reject the null Hypothesis',
 'Property_Area-Rural': 'Accept the null Hypothesis',
 'Property_Area-Semiurban': 'Reject the null Hypothesis',
 'Property_Area-Urban': 'Accept the null Hypothesis'}