# Pearson’s Chi-Squared test
## Used to verify relashionship between two categorical groups

H0: NO statistic significant difference between groups

H1: Correlation exists

    
    If Statistic >= Critical Value: significant result, reject null hypothesis (H0), dependent.
    If Statistic < Critical Value: not significant result, fail to reject null hypothesis (H0), independent.
    
    If p-value <= alpha: significant result, reject null hypothesis (H0), dependent.
    If p-value > alpha: not significant result, fail to reject null hypothesis (H0), independent.


## Load dataset

In [3]:
#churn_clean_altered.csv
docId= "1-WjyGAwXhgkEMSGk1PHKMjIASkVgn-YO"
googleDriveFile = "https://docs.google.com/uc?id="+docId+"&export=download"

# import into data frame
import pandas as pd
df = pd.read_csv(googleDriveFile, index_col=0)

# Create subset with categorical data only
## categorical_columns & categorical_df

In [40]:
categorical_columns = [
                  "Churn",
                 "State",
                 "Area",
                 "Marital",
                 "Gender",
                 "Techie",
                 "InternetService",
                 "Multiple",
                 "OnlineBackup",
                 "DeviceProtection",
                 "StreamingTV",
                 "StreamingMovies",
                 "Port_modem",
                 "Tablet",
                 "OnlineSecurity",
                 "TechSupport",
                 "Contract",
                 "PaperlessBilling",
                 "PaymentMethod",
                 "Item1",
                 "Item2",
                 "Item3"
                 ]
#Select columns
categorical_df = df.loc[:, categorical_columns]

# Cross tabulation and Chi-square Tests

In [33]:
rows = len(categorical_df)
columns = len(categorical_df.columns)
degress_of_freedom_math = (rows - 1) * (columns - 1)

In [32]:
# Checking if feature order matters
crossTab = pd.crosstab(index=categorical_df['Churn'], columns=categorical_df['Techie'])
crossTab1 = pd.crosstab(index=categorical_df['Techie'], columns=categorical_df['Churn'])
print(stats.chi2_contingency(crossTab))
print(stats.chi2_contingency(crossTab1))

(44.11479393861451, 3.096716355509661e-11, 1, array([[6115.935, 1234.065],
       [2205.065,  444.935]]))
(44.11479393861451, 3.096716355509661e-11, 1, array([[6115.935, 2205.065],
       [1234.065,  444.935]]))


In [44]:
from scipy import stats

myDictionaty = dict()
# Compare every feature with each other
for index, column_name in enumerate(categorical_columns):
  iterator = index + 1
  while iterator < len(categorical_columns):
    print("\n")
    sub_column_name = categorical_columns[iterator]
    features = column_name + " & " + sub_column_name
          
    # Cross tabulation
    crossTab = pd.crosstab(index=categorical_df[column_name], columns=categorical_df[sub_column_name])

    #Performing Chi-square test
    # H0: NO statistic significant difference between groups
    # H1: Correlation exists

    statistic, pValue, dof, expected = stats.chi2_contingency(crossTab)
    prob = 0.95

    #print('interpret test-statistic')
    critical = stats.chi2.ppf(prob, dof)
    # statistic measure distance between observed and expected frequencies
    if abs(statistic) >= critical:
      h1_t_statistics = True
    else:
      h1_t_statistics = False

    #print('interpret p-value')
    alpha = 1.0 - prob
    #P-value is the probability of H0 being true
    if pValue <= alpha: # Accept H1
      h1_p_value = True
    else:
      h1_p_value = False

    if (h1_t_statistics & h1_p_value):
      print(features)
      print('Chi-square: Dependent! Reject H0.')
      print('    Degrees of Freedom:%d, probability:%.2f' % (dof, prob))
      print('  interpret test-statistic')
      print('    statistic:%.3f >= critical_value:%.3f' % (abs(statistic), critical) )
      print('  interpret p-value')
      print('    p-value:%.3f <= significance:%.3f' % (pValue, alpha))
      key = features
      myDictionaty[key] = statistic
      print('Cross tab:')
      print(crossTab)

    iterator=iterator+1

print("\nSUMMARY: \nCorrelation exists on beetween these features")
myDictionaty_sorted = dict(sorted(myDictionaty.items(), key=lambda item: item[1]))
# Print dictionary as json format
import json
print(json.dumps(myDictionaty_sorted, indent=4))









Churn & Gender
Chi-square: Dependent! Reject H0.
    Degrees of Freedom:2, probability:0.95
  interpret test-statistic
    statistic:7.880 >= critical_value:5.991
  interpret p-value
    p-value:0.019 <= significance:0.050
Cross tab:
Gender  Female  Male  Nonbinary
Churn                          
No        3753  3425        172
Yes       1272  1319         59


Churn & Techie
Chi-square: Dependent! Reject H0.
    Degrees of Freedom:1, probability:0.95
  interpret test-statistic
    statistic:44.115 >= critical_value:3.841
  interpret p-value
    p-value:0.000 <= significance:0.050
Cross tab:
Techie    No   Yes
Churn             
No      6226  1124
Yes     2095   555


Churn & InternetService
Chi-square: Dependent! Reject H0.
    Degrees of Freedom:2, probability:0.95
  interpret test-statistic
    statistic:87.462 >= critical_value:5.991
  interpret p-value
    p-value:0.000 <= significance:0.050
Cross tab:
InternetService   DSL  Fiber Optic  None
Churn                         

# Insigts
Relevant correlations
    
    Multiple & Tablet
    StreamingMovies & Tablet
    Techie & DeviceProtection
    InternetService & TechSupport
    Gender & OnlineBackup

    Churn & Gender
    Churn & PaymentMethod
    Churn & OnlineBackup
    Churn & DeviceProtection
    Churn & Techie
    Churn & InternetService
    Churn & Multiple
    Churn & StreamingTV
    Churn & Contract
    Churn & StreamingMovies

    TechSupport & Item1: Timely response
    Techie & Item8:Evidence of active listening
    State & Item3:Timely replacements

    Item2 & Item3 (Timely fixes & Timely replacements)
    Item1 & Item3 (Timely response & Timely replacements)
    Item1 & Item2 (Timely response & Timely fixes)