In this notebook we will check if there is an interaction between any of the dependent categorical variables. 
* <code>Neighbourhood</code>
* <code>Scholarship</code>
* <code>Hipertension</code>
* <code>Diabetes</code>
* <code>SMS_received</code>
* <code>Gap.d</code>

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

In [26]:
raw_data = pd.read_csv('clean_data.csv')

In [27]:
cols_of_interest = ['Neighbourhood', 'Scholarship', 'Hipertension', 'Diabetes', 'SMS_received', 'Gap.d', 'No-show']
data = pd.DataFrame.drop_duplicates(raw_data[cols_of_interest])

In [28]:
data.shape

(19750, 7)

In [3]:
def run_chi_sq(att1, att2, significance = 0.05, print_results = True):
    """Run chi-square test for attributes att1 and att2."""
    ct = pd.crosstab(data[att1], data[att2])
    chsq_results = stats.chi2_contingency(ct)
    pval = chsq_results[1]
    
    if print_results:
        print(f'Crosstab between {att1} and {att2}:')
        print(ct.apply(lambda r: r/r.sum(), axis=1))        
        print(chsq_results)
        print(f'p-value = {pval}')
        
    if pval < 0.05:
        print('Reject the null hypothesis that the two groups are same.')
    else:
        print('Fail to reject the null hypothesis that the two groups are same.')
        
    return pval

In [4]:
data.columns

Index(['PatientId', 'AppointmentID', 'Gender', 'ScheduledDay',
       'AppointmentDay', 'Age', 'Neighbourhood', 'Scholarship', 'Hipertension',
       'Diabetes', 'Alcoholism', 'Handcap', 'SMS_received', 'Gap', 'Gap.d',
       'No-show'],
      dtype='object')

In [29]:
run_chi_sq('Neighbourhood', 'Scholarship', print_results=False)

Reject the null hypothesis that the two groups are same.


5.364600221371453e-174

In [30]:
run_chi_sq('Neighbourhood', 'Hipertension', print_results=False)

Reject the null hypothesis that the two groups are same.


1.0542408909677907e-105

In [31]:
run_chi_sq('Neighbourhood', 'Diabetes', print_results=False)

Reject the null hypothesis that the two groups are same.


1.2183507039657931e-46

In [32]:
run_chi_sq('Neighbourhood', 'SMS_received', print_results=False)

Reject the null hypothesis that the two groups are same.


0.018673582171676622

In [33]:
run_chi_sq('Neighbourhood', 'Gap.d', print_results=False)

Reject the null hypothesis that the two groups are same.


1.7052457517919569e-124

In [34]:
run_chi_sq('Scholarship', 'Hipertension', print_results=False)

Reject the null hypothesis that the two groups are same.


3.6839396971180946e-50

In [35]:
run_chi_sq('Scholarship', 'Diabetes', print_results=False)

Reject the null hypothesis that the two groups are same.


6.190184106434597e-44

In [36]:
run_chi_sq('Scholarship', 'SMS_received', print_results=False)

Fail to reject the null hypothesis that the two groups are same.


0.13119220651534103

We should retain only one of <code>Scholarship</code> and <code>SMS_received</code>.

In [37]:
run_chi_sq('Scholarship', 'Gap.d', print_results=False)

Reject the null hypothesis that the two groups are same.


8.098061092457597e-33

In [38]:
run_chi_sq('Hipertension', 'Diabetes', print_results=False)

Reject the null hypothesis that the two groups are same.


0.0

In [39]:
run_chi_sq('Hipertension', 'SMS_received', print_results=False)

Reject the null hypothesis that the two groups are same.


0.02320173837673293

In [40]:
run_chi_sq('Hipertension', 'Gap.d', print_results=False)

Reject the null hypothesis that the two groups are same.


7.304947874975879e-54

In [41]:
run_chi_sq('Diabetes', 'SMS_received', print_results=False)

Reject the null hypothesis that the two groups are same.


2.5098054043182834e-07

In [42]:
run_chi_sq('Diabetes', 'Gap.d', print_results=False)

Reject the null hypothesis that the two groups are same.


6.416221878802838e-89

In [43]:
run_chi_sq('SMS_received', 'Gap.d', print_results=False)

Reject the null hypothesis that the two groups are same.


0.0

In order to decide which amond <code>Scholarship</code> and <code>SMS_received</code> is a stronger candidate, we look at the cross tabs and the $\chi^2$-test.

In [47]:
pd.crosstab(data['Scholarship'], data['No-show'])

No-show,0,1
Scholarship,Unnamed: 1_level_1,Unnamed: 2_level_1
0,9494,6152
1,2476,1628


We can find the probability of <code>No-show</code> conditioned on the <code>Scholarship</code> using:

In [49]:
pd.crosstab(data['Scholarship'], data['No-show']).apply(lambda r: r/r.sum(), axis=1)

No-show,0,1
Scholarship,Unnamed: 1_level_1,Unnamed: 2_level_1
0,0.6068,0.3932
1,0.603314,0.396686


On the other , the probability of <code>Scholarship</code> conditioned on <code>No-show</code> is:

In [51]:
pd.crosstab(data['Scholarship'], data['No-show']).apply(lambda r: r/r.sum(), axis=0)

No-show,0,1
Scholarship,Unnamed: 1_level_1,Unnamed: 2_level_1
0,0.79315,0.790746
1,0.20685,0.209254


The corresponding conditional probabilities for the pair <code>(SMS_received, No-show)</code> are:

In [54]:
pd.crosstab(data['SMS_received'], data['No-show']).apply(lambda r: r/r.sum(), axis=1)

No-show,0,1
SMS_received,Unnamed: 1_level_1,Unnamed: 2_level_1
0,0.593683,0.406317
1,0.618319,0.381681


In [55]:
pd.crosstab(data['SMS_received'], data['No-show']).apply(lambda r: r/r.sum(), axis=1)

No-show,0,1
SMS_received,Unnamed: 1_level_1,Unnamed: 2_level_1
0,0.593683,0.406317
1,0.618319,0.381681


<code>SMS_received</code> seems to be a better determinant of <code>No-show</code> because there is more difference between $P(\text{No-show} = 0 | \text{SMS_received} = 0)$ and $P(\text{No-show} = 0 | \text{SMS_received} = 1)$ than between the pair $P(\text{No-show} = 0 | \text{Scholarship} = 0)$ and $P(\text{No-show} = 0 | \text{Scholarship} = 1)$

Among the list of varibles we started with in this notebook, we will choose the following for our model: 
* <code>Neighbourhood</code>
* <code>Hipertension</code>
* <code>Diabetes</code>
* <code>SMS_received</code>
* <code>Gap.d</code>