## Data Exploration

**Check if there exists relation between one categorical and one categorical data**

In [1]:
import numpy as np
import pandas as pd
import scipy.stats as stats

The manager of a restaurant wants to find the relation between customer satisfaction and the salaries of the people waiting tables.

- She takes a random sample of 100 customers asking if the service was excellent, good, or poor.
- She then categorizes the salaries of the people waiting as low, medium, and high.

Her findings are shown in the table below:


In [5]:
data = pd.read_csv('ChiSquare.csv')
data

Unnamed: 0,Service,Low,Medium,High,Row_Totals
0,Excellent,9,10,7,26
1,Good,11,9,31,51
2,Poor,12,8,3,23
3,Column_Totals,32,27,41,100


**Ho: Salaries and the Service are independent**

**Ha: Salaries and the Service are not independent**

In [6]:
del data['Service']

data.columns = ['low', "medium", "high", "Row_Totals"]

data.index = ['excellent', 'good', 'poor', 'Column_Totals']

display (data)

Unnamed: 0,low,medium,high,Row_Totals
excellent,9,10,7,26
good,11,9,31,51
poor,12,8,3,23
Column_Totals,32,27,41,100


In [7]:
expected = np.outer(data['Row_Totals'][0:3],data.loc["Column_Totals"][0:3])/100

expected = pd.DataFrame(expected)

expected.columns = ['low', 'medium', 'high']

expected.index = ['excellent', 'good', 'poor']

display (expected)

Unnamed: 0,low,medium,high
excellent,8.32,7.02,10.66
good,16.32,13.77,20.91
poor,7.36,6.21,9.43


In [8]:
chi_squared_stat = (((data-expected)**2)/expected).sum().sum() # Test Statistic

print(chi_squared_stat)

# (((data-expected)**2)/expected).sum() - colwise sums

18.658230409973125


Test statistics to determine whether to reject the null hypothesis. The test statistic compares your data with what is expected under the null hypothesis.

pvalue = area after the test statistics ::: greater than values

In [9]:
# p_value = p(ChiSq > 18.658230409973125) = 1 - p(ChiSq < 18.658230409973125)

p_value = 1 - stats.chi2.cdf(chi_squared_stat, ((3-1)*(3-1)))
print (p_value)

0.0009172334128317861


In [10]:
p_value = 0.0009172334128317392
alpha = 0.05

p_value < alpha

True

# if p_value is less than alpha, we reject the null hypothesis

**Reject Ho, Service quality is dependent on the salaries of the people**

In [8]:
data = pd.read_csv('trainl.csv')
data.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area_Rural,Property_Area_Semiurban,Property_Area_Urban,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,128,360,1,0,0,1,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128,360,1,1,0,0,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66,360,1,0,0,1,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120,360,1,0,0,1,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141,360,1,0,0,1,Y


In [9]:
tbl = pd.crosstab(data.Gender, data.Loan_Status)
display (tbl)

Loan_Status,N,Y
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1
Female,37,75
Male,155,347


In [10]:
stats.chi2_contingency(tbl)

(0.11087854691241235, 0.7391461310869638, 1, array([[ 35.0228013,  76.9771987],
        [156.9771987, 345.0228013]]))

In [11]:
chi_square, p_value, degrees_of_freedom, expected_frequencies = stats.chi2_contingency(tbl)

print(chi_square)  # Test Statistic
print ()
print(p_value)
print ()
print(degrees_of_freedom)
print ()
print(expected_frequencies)

0.11087854691241235

0.7391461310869638

1

[[ 35.0228013  76.9771987]
 [156.9771987 345.0228013]]


In [12]:
p_value = 0.7391461310869638
alpha = 0.05

p_value < alpha

False

# if p_value is less than alpha, we reject the null hypothesis

**Do not rej the Ho. There is no relationship between gender of the customer and the Loan Status**

Ho: Education and Loan Status are independent

Ha: Education and Loan Status are not independent

In [13]:
tbl = pd.crosstab(data.Education, data.Loan_Status)
display (tbl)
print ()

chi_square , p_value, degrees_of_freedom, expected_frequencies=stats.chi2_contingency(tbl)

print(chi_square)  # Test Statistic
print ()
print(p_value)
print ()
print(degrees_of_freedom)
print ()
print(expected_frequencies)

Loan_Status,N,Y
Education,Unnamed: 1_level_1,Unnamed: 2_level_1
Graduate,140,340
Not Graduate,52,82



4.091490413303621

0.04309962129357355

1

[[150.09771987 329.90228013]
 [ 41.90228013  92.09771987]]


In [14]:
p_value = 0.04309962129357355
alpha = 0.05

p_value < alpha

True

**Rej Ho. Education and Loan Status are not independent. There is relationship between education level and Loan Status.**