<a href="https://colab.research.google.com/github/jtkernan7/DS-Unit-1-Sprint-4-Statistical-Tests-and-Experiments/blob/master/JT_Kernan_DS_Unit_1_Sprint_Challenge_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Science Unit 1 Sprint Challenge 4

## Exploring Data, Testing Hypotheses

In this sprint challenge you will look at a dataset of people being approved or rejected for credit.

https://archive.ics.uci.edu/ml/datasets/Credit+Approval

Data Set Information: This file concerns credit card applications. All attribute names and values have been changed to meaningless symbols to protect confidentiality of the data. This dataset is interesting because there is a good mix of attributes -- continuous, nominal with small numbers of values, and nominal with larger numbers of values. There are also a few missing values.

Attribute Information:
- A1: b, a.
- A2: continuous.
- A3: continuous.
- A4: u, y, l, t.
- A5: g, p, gg.
- A6: c, d, cc, i, j, k, m, r, q, w, x, e, aa, ff.
- A7: v, h, bb, j, n, z, dd, ff, o.
- A8: continuous.
- A9: t, f.
- A10: t, f.
- A11: continuous.
- A12: t, f.
- A13: g, p, s.
- A14: continuous.
- A15: continuous.
- A16: +,- (class attribute)

Yes, most of that doesn't mean anything. A16 (the class attribute) is the most interesting, as it separates the 307 approved cases from the 383 rejected cases. The remaining variables have been obfuscated for privacy - a challenge you may have to deal with in your data science career.

Sprint challenges are evaluated based on satisfactory completion of each part. It is suggested you work through it in order, getting each aspect reasonably working, before trying to deeply explore, iterate, or refine any given step. Once you get to the end, if you want to go back and improve things, go for it!

## Part 1 - Load and validate the data

- Load the data as a `pandas` data frame.
- Validate that it has the appropriate number of observations (you can check the raw file, and also read the dataset description from UCI).
- UCI says there should be missing data - check, and if necessary change the data so pandas recognizes it as na
- Make sure that the loaded features are of the types described above (continuous values should be treated as float), and correct as necessary

This is review, but skills that you'll use at the start of any data exploration. Further, you may have to do some investigation to figure out which file to load from - that is part of the puzzle.

In [62]:
# Load dataframe, check for na values, change columns names to match UCI for readability 

import pandas as pd
from scipy import stats
import numpy as np



df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening/crx.data', header=None, na_values=('?'), names = ['A1','A2','A3', 'A4','A5','A6','A7','A8','A9'
                                                                                                                                              , 'A10','A11','A12','A13','A14','A15','A16'])
df.tail(20)


Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14,A15,A16
670,b,47.17,5.835,u,g,w,v,5.5,f,f,0,f,g,465.0,150,-
671,b,25.83,12.835,u,g,cc,v,0.5,f,f,0,f,g,0.0,2,-
672,a,50.25,0.835,u,g,aa,v,0.5,f,f,0,t,g,240.0,117,-
673,,29.5,2.0,y,p,e,h,2.0,f,f,0,f,g,256.0,17,-
674,a,37.33,2.5,u,g,i,h,0.21,f,f,0,f,g,260.0,246,-
675,a,41.58,1.04,u,g,aa,v,0.665,f,f,0,f,g,240.0,237,-
676,a,30.58,10.665,u,g,q,h,0.085,f,t,12,t,g,129.0,3,-
677,b,19.42,7.25,u,g,m,v,0.04,f,t,1,f,g,100.0,1,-
678,a,17.92,10.21,u,g,ff,ff,0.0,f,f,0,f,g,0.0,50,-
679,a,20.08,1.25,u,g,c,v,0.0,f,f,0,f,g,0.0,0,-


In [63]:
df.isnull().sum()

A1     12
A2     12
A3      0
A4      6
A5      6
A6      9
A7      9
A8      0
A9      0
A10     0
A11     0
A12     0
A13     0
A14    13
A15     0
A16     0
dtype: int64

In [0]:
#choosing to drop rows with na values
df.dropna(inplace = True)

In [65]:
df.isnull().sum()

A1     0
A2     0
A3     0
A4     0
A5     0
A6     0
A7     0
A8     0
A9     0
A10    0
A11    0
A12    0
A13    0
A14    0
A15    0
A16    0
dtype: int64

In [66]:
df.head()

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14,A15,A16
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202.0,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43.0,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280.0,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100.0,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120.0,0,+


In [67]:
#still a large dataset with na values dropped
df.describe()

Unnamed: 0,A2,A3,A8,A11,A14,A15
count,653.0,653.0,653.0,653.0,653.0,653.0
mean,31.503813,4.829533,2.244296,2.502297,180.359877,1013.761103
std,11.838267,5.027077,3.37112,4.968497,168.296811,5253.278504
min,13.75,0.0,0.0,0.0,0.0,0.0
25%,22.58,1.04,0.165,0.0,73.0,0.0
50%,28.42,2.835,1.0,0.0,160.0,5.0
75%,38.25,7.5,2.625,3.0,272.0,400.0
max,76.75,28.0,28.5,67.0,2000.0,100000.0


In [68]:
df.describe(exclude=[np.number])

Unnamed: 0,A1,A4,A5,A6,A7,A9,A10,A12,A13,A16
count,653,653,653,653,653,653,653,653,653,653
unique,2,3,3,14,9,2,2,2,3,2
top,b,u,g,c,v,t,f,f,g,-
freq,450,499,499,133,381,349,366,351,598,357


In [69]:
df.shape

(653, 16)

## Part 2 - Exploring data, Testing hypotheses

The only thing we really know about this data is that A16 is the class label. Besides that, we have 6 continuous (float) features and 9 categorical features.

Explore the data: you can use whatever approach (tables, utility functions, visualizations) to get an impression of the distributions and relationships of the variables. In general, your goal is to understand how the features are different when grouped by the two class labels (`+` and `-`).

For the 6 continuous features, how are they different when split between the two class labels? Choose two features to run t-tests (again split by class label) - specifically, select one feature that is *extremely* different between the classes, and another feature that is notably less different (though perhaps still "statistically significantly" different). You may have to explore more than two features to do this.

For the categorical features, explore by creating "cross tabs" (aka [contingency tables](https://en.wikipedia.org/wiki/Contingency_table)) between them and the class label, and apply the Chi-squared test to them. [pandas.crosstab](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.crosstab.html) can create contingency tables, and [scipy.stats.chi2_contingency](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html) can calculate the Chi-squared statistic for them.

There are 9 categorical features - as with the t-test, try to find one where the Chi-squared test returns an extreme result (rejecting the null that the data are independent), and one where it is less extreme.

**NOTE** - "less extreme" just means smaller test statistic/larger p-value. Even the least extreme differences may be strongly statistically significant.

Your *main* goal is the hypothesis tests, so don't spend too much time on the exploration/visualization piece. That is just a means to an end - use simple visualizations, such as boxplots or a scatter matrix (both built in to pandas), to get a feel for the overall distribution of the variables.

This is challenging, so manage your time and aim for a baseline of at least running two t-tests and two Chi-squared tests before polishing. And don't forget to answer the questions in part 3, even if your results in this part aren't what you want them to be.

In [70]:
approved = df.loc[df['A16']=='+']
rejected = df.loc[df['A16']=='-']
approved.head()

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14,A15,A16
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202.0,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43.0,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280.0,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100.0,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120.0,0,+


In [58]:
rejected.head()

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14,A15,A16
70,b,32.33,7.5,u,g,e,bb,1.585,t,f,0,t,s,420.0,0,-
72,a,38.58,5.0,u,g,cc,v,13.5,t,f,0,t,g,980.0,0,-
73,b,44.25,0.5,u,g,m,v,10.75,t,f,0,f,s,400.0,0,-
74,b,44.83,7.0,y,p,c,v,1.625,f,f,0,f,g,160.0,2,-
75,b,20.67,5.29,u,g,q,v,0.375,t,t,1,f,g,160.0,0,-


In [71]:
print(approved.shape)
approved.describe()


(296, 16)


Unnamed: 0,A2,A3,A8,A11,A14,A15
count,296.0,296.0,296.0,296.0,296.0,296.0
mean,33.845473,5.971943,3.475186,4.716216,164.621622,2009.726351
std,12.689357,5.492651,4.167399,6.398136,162.54355,7660.949172
min,13.75,0.0,0.0,0.0,0.0,0.0
25%,23.25,1.5,0.75,0.0,0.0,0.0
50%,31.04,4.48,2.0,3.0,120.0,210.5
75%,41.44,9.56125,5.0,7.0,280.0,1216.5
max,76.75,28.0,28.5,67.0,840.0,100000.0


In [72]:
print(rejected.shape)
rejected.describe()

(357, 16)


Unnamed: 0,A2,A3,A8,A11,A14,A15
count,357.0,357.0,357.0,357.0,357.0,357.0
mean,29.562269,3.882325,1.223725,0.666667,193.408964,187.97479
std,10.719168,4.393079,2.029272,1.958368,172.057953,632.781715
min,15.17,0.0,0.0,0.0,0.0,0.0
25%,21.92,0.835,0.125,0.0,100.0,0.0
50%,26.92,2.21,0.455,0.0,160.0,1.0
75%,34.83,5.0,1.5,0.0,260.0,67.0
max,74.83,26.335,13.875,20.0,2000.0,5552.0


In [73]:
# for the continous features I will be setting up t_tests
# I am first looking at feature A15 because the means are 1800 apart and it seems a very important to if someone is approved or not
# will also look at feautre A2 to start because the means are about 4.3 apart with both standard deviations over 10
from scipy.stats import ttest_ind, ttest_ind_from_stats, ttest_rel

#test for A15

approvedA15 = approved['A15']
rejectedA15 = rejected['A15']

stat, pvalue = ttest_ind(approvedA15, rejectedA15)
print(pvalue <=.05)
print(stat, pvalue)

True
4.475369764700449 9.003915641872878e-06


In [78]:
for sample in [approvedA15, rejectedA15]:
    print(f"Mean: {sample.mean()}")
    print(f"Standard Deviation: {sample.std()}")
    print(f"Variance: {sample.var()}")
    print("---"*10)

Mean: 2009.7263513513512
Standard Deviation: 7660.949172248684
Variance: 58690142.21977779
------------------------------
Mean: 187.9747899159664
Standard Deviation: 632.7817149703922
Variance: 400412.6988008706
------------------------------


In [74]:
#testing A2

approvedA2 = approved['A2']
rejectedA2 = rejected['A2']

stat, pvalue = ttest_ind(approvedA2, rejectedA2)
print(pvalue <=.05)
print(stat, pvalue)

True
4.675662433767456 3.5636710933835225e-06


In [79]:
# for both A2 and A15 we are rejecting the null hypothesis 
# now trying for A3

approvedA3 = approved['A3']
rejectedA3 = rejected['A3']

stat, pvalue = ttest_ind(approvedA3, rejectedA3)
print(pvalue <=.05)
print(stat, pvalue)

True
5.400813416576192 9.310154396147606e-08


In [80]:
# also rejecting null hypothesis for feature A3
# trying A8

approvedA8 = approved['A8']
rejectedA8 = rejected['A8']

stat, pvalue = ttest_ind(approvedA8, rejectedA8)
print(pvalue <=.05)
print(stat, pvalue)

True
9.002392498622463 2.4079238505859142e-18


In [81]:
# feature A8 is extremely signifcant to approval or not
#testing A11 and A14

approvedA11 = approved['A11']
rejectedA11 = rejected['A11']

stat, pvalue = ttest_ind(approvedA11, rejectedA11)
print(pvalue <=.05)
print(stat, pvalue)

True
11.336964562239809 2.5864649620278843e-27


In [82]:
approvedA14 = approved['A14']
rejectedA14 = rejected['A14']

stat, pvalue = ttest_ind(approvedA14, rejectedA14)
print(pvalue <=.05)
print(stat, pvalue)

True
-2.18221936438708 0.029450100409286473


In [83]:
approvedA14 = approved['A14']
rejectedA14 = rejected['A14']

stat, pvalue = ttest_ind(approvedA14, rejectedA14)
print(pvalue <=.01)
print(stat, pvalue)

False
-2.18221936438708 0.029450100409286473


In [0]:
# i am rejecting the null hypotheis for every feature that there is not statstical significance for approval... every feature has significance 
# that being said feature A8 is the least statstically significant. It would not be signficant considerining a 99% sig test
# features A8 and A11 have extremely high significance 

Everything above is ttest comparison of categorical feautures 

Everything below is testing categorical feautures with crosstabs and chi-squared tests

In [134]:
approved.describe(exclude=[np.number])

Unnamed: 0,A1,A4,A5,A6,A7,A9,A10,A12,A13,A16
count,296,296,296,296,296,296,296,296,296,296
unique,2,3,3,14,9,2,2,2,3,1
top,b,u,g,c,v,t,t,f,g,+
freq,201,249,249,60,163,278,203,151,280,296


In [135]:
rejected.describe(exclude=[np.number])

Unnamed: 0,A1,A4,A5,A6,A7,A9,A10,A12,A13,A16
count,357,357,357,357,357,357,357,357,357,357
unique,2,2,2,14,9,2,2,2,3,1
top,b,u,g,c,v,f,f,f,g,-
freq,249,250,250,73,218,286,273,200,318,357


In [0]:
#quick check of top values and their frequencies shows that A9 and A10 both have different dominating features for approved vs rejected
# for both A9 and A10 there are two options t and f. I am going to assume that these reference true and false
# for both features most of the accepted applications recieved a true and most of the rejected applications recieved a false.


In [0]:
from scipy.stats import chisquare

In [121]:
contingencyA6 = pd.crosstab(df['A16'], df['A6'])
contingencyA6

A6,aa,c,cc,d,e,ff,i,j,k,m,q,r,w,x
A16,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
+,19,60,29,7,14,7,14,3,13,16,49,2,33,30
-,33,73,11,19,10,43,41,7,35,22,26,1,30,6


In [122]:
chi_squared, p_value, dof, expected = stats.chi2_contingency(contingencyA6)

print(f"Chi-Squared: {chi_squared}")
print(f"P-value: {p_value}")
print(f"Degrees of Freedom: {dof}") 
print("Expected: \n", np.array(expected))

Chi-Squared: 89.76481160702343
P-value: 1.5500154549498966e-13
Degrees of Freedom: 13
Expected: 
 [[23.5712098  60.28790199 18.13169985 11.7856049  10.87901991 22.66462481
  24.93108729  4.53292496 21.75803982 17.22511485 33.99693721  1.35987749
  28.55742726 16.31852986]
 [28.4287902  72.71209801 21.86830015 14.2143951  13.12098009 27.33537519
  30.06891271  5.46707504 26.24196018 20.77488515 41.00306279  1.64012251
  34.44257274 19.68147014]]


In [123]:
contingencyA1 = pd.crosstab(df['A16'], df['A1'])
contingencyA1

A1,a,b
A16,Unnamed: 1_level_1,Unnamed: 2_level_1
+,95,201
-,108,249


In [124]:
chi_squared, p_value, dof, expected = stats.chi2_contingency(contingencyA1)

print(f"Chi-Squared: {chi_squared}")
print(f"P-value: {p_value}")
print(f"Degrees of Freedom: {dof}") 
print("Expected: \n", np.array(expected))

Chi-Squared: 0.17764082160253514
P-value: 0.6734085695133722
Degrees of Freedom: 1
Expected: 
 [[ 92.01837672 203.98162328]
 [110.98162328 246.01837672]]


In [0]:
#A1 we are going to accept the null hypothesis that there is not signficane to acceptance for this feature 
#A6 we are going to reject the null hypothesis. meaning there is signficance for this categorical feature
#found that A1 has little signficance for approval and that A6 does have serious signficance so
#now i am testing A9 and A10 which I believe will have extremely extremely high signficance to approval or rejection because of the value counts for each.


In [136]:
contingencyA10 = pd.crosstab(df['A16'], df['A10'])
print(contingencyA7)

chi_squared, p_value, dof, expected = stats.chi2_contingency(contingencyA10)

print(f"Chi-Squared: {chi_squared}")
print(f"P-value: {p_value}")
print(f"Degrees of Freedom: {dof}") 
print("Expected: \n", np.array(expected))

A7   bb  dd  ff   h  j  n  o    v  z
A16                                 
+    24   2   8  87  3  2  1  163  6
-    29   4  46  50  5  2  1  218  2
Chi-Squared: 131.50867232095965
P-value: 1.9163536191857147e-30
Degrees of Freedom: 1
Expected: 
 [[165.9050536 130.0949464]
 [200.0949464 156.9050536]]


In [128]:
contingencyA9 = pd.crosstab(df['A16'], df['A9'])
print(contingencyA9)

chi_squared, p_value, dof, expected = stats.chi2_contingency(contingencyA9)

print(f"Chi-Squared: {chi_squared}")
print(f"P-value: {p_value}")
print(f"Degrees of Freedom: {dof}") 
print("Expected: \n", np.array(expected))

A9     f    t
A16          
+     18  278
-    286   71
Chi-Squared: 353.4827159410316
P-value: 7.391616628555818e-79
Degrees of Freedom: 1
Expected: 
 [[137.80091884 158.19908116]
 [166.19908116 190.80091884]]


In [130]:
# feature A9 has an insanely high significance to approval or not. 

rejected['A9'].value_counts()

f    286
t     71
Name: A9, dtype: int64

In [132]:
approved['A9'].value_counts()

t    278
f     18
Name: A9, dtype: int64

## Part 3 - Analysis and Interpretation

Now that you've looked at the data, answer the following questions:

- Interpret and explain the two t-tests you ran - what do they tell you about the relationships between the continuous features you selected and the class labels?
- Interpret and explain the two Chi-squared tests you ran - what do they tell you about the relationships between the categorical features you selected and the class labels?
- What was the most challenging part of this sprint challenge?

Answer with text, but feel free to intersperse example code/results or refer to it from earlier.


*   t-test talk


I ran a t test on every continous feature, every single feature was significant at a 95% level for whether or not one's application would be accepted or rejected. features A8 and A11 stood out for having extremely high signifance. Approval and rejection of applications seems to be very reliant on these two.

Feature A14 was significant at a 95% level but was not at a 99%. It was by far the least significant continous feature for approval or rejection

* chi-square talk

I ran multiple chi-squared tests on categorical features of the data. I found strong signficance for acceptance and rejection according to feature A6. 

I was able to accept the null hypothesis for feature A1.  It's p-value is .673!
So it is believed that this feature has virtually no signicance to whether an application is accepted or rejected.

The two most signifcant categorical features for accepted or rejection are A9 and A10. This could be seen from value counts alone and when a chi-sqaure test was run on each this was confirmed.

Both A9 and A10 had two options t and f. I am going to assume that these reference true and false.
For both features most of the accepted applications recieved a true and most of the rejected applications recieved a false. So the data shows that if you want your application accepted having a true in both these places is extremely helpful.

* challenges talk

honestly I was just hvaing some computer issues. it may be time to upgrade my system