# Data Science Unit 1 Sprint Challenge 3

## Exploring Data, Testing Hypotheses

In this sprint challenge you will look at a dataset of people being approved or rejected for credit.

https://archive.ics.uci.edu/ml/datasets/Credit+Approval

Data Set Information: This file concerns credit card applications. All attribute names and values have been changed to meaningless symbols to protect confidentiality of the data. This dataset is interesting because there is a good mix of attributes -- continuous, nominal with small numbers of values, and nominal with larger numbers of values. There are also a few missing values.

Attribute Information:
- A1: b, a.
- A2: continuous.
- A3: continuous.
- A4: u, y, l, t.
- A5: g, p, gg.
- A6: c, d, cc, i, j, k, m, r, q, w, x, e, aa, ff.
- A7: v, h, bb, j, n, z, dd, ff, o.
- A8: continuous.
- A9: t, f.
- A10: t, f.
- A11: continuous.
- A12: t, f.
- A13: g, p, s.
- A14: continuous.
- A15: continuous.
- A16: +,- (class attribute)

Yes, most of that doesn't mean anything. A16 (the class attribute) is the most interesting, as it separates the 307 approved cases from the 383 rejected cases. The remaining variables have been obfuscated for privacy - a challenge you may have to deal with in your data science career.

Sprint challenges are evaluated based on satisfactory completion of each part. It is suggested you work through it in order, getting each aspect reasonably working, before trying to deeply explore, iterate, or refine any given step. Once you get to the end, if you want to go back and improve things, go for it!

## Part 1 - Load and validate the data

- Load the data as a `pandas` data frame.
- Validate that it has the appropriate number of observations (you can check the raw file, and also read the dataset description from UCI).
- UCI says there should be missing data - check, and if necessary change the data so pandas recognizes it as na
- Make sure that the loaded features are of the types described above (continuous values should be treated as float), and correct as necessary

This is review, but skills that you'll use at the start of any data exploration. Further, you may have to do some investigation to figure out which file to load from - that is part of the puzzle.

In [1]:
!curl -O https://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening/crx.data
!curl -O https://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening/crx.names
!curl -O https://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening/credit.lisp
!curl -O https://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening/credit.names

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 32218  100 32218    0     0  75807      0 --:--:-- --:--:-- --:--:-- 75807
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1486  100  1486    0     0   3555      0 --:--:-- --:--:-- --:--:--  3555
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 12314  100 12314    0     0  27425      0 --:--:-- --:--:-- --:--:-- 27486
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   522  100   522    0     0   1631      0 --:--:-- --:--:-- --:--:--  1626


In [2]:
!ls -l c*

-rw-r--r--  1 EricJC  staff  12314 May 31 21:08 credit.lisp
-rw-r--r--  1 EricJC  staff    522 May 31 21:08 credit.names
-rw-r--r--  1 EricJC  staff  32218 May 31 21:08 crx.data
-rw-r--r--  1 EricJC  staff   1486 May 31 21:08 crx.names


In [3]:
!cat credit.lisp

;; positive examples represent people that were granted credit
(def-pred credit_screening :type (:person) 
  :pos
  ((s1) (s2) (s4) (s5) (s6) (s7) (s8) (s9) (s14) (s15) (s17) (s18) (s19)
   (s21) (s22) (s24) (s28) (s29) (s31) (s32) (s35) (s38) (s40) (s41)
   (s42) (s43) (s45) (s46) (s47) (s49) (s50) (s51) (s53) (s54) (s55)
   (s56) (s57) (s59) (s61) (s62) (s63) (s64) (s65) (s66) (s69) (s70)
   (s71) (s72) (s73) (s74) (s75) (s76) (s77) (s78) (s79) (s80) (s81)
   (s83) (s84) (s85) (s86) (s87) (s89) (s90) (s91) (s92) (s93) (s94)
   (s96) (s97) (s98) (s100) (s103) (s104) (s106) (s108) (s110) (s116)
   (s117) (s118) (s119) (s121) (s122) (s123) (s124))
  :neg
  ((s3) (s10) (s11) (s12) (s13) (s16) (s20) (s23) (s25) (s26) (s27) 
   (s30) (s33) (s34) (s36) (s37) (s39) (s44) (s48) (s52) (s58) (s60)
   (s67) (s68) (s82) (s88) (s95) (s99) (s101) (s102) (s105) (s107)
   (s109) (s111) (s112) (s113) (s114) (s115) (s120) (s125)))

(def-pred jobless :type (:person) :pos
  ((s3) (s10) 

In [7]:
#Let's read in the csv file
import pandas as pd
df = pd.read_csv('crx.data', header=None, na_values='?')

In [8]:
#There are some missing values as we can see
#Missing values are in columns 0,1,3,4,5,6 and 13

df.isna().sum()

0     12
1     12
2      0
3      6
4      6
5      9
6      9
7      0
8      0
9      0
10     0
11     0
12     0
13    13
14     0
15     0
dtype: int64

In [9]:
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202.0,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43.0,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280.0,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100.0,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120.0,0,+


In [30]:
#column [0]
df[0].fillna(method='backfill', inplace=True)

In [31]:
df[0].isna().sum()

0

In [32]:
df[1].fillna(df[1].mean(), inplace=True)

In [33]:
df[1].isna().sum()

0

In [59]:
df[3].fillna(method='pad', inplace=True)
df[4].fillna(method='pad', inplace=True)
df[5].fillna(method='pad', inplace=True)
df[6].fillna(method='pad', inplace=True)
df[13].fillna(method='pad', inplace=True)

In [60]:
#Filled in all missing values
df[3:6].isna().sum()

0     0
1     0
2     0
3     0
4     0
5     0
6     0
7     0
8     0
9     0
10    0
11    0
12    0
13    0
14    0
15    0
dtype: int64

## Part 2 - Exploring data, Testing hypotheses

The only thing we really know about this data is that A16 is the class label. Besides that, we have 6 continuous (float) features and 9 categorical features.

Explore the data: you can use whatever approach (tables, utility functions, visualizations) to get an impression of the distributions and relationships of the variables. In general, your goal is to understand how the features are different when grouped by the two class labels (`+` and `-`).

For the 6 continuous features, how are they different when split between the two class labels? Choose two features to run t-tests (again split by class label) - specifically, select one feature that is *extremely* different between the classes, and another feature that is notably less different (though perhaps still "statistically significantly" different). You may have to explore more than two features to do this.

For the categorical features, explore by creating "cross tabs" (aka [contingency tables](https://en.wikipedia.org/wiki/Contingency_table)) between them and the class label, and apply the Chi-squared test to them. [pandas.crosstab](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.crosstab.html) can create contingency tables, and [scipy.stats.chi2_contingency](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html) can calculate the Chi-squared statistic for them.

There are 9 categorical features - as with the t-test, try to find one where the Chi-squared test returns an extreme result (rejecting the null that the data are independent), and one where it is less extreme.

**NOTE** - "less extreme" just means smaller test statistic/larger p-value. Even the least extreme differences may be strongly statistically significant.

Your *main* goal is the hypothesis tests, so don't spend too much time on the exploration/visualization piece. That is just a means to an end - use simple visualizations, such as boxplots or a scatter matrix (both built in to pandas), to get a feel for the overall distribution of the variables.

This is challenging, so manage your time and aim for a baseline of at least running two t-tests and two Chi-squared tests before polishing. And don't forget to answer the questions in part 3, even if your results in this part aren't what you want them to be.

In [61]:
#Let's make this data more manageable
#The last column, according to the attributes, contain the class information that we 
#need to separate the information by
df[15].unique()

array(['+', '-'], dtype=object)

In [62]:
#app == approved
#dis == disapproved

app = df.loc[df[15] == '+']
dis = df.loc[df[15]=='-']

#run this to ensure that once we have separated the data into the different classes
#we have captured all of the data.
#The sum of the "length" of both classes should equal the total length of the feature
assert len(df)== len(app) + len(dis)

In [63]:
#Separated the data, checked to make sure.
print(app.head(20))
print('==' * 35)
print(dis.head(20))

   0      1       2  3  4   5  6      7  8  9   10 11 12     13     14 15
0   b  30.83   0.000  u  g   w  v  1.250  t  t   1  f  g  202.0      0  +
1   a  58.67   4.460  u  g   q  h  3.040  t  t   6  f  g   43.0    560  +
2   a  24.50   0.500  u  g   q  h  1.500  t  f   0  f  g  280.0    824  +
3   b  27.83   1.540  u  g   w  v  3.750  t  t   5  t  g  100.0      3  +
4   b  20.17   5.625  u  g   w  v  1.710  t  f   0  f  s  120.0      0  +
5   b  32.08   4.000  u  g   m  v  2.500  t  f   0  t  g  360.0      0  +
6   b  33.17   1.040  u  g   r  h  6.500  t  f   0  t  g  164.0  31285  +
7   a  22.92  11.585  u  g  cc  v  0.040  t  f   0  f  g   80.0   1349  +
8   b  54.42   0.500  y  p   k  h  3.960  t  f   0  f  g  180.0    314  +
9   b  42.50   4.915  y  p   w  v  3.165  t  f   0  t  g   52.0   1442  +
10  b  22.08   0.830  u  g   c  h  2.165  f  f   0  t  g  128.0      0  +
11  b  29.92   1.835  u  g   c  h  4.335  t  f   0  f  g  260.0    200  +
12  a  38.25   6.000  u  g   k  v  1.0

In [64]:
#Let's explore this data a little
app.describe()

Unnamed: 0,1,2,7,10,13,14
count,307.0,307.0,307.0,307.0,307.0,307.0
mean,33.641173,5.904951,3.427899,4.605863,163.104235,2038.859935
std,12.805674,5.471485,4.120792,6.320242,161.3207,7659.763941
min,13.75,0.0,0.0,0.0,0.0,0.0
25%,23.17,1.5,0.75,0.0,0.0,0.0
50%,30.17,4.46,2.0,3.0,120.0,221.0
75%,41.33,9.52,5.0,7.0,280.0,1209.0
max,76.75,28.0,28.5,67.0,840.0,100000.0


In [65]:
dis.describe()

Unnamed: 0,1,2,7,10,13,14
count,383.0,383.0,383.0,383.0,383.0,383.0
mean,29.753185,3.839948,1.257924,0.631854,200.375979,198.605744
std,10.856638,4.337662,2.120481,1.900049,181.301778,671.608839
min,15.17,0.0,0.0,0.0,0.0,0.0
25%,22.0,0.835,0.125,0.0,100.0,0.0
50%,27.33,2.21,0.415,0.0,168.0,1.0
75%,34.83,5.0,1.5,0.0,278.0,67.0
max,80.25,26.335,13.875,20.0,2000.0,5552.0


In [66]:
from scipy import stats
import numpy as np



In [67]:
#perform ttest on the columns with the continuous data
cols = [1, 2, 7, 10, 13, 14]
for e in cols:
    print(e, '=', stats.ttest_ind(app[e], dis[e]))


1 = Ttest_indResult(statistic=4.314537665589307, pvalue=1.833817735205033e-05)
2 = Ttest_indResult(statistic=5.52998337614816, pvalue=4.551680702308068e-08)
7 = Ttest_indResult(statistic=8.935819983773698, pvalue=3.6710537401601785e-18)
10 = Ttest_indResult(statistic=11.667004222431277, pvalue=7.957718568079967e-29)
13 = Ttest_indResult(statistic=-2.8172805065288804, pvalue=0.004982145446164413)
14 = Ttest_indResult(statistic=4.680216020964486, pvalue=3.4520256956287944e-06)


In [72]:
#look at the columns with non-continuous data
print(app.describe(exclude=np.number))
print('=='*28)
print(dis.describe(exclude=np.number))

         0    3    4    5    6    8    9    11   12   15
count   307  307  307  307  307  307  307  307  307  307
unique    2    3    3   14    9    2    2    2    3    1
top       b    u    g    c    v    t    t    f    g    +
freq    208  260  260   64  169  284  209  161  287  307
         0    3    4    5    6    8    9    11   12   15
count   383  383  383  383  383  383  383  383  383  383
unique    2    2    2   14    9    2    2    2    3    1
top       b    u    g    c    v    f    f    f    g    -
freq    267  264  264   75  234  306  297  213  338  383


In [73]:
#Let's look a little deeper, there are some differences between some of the 'top' values
#in columns=[8:9]


In [75]:
col = [0, 3, 4, 5, 6, 8, 9, 11, 12]
for x in col:
    ct = pd.crosstab(df[15], df[x])
    chi_sqr, p_value, dof, expected = scipy.stats.chi2_contingency(ct)
    print(x, ':  X^2 =', chi_sqr,'  P Value =', p_value)

0 :  X^2 = 0.22074557899756417   P Value = 0.6384724108987662
3 :  X^2 = 27.381959088907593   P Value = 1.1326171463492892e-06
4 :  X^2 = 27.381959088907593   P Value = 1.1326171463492892e-06
5 :  X^2 = 101.15724449221705   P Value = 9.895986843584472e-16
6 :  X^2 = 47.1346477758394   P Value = 1.445357273081106e-07
8 :  X^2 = 355.2038167412799   P Value = 3.1185900878457007e-79
9 :  X^2 = 143.06956205083145   P Value = 5.675727374527571e-33
11 :  X^2 = 0.568273300792113   P Value = 0.45094587758631943
12 :  X^2 = 9.191570451545383   P Value = 0.010094291370456362


## Part 3 - Analysis and Interpretation

Now that you've looked at the data, answer the following questions:

- Interpret and explain the two t-tests you ran - what do they tell you about the relationships between the continuous features you selected and the class labels?
- Interpret and explain the two Chi-squared tests you ran - what do they tell you about the relationships between the categorical features you selected and the class labels?
- What was the most challenging part of this sprint challenge?

Answer with text, but feel free to intersperse example code/results or refer to it from earlier.

## My answers to the questions

- Interpret and explain the two t-tests you ran - what do they tell you about the relationships between the continuous features you selected and the class labels?

T-tests are very useful when you are examining data and trying to find any evidence of a significant difference between two population means. The t-value essentially provides the relative size of the difference as it relates to the variation in the sample data. So, the greater the magnitude of T (or the greater the abs(T), the stronger the evidence there is against the null hypothesis. In other words there is greater evidence that there is a significant difference. Conversely, the closer the value of T is to 0, the less evidence there is to support that there is a significant difference.

To explain this, we will look at two different t-tests--one for df[10] and one for df[13].

df[10] (shown below) shows that we have a very large T of 11.66. This suggests that there is a significant difference between the means, and that we are able to say that this column has an impact on the final approval decision. The p-value is also very small suggesting that we need to reject the null hypothesis.
10 = Ttest_indResult(statistic=11.667004222431277, pvalue=7.957718568079967e-29)

df[13] (shown below) shows that we have a relatively small T of -2.817 and a much larger pvalue than df[10] p-value. However, even though it is larger, it is still small enough to reject the null hypothesis.
13 = Ttest_indResult(statistic=-2.8172805065288804, pvalue=0.004982145446164413)

- Interpret and explain the two Chi-squared tests you ran - what do they tell you about the relationships between the categorical features you selected and the class labels?

df[8]
8 :  X^2 = 355.2038167412799   P Value = 3.1185900878457007e-79
Column 8 shows a Chi-squared value of 355.203 which would suggest, with near certainty, that overall approval depends on this featrure. The probability that this occurs by chance is 3.1185900878457007e-79.

df[0]
0 :  X^2 = 0.22074557899756417   P Value = 0.6384724108987662
Column 0 shows a Chi-squared value of 0.22 which is not strong enough to reject the null hypothesis. There is a roughly 63% probability that this was due to chance and not due to the feature.

- What was the most challenging part of this sprint challenge?
I am still having some hangups with quickly and accurately determining the meaning of the statistics. I know the basic concepts...it's just the application that I am struggling with.
