<a href="https://colab.research.google.com/github/JoshuaPMallory/DS-Unit-1-Sprint-2-Statistics/blob/master/module5/DS9_W2D5_Sprint_Challenge_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Science Unit 1 Sprint Challenge 2

## Exploring Data, Testing Hypotheses

In this sprint challenge you will look at a dataset of people being approved or rejected for credit.

https://archive.ics.uci.edu/ml/datasets/Credit+Approval

Data Set Information: This file concerns credit card applications. All attribute names and values have been changed to meaningless symbols to protect confidentiality of the data. This dataset is interesting because there is a good mix of attributes -- continuous, nominal with small numbers of values, and nominal with larger numbers of values. There are also a few missing values.

Attribute Information:
- A1: b, a.
- A2: continuous.
- A3: continuous.
- A4: u, y, l, t.
- A5: g, p, gg.
- A6: c, d, cc, i, j, k, m, r, q, w, x, e, aa, ff.
- A7: v, h, bb, j, n, z, dd, ff, o.
- A8: continuous.
- A9: t, f.
- A10: t, f.
- A11: continuous.
- A12: t, f.
- A13: g, p, s.
- A14: continuous.
- A15: continuous.
- A16: +,- (class attribute)

Yes, most of that doesn't mean anything. A16 is a variable that indicates whether or not a person's request for credit has been approved or denied. This is a good candidate for a y variable since we might want to use the other features to predict this one. The remaining variables have been obfuscated for privacy - a challenge you may have to deal with in your data science career.

Sprint challenges are evaluated based on satisfactory completion of each part. It is suggested you work through it in order, getting each aspect reasonably working, before trying to deeply explore, iterate, or refine any given step. Once you get to the end, if you want to go back and improve things, go for it!

## Part 1 - Load and validate the data

- Load the data as a `pandas` data frame.
- Validate that it has the appropriate number of observations (you can check the raw file, and also read the dataset description from UCI).
- UCI says there should be missing data - check, and if necessary change the data so pandas recognizes it as na
- Make sure that the loaded features are of the types described above (continuous values should be treated as float), and correct as necessary

This is review, but skills that you'll use at the start of any data exploration. Further, you may have to do some investigation to figure out which file to load from - that is part of the puzzle.

Hint: If a column has the datatype of "object" even though it's made up of float or integer values, you can coerce it to act as a numeric column by using the `pd.to_numeric()` function.

In [227]:
import pandas as pd
import numpy as np

names=['A1'
      ,'A2'
      ,'A3'
      ,'A4'
      ,'A5'
      ,'A6'
      ,'A7'
      ,'A8'
      ,'A9'
      ,'A10'
      ,'A11'
      ,'A12'
      ,'A13'
      ,'A14'
      ,'A15'
      ,'A16'
      ]

df        = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening/crx.data'
                        ,header=None
                        ,names=names)
df        = df.replace({'?': np.NaN})
df['A16'] = df['A16'].replace({'+': True, '-': False})
df['A2']  = pd.to_numeric(df['A2'])
df['A14'] = pd.to_numeric(df['A14'])
df['A9']  = df['A9'].replace({'t': True, 'f': False})
df['A10'] = df['A10'].replace({'t': True, 'f': False})
df['A12'] = df['A12'].replace({'t': True, 'f': False})

print(df.shape)
df.head()
# It looks like they just miscalculated the columns because even their files say
# they should have 16 columns, not 15.

(690, 16)


Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14,A15,A16
0,b,30.83,0.0,u,g,w,v,1.25,True,True,1,False,g,202.0,0,True
1,a,58.67,4.46,u,g,q,h,3.04,True,True,6,False,g,43.0,560,True
2,a,24.5,0.5,u,g,q,h,1.5,True,False,0,False,g,280.0,824,True
3,b,27.83,1.54,u,g,w,v,3.75,True,True,5,True,g,100.0,3,True
4,b,20.17,5.625,u,g,w,v,1.71,True,False,0,False,s,120.0,0,True


In [228]:
df.dtypes
# Correct types

A1      object
A2     float64
A3     float64
A4      object
A5      object
A6      object
A7      object
A8     float64
A9        bool
A10       bool
A11      int64
A12       bool
A13     object
A14    float64
A15      int64
A16       bool
dtype: object

In [229]:
df.isnull().sum()
# Correct number of nulls

A1     12
A2     12
A3      0
A4      6
A5      6
A6      9
A7      9
A8      0
A9      0
A10     0
A11     0
A12     0
A13     0
A14    13
A15     0
A16     0
dtype: int64

## Part 2 - Exploring data, Testing hypotheses

The only thing we really know about this data is that A16 is the class label. Besides that, we have 6 continuous (float) features and 9 categorical features.

Explore the data: you can use whatever approach (tables, utility functions, visualizations) to get an impression of the distributions and relationships of the variables. In general, your goal is to understand how the features are different when grouped by the two class labels (`+` and `-`).

For the 6 continuous features, how are they different when split between the two class labels? Choose two features to run t-tests (again split by class label) - specifically, select one feature that is *extremely* different between the classes, and another feature that is notably less different (though perhaps still "statistically significantly" different). You may have to explore more than two features to do this.

For the categorical features, explore by creating "cross tabs" (aka [contingency tables](https://en.wikipedia.org/wiki/Contingency_table)) between them and the class label, and apply the Chi-squared test to them. [pandas.crosstab](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.crosstab.html) can create contingency tables, and [scipy.stats.chi2_contingency](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html) can calculate the Chi-squared statistic for them.

There are 9 categorical features - as with the t-test, try to find one where the Chi-squared test returns an extreme result (rejecting the null that the data are independent), and one where it is less extreme.

**NOTE** - "less extreme" just means smaller test statistic/larger p-value. Even the least extreme differences may be strongly statistically significant.

Your *main* goal is the hypothesis tests, so don't spend too much time on the exploration/visualization piece. That is just a means to an end - use simple visualizations, such as boxplots or a scatter matrix (both built in to pandas), to get a feel for the overall distribution of the variables.

This is challenging, so manage your time and aim for a baseline of at least running two t-tests and two Chi-squared tests before polishing. And don't forget to answer the questions in part 3, even if your results in this part aren't what you want them to be.

### Setup

In [230]:
# In general, your goal is to understand how the features are different when grouped by the two class labels (`+` and `-`).

# For the 6 continuous features, how are they different when split between the two class labels?
# Choose two features to run t-tests split by class label
# Select one feature that is *extremely* different between the classes
# Select one feature that is notably less different (but still "statistically significantly" different).
# You may have to explore more than two features to do this.

pos = df[df['A16'] == True]
neg = df[df['A16'] == False]

pos.head()

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14,A15,A16
0,b,30.83,0.0,u,g,w,v,1.25,True,True,1,False,g,202.0,0,True
1,a,58.67,4.46,u,g,q,h,3.04,True,True,6,False,g,43.0,560,True
2,a,24.5,0.5,u,g,q,h,1.5,True,False,0,False,g,280.0,824,True
3,b,27.83,1.54,u,g,w,v,3.75,True,True,5,True,g,100.0,3,True
4,b,20.17,5.625,u,g,w,v,1.71,True,False,0,False,s,120.0,0,True


In [231]:
neg.head()

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14,A15,A16
70,b,32.33,7.5,u,g,e,bb,1.585,True,False,0,True,s,420.0,0,False
71,b,34.83,4.0,u,g,d,bb,12.5,True,False,0,True,g,,0,False
72,a,38.58,5.0,u,g,cc,v,13.5,True,False,0,True,g,980.0,0,False
73,b,44.25,0.5,u,g,m,v,10.75,True,False,0,False,s,400.0,0,False
74,b,44.83,7.0,y,p,c,v,1.625,False,False,0,False,g,160.0,2,False


In [232]:
neg.dropna().mean() > pos.dropna().mean()
# I'd expect certain numbers to be worse for the negative side.
# Maybe any Trues tell me something?

A2     False
A3     False
A8     False
A9     False
A10    False
A11    False
A12    False
A14     True
A15    False
A16    False
dtype: bool

### For the 6 continuous features, how are they different when split between the two class labels?

In [233]:
# Continuous
# A2
# A3
# A8
# A11
# A14
# A15


column = 'A14'

print('Positive Credit'                                                   ,'\n'
     ,'Mean: '               ,pos[column].dropna().mean()                 ,'\n'
     ,'Median:'              ,pos[column].dropna().median()               ,'\n'
     ,'Sum: '                ,pos[column].dropna().sum()                  ,'\n'
     ,'Value Counts:'
     ,'\n'                   ,pos[column].dropna().value_counts().head()  ,'\n'
     )

print('Negative Credit'                                                   ,'\n'
     ,'Mean: '               ,neg[column].dropna().mean()                 ,'\n'
     ,'Median:'              ,neg[column].dropna().median()               ,'\n'
     ,'Sum: '                ,neg[column].dropna().sum()                  ,'\n'
     ,'Value Counts:'
     ,'\n'                   ,neg[column].dropna().value_counts().head()  ,'\n'
     )

# My guess for 14 is that it's somehow related to outstanding debt?
# I don't think this is helping

Positive Credit 
 Mean:  164.421926910299 
 Median: 120.0 
 Sum:  49491.0 
 Value Counts: 
 0.0      81
80.0     16
100.0    14
120.0    12
160.0     8
Name: A14, dtype: int64 

Negative Credit 
 Mean:  199.6994680851064 
 Median: 167.5 
 Sum:  75087.0 
 Value Counts: 
 0.0      51
200.0    27
160.0    26
120.0    23
280.0    17
Name: A14, dtype: int64 



## Choose two features to run t-tests split by class label

In [0]:
from scipy.stats import ttest_ind

### Select one feature that is *extremely* different between the classes

In [235]:
column = 'A11'
stat, p = ttest_ind(pos[column]
                   ,neg[column]
                   ,nan_policy='omit'
                   )
print(stat, p)

# They are indeed not the same.
# P value is the furthest from one of all six.
# 11+ standard deviations awaay.

11.667004222431277 7.957718568079967e-29


### Select one feature that is notably less different

In [236]:
column = 'A14'
stat, p = ttest_ind(pos[column]
                   ,neg[column]
                   ,nan_policy='omit'
                   )
print(stat, p)

# They are indeed not the same.
# P value is the closest to one of all six.
# Only about 2-3 standard deviations away.

-2.6358251986645476 0.008586135473979569


## There are 9 categorical features. Create "cross tabs" between them
Apply the Chi-squared test to them.

In [0]:
from scipy.stats import chi2_contingency as chi2c


# Categorical
# A1:	b, a
# A4:	u, y, l, t                                           (t does not appear)
# A5:	g, p, gg
# A6:	c, d, cc, i, j, k, m, r, q, w, x, e, aa, ff
# A7:	v, h, bb, j, n, z, dd, ff, o
# A9:	t, f
# A10:	t, f
# A12:	t, f
# A13:	g, p, s

### Chi-squared test returns an extreme result

In [239]:
chi2, p, dof, expected = chi2c(pd.crosstab(df['A16']
                                          ,df['A9']
                                          )
                              )

print('Chi^2: '                 ,chi2, '\n'
     ,'P Value" '               ,p, '\n'
     ,'Degrees of Freedom: '    ,dof, '\n'
     ,'Expected Values: ', '\n' ,expected, '\n'
     )

Chi^2:  355.2038167412799 
 P Value"  3.1185900878457007e-79 
 Degrees of Freedom:  1 
 Expected Values:  
 [[182.61884058 200.38115942]
 [146.38115942 160.61884058]] 



### Less extreme

In [240]:
chi2, p, dof, expected = chi2c(pd.crosstab(df['A16']
                                          ,df['A1']
                                          )
                              )

print('Chi^2: '                 ,chi2      ,'\n'
     ,'P Value: '               ,p         ,'\n'
     ,'Degrees of Freedom: '    ,dof       ,'\n'
     ,'Expected Values: '
     ,'\n'                      ,expected  ,'\n'
     )

# Not sure how much less extreme you want, so I'll note:
# A13's p value is 0.010094291370456372

Chi^2:  0.3112832649161994 
 P Value:  0.5768937883001118 
 Degrees of Freedom:  1 
 Expected Values:  
 [[115.84070796 258.15929204]
 [ 94.15929204 209.84070796]] 



## Part 3 - Analysis and Interpretation

Now that you've looked at the data, answer the following questions:

- Interpret and explain the two t-tests you ran - what do they tell you about the relationships between the continuous features you selected and the class labels?
- Interpret and explain the two Chi-squared tests you ran - what do they tell you about the relationships between the categorical features you selected and the class labels?
- What was the most challenging part of this sprint challenge?

Answer with text, but feel free to intersperse example code/results or refer to it from earlier.

### T-Testing:

My best guess right now with the extreme t-test on A11 is that these variables are a strong deciding factor in wether your credit score is positive or nagative. The t-test doesn't show much but if you look at the code from earlier it shows that most of the negatives had small values, with the median around 0 and a sum of only 242, compared with the positive's values of about a 3 median and a sum of 1414, there's simply more of whatever they have in there. Since these are all numbers, my guess is savings or a similar account. For all I know one of these is stocks and bonds.

For the less extreme t-test we have some difference still, but it's more on the limit for standard deviation. The sums of both are actually relatively close, and the rest of the variables show this. If I had to guess, I'd say this is spending habits; From what I'm told, people that have little debt have made habits that allow them to spend less in general. Sure a part of all of these is still luck of the draw but being able to control your own habits is a good first step on the way. Then again, this could be total debt and the positives could simply be getting more money and paying off a bigger chunk, while leaving this ~4k debt lingering in their account purely to bump their credit score. This is the problem with guessing at data you don't understand.

###Chi^2 Testing:

I'm not certain I even got all the data you need to test me on this, but at this point I think I've done all I can to get the answers outside of additional stretch goals. Well, aside from visualizations anyway.

Most of the p values were low, though a few of them were extreme and a couple were marginal. None of them were actually next to one though, which makes sense given this is meant to determine something about your credit score; if we saw any approaching one, then why is it in the data in the first place? I suppose A1 is the closest example of that though as it was nearly 0.6, meaning it's use in determining credit score has - or at least appears to have - a lower significance than the other variables. I think this might be one of those 'niceities' that credit companies would like to have from you. It may not be that important, but it has a slight effect. Existance of emergency account funds maybe?

On the other hand I think this also got by far the lowest p value of all the testing with A9. I only just now realized that the reason it's T or F is to say True or False. I'll fix that if I have time. Regardless, it seems having this thing or not is of extreme importance.

### Most challenging part:

Probably still understanding how to get t-test and Chi^2 to work for me, but additionally getting the data to make any kind of sense without headers was a really big issue for me. I still don't really know if I'm sure what's going on with this, but at the same time I think it could be an example of how one needs to handle data regardless.

For example, what if we had a look at what these columns were after the fact. Would I change my mind on how all this is working? What if instead of credit you'd just replaced the text document personally and made it say positive HIV tests? Kitten adoption rates? It may be that relying on what the headers and the data says makes too large an impact when I should be able to just look at the data and know with relative certainty that *something* is having an effect and what effect that is.