<a href="https://colab.research.google.com/github/tomfox1/DS-Unit-1-Sprint-4-Statistical-Tests-and-Experiments/blob/master/DS_Unit_1_Sprint_Challenge_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Science Unit 1 Sprint Challenge 4

## Exploring Data, Testing Hypotheses

In this sprint challenge you will look at a dataset of people being approved or rejected for credit.

https://archive.ics.uci.edu/ml/datasets/Credit+Approval

Data Set Information: This file concerns credit card applications. All attribute names and values have been changed to meaningless symbols to protect confidentiality of the data. This dataset is interesting because there is a good mix of attributes -- continuous, nominal with small numbers of values, and nominal with larger numbers of values. There are also a few missing values.

Attribute Information:
- A1: b, a.
- A2: continuous.
- A3: continuous.
- A4: u, y, l, t.
- A5: g, p, gg.
- A6: c, d, cc, i, j, k, m, r, q, w, x, e, aa, ff.
- A7: v, h, bb, j, n, z, dd, ff, o.
- A8: continuous.
- A9: t, f.
- A10: t, f.
- A11: continuous.
- A12: t, f.
- A13: g, p, s.
- A14: continuous.
- A15: continuous.
- A16: +,- (class attribute)

Yes, most of that doesn't mean anything. A16 (the class attribute) is the most interesting, as it separates the 307 approved cases from the 383 rejected cases. The remaining variables have been obfuscated for privacy - a challenge you may have to deal with in your data science career.

Sprint challenges are evaluated based on satisfactory completion of each part. It is suggested you work through it in order, getting each aspect reasonably working, before trying to deeply explore, iterate, or refine any given step. Once you get to the end, if you want to go back and improve things, go for it!

## Part 1 - Load and validate the data

- Load the data as a `pandas` data frame.
- Validate that it has the appropriate number of observations (you can check the raw file, and also read the dataset description from UCI).
- UCI says there should be missing data - check, and if necessary change the data so pandas recognizes it as na
- Make sure that the loaded features are of the types described above (continuous values should be treated as float), and correct as necessary

This is review, but skills that you'll use at the start of any data exploration. Further, you may have to do some investigation to figure out which file to load from - that is part of the puzzle.

In [0]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt 
from scipy import stats
import seaborn as sns 

In [109]:
#raw file name was changed from .data to .csv to make it easier to import our data using the function "pd.read_csv"
from google.colab import files
uploaded = files.upload()

Saving crxdata.csv to crxdata (2).csv


In [189]:
#loading data, renaming with correct numbering and labeling attribute 16 as "Class Label" since it is the only attribute we know 
df = pd.read_csv("crxdata.csv", header=None, names=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, "Class Label"])
df.head()

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,Class Label
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120,0,+


In [190]:
#inspecting our df to see if we have the appropriate number of observations
df.describe(exclude=np.number).head()

Unnamed: 0,1,2,4,5,6,7,9,10,12,13,14,Class Label
count,690,690,690,690,690,690,690,690,690,690,690,690
unique,3,350,4,4,15,10,2,2,2,3,171,2
top,b,?,u,g,c,v,t,f,f,g,0,-
freq,468,12,519,519,137,399,361,395,374,625,132,383


In [191]:
#inspecting our df to see if we have the appropriate number of observations
df.describe(include=np.number).head()

Unnamed: 0,3,8,11,15
count,690.0,690.0,690.0,690.0
mean,4.758725,2.223406,2.4,1017.385507
std,4.978163,3.346513,4.86294,5210.102598
min,0.0,0.0,0.0,0.0
25%,1.0,0.165,0.0,0.0


In [192]:
#checking for Nans, crosschecking with our UCI documentation
#it seems as if "?" are represented as NaNs, we will check some more attributes to be certain
df[1].sort_values().value_counts().head()

b    468
a    210
?     12
Name: 1, dtype: int64

In [193]:
#confirming our initial intuition  
df[2].sort_values().value_counts().head()

?        12
22.67     9
20.42     7
23.58     6
24.50     6
Name: 2, dtype: int64

In [115]:
#we can clearly see that Nan's are represented as "?" also in the documentation the crosscheck is correct
df[4].sort_values().value_counts().head()

u    519
y    163
?      6
l      2
Name: 4, dtype: int64

In [0]:
#converting "?" to NaNs
df = df.replace({"?": np.nan})

In [195]:
#let's confirm if we still get value_counts for "?"; seems correct
df[4].sort_values().value_counts().head()

u    519
y    163
l      2
Name: 4, dtype: int64

In [196]:
# inspecting NaNs; note: this mateches with the UCI dataset documentation; running the NaN sum before would yield 0 since "?" was how our NaNs were encoded 
df.isna().sum()


1              12
2              12
3               0
4               6
5               6
6               9
7               9
8               0
9               0
10              0
11              0
12              0
13              0
14             13
15              0
Class Label     0
dtype: int64

In [0]:
#dropping NaNs
df = df.dropna()

In [0]:
#our attributes [2, 3, 8, 11, 14, 15] in our df should be floats, and so converting as such 
df.dtypes
df = df.astype({2:"float64", 3:"float64", 8: "float64", 11: "float64", 14: "float64", 15:"float64"})

In [202]:
#cnfirming we converted our numeric variables to floats,
df.dtypes

1               object
2              float64
3              float64
4               object
5               object
6               object
7               object
8              float64
9               object
10              object
11             float64
12              object
13              object
14             float64
15             float64
Class Label     object
dtype: object

## Part 2 - Exploring data, Testing hypotheses

The only thing we really know about this data is that A16 is the class label. Besides that, we have 6 continuous (float) features and 9 categorical features.

Explore the data: you can use whatever approach (tables, utility functions, visualizations) to get an impression of the distributions and relationships of the variables. In general, your goal is to understand how the features are different when grouped by the two class labels (`+` and `-`).

For the 6 continuous features, how are they different when split between the two class labels? Choose two features to run t-tests (again split by class label) - specifically, select one feature that is *extremely* different between the classes, and another feature that is notably less different (though perhaps still "statistically significantly" different). You may have to explore more than two features to do this.

For the categorical features, explore by creating "cross tabs" (aka [contingency tables](https://en.wikipedia.org/wiki/Contingency_table)) between them and the class label, and apply the Chi-squared test to them. [pandas.crosstab](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.crosstab.html) can create contingency tables, and [scipy.stats.chi2_contingency](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html) can calculate the Chi-squared statistic for them.

There are 9 categorical features - as with the t-test, try to find one where the Chi-squared test returns an extreme result (rejecting the null that the data are independent), and one where it is less extreme.

**NOTE** - "less extreme" just means smaller test statistic/larger p-value. Even the least extreme differences may be strongly statistically significant.

Your *main* goal is the hypothesis tests, so don't spend too much time on the exploration/visualization piece. That is just a means to an end - use simple visualizations, such as boxplots or a scatter matrix (both built in to pandas), to get a feel for the overall distribution of the variables.

This is challenging, so manage your time and aim for a baseline of at least running two t-tests and two Chi-squared tests before polishing. And don't forget to answer the questions in part 3, even if your results in this part aren't what you want them to be.

In [0]:
#creating 2 conditional statements to divide our Class Labels into 2 named "minus" and "plus" respectively 
df["Class Label"].value_counts()
minus = df[df["Class Label"] == "-"]
plus = df[df["Class Label"] == "+"]

In [151]:
#calculating means of our 6 continuous variables will be useful in order to understand if our T-tests were computed correctly
#converting to percentage differences to make it more visually clear 
minus_mean = minus.mean()
plus_mean = plus.mean()
mean_change = (minus_mean - plus_mean) / minus_mean
mean_percent_change = mean_change * 100
mean_percent_change


2     -14.488753
3     -53.823873
8    -183.984099
11   -607.432432
14     14.884182
15   -969.146747
dtype: float64

In [131]:
#we will compute 6 different 2-sample T-tests in order to holistically verify the statistical significance of our results
#we will compute the t-tests on our "minus" and "plus" Class labels for each one of the 6 continuous attributes
#note: intereptation of results in the interpretation section of the Google Colab notebook
stats.ttest_ind(minus[2], plus[2], nan_policy='omit')

Ttest_indResult(statistic=-4.675662433767456, pvalue=3.5636710933835225e-06)

In [132]:
stats.ttest_ind(minus[3], plus[3], nan_policy='omit')

Ttest_indResult(statistic=-5.400813416576192, pvalue=9.310154396147606e-08)

In [133]:
stats.ttest_ind(minus[8], plus[8], nan_policy='omit')

Ttest_indResult(statistic=-9.002392498622463, pvalue=2.4079238505859142e-18)

In [134]:
stats.ttest_ind(minus[11], plus[11], nan_policy='omit')

Ttest_indResult(statistic=-11.336964562239809, pvalue=2.5864649620278843e-27)

In [136]:
stats.ttest_ind(minus[14], plus[14], nan_policy='omit')

Ttest_indResult(statistic=2.18221936438708, pvalue=0.029450100409286473)

In [137]:
stats.ttest_ind(minus[15], plus[15], nan_policy='omit')

Ttest_indResult(statistic=-4.475369764700449, pvalue=9.003915641872878e-06)

In [0]:
#getting a feel for our data in order to beter understand which categoreis to choose for our Chi-squared test 
#g = sns.pairplot(df2, kind='reg', plot_kws={'line_kws':{'color':'orange'}, 'scatter_kws': {'alpha': 0.1}})

In [158]:
#we create a contingency table in order to separate our Class Labels in 2 
#we compute our chi-squared statistic for each of the 9 categorical variables
#and see if there is dependeny between our Class Labels and the other categorial attributes
contingency1 = pd.crosstab(df["Class Label"], df[1])
contingency1

1,a,b
Class Label,Unnamed: 1_level_1,Unnamed: 2_level_1
+,95,201
-,108,249


In [163]:
#for the following 8 computations we will not visualize the contingecny table for the sake of brevity
#note: interpretations in the results section of the Colab notebook 
chi_squared, p_value, dof, expected = stats.chi2_contingency(contingency1)

print(f"Chi-Squared: {chi_squared}")
print(f"P-value: {p_value}")
print(f"Degrees of Freedom: {dof}") 
print("Expected: \n", np.array(expected))

Chi-Squared: 0.17764082160253514
P-value: 0.6734085695133722
Degrees of Freedom: 1
Expected: 
 [[ 92.01837672 203.98162328]
 [110.98162328 246.01837672]]


In [174]:
contingency4 = pd.crosstab(df["Class Label"], df[4])
contingency4

chi_squared, p_value, dof, expected = stats.chi2_contingency(contingency4)

print(f"Chi-Squared: {chi_squared}")
print(f"P-value: {p_value}")
print(f"Degrees of Freedom: {dof}") 
print("Expected: \n", np.array(expected))

Chi-Squared: 21.78325079317282
P-value: 1.8613463470618034e-05
Degrees of Freedom: 2
Expected: 
 [[  0.90658499 226.19295559  68.90045942]
 [  1.09341501 272.80704441  83.09954058]]


In [167]:
contingency5 = pd.crosstab(df["Class Label"], df[5])
contingency5

chi_squared, p_value, dof, expected = stats.chi2_contingency(contingency5)

print(f"Chi-Squared: {chi_squared}")
print(f"P-value: {p_value}")
print(f"Degrees of Freedom: {dof}") 
print("Expected: \n", np.array(expected))

Chi-Squared: 21.78325079317282
P-value: 1.8613463470618034e-05
Degrees of Freedom: 2
Expected: 
 [[226.19295559   0.90658499  68.90045942]
 [272.80704441   1.09341501  83.09954058]]


In [168]:
contingency6 = pd.crosstab(df["Class Label"], df[6])
contingency6

chi_squared, p_value, dof, expected = stats.chi2_contingency(contingency6)

print(f"Chi-Squared: {chi_squared}")
print(f"P-value: {p_value}")
print(f"Degrees of Freedom: {dof}") 
print("Expected: \n", np.array(expected))

Chi-Squared: 89.76481160702343
P-value: 1.5500154549498966e-13
Degrees of Freedom: 13
Expected: 
 [[23.5712098  60.28790199 18.13169985 11.7856049  10.87901991 22.66462481
  24.93108729  4.53292496 21.75803982 17.22511485 33.99693721  1.35987749
  28.55742726 16.31852986]
 [28.4287902  72.71209801 21.86830015 14.2143951  13.12098009 27.33537519
  30.06891271  5.46707504 26.24196018 20.77488515 41.00306279  1.64012251
  34.44257274 19.68147014]]


In [169]:
contingency7 = pd.crosstab(df["Class Label"], df[7])
contingency7

chi_squared, p_value, dof, expected = stats.chi2_contingency(contingency7)

print(f"Chi-Squared: {chi_squared}")
print(f"P-value: {p_value}")
print(f"Degrees of Freedom: {dof}") 
print("Expected: \n", np.array(expected))

Chi-Squared: 42.988254470828515
P-value: 8.829142688919391e-07
Degrees of Freedom: 8
Expected: 
 [[ 24.0245023    2.71975498  24.47779479  62.10107198   3.62633997
    1.81316998   0.90658499 172.70444104   3.62633997]
 [ 28.9754977    3.28024502  29.52220521  74.89892802   4.37366003
    2.18683002   1.09341501 208.29555896   4.37366003]]


In [170]:
contingency9 = pd.crosstab(df["Class Label"], df[9])
contingency9

chi_squared, p_value, dof, expected = stats.chi2_contingency(contingency9)

print(f"Chi-Squared: {chi_squared}")
print(f"P-value: {p_value}")
print(f"Degrees of Freedom: {dof}") 
print("Expected: \n", np.array(expected))

Chi-Squared: 353.4827159410316
P-value: 7.391616628555818e-79
Degrees of Freedom: 1
Expected: 
 [[137.80091884 158.19908116]
 [166.19908116 190.80091884]]


In [171]:
contingency10 = pd.crosstab(df["Class Label"], df[10])
contingency10

chi_squared, p_value, dof, expected = stats.chi2_contingency(contingency10)

print(f"Chi-Squared: {chi_squared}")
print(f"P-value: {p_value}")
print(f"Degrees of Freedom: {dof}") 
print("Expected: \n", np.array(expected))

Chi-Squared: 131.50867232095965
P-value: 1.9163536191857147e-30
Degrees of Freedom: 1
Expected: 
 [[165.9050536 130.0949464]
 [200.0949464 156.9050536]]


In [173]:
contingency12 = pd.crosstab(df["Class Label"], df[12])
contingency12

chi_squared, p_value, dof, expected = stats.chi2_contingency(contingency12)

print(f"Chi-Squared: {chi_squared}")
print(f"P-value: {p_value}")
print(f"Degrees of Freedom: {dof}") 
print("Expected: \n", np.array(expected))

Chi-Squared: 1.4379377134356208
P-value: 0.23047335495661603
Degrees of Freedom: 1
Expected: 
 [[159.10566616 136.89433384]
 [191.89433384 165.10566616]]


In [172]:
contingency13 = pd.crosstab(df["Class Label"], df[13])
contingency13

chi_squared, p_value, dof, expected = stats.chi2_contingency(contingency13)

print(f"Chi-Squared: {chi_squared}")
print(f"P-value: {p_value}")
print(f"Degrees of Freedom: {dof}") 
print("Expected: \n", np.array(expected))

Chi-Squared: 6.756491933104269
P-value: 0.03410722751542202
Degrees of Freedom: 2
Expected: 
 [[271.06891271   0.90658499  24.0245023 ]
 [326.93108729   1.09341501  28.9754977 ]]


## Part 3 - Analysis and Interpretation

Now that you've looked at the data, answer the following questions:

- Interpret and explain the two t-tests you ran - what do they tell you about the relationships between the continuous features you selected and the class labels?
- Interpret and explain the two Chi-squared tests you ran - what do they tell you about the relationships between the categorical features you selected and the class labels?
- What was the most challenging part of this sprint challenge?

Answer with text, but feel free to intersperse example code/results or refer to it from earlier.

**1)** We conduct a 2-sample T-test since we are comparing two different samples. After running our t-tests on all our continuous attributes and separating our Class Labels into two, we can conclude that in every case since our p-values were statistically significant (p<0.05) we reject our Null Hypothesis that the means between our two samples in our two different classes are equal. 

The Attributes with greater differences amongst them in statistical significance were 11 and 14 . Attribute 11 shows a T-statistic of -11.33 and a p-value of 2.58e-27, while Attribute 14 shows a T-statistic of 2.18, and a pvalue of 0.029. 

It is noteworthy to point out that while we reject the Null hypothesis in both cases, Attribute 11 shows a higher level of statistical significance since the p-value is lower; also, our negative T-statistic shows us that the mean for our "minus" class on Attribute 11 is lower than that of our "plus" class. Regarding attribute 14 the opposite holds true, the mean for our "minus" class is greater than that of our "plus" class. 

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------

**2)** Regarding our chi-squared test,  attribute 9 has the higher chi-square statistic and the most significant (lowest) p-value, and thus we reject our Null hypothesis that our Class Labels and the categorial attribute 9 are independent to one another. This does not necessarily impliy that there exists a dependency between our Class Labels and the categorial feature number 9. 

Furthermore, our Class Labels and attributes 1 and 12 have a chi-squared of 0.177 and a P-value of 0.67, and Chi-Squared of 1.43 and P-value of 0.23 respectively. The low chi-square and high p-value (p>0.05) imply a low level of statistical significance and thus we fail to reject the null hypothesis that states that the Class Labels and our Attributes 1 and 12 are linearly independent. This does not necessarily imply that our categorical variables may be independent of one another, it just tell us we fail to reject the null hypothesis claiming linear independency.


--------------------------------------------------------------------------------------------------------------------------------------------------------------------------

**3)** The most challenging part of the sprint challenge was working on data that had been completely anonymized, since it does not flow so well as a story-telling narrative.