<a href="https://colab.research.google.com/github/Cknowles11/DS-Unit-1-Sprint-2-Statistics/blob/master/Carlos_Knowles_DS_Unit_1_Sprint_Challenge_2_Statistics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Science Unit 1 Sprint Challenge 2

## Exploring Data, Testing Hypotheses

In this sprint challenge you will look at a dataset of Echocardiograms

<https://archive.ics.uci.edu/ml/datasets/Echocardiogram>

Attribute Information:

1. survival -- the number of months patient survived (has survived, if patient is still alive). Because all the patients had their heart attacks at different times, it is possible that some patients have survived less than one year but they are still alive. Check the second variable to confirm this. Such patients cannot be used for the prediction task mentioned above.
2. still-alive -- a binary variable. 0=dead at end of survival period, 1 means still alive
3. age-at-heart-attack -- age in years when heart attack occurred
4. pericardial-effusion -- binary. Pericardial effusion is fluid around the heart. 0=no fluid, 1=fluid
5. fractional-shortening -- a measure of contracility around the heart lower numbers are increasingly abnormal
6. epss -- E-point septal separation, another measure of contractility. Larger numbers are increasingly abnormal.
7. lvdd -- left ventricular end-diastolic dimension. This is a measure of the size of the heart at end-diastole. Large hearts tend to be sick hearts.
8. wall-motion-score -- a measure of how the segments of the left ventricle are moving
9. wall-motion-index -- equals wall-motion-score divided by number of segments seen. Usually 12-13 segments are seen in an echocardiogram. Use this variable INSTEAD of the wall motion score.
10. mult -- a derivate var which can be ignored
11. name -- the name of the patient (I have replaced them with "name")
12. group -- meaningless, ignore it
13. alive-at-1 -- Boolean-valued. Derived from the first two attributes. 0 means patient was either dead after 1 year or had been followed for less than 1 year. 1 means patient was alive at 1 year.

Sprint challenges are evaluated based on satisfactory completion of each part. It is suggested you work through it in order, getting each aspect reasonably working, before trying to deeply explore, iterate, or refine any given step. Once you get to the end, if you want to go back and improve things, go for it!

## Part 1 - Load and validate the data

- Load the data as a `pandas` data frame.
- Validate that it has the appropriate number of observations (you can check the raw file, and also read the dataset description from UCI).
- UCI says there should be missing data - check, and if necessary change the data so pandas recognizes it as na
- Make sure that the loaded features are of the types described above (continuous values should be treated as float), and correct as necessary

This is review, but skills that you'll use at the start of any data exploration. Further, you may have to do some investigation to figure out which file to load from - that is part of the puzzle.

In [11]:
#import pandas as pd
#url  = 'https://archive.ics.uci.edu/ml/machine-learning-databases/echocardiogram/echocardiogram.data'
#col_names = ['survival', 'still-alive', 'age-at-heart-attack', 'pericardial-effusion', 'fractional-shortening', 'epss', 'ivdd', 'wall-motion-score', 'wall-motion-index', 'mult', 'name', 'group', 'alive-at-1']
#cardio = pd.read_csv(url, names = col_names, na_values = ['?'], error_bad_lines= False, header= None)
#cardio.head(5)


In [12]:
import pandas as pd
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/echocardiogram/echocardiogram.data"
cardio = pd.read_csv(url, error_bad_lines=False, header=None, na_values = '?')
print(cardio.head())
cardio.iloc[:, -1].unique()

     0   1     2   3      4       5      6     7     8      9     10   11   12
0  11.0   0  71.0   0  0.260   9.000  4.600  14.0  1.00  1.000  name  1.0  0.0
1  19.0   0  72.0   0  0.380   6.000  4.100  14.0  1.70  0.588  name  1.0  0.0
2  16.0   0  55.0   0  0.260   4.000  3.420  14.0  1.00  1.000  name  1.0  0.0
3  57.0   0  60.0   0  0.253  12.062  4.603  16.0  1.45  0.788  name  1.0  0.0
4  19.0   1  57.0   0  0.160  22.000  5.750  18.0  2.25  0.571  name  1.0  0.0


b'Skipping line 50: expected 13 fields, saw 14\n'


array([ 0.,  1., nan])

In [13]:
cardio.isnull().sum()

0      1
1      0
2      5
3      0
4      7
5     14
6     10
7      3
8      1
9      3
10     0
11    22
12    57
dtype: int64

In [15]:
cardio[12].unique()

array([ 0.,  1., nan])

## Part 2 - Exploring data, Testing hypotheses

The only thing we really know about this data is that Alive-at-1 is the class label. Besides that, we have continuous features and categorical features.

Explore the data: you can use whatever approach (tables, utility functions, visualizations) to get an impression of the distributions and relationships of the variables. In general, your goal is to understand how the features are different when grouped by the two class labels (`1` and `0`).

For the continuous features, how are they different when split between the two class labels? Choose two features to run t-tests (again split by class label) - specifically, select one feature that is *extremely* different between the classes, and another feature that is notably less different (though perhaps still "statistically significantly" different). You may have to explore more than two features to do this.

For the categorical features, explore by creating "cross tabs" (aka [contingency tables](https://en.wikipedia.org/wiki/Contingency_table)) between them and the class label, and apply the Chi-squared test to them. [pandas.crosstab](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.crosstab.html) can create contingency tables, and [scipy.stats.chi2_contingency](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html) can calculate the Chi-squared statistic for them.

There are also categorical features - as with the t-test, try to find one where the Chi-squared test returns an extreme result (rejecting the null that the data are independent), and one where it is less extreme.

**NOTE** - "less extreme" just means smaller test statistic/larger p-value. Even the least extreme differences may be strongly statistically significant.

Your *main* goal is the hypothesis tests, so don't spend too much time on the exploration/visualization piece. That is just a means to an end - use simple visualizations, such as boxplots or a scatter matrix (both built in to pandas), to get a feel for the overall distribution of the variables.

This is challenging, so manage your time and aim for a baseline of at least running two t-tests and two Chi-squared tests before polishing. And don't forget to answer the questions in part 3, even if your results in this part aren't what you want them to be.

In [143]:
# TODO
import matplotlib.pyplot as plt
from scipy.stats import t, ttest_1samp, ttest_ind, stats, chi2_contingency, chi2
import seaborn as sns
import numpy as np

In [25]:
alive_group = cardio.groupby(12)
alive_group_one = alive_group.get_group(1.0)
not_alive_group = alive_group.get_group(0.0)
print(alive_group_one.head())
print(not_alive_group.head())

      0   1       2   3     4     5     6      7     8      9     10   11   12
10  10.0   1  77.000   0  0.13  16.0  4.23  18.00  1.80  0.714  name  1.0  1.0
14   0.5   1  62.000   0  0.12  23.0  5.80  11.67  2.33  0.358  name  1.0  1.0
16   0.5   1  69.000   1  0.26  11.0  4.65  18.00  1.64  0.784  name  1.0  1.0
17   0.5   1  62.529   1  0.07  20.0  5.20  24.00  2.00  0.857  name  1.0  1.0
19   1.0   1  66.000   1  0.22  15.0  5.40  27.00  2.25  0.857  name  1.0  1.0
     0   1     2   3      4       5      6     7     8      9     10   11   12
0  11.0   0  71.0   0  0.260   9.000  4.600  14.0  1.00  1.000  name  1.0  0.0
1  19.0   0  72.0   0  0.380   6.000  4.100  14.0  1.70  0.588  name  1.0  0.0
2  16.0   0  55.0   0  0.260   4.000  3.420  14.0  1.00  1.000  name  1.0  0.0
3  57.0   0  60.0   0  0.253  12.062  4.603  16.0  1.45  0.788  name  1.0  0.0
4  19.0   1  57.0   0  0.160  22.000  5.750  18.0  2.25  0.571  name  1.0  0.0


In [32]:
print(alive_group_one[2].mean())
print(not_alive_group[2].mean())

67.45778260869565
62.92


In [63]:
ttest_ind(alive_group_one[2], not_alive_group[2], nan_policy='omit')

Ttest_indResult(statistic=2.2165294422402875, pvalue=0.02986013147831115)

In [49]:
ttest_ind(alive_group_one[4], not_alive_group[4], nan_policy='omit')

Ttest_indResult(statistic=-2.5335404371604846, pvalue=0.013601905952411147)

In [153]:
contingency_table1 = pd.crosstab(cardio[3], cardio[12], margins= True)
contingency_table1.head()

12,0.0,1.0,All
3,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,43,16,59
1,7,8,15
All,50,24,74


In [154]:
peri_alive_zero = contingency_table1.iloc[0:2,0:1]
peri_alive_zero.head()

12,0.0
3,Unnamed: 1_level_1
0,43
1,7


In [155]:
peri_alive_one = contingency_table1.iloc[0:2, 1:2]
peri_alive_one.head()

12,1.0
3,Unnamed: 1_level_1
0,16
1,8


In [157]:
row_sums = contingency_table1.iloc[0:2, 2].values
col_sums = contingency_table1.iloc[2, 0:2].values
print(row_sums)
print(col_sums)

[59 15]
[50 24]


In [158]:
total = contingency_table1.loc['All', 'All']
print(total)

74


In [165]:
# expected = []
# for i in range(len(row_sums)):
#     expected_row = []
#     for column in col_sums:
#         expected_val = column*row_sums[i]/total
#         expected_row.append(expected_val)
#     expected.append(expected_row)
    
# expected = np.array(expected)
# print(expected.shape)  
# print(expected)

In [160]:
observed = pd.crosstab(cardio[3], cardio[12])
print(observed.shape)

(2, 2)


In [162]:
dof = (len(row_sums)-1)*(len(col_sums)-1)

In [163]:
chi_squared, p_value, dof, expected = chi2_contingency(observed)

print(f"Chi-Squared: {chi_squared}")
print(f"P-value: {p_value}")
print(f"Degrees of Freedom: {dof}") 


Chi-Squared: 2.649576271186441
P-value: 0.10357750184719003
Degrees of Freedom: 1


In [166]:
chi2_contingency(observed)

(2.649576271186441, 0.10357750184719003, 1, array([[39.86486486, 19.13513514],
        [10.13513514,  4.86486486]]))

In [145]:
contingency_table2 = pd.crosstab(cardio[1], cardio[12])
contingency_table2

12,0.0,1.0
1,Unnamed: 1_level_1,Unnamed: 2_level_1
0,45,0
1,5,24


In [152]:
contingency_table2.shape

(2, 2)

In [167]:
chi2_contingency(contingency_table2)

(51.40538058748403,
 7.513233713240164e-13,
 1,
 array([[30.40540541, 14.59459459],
        [19.59459459,  9.40540541]]))

## Part 3 - Analysis and Interpretation

Now that you've looked at the data, answer the following questions:

- Interpret and explain the two t-tests you ran - what do they tell you about the relationships between the continuous features you selected and the class labels?
- Interpret and explain the two Chi-squared tests you ran - what do they tell you about the relationships between the categorical features you selected and the class labels?
- What was the most challenging part of this sprint challenge?

Answer with text, but feel free to intersperse example code/results or refer to it from earlier.

Based on the two t-tests I ran I can tell that there is a statistically significant difference between the means of the patient's fractional shortening measurements given that they lived after 1 year or not.
Based on the two chi-squared tests I can tell that some variables have a strong dependence on the alive-after-1 column.
The most challenging part was determing what variables to run these tests on in order to get an accurate result that I was able to interpret.

## Part 4 - Bayesian vs Frequentist Statistics

Using a minimum of 2-3 sentences, give an explanation of Bayesian and Frequentist statistics, and then compare and contrast these two approaches to statistical inference.



Frequentist statistics does not give or use the probability of a hypothesis and does not have a prior or posterior probability. Frequentist Statistics relies heavily on the pvalues and confidence intervals.

Bayesian statistics uses prior data, knowledge, experiences, etc. to create a probability of the hypothesis known as the prior to create a posterior which is then used as the prior in the event of another calculation.

# Stretch Goals: 
Do these to get a 3. These are not required in order to pass the Sprint Challenge.

## Part 1: 

Make sure that all of your dataframe columns have the appropriate data types. *Hint:* If a column has the datatype of "object" even though it's made up of float or integer values, you can coerce it to act as a numeric column by using the `pd.to_numeric()` function. In order to get a 3 on this section make sure that your data exploration is particularly well commented, easy to follow, and thorough.

## Part 2:

Write functions that can calculate t-tests and chi^2 tests on all of the appropriate column combinations from the dataset. (Remember that certain tests require certain variable types.)

## Part 3: 

Calculate and report confidence intervals on your most important mean estimates (choose at least two). Make some kind of a graphic or visualization to help us see visually how precise these estimates are.

## Part 4:

Give an extra awesome explanation of Bayesian vs Frequentist Statistics. Maybe use code or visualizations, or any other means necessary to show an above average grasp of these high level concepts.

In [None]:
# You can work the stretch goals down here or back up in their regular sections
# just make sure that they are labeled so that we can easily differentiate
# your main work from the stretch goals.