<a href="https://colab.research.google.com/github/bofori-tech/DS-Unit-1-Sprint-2-Statistics/blob/master/DS_Unit_1_Sprint_Challenge_2_Statistics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Science Unit 1 Sprint Challenge 2

## Exploring Data, Testing Hypotheses

In this sprint challenge you will look at a dataset of Echocardiograms

<https://archive.ics.uci.edu/ml/datasets/Echocardiogram>

Attribute Information:

1. survival -- the number of months patient survived (has survived, if patient is still alive). Because all the patients had their heart attacks at different times, it is possible that some patients have survived less than one year but they are still alive. Check the second variable to confirm this. Such patients cannot be used for the prediction task mentioned above.
2. still-alive -- a binary variable. 0=dead at end of survival period, 1 means still alive
3. age-at-heart-attack -- age in years when heart attack occurred
4. pericardial-effusion -- binary. Pericardial effusion is fluid around the heart. 0=no fluid, 1=fluid
5. fractional-shortening -- a measure of contracility around the heart lower numbers are increasingly abnormal
6. epss -- E-point septal separation, another measure of contractility. Larger numbers are increasingly abnormal.
7. lvdd -- left ventricular end-diastolic dimension. This is a measure of the size of the heart at end-diastole. Large hearts tend to be sick hearts.
8. wall-motion-score -- a measure of how the segments of the left ventricle are moving
9. wall-motion-index -- equals wall-motion-score divided by number of segments seen. Usually 12-13 segments are seen in an echocardiogram. Use this variable INSTEAD of the wall motion score.
10. mult -- a derivate var which can be ignored
11. name -- the name of the patient (I have replaced them with "name")
12. group -- meaningless, ignore it
13. alive-at-1 -- Boolean-valued. Derived from the first two attributes. 0 means patient was either dead after 1 year or had been followed for less than 1 year. 1 means patient was alive at 1 year.

Sprint challenges are evaluated based on satisfactory completion of each part. It is suggested you work through it in order, getting each aspect reasonably working, before trying to deeply explore, iterate, or refine any given step. Once you get to the end, if you want to go back and improve things, go for it!

## Part 1 - Load and validate the data

- Load the data as a `pandas` data frame.
- Validate that it has the appropriate number of observations (you can check the raw file, and also read the dataset description from UCI).
- UCI says there should be missing data - check, and if necessary change the data so pandas recognizes it as na
- Make sure that the loaded features are of the types described above (continuous values should be treated as float), and correct as necessary

This is review, but skills that you'll use at the start of any data exploration. Further, you may have to do some investigation to figure out which file to load from - that is part of the puzzle.

In [2]:

!wget https://archive.ics.uci.edu/ml/machine-learning-databases/echocardiogram/echocardiogram.data

--2020-08-01 16:16:45--  https://archive.ics.uci.edu/ml/machine-learning-databases/echocardiogram/echocardiogram.data
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6107 (6.0K) [application/x-httpd-php]
Saving to: ‘echocardiogram.data’


2020-08-01 16:16:45 (115 MB/s) - ‘echocardiogram.data’ saved [6107/6107]



In [49]:
#make a dataframe
import pandas as pd
column_headers = ['survival','still-alive','age-at-heart-attack','pericardial-effusion','fractional-shortening', 
                             'epss','lvdd','wall-motion-score','wall-motion-index','mult','name','group', 
                              'alive-at-1'] 
df = pd.read_csv('echocardiogram.data', header = None,
                   names = column_headers, error_bad_lines=False,
                   na_values = '?',
                   )
df.head()

Unnamed: 0,survival,still-alive,age-at-heart-attack,pericardial-effusion,fractional-shortening,epss,lvdd,wall-motion-score,wall-motion-index,mult,name,group,alive-at-1
0,11.0,0.0,71.0,0,0.26,9.0,4.6,14.0,1.0,1.0,name,1,0.0
1,19.0,0.0,72.0,0,0.38,6.0,4.1,14.0,1.7,0.588,name,1,0.0
2,16.0,0.0,55.0,0,0.26,4.0,3.42,14.0,1.0,1.0,name,1,0.0
3,57.0,0.0,60.0,0,0.253,12.062,4.603,16.0,1.45,0.788,name,1,0.0
4,19.0,1.0,57.0,0,0.16,22.0,5.75,18.0,2.25,0.571,name,1,0.0


## Part 2 - Exploring data, Testing hypotheses

The only thing we really know about this data is that Alive-at-1 is the class label. Besides that, we have continuous features and categorical features.

Explore the data: you can use whatever approach (tables, utility functions, visualizations) to get an impression of the distributions and relationships of the variables. In general, your goal is to understand how the features are different when grouped by the two class labels (`1` and `0`).

For the continuous features, how are they different when split between the two class labels? Choose two features to run t-tests (again split by class label) - specifically, select one feature that is *extremely* different between the classes, and another feature that is notably less different (though perhaps still "statistically significantly" different). You may have to explore more than two features to do this.

For the categorical features, explore by creating "cross tabs" (aka [contingency tables](https://en.wikipedia.org/wiki/Contingency_table)) between them and the class label, and apply the Chi-squared test to them. [pandas.crosstab](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.crosstab.html) can create contingency tables, and [scipy.stats.chi2_contingency](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html) can calculate the Chi-squared statistic for them.

There are also categorical features - as with the t-test, try to find one where the Chi-squared test returns an extreme result (rejecting the null that the data are independent), and one where it is less extreme.

**NOTE** - "less extreme" just means smaller test statistic/larger p-value. Even the least extreme differences may be strongly statistically significant.

Your *main* goal is the hypothesis tests, so don't spend too much time on the exploration/visualization piece. That is just a means to an end - use simple visualizations, such as boxplots or a scatter matrix (both built in to pandas), to get a feel for the overall distribution of the variables.

This is challenging, so manage your time and aim for a baseline of at least running two t-tests and two Chi-squared tests before polishing. And don't forget to answer the questions in part 3, even if your results in this part aren't what you want them to be.

In [4]:
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns

  import pandas.util.testing as tm


In [5]:
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

In [52]:
ran=df.sample(n=30, random_state=42)
ran.head()

Unnamed: 0,survival,still-alive,age-at-heart-attack,pericardial-effusion,fractional-shortening,epss,lvdd,wall-motion-score,wall-motion-index,mult,name,group,alive-at-1
56,45.0,0.0,63.0,0,0.15,13.0,4.57,13.0,1.08,0.857,name,2,0.0
83,0.75,1.0,78.0,0,0.05,10.0,4.44,15.0,1.36,0.786,name,2,1.0
19,1.0,1.0,66.0,1,0.22,15.0,5.4,27.0,2.25,0.857,name,1,1.0
31,1.0,1.0,52.0,1,0.17,17.2,5.32,14.0,1.17,0.857,name,2,
76,0.25,1.0,68.0,0,0.22,21.7,4.85,15.0,1.15,0.928,name,2,


In [53]:
from scipy.stats import ttest_ind
ran['age-at-heart-attack'].mean()
df['age-at-heart-attack'].mean()
print(ran['age-at-heart-attack'].mean())
print(df['age-at-heart-attack'].mean())

63.666666666666664
62.813722222222225


In [54]:
ttest_ind(ran['age-at-heart-attack'], df['age-at-heart-attack'], nan_policy='omit')

Ttest_indResult(statistic=0.488817287180926, pvalue=0.6256664616202791)

In [None]:
#pvalue is above 0.05, we fail reject the null hypothesis
#Mean of the sample(age_at_attack) and data(age_at_attack) are close.

In [7]:
df.describe(exclude='number')

Unnamed: 0,name,group
count,131,110
unique,1,3
top,name,2
freq,131,85


In [8]:
#checking for dependence or independence
df['survival'].value_counts()


1.00     6
0.75     6
0.50     6
33.00    5
26.00    5
29.00    4
0.25     4
36.00    4
27.00    4
22.00    4
25.00    4
12.00    4
19.00    4
41.00    3
24.00    3
34.00    3
10.00    3
16.00    3
31.00    3
32.00    3
21.00    2
17.00    2
5.00     2
28.00    2
3.00     2
35.00    2
40.00    2
13.00    2
2.00     2
37.00    2
52.00    2
20.00    2
15.00    2
53.00    2
38.00    2
49.00    2
57.00    1
50.00    1
44.00    1
0.03     1
48.00    1
7.50     1
46.00    1
19.50    1
7.00     1
45.00    1
9.00     1
47.00    1
23.00    1
55.00    1
4.00     1
1.25     1
11.00    1
Name: survival, dtype: int64

In [6]:

df['still-alive'].value_counts()

0.0    88
1.0    43
Name: still-alive, dtype: int64

In [None]:
surv_still =pd.crosstab(df['survival'], df['still-alive'])

In [13]:
#to test
from scipy.stats import chisquare
chisquare(surv_still, axis = None)

Power_divergenceResult(statistic=199.41538461538457, pvalue=7.784164563032896e-08)

In [None]:
#pvalue is close to zero, we reject the null hypothesis
#The null hypothesis that there is no relation between survival and still-alive
#Two pairs of variables that are dependant.

In [23]:
#Comparing Still-allive and age-at-heart-attack
age_still =pd.crosstab(df['age-at-heart-attack'], df['still-alive'])
chisquare(age_still, axis = None)

Power_divergenceResult(statistic=134.5714285714286, pvalue=2.9456287523394776e-05)

In [None]:
#pvalue is close to zero, we reject the null hypothesis
#There is a dependent relationship between Still-alive and age-at heart attack

In [30]:
#Comparing epss and wall-motion-score
wall_epss =pd.crosstab(df['wall-motion-score'], df['epss'])
stats.chi2_contingency(wall_epss)[1]

0.04088451596823525

In [None]:
#pvalue is less than .05 we reject the null hypothesis of independence relationship 

In [None]:
#Comparing epss and wall-motion-score
wall_epss =pd.crosstab(df['wall-motion-score'], df['epss'])
stats.chi2_contingency(wall_epss)[1]

In [31]:
#Comparing age-at-heart-attack and alive_at_1
age_alive =pd.crosstab(df['age-at-heart-attack'], df['alive-at-1'])
stats.chi2_contingency(age_alive)[1]


0.1444224780862056

In [None]:
#pvalue is above .05, we fail to reject the null hypothesis.
#There is therefore an independence relationship between age_at_heart_attack and alive_at_1.
#The age at which one gets a heart attack does not depict the ability to live beyond a year.

## Part 3 - Analysis and Interpretation

Now that you've looked at the data, answer the following questions:

- Interpret and explain the two t-tests you ran - what do they tell you about the relationships between the continuous features you selected and the class labels?
- Interpret and explain the two Chi-squared tests you ran - what do they tell you about the relationships between the categorical features you selected and the class labels?
- What was the most challenging part of this sprint challenge?

Answer with text, but feel free to intersperse example code/results or refer to it from earlier.

*Your words here!*

## Part 4 - Bayesian vs Frequentist Statistics

Using a minimum of 2-3 sentences, give an explanation of Bayesian and Frequentist statistics, and then compare and contrast these two approaches to statistical inference.



Bayes theorem is what allows us to go from our sampling and prior distributions to our posterior distribution while Frequentist statistics draws conclusions from sample data by emphasizing the frequency or proportion of the data, it focuses on likehood with non-informative prior

# Stretch Goals: 
Do these to get a 3. These are not required in order to pass the Sprint Challenge.

## Part 1: 

Make sure that all of your dataframe columns have the appropriate data types. *Hint:* If a column has the datatype of "object" even though it's made up of float or integer values, you can coerce it to act as a numeric column by using the `pd.to_numeric()` function. In order to get a 3 on this section make sure that your data exploration is particularly well commented, easy to follow, and thorough.

## Part 2:

Write functions that can calculate t-tests and chi^2 tests on all of the appropriate column combinations from the dataset. (Remember that certain tests require certain variable types.)

## Part 3: 

Calculate and report confidence intervals on your most important mean estimates (choose at least two). Make some kind of a graphic or visualization to help us see visually how precise these estimates are.

## Part 4:

Give an extra awesome explanation of Bayesian vs Frequentist Statistics. Maybe use code or visualizations, or any other means necessary to show an above average grasp of these high level concepts.

In [None]:
# You can work the stretch goals down here or back up in their regular sections
# just make sure that they are labeled so that we can easily differentiate
# your main work from the stretch goals.