## Introduction to Stats in Python Studio

We are going to be working with this [dataset](https://www.kaggle.com/andrewmvd/heart-failure-clinical-data) from Kaggle.  No need to download, as it is included in the git repository you just cloned.
<br>

Heart Disease is the number 1 cause of death globally, taking an estimated 17.9 million lives each year, which accounts for 31% of all deaths worldwide.
<br>

Heart failure is a common event caused by heart disease and this dataset contains 12 features that can be used to predict mortality by heart failure. You are tasked to look at two particular variables and record your observations about their usefulness for predicting the probability of heart failure.
<br>

In section one, you will be asked to run some simple EDA and apply statistical terminology to describe each variable in more detail.  Section two will explore what the distribution of your variables looks like. Finally, in section three you will be asked to make some inferences about your variables and if you feel they are good indicators of predicting heart failure.
<br>

Answer the questions and record your observations in the space provided. Feel free to add more code blocks if you'd like.
<br>



In [1]:
# Import libries need with alias
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
%matplotlib inline

# Suppress warnings
import warnings
warnings.filterwarnings('ignore')


# Set style and font size
sns.set_style('darkgrid')
sns.set(font_scale=1.5)

In [3]:
# Read in data to a dataframe
df = pd.read_csv("heart3.csv")
df

Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,DEATH_EVENT
0,75.0,0,582,0,20,1,265000.00,1.9,130,1,0,4,1
1,55.0,0,7861,0,38,0,263358.03,1.1,136,1,0,6,1
2,65.0,0,146,0,20,0,162000.00,1.3,129,1,1,7,1
3,50.0,1,111,0,20,0,210000.00,1.9,137,1,0,7,1
4,65.0,1,160,1,20,0,327000.00,2.7,116,0,0,8,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
294,62.0,0,61,1,38,1,155000.00,1.1,143,1,1,270,0
295,55.0,0,1820,0,38,0,270000.00,1.2,139,0,0,271,0
296,45.0,0,2060,1,60,0,742000.00,0.8,138,0,0,278,0
297,45.0,0,2413,0,38,0,140000.00,1.4,140,1,1,280,0


## Section 1: First look at the data:

Run some simple EDA and look at the data and your variables. Answer the following questions.

In [5]:
df.columns

Index(['age', 'anaemia', 'creatinine_phosphokinase', 'diabetes',
       'ejection_fraction', 'high_blood_pressure', 'platelets',
       'serum_creatinine', 'serum_sodium', 'sex', 'smoking', 'time',
       'DEATH_EVENT'],
      dtype='object')

In [6]:
df.isnull().sum()

age                         0
anaemia                     0
creatinine_phosphokinase    0
diabetes                    0
ejection_fraction           0
high_blood_pressure         0
platelets                   0
serum_creatinine            0
serum_sodium                0
sex                         0
smoking                     0
time                        0
DEATH_EVENT                 0
dtype: int64

Which of our columns are catogorical data?
<BR><BR><BR>
    #age , anaemia , diabetes , high_blood_pressure  , smoking , time , DEATH_EVENT
Which of our columns are continuous?
<BR><BR><BR>
    #creatinine_phosphokinase , platelets , serum_creatinine , serum_sodium , sex

### Statistical interpretation of our data?
#### First Variable:
Mean, Min, Max, STD? Describe what this means.

<br><br><br>


#### Second Variable:
Mean, Min, Max, STD? Describe what this means.

<br><br><br>

What could the numbers in our categorical data tell us?

<br><br><br>

Why might we want to keep our categorical data as 1's and 0's? Why may we want to use something like the code below to change it?



In [7]:
df['sex'] = df.sex.replace({1: "Male", 0: "Female"})
df['anaemia'] = df.anaemia.replace({1: "Yes", 0: "No"})
df['diabetes'] = df.diabetes.replace({1: "Yes", 0: "No"})
df['high_blood_pressure'] = df.high_blood_pressure.replace({1: "Yes", 0: "No"})
df['smoking'] = df.smoking.replace({1: "Yes", 0: "No"})

df['DEATH_EVENT'] = df.DEATH_EVENT.replace({1: "Died", 0: "Alive"})

## Section 2: Distribution of our data:

In [None]:
# Plot the distribution of your variable using distplot


In [None]:
# Create boxplot to show distribution of variable


In [None]:
# Feel free to add any additional graphs that help you answer the questions below.

In [None]:
# Another way to check the skewness of our variable
df['variable'].skew()

In [None]:
# Another way to check the kurtosis of our variable
df['variable'].kurtosis()

### Interpretation of how our data is distributed by variable?
Looking at the above graphs, what can you tell about the distribution of your variables?
<br><br><br><br><br>
What is the skewness and kurtosis of your variables.  What does this mean?<br>
<br><br><br><br><br>
What are some of the differences you note looking at a categorical variable vs a continuous variable?
<br><br><br><br><br>

## Section 3: Finding Correlations

Lets start by breaking our data into two.  

In [None]:
# splitting the dataframe into 2 parts
# on basis of ‘DEATH_EVENT’ column values
df_died = df[df['DEATH_EVENT'] == 1 ]
df_lived = df[df['DEATH_EVENT'] == 0 ]

In [None]:
# Plot your variable based on if they died or lived

sns.distplot(df_died['variable'])
sns.distplot(df_lived['variable'])
plt.title("Chances of survival vs Variable")
plt.legend(('Died','Lived'))
plt.plot()


In [None]:
# Feel free to add any additional graphs that help you answer the questions below.

#### What things can you infer if we consider our data a sample of the population, based on each of your variables.  
<br><br><br><br><br>
#### Do you think either of your variables is a good indicator for predicting Heart Failure, why or why not?  
<br><br><br><br><br>