<a href="https://colab.research.google.com/github/digitalshawn/STC551/blob/main/Module%201/Descriptive%20Stats%20Example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Descriptive Stats Example - Analysis of Heart Attacks

In this notebook, we will analyze data from the [UCI Machine Learning Repository Heart Attack Analysis & Prediction Dataset](http://archive.ics.uci.edu/ml/datasets/Heart+Disease). Follow the link to review the data set description. We will examine measures of central tendency (mean, median, mode) as well as the standard deviation, variance, and spread. You will also find examples of how to graph many of the measures in the code below.

We will use the following python modules in the code below:

*   [pandas](https://pandas.pydata.org/docs/index.html) - a common data analysis toolkit
*   [plotly](https://plotly.com/python/) - for plotting graphs
*   [scipy.stats](https://docs.scipy.org/doc/scipy/reference/stats.html) - to analyze the skew of the data




**There are many paths...**

*NOTE: While this code provide an example pattern for you to use -- via the pandas and plotly modules -- you may use whatever modules you are comfortable. An important part of this work is to recgonize that there is more than one approach to exploring or analyzing data.*

This example is based upon the [Descriptive Statistics using pandas](https://github.com/elakapoor/Descriptive_analysis_python/blob/0348d2fbd4c2b9ff7835a564008ebfff559d4d64/descriptive-statistics-using-pandas.ipynb) notebook by ELAKAPOOR.

# Loadin the python modules

In [None]:
# Comments start with a '#' and will not be executed

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import plotly.express as px # accessible module for plotting graphs
from scipy.stats import skew, kurtosis # to analyze the skew of our dataset
import plotly.figure_factory as ff

# Loading the data

The data for this example in CSV format which we can load directly into a pandas dataframe (think Excel table).  I've put the data in Google Drive which pandas can load directly via a open URL.

Below we load the data into a pandas dataframe, discard the columns we will do not plan to use for this example, and start to examine the data before we dive into the analysis.

*NOTE: Something to try... Click the little wand icon next to the table output below to load an interactive version of the dataframe.*

In [None]:
df = pd.read_csv("https://drive.google.com/u/0/uc?id=1P6_2pBH_qvtJ1mW0X773LlhUWqfnB2mm&export=download")

# We are only keeping a subset of the data (i.e. dropping columns we will not use in this example)
df = df[["age", "sex", 'exng', 'cp', 'trtbps', 'chol', 'fbs', 'restecg', 'output']]

# head() displays the first n rols of a dataframe.  This is an easy ways to take a peak at our data
df.head()

# What do each of the columns mean
*It is important that we understand the data and its context. Information about each column is provided below so we can start exploring the data*

1. **Age** : Age of the patient
2. **Sex** : Sex of the patient (1 = male; 0 = female)
3. **exang**: exercise induced angina (1 = yes; 0 = no)
4. ca: number of major vessels (0-3)
5. **cp** : Chest Pain type chest pain type
    6. Value 1: typical angina
    7. Value 2: atypical angina
    8. Value 3: non-anginal pain
    9. Value 4: asymptomatic
10. **trtbps** : resting blood pressure (in mm Hg)
11. **chol** : cholestoral in mg/dl fetched via BMI sensor
12. **fbs** : (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
13. **rest_ecg** : resting electrocardiographic results
    14. Value 0: normal
    15. Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
    16. Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
17. thalach : maximum heart rate achieved
18. target : 0= less chance of heart attack 1= more chance of heart attack



In [None]:
# len() returns the length of an object. For a dataframe, it returns the number of rows.
len(df)

In [None]:
df.describe()

# Central Tendency
It is measured using 3M's that is mean, median and mode.
1. Mean: It is defined as the average of the values present.
2. Median: It is the centrally located value of the dataset when arranged in ascending order.
3. Mode: It is the most frequent value in the dataset.<br>
Now the question arises how are these values useful in data analysis of data science. To answer it let us see the following graph.


In [None]:
x = df["age"]
hist_data = [x]
group_labels = ['age'] # name of the dataset

fig = ff.create_distplot(hist_data, group_labels, show_rug=False)
fig.update_layout(title = "Distribution of age")
fig.show()

print("Mean age:", df["age"].mean())
print("Median age:", df["age"].median())
print("Mode of age:", df["age"].mode())
print("Skewness: ", skew(df["age"]))

From the graph above we have following observations:
1. The average value is around 54 years
2. The centrally located vale is median which is 55
3. The most frequent values is 58<br>
Now we can see that the curve is not a perfect gaussian curve or bell shaped curve. The tail of the curve is towards left so it is a left skew curve (negative value above). So how does these values justify the curve shape? If we arrange our values for mean, median and mode we see that:<br>
Mode > Median > Mean (shows left skewness)<br>
Mode < Median < Mean (shows right skewness)<br>
Mode = Median = Mean (perfect bell shape curve)<br>
Hence these values provides the knowledge about the shape of the data distribution curve. Which makes it easier to deal with the data. 
Also, we see that mean is nearly equal to median which can tell us that there maybe no outliers present. Let us confirm it.

# Central Tendency by Sex
Let's calculate the same central tendency measures but split the data by sex. From the data definitions, we know that sex is coded as male and female (1 = male; 0 = female).

To do so we will slice out data into two datafames, one where sex == 1 (male) and the other where sex == 0 (female).

**After running this code, what differences do you see between males and females?**


We can use panda's [.value_counts()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.value_counts.html) function to get the count of the values in a column (it works on both numerical and catagorical data!). This will show us the count of females and males in the dataset. Is the dataset evenly distributed by sex?

In [None]:
df["sex"].value_counts()

In [None]:
# Create a new dataframe that only includes 'males' (sex == 1)
dfm = df[df.sex == 1]

# Create a dataframe that only includes the age column
xm = dfm["age"]
hist_data = [xm]
group_labels = ['age'] # name of the dataset

# Create the distribution plot using the data we generated above
fig = ff.create_distplot(hist_data, group_labels, show_rug=False)
# Add a title
fig.update_layout(title = "Distribution of age -- Male")
# Show the plot, without this there will not be any output!
fig.show()

# Display central tendency measures
print("Mean age:", dfm["age"].mean())
print("Median age:", dfm["age"].median())
print("Mode of age:", dfm["age"].mode())
print("Skewness: ", skew(dfm["age"]))
print("Rows in dataset: ", len(dfm))

# Now we do the same thing, but for females in the dataset

# Create a new dataframe that only includes 'females' (sex == 0)
dff = df[df.sex ==0]
xf = dff["age"]
hist_data = [xf]
group_labels = ['age'] # name of the dataset

fig = ff.create_distplot(hist_data, group_labels, show_rug=False)
fig.update_layout(title = "Distribution of age -- Female")
fig.show()

print("Mean age:", dff["age"].mean())
print("Median age:", dff["age"].median())
print("Mode of age:", dff["age"].mode())
print("Skewness: ", skew(dff["age"]))
print("Rows in dataset: ", len(dff))

In [None]:
# boxplot represent presence or absence of outliers
fig = px.box(df, x = "age", title = "distribution of age")
fig.show()

In [None]:
# Now let us create the same boxplot, but split it by sex
# We can use the dfm and dff dataframess we already created
fig = px.box(dfm, x = "age", title = "distribution of age - male")
fig.show()

fig = px.box(dff, x = "age", title = "distribution of age - female")
fig.show()

# Spread
It is a variability of the data within a distribution.This spread is the distribution of the data around the central tendency. It can be measured using the following metrics:<br>
1. Range
2. Quartile
3. Variance
4. Standard Deviation<br>


**Range:** It can be measured as the difference between the largest and smallest values

Pandas provides functions for us to get the max and min of a column

*   [DataFrame.min()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.min.html#pandas.DataFrame.min) provides us with the minimum value in a set
*   [DataFrame.max()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.max.html#pandas.Series.max) provides us with the maximum value in a set



In [None]:
range_chol = df["chol"].max() - df["chol"].min()
print("The range of the cholestrol level is:",range_chol)

We can also use panda's [.describe()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.describe.html) function to get a number of descriptive stats about an entire dataframe or a column. You can seen an example for cholestrol levels below.

In [None]:
df.chol.describe()

In [None]:
x = df["chol"]
hist_data = [x]
group_labels = ['cholestrol'] # name of the dataset

fig = ff.create_distplot(hist_data, group_labels, show_rug=False)
fig.update_layout(title = "Distribution of cholestrol levels")
fig.show()

Here the values are concentrated between 200-300 but we see the range is 438. The reason is because of outliers in the data.<br>
The curve is right skewed curve. Here as a practice you can calculate the central tendency and see the above equation for right skew matches.

**Quartiles:** As a name suggest it represent quarter of the data and divides the data in 
4 equal parts. Namely 25%(Q1), 50%(Q2), 75%(Q3). The data is arranged in ascending order
which is the reason median and Q2 are equal. <br>
**IQR** is a range between Q1 and Q3. IQR is preferred over a range as it is not influence by outliers. IQR is used to measure variability by splitting a data set into four equal quartiles.<br>
Find Outlier = [(Q1 – 1.5 * IQR), (Q3 + 1.5 * IQR)]<br>
Any number not in range is outlier<br>
The quartiles and outliers can be explained with the help of box plot.

In [None]:
fig = px.box(df, x = "chol", title = "Distribution of cholestrol levels")
fig.show()

In [None]:
df.chol.describe()

In [None]:
# Look at the output of describe above. You'll see the Q1 (25%) is the 4th value (start counting from 0) from the output of describe.
# We can use these values to find the outliers.

# Outlier calculation
#IQR = Q3 - Q1
Q1 = df.chol.describe()[4]
#Q1 = 211.0
#Q2 = 240.0
Q2 = df.chol.describe()[5]
#Q3 = 274.5
Q3 = df.chol.describe()[6]
IQR = Q3 - Q1
#IQR = 274.5 - 211
outlier1 = (Q1 - 1.5 * IQR)
outlier2 = (Q3 + 1.5 * IQR)
print(f"The numbers outside the range of {outlier1} and {outlier2} will be considered as outliers")
print("The box plot verify our calculation. All the values greater than {outlier2} are shown as outliers.")

**Variance:** It is a statistical parameter used to quantify spread. It measures how far each number in the set from the mean and thus from every other number in the set.
Observation near to mean value gets the lower result and far from means gets higher value.
1. A high variance indicates that the numbers are far from the mean and far from each other. 
2. A low variance indicates that the numbers are close to the mean and to each other. 
3. If variance is 0 that means that all the numbers in the dataset are the identical. 
4. The valid variance is always a positive number (0 or more).
<br>
**Standard Deviation:** It is a square root of variance. It is more commonly used because the unit measure is easy to calculate spread.<br>
For example in variance the unit is kg<sup>2</sup> whereas in standard deviation it is kg.

In [None]:
# case of very high variance as per the plot above
print("Variance: ",df["chol"].var())
print("Standard Deviation: ", df["chol"].std())

# Normalization
In it values are shifted and rescaled so that they end up ranging between 0 and 1. It is also known as Min-Max scaling. When the curve does not follow the gaussian distribution it is good to do normalization.

In [None]:
df["normalized_chol"]=(df["chol"]-df["chol"].min())/(df["chol"].max()-df["chol"].min())

In [None]:
x = df["normalized_chol"]
hist_data = [x]
group_labels = ['normalized cholestrol'] # name of the dataset

fig = ff.create_distplot(hist_data, group_labels, show_rug=False)
fig.update_layout(title = "Distribution of normalized cholestrol levels")
fig.show()

Compared to the above curve (Fig:1) we can conclude following:
1. The spread is less now
2. The height of the curve is increased
3. All values as in range of 0 to 1