# Understanding Data with Descriptive Statistics

<b>Descriptive statistics</b> is about describing and summarizing data. It uses two main approaches:

1. The quantitative approach describes and summarizes data numerically.
2. The visual approach illustrates data with charts, plots, histograms, and other graphs.

You can apply descriptive statistics to one or many datasets or variables. 
- <b>Univariate analysis</b>: When you describe and summarize a single variable, you’re performing univariate analysis. 
- <b>Bivariate analysis</b>: When you search for statistical relationships among a pair of variables, you’re doing a bivariate analysis. 
- <b>Mulivariate analysis</b>: A multivariate analysis is concerned with multiple variables at once.

Types of measures in descriptive statistics:

- <b>Central tendency</b> tells you about the centers of the data. Useful measures include the mean, median, and mode.
- <b>Variability</b> tells you about the spread of the data. Useful measures include variance and standard deviation.
- <b>Correlation or joint variability</b> tells you about the relation between a pair of variables in a dataset. Useful measures include covariance and the correlation coefficient.

<p align="center">
<img src="pic/descriptive_statistics.jpg">
</p>

## Looking at Raw Data
It is important to look at raw data because the insight we will get after looking at raw data will boost our chances to better pre-processing as well as handling of data for ML projects.

In [1]:
from pandas import read_csv
path = 'diabetes.csv'
data = read_csv(path)
print(data.head(10))

   Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0            6      148             72             35        0  33.6   
1            1       85             66             29        0  26.6   
2            8      183             64              0        0  23.3   
3            1       89             66             23       94  28.1   
4            0      137             40             35      168  43.1   
5            5      116             74              0        0  25.6   
6            3       78             50             32       88  31.0   
7           10      115              0              0        0  35.3   
8            2      197             70             45      543  30.5   
9            8      125             96              0        0   0.0   

   DiabetesPedigreeFunction  Age  Outcome  
0                     0.627   50        1  
1                     0.351   31        0  
2                     0.672   32        1  
3                     0.167   2

## Checking Dimensions of Data
It is always a good practice to know how much data, in terms of rows and columns, we are having for our ML project. The reasons behind are:

- Suppose if we have too many rows and columns then it would take long time to run the algorithm and train the model.

- Suppose if we have too less rows and columns then it we would not have enough data to well train the model.


In [2]:
print(data.shape)

(768, 9)


## Getting Each Attribute's Data Type
It is another good practice to know data type of each attribute. The reason behind is that, as per to the requirement, sometimes we may need to convert one data type to another. For example, we may need to convert string into floating point or int for representing `categorial` or `ordinal` values. We can have an idea about the attribute’s data type by looking at the raw data, but another way is to use dtypes property of Pandas DataFrame.

In [3]:
print(data.dtypes)

Pregnancies                   int64
Glucose                       int64
BloodPressure                 int64
SkinThickness                 int64
Insulin                       int64
BMI                         float64
DiabetesPedigreeFunction    float64
Age                           int64
Outcome                       int64
dtype: object


## Statistical Summary of Data
__`describe()`__ function of `Pandas DataFrame` can provide the following 8 statistical properties of each & every data attribute:

- Count
- Mean
- Standard Deviation
- Minimum Value
- Maximum value
- 25%
- Median i.e. 50%
- 75%

In [4]:
import pandas as pd

pd.set_option('display.precision', 2)
print(data.shape)
print(data.describe())

(768, 9)
       Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin     BMI  \
count       768.00   768.00         768.00         768.00   768.00  768.00   
mean          3.85   120.89          69.11          20.54    79.80   31.99   
std           3.37    31.97          19.36          15.95   115.24    7.88   
min           0.00     0.00           0.00           0.00     0.00    0.00   
25%           1.00    99.00          62.00           0.00     0.00   27.30   
50%           3.00   117.00          72.00          23.00    30.50   32.00   
75%           6.00   140.25          80.00          32.00   127.25   36.60   
max          17.00   199.00         122.00          99.00   846.00   67.10   

       DiabetesPedigreeFunction     Age  Outcome  
count                    768.00  768.00   768.00  
mean                       0.47   33.24     0.35  
std                        0.33   11.76     0.48  
min                        0.08   21.00     0.00  
25%                        0.24  

## Reviewing Class Distribution
Class distribution statistics is useful in classification problems where we need to know the balance of class values. It is important to know class value distribution because if we have
highly imbalanced class distribution i.e. one class is having lots more observations than other class, then it may need special handling at data preparation stage of our ML project. We can easily get class distribution in Python with the help of Pandas DataFrame.

In [5]:
count_Outcome = data.groupby('Outcome').size()
print(count_Outcome)

Outcome
0    500
1    268
dtype: int64


## Reviewing Correlation between Attributes
The relationship between two variables is called __correlation__. In statistics, the most common method for calculating correlation is `Pearson’s Correlation Coefficient`. It can have three values as follows:

- __Coefficient value = 1:__ It represents full positive correlation between variables.
- __Coefficient value = -1:__ It represents full negative correlation between variables.
- __Coefficient value = 0:__ It represents no correlation at all between variables.

It is always good for us to review the pairwise correlations of the attributes in our dataset before using it into ML project because some machine learning algorithms such as linear
regression and logistic regression will perform poorly if we have highly correlated attributes. In Python, we can easily calculate a correlation matrix of dataset attributes with
the help of corr() function on Pandas DataFrame.

In [6]:
correlations = data.corr(method='pearson')
print(correlations)

                          Pregnancies  Glucose  BloodPressure  SkinThickness  \
Pregnancies                      1.00     0.13           0.14          -0.08   
Glucose                          0.13     1.00           0.15           0.06   
BloodPressure                    0.14     0.15           1.00           0.21   
SkinThickness                   -0.08     0.06           0.21           1.00   
Insulin                         -0.07     0.33           0.09           0.44   
BMI                              0.02     0.22           0.28           0.39   
DiabetesPedigreeFunction        -0.03     0.14           0.04           0.18   
Age                              0.54     0.26           0.24          -0.11   
Outcome                          0.22     0.47           0.07           0.07   

                          Insulin   BMI  DiabetesPedigreeFunction   Age  \
Pregnancies                 -0.07  0.02                     -0.03  0.54   
Glucose                      0.33  0.22          

The matrix in above output gives the correlation between all the pairs of the attribute in dataset.

## Reviewing Skew of Attribute Distribution
Skewness may be defined as the distribution that is assumed to be __Gaussian__ but appears distorted or shifted in one direction or another, or either to the left or right. Reviewing the
skewness of attributes is one of the important tasks due to following reasons:

- Presence of skewness in data requires the correction at data preparation stage so that we can get more accuracy from our model.
- Most of the ML algorithms assumes that data has a Gaussian distribution i.e. either normal of bell curved data.

In Python, we can easily calculate the skew of each attribute by using __`skew()`__ function on Pandas DataFrame.

In [7]:
print(data.skew())

Pregnancies                 0.90
Glucose                     0.17
BloodPressure              -1.84
SkinThickness               0.11
Insulin                     2.27
BMI                        -0.43
DiabetesPedigreeFunction    1.92
Age                         1.13
Outcome                     0.64
dtype: float64


From the above output, positive or negative skew can be observed. If the value is closer to zero, then it shows less skew.

## References

- [Python Statistics Fundamentals: How to Describe Your Data](https://realpython.com/python-statistics/)