# Explore your data with descriptive statistics

***The time spent in understanding your data (BEFORE APPLYING ANY ML TECHNIQUES) is time well spent***. 

Suggestions and guidelines may differ, but here is what we will use as a checklist:

1. ***Take a peek*** at your ***raw*** dataset
1. Review the ***dimensions*** of your dataset
1. Review the ***data types*** of the attributes in your dataset
1. Summarize your data using ***descriptive statistics***
1. Summarize the ***distribution of instances across classes*** in your dataset
1. Understand the relationships in your data using ***correlations***
1. Review the ***skew*** of the distributions of each attribute

In [1]:
import csv
import numpy as np

## 1. Peek at the raw data

There is no substitute for this phase: looking at the raw data can reveal insights that you may not be able to get in any other way. This phase also helps to plant seeds that may later grow into ideas on how to better pre-process and handle your data for ML tasks. 

Just review the first few rows of your data, either quickly like this...

In [2]:
!head -20 pima-indians-diabetes.data.csv

6,148,72,35,0,33.6,0.627,50,1
1,85,66,29,0,26.6,0.351,31,0
8,183,64,0,0,23.3,0.672,32,1
1,89,66,23,94,28.1,0.167,21,0
0,137,40,35,168,43.1,2.288,33,1
5,116,74,0,0,25.6,0.201,30,0
3,78,50,32,88,31.0,0.248,26,1
10,115,0,0,0,35.3,0.134,29,0
2,197,70,45,543,30.5,0.158,53,1
8,125,96,0,0,0.0,0.232,54,1
4,110,92,0,0,37.6,0.191,30,0
10,168,74,0,0,38.0,0.537,34,1
10,139,80,0,0,27.1,1.441,57,0
1,189,60,23,846,30.1,0.398,59,1
5,166,72,19,175,25.8,0.587,51,1
7,100,0,0,0,30.0,0.484,32,1
0,118,84,47,230,45.8,0.551,31,1
7,107,74,0,0,29.6,0.254,31,1
1,103,30,38,83,43.3,0.183,33,0
1,115,70,30,96,34.6,0.529,32,1


.. or, in Python, using the `head()` function on the Pandas DataFrame.

In [3]:
from pandas import read_csv

In [4]:
filename = "pima-indians-diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(filename, names=names)
peek = data.head(20)
peek

Unnamed: 0,preg,plas,pres,skin,test,mass,pedi,age,class
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
5,5,116,74,0,0,25.6,0.201,30,0
6,3,78,50,32,88,31.0,0.248,26,1
7,10,115,0,0,0,35.3,0.134,29,0
8,2,197,70,45,543,30.5,0.158,53,1
9,8,125,96,0,0,0.0,0.232,54,1


The second approach is just much better: it shows attributes' names, and displays data in a more readable format. This is coming for free from loading your data as a Pandas DataFrame. The first column lists the row number, which is very handy for referencing a specific observation.

But WHAT is your data? Do you understand the columns?

Of course if it is your data related to your problem which you are using ML on, you know everything. If you take a dataset online, or you are given one, you need to familiarize with it. For this example, some details can be found e.g. at https://www.kaggle.com/uciml/pima-indians-diabetes-database/data, and are summarized below:

    Pregnancies = Number of times pregnant
    Glucose = Plasma glucose concentration a 2 hours in an oral glucose tolerance test numeric
    BloodPressure = Diastolic blood pressure (mm Hg)
    SkinThickness = Triceps skin fold thickness (mm)
    Insulin = 2-Hour serum insulin (mu U/ml)
    BMI = Body mass index (weight in kg/(height in m)^2)
    DiabetesPedigreeFunction = Diabetes pedigree function
    Age = Age (years)
    Outcome = Class variable (0 or 1)

## 2. Dimensions of your data

In the previous part you have had a glance to "some" data.. out of how much? How much data do you have? How many rows? How many columns (i.e. features)? 

You need to know because this has implications:

* Too many rows? your ML algo may take too long to train.
* Too few rows? perhaps you do not have enough data to train the algorithms at all.
* Too many features? some algorithms can be distracted or suffer poor performance due to the curse of dimensionality. 
* Too few features? you go nowhere.

You had better review right away the size and shape of your dataset, by printing the shape property using the Pandas DataFrame functionalities. Easy because of this!

In [5]:
shape = data.shape
shape

(768, 9)

The results are listed as (rows,columns). You can see that the dataset has 768 rows and
9 columns. This will drive some of your future choices.

## 3. Data type for each attribute

The type of each attribute is important to know, as this can easily be the nightmare of data preparation. E.g. strings may need to be converted to floating point values or integers, to represent categorical or ordinal values. Of course, you can get an idea of the types of attributes already by peeking at the raw data, as done above. But it is usually suggested to list the data types of each attribute by using the Pandas DataFrame, i.e. explicitly characterize each attribute using the `dtypes` property.

In [6]:
types = data.dtypes
print(types)

preg       int64
plas       int64
pres       int64
skin       int64
test       int64
mass     float64
pedi     float64
age        int64
class      int64
dtype: object


Most of the attributes are integers. "mass" and "pedi" are floating point types. All of them are numbers, anyway.

## 4. Descriptive statistics

Descriptive statistics can give you good insight into the shape of each attribute. You need nothing too complicated at this stage, and note that often you may create with just one command more statistical summaries than you have time to review. Use the `describe()` function on the Pandas DataFrame to list 8 major statistical properties of each attribute:
* Count
* Mean
* Standard Deviation
* Minimum Value
* 25th Percentile
* 50th Percentile (Median)
* 75th Percentile
* Maximum Value

In [7]:
from pandas import set_option

In [10]:
set_option('display.width', 100)        # set the preferred width of the output
set_option('precision', 3)              # set the numbers precision
description = data.describe()
print(description)

          preg     plas     pres     skin     test     mass     pedi      age    class
count  768.000  768.000  768.000  768.000  768.000  768.000  768.000  768.000  768.000
mean     3.845  120.895   69.105   20.536   79.799   31.993    0.472   33.241    0.349
std      3.370   31.973   19.356   15.952  115.244    7.884    0.331   11.760    0.477
min      0.000    0.000    0.000    0.000    0.000    0.000    0.078   21.000    0.000
25%      1.000   99.000   62.000    0.000    0.000   27.300    0.244   24.000    0.000
50%      3.000  117.000   72.000   23.000   30.500   32.000    0.372   29.000    0.000
75%      6.000  140.250   80.000   32.000  127.250   36.600    0.626   41.000    1.000
max     17.000  199.000  122.000   99.000  846.000   67.100    2.420   81.000    1.000


When describing the datasat in this way, it is worth taking some time and reviewing what you see, and draw observations from the outcome of this piece of code. This might include the presence of NaN values for missing data or surprising distributions for some attributes, so be careful in this step!

Some obvious tips (without thinking much):
* is `count` the same for all attributes, or not?
* do you see any unexpectedly big std deviation for some attribute?
* ...

Some obvious tips (thinking a bit):
* given what you think a feature means, do they average makes sense?
* ...

## 5. Class distribution 
(only for classification problems)

On classification problems, you need to know how balanced the class (0/1) values are. Highly imbalanced problems (a lot more observations for one class than another) are common and may need special handling in the data preparation stage of your project. 

You can quickly get an idea of the distribution of the class attribute in Pandas, by checking how many 0s and how many 1s you have.

In [11]:
class_counts = data.groupby('class').size()
print(class_counts)

class
0    500
1    268
dtype: int64


There are nearly double the number of observations with class 0 (no onset of diabetes) than there are with class 1 (onset of diabetes).

## 6. Correlations between attributes

Correlations here means any relationship between two features and how they may or may not change together. 

The most common method for calculating correlation is ***Pearson’s Correlation Coefficient***, that assumes a normal distribution of the attributes involved: a correlation of (-1,1,0) shows a full negative, full positive, or no correlation respectively. 

Some ML algorithms - like linear and logistic regression - can suffer poor performance if there are highly-correlated features in the dataset. So, it is a good idea to review all of the pairwise correlations of the features in your dataset at an early stage.

You can use the `corr()` function on the Pandas DataFrame to calculate a correlation matrix.

In [13]:
set_option('display.width', 100)
set_option('precision', 3)
correlations = data.corr(method='pearson')
correlations

Unnamed: 0,preg,plas,pres,skin,test,mass,pedi,age,class
preg,1.0,0.129,0.141,-0.082,-0.074,0.018,-0.034,0.544,0.222
plas,0.129,1.0,0.153,0.057,0.331,0.221,0.137,0.264,0.467
pres,0.141,0.153,1.0,0.207,0.089,0.282,0.041,0.24,0.065
skin,-0.082,0.057,0.207,1.0,0.437,0.393,0.184,-0.114,0.075
test,-0.074,0.331,0.089,0.437,1.0,0.198,0.185,-0.042,0.131
mass,0.018,0.221,0.282,0.393,0.198,1.0,0.141,0.036,0.293
pedi,-0.034,0.137,0.041,0.184,0.185,0.141,1.0,0.034,0.174
age,0.544,0.264,0.24,-0.114,-0.042,0.036,0.034,1.0,0.238
class,0.222,0.467,0.065,0.075,0.131,0.293,0.174,0.238,1.0


Note the outcome is a matrix that lists all attributes across the top and down the side, and gives correlation between all pairs of attributes (well, even twice: the matrix is symmetrical, so cut it along the diagonal and look only at one half). Needless to mention, the diagonal line through the matrix from the top left to bottom right corners of the matrix shows 1.000, i.e. perfect correlation of each attribute with itself.

## 7. Skew of Univariate Distributions

Skew refers to a distribution that would be assumed Gaussian (normal or bell curve) but is actually shifted or squashed in one direction or another. Many ML algorithms assume a Gaussian distribution. Knowing that an attribute has a skew may allow you to perform proper data preparation to correct the skew and later improve the accuracy of your models. 

You can calculate the skew of each attribute using the `skew()` function on the Pandas DataFrame.

In [14]:
skew = data.skew()
print(skew)

preg     0.902
plas     0.174
pres    -1.844
skin     0.109
test     2.272
mass    -0.429
pedi     1.920
age      1.130
class    0.635
dtype: float64


The skew result show a positive/negative skew in case the distribution is squashed right/left (respectively) w.r.t. a Gaussian bell shape. Values closer to zero show less skew.

## Summary

What we did:

* we familiarized with the concept of (at least, there might be more!) 7 quick ways to explore my dataset and try to describe and summarize it statistically, before we start working on our ML project.
* we implemented these concepts by exploiting Pandas' functionalities: easy! Basically one command each!
* remember to look carefully at the numbers, ask yourself questions as of whether things are as you see them, and note down evertything that comes to mind as it might well be useful at a later stage!

## What's next

We did what we did with numbers. ***Isn't it frustrating?*** Time to make it with visualizations!