# Explore your data with descriptive statistics

***The time spent in understanding your data (BEFORE APPLYING ANY ML TECHNIQUES) is time well spent***. 

Suggestions and guidelines may differ, but here is what we will use as a checklist:

1. Look at the raw data
1. Dimensions
1. Data types
1. Descriptive statistics
1. Distribution of instances across classes
1. Correlations
1. Skews

## 1. Raw data

In [0]:
import pandas as pd

url = 'https://raw.githubusercontent.com/dbonacorsi/AML_basic_AA1920/master/datasets/pima-indians-diabetes.data.csv'

names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = pd.read_csv(url, names=names)
data

More details can be found e.g. at https://www.kaggle.com/uciml/pima-indians-diabetes-database/data, and are summarized below:

    Pregnancies = Number of times pregnant
    Glucose = Plasma glucose concentration a 2 hours in an oral glucose tolerance test numeric
    BloodPressure = Diastolic blood pressure (mm Hg)
    SkinThickness = Triceps skin fold thickness (mm)
    Insulin = 2-Hour serum insulin (mu U/ml)
    BMI = Body mass index (weight in kg/(height in m)^2)
    DiabetesPedigreeFunction = Diabetes pedigree function
    Age = Age (years)
    Outcome = Class variable (0 or 1)

## 2. Dimensions

In [0]:
shape = data.shape
shape

## 3. Data types

In [0]:
types = data.dtypes
print(types)

## 4. Descriptive statistics

Help on pandas set_option [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.set_option.html).


In [0]:
from pandas import set_option

set_option('display.width', 200)
set_option('display.max_rows', 500)
set_option('display.max_columns', 500)
set_option('precision', 3)        

In [0]:
description = data.describe()
print(description)

## 5. Class distribution 
(only for classification problems)

In [0]:
class_counts = data.groupby('class').size()
print(class_counts)

## 6. Correlations between attributes

In [0]:
set_option('display.width', 100)
set_option('precision', 3)

correlations = data.corr(method='pearson')
correlations

## 7. Skew of Univariate Distributions

In [0]:
skew = data.skew()
print(skew)

## Summary

What we did:

* we familiarized with the concept of (at least, there might be more!) 7 quick ways to explore my dataset and try to describe and summarize it statistically, before we start working on our ML project.
* we implemented these concepts by exploiting Pandas' functionalities: easy, basically one command each.
* remember to look carefully at the numbers, ask yourself questions as of whether things are as you see them, and note down evertything that comes to mind as it might well be useful at a later stage!

## What's next

We did what we did with numbers. ***Isn't it frustrating?*** Time to make it with visualizations!