You must understand your data in order to get the best results. In this chapter you will discover
7 recipes that you can use in Python to better understand your machine learning data. After
reading this lesson you will know how to:

1. Take a peek at your raw data.
2. Review the dimensions of your dataset.
3. Review the data types of attributes in your data.
4. Summarize the distribution of instances across classes in your dataset.
5. Summarize your data using descriptive statistics.
6. Understand the relationships in your data using correlations.
7. Review the skew of the distributions of each attribute.

Each recipe is demonstrated by loading the Pima Indians Diabetes classification dataset
from the UCI Machine Learning repository. Open your Python interactive environment and try
each recipe out in turn. Let’s get started.

## Load Data

In [3]:
# Load CSV using Pandas
from pandas import read_csv
filename = 'pima-indians-diabetes.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(filename, names=names)

### 1. Peek at Your Data

In [5]:
data.head(20)

Unnamed: 0,preg,plas,pres,skin,test,mass,pedi,age,class
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
5,5,116,74,0,0,25.6,0.201,30,0
6,3,78,50,32,88,31.0,0.248,26,1
7,10,115,0,0,0,35.3,0.134,29,0
8,2,197,70,45,543,30.5,0.158,53,1
9,8,125,96,0,0,0.0,0.232,54,1


### 2. Dimensions of Your Data

In [7]:
data.shape

(768, 9)

### 3. Data Type For Each Attribute

In [9]:
data.dtypes

preg       int64
plas       int64
pres       int64
skin       int64
test       int64
mass     float64
pedi     float64
age        int64
class      int64
dtype: object

### 4. Descriptive Statistics

In [12]:
from pandas import set_option
set_option('display.width', 100)
set_option('precision', 3)
data.describe()

Unnamed: 0,preg,plas,pres,skin,test,mass,pedi,age,class
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845,120.895,69.105,20.536,79.799,31.993,0.472,33.241,0.349
std,3.37,31.973,19.356,15.952,115.244,7.884,0.331,11.76,0.477
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.244,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.372,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.626,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


### 5. Class Distribution (Classification Only)

In [16]:
data.groupby('class').size()

class
0    500
1    268
dtype: int64

### 6. Correlations Between Attributes

In [18]:
data.corr(method='pearson')

Unnamed: 0,preg,plas,pres,skin,test,mass,pedi,age,class
preg,1.0,0.129,0.141,-0.082,-0.074,0.018,-0.034,0.544,0.222
plas,0.129,1.0,0.153,0.057,0.331,0.221,0.137,0.264,0.467
pres,0.141,0.153,1.0,0.207,0.089,0.282,0.041,0.24,0.065
skin,-0.082,0.057,0.207,1.0,0.437,0.393,0.184,-0.114,0.075
test,-0.074,0.331,0.089,0.437,1.0,0.198,0.185,-0.042,0.131
mass,0.018,0.221,0.282,0.393,0.198,1.0,0.141,0.036,0.293
pedi,-0.034,0.137,0.041,0.184,0.185,0.141,1.0,0.034,0.174
age,0.544,0.264,0.24,-0.114,-0.042,0.036,0.034,1.0,0.238
class,0.222,0.467,0.065,0.075,0.131,0.293,0.174,0.238,1.0


### 7. Skew of Univariate Distributions

In [20]:
data.skew()

preg     0.902
plas     0.174
pres    -1.844
skin     0.109
test     2.272
mass    -0.429
pedi     1.920
age      1.130
class    0.635
dtype: float64

### Tips To Remember

This section gives you some tips to remember when reviewing your data using summary statistics.

- **Review the numbers**. Generating the summary statistics is not enough. Take a moment to pause, read and really think about the numbers you are seeing.

- **Ask why**. Review your numbers and ask a lot of questions. How and why are you seeing specific numbers. Think about how the numbers relate to the problem domain in general and specific entities that observations relate to.

- **Write down ideas**. Write down your observations and ideas. Keep a small text file or note pad and jot down all of the ideas for how variables may relate, for what numbers mean, and ideas for techniques to try later. The things you write down now while the data is fresh will be very valuable later when you are trying to think up new things to try