# Understand Your Data With Descriptive Statistics

You must understand your data in order to get the best results.

#### Things to do with your data:

**1. Take a peek at your raw data.** <br>
**2. Review the dimensions of your dataset.** <br>
**3. Review the data types of attributes in your data.** <br>
**4. Summarize the distribution of instances across classes in your dataset.** <br>
**5. Summarize your data using descriptive statistics.** <br>
**6. Understand the relationships in your data using correlations.** <br>
**7. Review the skew of the distributions of each attribute.** <br>

## 1. Peek at Your Data: 

There is no substitute for looking at the raw data. <br>
Looking at the raw data can reveal insights that you cannot get any other way. <br>
You can review the first 20 rows of your data using the **head()** function on the Pandas DataFrame.

#### CODE:

In [8]:
# View first 20 rows

import pandas as pd

filename = '/home/ubuntu/Desktop/ML/Machine Learning With Python/pima-indians-diabetes.data.csv'

names = [ ' preg ' , ' plas ' , ' pres ' , ' skin ' , ' test ' , ' mass ' , ' pedi ' , ' age ' , ' class ' ] # Names to columns
data = pd.read_csv(filename, names=names)

peek = data.head(20)

print(peek)

     preg    plas    pres    skin    test    mass    pedi    age    class 
0        6     148      72      35       0    33.6   0.627     50        1
1        1      85      66      29       0    26.6   0.351     31        0
2        8     183      64       0       0    23.3   0.672     32        1
3        1      89      66      23      94    28.1   0.167     21        0
4        0     137      40      35     168    43.1   2.288     33        1
5        5     116      74       0       0    25.6   0.201     30        0
6        3      78      50      32      88    31.0   0.248     26        1
7       10     115       0       0       0    35.3   0.134     29        0
8        2     197      70      45     543    30.5   0.158     53        1
9        8     125      96       0       0     0.0   0.232     54        1
10       4     110      92       0       0    37.6   0.191     30        0
11      10     168      74       0       0    38.0   0.537     34        1
12      10     139      8

You can see that the first column lists the row number, which is handy for referencing a
specific observation.

## 2. Dimensions of Your Data:

You must have a very good handle on how much data you have, both in terms of rows and
columns.<br>
1. Too many rows and algorithms may take too long to train. Too few and perhaps you do
not have enough data to train the algorithms.<br>
2. Too many features and some algorithms can be distracted or suffer poor performance due
to the curse of dimensionality.

#### CODE:

In [9]:
# Dimensions of your data

import pandas as pd

filename = '/home/ubuntu/Desktop/ML/Machine Learning With Python/pima-indians-diabetes.data.csv'

names = [ ' preg ' , ' plas ' , ' pres ' , ' skin ' , ' test ' , ' mass ' , ' pedi ' , ' age ' , ' class ' ] # Names to columns
data = pd.read_csv(filename, names=names)

shape = data.shape

print(shape)

(768, 9)


The results are listed in rows then columns. You can see that the dataset has 768 rows and
9 columns.

 ## 3. Data Type For Each Attribute:

The type of each attribute is important. <br>
Strings may need to be converted to floating point values or integers to represent categorical or ordinal values. <br>
You can get an idea of the types of attributes by peeking at the raw data. You can also list the data types used by the DataFrame to characterize each attribute using the **dtypes** property.

#### CODE:

In [10]:
# Data Types for Each Attribute

import pandas as pd

filename = '/home/ubuntu/Desktop/ML/Machine Learning With Python/pima-indians-diabetes.data.csv'

names = [ ' preg ' , ' plas ' , ' pres ' , ' skin ' , ' test ' , ' mass ' , ' pedi ' , ' age ' , ' class ' ] # Names to columns
data = pd.read_csv(filename, names=names)

types = data.dtypes

print(types)

 preg        int64
 plas        int64
 pres        int64
 skin        int64
 test        int64
 mass      float64
 pedi      float64
 age         int64
 class       int64
dtype: object


You can see that most of the attributes are integers and that mass and pedi are floating
point types.

 ## 4. Descriptive Statistics:

Descriptive statistics can give you great insight into the shape of each attribute. <br>
The **describe()** function on the Pandas DataFrame lists 8 statistical properties of each attribute. 
<br>They are:
1. Count.
2. Mean.
3. Standard Deviation.
4. Minimum Value.
5. 25th Percentile.
6. 50th Percentile (Median).
7. 75th Percentile.
8. Maximum Value.

#### CODE:

In [11]:
# Statistical Summary

import pandas as pd

filename = '/home/ubuntu/Desktop/ML/Machine Learning With Python/pima-indians-diabetes.data.csv'

names = [ ' preg ' , ' plas ' , ' pres ' , ' skin ' , ' test ' , ' mass ' , ' pedi ' , ' age ' , ' class ' ] # Names to columns
data = pd.read_csv(filename, names=names)

description = data.describe()

print(description)

            preg        plas        pres        skin        test        mass   \
count  768.000000  768.000000  768.000000  768.000000  768.000000  768.000000   
mean     3.845052  120.894531   69.105469   20.536458   79.799479   31.992578   
std      3.369578   31.972618   19.355807   15.952218  115.244002    7.884160   
min      0.000000    0.000000    0.000000    0.000000    0.000000    0.000000   
25%      1.000000   99.000000   62.000000    0.000000    0.000000   27.300000   
50%      3.000000  117.000000   72.000000   23.000000   30.500000   32.000000   
75%      6.000000  140.250000   80.000000   32.000000  127.250000   36.600000   
max     17.000000  199.000000  122.000000   99.000000  846.000000   67.100000   

            pedi         age       class   
count  768.000000  768.000000  768.000000  
mean     0.471876   33.240885    0.348958  
std      0.331329   11.760232    0.476951  
min      0.078000   21.000000    0.000000  
25%      0.243750   24.000000    0.000000  
50%   

 ## 5. Class Distribution (Classification Only):

On classification problems you need to know how balanced the class values are.<br> Highly imbalanced
problems (a lot more observations for one class than another) are common and may need special
handling in the data preparation stage of your project.<br> You can quickly get an idea of the
distribution of the class attribute in Pandas using **groupby( ' class Label ' ).size()** function.

#### CODE:

In [14]:
# Class Distribution

import pandas as pd

filename = '/home/ubuntu/Desktop/ML/Machine Learning With Python/pima-indians-diabetes.data.csv'

names = [ ' preg ' , ' plas ' , ' pres ' , ' skin ' , ' test ' , ' mass ' , ' pedi ' , ' age ' , ' class ' ] # Names to columns
data = pd.read_csv(filename, names=names)

class_counts = data.groupby( ' class ' ).size()

print(class_counts)

 class 
0    500
1    268
dtype: int64


You can see that there are nearly double the number of observations with class 0 (no onset
of diabetes) than there are with class 1 (onset of diabetes).

 ## 6. Correlations Between Attributes:

Correlation refers to the relationship between two variables and how they may or may not
change together. <br> The most common method for calculating correlation is **Pearson’s Correlation
Coefficient**, that assumes a normal distribution of the attributes involved.<br> A correlation of -1
or 1 shows a full negative or positive correlation respectively. Whereas a value of 0 shows no
correlation at all. <br>Some machine learning algorithms like linear and logistic regression can suffer
poor performance if there are highly correlated attributes in your dataset. As such, it is a good
idea to review all of the pairwise correlations of the attributes in your dataset. <br> You can use the
**corr()** function on the Pandas DataFrame to calculate a correlation matrix.

#### CODE:

In [16]:
# Pairwise Pearson correlations

import pandas as pd

filename = '/home/ubuntu/Desktop/ML/Machine Learning With Python/pima-indians-diabetes.data.csv'

names = [ ' preg ' , ' plas ' , ' pres ' , ' skin ' , ' test ' , ' mass ' , ' pedi ' , ' age ' , ' class ' ] # Names to columns
data = pd.read_csv(filename, names=names)

correlations = data.corr(method= 'pearson' )

print(correlations)

            preg      plas      pres      skin      test      mass      pedi   \
 preg    1.000000  0.129459  0.141282 -0.081672 -0.073535  0.017683 -0.033523   
 plas    0.129459  1.000000  0.152590  0.057328  0.331357  0.221071  0.137337   
 pres    0.141282  0.152590  1.000000  0.207371  0.088933  0.281805  0.041265   
 skin   -0.081672  0.057328  0.207371  1.000000  0.436783  0.392573  0.183928   
 test   -0.073535  0.331357  0.088933  0.436783  1.000000  0.197859  0.185071   
 mass    0.017683  0.221071  0.281805  0.392573  0.197859  1.000000  0.140647   
 pedi   -0.033523  0.137337  0.041265  0.183928  0.185071  0.140647  1.000000   
 age     0.544341  0.263514  0.239528 -0.113970 -0.042163  0.036242  0.033561   
 class   0.221898  0.466581  0.065068  0.074752  0.130548  0.292695  0.173844   

             age     class   
 preg    0.544341  0.221898  
 plas    0.263514  0.466581  
 pres    0.239528  0.065068  
 skin   -0.113970  0.074752  
 test   -0.042163  0.130548  
 mass    

The matrix lists all attributes across the top and down the side, to give correlation between
all pairs of attributes (twice, because the matrix is symmetrical).<br> You can see the diagonal
line through the matrix from the top left to bottom right corners of the matrix shows perfect
correlation of each attribute with itself.

## 7. Skew of Univariate Distributions:

Skew refers to a distribution that is assumed Gaussian (normal or bell curve) that is shifted or
squashed in one direction or another.<br> Many machine learning algorithms assume a Gaussian
distribution. Knowing that an attribute has a skew may allow you to perform data preparation
to correct the skew and later improve the accuracy of your models.<br> You can calculate the skew
of each attribute using the **skew()** function on the Pandas DataFrame.

#### CODE:

In [17]:
# Skew for each attribute

import pandas as pd

filename = '/home/ubuntu/Desktop/ML/Machine Learning With Python/pima-indians-diabetes.data.csv'

names = [ ' preg ' , ' plas ' , ' pres ' , ' skin ' , ' test ' , ' mass ' , ' pedi ' , ' age ' , ' class ' ] # Names to columns
data = pd.read_csv(filename, names=names)

skew = data.skew()

print(skew)

 preg      0.901674
 plas      0.173754
 pres     -1.843608
 skin      0.109372
 test      2.272251
 mass     -0.428982
 pedi      1.919911
 age       1.129597
 class     0.635017
dtype: float64


The skew result show a positive (right) or negative (left) skew. Values closer to zero show
less skew.