# Load Your Data

* The Pima Indians dataset is used to demonstrate data loading in this exercise. It will also be used in many of the lessons to come. This dataset describes the medical records for Pima Indians and whether or not each patient will have an onset of diabetes within five years.

### Fun Time: The Pima Indians dataset is (1) a classification problem (2) a regression problem (3) None of the above

* The Pima Indians dataset is a good dataset for demonstration because all of the input attributes are numeric and the output variable to be predicted is binary (0 or 1). The data is freely available from the UCI Machine Learning Repository.

## Load CSV Files with NumPy
* You can load your CSV data using NumPy and the **numpy.loadtxt()** function. This function assumes no header row and all data has the same format. The example below assumes that the file **pima-indians-diabetes.data.csv** is in your current working directory.

In [None]:
# Load CSV using NumPy
from numpy import loadtxt
filename = 'pima-indians-diabetes.data.csv'
raw_data = open(filename, 'rt')
data = loadtxt(raw_data, delimiter=",")
print(type(data))
print(data.shape)

## Load CSV Files with Pandas
* You can load your CSV data using Pandas and the **pandas.read_csv()** function. This function is very flexible and is perhaps my recommended approach for loading your machine learning data. The function returns a **pandas.DataFrame** that you can immediately start summarizing and plotting. The example below assumes that the **pima-indians-diabetes.data.csv** file is in the current working directory.

In [None]:
# Load CSV using Pandas
from pandas import read_csv
filename = 'pima-indians-diabetes.data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(filename, names=names)
print(type(data))
print(data.shape)

# Understand Your Data With Descriptive Statistics

## Peek at Your Data
* There is no substitute for looking at the raw data. Looking at the raw data can reveal insights that you cannot get any other way. It can also plant seeds that may later grow into ideas on how to better pre-process and handle the data for machine learning tasks. You can review the first 20 rows of your data using the head() function on the Pandas **DataFrame**.

* preg: Number of times pregnant
* plas: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
* pres: Diastolic blood pressure (mm Hg)
* skin: Triceps skin fold thickness (mm)
* test: 2-Hour serum insulin (mu U/ml)
* mass: Body mass index (weight in kg/(height in m)^2)
* pedi: Diabetes pedigree function
* age: Age (years)
* class: 

In [None]:
# View first 20 rows
from pandas import read_csv
filename = "pima-indians-diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(filename, names=names)
peek = data.head(20)
print(peek)

## Dimensions of Your Data
* You can review the shape and size of your dataset by printing the shape property on the Pandas **DataFrame**.

In [None]:
# Dimensions of your data
from pandas import read_csv
filename = "pima-indians-diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(filename, names=names)
shape = data.shape
print(shape)

### Fun Time: What does the number of columns (in this case 9) stand for? (1) number of features (2) number of features and target (3) sample size (4) None of the above

## Data Type For Each Attribute
* The type of each attribute is important. Strings may need to be converted to floating point values or integers to represent categorical or ordinal values. 
* You can get an idea of the types of attributes by peeking at the raw data, as above. You can also list the data types used by the **DataFrame** to characterize each attribute using the **dtypes** property.

In [None]:
# Data Types for Each Attribute
from pandas import read_csv
filename = "pima-indians-diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(filename, names=names)
types = data.dtypes
print(types)

## Descriptive Statistics
* Descriptive statistics can give you great insight into the shape of each attribute. Often you can create more summaries than you have time to review. The describe() function on the Pandas DataFrame lists 8 statistical properties of each attribute. 
* For numeric data, the result’s index will include count, mean, std, min, max as well as lower, 50 and upper percentiles. By default the lower percentile is 25 and the upper percentile is 75. The 50 percentile is the same as the median.

In [None]:
# Statistical Summary
from pandas import read_csv
from pandas import set_option
filename = "pima-indians-diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(filename, names=names)
set_option('display.width', 100)
set_option('precision', 3)
description = data.describe()
print(description)

## Class Distribution (Classification Only)
* On classification problems you need to know how balanced the class values are. 
* Highly imbalanced problems (a lot more observations for one class than another) are **very common** in practice and may need special handling in the data preparation stage of your project. 
* You can quickly get an idea of the distribution of the class attribute in Pandas.

In [None]:
# Class Distribution
from pandas import read_csv
filename = "pima-indians-diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(filename, names=names)
class_counts = data.groupby('class').size()
print(class_counts)

## Correlations Between Attributes
* Correlation refers to the relationship between two variables and how they may or may not change together. 
* The most common method for calculating correlation is **Pearson’s Correlation Coefficient**, that assumes a normal distribution of the attributes involved. 
* A correlation of -1 or 1 shows a full negative or positive correlation respectively. Whereas a value of 0 shows no correlation at all. 
* Some machine learning algorithms like linear and logistic regression can suffer poor performance if there are highly correlated attributes in your dataset. As such, it is a good idea to review all of the pairwise correlations of the attributes in your dataset. You can use the **corr()** function on the Pandas **DataFrame** to calculate a correlation matrix.

In [None]:
# Pairwise Pearson correlations
from pandas import read_csv
from pandas import set_option
filename = "pima-indians-diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(filename, names=names)
set_option('display.width', 100)
set_option('precision', 3)
correlations = data.corr(method='pearson')
print(correlations)

## Skew of Univariate Distributions
* Skew refers to a distribution that is assumed Gaussian (normal or bell curve) that is shifted in one direction or another. 
* Many machine learning algorithms assume a Gaussian distribution. Knowing that an attribute has a skew may allow you to perform data preparation to correct the skew and later improve the accuracy of your models. 
* You can calculate the skew of each attribute using the **skew()** function on the Pandas **DataFrame**.

In [None]:
# Skew for each attribute
from pandas import read_csv
filename = "pima-indians-diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(filename, names=names)
skew = data.skew()
print(skew)

# Understand Your Data With Visualization

## Histograms
* A fast way to get an idea of the distribution of each attribute is to look at histograms. 
* Histograms group data into bins and provide you a count of the number of observations in each bin. 

In [None]:
# Univariate Histograms
%matplotlib inline
from matplotlib import pyplot
from pandas import read_csv
filename = 'pima-indians-diabetes.data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(filename, names=names)
data.hist()
pyplot.show()

### Fun Time: From the shape of the bins you can quickly get a feeling for (1) whether an attribute is Gaussian, skewed or even has an exponential distribution. (2) helping you see possible outliers (3) all of the above (4) none of the above.

## Density Plots
* Density plots are another way of getting a quick idea of the distribution of each attribute. The plots look like an abstracted histogram with a smooth curve drawn through the top of each bin, much like your eye tried to do with the histograms.

In [None]:
# Univariate Density Plots
%matplotlib inline
from matplotlib import pyplot
from pandas import read_csv
filename = 'pima-indians-diabetes.data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(filename, names=names)
data.plot(kind='density', subplots=True, layout=(3,3), sharex=False)
pyplot.show()

## Box and Whisker Plots
* Another useful way to review the distribution of each attribute is to use Box and Whisker Plots or boxplots for short. 
* Boxplots summarize the distribution of each attribute, drawing a line for the median (middle value) and a box around the 25th and 75th percentiles (the middle 50% of the data). 
* The whiskers give an idea of the spread of the data and dots outside of the whiskers show candidate outlier values (values that are 1.5 times greater than the size of spread of the middle 50% of the data).

In [None]:
# Box and Whisker Plots
%matplotlib inline
from matplotlib import pyplot
from pandas import read_csv
filename = "pima-indians-diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(filename, names=names)
data.plot(kind='box', subplots=True, layout=(3,3), sharex=False, sharey=False)
pyplot.show()