# Explore your data with visualization

Data visualization is a great helper in data science in general. There are plenty of things one can do. We will use:

* ***Univariate*** plots
   * Histograms
   * Density Plots
   * Box and Whisker Plots
* ***Multivariate*** plots

## Univariate plots

### Histograms

Histos are a fast way to get an idea of the distribution of each attribute. Needless to remind this, histos group data into bins and provide you a visual count of the number of observations falling into each bin. From the heights of the bins you can quickly get a feeling for whether an attribute is Gaussian, skewed or even has an exponential distribution. Also, it can help to spot possible outliers.

In [None]:
from matplotlib import pyplot
from pandas import read_csv

In [None]:
filename = 'pima-indians-diabetes.data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(filename, names=names)
data.hist()
pyplot.rcParams["figure.figsize"] = [16,16]
pyplot.show()

E.g. it is quickly visible that:
* perhaps the attributes "age", "pedi" and "test" may have an exponential distribution
* perhaps the attributes "mass", "pres" and "plas" may have a Gaussian (or nearly Gaussian) distribution

This is interesting because many ML techniques assume a Gaussian univariate distribution on the input variables.

### Density plots

Density plots are another way of getting a quick idea of the distribution of each attribute. They look like an "abstracted histogram" with a smooth curve drawn through the top of each bin.

In [None]:
from matplotlib import pyplot
from pandas import read_csv

In [None]:
#from pylab import rcParams
#rcParams['figure.figsize'] = 5, 10
data.plot(kind='density', subplots=True, layout=(3,3), sharex=False)
#pyplot.rcParams["figure.figsize"] = [16,16]
#plt.subplots_adjust(left=0.0, right=1.0, bottom=0.0, top=1.0)
pyplot.show()

Thanks to the smoothing, we can see that the distribution for each attribute in a clearer way than the histograms. This is much like your eye tried to do with the histograms: so, not only cosmetics.. for many histograms (or just complicated example) this may help you and save some time.

### Box and Whisker plots

Another useful set of plots is the Box and Whisker Plots (or boxplots for short). Boxplots summarize the distribution of each attribute, drawing a line for the median (middle value) and a box around the 25th and 75th percentiles (the middle 50% of the data). The whiskers give an idea of the spread of the data and dots outside of the whiskers show candidate outlier values (values that are 1.5 times greater than the size of spread of the middle 50% of the data).

These plots are very useful to quickly evaluate some skewes..

In [None]:
data.plot(kind='box', subplots=True, layout=(3,3), sharex=False, sharey=False)
pyplot.show()

The spread of the attributes is quite different one from the others. Some attributes - like "age", "test" and "skin"
appear quite skewed towards smaller values.

## Multivariate plots
They show the interactions between multiple variables in your dataset.

### Correlation matrix plots

The correlation gives an indication of how related the changes are between two variables. If two variables change in the same direction they are positively correlated. If they change in opposite directions together (one goes up, one goes down), then they are negatively correlated. You can calculate the correlation between each pair of attributes. This is called a correlation matrix. You can then plot the correlation matrix and get an idea of which variables have a high correlation with each other. This is extremely useful to know, because some ML algorithms - like linear and logistic regression - can have poor performance if there are highly correlated input variables in your data.

In [None]:
from matplotlib import pyplot
from pandas import read_csv
import numpy

In [None]:
filename = 'pima-indians-diabetes.data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(filename, names=names)
correlations = data.corr()
correlations

In [None]:
# plot correlation matrix
fig = pyplot.figure()
ax = fig.add_subplot(111)
cax = ax.matshow(correlations, vmin=-1, vmax=1)
fig.colorbar(cax)
ticks = numpy.arange(0,9,1)
ax.set_xticks(ticks)
ax.set_yticks(ticks)
ax.set_xticklabels(names)
ax.set_yticklabels(names)
pyplot.show()

Needless to mentions, the matrix is symmetrical. As expected, each variable is perfectly positively correlated with itself, see the diagonal line from top left to bottom right. For the rest, a color heat map allows us to visually spot correlations and/or anticorrelations.

One would conclude here that there is a mild positive correlations among quite some attributes.

The example is not generic in that it specifies the names for the attributes along the axes as well as the number of ticks. This recipe can be made more generic by removing these aspects as follows:

In [None]:
correlations = data.corr()
# plot correlation matrix
fig = pyplot.figure()
ax = fig.add_subplot(111)
cax = ax.matshow(correlations, vmin=-1, vmax=1)
fig.colorbar(cax)
pyplot.show()

Now you have a generic correlation matrix plot. Generating the plot, you can see that it gives the same information although making it a little harder to see what attributes are correlated by name. 

Use this generic plot as a first try to understand the correlations in your dataset and customize it like the first example in order to read off more specific data if needed.

### Scatter plot matrix

A scatter plot shows the relationship between two variables as dots in two dimensions, one axis for each attribute. You can create a scatter plot for each pair of attributes in your data. Drawing all these scatter plots together is called a "scatter plot matrix". 

Scatter plots are useful for spotting structured relationships between variables, like whether you could summarize the relationship between two variables with a line. Attributes with structured relationships may also be correlated and good candidates for removal from your dataset.

In [None]:
from matplotlib import pyplot
from pandas import read_csv
import numpy
#from pandas.plotting import scatter_matrix
from pandas import scatter_matrix

In [None]:
scatter_matrix(data)
pyplot.show()

Like the correlation matrix, the scatter plot matrix is symmetrical. This is useful to look at the pairwise relationships from different perspectives. Because there is little point of drawing a scatter plot of each variable with itself, the diagonal shows histograms of each attribute.

## Summary

What we did:

* we familiarized with 5 quick ways to visually explore your dataset
* each of these plots, in any basic data analysis plotting system, would have required PLENTY of time to do just 1 plot! We can do it quickly exploiting Pandas and Matplotlib. 

## What's next

Let's start to manipulate the data: we need to prepare the data to best expose the structure of your problem to modeling algorithms.