# Explore your data with visualization

We will make:

* ***Univariate*** plots
   * Histograms
   * Density Plots
   * Box and Whisker Plots
* ***Multivariate*** plots

## Import the data

In [0]:
import pandas as pd

url = 'https://raw.githubusercontent.com/dbonacorsi/AML_basic_AA1920/master/datasets/pima-indians-diabetes.data.csv'

names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = pd.read_csv(url, names=names)
data

## Univariate plots

### Histograms

In [0]:
from matplotlib import pyplot

In [0]:
### if you want to use read_csv, here is the code:

#from pandas import read_csv
#filename = 'pima-indians-diabetes.data.csv'
#names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
#data = read_csv(filename, names=names)
#data.hist()
#pyplot.rcParams["figure.figsize"] = [16,16]
#pyplot.show()

In [0]:
data.hist()
pyplot.rcParams["figure.figsize"] = [16,16]
pyplot.show()

## <font color=red>Exercise</font>

Compare what you see here, with whatyou saw as non-viz data exploration in the previous notebook. Where are you most familiar? The first, the second, or by using both?

Selecting your preferred way to look at data, which observations do you make from these plots? 

### Density plots

In [0]:
from matplotlib import pyplot
#from pandas import read_csv

In [0]:
data.plot(kind='density', subplots=True, layout=(3,3), sharex=False)
pyplot.show()

### Box and Whisker plots

In [0]:
data.plot(kind='box', subplots=True, layout=(3,3), sharex=False, sharey=False)
pyplot.show()

In [0]:
### upload -> BoxAndWhiskers.png
#from google.colab import files
#uploaded = files.upload()

In [0]:
from IPython.display import Image
Image(filename='/content/BoxAndWhiskers.png')

## Multivariate plots
They show the interactions between multiple variables in your dataset.

### Correlation matrix plots

In [0]:
from matplotlib import pyplot
#from pandas import read_csv
import numpy

In [0]:
#filename = 'pima-indians-diabetes.data.csv'
#names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
#data = read_csv(filename, names=names)
correlations = data.corr()
correlations

In [0]:
# plot correlation matrix
fig = pyplot.figure()
ax = fig.add_subplot(111)
cax = ax.matshow(correlations, vmin=-1, vmax=1)
fig.colorbar(cax)
ticks = numpy.arange(0,9,1)
ax.set_xticks(ticks)
ax.set_yticks(ticks)
ax.set_xticklabels(names)
ax.set_yticklabels(names)
pyplot.show()

### Scatter plot matrix

In [0]:
#from matplotlib import pyplot
#from pandas import read_csv
#import numpy
#from pandas import scatter_matrix

from pandas.plotting import scatter_matrix


In [0]:
scatter_matrix(data)
pyplot.show()

## <font color=red>Exercise</font>

Do you spot something interesting in the data by visualizing it this way?

## <font color=red>Exercise</font>

Try to review previous and current notebook, and produce a new one of your own that would eventually digest all exploiration-related tasks on a generic CSV file in input. You will re-use it often!

## Summary

What we did:

* we familiarized with 5 quick ways to visually explore your dataset
* each of these plots, in any basic data analysis plotting system, would have required PLENTY of time to do just 1 plot! We can do it quickly exploiting Pandas and Matplotlib. 

## What's next

Let's start to manipulate the data: we need to prepare the data to best expose the structure of your problem to modeling algorithms.