# Visualisation

A main challenge with big data is to report the results. Visualisation is a very important part of it. We have seen some simple plotting examples in python. In this tutorial we will focus on how to produce better plots.

Libraries used:
- pandas
- matplotlib
- seaborn

We will be using the following dataset:
- iris
- boston

A good resource for visualisation ideas in python can be found here: https://python-graph-gallery.com/

## Plotting libraries

We have already seen `matplotlib` which is the most basic library. There exists many python libraries for visualisation. Here we will focus on `seaborn` which is a commonly used library.

A good cheat sheet for this libary can be found here: https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Python_Seaborn_Cheat_Sheet.pdf

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import datasets
%matplotlib  inline
from sklearn import preprocessing

In [None]:
data=datasets.load_boston()
boston=pd.DataFrame(data['data'],columns=data.feature_names)
boston["price"]=data.target   # save price as another column in the dataset
boston.head()

In [None]:
data=datasets.load_iris()
iris=pd.DataFrame(data['data'],columns=['sepal.length','sepal.width','petal.length','petal.width'])
iris["species"]=data.target_names[data.target]   # save species as another column in the dataset
iris.head()

 ## Scatterplot
 
 Scatterplot are plots representing two variables, with each dot representing a sample
 
 With matplotlib:

In [None]:
plt.scatter('sepal.length','sepal.width',data=iris)



With `seaborn`, the `lmplot` function plots a line through it

In [None]:
sns.lmplot('sepal.length','sepal.width',data=iris)

Let's make more information appear on these plots. For each plant, we know the species (stored in data.target).

In matplotlib, you can use the `c` argument to indicate a value that will set the  color.

In [None]:
plt.scatter('sepal.length','sepal.width',data=iris,c=data.target)
plt.xlabel("Sepal length")
plt.ylabel("Sepal width")
plt.title("Iris sepal")
#plt.legend(labels=data.target_names,c=[0,1,2])

In [None]:
g=sns.lmplot('sepal.length','sepal.width',data=iris,hue="species")
g.set_axis_labels("Sepal length (cm)","Sepal width (cm)")
plt.title("Iris sepal")
plt.show(g)

Describe the plot above

*Write your answer here*

Reproduce similar plots with petal lenght and petal width

## Saving a plot

To save any python plot, simply use the `savefig` function:

In [None]:
g=sns.lmplot('sepal.length','sepal.width',data=iris,hue="species")
g.set_axis_labels("Sepal length (cm)","Sepal width (cm)")
plt.title("Iris sepal")
plt.show(g)
plt.savefig("sepal.png")

## Distribution plots

### Boxplot

Boxplots represents the median (middle line), the first and third quartile of the data (25 and 75 percentile), as well as the extremes and outliers. Outliers are represented by dots.


The distribution of sepal length:

In [None]:
sns.boxplot( y=iris["sepal.length"] )

The distribution of sepal length by species

In [None]:
sns.boxplot( x=iris["species"], y=iris["sepal.length"] )

Reproduce this boxplot for petal length:

The distribution of all four variables

In [None]:
sns.boxplot(data=iris.iloc[:,0:4])

Reproduce this boxplot for only petal length and width:

**Boxplots are great for comparing distributions.**

### Histogram

Histogram represent the distribution of the data 2 dimension, which give a bit more details about the shape of the distribution (normality, skewness)

In [None]:
iris["sepal.length"].hist() # this function is from the pandas library

In [None]:
iris["sepal.length"].hist(bins=20) # more bins = more "bars"

Seaborn offers the possibility to draw the distribution curve on top of the histogram

In [None]:
sns.distplot(iris["sepal.length"])

You can also plot two distributions on the same graph:

In [None]:
sns.distplot( iris["sepal.width"] , color="red", label="Sepal Width")
sns.distplot( iris["sepal.length"] , color="skyblue", label="Sepal Length")
plt.legend()

Reproduce this histogram for petal length and width:

**Histogram are used to check the shape of distributions**

### 2D histogram

You can plot histogram of the occurence of two variables together with 2 D histograms

In [None]:
plt.hist2d(iris["sepal.length"],iris["petal.length"] , bins=(10, 10),cmap=plt.cm.Greys)
plt.xlabel("sepal length")
plt.ylabel("petal length")
plt.colorbar()


What do you observe?

*write your answer*

Do a 2D histogram of petal width and sepal width

### Violin plot

A mix of boxplor and histogram, violin plot shows the shape of distribution in the manner of a boxplot:b

In [None]:
sns.violinplot( y=iris["sepal.length"] )

In [None]:
sns.violinplot( y=iris["sepal.length"] ,x=iris.species)

Reproduce these two graphs for petal length:

## Correlation

Heatmap are a coloured representation of tables. The color will match the value in each cell of the tables. 

In [None]:
sns.heatmap(boston)

The heatmap above somewhat represent the entire boston dataset, but in this case it is not very useful, because onyly a couple of variables are visible, due to the diffence in scale.

In [None]:
boston_scaled=preprocessing.scale(boston)

In [None]:
sns.heatmap(boston_scaled)

Now we see all the variables

Heatmap are especially useful for representing correlation tables. Indeed the correlation between all teh boston variables is 

In [None]:
boston.corr()

This is a big table, and it is hard to really see the information, instead let's  do a heatmap of it:

In [None]:
sns.heatmap(boston.corr())

Now we see more clearly the variables that are negatively correlated (in dark) and positively correlated (in light)

Produce a heat of the correlation for the iris dataset

## A bit more control on your plots

### The legend

Each instance that appears in the legend is defined by a new plot `plt.someplot` and a `label=` defined in that call

Example:

In [None]:
plt.scatter(iris['sepal.length'][iris.species=="setosa"],iris['sepal.width'][iris.species=="setosa"],label="setosa")
plt.scatter(iris['sepal.length'][iris.species=="virginica"],iris['sepal.width'][iris.species=="virginica"],label="virginica")
plt.scatter(iris['sepal.length'][iris.species=="versicolor"],iris['sepal.width'][iris.species=="versicolor"],label="versicolor")
plt.legend()

To add title and labels on axes use `plt.title` `plt.xlabel` and `plt.ylabel`
** You should always label your axes**

In [None]:
plt.scatter(iris['sepal.length'][iris.species=="setosa"],iris['sepal.width'][iris.species=="setosa"],label="setosa")
plt.scatter(iris['sepal.length'][iris.species=="virginica"],iris['sepal.width'][iris.species=="virginica"],label="virginica")
plt.scatter(iris['sepal.length'][iris.species=="versicolor"],iris['sepal.width'][iris.species=="versicolor"],label="versicolor")
plt.legend()
plt.xlabel("sepal length")
plt.xlabel("sepal width")
plt.title("Iris dataset")

### Colours

To change the colours of your graph, you need to change the *palette*, for this you can use the argument `cmap`. Unfortunately for scatterplot, that forces to change the way the command is written

In [None]:
iris['species']=pd.Categorical(iris['species'])
iris['species'].cat.codes
plt.scatter(iris['sepal.length'],iris['sepal.width'],c=iris['species'].cat.codes,cmap="summer")


In [None]:
iris['species']=pd.Categorical(iris['species'])
iris['species'].cat.codes
plt.scatter(iris['sepal.length'],iris['sepal.width'],c=iris['species'].cat.codes,cmap="ocean")


See the effect on the heatmap:

In [None]:
sns.heatmap(boston.corr(),cmap="summer")

In [None]:
sns.heatmap(boston.corr(),cmap="Blues")

### Subplots

To plot several graph in the same figure, you can use `plt.subplot`. 
Here is an example with 4 plots:

In [None]:
f, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2)
f.suptitle('the great plot')
ax1.hist(iris['petal.length' ])
ax2.boxplot(iris['petal.length'])
ax3.scatter(iris['petal.length'],iris['petal.width'])
ax4.scatter(iris['sepal.length'],iris['sepal.width'])

Each plot can be manipulated by its own handle, e.g. `ax1`:

In [None]:
f, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2)
f.suptitle('the great plot')

ax1.hist(iris['petal.length' ])
ax1.set_title("petal length")

ax2.boxplot(iris['petal.length'])
ax2.set(ylabel="petal length")

ax3.scatter(iris['petal.length'],iris['petal.width'])
ax3.set(xlabel="petal length",ylabel="petal width")

ax4.scatter(iris['sepal.length'],iris['sepal.width'])
ax4.set(xlabel="sepal length",ylabel="petal width")


If you want to insert a seaborn plot as one of the subplot:

In [None]:
f, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2)
f.suptitle('the great plot')
ax1.hist(iris['petal.length' ])
ax1.set_title("petal length")

sns.boxplot(x=iris.species,y=iris['petal.length'],ax=ax2) ## the seaborn plot

ax3.scatter(iris['petal.length'],iris['petal.width'])
ax3.set(xlabel="petal length",ylabel="petal width")

ax4.scatter(iris['sepal.length'],iris['sepal.width'])
ax4.set(xlabel="sepal length",ylabel="petal width")


Create a 2 by 2 subplot with all the boxplot of the four variables in iris, colored by species

To tweak even more your matplotlib, see the following documentation: https://matplotlib.org/users/dflt_style_changes.html