<img src="img/LinkedIN Header.jpg">

# Data visualisation with pandas and seaborn

In this notebook you will visualise an online retail dataset.

For this module, you will need to import: 

* [pandas](https://pandas.pydata.org/pandas-docs/stable/) `pandas as pd`
* [seaborn]('https://seaborn.pydata.org/') `seaborn as sns`
* [maplotlib.pyplot](https://matplotlib.org/3.2.1/api/_as_gen/matplotlib.pyplot.html) `matplotlib.pyplot as plt`

In [None]:
# add your code here to load the libraries listed above.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt



### Loading the dataset


Load the dataset `online_retail_customer_data.csv` with `pandas`.

In [None]:
# add your code here.

customers = pd.read_csv(
    'data/online_retail_customer_data.csv',
    index_col='CustomerID'
)

customers.head()



### Plotting with Pandas

The [Pandas `.plot` method](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html) provides a high-level, user-friendly means by which to create common plot types from DataFrames.

The parameters are mostly self-explanatory, and include:

* kind (e.g. 'line', 'bar', 'pie', 'scatter')
* xlim, ylim, xticks, yticks (axis limits and increments)
* title (a list can be provided for subplots)

Use the `.plot` method to create a scatter plot with 'n_orders' on the x-axis and 'time_between_orders' on the y-axis.       

Add a title and make the plot bigger. You'll see some outliers; change the limits for both axes so that these aren't visible.  Consider changing the opacity of the datapoints so that the point gives a better indication of the number of datapoints in the denser areas. The opacity is controlled by the additional parameter `alpha` which expects values in the range of `0.0` to `1.0` (e.g. `alpha=0.9`)

Adding a semi-colon `;` to the last line of our code will suppress the display of the text representation of the plot by the notebook.

In [None]:
# add your code here.

customers.plot(
    x='n_orders', 
    y='time_between_orders', 
    kind='scatter', 
    xlim=(0, 60), 
    ylim=(0, 300),
    alpha=0.1,
    title="Number of orders vs Time between orders",
    figsize=(10, 8)
);



### Plotting with Seaborn

The`seaborn` library is a wrapper around `matplotlib` that is great for producing clear plots with a greater variety of chart types than `pandas`, while being more straightforward than `matplotlib` for producing attractive plots for exploratory analysis.
Have a look [here](https://stanford.edu/~mwaskom/software/seaborn/examples/index.html) for a gallery of plots possible with `seaborn`.

### lmplot

Create a scatter plot of the `n_orders` vs `balance` using the `lmplot` function of `seaborn`. Once you have a grip of this, you should try looking at the scatter plot corresponding to other pairs.

Note how `.lmplot` helps us to determine the relationship between variables by applying a linear regression line to the plot; this can be particuarly informative if the density of the points is very high on certain parts of the plot.

In [None]:
# add your code here.

sns.lmplot(
    'n_orders', 
    'balance', 
    customers
);



### pairplot

A scatterplot matrix shows a grid of all scatterplots where each attribute is plotted against all other attributes.

You can find further information on how to create a scatterplot matrix with seaborn using the `pairplot()` function [here](https://stanford.edu/~mwaskom/software/seaborn/generated/seaborn.pairplot.html).

Create a dataframe called `spend_orders`, containing only the columns `['max_spent', 'mean_spent', 'n_orders', 'total_items']` of the `customers` dataframe, and then use `sns.pairplot` to examine it:


In [None]:
# add your code here.

spend_orders = customers[['max_spent', 'mean_spent', 'n_orders', 'total_items']]
sns.pairplot(spend_orders);



Be mindful of using `.pairplot` if you have a large number of variables; the output may not be of practical use and you could be waiting a long time for it to be produced!

### Correlation matrix and heatmap of correlations between the input features

It is often of great interest to investigate whether any of the variables in a multivariate dataset are significantly correlated. 
As previously shown, the different features (variables) in `customers` are not independent from each other. 
To quickly identify which features are related and to what degree, it is useful to compute a correlation matrix that shows the correlation coefficient for each pair of variables. 
You can do this by using the `corr()` function from the `pandas` library:

In [None]:
# add your code here.

customers.corr()



To visualise the degree of correlation between variables, you can use a heatmap. From `seaborn`, use the `heatmap` method and pass it the correlation matrix calculated in the previous exercise, setting the `cmap` parameter to `RdBu`).  

Note that we can use pylot (`.plt`) methods to affect the Seaborn plot; try running the `plt.figure` method, specifying the `figsize` as an argument (a tuple of two integers, such as `(12, 9)`) and the `facecolor` as any hex color code, such as `#ECEFF4` before the heatmap.

In [None]:
#If you see truncated top and bottom rows on your heatmap, revert matplotlib to 3.1.0. This is a bug in 3.1.1.
# add your code here.

plt.figure(figsize=(12,9), facecolor="#ECEFF4")
sns.heatmap(customers.corr(), cmap="RdBu");



The above exercise shows how we can use a package such as Seaborn to do a lot of the hard work for us in creating the heatmap, while also being able access more detailed elements using matplotlib when necessary.  

It's also worth noting that the available colormap  options that can be used with the Seaborn `cmap` parameter are those provided by [matplotlib]('https://matplotlib.org/3.1.0/tutorials/colors/colormaps.html').

### relplot

Here's an example of an alternative view on a relationship we saw in the matplotlib example earlier.  

When there are multiple y-values for a given x-value, a Seaborn relplot will automatically plot the mean y-value and add the 95% [confidence interval]('https://en.wikipedia.org/wiki/Confidence_interval') around it.  

We can also use the `.savefig` method to save our plot to an image file.

In [None]:
new_plot = sns.relplot(
    x="n_orders", 
    y="total_spent", 
    kind="line", 
    data=customers[customers['n_orders']<=20]
)

new_plot.savefig('seaborn_relplot.png');

### distplot

We can use `.distplot` to give us a histogram-like representation of the distribution of the data.

The data is automatically split into 'bins' on the x-axis, with the bars showing the number of instances in the bucket and the line showing the 'probability density' (which is the probability per unit on the x-axis).  

In [None]:
fuel = pd.read_csv('data/mpg.csv')

Create a `distplot` of the values in the `mpg` column from the `fuel` dataset loaded above, with a number of `bins` such that the average number of cars within each is about 10.

In [None]:
# add your code here.

sns.distplot(
    fuel['mpg'], 
    bins=int(len(fuel)/10)
);



### boxplot

The `.boxplot`method will produce a 'box and whisker' plot which can be useful to understand the distribution of data and in particular for comparison between categories.

Using the `fuel` dataset, create a `boxplot` which shows the distribution of `mpg` for cars from each different `origin`. The 'whiskers' show the range of values, with points determined to be 'outliers' (remember that you may not consider these points to be so) marked outside them.

In [None]:
# add your code here.

sns.boxplot(
    x='origin', 
    y='mpg', 
    data=fuel
);



### violinplot

Re-use your code from the previous example but with the `.violinplot` method instead: 

In [None]:
# add your code here.

sns.violinplot(
    x='origin', 
    y='mpg', 
    data=fuel
);



We can see that this combines elements of the `boxplot` with that of the probability density distribution seen in the `distplot`. Which plot do you think conveys information about the differences in `mpg` distribution for each `origin`?

### lmplot with multiple categories

Create an `lmplot`to show the relationship between `mpg` and `horsepower`, with a different `hue` for each `origin`. Make the plot larger than the default size.

In [None]:
# add your code here.

sns.lmplot(
    x='mpg', 
    y='horsepower', 
    data=fuel, 
    hue='origin', 
    height=8
);



Finally, make another plot which is the same as that created above, but which excludes any cars with a horsepower above the maximum of any car from outside the USA.

In [None]:
# add your code here.

non_us_max_hp = fuel[fuel['origin']!='usa']['horsepower'].max()
low_hp = fuel[fuel['horsepower']<=non_us_max_hp]

sns.lmplot(
    x='mpg', 
    y='horsepower', 
    data=low_hp, 
    hue='origin', 
    height=8
);

