# Drawing Plots in Python

**Matplotlib** is a great data visualisation package that we can use in Python to draw data visualisations. In this notebook we will look at making basic data visualisation building blocks. 

Import Matplotlib, and note the use of the inline command to make vicualsiations work with iPython notebook.

In [None]:
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

To get plots to appear in our Jupyter notebook we need to use the following Jupyter "magic command"

In [None]:
%matplotlib inline

We can also explictily pass a list of index values to the Series method so as to use a more intersting index. For example:

In [None]:
populations = pd.Series([1357000000, 1200000000, 321068000, 249900000, 249900000, 200400000, 191854000], 
                        ["China", "India", "United States", "Indonesia", "Qumar", "Brazil", "Pakistan"])
print(populations)

### Basic Statistical Visualisations

#### Bar chart

In [None]:
populations.plot(kind = "bar", \
                 title="National Populations")

#### Horizontal bar chart

In [None]:
populations.plot(kind = "barh", \
                 title="National Populations")

In [None]:
populations.plot(kind = "barh", \
                 title="National Populations", \
                 xlim = (0, 1500000000))

In [None]:
populations.plot(kind = "barh", \
                 title="National Populations", \
                 xlim = (0, 1500000000), \
                 grid=True)

In [None]:
populations.plot(kind = "barh", title="National Populations", \
                 xlim = (0, 1500000000), \
                 grid=True, 
                 xticks=(0, 500000000, 1000000000, 1500000000))

We can easily change the colours of the bars in a bar plot using the *color* attribute.

In [None]:
populations.plot(kind = "barh", \
                 title="National Populations", \
                 xlim = (0, 1500000000), \
                 grid=True, \
                 xticks=(0, 500000000, 1000000000, 1500000000), \
                 color = ['lemonchiffon', '#00ff00', 'blue'])

Full list of named matplotlib colours is availabe [here](http://matplotlib.org/examples/color/named_colors.html)

In [None]:
populations.plot(kind = "barh", \
                 title="National Populations", \
                 xlim = (0, 1500000000), \
                 grid=True, xticks=(0, 500000000, 1000000000, 1500000000), \
                 color = 'slateblue')

In [None]:
populations.plot(kind = "barh", \
                 title="National Populations", \
                 xlim = (0, 1500000000), \
                 color = 'blue')

Or we can give colours as RGB values in hex code

In [None]:
populations.plot(kind = "barh", title="National Populations", xlim = (0, 1500000000), grid=True, xticks=(0, 500000000, 1000000000, 1500000000), color = '#007700')

#### Bar Charts For Categorical Distributions

If we have a categorical series we first need to generate a frequency table before plotting abar chart

In [None]:
parties = ['FG', 'FF', 'IO', 'SF', 'Lab', 'AAA-PBP', 'SD', 'GP', 'RI']
votes = pd.Series(np.random.choice(parties, 100000, p=[0.25, 0.24, 0.18, 0.14, 0.066, 0.040, 0.035, 0.027, 0.022]))
print(votes)

In [None]:
votes.plot(kind="bar")

In [None]:
vc = votes.value_counts()
vc

In [None]:
vc.plot(kind = 'bar', color = 'green')

#### Pie Chart

In [None]:
vc.plot(kind = "pie", \
        title="Vote result")

In [None]:
vc.plot(kind = "pie", \
                 title="Votes", \
                figsize=(6,6))

In [None]:
vc.plot(kind = "pie", \
                 title="Votes", \
                 figsize=(6,6), \
                 colormap = "Accent")

#### Histograms

Generate some data for histograms

In [None]:
mu = 165
sigma = 20
heights = pd.Series(np.random.normal(mu, sigma, 1000))
heights

In [None]:
heights.plot(kind="hist", \
             title = "Student Heights", \
             xlim = (0, 250), bins = 25)

In [None]:
heights.hist(bins = 20)

#### Box plots

In [None]:
heights.plot(kind="box", title = "Student Heights")

#### Multiple Box Plots

In [None]:
mu = 165
sigma = 20
student_heights = pd.Series(np.random.normal(mu, sigma, 1000))
student_heights

mu = 220
sigma = 20
bball_heights = pd.Series(np.random.normal(mu, sigma, 1000))
bball_heights

height = pd.Series(student_heights)
height = height.append(bball_heights, ignore_index = True)

category = ["student"]*1000 +["bball_player"]*1000
category = pd.Series(category)

heights = pd.DataFrame({"height":height, "category":category})
heights.tail()

In [None]:
heights.boxplot(column = "height", by = "category", figsize = (8,6), grid = False)

#### Density plots

In [None]:
heights.plot(kind="kde", title = "Student Heights", ylim = (0, 0.03))

#### Line plots

Generate some time series data for line and area plots.

In [None]:
ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2015', periods=1000))
ts = ts.cumsum()
ts

In [None]:
ts.plot(kind = "line", title = "Price")

#### Area Plot

In [None]:
ts.plot(kind = "area", title = "Price", stacked= False)

### Scatter Plots

Scatter plots are great for displaying relationships between variables.

Load a dataset so we can look at some scatter plots.

In [None]:
PAYE_Fraud = pd.read_csv("PAYE_Fraud.csv")
display(PAYE_Fraud.head())
PAYE_Fraud.shape

In [None]:
plt.figure(figsize=(10, 10))
plt.scatter(PAYE_Fraud["TaxPaid"], \
            PAYE_Fraud["TotalIncome"], alpha= 0.4)

In [None]:
plt.figure(figsize=(10, 10))
plt.scatter(PAYE_Fraud["TaxPaid"], \
            PAYE_Fraud["YearsInCurrentEmployment"], \
            s = PAYE_Fraud["TotalIncome"]/1000)

In [None]:
plt.figure(figsize=(10, 10))
colMap = {"Employed":"Red", 'Self-Employed':"Green", "Casual":"Blue"}
plt.scatter(PAYE_Fraud["TaxPaid"], \
            PAYE_Fraud["YearsInCurrentEmployment"], \
            c = PAYE_Fraud["OccupationType"].map(colMap), \
            alpha=0.2)

A **scatter plot matrix** or **SPLOM** is a nice way to create a set of scatter plots for a collection of variables. **pandas** has a fucntion, **scatter_matrix** to draw scatter plot matrices.

In [None]:
from pandas.plotting import scatter_matrix
_ = scatter_matrix(PAYE_Fraud[["YearsInCurrentEmployment", "NumPreviousEmployments", "TotalIncome", "TaxableIncome"]], \
                   alpha=0.2, figsize=(10, 10), diagonal='hist')

A **hexbin** plot is an alternative to a simple scatter plot that solves a problem that arises when datasets are very large. We use the **seaborn** package to draw hexbin plots.

In [None]:
import seaborn as sns
from scipy.stats import kendalltau # Used for performing a correlation test between varaibles in hexbin
sns.set(style="ticks") # Makes for nicer looking histograms

In [None]:
sns.jointplot(PAYE_Fraud["TaxPaid"], PAYE_Fraud["YearsInCurrentEmployment"], kind="hex", stat_func=kendalltau, color="#4CB391")

In [None]:
sns.jointplot(PAYE_Fraud["EffectiveTaxRate"], PAYE_Fraud["TaxableIncome"], kind="hex", stat_func=kendalltau, color="red")

### matplotlib Styles

We can change styles in matplotlib easily

In [None]:
populations.plot(kind = "barh", title="National Populations")

In [None]:
matplotlib.style.use('ggplot')
populations.plot(kind = "barh", title="National Populations")

In [None]:
matplotlib.style.use('fivethirtyeight')
populations.plot(kind = "barh", title="National Populations")

In [None]:
matplotlib.style.use('seaborn-colorblind')
populations.plot(kind = "barh", title="National Populations")

In [None]:
plt.xkcd()
populations.plot(kind = "barh", title="National Populations")

### Saving Plots

We can easily save plots we generate in our notebooks.

In [None]:
populations.plot(kind = "barh", title="National Populations")
plt.savefig('barplot.pdf')