# Visualising relationships

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import plotly as py
import plotly.graph_objects as go
from plotly.offline import init_notebook_mode
import seaborn as sns

Load `customer_churn.csv` into a Pandas data frame (`df`).

In [None]:
df = pd.read_csv("customer_churn.csv")
df.drop(df.columns[0], axis=1, inplace=True)

View the contents of the `df` data frame.

In [None]:
df

There are many columns, so they've been truncated. List the names of all the columns in `df`.

In [None]:
list(df.columns)

## Visualising distributions

A **histogram** can be used to visualise the **distribution** of ages.

Create a histogram of the number of voicemail messages customers receive (`number_vmail_messages`) from `df`. Use 30 bins.

In [None]:
sns.histplot(data=df, x="number_vmail_messages", bins=30)

Most customers receive none.

## Visualising correlations

A **scatterplot** can be used to visualise the **correlation** between number of calls and time spent on calls.

Create a scatterplot of `total_day_calls` against `total_day_minutes` from `df`.

In [None]:
sns.scatterplot(data=df, x="total_day_calls", y="total_day_minutes")

There's no evidence of a relationship between number and length of calls.

## Visualising changes over time

The customer churn data date doesn't have any data information, so going to visualise how China's GDP has changed over the past few decades.

Load `gdp_by_country_by_year.csv` into a Pandas data frame (`gdp_df`).

In [None]:
gdp_df = pd.read_csv("gdp_by_country_by_year.csv")

View the contents of the `gdp_df` data frame.

In [None]:
gdp_df

Transform `Year` to a date-time index and set it to be the index of `gdp_df`.

In [None]:
gdp_df.index = pd.to_datetime(gdp_df["Year"], format='%Y')

Extract China's GDP data into a `china_gdp_df` data frame.

In [None]:
china_gdp_df = gdp_df[gdp_df["Country Code"] == "CHN"]

View the contents of the `china_gdp_df` data frame.

In [None]:
china_gdp_df

Create a line plot to visualise total revenue per quarter as a time series.

In [None]:
sns.lineplot(data=china_gdp_df, x=china_gdp_df.index, y="Value")

## Visualising rankings

Rankings can be visualised using a standard bar plot. In ranking, the focus is on the ordering. A simple table can also be used to visualise a ranking (e.g. football league tables).

We will rank states based on the average number of calls each of their citizens makes during the day.

Average (mean) `total_day_calls` total for each state and store the resulting data frame in `state_df`.

In [None]:
state_df = df.groupby("state", as_index = False).mean()

View the contents of `state_df`.

In [None]:
state_df

Sort the data frame in descending order of `total_day_calls`.

In [None]:
state_df = state_df.sort_values("total_day_calls", ascending = False)

View the contents of `state_df`.

In [None]:
state_df

Create a bar plot of `total_day_calls` for each `state` (using `state_df`). It's often a good idea to flip bar charts with categorical labels so the bars are horizontal. Makes it easier to read the labels.

Restrict the chart to the top 10 states.

In [None]:
sns.barplot(data=state_df.iloc[:10,:], x="total_day_calls", y="state", orient="h", color="gray")

Show a similar chart for the states with the _least_ daytime usage.

In [None]:
sns.barplot(data=state_df.iloc[-10:,:], x="total_day_calls", y="state", orient="h", color="gray")

## Visualising parts-to-wholes

Sometimes we wish to emphasise how sub-parts compare to the whole---rather then to each other. How much of the pie am I getting, rather than did I get a bigger slice than everyone else.

Pie charts have a bad reputation within the data visualisation community. They are easy to abuse. However, if you have 2-5 reasonably-sized categories, and want to emphasise the whole, they can be effective.

We'll use a pie chart to visualise the proportion of customers who have a voicemail plan.

Create a `voice_mail_plan` frequency table from `df` and store it in `voicemail_frequency_df`.

In [None]:
voicemail_frequency_df = df["voice_mail_plan"].value_counts()

View the contents of `voicemail_frequency_df`.

In [None]:
voicemail_frequency_df

Create a pie chart from the frequencies to show how they contribute to the overall customer base.

In [None]:
voicemail_frequency_df.plot.pie()

## Visualising deviations

It's occassionally useful to visualise deviations from a fixed point (e.g. currency fluctuations relative to US $).

In this example, we are going to see how the total revenue from different consumer _value_ segements deviates from the reference segment of "Youth".

Prepare the data as follows.

1. Sum `df` by `consumer_value_type` and store result in `customer_value_type_df`.
2. Calculate `total_revenue_diff_youth` by substracting the total revenue score for the youth value segment from the other segment revenues.
3. Remove the reference ("Youth") value segment.
4. Sort `total_revenue_diff_youth` by `total_revenue_diff_youth`

In [None]:
dc_df = state_df.copy().set_index("state")
dc_df["total_day_calls_diff_dc"] = dc_df["total_day_calls"] - dc_df["total_day_calls"]["DC"]
#dc_df = customer_value_type_df[customer_value_type_df.index != "Youth"]
dc_df.sort_values("total_day_calls_diff_dc", inplace=True)

View the contents of `dc_df`.

In [None]:
dc_df

Create a horizontal bar plot from `customer_value_type_df` that shows `total_revenue_diff_youth` for each `consumer_value_type`.

In [None]:
plt.figure(figsize=(12,12))
sns.barplot(data=dc_df, x="total_day_calls_diff_dc", y=dc_df.index, orient="h", color="gray")

## Visualising magnitudes

Visualising magntitudes is one of the most common uses for charts. We've seen how ranking, part-to-whole and deviation visualisations all make use of magnitudes---but place the emphasis on a different aspect of the data. It can be a subtle distinction, but it's important to know the characteristic of your data that you are trying to showcase with your visualisation. Good data visualisation requires mastery of nuances.

We'll use a bar chart to compare the average time spent on calls, during the day, in each state.

Create a horizontal bar plot from `state_df` that shows the magnitude of `total_day_minutes` for each `state`.

Restrict the chart to the first 10 states, ordered alphabetically by state code.

In [None]:
sns.barplot(data=state_df.sort_values("state").iloc[:10,:], x="total_day_minutes", y="state", orient="h", color="gray")

## Visualising spatial data

Visualising geographic data is a specialist subfield of visualisation.

As there's no geographic data in our customer churn data, we'll create a choropleth (colour polygon) map that shows how the the global happiness survey index varies across countries.

Load `world_happiness_2019.csv` into a Pandas data frame (`happiness_df`).

In [None]:
happiness_df = pd.read_csv("world_happiness_2019.csv")

View the contents of `happiness_df`.

In [None]:
happiness_df

Note that the countries are listed in descending order of happiness. Finns are happiest, while the South Sudanese appear to be the least happy.

Use Plotly to create a world choropleth from `happiness_df` the using `Country or region` and the location and `Score` as the data value (`z`).

In [None]:
data = {
    "type": 'choropleth',
    "locations": happiness_df['Country or region'],
    "locationmode": 'country names',
    "colorscale": 'viridis',
    "z": happiness_df['Score']
}

init_notebook_mode(connected=True)

map = go.Figure(data=[data])

map.update_layout(
    autosize=False,
    width=800,
)

map.show()

Hover over a country to see its happiness index. Test your geographic knowledge by locating Finland and South Sudan.

## Visualising flow

Flow considers the volume or strength of movement between stages (e.g. locations, tasks, periods). It's not a common type of visualisation, but can be powerful.

We'll move away from our customer churn data, as it doesn't have any simple flow examples we can visualise. 

In its place, we'll look at data on Scottish independence. Specifically, we'll visualise how Brexit has impacted Scots' views on independence.

Load `scottish_votes.csv` into a Pandas data frame (`scottish_votes_df`).

In [None]:
scottish_votes_df = pd.read_csv("scottish_votes.csv")

View the contents of `scottish_votes_df`.

In [None]:
scottish_votes_df

Don't spend a lot of time pondering the file format. However, you will see that it contains data on how people voted in IndyRef1 and Brexit (the initial stage) and voting intentions for IndyRef2 (the final stage).

This data has been formatted for use in a **Sankey** diagram.

Create the `source`, `target` and `value` data for the Sankey link.

In [None]:
link = {
    "source": scottish_votes_df['Source'].dropna(axis=0, how='any'),
    "target": scottish_votes_df['Target'].dropna(axis=0, how='any'),
    "value": scottish_votes_df['Value'].dropna(axis=0, how='any'),
    "color": scottish_votes_df['Link Color'].dropna(axis=0, how='any'),
}
   
node = {
    "pad": 10, 
    "line": {
        "color": "black", 
        "width": 0
    }, 
    "color": ["#F27420", "#4994CE", "#FABC13", "#7FC241", "#D3D3D3", "#8A5988", "#449E9E", "#D3D3D3"], 
    "label": ["Remain+No – 28", "Leave+No – 16", "Remain+Yes – 21", "Leave+Yes – 14", "Didn’t vote in at least one referendum – 21", "46 – No", "39 – Yes", "14 – Don’t know / would not vote"], 
    "thickness": 30
}

Use Plotly to display the Sankey diagram.

In [None]:
init_notebook_mode(connected=True)

data = go.Sankey(link = link, node=node)
    
fig = go.Figure(data)

fig.update_layout(
    autosize=False,
    width=600,
    height=400,
)

fig.show()

Note how the Europhile Unionists who have shifted to favouring independence are balanced by some Leave Nationalists deciding to remain in the UK.

There is also an additional swing to "No" in IndyRef2 from earlier Europhile Nationalists---possibly as a result of the challenges that would be presented by a hard border between Scotland and England.

The diagram is from an [article](https://www.theguardian.com/commentisfree/2017/jan/27/shift-scottish-independence-yougov-nicola-sturgeon-balancing-act) in the Guardian that has a more in-depth analysis that is warranted here.

## Takeway

There are many different chart types available---and more are created every year. But it's important to focus on the relationship you wish to study or explain.

Your visualisations should emerge from your requirements. Think carefully about your data before you throw it into a charting application.