![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)

# Module 4 Unit 3  - Data Visualization Design



![Data Visualization Design](../_images/Module4-Unit3-image.jpeg)


We don't have to be graphic designers to recognize that the way data is presented visually is important. When a data visualization is well-crafted, it's appealing to the viewer and easy for them to access, understand, and make use of.

This makes data visualizations powerful and practical tools for communicating ideas and presenting arguments. They can also help make information and processing resources available to colleagues and students, empowering them to find answers to questions on their own.

Here are four goals to keep in mind when creating a data visualization:

1. Define a purpose

2. Know the audience

3. Keep it tidy

4. Make it accessible

Let's explore each one in more detail.

#### Design with purpose

What question does this visualization answer? The audience should know exactly why they are being shown this information and what to do with it.

This is a good time for us to make sure we have the necessary data to answer this question and that our source is reliable.

Defining a purpose also helps us select the right type of data visualization. Is this particular chart, map, or plot the best tool to represent our idea? If so, why?

Do we need to show multiple categories or variables? If so, how are they differentiated?

It can help to try different kinds of visualizations until we find one that communicates the idea as clearly as possible.

#### Know the audience

Who is this visualization for? Whenever we represent data, we should think about what our audience needs and understands.

For example, the title, legend, and any labels should use terms that are familiar to them.

Or, a visualization intended for a class of 12-year-olds might use less complex language than one for post-graduate academics.

#### Keep it tidy

Visualizations that look cluttered or disorganized work against our goal of clear communication.

Questions to ask yourself: 

* Does your visualization look cluttered?
* Are you mixing data categories?
* Are you attempting to answer more than one question using a single visualization? 

To be effective, visualizations must be clean and coherent. While it can be exciting to add more elements and variables, data is easier to communicate and navigate when it is presented simply.

In many cases, it's better to create a visualization for each key idea than to try and make a more complex visualization that presents multiple ideas at once.

#### 📚 Read
>[Data Visualization – Best Practices and Foundations](https://www.toptal.com/designers/data-visualization/data-visualization-best-practices)


#### Make is accessible

Not everyone takes in information the same way. Some people experience more challenges than others when accessing and working with data.

For example, colour is a common tool for differentiating categories in visualizations. However, many people experience full or partial colour-blindness that can make this information hard to interpret.

#### 📚 Read
>[How to Use Color Blind Friendly Palettes to Make Your Charts Accessible](https://venngage.com/blog/color-blind-friendly-palette/)

When designing visualizations, we should strive to remove as many barriers as we can for the people who use our work. This can incorporate a wide variety of design practices such as using large font sizes and high-contrast colour palettes.

#### 📚 Read
>[5 easy ways to make your data visualization more accessible](http://www.storytellingwithdata.com/blog/2018/6/26/accessible-data-viz-is-better-data-viz)


### 🏁 Activity

*Get the most out of this activity by opening a Jupyter notebook in another window and following along. Code snippets provided in the course can be pasted directly into your Jupyter notebook. Review Module 2, Unit 5 for a refresher on creating and opening Jupyter notebooks in Callysto.*

Let's try applying an inclusive colour palette to a data visualization.

**Step 1**

Start by running the code below to recreate the pets DataFrame. 

    # load "pandas" library under the alias "pd"
    import pandas as pd
    import plotly.express as px

    # identify the location of our online data
    url = "https://tinyurl.com/y917axtz-pets"

    # read csv file from url and create a dataframe
    pets = pd.read_csv(url)

    # display the head of the data
    pets.head()

**Step 2**

Next, use the groupby function to select counts of each species in our data set.

    species = pets.groupby("Species").size().reset_index(name="Count")
    species
    
![head](../_images/Module4-Unit3-image1.png)

*This is a dataframe that indicates there are five different species in the dataset, and tells us how many of each pet is within the dataset. From the first row (index 0), we see there are 11 cats, from the second row (index 1), there are 15 dogs, from the third row (index 2) there are 2 lizards, from the fourth row (index 3), there are 2 rabbits, and from the fifth row (index 4) there is only 1 tarantula.*


**Step 3**

Generate a pie chart of the different species in our data set.

    colour_list1 = ['#ffffcc','#a1dab4','#41b6c4','#2c7fb8','#253494']

    fig = px.pie(species,values="Count",names="Species",title="Species of pets",color_discrete_sequence=colour_list1)

    fig.show()
    
The numbers in *colour_list1* represent colours as six hexidecimal digits (the sixteen digits 0, 1, 2, . . . , 9, 8, a, b, c, d, e, f). The number *#ff0000* represents the colour Red, #00ff00 is the colour Green and *#0000ff* is the colour Blue. Adding combinations of these three basic colours give all the colours of the rainbow. Usually we look up the colours we want in a chart.

The output should look like this:  ?

**Step 4**

Next, visit [**ColorBrewer 2.0**](https://colorbrewer2.org/#type=sequential&scheme=BuGn&n=3), a free online diagnostic tool for evaluating color schemes used in maps and other data visualizations.

In ColorBrewer 2, generate a new colour scheme for the pie chart that is both suitable for our data and is friendly to people who experience colour blindness.

Update the code for the pie chart to reflect the new colours.

#### 📚 Read
>This Jupyter notebook provides a walkthrough of this activity.Accessible Colour Schemes in Data Visualizations


### Additional considerations

While data visualizations can be a powerful tool for analyzing data and communicating insight, there are limitations we should consider when using them.


#### Speed and power

![Speed and power](../_images/Module4-Unit3-image2.png)

If a visualization is generated from live data, we should consider how long it takes to load when using older systems or less reliable internet connections.

When code is pulling from large or complex data sets to produce a visualization, it can take time to compute all that data. This means the notebook will be slow to respond.

In most cases Jupyter notebooks use remote server resources from the Hub to run code. We can help to optimize visualizations by processing the data using Python before passing that data to the visualization library.


### 🏁 Activity

Let's compare how long it takes to create a data visualization using two different methods.

1. Generating the data visualization at the same time as we download the data set.

1. Downloading the data set into a DataFrame, then creating the data visualization from the DataFrame.

We will use the [**Python time Module**](https://www.programiz.com/python-programming/time) to time how long it takes each method.

In your Jupyter notebook, run the code below to set up the demonstration.

    # Get data from URL
    import requests as r

    # Parse data
    import pandas as pd

    # Visualizations
    import plotly.express as px
    import plotly.io as pio

    # Timing code
    import time

    # Get data
    print("Downloading data")
    link = "https://tinyurl.com/ycjwdfhk"
    
Now let's see how long it takes to generate a histogram directly from our Public Art data as we download it.

    # Downloading and visualizing at the same time
    t0 = time.time()

    fig = px.histogram(pd.json_normalize(r.get(link).json()["records"]),
                   x="fields.neighbourhood",
                   title="Histogram, art per neighborhood")
    fig.show()
    t1 = time.time()

    total = t1-t0

    print("Total time taken:", total)
    
![time](../_images/Module4-Unit3-image3.png)


*Bar chart showing how art distributes in each neighborhood. This chart shows 20 art pieces are in Grandview-Woodland, 26 in West End, 43 in Downtown East side, 154 in Downtown, 4 in Killarney, 15 in Kensington-Cedar Cottage, 22 in Strathcona, 24 in Stanley Park, 5 in Renfrew-Collingwood, 9 in Marpole, 15 in Kitsilano, 5 in Hasting-Sunrise, 9 in RileyPark, 4 in Oakridge, 11 in Fairview, 25 in Shaughnessey, 1 in Victoria-Fraserview, 2 in West Point Grey, 6 in Sunset, 3 in South Cambie, 2 in Dunbar-Southlands and 2 in Arbutus Ridge.*


The number of seconds it took for Jupyter notebooks to generate the histogram should be displayed directly below your visualization.

Time will vary, but for most people it will take around 2.5 seconds.

Now let's explore how long that same histogram takes to be generated if we've already downloaded our data into a DataFrame.

Run the code below to store our data in a pandas DataFrame, which we'll call *records*.


    API_response_trees = r.get(link)
    data = API_response_trees.json() 
    # Parse data
    records = pd.json_normalize(data=data['records'])
    
 Now let's try generating our histogram just as we did before.
 
 
    t0 = time.time()

    fig = px.histogram(records,x="fields.neighbourhood",title="Histogram, art per neighborhood")
    fig.show()
    t1 = time.time()

    total = t1-t0

    print("Total time taken:", total)
    
    
![time](../_images/Module4-Unit3-image4.png)

*Bar chart showing how art distributes in each neighborhood. This chart shows 20 art pieces are in Grandview-Woodland, 26 in West End, 43 in Downtown East side, 154 in Downtown, 4 in Killarney, 15 in Kensington-Cedar Cottage, 22 in Strathcona, 24 in Stanley Park, 5 in Renfrew-Collingwood, 9 in Marpole, 15 in Kitsilano, 5 in Hasting-Sunrise, 9 in RileyPark, 4 in Oakridge, 11 in Fairview, 25 in Shaughnessey, 1 in Victoria-Fraserview, 2 in West Point Grey, 6 in Sunset, 3 in South Cambie, 2 in Dunbar-Southlands and 2 in Arbutus Ridge.*


Take a look at the length of time it took for the histogram to be generated — it should be significantly less!

Although the difference in this case is a matter of seconds, time can add up as our projects become more complex. Large data sets, multiple data sets, multiple visualizations, and more complex visualizations can make our presentations too slow to be practical in a classroom or when giving a presentation at work.


#### Explore
>This Jupyter notebook provides a walkthrough of this activity. Speed and power


#### Display


![Display](../_images/Module4-Unit3-image5.jpeg)


It's likely that not everyone who accesses our work will do so on equivalent equipment. Fine details and colours may not look the same on older or lower-resolution displays, or displays which have not been calibrated.

This is another great reason to keep visualizations simple and tidy and to be cautious about the colour palettes we choose.

### Conclusion

In this unit, we learned more ways to use data visualizations, their limitations, and ways to bypass some of those limitations. In the next unit, we will learn more about Python libraries which can be used to generate visualizations.

[![Callysto.ca License](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-bottom.jpg?raw=true)](https://github.com/callysto/curriculum-notebooks/blob/master/LICENSE.md)