![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)

# Module 4 Unit 4  - Python Libraries for Data Visualizations

*Get the most out of this section by opening a Jupyter notebook in another window and following along. Code snippets provided in the course can be pasted directly into your Jupyter notebook. Review Module 2, Unit 5 for a refresher on creating and opening Jupyter notebooks in Callysto.*

Libraries are collections of code for performing different kinds of tasks. Programmers can install libraries developed by others to allow them to easily do different things without having to reinvent methods on their own.

The [Python developer community](https://www.python.org/community/) is full of experienced programmers who have made their code available to be used.

Let's explore some key libraries that are useful when creating data visualizations.

* Matplotlib
* Plotly
* Cufflinks
* Folium

The examples in this unit use a real data set sourced from [Vancouver Open Data.](https://opendata.vancouver.ca/pages/home/)

Run the code below in a Jupyter notebook to pull the Public Art data set and follow along.

    # Get data from URL
    import requests as r
    import pandas as pd


    # Get data
    link = "https://tinyurl.com/ycjwdfhk"
    API_response_trees = r.get(link)
    data = API_response_trees.json() 
    # Parse data
    records = pd.json_normalize(data=data['records'])

    # Append coordinates
    lon = [ ]
    lat = [ ]
    for item in records['fields.geom.coordinates'].to_list():
        if type(item) !=float:
            lon.append(item[0])
            lat.append(item[1])
        else:
            lon.append(0)
            lat.append(0)

    records['longitude'] = lon
    records['latitude'] = lat

    display(records.head())
    
The output should look like this:

![Public Art](../_images/Module4-Unit4-image.png)

*This is a dataset that contains complete information on art in Vancouver. We see the ID of art, the artist project statement, number of artists that worked on it, the neighborhood where it was installed, its coordinates and the type of datapoint.*

### Matplotlib

The [**Matplotlib**](https://matplotlib.org/index.html) library provides methods for creating a variety of different kinds of data visualizations, including static, animated, and even interactive visualizations.

[**Pyplot**](https://matplotlib.org/api/pyplot_api.html) is a submodule of Matplotlib. Using the Pyplot module allows us to plot data stored in lists and arrays.

### 🏁 Activity: Creating simple charts with Pyplot

**Step 1**

We can access a submodule within a library with an import command that specifies the library name, followed by a dot (.) and the submodule name.

Run the code below to import Pyplot.

    import matplotlib.pyplot as pltimport matplotlib.pyplot as plt

With this command, we've also given pylot an alias, plt, so we don't have to type out matplotlib.pyplot every single time we want to use a function from it.

Instead, all functions associated with this class can be accessed with the syntax **plt.FunctionName**.

**Step 2**

Let’s take a closer look at the different statuses of the art in our data set, and see how many pieces of art are in each status.

    grouped_by_status = records.groupby("fields.status").size().reset_index(name="Count")
    grouped_by_status
    
The output should look like this:

![Status](../_images/Module4-Unit4-image1.png)

*This is a dataframe contaning information on the status of art, where art can either be "In place", "Removed" or "Under review". On the first row (index 0), we see there are 372 pieces of art that are in place. On the second row (index 1), we see that 124 pieces of art have been removed. On the third row (index 2), we see that 4 pieces of art are under review.*


**Step 3**

Now that this data is broken out, we can create charts from it.

Run the code to create a bar chart.

**Step 4**



#### Plotly



#### Cufflinks



#### 📚 Read



#### Folium



#### 📚 Read


#### 📚 Read



### 🏁 Activity



**Step 1**


**Step 2**




**Step 3**



**Step 4**



#### 📚 Read
>This Jupyter notebook provides a walkthrough of this activity.Accessible Colour Schemes in Data Visualizations


### Additional considerations

While data visualizations can be a powerful tool for analyzing data and communicating insight, there are limitations we should consider when using them.


#### Speed and power

![Speed and power](../_images/Module4-Unit3-image2.png)

If a visualization is generated from live data, we should consider how long it takes to load when using older systems or less reliable internet connections.

When code is pulling from large or complex data sets to produce a visualization, it can take time to compute all that data. This means the notebook will be slow to respond.

In most cases Jupyter notebooks use remote server resources from the Hub to run code. We can help to optimize visualizations by processing the data using Python before passing that data to the visualization library.


### 🏁 Activity

Let's compare how long it takes to create a data visualization using two different methods.

1. Generating the data visualization at the same time as we download the data set.

1. Downloading the data set into a DataFrame, then creating the data visualization from the DataFrame.

We will use the [**Python time Module**](https://www.programiz.com/python-programming/time) to time how long it takes each method.

In your Jupyter notebook, run the code below to set up the demonstration.

    # Get data from URL
    import requests as r

    # Parse data
    import pandas as pd

    # Visualizations
    import plotly.express as px
    import plotly.io as pio

    # Timing code
    import time

    # Get data
    print("Downloading data")
    link = "https://tinyurl.com/ycjwdfhk"
    
Now let's see how long it takes to generate a histogram directly from our Public Art data as we download it.

    # Downloading and visualizing at the same time
    t0 = time.time()

    fig = px.histogram(pd.json_normalize(r.get(link).json()["records"]),
                   x="fields.neighbourhood",
                   title="Histogram, art per neighborhood")
    fig.show()
    t1 = time.time()

    total = t1-t0

    print("Total time taken:", total)
    
![time](../_images/Module4-Unit3-image3.png)


*Bar chart showing how art distributes in each neighborhood. This chart shows 20 art pieces are in Grandview-Woodland, 26 in West End, 43 in Downtown East side, 154 in Downtown, 4 in Killarney, 15 in Kensington-Cedar Cottage, 22 in Strathcona, 24 in Stanley Park, 5 in Renfrew-Collingwood, 9 in Marpole, 15 in Kitsilano, 5 in Hasting-Sunrise, 9 in RileyPark, 4 in Oakridge, 11 in Fairview, 25 in Shaughnessey, 1 in Victoria-Fraserview, 2 in West Point Grey, 6 in Sunset, 3 in South Cambie, 2 in Dunbar-Southlands and 2 in Arbutus Ridge.*


The number of seconds it took for Jupyter notebooks to generate the histogram should be displayed directly below your visualization.

Time will vary, but for most people it will take around 2.5 seconds.

Now let's explore how long that same histogram takes to be generated if we've already downloaded our data into a DataFrame.

Run the code below to store our data in a pandas DataFrame, which we'll call *records*.


    API_response_trees = r.get(link)
    data = API_response_trees.json() 
    # Parse data
    records = pd.json_normalize(data=data['records'])
    
 Now let's try generating our histogram just as we did before.
 
 
    t0 = time.time()

    fig = px.histogram(records,x="fields.neighbourhood",title="Histogram, art per neighborhood")
    fig.show()
    t1 = time.time()

    total = t1-t0

    print("Total time taken:", total)
    
    
![time](../_images/Module4-Unit3-image4.png)

*Bar chart showing how art distributes in each neighborhood. This chart shows 20 art pieces are in Grandview-Woodland, 26 in West End, 43 in Downtown East side, 154 in Downtown, 4 in Killarney, 15 in Kensington-Cedar Cottage, 22 in Strathcona, 24 in Stanley Park, 5 in Renfrew-Collingwood, 9 in Marpole, 15 in Kitsilano, 5 in Hasting-Sunrise, 9 in RileyPark, 4 in Oakridge, 11 in Fairview, 25 in Shaughnessey, 1 in Victoria-Fraserview, 2 in West Point Grey, 6 in Sunset, 3 in South Cambie, 2 in Dunbar-Southlands and 2 in Arbutus Ridge.*


Take a look at the length of time it took for the histogram to be generated — it should be significantly less!

Although the difference in this case is a matter of seconds, time can add up as our projects become more complex. Large data sets, multiple data sets, multiple visualizations, and more complex visualizations can make our presentations too slow to be practical in a classroom or when giving a presentation at work.


#### Explore
>This Jupyter notebook provides a walkthrough of this activity. Speed and power


#### Display


![Display](../_images/Module4-Unit3-image5.jpeg)


It's likely that not everyone who accesses our work will do so on equivalent equipment. Fine details and colours may not look the same on older or lower-resolution displays, or displays which have not been calibrated.

This is another great reason to keep visualizations simple and tidy and to be cautious about the colour palettes we choose.

### Conclusion

In this unit, we learned more ways to use data visualizations, their limitations, and ways to bypass some of those limitations. In the next unit, we will learn more about Python libraries which can be used to generate visualizations.

[![Callysto.ca License](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-bottom.jpg?raw=true)](https://github.com/callysto/curriculum-notebooks/blob/master/LICENSE.md)