![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)

<a href="https://hub.callysto.ca/jupyter/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fcallysto%2Fhackathon&branch=master&subPath=SustainabilityOnMars/CuriosityTrack/challenge-7-bonus.ipynb&depth=1" target="_parent"><img src="https://raw.githubusercontent.com/callysto/curriculum-notebooks/master/open-in-callysto-button.svg?sanitize=true" width="123" height="24" alt="Open in Callysto"/></a>

# <p style="text-align: center;"> Bonus Challenge </p>
# <p style="text-align: center;"> 📌 Pets on Mars: Visualizing the Data </p>

A visualization is only useful if it helps us to understand the data set better or communicate information about it more accurately and powerfully.

When choosing to visually represent data, remember what question you want to answer and ask yourself if what the visualization is presenting is relevant to your question.

Run the following code (press `shift + enter`) in your Jupyter notebook to import the pandas library and recreate the pets DataFrame.

In [None]:
#❗️Run this cell

#load "pandas" library under the alias "pd"
import pandas as pd

#identify the location of our online data
url = "https://tinyurl.com/y917axtz-pets"

#read csv file from url and create a dataframe
pets = pd.read_csv(url)

#display the head of the data
pets.head()

The head of the DataFrame shows us that for each animal eight different variables have been recorded.

## 📕 Debrief: Grouping Variables

Let's start by picking a few columns we are interested in and get counts of the different values within those columns.

- Gender
- Species
- Age (in years)

To do this, we'll use the pandas method `groupby`, which lets us split data into groups and give those groups names so they can be easily referenced.

Run the code below to group data in the gender, species, and age columns of the pets DataFrame by count.

In [None]:
#❗️Run this cell

# Group by different Categories: Gender, Species, Age (years)

gender = pets.groupby("Gender").size().reset_index(name="Count")
species = pets.groupby("Species").size().reset_index(name="Count")
age = pets.groupby("Age (years)").size().reset_index(name="Count")

The variable pets refers to our DataFrame of information. Writing the code `pets.groupby("Gender")` creates a new dataframe that groups the information by the two genders Male and Female. Extending the line of code to `pets.groupby("Gender").size()` creates a column indicating the size of each group (Male, Female), and appending `reset_index(name="Count")` gives an informative name for this size data.

Now that we've created our groups, let's call each group name and see what it looks like as a table.

In [None]:
#❗️Run this cell

gender

### 🌟 What do you notice?

This is a dataframe that indicates that 15 pets are female, while 16 pets are male. 

The 0 indicates the first row - in Python we start counting from zero.<br>
This is the index of the row, and corresponds to the row female, with count value 15.<br>
The 1 indicates the second row, and corresponds to the row male, with count value 16.

In [None]:
#❗️Run this cell

species

This is a dataframe that indicates there are five different species in the dataset, and tells us how many of each pet is within the dataset.

In [None]:
#❗️Run this cell

age

In [None]:
#❗️Run this cell
i.challenge7a()

### 📕 Activity
We now have three different subsets of data we could make visualizations from. But what type of visualization should we choose for each one?

Each one of these data subsets could easily be represented by a pie chart, scatter plot, or bar chart.<br>
Try matching the data sets with different visualizations to see what the resulting chart or graph would look like.<br>
Run the cell below.

In [None]:
%%html

<!--❗️Run this cell-->

<iframe width="100%" height="900" src="https://callysto.github.io/online-courses/CallystoAndDataScience/modules/interactives/datographer/plot-styles" frameborder="0" ></iframe>

## ✏️ Reflective Questions
1. Do you have a preferred data visualization for each data subset? If so, why?

2. Do any of the data visualizations seem like they might misrepresent the data, or cause confusion?

## 📕 Debrief: Creating Clear and Useful Visualizations

Sometimes a particular visualization represents data better than others.

Let's look at some examples created with <a href = "https://medium.com/plotly/introducing-plotly-express-808df010143d#:~:text=Plotly%20Express%20is%20a%20new,simple%20syntax%20for%20complex%20charts">Plotly Express</a>, a library with lots of great tools for creating nice-looking data visualizations.

Suppose we wanted to see the relationship between age and time to adoption for the pets in our data set. The code below allows us to generate a bar graph to compare these variables.

The code line that starts `bar_pet = px.bar(pets,...` calls up the function bar in the Plotly library to create a bar chart from the data in the pets DataFrame.<br>
The line `x="Time to Adoption (weeks)"` says the horizontal axis in the bar chart will take data from the "Time to Adoption" column in the DataFrame.<br>
A similar line `y="Age (years)"` tells us what column provides data for the vertical axis.<br>
The title line adds a nice title to our bar chart.<br>
And finally, the command `bar_pet.show()` tells the computer to show us the chart by displaying it on the screen.

Look and interact with the visualization. When you mouse-over the the bar graph, you’ll notice more information appears.

In [None]:
#❗️Run this cell

#load "plotly express" library under the alias "px"
import plotly.express as px

# Create bar graph
bar_pet = px.bar(pets,
           x="Time to Adoption (weeks)", 
           y="Age (years)",
           title="Age (in years) and Time to Adoption (weeks) for each Pet")

# Display within our Jupyter notebook 
bar_pet.show()

### 🌟 What do you notice?
Looking at this bar graph, there seems to be a relationship between age and time to adoption, but some aspects of the visualization are not clear. For example, most of the bars are segmented, but there is no explanation of what the segments represent.

Let’s update our data visualization to include more information. We'll use colour to represent pet species and add labels with the names of each pet.

In [None]:
#❗️Run this cell

# Create coloured bar chart
bar_pet = px.bar(pets,
           x="Time to Adoption (weeks)", 
           y="Age (years)",
           title="Age (in years) and Time to Adoption (weeks) for each Pet",
           color="Species",text="Name")

bar_pet.show()

### 🌟 What do you notice?
These updates have made the bar chart easier to interpret. However, there's a lot of information represented here and the chart doesn't clearly display the relationship between age and time to adoption.

Some elements are also confusing. 

In [None]:
#❗️Run this cell
i.challenge7b()

Let’s try a different type of visualization.

Run the code below to create a scatter plot. In this visualization, each dot will represent a single pet and the dot's colour will represent the species.

In [None]:
#❗️Run this cell

# Create scatter plot
scatter_pet = px.scatter(pets,
           x="Time to Adoption (weeks)", 
           y="Age (years)",
           title="Age (in years) and Time to Adoption (weeks) for each Pet",
           color="Species",hover_name="Name")

scatter_pet.show()

### 🌟 What do you notice?
This scatter plot communicates our information more clearly than our bar graph. It allows each pet to be represented equally, shows the strength of the relationship between our two variables, and allows us to easily identify outliers and see the different species of pets in our data set.

# 📚 Challenge Complete!
Great job! You completed the bonus challenge. 

# <p style="text-align: center;"> 🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟🌟 </p>

[![Callysto.ca License](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-bottom.jpg?raw=true)](https://github.com/callysto/curriculum-notebooks/blob/master/LICENSE.md)