# Data Visualization

## Assignment 1: Why Visualize Data?

You can't learn technical subjects without hands-on practice. The assignments are an important part of the course. To submit this assignment you will need to make sure that you save your Jupyter notebook. 

Below are the links to 2 videos that explain:

1. [How to save your Jupyter notebook](https://youtu.be/0aoLgBoAUSA) and,       
2. [How to answer a question in a Jupyter notebook assignment](https://youtu.be/7j0WKhI3W4s).

<div class="alert alert-info" style="color:black">
    
### Assignment Learning Goals:

By the end of the module, students are expected to:

- Explain the importance of data visualizations.
- Describe the role of the grammar of graphics in data visualization.
- Create point and line visualizations in `altair`.
- Transform data directly in `altair` instead of `pandas`.
- Combine graphical marks via layering.


This assignment covers [Module 1](https://viz-learn.mds.ubc.ca/en/module1) of the online course. You should complete this module before attempting this assignment.
 
</div>

Any place you see `...`, you must fill in the function, variable, or data to complete the code. Substitute the `None` and the `raise NotImplementedError # No Answer - remove if you provide an answer` with your completed code and answers then proceed to run the cell!

Note that some of the questions in this assignment will have hidden tests. This means that no feedback will be given as to the correctness of your solution. It will be left up to you to decide if your answer is sufficiently correct. These questions are worth 1 more point than indicated.

In [None]:
# Import libraries needed for this assignment
from hashlib import sha1
import altair as alt
import pandas as pd

import test_assignment1 as t

##  Gapminder

![](https://www.gapminder.org/wp-content/themes/gapminder2/images/logo.png)

The Gapminder foundation strives to educate people about the public health status
in countries all around the world
and fight devastating misconceptions that hinder world development.
This information is important both for our capacity to make considerate choices as individuals
and from an industry perspective in understanding where markets are emerging.
In their research,
Gapminder has discovered that most people don't really know what the world looks like today.

**Do you?** 

Feel free to take [this 7-8 min quiz to find out](https://forms.gapminder.org/s3/test-2018).
It is not mandatory or for marks but it may spark your curiosity to learn more about this lab's data set!

Also, don't worry if you get a low score on this quiz. When some of the instructors of this course took it for the first time, we also didn't shine too brightly. 

# 1. Getting Motivated

We are going to ease into this lab by first watching <a href="https://www.youtube.com/watch?v=usdJgEwMinM" target="_blank">this 20 min video of Hans Rosling</a>,  a public health professor at Karolinska Institute
who founded Gapminder together with his son and his son's wife.
Although the video is almost 15 years old, it is a formidable demonstration of how to present data in a way that engages your audience while conveying a strong, important message. (The original clip has over 3 million views,
but we linked you one of better video quality).

Once you have finished watching this, answer the multiple-choice questions below. 

**Question 1.1** <br> {points: 1}  

Which question did Hans Rosling ask his students on his "pretest" before the course began? 
 
A) Which continent has the highest mortality rate?

B) Which country has the highest mortality rate?

C) Which continent has the highest child mortality rate?

D) Which country has the highest child mortality rate?

To answer the question, assign the letter associated with the correct answer to a variable in the code cell below.

*Answer in the cell below using the uppercase letter associated with your answer. Place your answer between `""`, assign the correct answer to an object called `answer1_1`.*

 For example: 

`answer1_1 = "B"`


In [None]:
answer1_1 = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
answer1_1

In [None]:
t.test_1_1(answer1_1)

**Question 1.2** <br> {points: 1}  

Which animal did Hans say knew more about the world statistically than Swedish students?

*Save your answers as a string in an object named `answer1_2`*. 

For example: 

`answer1_2 = "elephants"`


In [None]:
answer1_2 = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
answer1_2

In [None]:
t.test_1_2(answer1_2)

**Question 1.3** <br> {points: 1}  

What was the problem that Hans has to overcome with his students?

A) Ignorance

B) Intelligence

C) Preconceived ideas

D) work ethic

To answer the question, assign the letter associated with the correct answer to a variable in the code cell below.

*Answer in the cell below using the uppercase letter associated with your answer. Place your answer between `""`, assign the correct answer to an object called `answer1_3`.*

In [None]:
answer1_3 = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
answer1_3

In [None]:
t.test_1_3(answer1_3)

**Question 1.4** <br> {points: 1}  

For the 1962 data that Hans showed for fertility rate vs life expectancy, what were the trends among the Industrialized countries and the developing countries? 

A) In the industrialized countries families tend to be smaller, with longer life expectancies and in the "developing counties" families tend to be larger with shorter life expectancies. 

B) In the industrialized countries families tend to be larger, with longer life expectancies and in the developing countries families tend to be smaller with shorter life expectancies. 

C) In the industrialized countries families tend to be larger, with shorter life expectancies and in the developing countries families tend to be smaller with longer life expectancies. 

D) In the industrialized countries families tend to be smaller, with shorter life expectancies and in the developing countries families tend to be larger with longer life expectancies. 

To answer the question, assign the letter associated with the correct answer to a variable in the code cell below.

*Answer in the cell below using the uppercase letter associated with your answer. Place your answer between `""`, assign the correct answer to an object called `answer1_4`.*

In [None]:
answer1_4 = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
answer1_4

In [None]:
t.test_1_4(answer1_4)

**Question 1.5** <br> {points: 1}  


What changed in the 1962 fertility rate vs life expectancy data as time progressed into the years 2000?

Select all that apply:

A) On average, the life expectancy among industrialized countries seemed to have decreased.

B) On average, the life expectancy among developing countries seemed to have increased.

C) On average, the size of families among industrialized countries seemed to have increased.

D) On average, the size of families among developing countries seemed to have decreased.

*To answer the question, select all that apply and add the letter(s) associated with the correct answer(s) to a list and assign it to a variable named `answer1_5`. For example, if you believe that A and B are True, then your answer would look like this:*

`answer1_5 = ["A", "B"]`


In [None]:
answer1_5 = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
answer1_5

In [None]:
t.test_1_5(answer1_5)

**Question 1.6** <br> {points: 1}  

In the second  world income distribution visualization that Hans presents, for the 1970 data, which continent contributed most to those living in absolute poverty. 

*Save your answers as a string in an object named `answer1_6`*. 


In [None]:
answer1_6 = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
answer1_6

In [None]:
t.test_1_6(answer1_6)

**Question 1.7** <br> {points: 1}  

What type of relationship did Hans notice between child survival rate and GDP per Capita (on a log scale) among continents?


A) There was no relationship evident

B) Linear

C) Quadratic

D) Logarithmic

*Answer in the cell below using the uppercase letter associated with your answer. Place your answer between `""`, assign the correct answer to an object called `answer1_7`.*

In [None]:
answer1_7 = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
answer1_7

In [None]:
t.test_1_7(answer1_7)

**Question 1.8** <br> {points: 1}  

Which of the following statements is true? 

A) Hans noticed that most countries within a continent progressed through time at similar survival vs GDP rates. 

B) Hans noticed that there was a huge amount of variation among the countries for the survival vs GDP rates. 

C) Hans pointed out that from 1970 to 2000, South Korea showed little signs of advancement. 

D) Hans commented that the survival vs GDP rate within a country was relatively the same among different income levels. 

*Answer in the cell below using the uppercase letter associated with your answer. Place your answer between `""`, assign the correct answer to an object called `answer1_8`.*

In [None]:
answer1_8 = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
answer1_8

In [None]:
t.test_1_8(answer1_8)

**Question 1.9** <br> {points: 2}  

Where did Hans get the inspiration to name the dataset "gapminder"?

A) The gap between the mean score of his Swedish students and the chimpanzees. 

B) The difference between industrialized and developing countries from the first diagram he presented. 

C) The London Metro station. 

D) The Store "The Gap".  

*Answer in the cell below using the uppercase letter associated with your answer. Place your answer between `""`, assign the correct answer to an object called `answer1_9`.*

In [None]:
answer1_9 = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
answer1_9

In [None]:
# check that the variable exists
assert 'answer1_9' in globals(
), "Please make sure that your solution is named 'answer1_9'"

# This test has been intentionally hidden. It will be up to you to decide if your solution
# is sufficiently good.

**Question 1.10** <br> {points: 1}  

The following has no correct answer (in the sense we will accept anything you answer here) but do you think Han's visualization helped you grasp his insights faster and more effectively than had he simply read them out or showed your numerical values?


*Answer your comment in a string below and assign it to an object called `answer1_10`. Note that this is more of a thought-provoking question so the auto-grader will only check that your comment is not too short*.

In [None]:
answer1_10 = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
answer1_10

In [None]:
t.test_1_10(answer1_10)

# 2.  Gapminder bubble chart

**Example:**   

![](img/life_fertility.jpg)
     
[Source](https://bit.ly/2YNoFMV)



"Bubble chart" plots have become quite famous from their appearance in the Gapminder talks and are widely used in other areas as well.

Let's start by recreating a simple version of this chart ourselves!

There will be some data wrangling involved in this assignment. Since this course is primarily about visualization and this is the first assignment, we will be providing some hints and starter questions to help you wrangle your data.

Note: since you do have wrangling experience from [Programming in Python for Data Science](https://prog-learn.mds.ubc.ca/), we will not always do this in the future. 

The data that we have provided to you contains values up until 2018 for most of the features.
You will not use all the columns in the data set, but here is a description of what they contain
that you can refer back to throughout the assignment.

| Column                | Description                                                                                  |
|-----------------------|----------------------------------------------------------------------------------------------|
| country               | Country name                                                                                 |
| year                  | Year of observation                                                                          |
| population            | Population in the country at each year                                                       |
| region                | Continent the country belongs to                                                             |
| sub_region            | Sub-region the country belongs to                                                            |
| income_group          | Income group [as specified by the world bank in 2018]                                                |
| life_expectancy       | The mean number of years a newborn would <br>live if mortality patterns remained constant    |
| income                | GDP per capita (in USD) <em>adjusted <br>for differences in purchasing power</em>            |
| children_per_woman    | Average number of children born per woman                                                    |
| child_mortality       | Deaths of children under 5 years <break>of age per 1000 live births                          |
| pop_density           | Average number of people per km<sup>2</sup>                                                  |
| co2_per_capita        | CO2 emissions from fossil fuels (tonnes per capita)                                          |
| years_in_school_men   | Mean number of years in primary, secondary,<br>and tertiary school for 25-36 years old men   |
| years_in_school_women | Mean number of years in primary, secondary,<br>and tertiary school for 25-36 years old women |

[as specified by the world bank in 2018]: https://datahelpdesk.worldbank.org/knowledgebase/articles/378833-how-are-the-income-group-thresholds-determined

**Question 2.1** 
    <br> {points: 3}

Before you do anything, you must read in the data. We have provided you with a preprocessed version of the 2018 Gapminder data in a csv file named `world-data-gapminder.csv`.  

Use `read_csv` from `pandas` to load the data from the `data` folder and  assign it to a variable named `gapminder_df`. Set the `parse_dates` argument to `['year']` to ensure that Altair recognizes this column as time data.

It would also be useful to take a look at the first 5 rows of your data to see how it's presented. 

*Hint: Observe in particular how the values from the `year` column present. You'll need this in the upcoming questions.*


In [None]:
gapminder_df = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
# check that the variable exists
assert 'gapminder_df' in globals(
), "Please make sure that your solution is named 'gapminder_df'"

# This test has been intentionally hidden. It will be up to you to decide if your solution
# is sufficiently good.

In [None]:
t.test_2_1b(gapminder_df)

**Question 2.2**
<br> {points: 1}

What are the dimensions of `gapminder_df`? 

*Save this as a tuple in an object named `gapminder_df_shape`.*

In [None]:
gapminder_df_shape = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
gapminder_df_shape

In [None]:
t.test_2_2(gapminder_df_shape)

**Question 2.3** 
<br> {points: 1}

For this visualization that we are going to make, we want to concentrate on observations from the 1962.

Our question that we are hoping to answer is ***"Does there appear to be a relationship between the number of children per woman and life expectancy in the year 1962.***

Let's filter the `gapminder_df` object so that it only contains the 1962 observations. 

*Save the filtered data in an object named `gapminder_df_1962`.*

In [None]:
gapminder_df_1962 = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
gapminder_df_1962.head()

In [None]:
t.test_2_3(gapminder_df_1962)

**Question 2.4** 
<br> {points: 3}

Now let’s create a similar bubble chart to what you saw in the video using the `gapminder_df_1962` data:

- Use a circle mark (`mark_circle()`) to recreate the bubble geom of the plot in the video.
- Map the children per woman variable to the x-axis and the life expectancy variable to the y-axis.
- Map the region variable to the color argument, and the population variable to the size argument.

Don't worry about getting axis labels and sizes to be exactly like in the video,
we will return to this code later in the lab to customize it.

*Assign your plot to a variable named `bubble_plot`.*


In [None]:
bubble_plot = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
bubble_plot

In [None]:
t.test_2_4a(bubble_plot)

In [None]:
t.test_2_4b(bubble_plot)

In [None]:
t.test_2_4c(bubble_plot)

**Question 2.5** <br> {points: 1}

Let's return to the question we initially had - ***"Does there appear to be a relationship between the number of children per woman and life expectancy in the year 1962.***

Which of the following would you say is *most* accurate? 

A) There appears to be a cluster of countries that have higher children per woman rates with lower life expectancy and a cluster of countries with higher children per woman rates and higher life expectancy.

B) There seems to be a positive linear relationship between the number of children per woman and life expectancy. 

C) There seems to be a negative linear relationship between the number of children per woman and life expectancy. 

D)  There appears to be a cluster of countries that have higher children per woman rates with lower life expectancy and a cluster of countries with lower children per woman rates and higher life expectancy.
  

*Answer in the cell below using the uppercase letter associated with your answer. Place your answer between `""`, assign the correct answer to an object called `answer2_5`.*

In [None]:
answer2_5 = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
answer2_5

In [None]:
t.test_2_5(answer2_5)

# 3. Education balance

For this next section, we are interested in seeing if women attend school for the same amount of time as men. Does family income have a relationship with this ratio?

Let’s find out what the data says about this.

*Note that many popular datasets have sex as a feature where the possible values are male and female. This representation reflects how the data were collected and is not meant to imply that, for example, gender is binary.*


**Question 3.1** 
<br> {points: 1}

Assign a new column from the `gapminder_df` dataframe named `women_men_school_ratio` that represents the ratio between the number of years in school for women and men (calculate it so that the value 1 means as many years for both, and 0.5 means half as many for women compared to men).

Name the dataframe object containing this column `gapminder_ratio_df`.

In [None]:
gapminder_ratio_df = None 

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
gapminder_ratio_df.tail()

In [None]:
t.test_3_1(gapminder_ratio_df)

**Question 3.2** 
<br> {points: 1}

Filter the `gapminder_ratio_df` dataframe to only contain values from 1971 to 2014 inclusive, since those are the years where the education data has been recorded.

*Save the filtered data in an object named `gapminder_ratio_filtered`.*

In [None]:
gapminder_ratio_filtered = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
gapminder_ratio_filtered.head()

In [None]:
t.test_3_2(gapminder_ratio_filtered)

**Question 3.3** 
<br> {points: 3}

Next, create a line plot from the `gapminder_ratio_filtered` data, showing how the ratio of women to men years in school has changed over time.

- Use a line mark (`mark_line`) 
- Map time (`year`) to the x-axis
- Map the **mean** women to men school ratio that we recently created in `gapminder_ratio_df`.
- Colour the data by income group.


*Save the plot in an object named `line_plot`.*


In [None]:
line_plot = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
line_plot

In [None]:
t.test_3_3a(line_plot)

In [None]:
t.test_3_3b(line_plot)

In [None]:
t.test_3_3c(line_plot)

**Question 3.4** 
<br> {points: 1}

In the legend of the plot in **Question 3.3**, you'll notice that it's not quite ordered correctly. 

We go from "High" to "Low" and then to the "middle" income groups. We can specify the correct order by using the `sort` parameter in `alt.Color()` instead of simply setting `color` equal to the desired column. [This comment on GitHub](https://github.com/altair-viz/altair/issues/1059#issuecomment-409944083) may give you a bit of direction.

Recreate the plot above, but this time sorting the legend of the `income_group` from high to low. 

Make sure that you order the income groups correctly in a list so that when you assign it to the `sort` argument within `alt.Color()`, the legend will read in descending order.

*Save the plot in an object named `line_plot_sorted`.*

In [None]:
line_plot_sorted = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
line_plot_sorted

In [None]:
t.test_3_4(line_plot_sorted)

**Question 3.5** 
<br> {points: 2}

Use layering to add a [square mark](https://altair-viz.github.io/user_guide/marks.html) for every data point in the plot `line_plot_sorted` from **Question 3.4** (so one per yearly mean in each group). 

*Hint: This question is much more simple than you likely think, refer to the slides if this is causing any confusion.*


*Save the layered plot in an object named `plots_combined`.*

In [None]:
plots_combined = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
plots_combined

In [None]:
# check that the variable exists
assert 'plots_combined' in globals(
), "Please make sure that your solution is named 'plots_combined'"

# This test has been intentionally hidden. It will be up to you to decide if your solution
# is sufficiently good.

**Question 3.6** 
<br> {points: 1}

Ok let's now return to our original question - Do women around the world go to school less than men?  

Which of the following statements is true? Select all that apply: 

A) Over time, the ratio of years in school for women to men has increased for all income groups.

B) As of 2014, all income groups have ratios of years in school for women to men of at least 1. 

C) As of 2014, the number of years in school for women exceeds those of men for the high income group.

D) In 1985, men completed approximately twice as many year in school as women in the low income group. 

*To answer the question, select all that apply and add the letter(s) associated with the correct answer(s) to a list and assign it to a variable named `answer3_6`. For example, if you believe that A and B are True, then your answer would look like this:*

`answer3_6 = ["A", "B"]`

In [None]:
answer3_6 = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
answer3_6

In [None]:
t.test_3_6(answer3_6)

# 4. Chart Beautification

Let's make our charts from question 2 look more like the Gapminder bubble chart! 
Beautifying charts can take a long time, but it is also satisfying when you end up with a really nice looking chart in the end. We will learn more about how to create charts for communication later, but these parameters are usually enough to create basic communication charts and to help you in your data exploration.

In order to start the beautification process, we will have to remind ourselves of `alt.X()` and `alt.Y()` that we learned in the second module of [Programming in Python for Data Science](https://prog-learn.mds.ubc.ca/). We learned how to add titles and axis labels, as well as change the plotting size. Let's add these to one of the plots we made.

**Question 4.1** 
<br> {points: 1}

First, let's begin with the dimensions of a plot. 

Copy in your code for `bubble_plot` from **Question 2.4** and confirm that your scatter plot is generated properly so that you didn't miss to copy anything.
Add the arguments `width=700` and `height=400` to the plot. If you can't remember how to do this, take a look at exercise 31: *Quick Viz with Altair* in [Module 1 of Programming in Python for Data Science](https://prog-learn.mds.ubc.ca/en/module1).

*Save the plot with new dimensions in an object named `plot_dim`.*

In [None]:
plot_dim = None 

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
plot_dim

In [None]:
t.test_4_1(plot_dim)

**Question 4.2** 
<br> {points: 2}

Next, let's add a title to the chart! Make sure it's meaningful to what you are trying to convey.  

This can be done by using the plot `plot_dim` from **Question 4.1**  and adding `.properties()` with the argument `title` (which is how we learned it in exercise 31: *Quick Viz with Altair* in [Module 1 of Programming in Python for Data Science](https://prog-learn.mds.ubc.ca/en/module1)). 

You shouldn't have to copy and paste the plot code above and you should be able to simply take the plotting object and include `.properties()` .

Assign the new plot in a variable named `title_plot`.



In [None]:
title_plot = None 

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
title_plot

In [None]:
# check that the variable exists
assert 'title_plot' in globals(
), "Please make sure that your solution is named 'title_plot'"

# This test has been intentionally hidden. It will be up to you to decide if your solution
# is sufficiently good.

**Question 4.3** 
<br> {points: 3}

Using `title_plot` from **Question 4.2**, let's now assign proper titles for the axis and the legends. The titles should include spaces instead of underscores and have appropriate capitalization.

To do this, we need to assign our desired axis title as a string to the encoding porting of the plot.

Let's change the x axis label as an example. We want to change the title for this axis to `Children per woman`. 


In [None]:
title_plot.encoding.x.title = 'Children per woman'
title_plot

We can see now that `title_plot` has a new x-axis label reflecting our change. 

Do the same thing for the `y` axis, the `color` legend, and the `size` legend titles. If a chart label has units (e.g., life expectancy), they should be specified in the corresponding labels. Be sure to include the unit in your label for `life_expectancy`!

In [None]:

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
title_plot

In [None]:
t.test_4_3a(title_plot)

In [None]:
t.test_4_3b(title_plot)

In [None]:
t.test_4_3c(title_plot)

In the next assignment, we will talk more about changing the appearance of a plot by changing the axis scales, or the label and title font size. Hang tight, you're almost there!

## Before Submitting 

Before submitting your assignment please do the following:

- Read through your solutions
- **Restart your kernel, clear output and rerun your cells from top to bottom** 
- Makes sure that none of your code is broken 
- Verify that the tests from the questions you answered have obtained the output "Success"

This is a simple way to make sure that you are submitting all the variables needed to mark the assignment. This method should help avoid losing marks due to changes in your environment.  

## Attributions
- Gapminder dataset processed and uploaded by Joel Ostblom - [UofTCoders/workshops-dc-py](https://github.com/UofTCoders/workshops-dc-py)

- Original Gapminder data - [The Gapminder Foundation](https://www.gapminder.org/)


- MDS DSCI 531: Data Visualization I - [MDS's GitHub website](https://github.com/UBC-MDS/DSCI_531_viz-1) 
