# Worksheet 2 - Scientific Visualization MVE080/MMG640
## Directory of visualizations, color, and aesthetics

Name: _Your Name_

This is the second worksheet in the course *Scientific Visualization*. The purpose is to study various types of visualizations and how they are produced in Matplotlib. 

Once you're finished with all the tasks, export this document as an HTML-file and upload it in Canvas.
You are encouraged to discuss problems and solutions with your fellow students (in the class-room but also on CampusWire), but each student must solve all tasks by themselves and hand-in their own report.
Notice that Jupyter notebooks use [Markdown](https://docs.github.com/en/github/writing-on-github/getting-started-with-writing-and-formatting-on-github/basic-writing-and-formatting-syntax#links) for writing text cells. Make sure you understand the basics. Later on you can also include $\LaTeX$ in your Markdown cells.

## Setup
Before we begin it is necessary to load a few Python modules that are needed. We can do that with the following commands (which only have to be run once, unless you restart the Jupyter kernel in which case you have to re-run them).

In some of the tasks below we shall use an addon module to Matplotlib called [Seaborn](https://seaborn.pydata.org/index.html). It is entirely possible to solve all the tasks without using Seaborn (if you prefer), but many things are simplified with Seaborn. Furthermore, to handle tabular data (such as the data in an Excel-sheet), we shall use the module [Pandas](https://pandas.pydata.org/). Pandas is an open source data analysis toolbox that connects well with Matplotlib and in particular with Seaborn. Both Pandas and Seaborn are part of the Anaconda Python distribution and can be considered "standard" modules (meaning a lot of code use these modules).

In [7]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

## Task 1: colors and directory of visualizations

Read Chapters 4, 19, and 5 of [Fundamentals of Data Visualizations](https://clauswilke.com/dataviz/), then answer the questions below.

### Question 1.1
When selecting a color scale (called _colormap_ in Matplotlib and _palette_ in Seaborn), how do the type of data influence the choice of color scale? Give examples.

### Answer 1.1
_Your answer here_

### Question 1.2
The Seaborn module offers functionality for creating your own color scales.
Read [this tutorial](https://seaborn.pydata.org/tutorial/color_palettes.html) on how to create palettes in Seaborn.
What is a "cubehelix" palette? When is it useful?

### Answer 1.2
_Your answer here_

### Task 1.1
Use Seaborn to design your own cubehelix sequential color scale. 
Then, in the tasks below, use this color scale whenever appropriate.

**Hint:** Check out the Seaborn command [`choose_cubehelix_palette`](https://seaborn.pydata.org/generated/seaborn.choose_cubehelix_palette.html#seaborn.choose_cubehelix_palette).

In [27]:
# Your code-cells here for creating the cubehelix palette with Seaborn

### Question 1.3
The author describes 6 different types of 2D plots and discusses briefly their use. In some cases the data you have naturally limits which type of plots you can use (for example geospatial plots require geospatial data). But in many cases it is not at all obvious which type of plot to use. I want you to think carefully about a strategy for selecting which plot to use. This is an "open question", so there's no right or wrong. To help you get going, think about the data visualizations that you've been doing in other courses.

### Answer 1.3
_Explain your strategy for selecting plot type here. Try to be as concrete as possible: give examples of data (anything, either real or made up) and explain how your strategy applies to that data._

## Task 2: Pandas for tabular data

![Pandas](https://images.squarespace-cdn.com/content/v1/5268c662e4b0269256614e9a/1562312824569-6LDHKN3X0QPA2ON2CPGZ/pandas-logo-300.png)

First [watch this video introduction to Pandas](https://youtu.be/_T8LGqJtuGc).
Then go to the [Pandas getting started tutorial page](https://pandas.pydata.org/docs/getting_started/intro_tutorials/) and go through the tutorials (you don't have to include the code-cells for this).

As part of this work-sheet you are given 3 files containing temperature data from cities in Sweden (`smhi-gothenburg.csv`, `smhi-stockholm.csv`, and `smhi-malmoe.csv`). The data is saved in the [comma-separated values (CSV)](https://en.wikipedia.org/wiki/Comma-separated_values) format.
This is a tabular format, similar in structure to an Excel sheet.

Write code below to load the 3 CSV files using the Pandas `DataFrame` structure. 
Then combine the 3 data structures in a new `DataFrame` with a new column labelled "City" taking the values "Stockholm", "Gothenburg", or "Malmoe". Save your new Pandas data structure as a new CSV file.

In [28]:
# Your code-cells here

## Task 3: Seaborn for plotting

![Seaborn](https://cmdlinetips.com/wp-content/uploads/2020/09/Seaborn_logo.png)

Seaborn is a Python data visualization library based on Matplotlib. It provides a high-level interface for drawing attractive and informative graphics. It works especially well together with the Pandas `DataFrame` structure.

### Task 3.1

First work through the [Seaborn introduction](https://seaborn.pydata.org/introduction.html). Notice that the test data used in the tutorial is based on Pandas. You don't have to include the code-cells for the Seaborn tutorials.

### Task 3.2

Read Chapter 6 of [Fundamentals of Data Visualizations](https://clauswilke.com/dataviz/).
Using the temperature `DataFrame` from Task 2, create two figures displaying the average temperature during Jan, April, July, and October in the three cities. 
- The first figure should be a **grouped bar plot**, where the group is the city. On the x-axis you should have the months. To generate grouped bar plots with Seaborn, see examples in the [categorical data plotting tutorial](https://seaborn.pydata.org/tutorial/categorical.html).
- The second figure should also be a grouped bar plot, but now the group is the months. On the x-axis you should have the cities.

Try to make your plots as identical as possible to [Figure 6.7 and 6.8 of the book](https://clauswilke.com/dataviz/visualizing-amounts.html) (but of course with different data and labels). To do this, it is helpful to first read the [Seaborn Plot aesthetics tutorial](https://seaborn.pydata.org/tutorial.html).

In [9]:
# Your code-cell for the first figure here

In [8]:
# Your code-cell for the second figure here

### Question 3.1

Of the two plots you made in Task 3.2, which one do you think works best? Explain your reasoning.

### Answer 3.1

_Provide you answer here_

## Task 4: Visualizing distributions

Read Chapter 7 and Chapter 9 of [Fundamentals of Data Visualizations](https://clauswilke.com/dataviz/).
Using the temperature data structure that you created in Task 2, create one or several figures displaying the temperature distribution throughout the whole year for each of the three cities. I'm not telling you which of the distribution visualization type to use: make sure you study the figures in Chapters 7 and 9 carefully, judge their pros and cons. You probably need to try many different types of distribution plots (each with one or several figures included) before you settle on which is the best.

**Hint:** Useful information about using Seaborn for visualizing distributions can be found in [this tutorial](https://seaborn.pydata.org/tutorial/distributions.html).

In [10]:
# Your code-cells for the temperature distribution plots here.

### Question 4.1
Explain, with clear motivations, what made you choose the type of visualization you used. Also, explain all the visualization types you tested and why in the end you did not use them.

### Answer 4.1
_Provide you answer here_

## Task 5: Visualizing time-series and trends

![Image](https://seaborn.pydata.org/_images/regression_17_0.png)

Read Chapter 13 and Chapter 14 of [Fundamentals of Data Visualizations](https://clauswilke.com/dataviz/).

### Task 5.1

Use the temperature data from Task 2 to display, for each city, how temperature is varying with month of the year. Do it in three different ways:

1. Using an x-y-scatter plot (with month on the x-axis, and make sure the ticks are labelled by month names, not numbers).
2. Using a polar scatter plot, with angle corresponding to the month scale and radius corresponding to the temperature scale. Take into account that the average temperature during some months are negative.
3. Using a _heat map_, similar to Figure 2.4 in [Fundamentals of Data Visualizations](https://clauswilke.com/dataviz/). Remember to use your own cubehelix palette.

In [26]:
# Your code-cells here.

### Question 5.1

Chapter 14 discusses the LOESS smoother. In Seaborn the [LOWESS smoother](https://en.wikipedia.org/wiki/Local_regression) is implemented. When is it appropriate to use a smoother? What is the difference between LOESS and LOWESS?

### Answer 5.1
_Your answer here_

### Task 5.1

Based on what you've learned in Chapter 14 and 15, make a visualization that for each city shows the temperature trends for January, April, July, and October from 1961 to today. Try various types of regression models, motivate your choice. Also motivate why you choose to include or not to include scatter plots and/or confidence intervals in your final plot.

**Hint:** Read the [Seaborn tutorial on regression models](https://seaborn.pydata.org/tutorial/regression.html).

In [29]:
# Your code-cells here

_Your markdown cells here explaining and motivating your choices_

## Task 6: Reflection

### Question 6.1
In general, when you create a visualization of some data it is because you want to illustrate something about the data. In Tasks 3-5 above you have used the same data (average monthly temperatures in three cities) to produce many different plots. What particular aspect of the data are you conveying in each plot? This question is a bit like Jeopardy: I give you the answers (your plots), you give me the questions (what aspects of the data are the plots showing).

### Answer 6.1

_Your answer here_