# Applied Data Visualization – Homework 3
*https://www.dataviscourse.net/2024-applied/*


In this homework we will create tabular custom charts in Matplotlib and Seaborn. 



## Your Info and Submission Instructions

* *First name:*
* *Last name:*
* *Email:*
* *UID:*



For your submission, please do the following things: 
* **rename the file to `hw3_lastname.ipynb`**
* **include all files that you need to run the homework, including the data file provided** 
* **don't use absolute paths, but use a relative path to the same directory for referencing data**

In [6]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Useful for this homework
from matplotlib.patches import Rectangle
from matplotlib.lines import Line2D

plt.style.use('default')
# This next line tells jupyter to render the images inline
%matplotlib inline
import matplotlib_inline
# This renders your figures as vector graphics AND gives you an option to download a PDF too
matplotlib_inline.backend_inline.set_matplotlib_formats('svg', 'pdf')

# Part 1: Bubble Grid Chart

For this assignment, we will use a historical data set of medals awarded in Winter Olympics. Recreate the chart below using Matplotlib with the following requirements:

- Each `Discipline` bubble and label should be colored according to the `Sport` variable. You can pick your own colors, as long as they are discernable.
- Each bubble's size should depend on the number of gold medals awarded. (This can be calculated as the number of unique `Event`-`Gender` pairs in the data set.)
- There should be a label noting that 1940 and 1944 Olympic games were not held (due to World War II).

![A bubble grid chart of medals for winter olympics](medals.svg)

Hints you may find useful for this assignment:
- matplotlib's `.patch.Rectangle` and `.add_patch()` for the label box
- matplotlib's `.get_yticklabels()`, `.get_text()`, and `.set_color()` to color the y-axis labels

In [2]:
# Keep this cell
medals_df = pd.read_csv('./winter.csv')


# your code here

# Part 2: Parallel Coordinates

We're back to the familiar Utah Avalanche Center data set for this assignment. Recreate the parallel coordinates chart below using Matplotlib, given the following requirements:

- Subset the data to avalanches caused by skiers in Salt Lake between 2015 and today.
- Highlight (e.g., with another color) avalanches with casualties (i.e. anyone injured, killed, or buried).
- Add a custom legend explaining the highlighting.
- Each axis should range from the minimum to the maximum value of the corresponding variable in the data.

![A parralel coordinates plot of avalance data with fatalities highlighted.](pc.svg)

Hints you may find useful:
- Drop rows that have NaN values in the columns you want to plot.
- Matplotlib's `.twinx()` function is useful to create subfigures that share the same x-axis.
- Note that the ranges of variables are very different. For the lines to fit onto the same chart, all variables should be *normalized* to the range of one variable (for example, if you choose the leftmost variable---year---to be the reference, all other variables should be normalized to range between 2015 and 2023).
- Make sure you first draw the axes, and only then normalize variables to the reference, and then draw.
- You may find it useful to loop over all observations (rows) and `.plot()` each line individually.
- Note that since we are plotting each line individually, matplotlib will not generate a legend. Refer to [the documentation](https://matplotlib.org/stable/gallery/text_labels_and_annotations/custom_legends.html. ) for guidance on how to create a custom legend.

In [3]:
# Keep this cell

avy_df = pd.read_csv('./avalanches.csv')

# Clean dates and elevation
avy_df['Date'] = pd.to_datetime(avy_df['Date'])
avy_df['Year'] = avy_df['Date'].dt.year.astype('Int64')
avy_df['Month'] = avy_df['Date'].dt.month.astype('Int64')
avy_df['Elevation_Feet'] = pd.to_numeric(avy_df['Elevation'].str.replace('\'', '').str.replace(',', ''))
avy_df['Width_Feet'] = pd.to_numeric(avy_df['Width'].str.replace('\'', '').str.replace(',', ''))
avy_df['Vertical_Feet'] = pd.to_numeric(avy_df['Vertical'].str.replace('\'', '').str.replace(',', ''))

def CleanInchesFeet(x):

    if x!=x: return x

    number = pd.to_numeric(x[:-1].replace(',', ''))
    unit = x[-1]

    if unit == '\"':
        return number
    elif unit == '\'':
        return number*12
    else:
        return float('NaN')
    
avy_df['Depth_Inches'] = avy_df['Depth'].map(lambda x: CleanInchesFeet(x))

# Filter out null dates and incomplete years
avy_df = avy_df[avy_df['Date']==avy_df['Date']]
avy_df = avy_df[avy_df['Year'] > 2015]

In [4]:
# your code here

## Part 3: Scatterplot Matrix

Use seaborn to show a scaterplott matrix of the data you used in Part 2. Make the dots transparent to see which areas are heavily overplotted. 

Hints: 
 * this is one line of code.
 * showing a histogram in the diagonals (instead of a KDE plot) gets around a bug that occurs when using a KDE plot with this dataset

In [5]:
import seaborn as sns

# your code here

## Part 4: Analysis and Comparison

* Analyze the data; under which conditions do casualties occur?
* Compare the scatterplot matrix with the PCP plot. Do you see different patterns in the visualization for this datasets? What are the strengths and weaknesses of each plot? 
  

# Grading Scheme

Part 1: 40%  
Part 2: 40%  
Part 3: 10%  
Part 4: 10%