# Homework 6 – Visualization

## History of Data Science, Winter 2022

In [None]:
import pandas as pd
import numpy as np

import plotly.express as px
import folium

def write_file(path, html_string):
    with open(path, 'w') as f:
        f.write(html_string)
        f.close()

**Note:** Homework 6 will work a little differently than earlier homework assignments did. Specifically, there are three questions, Question 1, Question 2, and Question 3. Each question will have you create a single interactive visualization that is historically relevant in some way.

Instead of submitting the assignment as a PDF, you will upload your work to a website you create on your own using GitHub Pages! This will be good practice for setting up project websites and portfolios of your own. 

## Question 1 – John Snow and Cholera Deaths

In this question, we'll create John Snow's iconic map of cholera deaths in SoHo, London:

<img src='data/snow.jpg' width=400>

The library we will use to create this visualization is called `folium`. It has already been imported for you. Let's look at a quick example of how to create maps in `folium`. For this example, we'll look at data that describes the location of every public university in California.

In [None]:
universities = pd.read_csv('data/universities/enrollment_and_location.csv')
universities.head(5) # head(5) means "show the first 5 rows of the DataFrame"; it does not change the DataFrame

The relevant object from the `folium` library is the `Map` object. Let's instantiate a `Map` object with no arguments:

In [None]:
folium.Map()

This map seems to be pretty zoomed out, and it also happens to be centered at the point on Earth's surface with latitude 0 and longitude 0.

Since we're looking at schools in California, we'd like our map to be centered on California. We can tell `folium` where to center our map, but to do this we need to find the rough geographic center of the points that we'll be plotting. We can do this by finding the average latitude and longitude of all locations in our dataset:

In [None]:
california_center = (universities['Latitude'].mean(), universities['Longitude'].mean())
california_center

Let's assign `california_center` to a `folium.Map` argument whose `location` is set to `center` and whose `zoom_start` is set to 6.

Take a look at what happens if `zoom_start` is set to 7 or 5!

In [None]:
cal_map = folium.Map(location=california_center, zoom_start=6, width=600, height=400) # Also setting width and height to make a smaller map
cal_map

We now need to add markers at each university. To add a marker to a `folium.Map` object, we first need to create a `folium.CircleMarker` object with all of the characteristics that we want (e.g. size, position, label). We then call the `.add_to` method on that `folium.CircleMarker` object, and as an argument pass in our `folium.Map` object.

Here's an example of what that would look like.

In [None]:
folium.CircleMarker(location=(32.880060, -117.234014),
                    popup='University of California, San Diego',
                    color='blue',
                    radius=4,
                    alpha=0.1).add_to(cal_map);

In [None]:
cal_map

If you click on the blue circle, you should see "University of California, San Diego".

In order to add circles for every school, we'll have to loop through every row of our DataFrame. Here's how:

In [None]:
# Redefining cal_map, just so that we can get rid of the first CircleMarker that we added above
cal_map = folium.Map(location=california_center, zoom_start=6, width=600, height=400)

for _, row in universities.iterrows():
    folium.CircleMarker(location=(row['Latitude'], row['Longitude']),
                    popup=row['Name'],
                    color='blue' if 'University of California' in row['Name'] else 'red',
                    radius=row['Enrollment'] // 5000,
                    fill=True,
                    alpha=0.4).add_to(cal_map);
    
cal_map

Note that the radius of each university's circle is proportional to the number of students enrolled at that university. Also note that if we want to draw "pins" instead of circles, we can use `folium.Marker` instead of `folium.CircleMarker`. Change `CircleMarker` to `Marker` above and observe what happens.

### Back to the John Snow example!

<img src='data/snow.jpg' width=400>

Run the two cells below to load in two DataFrames – one containing the latitude, longitude, and number of deaths at each address, and another containing the location of each water pump.

In [None]:
snow = pd.read_csv('data/snow/deaths.csv')
snow.head(5)

In [None]:
pumps = pd.read_csv('data/snow/pumps.csv')
pumps

**Your Job:** Create a map using `folium` that has:
- A circle at each location with a death. The radius of the circle should be **proportionate** to the number of deaths at that location; make sure to make the circles large enough to be visible, but small enough so that they don't overlap too much. You can choose the colors, fills, etc.
- A marker at each location with a pump. When we click on a marker, it should show the name of the pump. You can choose the color of the marker (to set the color of the marker to red, for example, set the keyword argument `icon` in `folium.Marker` to `folium.Icon(color='red')`).

You will have to figure out what reasonable `location`, `zoom_start`, `width` and `height` arguments are when calling `folium.Map` for the first time. We should be able to see most of the circles and markers by default (i.e. without zooming in even further), and the circles should take up most of the space on the map (i.e. don't start too far out).

For example, your map may look like this:

<img src='data/snow-example.png' width=500>

Save your plot to a variable name (e.g. at some point perhaps you'll have `snow_map = folium.Map(...)`).

In [None]:
# YOUR CODE HERE
...

## Question 2 – Galton's Heights

In Lecture 4 (and in DSC 10), we briefly revisited Sir Francis Galton's heights dataset. Run the cell below to load this dataset in as a DataFrame.

In [None]:
galton = pd.read_csv('data/galton/GaltonFamilies.csv')
galton

Previously, when we were only able to look at two variables at a time, we looked at `'childHeight'` vs `'midparentHeight'` (which, recall, is a weighted average of a child's mother's height and father's height).

In [None]:
px.scatter(galton, x='midparentHeight', y='childHeight')

**Your Job:** Use `px.scatter_3d` to create a **3D scatterplot** that visualizes `'father'`'s height on the $x$-axis, `'mother'`'s height on the $y$-axis, and `'childHeight'` on the $z$-axis. Color each point according to the child's `'gender'`. Rename the axes to "Father's Height", "Mother's Height", and "Child's Height", and give the plot an appropriate title. You're free to change other aspects of the plot (colors, fonts, etc) but you are not required to.

Save your plot to a variable name (e.g. at some point perhaps you'll have `galton_fig = px.scatter_3d(...)`).

In [None]:
# YOUR CODE HERE
...

## Question 3 – French Departments

Recall from lecture that the first ever **choropleth** depicted the levels of literacy in different regions of France. These regions are actually called "departments" (in France, "regions" are like states and "departments" are like counties).

<img src='https://upload.wikimedia.org/wikipedia/commons/3/38/Carte_figurative_de_l%27instruction_populaire_de_la_France.jpg' width=300>

In this question, you will draw a choropleth depicting the population of each department in France. This time, however, you will have to access the necessary pieces of data on your own.

### Collecting and Cleaning Data

First, we need to find a **geojson** file that contains the boundaries of each department of France.

**Task 1:** Follow these instructions:
1. Go to [this site](https://france-geojson.gregoiredavid.fr).
2. Click "Départements", then click "Télécharger."
3. Now, you should see a file containing several lists of numbers. Copy the entire file you see, paste it in a text editor on your computer, and save it in the `.json` format.
4. Upload that `.json` file to DataHub, and store it in the `data/france/` subfolder of the `dsc90-2022-wi/homework/hw06` folder.
5. Call the function `read_json` (defined below) on the path to your `.json` file, and store the result in the variable `geojson`. For example, if your file is called `french-departments.json`, you would write `geojson = read_json('data/france/french-departments.json')`.

In [None]:
import json
def read_json(path):
    f = open(path, 'r')
    return json.load(f)

# YOUR CODE HERE
geojson = ...

We now have a **dictionary** containing the boundaries of all French departments. For instance, the following cell shows the boundaries of the French department "Aisne."

In [None]:
geojson['features'][0]

We now need to find a dataset containing the population of each French department.

**Task 2:** 
1. Google "french departments by population" or something similar, and open the first Wikipedia article you see.
2. Paste the link to the Wikipedia article at [this site](https://wikitable2csv.ggor.de) and click "Convert".
3. Click "Download". Ensure it downloads as a `.csv`; change its extension to `.csv` if it does not.
4. Upload the `.csv` file to DataHub, also in the `data/france/` folder.
5. Load the dataset as a DataFrame into the variable name `population_raw`, using `pd.read_csv`.

In [None]:
# YOUR CODE HERE
population_raw = ...
population_raw.head(5)

The dataset you collected from Wikipedia unfortunately stores most numerical columns as strings, due to characters like commas and brackets:

In [None]:
population_raw['Legal population in 2013']

Run the cell below. It converts all elements of the `'Legal population in 2013'` column to integers, selects just the relevant columns from `population_raw`, and stores the result as `population`. (Ignore the warning.)

In [None]:
population_raw['2013 Population'] = population_raw['Legal population in 2013'].str.replace(',', '').str.replace('\\[6\\]', '').str.replace('\[\]', '').astype(int)
population = population_raw[['Department', '2013 Population']]
population

**Task 3:** Use `px.choropleth` to draw a choropleth showing the population of each department of France. You will have to set several arguments:
- The first argument should be a DataFrame containing the numerical variable we want to visualize
- The `geojson` argument should be a dictionary containing the boundaries of each department
- The `locations` argument should be the name of the column in the DataFrame that contains the name of each department
- The `featureidkey` should be `'properties.nom'` (scroll to the very bottom of the output for `geojson['features'][0]` to see why)

You should also set the `color_continuous_scale` argument to one of the Built-In Sequential Color scales [mentioned here](https://plotly.com/python/builtin-colorscales/) (for example, you could set `color_continuous_scale=px.colors.sequential.tempo`). Try out a few and see which one you like the most!

You should also give your plot a `title`.

Save your plot to a variable name (e.g. at some point perhaps you'll have `france_fig = px.choropleth(...)`). Then, write `france_fig.update_geos(fitbounds="locations", visible=False)` to make sure your choropleth is zoomed in on the relevant region.

In [None]:
# YOUR CODE HERE
...

Try and find the "Paris" department. Is it the most populated department? Is it the most **densely** populated department?

## Submission Instructions

Follow the instructions [here](https://historyofdsc.com/resources/weeks/week06/) to submit your work.