The GitHub repository containing this demonstration notebook and the dashboard notebook can be found at: [https://github.com/austinxjliu/siads521hw3/tree/main#](https://github.com/austinxjliu/siads521hw3/tree/main#)

# Visualization Technique

The dataset I am working with contains data about cost of living across the world. The dashboard will contain visualizations that allow for comparisons between countries on various datapoints. To best support this, I will be using a few types of visualizations.

Bar chart:
- Ideal for comparisons
- Can compare between countries on data points within a column

Scatter plot:
- Ideal for finding trends
- Each country will be plotted as a point, with different cost of living prices able to be selected as the x and y-axis

Geographic scatter map:
- Geographical way of viewing data
- Can be used to view geographical trends, such as finding if countries that share a border have similar costs of living

Pie chart:
- Good for finding compositions of a whole
- Used to determine what portion of the total cost of living is contributed by each category of spending in a given country

These visualizations compliment each other by focusing on various ways of comparison. There are ways to compare between countries and between categories of spending, and a way to correlate that information geographically.

When selecting the visualization library, it needs to be able to support these types of visualizations, as well as the ability to dynamically choose certain fields to look at, otherwise each visualization would be much too crowded.

# Visualization Library

The libraries that I settled on using were Dash and Plotly. These are open source libraries from the Plotly team that have enterprise pricing models, and they are easy to install via pip.

I chose these libraries since they offer easy yet powerful plotting capabilities via Plotly, and a fairly simple web-based interface for interactivity via Dash. Underneath the hood, the web server is run using Flask and formatted with HTML/CSS.

Dash and Plotly are declarative libraries. As a result, the styles of web elements like buttons and drop-down menus is already set, however there is a fair bit of customization possible with the plots themselves and the page layout.

Dash/Plotly does integrate with Jupyter in multiple ways. I will be utilizing the tab integration to open the page in its own locally hosted tab. This Jupyter integration as well as the declarative nature is why I chose this framework as opposed to other options available.

# Demonstration

The dataset I am using was obtained online from [kaggle](https://www.kaggle.com/datasets/mvieira101/global-cost-of-living/data?select=cost-of-living_v2.csv). The dataset contains data regarding cost of living across the world. In terms of cleanup, there's not too much to do since the dataset itself is quite clean. I did drop any cities where the `data_quality` was 0, to isolate the (majority) of cities where the data quality was considered good. There are also certain cases where care must be taken when plotting to avoid `nan` values in certain fields, but for the most part the dataset is good to go.

One quirk of this dataset I did have to contend with is that the useful columns have names ranging from `x1` to `x55`, with the human readable names contained in a table on kaggle. I copied that table and created a csv `headers.csv`, which I have read in as a dataframe and then converted to a numpy array. I then replaced the column names to the descriptive versions. The column name list can later be used to filter the data columns for aggregation.

In [1]:
import pandas as pd

# reading in data and column descriptions
dataset = pd.read_csv("cost-of-living_v2.csv")
dataset.drop("data_quality", axis=1, inplace=True)

desc_df = pd.read_csv("headers.csv", index_col=0, header=None).T
desc_df.drop(["city", "country", "data_quality"], axis=1, inplace=True)

col_list = desc_df.iloc[0].to_numpy()
print(col_list)

# renaming columns
x_list = [f"x{i}" for i in range(1, 56)]
col_mapping = {}
for i, og_name in enumerate(x_list):
    col_mapping[og_name] = col_list[i]
dataset = dataset.rename(columns=col_mapping)
dataset.head()

# desc = desc_df.to_dict("records")[0]
# print(desc)


['Meal, Inexpensive Restaurant (USD)'
 'Meal for 2 People, Mid-range Restaurant, Three-course (USD)'
 'McMeal at McDonalds (or Equivalent Combo Meal) (USD)'
 'Domestic Beer (0.5 liter draught, in restaurants) (USD)'
 'Imported Beer (0.33 liter bottle, in restaurants) (USD)'
 'Cappuccino (regular, in restaurants) (USD)'
 'Coke/Pepsi (0.33 liter bottle, in restaurants) (USD)'
 'Water (0.33 liter bottle, in restaurants) (USD)'
 'Milk (regular), (1 liter) (USD)'
 'Loaf of Fresh White Bread (500g) (USD)' 'Rice (white), (1kg) (USD)'
 'Eggs (regular) (12) (USD)' 'Local Cheese (1kg) (USD)'
 'Chicken Fillets (1kg) (USD)'
 'Beef Round (1kg) (or Equivalent Back Leg Red Meat) (USD)'
 'Apples (1kg) (USD)' 'Banana (1kg) (USD)' 'Oranges (1kg) (USD)'
 'Tomato (1kg) (USD)' 'Potato (1kg) (USD)' 'Onion (1kg) (USD)'
 'Lettuce (1 head) (USD)' 'Water (1.5 liter bottle, at the market) (USD)'
 'Bottle of Wine (Mid-Range, at the market) (USD)'
 'Domestic Beer (0.5 liter bottle, at the market) (USD)'
 'Imported

Unnamed: 0,city,country,"Meal, Inexpensive Restaurant (USD)","Meal for 2 People, Mid-range Restaurant, Three-course (USD)",McMeal at McDonalds (or Equivalent Combo Meal) (USD),"Domestic Beer (0.5 liter draught, in restaurants) (USD)","Imported Beer (0.33 liter bottle, in restaurants) (USD)","Cappuccino (regular, in restaurants) (USD)","Coke/Pepsi (0.33 liter bottle, in restaurants) (USD)","Water (0.33 liter bottle, in restaurants) (USD)",...,1 Pair of Nike Running Shoes (Mid-Range) (USD),1 Pair of Men Leather Business Shoes (USD),Apartment (1 bedroom) in City Centre (USD),Apartment (1 bedroom) Outside of Centre (USD),Apartment (3 bedrooms) in City Centre (USD),Apartment (3 bedrooms) Outside of Centre (USD),Price per Square Meter to Buy Apartment in City Centre (USD),Price per Square Meter to Buy Apartment Outside of Centre (USD),Average Monthly Net Salary (After Tax) (USD),"Mortgage Interest Rate in Percentages (%), Yearly, for 20 Years Fixed-Rate"
0,Seoul,South Korea,7.68,53.78,6.15,3.07,4.99,3.93,1.48,0.79,...,70.81,110.36,742.54,557.52,2669.12,1731.08,22067.7,10971.9,2689.62,3.47
1,Shanghai,China,5.69,39.86,5.69,1.14,4.27,3.98,0.53,0.33,...,88.21,123.51,1091.93,569.88,2952.7,1561.59,17746.11,9416.35,1419.87,5.03
2,Guangzhou,China,4.13,28.47,4.98,0.85,1.71,3.54,0.44,0.33,...,66.73,43.89,533.28,317.45,1242.24,688.05,12892.82,5427.45,1211.68,5.19
3,Mumbai,India,3.68,18.42,3.68,2.46,4.3,2.48,0.48,0.19,...,49.87,41.17,522.4,294.05,1411.12,699.8,6092.45,2777.51,640.81,7.96
4,Delhi,India,4.91,22.11,4.3,1.84,3.68,1.77,0.49,0.19,...,49.99,36.5,229.84,135.31,601.02,329.15,2506.73,1036.74,586.46,8.06


The first demonstration is how to plot a geographical scatter plot. Due to how the `px.scatter_geo` function works, the locations need ISO-3166-1 country codes in the `locations` field. I did a left merge between `dataset` and `geo` (so, `geo` onto `dataset`) to append the country codes and continents for graphing.

The advantages of the declarative nature of Plotly is apparent, as the resulting figure can be interacted with easily out of the box.

In [2]:
import plotly.express as px

# reading in a gapminder dataset that contains information for countries necessary for plotting on map
geo = px.data.gapminder().query("year==2007")
merged = pd.merge(dataset, geo, on=["country"], how="left")


mean_df = merged.groupby(["country", "iso_alpha", "continent"])[col_list].mean().reset_index()
mean_not_nan = mean_df[mean_df[col_list[0]].notnull()].reset_index(drop=True)
mean_not_nan["hover"] = mean_not_nan["country"].astype(str)

fig = px.scatter_geo(mean_not_nan, locations="iso_alpha", color="continent",
                     hover_name="hover", size=col_list[0],
                     projection="natural earth")
fig.show()

Next is a very simple demonstration of Dash. We will use the same data to make the same graph, but this time inside of a basic inline Dash app, adding a title and the ability to select columns to examine from a dropdown menu. The app layout is done using Dash functions that mimic - or more accurately, abstract - HTML. This makes it very simple to style and adjust the layout of the app as a whole. In this case, I have made the background color of the top level `html.Div` white, so that the black text contrasts better. This makes the app more compatible with any dark-mode Jupyter users.

Dash uses callbacks to define responses from components. In this example I created a callback with the choropleth figure as an output, while the input is the dropdown value. What this means is that anytime the dropdown value changes, the figure will be updated to reflect that change. Inside the function, I am taking the inputted column name to filter the dataframe accordingly to output the corresponding data on the plot. This provides some very simple but powerful functionality for user interaction.

In [None]:
from dash import Dash, html, dcc, callback, Output, Input
import numpy as np

app = Dash()

app.layout = html.Div(
    [
        html.H2(children='Cost of Living Dash Demo', style={'textAlign':'center'}),
    html.Div([
        html.Label("Select Column:"),
        dcc.Dropdown(col_list, col_list[0], id='col-dropdown', style={'width':'100%'}),
        dcc.Graph(id='choropleth', style={'width':'100%', 'display':'inline-block'}),
    ], style={"padding": "10px"})
], style={"backgroundColor":"white"})

@app.callback(
    Output('choropleth', 'figure'),
    Input('col-dropdown', 'value'))
def update_map(col_name):
    geo = px.data.gapminder().query("year==2007")
    merged = pd.merge(dataset, geo, on=["country"], how="left")

    mean_df = merged.groupby(["country", "iso_alpha", "continent"])[col_list].mean().reset_index()
    mean_not_nan = mean_df[mean_df[col_name].notnull()].reset_index(drop=True)
    mean_not_nan["hover"] = mean_not_nan["country"].astype(str)

    fig = px.scatter_geo(mean_not_nan, locations="iso_alpha", color="continent",
                        hover_name="hover", size=col_name,
                        projection="natural earth")
    return fig

# using non-default port, otherwise it will interfere with the dashboard
app.run(debug=False, port=8080, jupyter_height=620, jupyter_width="40%")