In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("lab05.ipynb")

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from IPython.display import display, Latex, Markdown
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
%matplotlib inline

import geopandas
import pycountry
import geopy

import re

# Lab 5: Geospatial Visualizations

In this lab, you will generate a 3D map visualizing data from [this paper](https://gabriel-zucman.eu/who-owns-offshore-real-estate/). The paper looks at the ownership of offshore real estate in Dubai (where Rohan grew up).  We would like to thank Professor Zucman for making his data freely available and accessible. Professor Zucman is one of the foremost experts in economic inequality; take [Econ 133](https://gabriel-zucman.eu/econ133/) to learn about it from him!

In order to generate the map, we will first import a cleaned version of the dataset from the paper. Then, we will do some essential data cleaning steps so the data can be interpreted by plotting packages. Then, we will generate a sample plot. Finally, we will use widgets to easily toggle between multiple plots.

### Learning Objectives:
- Revisits some data cleaning techniques
- Generates geospatial visualizations in 2D and 3D

First, let's load in the dataset.

In [None]:
data = pd.read_csv('APZO2022Data-cleaned.csv')
data.head()

---
## Part 1: Cleaning Data

In this part, you will follow the steps for cleaning the data as described in each individual subpart. If you do not follow the steps exactly, the plots will not be generated in subsequent parts.

**Question 1.1:** Set the `Country` column as the table index and delete the first row (the one for the entire World) from the data.

In [None]:
data = ... # set index
data = ... # delete the first row
data.head()

In [None]:
grader.check("q1_1")

Now, let's take a look at the following columns:

In [None]:
data[['Total Property Value / GDP', 'Female share', 'Share of total values owned by top 10% owners', 
            'Share of total values owned by top 10% persons','Share of total values owned by top 10% firms',
           'Share of total values owned by top 1% owners', 'Share of total values owned by top 1% persons',
           'Share of total values owned by top 1% firms']]

As you can see, the data is written in percents. Since the software can only plot numbers, the percentages will need to be converted to a number out of 100. A similar problem can also be seen below:

In [None]:
data[['Total Property Values', 'Mean Property value', 'Median Property Value']]

In this case, the letters 'USD', commas and spaces will need to be removed from the above rows so the data can be read as numbers.

**Question 1.2:** Fix the issues described above by converting the given columns to numbers. Once you have converted the columns to numbers, change the datatype of all the columns to be `float64`.

For example, we want to convert "7.81%" to "7.81", "USD 1,410,029,024" to "1410029024". 

*Hint 1:* Consider using string methods like we did in project 1.   
*Hint 2:* You can get part of a string by slicing a string like we did in Lab 5. We can do this on a column in a dataframe using the string method. This [tutorial](https://note.nkmk.me/en/python-pandas-str-slice/) may be helpful. 

In [None]:
for col in ['Total Property Values', 'Mean Property value', 'Median Property Value']:
    data[col] = ... # get rid of 'USD' 
    data[col] = ... # get rid of commas using regex

for col in ['Total Property Value / GDP', 'Female share', 'Share of total values owned by top 10% owners', 
            'Share of total values owned by top 10% persons','Share of total values owned by top 10% firms',
           'Share of total values owned by top 1% owners', 'Share of total values owned by top 1% persons',
           'Share of total values owned by top 1% firms']:
    data[col] = ... # get rid of '%'

# convert all the columns to float64
for col in data.columns:
    data[col] = ...
data

In [None]:
grader.check("q1_2")

Now that all of our data is stored as floats, we must deal with ambiguity in country names. For example, United States, United States of America and USA all refer to the same country. It's hard for a package to keep track of all the different names for a country, so instead packages like to refer to the standardized, 3-letter [country 
codes](https://www.iban.com/country-codes). The following function takes in a country name and attempts to find the 3 digit country code associated with the country.

In [None]:
import pycountry
pycountry.countries.get(name='Albania').alpha_3
                                    # .alpha_3 refers to the 3-letter country code 
                                    # .alpha_2 refers to the 2-letter country code

**Question 1.3:** Use the provided function to try and find the associated country code for all the countries in your data. Write a function `get_alpha3code` that get the 3-letter country code given the country name, and then apply this function to the index of our dataframe. There will be cases where the function fails as it cannot find the associated country code - consider using a try-except block to deal with these cases.

In [None]:
def get_alpha3code(country_name):
    try:
        code = ...
    except: # if it cannot find the associated country code
        code = 'None'
    return code
data['Code'] = ...
data.head()

In [None]:
grader.check("q1_3")

Let us quickly see the cases where the function fails.

In [None]:
data[data["Code"] == "None"]

**Question 1.4:** We can see that the function fails in a small portion of cases. We have provided a list of all the cases where the function fails; you have to manually correct these cases by manually referencing the [website](https://www.iban.com/country-codes). This might seem tedious, but that is the point - data cleaning must done with careful attention to detail.

In [None]:
data.loc['Antigua & Barbuda','Code'] = ...
data.loc['Brunei','Code'] = ...
data.loc['Congo, Republic Of','Code'] = ...
data.loc['Czech Republic','Code'] = ...
data.loc['Iran','Code'] = ...
data.loc['Ivory Coast','Code'] = ...
data.loc['Kyrgistan','Code'] = ...
data.loc['Macedonia','Code'] = ...
data.loc['Moldova','Code'] = ...
data.loc['North Korea','Code'] = ...
data.loc['Palestine','Code'] = ...
data.loc['Russia','Code'] = ...
data.loc['Saint Vincent & The Grenadines','Code'] = ...
data.loc['South Korea','Code'] = ...
data.loc['Southern Sudan','Code'] = ...
data.loc['Syria','Code'] = ...
data.loc['Taiwan','Code'] = ...
data.loc['Tanzania','Code'] = ...
data.loc['Trinidad & Tobago','Code'] = ...
data.loc['USA','Code'] = ...
data.loc['Venezuela','Code'] = ...
data.loc['Vietnam','Code'] = ...
data.loc['Bolivia','Code'] = ...
data.loc['Democratic Rep, Of Congo','Code'] = ...
data.loc['Turkey','Code'] = ...
data.loc['Comoros Islands','Code'] = ...
data.loc['British Virgin Islands','Code'] = ...

In [None]:
grader.check("q1_4")

Let us now look at the cases where the `Code` column is still 'None'.

In [None]:
data[data["Code"] == "None"]

We can see none of these countries/organizations have a 3-letter country code associated with them, so we can drop these rows.

In [None]:
data = data[data["Code"] != "None"]
data.head()

## Part 2: Generating a Sample Map

In this part, you will use your data to generate a sample plot visualizing the `Total Property Value / GDP` column.

When we hover over a country in the generated plot, we would like to be able to see it's name, the total property value owned and how it compares to the amount of property owned by other countries. In order to do this, we must first rank all the countries by total property value owned.

**Question 2.1:** First, sort all the values in the table by `Total Property Value / GDP` in ascending order (this sorting is important for when we generate the colors in the plot later). Then, rank all the countries by `Total Property Value / GDP`, in descending order. Store all the ranks in a column in the data named 'Rank'. 

*Hint:* [`pandas.Series.rank`](https://pandas.pydata.org/docs/reference/api/pandas.Series.rank.html) may be useful.

In [None]:
...
data

In [None]:
grader.check("q2_1")

Now, we must think about how we want the colors in the plot to look like. For the sake of simplicity, let's say we want to bin the colors. So, we will need to group the countries into bins depending on their value of `Total Property Value / GDP`, and then assign a color to each bin. Let's take a look at the values in the column.


In [None]:
data['Total Property Value / GDP'].describe()

As we can see, there are some clear outliers in the data. If we were to only consider the minimum and the maximum of the data, we could assign bins like ${[0,30),  [30,60),  [60,90),  [90,120),  [120,150)}$. This would leave most countries in the bottom most bin, and not provide an accurate color representation of the data. Ultimately, the bins you choose are a personal choice, but it is important to consider how those bins affect the final plot. We have provided sample bins for this part, but please feel free to mess around with these bins if you like.

**Question 2.2:** Fill in the provided code below to generate your sample plot for Total Property Value / GDP!

In [None]:
fig = px.choropleth(data, # This is the name of your dataset
                    locations= ..., # Which column are the country codes stored in?
                    color=pd.cut(data['Total Property Value / GDP'], 
                                bins=[0, 0.015,0.05,0.1,0.2,1,140]).astype(str).fillna('No Data'),
                                #These are our sample bins, feel free to mess around with them
                    hover_name = ..., # Which column are the country names stored in?
                    hover_data={"Total Property Value / GDP":":.1f", "Rank":":"},
                    # Change the above line so we can see the ratio of property value to GDP to 2 decimal places
                    color_discrete_sequence=px.colors.sequential.BuPu,
                    #Feel free to mess around with colors if you're interested
                    title = ..., # Write an appropriate title
                    height = 900
                   )
fig.update_layout(
    geo=dict(
        showframe=False,
        showcoastlines=False,
        projection_type='mercator'
    ),
    margin=dict(l=50, r=50, t=50, b=50),
)
fig.show()

In [None]:
grader.check("q2_2")

**Question 2.3:** Using the code above, generate a similar plot for `Total Property Values`.  Make sure the color bins and title are appropriate. However, make this plot 3D.

*Hint:* which line of code above references a 2D projection of the Earth? Here's a [list of supported projections](https://plotly.com/python/map-configuration/#map-projections). 

In [None]:
data = ... # rank the data
fig = px.choropleth(data, #This is the name of your dataset
                    locations= ..., # Which column are the country codes stored in?
                    color=pd.cut(data['Total Property Values'], 
                               bins=[0, 0.001*1000000000,0.01*1000000000,0.1*1000000000,0.5*1000000000,3*1000000000,300*1000000000]).astype(str).fillna('No Data'),
                    #These are our sample bins, feel free to mess around with them
                    hover_name = ..., # Which column are the country names stored in?
                    hover_data={"Total Property Values":":.1f", "Rank":":"},
                    # Change the above line so we can see the ratio of property value to 2 decimal places
                    color_discrete_sequence=px.colors.sequential.BuPu,
                    #Feel free to mess around with colors if you're interested
                    title = ..., # Write an appropriate title
                    height = 900
                   )
fig.update_layout(
    geo=dict(
        showframe=False,
        showcoastlines=False,
        projection_type=...
    ),
    margin=dict(l=50, r=50, t=50, b=50),
)
fig.show()

In [None]:
grader.check("q2_3")

---
## Part 3: Using Widgets

Congratulations on making the first map! In this part, we will generate a map that can easily toggle between different columns to visualize different data. In order to do this, we must first introduce [widgets](https://ipywidgets.readthedocs.io). Widgets are interactive browser controls that allow you to choose between different values. An example is included below.

In [None]:
from ipywidgets import Dropdown
Dropdown(
    options=['1', '2', '3'],
    value='2',
    description='Number:',
    disabled=False,
)

The `interact` function in the widgets module takes in a function, a list of values for it's parameters and determines the appropriate widget to let you visualize the function. 2 examples are included below.

In [None]:
def say_my_name(name):
    """
    Print the current widget value in a short sentence
    """
    print(f'My name is {name}')
     
interact(say_my_name, name=["James", "Bond", "James Bond"]);

In [None]:
def f(x):
    return x + 1
lst = [1,2,3]
interact(f, x=lst);

We will be using the `interact` function to generate a widget that lets us choose between and visualize the different column values easily. In order to this, we must first write a function that lets us generate a 3D plot for any column name. Thankfully, this isn't too hard. If you remember, other than changing the projection, making the plot for question 7 wasn't too bad once your code for question 6 was working.

The main thing we must consider for different columns is how to automatically determine the different color bins. After all, we won't be able to make judgement calls for every single bin. One way to automate this process would be to look at the data quintiles - assign the bottom 20% of the data to one bin, the next 20% to another bin, and so on. 

**Question 3.1:** Computationally determine the *quintiles* of `data['Total Property Values']` and return the information as a list. The list must start at the minimum value of `data['Total Property Values']` and end at the maximum value.

Hint: The solution does not need to be longer than one line. This [method](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.quantile.html) may be helpful. 

In [None]:
quintiles = ...
quintiles

In [None]:
grader.check("q3_1")

<!-- BEGIN QUESTION -->

**Question 3.2:** Write a function that takes in a column name and generates a 3D plot visualizing that column data. Name the function `plot_generator`. Feel free to assign the column name as the plot title.

In [None]:
def plot_generator(col):
    data_new = ... # generate ranking
    fig = px.choropleth(data_new,
                        locations = ...,
                        color=pd.cut(data_new[col], 
                                    bins=data_new[col]...(insert quintiles here)...astype(str).fillna('No Data'),
                        hover_name = ...,
                        hover_data={col:":.2f", "Rank":":"},
                        color_discrete_sequence=px.colors.sequential.BuPu,
                        title = ...,
                        height = 900
                       )
    fig.update_layout(
         title_text=col,
        geo=dict(
            showframe=False,
            showcoastlines=False,
            projection_type= ...
        ),
        margin=dict(l=50, r=50, t=50, b=50),
    )
    fig.show()

<!-- END QUESTION -->

Let's make sure the function works:

In [None]:
plot_generator("Total Property Values")

Now, view the beautiful visualizations with `interact`!

In [None]:
interact(plot_generator, col=data);

Let's throw in another toggle as a bonus!

In [None]:
display(widgets.interactive(plot_generator, col=widgets.ToggleButtons(options=[
    "Total Property Values", "Total Property Value / GDP", "Mean Property value", "Median Property Value"])));


**Question 3.3:** Using the widget that you generate above. Name one country that has both a high total property value invested in Dubai and a high total property value / GDP. What does that potentially imply about income inequality in that country? This is an open-ended question. 

We are testing autograding written responses only for this question on the lab, please make sure to input your answer in a **string variable provided**.

In [None]:
q3_3 = ...
q3_3

**Congratulations!!** You are done with the lab. Hopefully you enjoyed producing these geospatial visualizations!

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(run_tests=True)