# Visualizing world health data

## Loading data 

In [10]:
import pandas as pd
import altair as alt

url = "https://raw.githubusercontent.com/UofTCoders/workshops-dc-py/master/data/processed/world-data-gapminder.csv"
gm = pd.read_csv(url, parse_dates=["year"])  # Dataframe for Gapminder data


# Handle large data sets without embedding them in the notebook
alt.data_transformers.enable('data_server')

DataTransformerRegistry.enable('data_server')

In [3]:
# Filtering data to a specific year, 1962
gm_1962 = gm.query("year == 1962")

#explore dataset
gm_1962.head()

Unnamed: 0,country,year,population,region,sub_region,income_group,life_expectancy,income,children_per_woman,child_mortality,pop_density,co2_per_capita,years_in_school_men,years_in_school_women
162,Afghanistan,1962-01-01,9350000,Asia,Southern Asia,Low,40.1,1200,7.45,352.0,14.3,0.0738,,
381,Albania,1962-01-01,1740000,Europe,Southern Europe,Upper middle,64.6,2910,6.28,173.0,63.4,1.42,,
600,Algeria,1962-01-01,11700000,Africa,Northern Africa,Upper middle,53.2,4520,7.61,245.0,4.91,0.485,,
819,Angola,1962-01-01,5870000,Africa,Sub-Saharan Africa,Lower middle,43.6,4130,7.56,299.0,4.71,0.201,,
1038,Antigua and Barbuda,1962-01-01,57100,Americas,Latin America and the Caribbean,High,63.8,4640,4.34,89.7,130.0,1.8,,


## 1. Create a bubble chart for life expectancy vs. children per woman by region and size of their population

In [7]:
scatter_familysize_lifeexp = (
    alt.Chart(gm_1962, title="Life expectancy vs. Children per woman by region and population")
    .mark_circle()
    .encode(
        x = alt.X("children_per_woman", title = "Children per woman (avg)"),
        y = alt.Y("life_expectancy", title = "Life expectancy (years)"), 
        color = "region", 
        size = "population"
    )
)

# Show the plot at the end
scatter_familysize_lifeexp

## 2. Create a line plot showing how the ratio of women’s of men’s years in school has changed over time. Group the data by income group and plot the mean for each group.

Filtering steps:  
- Compute a new column in the dataframe (named `women_men_school_ratio`) that represents the ratio between the number of years in school for women and men (calculate it so that the value 1 means as many years for both, and 0.5 means half as many for women compared to men).
- Filter the dataframe to only contain value from 1970 - 2015, since those are the years where the education data has been recorded. Again you can either create a new variable or perform the filtering as you pass the data to the plotting function.
- Create a line plot showing how the ratio of women’s of men’s years in school has changed over time. 
- Group the data by income group and plot the mean for each group.
- Use layering to add a square mark for every data point in your line plot (so one per yearly mean in each group).

In [13]:
gm["women_men_school_ratio"] = gm["years_in_school_women"] / gm["years_in_school_men"]
gm_subset = gm[(gm["year"] >= "1970") & (gm["year"] <= "2015")]
gm_subset


line_education_ratio = (
    alt.Chart(gm_subset, title="Change in ration of women to men attending school by income groups")
    .mark_line()
    .encode(
        x = alt.X("year", title="Year"), 
        y = alt.Y("mean(women_men_school_ratio)", title = "Mean women to men school ratio"), 
        color = "income_group")
)
line_education_ratio  # This will be the first layer (line plot)

line_points_education_ratio = (
    line_education_ratio + line_education_ratio.mark_square()
)  # This will include the second layer (line plot + square marks)

# Show the plot at the end
line_points_education_ratio

#### Adding confidence interval to the above plot

In [14]:
band = (
    alt.Chart(gm_subset, title="Change in ration of women to men attending school by income groups")
    .mark_errorband(extent = "ci")
.encode(
        x = alt.X("year", title="Year"), 
        y = alt.Y("mean(women_men_school_ratio)", title = "Mean women to men school ratio"), 
        color = "income_group")
)

ci_bands_education_ratio = line_points_education_ratio + band

# Show the plot at the end
ci_bands_education_ratio

## 3. Exploring relationships of child mortality to family sizes

Filtering steps:
- Filter the data to include only the years 1918, 1938, 1958, 1978, 1998, and 2018. 
- Used filled circles to make a scatter plot with children per women on the x-axis, child mortality on the y-axis, and the circles colored by the income group.
- Facet data into six subplots, one for each year laid out in 3 columns and 2 rows.

In [16]:
gm["year"] = pd.to_datetime(gm["year"], format = "%Y")

gm_subset_years = gm.loc[
    (gm["year"] == "1918")
    | (gm["year"] == "1938")
    | (gm["year"] == "1958")
    | (gm["year"] == "1978")
    | (gm["year"] == "1998")
    | (gm["year"] == "2018")
]
# gm_subset_years
# Don't change the variable name you assign the plot to
scatter_familysize_mortality = (
    alt.Chart(gm_subset_years)
    .mark_circle(opacity = 1)
    .encode(x = alt.X("children_per_woman", title = "Children per woman"), y = alt.Y("child_mortality", title = "Child mortality"), 
            color = "income_group")
    .properties(width = 180, height = 180)
    .facet(facet = "year", columns = 3)
)

# Show the plot at the end
scatter_familysize_mortality

## Explore which countries emits the most CO2 per capita and which regions has emitted the most in total over time.

Filtering steps:
- Filter the data to include only the most recent year when 'co2_per_capita' was measured 
- Use the data frame nlargest method to select the top 40 countries in CO2 production per capita for that year.
- Since we have only one value per country per year, let’s create a bar chart to visualize it. Encode the CO2 per capita as on the x-axis, the country on the y-axis, and the region as the color.
- Sort your bar chart so that the highest CO2 per capita is the closest to the x-axis (the bottom of the chart). Here is an example of how to sort in Altair.

In [20]:
gm[gm.co2_per_capita.notnull()].sort_values(by = "year", ascending = True).tail(1)
nlargest = gm.query("year == 2014")
nlargest = nlargest.sort_values(by = "co2_per_capita", ascending = False).head(40)
# Don't change the variable name you assign the plot to
bars_co2 = (
    alt.Chart(nlargest)
    .mark_bar()
    .encode(x = alt.X("co2_per_capita", title = "CO2 per capita"), 
            y = alt.Y("country", sort = "x", title = "Country"), color = "region")
)

# Show the plot at the end
bars_co2

## Total CO2 emissions by region

In [None]:
Filtering steps: