Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [1]:
NAME = "Varshini Rana"
COLLABORATORS = ""

# Presenting Uncertainty
## School of Information, University of Michigan

## Week 2: Assignment Overview
Version 1.1
### The objectives for this week are for you to:

- review the concept of standard error
- construct a confidence distribution
- use a confidence distribution to construct a density plot, interval plot, CCDF barplot, and quantile dotplot
- apply one or more of the above techniques to the World Happiness dataset

In [2]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import altair as alt
import random
from scipy import stats
from scipy.stats import norm

# Part 1. Standard error and confidence distributions (14 points)

## 1.1 Bootstrap sampling distribution of phone pickups example

Recall in assignment 1 we used a sample from a hypothetical dataset of counts of how many times students at a school picked up their phones. Let's generate such a hypothetical dataset again. We'll make a sample of 20 students and the number of times each student picked up their phone:

In [3]:
np.random.seed(1234)   # for reproducibility

pickups_df = pd.DataFrame({"pickups": np.random.poisson(lam=40, size=20)})

pickups_df.describe()

Unnamed: 0,pickups
count,20.0
mean,39.35
std,6.072154
min,27.0
25%,36.75
50%,38.5
75%,43.25
max,52.0


### Question 1.1.1 (2 points)

Using what you learned in assignment 1, generate a bootstrap sampling distribution of the mean number of pickups and assign it to the variable `bootstrap_means`:

In [4]:
B = 5000
np.random.seed(1234) # for reproducibility

# YOUR CODE HERE
# raise NotImplementedError()

pickups=np.array(pickups_df)
pickups=pickups.reshape(20,)

bootstrap_means=[]
for i in range(B):
    sample=np.random.choice(pickups, size=20)
    avg=np.mean(sample)
    bootstrap_means.append(avg)

In [5]:
# the bootstrap sampling mean should be close to the sample mean
sample_mean = pickups_df['pickups'].mean()
sample_mean_se = np.std(pickups_df['pickups'], ddof=1) / np.sqrt(len(pickups_df))
assert len(bootstrap_means) == 5000, "Bootstrap sampling distribution: length of bootstrap_means should be 5000"
assert np.abs(np.std(bootstrap_means) - sample_mean_se) / sample_mean_se < 0.05, "Bootstrap sampling distribution: SD of boostrap_means does not match SE of sample_pickups"
assert np.abs(np.mean(bootstrap_means) - sample_mean)/sample_mean < 5*1e-3, "Bootstrap sampling distribution: mean of bootstrap_means does not match sample_mean"

### Question 1.1.2 (2 points)

Visualize the bootstrap sampling distribution of the mean as a density plot.

Hint: see assignment 1.

In [6]:
# YOUR CODE HERE
# raise NotImplementedError()

df = pd.DataFrame(data = {"bootstrap_means" : bootstrap_means})

bootstrap_density_chart = alt.Chart(df).transform_density(
    density='bootstrap_means',
    as_=['bootstrap_means', 'density'],
    
).mark_area(color="lightgray").encode(
    x=alt.X('bootstrap_means', axis=alt.Axis(title="Bootstrap Sampling Distribution")),
    y=alt.Y('density:Q', axis=alt.Axis(title="Density"))
)

bootstrap_mean_chart = alt.Chart(df).mark_rule(color="red").encode(
    x="mean(bootstrap_means)"
)

bootstrap_chart = bootstrap_density_chart + bootstrap_mean_chart

bootstrap_chart

## 1.2 Standard error and Normal sampling distribution for mean pickups

The sampling distribution of a mean approaches the shape of a Normal distribution as the sample size increases. Instead of using bootstrapping to derive the distribution, we could instead use the *standard error*, which is an estimate of the standard deviation of the sampling distribution, assuming the sampling distribution is Normal.

The standard error of the mean ($s_\bar{x}$) is calculated using the sample standard deviation ($s$) and the sample size ($n$):

$s_\bar{x} = \frac{s}{\sqrt{n}}$

This calculation can be done as follows:

In [7]:
# manual calculation of standard error
pickups_se = pickups_df['pickups'].std(ddof=1) / np.sqrt(len(pickups_df))
print("Manual calculation of SE:           ", pickups_se)

# or using scipy.stats.sem() (should be the same)
pickups_se = stats.sem(pickups_df['pickups'])
print("scipy.stats.sem() calculation of SE:", pickups_se)


Manual calculation of SE:            1.357774882511437
scipy.stats.sem() calculation of SE: 1.357774882511437


Given the sample mean pickups ($\bar{x}$ = `pickups_mean`) and the sample mean standard error ($s_\bar{x}$ = `pickups_se`), we can construct a sampling distribution using the standard error. As discussed in lecture, there are three functions we'll want to define in order to construct different uncertainty visualizations: the probability density function (PDF), the cumulative distribution function (CDF), and the quantile function (also called the percentage points function, PPF). For a mean and standard error, these are:

$\begin{align}
\mathrm{density~function~(PDF):~} & f_\mathrm{Normal}\left(x \middle| \bar{x}, s_\bar{x}\right) & \mathtt{stats.norm.pdf()}\\
\mathrm{cumulative~distribution~function~(CDF):~} & F_\mathrm{Normal}\left(x \middle| \bar{x}, s_\bar{x}\right) & \mathtt{stats.norm.cdf()}\\
\mathrm{quantile~function~(PPF):~} & F^{-1}_\mathrm{Normal}\left(p \middle| \bar{x}, s_\bar{x}\right) & \mathtt{stats.norm.ppf()}
\end{align}$

In [8]:
pickups_mean = pickups_df['pickups'].mean()

# generate evenly-spaced numbers covering the range of values of pickups
# that we will calculate the density of the sampling distribution at
x = np.linspace(
    start = pickups_df['pickups'].min(), 
    stop = pickups_df['pickups'].max(),
    num = 1001
)

# this is the density of the sampling distribution: f_Normal(x | x_bar, se)
density = stats.norm.pdf(x, pickups_mean, pickups_se)

df = pd.DataFrame({"Normal sampling distribution of mean pickups": x, "density": density})

pickups_se_chart = alt.Chart(df).mark_line().encode(
    x="Normal sampling distribution of mean pickups",
    y="density"
)

pickups_se_chart

## 1.3 t-based confidence distribution for mean pickups

A slight improvement to the sampling distribution above is to use a Student's t confidence distribution. At small sample sizes, the Normal approximation is less accurate; a scaled-and-shifted Student's t distribution will have slightly fatter tails than the Normal distribution. 

To determine how fat the tails are, the t distribution uses a *degrees of freedom* parameter, which for the estimate of a single independent mean of sample size $n$ is equal to $n - 1$. Thus, an improved confidence distribution over the Normal approximation above is:

$\begin{align}
\mathrm{density~function~(PDF):~} & f_\mathrm{t}\left(x \middle| n - 1, \bar{x}, s_\bar{x}\right) & \mathtt{stats.t.pdf()}\\
\mathrm{cumulative~distribution~function~(CDF):~} & F_\mathrm{t}\left(x \middle| n - 1, \bar{x}, s_\bar{x}\right) & \mathtt{stats.t.cdf()}\\
\mathrm{quantile~function~(PPF):~} & F^{-1}_\mathrm{t}\left(p \middle| n - 1, \bar{x}, s_\bar{x}\right) & \mathtt{stats.t.ppf()}
\end{align}$

This confidence distribution is related to the common Student's t test: confidence intervals from this distribution are the same as the confidence intervals you would generate from a t test.

### Question 1.3.1 Visualize the Student's t confidence distribution for the mean (5 points)

Visualize the density of the t-based confidence distribution described above in a similar manner to how the Normal sampling distribution was visualized, but use a different color for the line (this will help you with the next question):


In [9]:
# YOUR CODE HERE
# raise NotImplementedError()

# generate evenly-spaced numbers covering the range of values of pickups
# that we will calculate the density of the sampling distribution at
x = np.linspace(
    start = pickups_df['pickups'].min(), 
    stop = pickups_df['pickups'].max(),
    num = 1001
)

# this is the density of the sampling distribution: f_Normal(x | n-1, x_bar, se)
density = stats.t.pdf(x, len(pickups_df)-1, pickups_mean, pickups_se)

df = pd.DataFrame({"t-based confidence distribution of mean pickups": x, "density": density})

pickups_t_chart = alt.Chart(df).mark_line(color="red", opacity=0.5).encode(
    x="t-based confidence distribution of mean pickups",
    y="density"
)

pickups_t_chart

### Question 1.3.2 Visualize and compare uncertainty distributions (5 points)

Visualize all three distributions you have constructed so far in a single plot. 

Hint: if you have saved your previous plots in three separate variables and made them with different enough encodings that they can be distinguished by eye, this answer can be as simple as `pickups_bootstrap_chart + pickups_se_chart + pickups_t_chart`

In [10]:
# YOUR CODE HERE
# raise NotImplementedError()

# Bootstrap Distribution
df = pd.DataFrame(data = {"bootstrap_means" : bootstrap_means})

bootstrap_density_chart = alt.Chart(df).transform_density(
    density='bootstrap_means',
    as_=['bootstrap_means', 'density'],
    
).mark_area(color="lightgray").encode(
    x=alt.X('bootstrap_means', axis=alt.Axis(title="Uncertainty Distribution")),
    y=alt.Y('density:Q', axis=alt.Axis(title="Density"))
)

bootstrap_mean_chart = alt.Chart(df).mark_rule(color="red").encode(
    x="mean(bootstrap_means)"
)

bootstrap_chart = bootstrap_density_chart + bootstrap_mean_chart

# SE Distribution
pickups_mean = pickups_df['pickups'].mean()

# generate evenly-spaced numbers covering the range of values of pickups
# that we will calculate the density of the sampling distribution at
x = np.linspace(
    start = pickups_df['pickups'].min(), 
    stop = pickups_df['pickups'].max(),
    num = 1001
)

# this is the density of the sampling distribution: f_Normal(x | x_bar, se)
density = stats.norm.pdf(x, pickups_mean, pickups_se)

df = pd.DataFrame({"Normal sampling distribution of mean pickups": x, "density": density})

pickups_se_chart = alt.Chart(df).mark_line().encode(
    x=alt.X("Normal sampling distribution of mean pickups", axis=alt.Axis(title="Uncertainty Distribution")),
    y=alt.Y("density", axis=alt.Axis(title="Density"))
)

# T Distribution
# generate evenly-spaced numbers covering the range of values of pickups
# that we will calculate the density of the sampling distribution at
x = np.linspace(
    start = pickups_df['pickups'].min(), 
    stop = pickups_df['pickups'].max(),
    num = 1001
)

# this is the density of the sampling distribution: f_Normal(x | n-1, x_bar, se)
density = stats.t.pdf(x, len(pickups_df)-1, pickups_mean, pickups_se)

df = pd.DataFrame({"t-based confidence distribution of mean pickups": x, "density": density})

pickups_t_chart = alt.Chart(df).mark_line(color="red", opacity=0.5).encode(
    x=alt.X("t-based confidence distribution of mean pickups", axis=alt.Axis(title="Uncertainty Distribution")),
    y=alt.Y("density", axis=alt.Axis(title="Density"))
)

bootstrap_chart + pickups_se_chart + pickups_t_chart

Compare these distributions: what do you notice? Write your answer below.

Answer: 

The t-based distribution (in red) has a slightly lower peak and fatter tail compared to the normal sampling distribution (in blue). It would resemble the normal sampling distribution if the degrees of freedom were increased. The bootstrap sampling distribution (grey area) has a slight bump on the left near the peak (more visible on the individual plot). Moreover, the normal and t-based distributions seem to have tails extending from ~27 to 52, while the bootstrap sampling distribution's tails extend from ~35 to ~43.5 (again, more visible on the individual plot). In general, all three distributions seem to resemble each other to some extent, especially with regards to the shape and width of the distributions, with only the minor differences I have described above.

# Part 2. Intervals, CDFs, and quantile dotplots (17 points)

Now that we are able to define the density, CDF, and quantile functions to describe our uncertainty in a mean, we can use these to construct various uncertainty visualizations as described in the lectures. For example, we can use the density function to create density plots or gradients; the CDF to create complementary CDF bar plots; and the quantile function to create intervals and quantile dotplots.

## 2.1 Intervals

### Question 2.1.1 Calculate 95% interval (2 points)

Using the quantile function (aka percentage points function) of the Student's t confidence distribution you derived in Part 1, calculate the lower and upper bounds of a 95% confidence interval for mean pickups and assign these bounds to the `pickups_lower` and `pickups_upper` variables. 

(Hint: use `stats.t.ppf()` as the quantile function and see the lecture on intervals.)

In [11]:
# YOUR CODE HERE
# raise NotImplementedError()
pickups_lower=stats.t.ppf(0.025, len(pickups_df)-1, pickups_mean, pickups_se)
pickups_upper=stats.t.ppf(0.975, len(pickups_df)-1, pickups_mean, pickups_se)

In [12]:
dof = len(pickups_df) - 1
m = pickups_df['pickups'].mean()
se = stats.sem(pickups_df['pickups'])
#hiddent tests below

### Question 2.1.2 Visualize 95% + 66.666% interval (5 points)

Visualize a combined 95% + 66.666% (i.e. 2/3rds) interval for the mean of pickups, overlaid on top of the data (as ticks). 

Hints: 
- create two different layers, one for the 95% interval and one for the 66.666% interval, then combine them using `alt.layer()` or `+`.
- use `mark_rule(size=XXX).encode(x=YYY, x2=ZZZ)` to create each interval, where you fill in the values of `size` (to set the thickness of the interval), `x` (to set the lower bound), and `x2` (to set the upper bound).

Your output should look like this:

![Plot of 66% and 95% confidence intervals overliad on data](asset/assignment2_intervals.png)

In [13]:
# YOUR CODE HERE
# raise NotImplementedError()

pickups_lower_66=stats.t.ppf(0.16667, len(pickups_df)-1, pickups_mean, pickups_se)
pickups_upper_66=stats.t.ppf(0.83333, len(pickups_df)-1, pickups_mean, pickups_se)

pickups_data_chart=alt.Chart(pickups_df).mark_rule(color="red", opacity=0.5).encode(x="pickups")

df95=pd.DataFrame(columns=["pickups_lower", "pickups_upper"])
df95.loc[0]=[pickups_lower, pickups_upper]
df66=pd.DataFrame(columns=["pickups_lower_66", "pickups_upper_66"])
df66.loc[0]=[pickups_lower_66, pickups_upper_66]

interval95=alt.Chart(df95).mark_rule(size=5).encode(x=alt.X("pickups_lower:Q", axis=alt.Axis(title="lower")), 
                                                    x2="pickups_upper:Q")
interval66=alt.Chart(df66).mark_rule(size=8).encode(x="pickups_lower_66:Q", x2="pickups_upper_66:Q")

pickups_data_chart+interval95+interval66

## 2.2 Cumulative distribution function

### Question 2.2.1 Visualize a Complementary CDF barplot (5 points)

Using the CDF function (or the CCDF function), plot a CCDF barplot for the Student's t-based confidence distribution. *The barplot should start at 0*.

Hint: start from the code for generating a density plot, above, and (1) adjust the code for generating `x` to ensure it includes 0 and (2) change the function from a PDF to 1 - the CDF.

Your output should look something like this:

![CCDF barplot of mean pickups](asset/assignment2_ccdf.png)

In [14]:
# YOUR CODE HERE
# raise NotImplementedError()

x = np.linspace(
    start = 0, 
    stop = pickups_df['pickups'].max(),
    num = 1001
)

density = 1-(stats.t.cdf(x, len(pickups_df)-1, pickups_mean, pickups_se))

df = pd.DataFrame(data = {"Mean pickups" : x, "density": density})

chart=alt.Chart(df).mark_area().encode(
    x=alt.X('Mean pickups', axis=alt.Axis(title="Mean pickups (CCDF)"), scale=alt.Scale(domain=(0, 55))),
    y=alt.Y('density:Q', axis=alt.Axis(title="ccdf"))
)

chart.properties(height=50)

## 2.3 Quantile dotplots

Constructing quantile dotplots involves first generating a small-to-medium number of quantiles from the quantile function (aka percentage points function) of the uncertainty distribution. We do this by projecting evenly-spaced values in probability space back into the data units through the quantile function, then stacking up those values into a dotplot:

<img src="asset/cdf.png" alt="wo graphs stacked on top of each other. The top graph has a y-axis of p less than x, from 0.00 to 1.00 in intervals of 0.25, and the x-axis is x, from 0 to 30. The cumulative distribution function line starts out around (0, 0.00) and stays that way until about (8, 0.00) where it rises up the y axis until it reaches about (18, 1.00) and continues on to 30. The bottom graph has a y-axis of count, from 0.00 to 1.00 in intervals of 0.25, and the x-axis is x, from 0 to 30. Circles are stacked between 8 and 18 on the x-axis, with a peak around (11, 0.50) and is slightly skewed to the right." style="width: 500px;"/>

The first step is to generate the evenly-spaced probabilities. We will create a `ppoints(n)` function that generates `n` evenly-spaced probabilities (not including 0 and 1):

In [15]:
def ppoints(n):
    return (np.arange(n) + 0.5)/n

ppoints(20)

array([0.025, 0.075, 0.125, 0.175, 0.225, 0.275, 0.325, 0.375, 0.425,
       0.475, 0.525, 0.575, 0.625, 0.675, 0.725, 0.775, 0.825, 0.875,
       0.925, 0.975])

We can then use `ppoints` with the quantile function of the t confidence distribution (`stats.t.ppf`) to construct a quantile dotplot.

Dotplots in Altair are complex to construct, requiring three transformations:

1. `transform_bin` to bin the values
2. `transform_window` to rank the values within each bin (this stacks up all the dots in one bin)
3. `transform_joinaggregate` to calculate the midpoint of each bin using the median of all values in that bin

Putting together `ppoints`, the quantile function of the confidence distribution, and the above steps, we can construct a quantile dotplot.

In [16]:
df = pd.DataFrame({
    "x": stats.t.ppf(ppoints(50), len(pickups_df) - 1, pickups_mean, pickups_se)
})

alt.Chart(df).transform_bin(
    field="x",
    as_="bin",
    bin=alt.BinParams(step=0.75)
).transform_window(
    rank_in_bin="rank()",
    groupby=["bin"]
).transform_joinaggregate(
    bin_midpoint="median(x)",
    groupby=["bin"]
).mark_circle(size=90).encode(
    x="bin_midpoint:Q",
    y="rank_in_bin:Q"
).properties(width=600, height=100)

### Question 2.3.1 Construct a 20-dot quantile dotplot for the mean of pickups (5 points)

Construct a 20-dot quantile dotplot for the mean of pickups. Adjust the binning step size, circle size, and/or chart size to achieve a closely-packed quantile dotplot like the 50-dot dotplot above. It is typically also necessary to adjust the step size in the binning (`alt.BinParams(step=XXX)`), the size of the circles (`mark_circle(size=XXX)`) and/or the width and height of the plot (`properties(width=XXX, height=YYY)`) to achieve good-looking dotplots.

In [17]:
# YOUR CODE HERE
# raise NotImplementedError()

df = pd.DataFrame({
    "x": stats.t.ppf(ppoints(20), len(pickups_df) - 1, pickups_mean, pickups_se)
})

alt.Chart(df).transform_bin(
    field="x",
    as_="bin",
    bin=alt.BinParams(step=1.75)
).transform_window(
    rank_in_bin="rank()",
    groupby=["bin"]
).transform_joinaggregate(
    bin_midpoint="median(x)",
    groupby=["bin"]
).mark_circle(size=420).encode(
    x="bin_midpoint:Q",
    y=alt.Y("rank_in_bin:Q", scale=alt.Scale(domain=(0, 12)))
).properties(width=650, height=250)

## Part 3. Visualize world happiness (10 points)

We will do some visualization of the World Happiness Report dataset from 2015. It is a survey of the state of global happiness, which ranks 155 countries by their happiness levels. Let's explore the happiness score.

First, we'll read in the data and gather the happiness score for all countries:

In [18]:
#Read in 2015 World Happiness Report statistics as a dataframe
happ = pd.read_csv("asset/2015.csv")

#show the top 15 countries:
happ.head(15)

Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Standard Error,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual
0,Switzerland,Western Europe,1,7.587,0.03411,1.39651,1.34951,0.94143,0.66557,0.41978,0.29678,2.51738
1,Iceland,Western Europe,2,7.561,0.04884,1.30232,1.40223,0.94784,0.62877,0.14145,0.4363,2.70201
2,Denmark,Western Europe,3,7.527,0.03328,1.32548,1.36058,0.87464,0.64938,0.48357,0.34139,2.49204
3,Norway,Western Europe,4,7.522,0.0388,1.459,1.33095,0.88521,0.66973,0.36503,0.34699,2.46531
4,Canada,North America,5,7.427,0.03553,1.32629,1.32261,0.90563,0.63297,0.32957,0.45811,2.45176
5,Finland,Western Europe,6,7.406,0.0314,1.29025,1.31826,0.88911,0.64169,0.41372,0.23351,2.61955
6,Netherlands,Western Europe,7,7.378,0.02799,1.32944,1.28017,0.89284,0.61576,0.31814,0.4761,2.4657
7,Sweden,Western Europe,8,7.364,0.03157,1.33171,1.28907,0.91087,0.6598,0.43844,0.36262,2.37119
8,New Zealand,Australia and New Zealand,9,7.286,0.03371,1.25018,1.31967,0.90837,0.63938,0.42922,0.47501,2.26425
9,Australia,Australia and New Zealand,10,7.284,0.04083,1.33358,1.30923,0.93156,0.65124,0.35637,0.43562,2.26646


In [19]:
#some descriptive statistics of the World Happiness Report:
happ.describe()

Unnamed: 0,Happiness Rank,Happiness Score,Standard Error,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual
count,158.0,158.0,158.0,158.0,158.0,158.0,158.0,158.0,158.0,158.0
mean,79.493671,5.375734,0.047885,0.846137,0.991046,0.630259,0.428615,0.143422,0.237296,2.098977
std,45.754363,1.14501,0.017146,0.403121,0.272369,0.247078,0.150693,0.120034,0.126685,0.55355
min,1.0,2.839,0.01848,0.0,0.0,0.0,0.0,0.0,0.0,0.32858
25%,40.25,4.526,0.037268,0.545808,0.856823,0.439185,0.32833,0.061675,0.150553,1.75941
50%,79.5,5.2325,0.04394,0.910245,1.02951,0.696705,0.435515,0.10722,0.21613,2.095415
75%,118.75,6.24375,0.0523,1.158448,1.214405,0.811013,0.549092,0.180255,0.309883,2.462415
max,158.0,7.587,0.13693,1.69042,1.40223,1.02525,0.66973,0.55191,0.79588,3.60214


## 3.1 Happiness Scores

We'll focus on two columns in particular: the `Happiness Score` (an aggregate happiness score for each country) and the `Standard Error`, which is the standard error of the `Happiness Score`: because the data are based on surveys, there is uncertainty in the measurement of the happiness score, and this is quantified using its standard error.

### Question 3.1.1 Visualize the happiness scores and uncertainty of the top 15 countries (5 points)

Use one of the uncertainty visualizations above to visualize the happiness score of each country. You will need to use the Normal approximation to the sampling distribution along with the standard error of the happiness score to do this.

You **MUST NOT** use just a combination of a point and a single interval to visualize the scores. For example, the following chart **IS NOT AN ACCEPTABLE SOLUTION** (but could be a good place to start):

![point estimates and 95% confidence intervals for the happiness score in the top 10 countries according to the World Happiness Report](asset/assignment2_happiness_intervals.png)

But you could construct a chart with a similar layout, but which uses one of the uncertainty encodings you learned above. For example, you might start by constructing a chart with 95% intervals, then extend it to be a chart with 95%+66.666% intervals, which would be an acceptable solution.

Hints:

1. You may want to sort the y axis by happiness score using something like `alt.Y("Country", sort=["Happiness Score"]`
2. You may want to restrict the x axis domain to make it easier to see the uncertainty using something like `alt.X("Happiness Score", scale={"domain":[6,8]})`
3. If you decide to make density plots, quantile dotplots, or CCDF barplots, you need to use the `facet()` function to display multiple subcharts where each one is a single country. If you do that, you may want to set the `"labelAngle"` property on the facet headers so they are horizontal.

In [20]:
# YOUR CODE HERE
# raise NotImplementedError()

happ.sort_values("Happiness Score", ascending=False, inplace=True)
happ=happ.head(15)

chart = alt.vconcat(data=happ)
for i in range(len(happ)):
    happ_lower_95=stats.norm.ppf(0.025, happ.loc[i][3], happ.loc[i][4])
    happ_upper_95=stats.norm.ppf(0.975, happ.loc[i][3], happ.loc[i][4])
    happ_lower_66=stats.norm.ppf(0.16667, happ.loc[i][3], happ.loc[i][4])
    happ_upper_66=stats.norm.ppf(0.83333, happ.loc[i][3], happ.loc[i][4])
    df95=pd.DataFrame(columns=["happ_lower_95", "happ_upper_95", "Country"])
    df95.loc[0]=[happ_lower_95, happ_upper_95, happ.loc[i][0]]
    df66=pd.DataFrame(columns=["happ_lower_66", "happ_upper_66", "Country"])
    df66.loc[0]=[happ_lower_66, happ_upper_66, happ.loc[i][0]]
    interval95=alt.Chart(df95).mark_rule(size=5).encode(x=alt.X("happ_lower_95:Q", scale=alt.Scale(domain=(6,8)), 
                                                axis=alt.Axis(title="Happiness Score, 95% and 66.666% Uncertainty Intervals")), 
                                                        x2="happ_upper_95:Q", y=alt.Y("Country:N", axis=alt.Axis(title=None)))
    interval66=alt.Chart(df66).mark_rule(size=8).encode(x="happ_lower_66:Q", x2="happ_upper_66:Q")
    country_chart=interval95+interval66
    chart &= country_chart

chart

### Question 3.1.2 Reflect on the chart you created (5 points)

What are the pros and cons of the uncertainty encoding you used? How might it compare to other options you could have chosen?

Answer:

The following are the pros and cons of the uncertainty encoding I have used above.

Pros:

1. Showing two uncertainty intervals (95% and 66.666%) layered together emphasizes the fact that neither of those is a canonical interval that's guaranteed to contain the data/prediction/outcome, thus avoiding misinterpretations that either of them is a 100% interval that is definitely going to contain the data/prediction/outcome.
2. Showing two uncertainty intervals (95% and 66.666%) layered together gives a better intuitive understanding of the chance of the occurrence of the data/prediction/outcome. A wider interval (95%) means there's a greater level of uncertainty (and also a 95% chance that the data/prediction/outcome occurs within that interval), while a narrower interval (66.666%) means that there's a lower level of uncertainty (and also a 66.666% chance that the data/prediction/outcome occurs within that interval).

Cons:

1. With just the 95% and 66.666% intervals, it is not possible to see all of the distributional information, such as the skewness or the kurtosis of the distribution.
2. With just the 95% and 66.666% intervals, it is hard to get some semblance of the central tendency of the data without being accompanied by, say, a dot as a mark depicting the median data/prediction/outcome.

I briefly considered using density plots for each country, since I would've gotten a better sense of the distribution of the data. However, I decided against it for this particular data, since examining and comparing many density plots might be cumbersome. In contrast, the "lines" depicting the uncertainty levels might be more intuitive to compare across countries when looking at the chart at a glance.

Please remember to submit both the HTML and .ipynb formats of your completed notebook. When generating your HTML, be sure to run your complete code first before downloading as HTML. Please remember to work on your explanations and interpretations!