<img src="../../shared/img/banner.svg"></img>

# Lab 02 - Bootstrap Estimation for Personality Data

**REMEMBER**: if you downloaded this lab to datahub with an interact link,
it will be called `lab02_blank.ipynb`.

Before doing any work, go to the dropdown menu, `File > Make a Copy`,
and then rename that copy to `lab02.ipynb`.
Work in the copy, rather than the original.

In [None]:
%matplotlib inline


In [None]:
import sys

sys.path.append("../../")

from shared.src import quiet
from shared.src import seed
from shared.src import style

In [None]:
import json
import itertools

from client.api.notebook import Notebook
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

import utils.plot

In [None]:
ok = Notebook("ok/config")

In [None]:
sns.set_context("notebook", font_scale=1.7)

## Learning Goals

1. Practice using bootstrapping to calculate uncertainty.
1. Learn to draw inferences from bootstrap samples 
1. Visualize uncertainty using seaborn.

## Personality Types

The
[Myers-Briggs Type Indicator](https://amazon.com/Manual-Guide-Development-Myers-Briggs-Indicator/dp/0891060278)
(MBTI) is a personality test common in
[management psychology](https://journals.sagepub.com/doi/abs/10.1177/014920639602200103).
Though evidence for its utility and cross-cultural validity is
[limited](https://www.researchgate.net/publication/232494957_Cautionary_comments_regarding_the_Myers-Briggs_Type_Indicator),
and mainstream psychology prefers the
["Big Five"](https://en.wikipedia.org/wiki/Big_Five_personality_traits),
the MBTI has
[cultural staying power](https://www.newyorker.com/culture/culture-desk/the-enduring-allure-of-the-personality-quiz)
and a
[devoted online following](https://www.reddit.com/r/mbti/).

The test is based on
[Jungian personality theory](https://en.wikipedia.org/wiki/Personality_type#Carl_Jung),
which breaks personality down by means of **preferences** in each of four categories:

- Direction of Energy: `E`xtroverted, outward-oriented, or `I`ntroverted, inward-oriented
- Perception: `S`ensing, preferring raw sensations, or I`N`tuitive, preferring internal interpretations
- Decision-Making: `F`eeling, using emotional responses, or `T`hinking, using abstract rules
- Outward Orientation: Passing `J`udgement or simply `P`erceiving things

An individual is assigned a letter from each category,
resulting in a four-letter string called a **type**.

For example,
an `ENFP` is an `E`xternally-oriented person who uses their i`N`tuition/internal abstractions,
takes into account `F`eelings when making decisions and prefers `P`erceiving the world to passing judgment.

There are thus $2 * 2 * 2 * 2 = 16$ possible types.
The cell below defines a function,
`make_mbti_type_list`,
that will generate a Python list containing each of the 16 types,
which you might find useful later.

In [None]:
def make_mbti_type_list():
    indexers = list(itertools.product([0,1] ,repeat=4))
    types_array = np.array([["E", "I"], ["N", "S"], ["F", "T"], ["J", "P"]])
    
    types = ["".join(types_array[[0, 1, 2, 3], indexer]) for indexer in indexers]
    
    return types

type_list = make_mbti_type_list()

type_list

As the [Myers & Briggs Foundation notes](https://www.myersbriggs.org/my-mbti-personality-type/my-mbti-results/how-frequent-is-my-type.htm),
one of the first questions you might want to ask is
"how common are each of these types"?
That is, how **prevalent** are the types and preferences
in some population of interest?
They report some data on the prevalence of each type,
and of each preference in each category,
in the United States' population.

The cell below loads this data into a dictionary called `mbti_stats`.
The types and categories serve as keys and the values are the prevalences,
so `mbti_stats["ISFT"]` is the prevalence of the ISTJ type,
while `mbti_stats["I"]` is the prevalence of the preference I,
reported as fractions.
Notice that the prevalence of a pair of preferences from the same category,
e.g. E and I, from the first category, add up to 1.
The prevalences for the sixteen types add up to slightly more than 1, due to rounding.

In [None]:
with open("data/mbti_stats.json") as f:
    mbti_stats = json.load(f)

print(mbti_stats)

The cell below loads in some data about the MBTI types of a small subset of users of
[Personality Cafe](https://www.personalitycafe.com/),
a forum devoted to discussing aspects of personality
and connecting individuals on the basis of their results on personality tests.

In [None]:
mbti = pd.read_csv("data/downsampled_mbti.csv", index_col=0)

mbti.head()

In this week's lab, you are tasked with using bootstrapping to determine whether
the prevalences of the eight preferences and sixteen types among users of Personality Cafe
match the prevalences reported by the Myers & Briggs Foundation for the general population.

This is an example of an inferential problem, as discussed in class,
so we will use bootstrapping to estimate our uncertainty
and then use histograms to visually evaluate the results.
We will leave quantifying our uncertainty for this week's homework.

See the **Python Tips** section at the bottom of the lab for coding hints.

## Preference Prevalences

For the Extroverted preference,
compute the prevalence in the Personality Cafe user data
and save the value to a variable called `observed_E_prevalence`.

In [None]:
ok.grade("q1")

Now, for each of the eight preferences, E, N, F, J, I, S, T, and P,
apply bootstrapping to the prevalence to get an estimate of your uncertainty.
Between 100 and 1000 bootstraps should be sufficient.
Visualize your uncertainty by plotting the distribution of observed values with `sns.distplot`.

## Type Prevalences

Compute the prevalence of the INTJ type in the Personality Cafe user data
and save the value to a variable called `observed_INTJ_prevalence`.

In [None]:
ok.grade("q2")

For each of the sixteen types,
use bootstrapping to estimate your uncertainty in the prevalence.
Between, 100 and 1000 bootstraps will suffice.
Again, visualize your uncertainty with `sns.distplot`.

## Questions

Answer the questions below before submitting your lab.

#### Q For which of the preferences of `E`, `N`, `F`, and `J`, does it appear that the users of Personality Cafe have a different prevalence than the values reported for the broader population by the Myers & Briggs Foundation? Explain how you came to your conclusion.

#### Q For which of the 16 types does it appear that the users of Personality Cafe have about the same prevalence as reported for the broader population? Explain how you came to your conclusion.

In [None]:
ok.score()

## Python Tips

### Calculating Prevalences

You might calculate the prevalences of each preference
by using .`apply` or `==` to get a Boolean series indicating whether a given row contains a given preference
and then using `.mean` to to calculate the fraction that are `True` or `False`.

To calculate the prevalences of the types,
you'll need to do a bit more work,
since there is no `"type"` column.
To get a column for personality type,
you might use `.apply` on each row of the dataframe by passing the keyword argument `axis=1`:

```python
df.apply(?, axis=1)
```

where you'll need to replace the `?` with a function that takes in a row from `mbti`
and returns the type of the individual represented by that row.

### Bootstrapping

From a dataframe `df`, you can compute a bootstrap sample with

```python
boostrap_df = df.sample(frac=1, replace=True)
```

### Plotting

#### Plotting Distributions

To plot the results of your bootstrap,
you could use `distplot` from seaborn.
If `bootstrap_values` were a list of the values of a statistic computed on a bunch of bootstrap samples,
you would plot their distribution with

```python
sns.distplot(bootstrap_values)
```

If you call this function multiple times in the same cell,
the distributions will be plotted on top of each other.
Use the `label` keyword argument,
plus the command `plt.legend()`,
to label them.

An example, labeling two bootstrap samples, `bootstrap1_values` and `bootstrap2_values`,
appears below.

```python
sns.distplot(bootstrap1_values, label="name for bootstrap1")
sns.distplot(bootstrap2_values, label="name for bootstrap2")
plt.legend();
```

#### Plotting a Comparison Value

In order to answer the questions,
you'll want to compare the bootstrap samples to the values in
`mbti_stats`.
The function `utils.plot.add_vertical_line` makes this easier
by plotting a vertical line at the point on the x-axis given by `value`.

This cell shows how to use `utils.plot.add_vertical_line` in concert with `sns.displot`
to plot the bootstrapping results for the prevalence of a given type, ENFJ,
along with the prevalence reported by the Myers & Briggs Foundation,
presuming that the former is stored in a list or series called `enfj_bootstrap_values`.

```python
sns.distplot(enfj_bootstrap_values, label="Bootstrapped ENFJ Fraction")
utils.plot.add_vertical_line(mbti_stats["ENFJ"], label="Reported ENFJ Fraction")
```

Execute a cell containing the code
```python
utils.plot.utils.plot.add_vertical_line??
```
to see the documentation for this function.

#### Advanced: Subplots

If you want to learn to make fancier figures that have subplots,
consider adapting the code below.

It produces a figure with 16 subplots, in a grid of 4 rows and 4 columns,
and then calls `distplot` on three bootstrap samples,
plotting them in the axis in the top-left corner,
the axis next to it, and the axis beneath it.

Creating a plot with subplots is not necessary for getting full credit on? this lab.

```python
f, axs = plt.subplots(figsize=(16, 16), nrows=4, ncols=4)
axs = list(axs)  # a list of lists of axes
sns.distplot(enfj_bootstrap_values, ax=axs[0][0])
sns.distplot(enfp_bootstrap_values, ax=axs[0][1])
sns.distplot(esfj_bootstrap_values, ax=axs[1][1])
# and so on
```