<img src="../../shared/img/banner.svg"></img>

# Final Project Report - TITLE GOES HERE

#### REMOVE IN SUBMITTED VERSION

We suggest you use this template for making your final report.
Simply make a copy of this file and name it `report.ipynb`.
Remove any cells that say **REMOVE IN SUBMITTED VERSION** at the top,
like this one,
and add in `ok.auth` and `ok.submit` cells.

Make sure to keep the code cells below!

In [None]:
%matplotlib inline

In [None]:
import sys

sys.path.append("../../")

from shared.src import quiet
from shared.src import seed
from shared.src import style

In [None]:
import random

from client.api.notebook import Notebook
import numpy as np
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc3 as pm
import scipy
import seaborn as sns

In [None]:
import shared.src.utils.util as shared_util

#### REMOVE IN SUBMITTED VERSION

If you wish to define functions or classes outside of the notebook,
do so inside the file `utils/util.py`.
The cell below will import them, and they will be available inside `util`.

This is not strictly necessary,
so ignore this if you are not familiar with
defining your own Python modules.
It is provided only for convenience.

In [None]:
import utils.util as util

In [None]:
sns.set_context("notebook", font_scale=1.7)

In [None]:
ok = Notebook("ok/config")

#### REMOVE IN SUBMITTED VERSION

The rubric below breaks the report down into five weighted components.

## Report Rubric

1. **Dataset background described** (5%)
  - Provide link or citation for dataset source
  - Explain how it was originally collected<br><br>
2. **Dataset cleaned and visualized** (15%)
  - Perform any necessary "munging" tasks
  - Visualize important variables<br><br>
3. **Question articulated** (20%)
  - Describe a question you intend to use the data to answer
  - Make clear the connection between the data and the question<br><br>
4. **Model developed** (25%)
  - Define a pyMC `Model` for the data in order to support using
  Bayesian inference to answer the question you articulated
  - Explain your modeling choices: where did the likelihood come from?
  what does your prior imply about your beliefs, and why did you choose that prior?
  which variables are hidden, which variables are observed,
  and how do those choices relate to your question?
  - If you based your model on one from the class or elsewhere,
  make sure to mention that here.<br><br>
5. **Findings reported** (35%)
  - Use `sample`, `sample_posterior_predictive`, and/or `plot_posterior`,
  plus any visualizations you design yourself,
  as needed to examine the posterior of your model and draw conclusions.
  - Make sure you report your findings in terms of uncertainty:
  don't just report _whether_ you believe the data supports a conclusion,
  but instead report _how plausible_ a conclusion is, given the output of your model.

The remainder of this report is broken down into these sections,
with additional information on formatting and expectations.

There is no obvious way to measure the "length" of a notebook
that is analogous to the page or word count of an essay.
It is expected that the notebook has about as much content as a lab
and not more than twice as much.
You should produce up to a few short paragraphs of text in each section,
along with code and plots as needed.

If anything is unclear, please post on Piazza.

# Dataset Background

#### REMOVE IN SUBMITTED VERSION

This section can be more-or-less copied from your proposal.
See the relevant section of the `proposal_template` for details
on what is expected here.
Now that you know exactly what questions you'll be asking of your data,
you might wish to slightly change the emphasis.

#### MAKE SURE THE DATASET YOU USE IS LESS THAN 1 MEGABYTE IN SIZE.

To reduce the strain on the okpy system,
we need to keep our project files small.
Do _not_ submit the project with a dataset of size greater than 1 megabyte.
The `ok.grade` cell will check this for you.

In [None]:
df = pd.read_csv("data/data.csv", index_col=0)

In [None]:
ok.grade("dataset_size")

# Dataset Cleaning and Visualization

#### REMOVE IN SUBMITTED VERSION

In this section, remove missing values, apply necessary transformations,
and visualize the key variables in the dataset.
Lab 01 will be helpful here.

For your visualizations, you will likely want some combination of
`boxplot`s, `distplot`s, `pairplot`, and `jointplot`s.
Make sure to visualize _the relationships between variables_,
whenever that is important, and not just the values of variables by themselves.

# Question

#### REMOVE IN SUBMITTED VERSION

In this section, describe the question you'd like to pose of your data.


For example,
it might be "when did sentiment about Donald Trump in hip-hop turn negative", as in Lecture 06,
or "how does the score of participants on a task differ as both task difficulty and their attentional state change", as in the `attention` dataset used at several points in the class.

Make sure the connection between the dataset you chose and the question you are posing is clear.

# Model

#### REMOVE IN SUBMITTED VERSION

In this section, describe and define, in Python, the model you will use to answer your question with your dataset.

Make sure each of the choices you make in defining your model are explained. Which variables are observed? Which variables/parameters are unknown? What distributions did you choose for your prior, and what do they imply about what you believe or assume about the unknown variables? What distribution did you choose for your likelihood and why (continuous vs discrete, robust to outliers vs not)? How do the question you are asking and the structure of your data drive your modeling choices?

If you're struggling to come up with the right distribution for your prior or likelihood
based on what we've covered in class, check out
["The Distribution Zoo"](https://ben18785.shinyapps.io/distribution-zoo/),
an interactive tool for visualizing common distributions for Bayesian modeling.
It also includes some useful facts about each distribution.
You'll need to find the equivalent distribution in pyMC by checking
[the documentation](https://docs.pymc.io/api/distributions.html).

If you have time, you might wish to define two or more models and compare and contrast their structure.

# Findings

#### REMOVE IN SUBMITTED VERSION

In this section, draw samples from your model and then use them, along with our tools of posterior inference,
to answer your question.

For example, you might visualize the posteriors relative to a reference value with `plot_posterior`,
check what future data you might observe with `sample_posterior_predictive`,
or compute boolean functions on the posterior samples to determine
the probability that a claim is true under your posterior.

If you defined two different models above, you should compare and contrast their findings.

# Presentation Link

#### REMOVE IN SUBMITTED VERSION

In this section, include a link to the Google Slides presentation that you gave in class,
as below:

> See slides at [this link](https://docs.google.com/presentation/d/1NLLl-hKY2bq0RFUITeF-4nb-RErKzcq4C6XJ_4YZBIE/edit?usp=sharing).

Check out [these instructions](https://support.google.com/docs/answer/2494822)
for how to make a Google Drive document shareable by link.
You'll want to look for a section called "Share a link to the file" or something similar.

The presentation should cover the same material as the report,
but at a higher level.
Introduce the dataset and the question you'd like to answer with it,
visualize the relevant variables,
describe your model, and then report your findings.
The presentation should be no longer than five minutes.