# Section 3: Homework Exercises

This material provides some hands-on experience using the methods learned from the third day's material.

In [1]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
import scipy.stats as st
import pymc as pm
import pytensor.tensor as at
import arviz as az
import os

## Exercise: Effects of coaching on SAT scores

This example was taken from Gelman *et al.* (2013):

> A study was performed for the Educational Testing Service to analyze the effects of special coaching programs on test scores. Separate randomized experiments were performed to estimate the effects of coaching programs for the SAT-V (Scholastic Aptitude Test- Verbal) in each of eight high schools. The outcome variable in each study was the score on a special administration of the SAT-V, a standardized multiple choice test administered by the Educational Testing Service and used to help colleges make admissions decisions; the scores can vary between 200 and 800, with mean about 500 and standard deviation about 100. The SAT examinations are designed to be resistant to short-term efforts directed specifically toward improving performance on the test; instead they are designed to reflect knowledge acquired and abilities developed over many years of education. Nevertheless, each of the eight schools in this study considered its short-term coaching program to be successful at increasing SAT scores. Also, there was no prior reason to believe that any of the eight programs was more effective than any other or that some were more similar in effect to each other than to any other.

You are given the estimated coaching effects (`d`) and their sampling variances (`s`). The estimates were obtained by independent experiments, with relatively large sample sizes (over thirty students in each school), so you can assume that they have approximately normal sampling distributions with known variances variances.

Here are the data:

In [None]:
J = 8
d = np.array([28.,  8., -3.,  7., -1.,  1., 18., 12.])
s = np.array([15., 10., 16., 11.,  9., 11., 10., 18.])

Construct an appropriate model for estimating whether coaching effects are positive, using a **centered parameterization**, and then compare the diagnostics for this model to that from an **uncentered parameterization**.

Finally, perform goodness-of-fit diagnostics on the better model.

In [None]:
# Write your answer here

## Exercise: Car Price Prediction

We will use a small subset of [this Kaggle dataset](https://www.kaggle.com/datasets/nehalbirla/vehicle-dataset-from-cardekho). This data is amenable to both multiple linear regression and hierarchical regression, you can start with the former and move on to the latter. The data has the selling price of multiple used cars, along with some other features such as year of production, kilometers driven, fuel type, car brand and model, etc.

The full dataset has a lot of information, but we chose to crop it down to a simpler form that is easier for you to work with. Our reduced dataset has the following features:

- Selling price in thousands of dollars
- Natural log of the kilometers driven
- Year of production
- Number of seats
- Brand name
- Car name
- Full name of the car (with special edition modifiers)
- Kilometers per liter of fuel
- Engine CC

The goal of this exercise is to be able to **predict the selling price based on the rest of the features using linear regressions**. We propose that you do the following steps to do so:

1. Do an explorative data analysis. Visualize the distribution of selling prices and how it relates to the other features. Look at how the distribution changes across brands. Try to reason whether it's appropriate to transform the observations in some way to ellicit a linear relationship with some features in the dataset.
2. Assuming the observations are normally distributed, write a series of linear regression models to fit the dataset
    a. Try a model with a single intercept parameter and do prior predictive analysis to determine appropriate priors.
    b. Add linear terms that relate the model with the numerical features in the data. Try to reason about adequate priors for the slopes, and whether it helps to standardize the features or not.
    c. Run inference on both models and explain what happens to the observational noise parameter.
3. Add a hierarchical structure that depends on the car brand. Do prior predictive simulations to choose appropriate priors, and run inference. The hierarchical structure could be on the intercept term, the slopes or both.
4. (HARD) Add a nested hierarchy term to the model's intercept. The first level of the hierarchy is the brand, the second layer is the car name. Explore what happens if you use the centered or the non-centered parametrization in the second layer of the hierarchy.
5. Compare all of the above models using `arviz.compare`. Which one is the highest ranking predictive model?
6. (VERY HARD, Optional and without a coded answer) This is for people that want to explore beyond what we did with a simple linear regression. All of the models we used assumed normally distributed observations. The best fitting model is unable to explain the observed distribution of prices, and the reason for this is that prices are not normally distributed around a mean. Prices are chosen as multiple of a thousand, and usually they are grouped into tiers. It's not unusual to have multiple cars sold at exactly the same price. This leads to heteroskedastic observations across brands and car names, and the assumption of a unique observational noise parameter breaks down. We don't have a concrete answer on what observational distribution should really underlie the observations. We want to invite you to try to think:
    a. How could you have different observational noises while still assuming that the observations are normally distributed?
    b. What kind of observational distribution could you use to account for the groupings of multiple selling prices at the same value, and then some other selling prices that have more dispersion? As a hint, try to look into mixture models.

In [None]:
df = pd.read_csv(os.path.join("..", "data", "reduced_car_data.csv"), header=0)
df.head()

In [None]:
sns.pairplot(df, hue="brand");

In [None]:
# Write your answer here