In [83]:
##### import warnings

import arviz as az
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc3 as pm
import theano.tensor as tt

from scipy import stats
from scipy.special import expit as logistic
from math import exp
from scipy.special import softmax
from quap import quap


RANDOM_SEED = 1448
np.random.seed(RANDOM_SEED)

### Part 1. Theory

A company asks you to analyse some of their advertising and sales data to help them make marketing decisions. This company advertises their product in ads on Youtube videos. They would like to know whether longer running ads are more effective at getting people to click on their website relative to shorter ads. 

The dataset they send you has the following variables:

    click (yes/no): a variable indicating whether or not an individual has clicked on the ad.

    length (5 seconds – 1:30): a variable indicating the length of the advertisment video

    genre (kids, gaming, tech, building): a variable indicating which of 4 genres of Youtube video the advertisement was placed on

    user_ID: Many users viewed several different ads. This variable is a unique identifier for each user.

    budget (10_000 - 2_000_000): the budget per minute of the ad

    dark_theme: This variable indicates whether the user has their youtube set to Dark theme or Light theme.



###### What should your outcome variable be for this analysis?


###### What family of model should you choose for this analysis, or in other words: what likelihood distribution should you assume for your outcome data, conditional on the independent variables?

###### Does this model need a link function? Which? Why?

###### What is the main predictor variable of interest? Is it a continuous or categorical variable? 

###### Is there any reason to believe the relationship between this variable and the main (link-transformed) parameter will not be linear? If so, how might you model this non-linear relationship?

###### For each of the other variables in the dataset, please state: 

a)	are they categorical or continuous? 

b)	Could omitting them bias your estimate, and why?

c)	If they are categorical, should they be modeled using fixed or random effects? Explain why.

Name a possible interaction effect between the main predictor variable and one of the other independent variables that could be interesting to model.

##### Write the model definition for the model you have just proposed, leaving out the priors for the parameters (i.e. write just the likelihood line, the linear component, and any hyper-priors for random effects).

### Part 2: Simulation


Simulate a dataset of 192 observations of 3 variables: 2 independent variables X1 and X2, and 1 dependent variable Y.

X1 is normally distributed, with mean 0 and standard deviation 1.


X2 is log-normally distributed, with mu =  0 and sigma = 1

Y is a distributed normally around a mean mu = -1 -0.002*X1 + 0.57*X2. The residual standard error of Y conditional on X1 and X2 is 2.

Run a linear regression using Quap and attempt to recover the simulated parameters. 

Next, plot the predicted effect of x2 on Y, holding x1 at 0. Plot a 95% highest density interval around the predicted mean and a 95% prediction interval for the data.

### Part 3. Data


The chimpanzees.csv dataset is drawn from experiments designed to test whether chimpanzees choose to help out random other chimps at no cost to themselves. (See chapter 11/12 and lectures for week 10 and 11 for description of experiment. In the main exam you will have a more detailed description of the dataset).

In [2]:
d = pd.read_csv("Data/chimpanzees.csv")
d.head()


Unnamed: 0,actor,recipient,condition,block,trial,prosoc_left,chose_prosoc,pulled_left
0,1,,0,1,2,0,1,0
1,1,,0,1,4,0,0,1
2,1,,0,1,6,1,0,0
3,1,,0,1,8,0,1,0
4,1,,0,1,10,1,1,1


###### Fit the following model in pymc3 using pm.sample:

	PulledLeft ~ Binomial (1,p)
    
	logit(p) = a[actor] + b1*ProsocialLeft + b2[ProsocialLeft] * OtherPresent
    
	a[actor] ~ Normal (0, 1.5)
    
	b1 ~ Normal (0,1)
    
    b2[ProsocialLeft] ~ Normal (0, 0.5)
    
    


###### Present the b1 and b2 parameters in a forest plot and interpret their meaning.

###### Next, rerun the same model but with actor as a random effect rather than a fixed effect. 

###### Compare the fixed and random effects models using WAIC or Leave One Out cross-validation. Which is the most effective model?