# Analyzing experimental Data

Import all necessary packages

In [65]:
import pandas as pd
import numpy as np
from pandas_profiling import ProfileReport
import plotly.graph_objects as go
import plotly.express as px
import plotly.io as pio
pio.templates.default = "plotly_white"
import statsmodels.api as sm
from statsmodels.sandbox.regression.predstd import wls_prediction_std
from scipy.optimize import curve_fit

Read in data

In [51]:
df = pd.read_csv("data.csv", sep="\t")

## 0. First look at the data

In [52]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13842 entries, 0 to 13841
Data columns (total 5 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   prolific_id            13842 non-null  object 
 1   round                  13842 non-null  int64  
 2   cognitive_uncertainty  13842 non-null  int64  
 3   time_late              13842 non-null  float64
 4   present_value          13842 non-null  float64
dtypes: float64(2), int64(2), object(1)
memory usage: 540.8+ KB


In [53]:
df.describe()

Unnamed: 0,round,cognitive_uncertainty,time_late,present_value
count,13842.0,13842.0,13842.0,13842.0
mean,9.5,21.659803,17.016219,58.681633
std,5.188315,22.282012,23.064179,31.301548
min,1.0,0.0,0.25,1.785714
25%,5.0,5.0,4.0,37.5
50%,9.5,15.0,6.0,60.41667
75%,14.0,30.0,24.0,88.09524
max,18.0,100.0,84.0,100.0


The dataset is comprised of 5 columns, 4 of them numeric and one containing the ProlificIDs.

Since all numeric columns have full counts, it seems like there are no NaN values of which we have to take care.

For further exploration, we load a small inline report.

In [54]:
profile = ProfileReport(df, title='Data Report EDA', minimal=True)
profile.to_notebook_iframe()

Summarize dataset: 100%|██████████| 13/13 [00:00<00:00, 74.69it/s, Completed]
Generate report structure: 100%|██████████| 1/1 [00:01<00:00,  1.85s/it]
Render HTML: 100%|██████████| 1/1 [00:00<00:00,  6.04it/s]


My thoughs:

* Looking at this report and the data itself it becomes clear that every participant identified by a unique (Prolific)ID played 18 rounds of the game. 
* Cognitive uncertainty contains a large number of 0s. One explanation could be that many participants didn't answer these questions and that 0 was used in place of a NaN value to encode this case. Still because the overall distribution of cognitive uncertainty is also left-skewed, this doesn't seem likely to me. While my first reading of this variable's description in the attached pdf was that high values of cognitive_uncertainty meant that people were sure of their answers...
> “How certain are you (in %) that your decision actually reflects how much the later payment is worth to you today?”
> 
> A participant’s response to this question (on a scale 0-100%) is captured in the variable “cognitive_uncertainty”.

* ... the distribution of answers leads me to believe that actually, the encoding must be reverse (i.e., 100%-participant’s answer = cognitive_uncertainty). This is also more in line with my later discussions. Finally, if such a big number of participants indicated that they were not at all certain of their previos answer (cognitive_uncertainty = 0) we would have to doubt the whole experiment.
* Time_late encodes the thime horizon. The data is as expected.
* Present_value represents the main variable of this experiment. On first sight, everything seems to be alright here as well.

## 1. Aggregate Discount Function

In [55]:
disc_func_df = pd.pivot_table(df, index="time_late", values=["present_value"], aggfunc=np.mean).reset_index()

fig = go.Figure(go.Scatter(x=disc_func_df["time_late"], y=disc_func_df["present_value"], mode="markers"))
fig.update_layout(title="Aggregate Discount Function", xaxis_title="Time Horizon (months)", yaxis_title="Present Value ($)")
fig.show()

This plot looks a lot like what one would expect given standard theories of temporal preferences. Just like every econ student learns in university, the participants seem to value money less the further in the future they stand to receive it. The convex shape of this distribution indicates a non-linear devaluation of money (Not only is money received in 2 months perceived as worth less than money received in 1 month, its value declines faster, the further in the future it is received). Visual inspection seems to reveal slightly diverging preferences for larger time horizons and less volatile preferences for small time horizons (though more rigorous evaluation might indicate something else). This could be because for larger time horizons participants in general are worse at estimating their own preferences.

This fits well with a very simple financial model of discounting: the Net Present Value (NPV) calculation, which is why I fit the distribution with this formula for one of the simplest possible models of temporal discounting.

In [56]:
def std_discount_function(t: float, i: float, x: float) -> float: 
    """ A minimal NPV formula; t represents the time delay in months, i the discount rate and x the initial amount; inputs are ordered to enable easy fitting"""
    return x/(1 + i)**t

pars, cov = curve_fit(std_discount_function, disc_func_df["time_late"], disc_func_df["present_value"], p0=[0.1, 100])
print(f"parameters: {pars}")
print(f"covariance of parameters: {cov}")

parameters: [1.09841897e-02 6.79787775e+01]
covariance of parameters: [[2.66453153e-06 2.44926686e-03]
 [2.44926686e-03 6.11688346e+00]]


As we can see above, the parameters are fitted reasonably well.

In [57]:
fig = go.Figure(go.Scatter(x=disc_func_df["time_late"], y=disc_func_df["present_value"], mode="markers", name="Experimental data"))

x = np.linspace(1,81)
y = [std_discount_function(x, pars[0], pars[1]) for x in x]
fig.add_trace(go.Scatter(x=x, y=y, name="Standard financial dicounting fit"))

fig.update_layout(title="Aggregate Discount Function", xaxis_title="Time Horizon (months)", yaxis_title="Present Value ($)")
fig.show()

Although simple, the NPV model seems to be a reasonable fit for the observed data offering quick insights into how participants might respond to other time horizons between 1 Week and 84 months. Overall, the observed preferences are plausible and can be used for further analyis and evaluation of new models.

## 2. Intertemporal choice and cognitive uncertainty

For a first idea of how Present Value might vary as a function of Cognitive Uncertainty, I look at a simple scatter plot of the two. Because Present Values are close to continuous and Cognitive Uncertainty seems to be discrete in %5 intervals, this figure is quite hard to read and get an idea of the overall structure. 

In [58]:
fig = px.scatter(df, x="cognitive_uncertainty", y="present_value")
fig.update_layout(height=1500*0.5, width=1750*0.5)

fig.update_layout(showlegend=False, xaxis_title="Cognitive Uncertainty (%)", yaxis_title="Present Value ($)", title="Basic Scatter Chart")
fig.show()

In [59]:
fig = px.scatter(df, x="cognitive_uncertainty", y="present_value", color="round")
# fig.update_layout(height=1500*0.5, width=1750*0.5)

fig.update_layout(showlegend=False, xaxis_title="Cognitive Uncertainty (%)", yaxis_title="Present Value ($)", title="Distribution of Rounds")
fig.show()

From the chart above and the data it seems like the rounds in which each time horizon was tested were randomized. To get an idea of whether the round might have an affect on the relation between Cognitive certainty and present values, I look at them in isolation below. This also means that there is less data per chart, which makes it easier to spot relations. Lastly, I let the graphing library add plain regression lines using OLS for each round. These trendlines only take cognitive certainty and present values into account without any controls.

In [60]:
fig = px.scatter(df, x="cognitive_uncertainty", y="present_value", facet_col="round", trendline="ols", facet_col_wrap=4, facet_row_spacing=0.02, facet_col_spacing=0.02, category_orders={"round": np.sort(df["round"].unique()).tolist()})
fig.update_layout(height=1500, width=1750, title="Fitted Model (by time round number)")
fig.show()

Allthough slightly different, the trends seem to be similar for every round.

I explained while looking at the data, why I believe that a cognitive uncertainty of 0 indicates that participants are relatively sure about their indicated present_values. This seems like a good assumption here as well. These first charts indicate that low cognitive uncertainy corresponds with high present values and high cognitive uncertainty with low present values.

To see whether time horizons have an impact on the relationship of cognitive uncertainty and the present value, I plot a similar figure as above, this time looking at each time horizon individually.

In [61]:
fig = px.scatter(df, x="cognitive_uncertainty", y="present_value", facet_col="time_late", trendline="ols", facet_col_wrap=4, facet_row_spacing=0.02, facet_col_spacing=0.02, category_orders={"time_late": np.sort(df["time_late"].unique()).tolist()})
fig.update_layout(height=1500, width=1750, title="Fitted Model (by time horizon)")
fig.show()

As before, this way of looking at the scatter charts for every individual time horizon allows us to get a better look at the individual data points. For small time horizons, the relationship between cognitive uncertainty and present value seems similar to before. High cognitive uncertainty seems connected to low present values. This time however, we can observe a gradual reversal of this relationship for ever longer time horizons. 

Starting with 23 months, it seems like high cognitive uncertainty is connected to high present values. Intuitively, this reversal makes sense. The explanatory pdf describes that participants were asked how certain they felt about their judgements after they already made each judgement. This means that the participants were prompted to rethink their decision carefully without being able to change it. 

So, it is likely that those who indicated low present values for small time horizons began to doubt their decision when asked to reflect on it and tried to adjust for their initial low answers by answering they were relatively uncertain about their previous answer. Reversely, those who gave high present values for long time horizons likely tried to compensate for this relatively high number by citing high cognitive uncertainty. This theory would explain the observed pattern nicely.

It would be interesting to see whether the same participants understated their present value for small time horizons and overstated their present value for large time horizons or if these were distinct groups.

Now that we know that both round and the time horizon have an impact on the relationship between cognitive uncertainty and present values, we can build a simple regression model. I chose to run OLS with present value as the endogenous and cognitive as the main exogenous variable. I added a constant, time_late (the time horizon) and round as controls given our previous observations.

In [62]:
y = df["present_value"]
X = df[["cognitive_uncertainty", "time_late", "round"]]
X = sm.add_constant(X)
res = sm.OLS(y, X).fit()
res.summary()

0,1,2,3
Dep. Variable:,present_value,R-squared:,0.146
Model:,OLS,Adj. R-squared:,0.146
Method:,Least Squares,F-statistic:,791.7
Date:,"Tue, 04 May 2021",Prob (F-statistic):,0.0
Time:,12:04:24,Log-Likelihood:,-66211.0
No. Observations:,13842,AIC:,132400.0
Df Residuals:,13838,BIC:,132500.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,66.3058,0.599,110.719,0.000,65.132,67.480
cognitive_uncertainty,-0.2043,0.011,-18.415,0.000,-0.226,-0.183
time_late,-0.4475,0.011,-41.802,0.000,-0.468,-0.427
round,0.4649,0.047,9.798,0.000,0.372,0.558

0,1,2,3
Omnibus:,580.47,Durbin-Watson:,0.867
Prob(Omnibus):,0.0,Jarque-Bera (JB):,420.456
Skew:,-0.326,Prob(JB):,5e-92
Kurtosis:,2.45,Cond. No.,90.3


This run indicates that all of the chosen variables are highly relevant (small p-values). Meanwhile the R-squared value is relatively small. **Discuss Participant Dummies her**.

About the individual variables:

* The constant is relatively big and with a value 66.31 in the same order of magnitude as the intercept (61.17) of the fitted NPV model in Section 1. 
* time_late, the time horizon has a sizeable negative coefficient of -0.4475 as expected. Clearly, the longer participant have to wait the smaller their indicated present value.
* round has a significant positive coefficient of 0.4649. This effect didn't become clear in our previous charts and is uncexpected to me. After all, the order of time horizons was randomized. This might indicate that during the many questions (18 rounds) participants became used to thinking about risky (delayed) payments and their estimation of temporal discounting reduced. It could also be that they simply became bored and started deviating from the initial $100 less and less. Maybe we could learn something about this by looking at their cognitive uncertainty by round. If they became bored or lazy, it is likely that 0% cognitive uncertainty (assumed to be the default) would become more frequent.
* cognitive_uncertainty has a coefficient of -0.2043. The fact that it is negative and of a small size is in line with our previos observations. People seem to be more comfortable with assigning high present values than with assigning small present values. Assuming that all the effects concerning the time horizon and the order of rounds discussed before have been appropriately captured in the controls, this could be a sign of loss aversion. A low present value could be read as a loss compared to the initial amount. Participants feel bad encountering this loss and subconsciously lower their certainty (--> increased cognitive uncertainty) about these low values.

In [64]:
fig = px.scatter(df, x="cognitive_uncertainty", y="present_value", marginal_x="violin", marginal_y="violin")
fig.update_layout(height=1500*0.5, width=1750*0.5)

fitted_x = np.linspace(0,100, 101)
fitted_y = [res.params[0] + res.params[1] * i for i in fitted_x]
fig.add_trace(go.Scatter(x=fitted_x, y=fitted_y, mode="lines", line=dict(color="red", width=2)))

fig.update_layout(showlegend=False, xaxis_title="Cognitive Uncertainty (%)", yaxis_title="Present Value ($)", title="Fitted Model")
fig.show()

The above figure shows a last scatter chart of the relationship between the cognitive uncertainty and present value. Again it is very crowded, wich is why I added violin plots at the margin to make the distribution of points more clear (read violin plots as "thick" where there are many points and "thin" where there are few). Additionally, I plot the regression line (in regard to cognitive uncertainty as the independent variable while ignoring the controls) derived with the simple OLS model above. Naturally, this line is not a completely accurate representation of the 3-dimensional (plus constant) model, but assuming there are no strong interactions between the variables this should not lead to big errors.

This graph seems to confirm our analysis throughout this notebook.

If I had more time I would like to do at least two more things:

* Test the data for interactions and improve the model this way. The figure showing the relationship between cognitive uncertainty and presenv value for every single time horizon indicated a reversion of the coefficient for large time horizons. It seems to me like we miss some explanatory power by simply ignoring this dynamic and the best explanation I could think of for it are interactions.
* Find a way to include the ProlificID as explanatory variables in the model. I believe that people have rather different preferences expressed both in their present value and their cognitive uncertainty. Given that the data don't offer any demographic information (age, nationality, sex, ...) which is often used test for such individual preferences, including the ProlificIDs could potentially improve our fit substantially. Unfortunately, all quick attempts at dummifying the column and including all dummy variables led (as expected) to collinearity issues and I could not come up with a principled way of dropping some of those columns to prevent these issues.