This notebook describes the steps for the exile and online opposition example (second application). Data is from Esberg and Siegel 2023. Prepared by Annamaria Prati and Yehu Chen.

# Setup

Let's set ourselves up. Load the required libraries. Like before, note that `torch` should be version 2.6.0 and `gpytorch` should be version 1.8.1. Set the seed and the default data type to be `float64`. 

Given the size of the dataset, in this example we will implement a sparse GP model as an approximation for the full GP model (since GPs scale poorly). We will do this with inducing points and mini batches. Inducing points act as a compressed representation of the data, allowing for faster training and inference. More inducing points mean it will be closer to the full GP and therefore a better approximation, but it will be slower -- practitioners might want a lower number while they refine the model and increase the number of inducing points as they reach finalized production stages. Mini batches means that instead of using hte whole dataset at once, we divide it into smaller batches. Specifying the batch size controls how many training samples are processed at once during each optimization step. Using 256 is faster than using all the data at once, but is less noisy than very small batches might be. The number of empochs is the number of times the optimizer might go through the entire dataset during training (if the training loss stabilizes earlier, it might stop sooner). Note that one ``epoch'' is one full pass through the dataset via the mini batches. 100 is a standard starting point. 

In this case, we have set the number of inducing points to 3000, the batch size to 250, and the number of epochs to 100.

Finally, load the dataset (we again provide it in a subfolder on the Github repo). 

In [1]:
# load gpytoch and other libraries
import torch
import numpy as np
import pandas as pd
import gpytorch
from scipy.stats import norm
from matplotlib import pyplot as plt
from gpytorch.means import ZeroMean, LinearMean
from gpytorch.likelihoods import GaussianLikelihood
from gpytorch.kernels import ScaleKernel, RBFKernel
from datetime import datetime
from gpytorch.means import Mean
from gpytorch.models import ApproximateGP
from gpytorch.variational import CholeskyVariationalDistribution
from gpytorch.variational import VariationalStrategy
from typing import Optional, Tuple
from torch.utils.data import TensorDataset, DataLoader

print(torch.__version__)
print(gpytorch.__version__)

# set a random seed so results are consistent each time you run the code
torch.manual_seed(12345)

# set default data type to float64 for accuracy
torch.set_default_dtype(torch.float64)

# extra hyperparameters for the sparse GP model
num_inducing = 3000
batch_size = 256
num_epochs = 100

# load the dataset with the monthly tweet volumes
# this file should be in a folder called 'data' with the name 'exile.csv'
data = pd.read_csv("./data/exile.csv")


2.6.0
1.8.1


We will also define for ourselves two helper functions for our month-year time series:

- `diff_month`: computes the number of months between two dates.
- `to_month`: converts a date into a monthly index.

In [2]:
def diff_month(d1, d2):
    d1 = datetime.strptime(d1,"%Y-%m-%d")
    d2 = datetime.strptime(d2,"%Y-%m-%d")
    return (d1.year - d2.year) * 12 + d1.month - d2.month

def to_month(d1):
    return datetime(2013 + int(d1 / 12), ((1 +d1) % 12) + 1, 1)

# Data Preparation

After only selecting the columns we needed, we construct our covariates: `month` is converted to a numerical index using our `diff_month` helper function to be the number of months since January 1 2013; `unit.id` is a categorical encoding based on `actor.id`; `log_num_tweets` is the logged number of tweets+1 for stability; and `tweeted_exile` is an indicator for whether the tweet was from exile. We also construct our tensor for the dependent variable.

Since we are building a sparse GP, we will also define our inducing points at this stage. Because of the sparse GP, we also need to specify a training dataset (something we did not have to do in the first example on white nationalist rhetoric, where the covariance matrix was computed over all data points).

In [3]:
Y_name = "perc_harsh_criticism"

data = data[[Y_name, "tweeted_exile", "month","num_tweets", "actor.id"]] # only keeping columns we need

# creating input data
# xs: month, unit id, log_num_tweets, tweeted_exile
xs = data.month.apply(lambda x: diff_month(x,"2013-01-01"))
xs = torch.tensor(np.array([data["actor.id"].astype('category').cat.codes.values.reshape((-1,)),\
            xs.values.reshape((-1,)),
            np.log(data.num_tweets.values+1).reshape((-1,)), \
            data['tweeted_exile'].values.reshape((-1,))]).T)

# creating the dependent variable
Y_name = "perc_harsh_criticism" # percentage of tweets that are harsh criticism; facilitates changing dependent variable
ys = torch.tensor(data[Y_name].values).double()

# lets you convert numeric unit IDs back into original actor.id values later for interpretation
to_unit = dict(enumerate(data["actor.id"].astype('category').cat.categories))
del data # to save memory

# defining the training dataset
train_dataset = TensorDataset(xs, ys)
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)


# Model Specification

As explained in the main text, this model addresses three temporal modeling challenges:

1. Shared temporal shocks affecting all activists; 
2. Different activists have different baseline propensities for criticism; and
3. Activists have individual-specific trends in criticism that could confound the effect of exile



We construct a GP model that addresses these challenges while allowing for flexible temporal dynamics. The model has three components. First, a basic GP prior captures the effect of exile and tweet volume:
\begin{eqnarray}
    f \sim \mathcal{GP}\big( \mu_f(\mathbf x_{it}), \mathbf K_f \big), \text{ where}\\
    \mu_f(\mathbf{x}_{it}) =  \beta_0 + \beta_1 \text{exile}_{it} + \beta_2 \text{\#tweets}_{it}. 
\end{eqnarray}
We use a squared exponential auto relevance determination (SE-ARD) kernel for $\mathbf{K_f}$, allowing different weightings for the exile and tweet count predictors. Note that the SE-ARD kernel is a variation of the RBF kernel used earlier: Normally, the RBF kernel has a single lengthscale for all input dimensions. ARD is a special case where each dimension gets its own lengthscale. This makes ARD a kind of automatic feature selection built into the GP. This is usefule when we have multi-dimensional inputs, we don't know which inputs might matter most, and/or we want the GP to learn which dimensions are most relevant. 

Second, we model smooth common temporal shocks with:
\begin{eqnarray}
    g(t) \sim \mathcal{GP}\big(\mathbf 0, \textbf K_g \big)
\end{eqnarray}
where $\mathbf{K_g}$ is a SE kernel. This replaces any discrete monthly fixed effects with a continuous temporal process. Note that the SE kernel is widely used because it is relatively simple and encodes assumptions of smoothness and continuousness (it is infinitely differentiable), which is a reasonable starting point for many social sciences applications.

Third, we capture unit-specific patterns through independent GP priors:
\begin{eqnarray}
    h_i(t) \sim \mathcal{GP}\big(\mathbf \mu_{h[i]}(t), \textbf K_h\big), \text{ where}\\
    \mu_{h[i]}(t) = \alpha_{1[i]} + \alpha_{2[i]} t.
\end{eqnarray}
This allows each unit to follow its own temporal trajectory while maintaining a shared degree of smoothness across units.

We assume shared hyperparameters across units, which encodes the assumption that unit trends should exhibit similar levels of nonlinear deviations.

The complete model combines these components with Gaussian error:
\begin{eqnarray}
\mathbf y \sim \mathcal{MVN}(\boldsymbol \mu_y, \mathbf{K}_y), \text{ where}\\
\boldsymbol \mu_y = \boldsymbol \mu_f + \boldsymbol \mu_{h[i]} \text{ and}\\
\mathbf{K}_y = \mathbf{K}_f + \mathbf{K}_g + \mathbf{K}_h + \mathbf{I}\sigma^2_{\text{noise}}.
\end{eqnarray}



# Constructing the Model

We construct in `gpytorch` the components of the model to match our specifications and our temporal modeling challenges. Again, this will be a sparse variational GP that has a shared linear mean, unit specific means, and a covariance structure tat allows for continuous features, group effects, and interactions.

First, we define a `ConstantVectorMean` class, which allows for each unit to ave its own constant mean. It defines a custom mean function where the mean is not a single scalar but instead a vector of constants indexed by a categorical input. We initialize a learnable parameter called `constantvector` to be all zeros. It will take the categorical indices as the input, convert to integers, and use those as an index into the vector (in the `forward`).


In [4]:
class ConstantVectorMean(gpytorch.means.mean.Mean):
    def __init__(self, d=1, prior=None, batch_shape=torch.Size(), **kwargs):
        super().__init__()
        self.batch_shape = batch_shape
        self.register_parameter(name="constantvector",\
                 parameter=torch.nn.Parameter(torch.zeros(*batch_shape, d)))
        if prior is not None: 
            self.register_prior("mean_prior", prior, "constantvector")

    def forward(self, input):
        return self.constantvector[input.int().reshape((-1,)).tolist()]

Next, we define the `MaskMean` class. It is a wrapper mean function that allows our mean to only depend on one input instead of all possible inputs, as defined by the `active_dims`. It will extract those dimensions and pass them to the base mean (in the `forward`). 

In [5]:
class MaskMean(gpytorch.means.mean.Mean):
    def __init__(
        self,
        base_mean: gpytorch.means.mean.Mean,
        active_dims: Optional[Tuple[int, ...]] = None,
        **kwargs,
    ):
        super().__init__()
        if active_dims is not None and not torch.is_tensor(active_dims):
            active_dims = torch.tensor(active_dims, dtype=torch.long)
        self.active_dims = active_dims
        self.base_mean = base_mean
    
    def forward(self, x, **params):
        return self.base_mean.forward(x.index_select(-1, self.active_dims), **params)


Now we can put it together in the `GPModel` class, which will be an approximate GP using variational inference.

This variational GP set-up is defined by the `CholeskyVariationalDistribution` and `VariationalStrategy`. We have set the inducing point locations to be fixed (`learn_inducing_locations=False`). 

For the mean function, we have a shared linear mean using the first two dimensions of the data (month and unit ID) (`self.mean_module`). There is no intercept (`bias=False`), it is a pure weighted sum of our two features. We also create a list of LinearMeans, one per unit (`self.unit_mean`, like group fixed effects). 

There are two pieces to the covariance function. First is `self.covar_module`. For this piece, we are using a scaled SE kernel applied to the unit ID and the tweet volumes, where the kernel will learn a spearate lengthscale for each of the dimensions. `ScaleKernel` wraps the base kernel and learns an extra scalar variance parameter. This will model smooth similarity between observations based on the activist and their tweet volumes. Second is `self.t_covar_module`. It defines a product for whether the tweet is from exile. 

In the `forward`, we built the mean vector (the shared mean with the unit specific means) and the total covariance (sum of all kernels: continuous features, whether the tweet was in exile, and group-level correlation). It returns a `MultivariateNormal` distribution.

In [6]:
class GPModel(ApproximateGP):
    def __init__(self, inducing_points, unit_num):
        # variational GP setup
        self.unit_num = unit_num
        variational_distribution = CholeskyVariationalDistribution(inducing_points.size(0))
        variational_strategy = VariationalStrategy(self, inducing_points, variational_distribution, learn_inducing_locations=False)
        super(GPModel, self).__init__(variational_strategy)

        # linear mean
        self.mean_module = LinearMean(input_size=(2), bias=False)
        self.unit_mean = torch.nn.ModuleList([LinearMean(input_size=(1),bias=True) for _ in range(unit_num)])
        self.covar_module = ScaleKernel(RBFKernel(ard_num_dims=(2), active_dims=[2,3]))
        self.t_covar_module = ScaleKernel(RBFKernel(active_dims=[0])*RBFKernel(active_dims=[1]))
        self.g_covar_module = ScaleKernel(RBFKernel(active_dims=[1]))

    def forward(self, x):
        mean_x = self.mean_module(x[:,2:]) 
        for i in range(self.unit_num):
            mean_x[x[:,0]==i] += self.unit_mean[i](x[i,1].reshape((-1,1)))
        covar_x =  self.covar_module(x) + self.t_covar_module(x)  + self.g_covar_module(x)
        return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)

Since we are building a sparse GP, we will also define our inducing points at this stage. Because of the sparse GP, we also need to specify a training dataset (something we did not have to do in the first example on white nationalist rhetoric, where the covariance matrix was computed over all data points).

In [7]:
# defining the inducing points
inducing_points = xs[np.random.choice(xs.size(0),num_inducing,replace=False),:]

model = GPModel(inducing_points=inducing_points, unit_num=xs[:,0].unique().size()[0]).double()
likelihood = GaussianLikelihood().double()
del inducing_points

We set the initial values for the model's hyperparameters in a dictionary and initialize, as well as initialize model parameters.

In [8]:
# set initial values for model parameters

hypers = {
'mean_module.weights': torch.tensor([0, 5]),
'covar_module.outputscale': 9,
'covar_module.base_kernel.lengthscale': torch.std(xs[:,2:4],axis=0),
't_covar_module.base_kernel.kernels.1.lengthscale': torch.tensor([12]),
't_covar_module.outputscale': 4,
'g_covar_module.base_kernel.lengthscale': torch.tensor([24]),
'g_covar_module.outputscale': 9
}    

model = model.initialize(**hypers)


# initialize model parameters
model.t_covar_module.base_kernel.kernels[0].raw_lengthscale.requires_grad_(False)
model.t_covar_module.base_kernel.kernels[0].lengthscale = 0.01

likelihood.noise = 9.

# Training the Model

Like before, we switch the model and likelihood into "training" mode. We again use the Adam optimizer, though now we have a learning rate of 0.1 We are optimizing all model parameters except the raw lengthscale of the first kernel inside `t_covar_module` (like before, it is frozen) as well as the likelihoood parameters. Since we are now modeling a sparse GP, we will use Variational inference using Evidence Lower Bound (ELBO) and mini batching. Finally, using `train_loader`, we run a loop to train our model. In each loop, we zero the gradients, run the GP model's forward pass (`output=model(x_batch)`), compute the negative ELBO loss, backpropagate the gradients, and then update the parameters with Adam.

In [9]:
# train model
model.train()
likelihood.train()

optimizer = torch.optim.Adam([
    {'params': list(set(model.parameters()) \
                - {model.t_covar_module.base_kernel.kernels[0].raw_lengthscale,\
                })},
    {'params': likelihood.parameters()},
], lr=0.1)

# "Loss" for GPs
mll = gpytorch.mlls.VariationalELBO(likelihood, model, num_data=ys.size(0))

for i in range(num_epochs):
    for j, (x_batch, y_batch) in enumerate(train_loader):
        optimizer.zero_grad()
        output = model(x_batch)
        loss = -mll(output, y_batch)
        loss.backward()
        optimizer.step()
        if j % 50 == 0:
            print('Epoch %d Iter %d - Loss: %.3f' % (i + 1, j+1, loss.item()))

torch.linalg.solve_triangular has its arguments reversed and does not return a copy of one of the inputs.
X = torch.triangular_solve(B, A).solution
should be replaced with
X = torch.linalg.solve_triangular(A, B). (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\pytorch\aten\src\ATen\native\BatchLinearAlgebra.cpp:2259.)
  res = torch.triangular_solve(right_tensor, self.evaluate(), upper=self.upper).solution


Epoch 1 Iter 1 - Loss: 206.852
Epoch 1 Iter 51 - Loss: 13.866


# Evaluating the Model

We freeze our learned hyperparameters by switching the model and likelihood into "evaluation" mode and obtain our posterior predictions for both sets of inputs.

In [10]:
# set model and likelihood to evaluation mode
model.eval()
likelihood.eval()

with torch.no_grad(), gpytorch.settings.fast_pred_var():
    out = model(xs)
    mll.combine_terms = True
    loss = -mll(out, ys)
    mu_f = out.mean.numpy()
    lower, upper = out.confidence_region()


# copy training tensor to test tensors and set exile to 1 and 0
test_x1 = xs.clone().detach().requires_grad_(False)
test_x1[:,3] = 1
test_x0 = xs.clone().detach().requires_grad_(False)
test_x0[:,3] = 0

# in eval mode the forward() function returns posterioir
with torch.no_grad(), gpytorch.settings.fast_pred_var():
    out = model(xs)
    mll.combine_terms = False
    loss, _ , _ = mll(out, ys)
    loss = -loss*out.event_shape[0]
    out1 = model(test_x1)
    out0 = model(test_x0)


We can now compute our ATE and its uncertainty.

In [11]:
# compute ATE and its uncertainty
effect = out1.mean.numpy()[xs[:,3]==1].mean() - out0.mean.numpy()[xs[:,3]==1].mean()
effect_std = np.sqrt((out1.variance.detach().numpy()[xs[:,3]==1].mean()\
                    +out0.variance.detach().numpy()[xs[:,3]==1].mean()))
BIC = (3+2+1)*\
    torch.log(torch.tensor(xs.size()[0])) + 2*loss # *xs.size(0)/batch_size
print("ATE: {:0.3f} +- {:0.3f}\n".format(effect, effect_std))
print("model evidence: {:0.3f} \n".format(-loss))
print("BIC: {:0.3f} \n".format(BIC))

ATE: 6.852 +- 0.979

model evidence: -181330.798 

BIC: 362721.768 



# Saving the Results

We can save our results to a dataframe and then export.

In [13]:
# store results
results = pd.DataFrame({"gpr_mean":mu_f})
results['true_y'] = ys
results['gpr_lwr'] = lower
results['gpr_upr'] = upper
results['month'] = np.array([to_month(x) for x in xs[:,1].numpy().astype(int)])
results['unit'] = np.array([to_unit[x] for x in xs[:,0].numpy().astype(int)])
results['exile'] = xs[:,3].numpy().astype(int)

test_x0 = xs.clone().detach().requires_grad_(False)
test_x0[:,3] = 0

# in eval mode the forward() function returns posterioir
with torch.no_grad(), gpytorch.settings.fast_pred_var():
    out0 = model(test_x0)
    lower, upper = out0.confidence_region()

results['cf'] = out0.mean.numpy()
results['cf_lower'] = lower
results['cf_upper'] = upper

if Y_name == "perc_harsh_criticism":
    abbr = "crit"
else:
    abbr = "repr"
results.to_csv("./results/exile_{}_fitted_gpr.csv".format(abbr),index=False) #save to file