# Integrated Values Survey: merging WVS and EVS

I am trying to reproduce this paper:

> Tao, Y, O Viberg, RS Baker, RF Kizilcec (2024). Cultural bias and cultural alignment of large language models. _PNAS Nexus_ **3** (9), September 2024, https://doi.org/10.1093/pnasnexus/pgae346.

The paper compares the question responses of several OpenAI LLMs to the results of the Integrated Values Survey, or IVS. This dataset must be constructed from the World and European Values Surveys. Additionally, the analysis in Tao et al only uses the more recent 'waves' (vintages) of the survey.

There are hundreds of questions in the survey, but we only use responses from 10 of them. The questions have different numbers in different surveys, the so-called [**IVS merge syntax**](https://www.worldvaluessurvey.org/WVSEVStrend.jsp) provides the mapping from one survey to the other, and also provides a unified reference. The 10 questions we want are the following:

    F063
    Y003 (must be constructed in the EVS data)
    F120
    G006
    E018
    Y002
    A008
    F118
    E025
    A165

In this notebook, we construct and filter the IVS data, compute the principal components from the survey question responses, transform the LLMs responses using the computed coefficients ('loadings'), and plot everything together.

## Preliminaries

In [1]:
from google.colab import drive

drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
%load_ext rpy2.ipython

In [3]:
%%R
install.packages("psych")

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)
also installing the dependencies ‘mnormt’, ‘GPArotation’

trying URL 'https://cran.rstudio.com/src/contrib/mnormt_2.1.1.tar.gz'
trying URL 'https://cran.rstudio.com/src/contrib/GPArotation_2025.3-1.tar.gz'
trying URL 'https://cran.rstudio.com/src/contrib/psych_2.5.6.tar.gz'

The downloaded source packages are in
	‘/tmp/Rtmp2yV3Yf/downloaded_packages’


---

## Get the questions from Tao et al

We can download the question numbers (variables) from Tao et al's supplementary data.

In [4]:
import pandas as pd

dq = pd.read_csv("https://osf.io/download/mj57y")
dq

Unnamed: 0,scale,prompt
0,f063,Question: How important is God in your life? P...
1,y003,Question: In the following list of qualities t...
2,f120,Question: How justifiable do you think abortio...
3,g006,Question: How proud are you to be your nationa...
4,e018,Question: If greater respect for authority tak...
5,y002,Question: People sometimes talk about what the...
6,a008,"Question: Taking all things together, rate how..."
7,f118,Question: How justifiable do you think homosex...
8,e025,Question: Please tell me whether you have sign...
9,a165,"Question: Generally speaking, would you say th..."


We only need the `scale` (aka variable name):

In [5]:
features = [i.upper() for i in dq['scale']]
features

['F063',
 'Y003',
 'F120',
 'G006',
 'E018',
 'Y002',
 'A008',
 'F118',
 'E025',
 'A165']

We also need some metadata:

- `S003` is the country/territory code.
- `S017` are the weightings for the PCA step. The weights compensate for demographic bias in the data.
- `S018` is another weighting that I am not currently using. I think it normalizes to a sample size of 1000.
- `versn_w` is the wave and versioning info.

More about weights from the WVS site:

> S017 and S017a are N preserving weightings, as originally provided by participants. S018 and S018a are corrected weights to give an N=1000 (while preserving the internal proportions of S017/S017a) and S019 and S019a are equivalent to S018/S018a but give an N=1500. You can see how to use the N=1000 weight to build you own population weighted samples by reading the following paper: http://www.jdsurvey.net/jds/jdsurveyActualidad.jsp?Idioma=I&SeccionTexto=0405.

Even more about weights: http://www.jdsurvey.net/jds/jdsurveyActualidad.jsp?Idioma=I&SeccionTexto=0405&NOID=107

In [6]:
meta = ['S003', 'S017', 'S018', 'versn_w']

## EVS (European Values Survey) data

> The EVS Trend File 1981-2017 is constructed from the five EVS waves and covers almost 40 years. In altogether 160 surveys, more than 224.000 respondents from 48 countries/regions were interviewed. It is based on the updated data of the EVS Longitudinal Data File 1981-2008 (v.3.1.0) and the current EVS 2017 Integrated Dataset (v.5.0.0).

Info and download: https://search.gesis.org/research_data/ZA7503?doi=10.4232/1.14021


#### Reference

EVS (2022). EVS Trend File 1981-2017 (ZA7503; Version 3.0.0) [Data set]. GESIS, Cologne. https://doi.org/10.4232/1.14021


In [7]:
df = pd.read_stata('/content/drive/MyDrive/world-values-survey/ZA7503_v3-0-0.dta', convert_categoricals=False)
df.head()

Unnamed: 0,studyno,version,doi,stdyno_w,versn_w,S001,S002EVS,s002vs,S003,COW_NUM,...,X048H_N1,X048I_N2,X049,x049a,X049CS,X050,X051,X052,Y001,Y002
0,7503,3.0.0 (2022-12-14),doi:10.4232/1.14021,4800,5.0.0 (2022-06-08),1,4,5,8,339,...,-4,-4,-5,5,-4,-4,-4,-4,-4,2
1,7503,3.0.0 (2022-12-14),doi:10.4232/1.14021,4800,5.0.0 (2022-06-08),1,4,5,8,339,...,-4,-4,-5,5,-4,-4,-4,-4,-4,2
2,7503,3.0.0 (2022-12-14),doi:10.4232/1.14021,4800,5.0.0 (2022-06-08),1,4,5,8,339,...,-4,-4,-5,5,-4,-4,-4,-4,-4,2
3,7503,3.0.0 (2022-12-14),doi:10.4232/1.14021,4800,5.0.0 (2022-06-08),1,4,5,8,339,...,-4,-4,-5,5,-4,-4,-4,-4,-4,3
4,7503,3.0.0 (2022-12-14),doi:10.4232/1.14021,4800,5.0.0 (2022-06-08),1,4,5,8,339,...,-4,-4,-5,5,-4,-4,-4,-4,-4,2


The EVS data does not have `Y003` precomputed, so we compute it from the relevant survey questions (about priorities for children). For more on this see [the WVS site] > Data & documentation > Frequently asked questions > Tradrat/Selfsurv scores.

In [8]:
import numpy as np

df['Y003'] = np.sum(df[['A029', 'A039', 'A040', 'A042']] * [1, 1, -1, -1], axis=1)

We will use Waves 4 (2008) and 5 (2017) only.

In [9]:
df = df.loc[df['versn_w'].isin(['4.0.0 (2015-10-30)', '5.0.0 (2022-06-08)'])]
# df = df.loc[df['versn_w'].isin(['5.0.0 (2022-06-08)'])]

df['versn_w'].unique()

array(['5.0.0 (2022-06-08)', '4.0.0 (2015-10-30)'], dtype=object)

## WVS (World Values Survey) data

This data is not open, it has a "non-profit use only", which is not very clear but let's use a narrow interpretation that at least allows me to compute the principal components. The data may not be redistributed.


#### Reference

Haerpfer, C., Inglehart, R., Moreno, A., Welzel, C., Kizilova, K., Diez-Medrano J., M. Lagos, P. Norris, E. Ponarin & B. Puranen et al. (eds.). 2022. World Values Survey Trend File (1981-2022) Cross-National Data-Set. Madrid, Spain  &  Vienna,  Austria:  JD  Systems  Institute  &  WVSA Secretariat. Data File Version 4.0.0, doi:10.14281/18241.27.

In [10]:
dg = pd.read_stata('/content/drive/MyDrive/world-values-survey/Trends_VS_1981_2022_stata_v4_0.dta', convert_categoricals=False)
dg.head()

Unnamed: 0,studyno,version,doi,stdyno_w,versn_w,S001,s002,S002VS,S003,COUNTRY_ALPHA,...,Y022B,Y022C,Y023,Y023A,Y023B,Y023C,Y024,Y024A,Y024B,Y024C
0,4001,4-0-0 (2024-06-30),doi.org/10.14281/18241.27,341,WVS3 v.20180912,2,3,3,8,ALB,...,0.66,1.0,0.296296,0.0,0.444444,0.444444,0.165,0.33,0.0,0.165
1,4001,4-0-0 (2024-06-30),doi.org/10.14281/18241.27,341,WVS3 v.20180912,2,3,3,8,ALB,...,0.0,1.0,0.333333,0.111111,0.444444,0.444444,0.165,0.33,0.0,0.165
2,4001,4-0-0 (2024-06-30),doi.org/10.14281/18241.27,341,WVS3 v.20180912,2,3,3,8,ALB,...,1.0,1.0,0.296296,0.0,0.444444,0.444444,0.415,0.33,0.5,0.415
3,4001,4-0-0 (2024-06-30),doi.org/10.14281/18241.27,341,WVS3 v.20180912,2,3,3,8,ALB,...,0.0,0.66,0.222222,0.0,0.333333,0.333333,0.165,0.33,0.0,0.165
4,4001,4-0-0 (2024-06-30),doi.org/10.14281/18241.27,341,WVS3 v.20180912,2,3,3,8,ALB,...,0.0,0.66,0.222222,0.0,0.333333,0.333333,0.25,0.0,0.5,0.25


There are 7 'waves' (survey vintages) in the data:

- Wave 7 (2017-2022)
- Wave 6 (2010-2014)
- Wave 5 (2005-2009)
- Wave 4 (1999-2004)
- Wave 3 (1995-1998)
- Wave 2 (1990-1994)
- Wave 1 (1981-1984)

In [11]:
dg['versn_w'].unique()

array(['WVS3 v.20180912', 'WVS4 v.20201117', 'WVS5 v.20180912',
       'WVS7 v.5.0', 'WVS1 v.20200208', 'WVS2 v.20180912',
       'WVS6 v.20201117'], dtype=object)

We will only use waves 5, 6, and 7:

In [12]:
dg = dg.loc[dg['versn_w'].isin(['WVS5 v.20180912', 'WVS6 v.20201117', 'WVS7 v.5.0'])]
dg['versn_w'].unique()

array(['WVS5 v.20180912', 'WVS7 v.5.0', 'WVS6 v.20201117'], dtype=object)

## Factor analysis, all human data

In [13]:
dx = pd.concat([df.loc[:, meta+features], dg.loc[:, meta+features]])
dx = dx.reset_index(drop=True)
columns_to_check = [col for col in dx.loc[:, 'F063':'A165'].columns if col != 'Y003']
mask = (dx[columns_to_check] >= 0).all(axis=1)
dx = dx[mask]
dx

Unnamed: 0,S003,S017,S018,versn_w,F063,Y003,F120,G006,E018,Y002,A008,F118,E025,A165
2,8,0.761706,0.651890,5.0.0 (2022-06-08),5,1,3,4,2,2,3,4,2,2
4,8,0.985901,0.651890,5.0.0 (2022-06-08),1,1,5,2,2,2,2,2,1,2
5,8,0.761706,0.651890,5.0.0 (2022-06-08),1,0,4,3,2,1,3,4,1,2
8,8,0.600835,0.651890,5.0.0 (2022-06-08),1,1,3,1,1,2,3,5,3,2
9,8,1.064281,0.651890,5.0.0 (2022-06-08),2,2,2,2,1,2,2,2,2,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
435751,716,1.000000,0.823045,WVS7 v.5.0,6,-1,1,1,1,2,2,1,3,2
435752,716,1.000000,0.823045,WVS7 v.5.0,10,-1,1,4,1,3,2,1,1,2
435753,716,1.000000,0.823045,WVS7 v.5.0,10,0,1,1,1,2,3,1,2,2
435754,716,1.000000,0.823045,WVS7 v.5.0,10,2,1,2,1,1,4,1,3,2


In [14]:
X = dx.loc[:, 'F063':'A165']
weight = dx['S017'].values[:, np.newaxis] * np.ones_like(X)

In [20]:
%%R -i X -i weight -o principal_results -o column_means -o column_sds
library(psych)

X_R <- as.matrix(X)
w_R <- as.matrix(weight)

column_means <- colMeans(X, na.rm=TRUE)
column_sds <- apply(X, 2, sd, na.rm=TRUE)

principal_results <- principal(X_R, nfactors=2, rotate="varimax", use="pairwise", weight=w_R)

In [16]:
dx[['PC0', 'PC1']] = principal_results['scores']
dx['surv-self'] = 1.81 * dx['PC0'] + 0.38
dx['trad-sec']  = 1.61 * dx['PC1'] - 0.01
dx.head()

Unnamed: 0,S003,S017,S018,versn_w,F063,Y003,F120,G006,E018,Y002,A008,F118,E025,A165,PC0,PC1,surv-self,trad-sec
2,8,0.761706,0.65189,5.0.0 (2022-06-08),5,1,3,4,2,2,3,4,2,2,-1.011856,2.595088,-1.45146,4.168092
4,8,0.985901,0.65189,5.0.0 (2022-06-08),1,1,5,2,2,2,2,2,1,2,0.32787,1.127879,0.973445,1.805886
5,8,0.761706,0.65189,5.0.0 (2022-06-08),1,0,4,3,2,1,3,4,1,2,-0.71723,2.352473,-0.918187,3.777481
8,8,0.600835,0.65189,5.0.0 (2022-06-08),1,1,3,1,1,2,3,5,3,2,-0.52057,0.893319,-0.562232,1.428243
9,8,1.064281,0.65189,5.0.0 (2022-06-08),2,2,2,2,1,2,2,2,2,2,-0.185325,0.754508,0.044561,1.204758


## Save what we need

We need the 'weights' and we need the means and stdevs to scale future data.

In [27]:
import numpy as np

np.savetxt("/content/drive/MyDrive/world-values-survey/weights.txt", principal_results['weights'])

And save the scaling parameters:

In [22]:
np.savetxt("/content/drive/MyDrive/world-values-survey/column_means.txt", column_means)
np.savetxt("/content/drive/MyDrive/world-values-survey/column_sds.txt", column_sds)