# Day 2

In this analysis - inspired by  [Wulff & Mata, 2025](https://doi.org/10.1038/s41562-024-02089-y) - we will use a language model to extract features from personality items. We will then use these features to compute the similarity between items, evaluate how well these predict observed similarities, and visualize the similarity matrix in two dimensions. Finally, we will re-assign each item to a personality construct based on its predicted similarity to the constructs.

By the end of this analysis, you will have learned how:
- To extract features from text using a pre-trained langauge model
- To compute the similarity between items using cosine similarity
- How this can be used to predict the construct to which an item belongs, and thus potentially improve construct validity

## Environment Setup 

In [None]:
import sys
if 'google.colab' in sys.modules:  # If in Google Colab environment
    # Mount google drive to enable access to data files
    from google.colab import drive
    drive.mount('/content/drive')
    
    # Installing requisite packages
    !pip install pacmap sentence-transformers &> /dev/null

    # Change working directory
    %cd /content/drive/MyDrive/LLM4BeSci_Zurich2025/day_1

In [1]:
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer
from pacmap import PaCMAP
import seaborn as sns
import matplotlib.pyplot as plt

  from .autonotebook import tqdm as notebook_tqdm
OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.


## Extracting Features from Personality Items

We begin by loading the personality items into a `pandas.DataFrame` with three columns:

1. `factor`: The (high-level) personality factor to which the item belongs.
2. `construct`: The (mid-level) personality construct to which the item belongs.
3. `item`: The text of the personality item used to measure the construct.

Run the cell below.

In [2]:
# Loading personality data
personality = pd.read_csv('items.csv') 
personality

Unnamed: 0,factor,construct,item
0,Conscientiousness,Achievement-Striving,Go straight for the goal.
1,Conscientiousness,Achievement-Striving,Plunge into tasks with all my heart.
2,Conscientiousness,Achievement-Striving,Demand quality.
3,Conscientiousness,Achievement-Striving,Set high standards for myself and others.
4,Conscientiousness,Achievement-Striving,Turn plans into actions.
...,...,...,...
295,Neuroticism,Vulnerability,Remain calm under pressure.
296,Neuroticism,Vulnerability,Am calm even in tense situations.
297,Neuroticism,Vulnerability,Can handle complex problems.
298,Neuroticism,Vulnerability,Readily overcome setbacks.


The code below makes use of the `all-MiniLM-L6-v2` model to extract features from the personality items. It loads the model using the `sentence_transformers` library and extract a vector of features for each item with the `encode` method. It then converts the features to a `pandas.DataFrame` for further analysis and for easy viewing.

Run the cell below.

In [3]:
# Load pre-trained model
model = SentenceTransformer('all-MiniLM-L6-v2')  

# Extract features from personality items
item_features = model.encode(personality['item'])

# Convert features to DataFrame
item_features = pd.DataFrame(item_features, index=personality['item'])
item_features

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,374,375,376,377,378,379,380,381,382,383
item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Go straight for the goal.,0.010508,0.100210,-0.076360,0.001715,0.062427,0.049277,0.051734,0.018802,0.077200,0.012241,...,0.024534,0.014695,-0.061552,-0.013595,-0.088389,0.105380,-0.017474,-0.020673,0.008643,-0.026707
Plunge into tasks with all my heart.,-0.005888,0.016261,0.023178,-0.011430,0.017892,-0.049775,0.028243,-0.036557,0.030705,-0.007846,...,0.083111,0.034337,-0.034837,0.014203,-0.124650,0.089929,0.158494,-0.010659,-0.094051,-0.008922
Demand quality.,-0.042463,-0.015040,0.018326,-0.005335,-0.058177,-0.025078,0.007641,0.001493,0.012986,-0.005856,...,0.008393,-0.100025,-0.009640,-0.019040,-0.045627,0.018915,0.108192,-0.045537,0.057628,0.080098
Set high standards for myself and others.,0.002599,0.051376,-0.005017,-0.031174,-0.080613,-0.050955,-0.049017,-0.025073,-0.062257,-0.047764,...,0.033066,0.008424,0.037696,0.078157,-0.031138,0.042678,0.110578,-0.028196,-0.050860,-0.036707
Turn plans into actions.,-0.002227,0.071067,0.033380,-0.022899,-0.018842,0.015482,-0.013048,-0.028902,-0.018652,0.056530,...,0.104706,0.054388,0.026543,-0.054830,-0.026918,0.031745,-0.003095,-0.062595,-0.034699,0.001533
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Remain calm under pressure.,0.008002,0.041686,0.028673,0.048361,0.012777,-0.045190,0.048694,-0.055915,-0.006774,-0.092916,...,0.049380,-0.018107,-0.068623,0.063546,-0.033172,0.014080,0.000326,0.022905,-0.087142,0.037874
Am calm even in tense situations.,0.077793,0.017205,0.027962,0.013444,0.018121,-0.048891,0.070229,-0.011779,0.024065,-0.040962,...,0.051167,0.019745,-0.020991,0.016898,-0.026860,-0.008182,0.070316,0.077792,-0.070503,-0.023687
Can handle complex problems.,-0.064037,0.103417,0.005949,-0.016714,-0.080190,-0.048481,-0.003207,0.018399,-0.049289,0.005383,...,0.083289,0.038508,0.047977,-0.033457,-0.057766,0.104399,0.089886,0.009211,0.052749,-0.028001
Readily overcome setbacks.,-0.020695,0.061760,0.005606,0.072334,-0.035303,0.080265,-0.032409,-0.013967,-0.015124,-0.021790,...,0.088610,-0.003809,0.014618,0.015226,-0.004702,0.085417,-0.025451,-0.025940,0.035729,0.015886


## Computing Similarity between Personality Items
Now that we have extracted features for each personality item, we can compute the similarity between items. We use the `sklearn`'s `cosine similarity` function, which measures the cosine of the angle between two vectors. The closer the cosine similarity is to 1, the more similar the two items are. We compute the similarity between all pairs of items and store the results in a similarity matrix.

Run the cell below.

In [None]:
# Compute cosine similarity between features
predicted_sims = cosine_similarity(item_features)
predicted_sims = pd.DataFrame(predicted_sims, index=personality['item'], columns=personality['item'])
predicted_sims

As you can see, the similarity matrix is symmetric, with the diagonal containing 1s (since the similarity of an item with itself is 1). Furthermore, items that you would expect to be more related (e.g. "Turn plans into actions." and "Plunge into tasks with all my heart." are indeed more similar. Conversely, less related items (e.g. "Am calm in tense situations." and "Demand quality.") show lower cosine similarities.


In [None]:
# Plotting the distribution of item similarities
predicted_sims['Go straight for the goal.'].hist(bins=10)

**Task 1**: The code above plots the distribution of cosine similarities for the first item. Try replacing `'Go straight for the goal.'` with other items to get a feel for the overall similarity distribution (hint: you can plan around with the `bins` parameter to change the resolution of the histogram). What do you notice about the distributions?

 ## Comparing to observed correlations between items
This section compares how well the predicted similarities align with the *observed* similarities between items: that is, the correlations between the participant responses to the items. It first loads the observed correlations into a `pandas.DataFrame`:

In [None]:
# Load observed correlations
observed_sims = pd.read_csv('item_corrs.csv')
observed_sims

Next, the code pivots `observed_sims` to create a correlation matrix with the same structure as `predicted_sims` so that they can be easily compared.

In [None]:
# Pivoting to a correlation matrix for easy comparison with predicted correlations
observed_sims = observed_sims.pivot(index='text_i', columns='text_j', values='cor')
observed_sims

The predicted and observed similarities are then aligned to ensure that the items are in the same order. The code then flattens the lower triangle of the matrices into vectors to compute the correlation between the predicted and observed similarities.

In [None]:
# Aligning observed and predicted similarities
predicted_sims, observed_sims = predicted_sims.align(observed_sims)

def lower_triangle_flat(df):
    """Takes the lower triangle of a dataframe and flattens it into a vector"""
    rows, cols = np.triu_indices(len(df), k=1)  # k=1 to exclude the diagonal (self-similarities)
    return pd.Series(df.values[rows, cols])

# Flatten the lower triangle of the observed and predicted similarities into vectors
predicted_sims_flat, observed_sims_flat = lower_triangle_flat(predicted_sims), lower_triangle_flat(observed_sims)

# Correlation between predicted and observed
print(f'r: {predicted_sims_flat.corr(observed_sims_flat).round(2)}')
print(f'r of absolute values: {predicted_sims_flat.abs().corr(observed_sims_flat.abs()).round(2)}')

The correlation between the predicted and observed similarities is 0.18. If we take the absolute values of the similarities, the correlation increases to 0.33. Since we are not interested in which way round (in terms of polarity) the personality item scale was rated, we focus on the absolute values. This suggests that the extracted features capture some of the variance in the observed similarities between items. Whilst this suggests that the extracted features may not be capturing everything we want to know about the items, alternative explanations exist. Can you think of any? 

## Visualizing the Item Similarities
We can also visualize `predicted_sims` in two dimensions using PaCMAP. PaCMAP is a dimensionality reduction technique that preserves the pairwise distances between points. The code fits the PaCMAP model to the extracted features and transform them into two dimensions, saving the results in a `pandas.DataFrame`. 

In [None]:
# Initialize MDS model
pac = PaCMAP(n_components=2, random_state=42)

# Fit and transform the features
pac_features = pac.fit_transform(item_features)

# Convert features to DataFrame
pac_features = pd.DataFrame(pac_features, columns=['x', 'y'])
pac_features

Next, the code adds the personality factors and items as columns to `pac_features` to see how items cluster based on their similarity. 

In [None]:
# Adding personality factors to MDS features
pac_features['factor'] = personality['factor']
pac_features['item'] = personality['item']
pac_features

The code next plots the MDS features, with each point representing a personality item. The points are colored by factor, allowing us to see how items cluster based on their similarity.

In [None]:
# Plot pac features
sns.scatterplot(data=pac_features, x='x', y='y', hue='factor', s=100)
sns.despine(offset=10)

As illustrated, the items somewhat cluster according to their factor, again suggesting that the extracted features have captured some meaningful information about the items.

## Reassigning Items to Constructs
Finally, we can ask how well the extracted features predict the constructs to which the items belong. We first extract the features for each construct.

In [None]:
# Extracting construct features
constructs = personality['construct'].unique()

# Extracting features for constructs
construct_features = model.encode(constructs)

# Convert features to DataFrame
construct_features = pd.DataFrame(construct_features, index=constructs)
construct_features

The code next computes the cosine similarity between the construct features and the item features. 

In [None]:
# Computing cosine similarity between constructs and items
construct_item_sims = cosine_similarity(construct_features, item_features)
construct_sims = pd.DataFrame(construct_item_sims, index=construct_features.index, columns=item_features.index)
construct_sims

We then find the closest construct to each item by finding the construct with the highest similarity. We add this as a new column, `closest_construct`, to the `personality` dataframe.

In [None]:
# Finding the closest construct to each item adding as a new column ['closest_construct'] to the personality dataframe
closest_construct = construct_sims.idxmax()
closest_construct

In [None]:
# Adding the closest constructs to original personality dataframe
personality['predicted_construct'] = personality['item'].map(closest_construct)
personality

In [None]:
# Evaluating how well the predicted constructs align with the actual constructs
accuracy = (personality['construct'] == personality['predicted_construct']).mean()
print(f'Accuracy: {accuracy:.2f}')

Predicting the constructs based on the similarity between the item and construct features results in an accuracy of 23%. Whilst this is an improvement on the .03% accuracy that would be expected at random, it is still relatively low. This could suggest that the extracted features do not fully capture the differences between the constructs, or (perhaps more interestingly) that the constructs are not as distinct as we might expect.

You can also visualize the confusion matrix to see how well the items were assigned to the constructs. We firstly compute the confusion matrix using `pd.crosstab` and then sort it by the personality factor to make it easier to interpret. 

In [None]:
# Confusion matrix
confusion_matrix = pd.crosstab(personality['construct'], personality['predicted_construct'])

# Adding missing predicted constructs
missing_constructs = set(personality['construct']) - set(personality['predicted_construct'])
confusion_matrix[list(missing_constructs)] = 0

# Sorting confusion matrix by personality factor
ordered_constructs = personality.sort_values('factor')['construct'].unique()
confusion_matrix = confusion_matrix.loc[ordered_constructs, ordered_constructs]
confusion_matrix

When interpreting the confusion matrix, it is important to remember that the rows represent the actual constructs, while the columns represent the predicted constructs. The values in the cells represent the number of items assigned to each construct. The diagonal reflects the number of items correctly assigned to their construct, while off-diagonal values reflect items that were misclassified. Finally, the maximum number of items that could be correctly assigned to a construct is 10, which is why the heatmap is capped at this value. 

In [None]:
# Plotting confusion matrix without numbers in cells
fig, ax = plt.subplots(figsize=(16, 12))
n_items_per_construct = 10 # Maximum possible number of correctly assigned items per construct
sns.heatmap(confusion_matrix, cmap='Blues', vmin=0, vmax=n_items_per_construct, ax=ax)

# Increasing x-tick label and y-tick label font size
ax.xaxis.set_tick_params(labelsize=12)

As illustrated, while some constructs are well predicted (e.g., "Emotionality" and "Imagination"), most are less well predicted. 

**TASK 2**: Now rerun the entire notebook but with `model = SentenceTransformer('dwulff/mpnet-personality')` (you can find the right line via a `cmd + f` search). This is a model that has been fine-tuned on pairs of personality items to accurately predict the observed correlations between items. Although performance should be considerably better, it is important to be aware that this model has been fine-tuned on the same data that we are using to evaluate it, which gives it an unfair advantage.