# Workshop 2

In this exercise---inspired by  [Wulff & Mata, 2023](https://osf.io/preprints/psyarxiv/9h7aw)---we will use an LLM to extract features (embeddings) from personality items. We will then use these features to compute the similarity between items, evaluate how well these predict observed similarities, and visualize the similarity matrix in two dimensions. Finally, we will assign each item to a personality construct based on its similarity to the constructs.

By the end of this analysis, you will have learned how:
- To extract features from text using an LLM
- To compute the similarity between items using cosine similarity
- How this can be used to predict the construct to which an item belongs, and thus potentially improve construct validity

## Environment Setup 

In [None]:
import sys
if 'google.colab' in sys.modules:  # If in Google Colab environment
    # Mount google drive to enable access to data files
    from google.colab import drive
    drive.mount('/content/drive')
    
    # Installing requisite packages
    !pip install pacmap sentence-transformers &> /dev/null

    # Change working directory 
    %cd /content/drive/MyDrive/LLM4BeSci_EADM2024/workshop_2

In [1]:
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer
from pacmap import PaCMAP
import seaborn as sns
import matplotlib.pyplot as plt

OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.


## Extracting Features from Personality Items

We begin by loading the personality items into a `pandas.DataFrame` with three columns:

1. `factor`: The (high-level) personality factor to which the item belongs.
2. `construct`: The (mid-level) personality construct to which the item belongs.
3. `item`: The text of the personality item used to measure the construct.

Run the cell below.

In [2]:
# Loading personality data
personality = pd.read_csv('items.csv') 
personality

Unnamed: 0,factor,construct,item
0,Conscientiousness,Achievement-Striving,Go straight for the goal.
1,Conscientiousness,Achievement-Striving,Plunge into tasks with all my heart.
2,Conscientiousness,Achievement-Striving,Demand quality.
3,Conscientiousness,Achievement-Striving,Set high standards for myself and others.
4,Conscientiousness,Achievement-Striving,Turn plans into actions.
...,...,...,...
295,Neuroticism,Vulnerability,Remain calm under pressure.
296,Neuroticism,Vulnerability,Am calm even in tense situations.
297,Neuroticism,Vulnerability,Can handle complex problems.
298,Neuroticism,Vulnerability,Readily overcome setbacks.


The code below makes use of the `all-MiniLM-L6-v2` model to extract features from the personality items. It loads the model using the `sentence_transformers` library and extracts a vector of features for each item with the `encode` method. It then converts the features to a `pandas.DataFrame` for further analysis and for easy viewing.

Run the cell below.

In [3]:
# Load pre-trained model
model = SentenceTransformer('all-MiniLM-L6-v2')  

# Extract features from personality items
item_features = model.encode(personality['item'])

# Convert features to DataFrame
item_features = pd.DataFrame(item_features, index=personality['item'])
item_features

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,758,759,760,761,762,763,764,765,766,767
item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Go straight for the goal.,0.017550,0.003073,0.008361,-0.008687,0.021958,-0.038576,-0.098399,-0.042849,-0.051750,0.031179,...,-0.044286,0.008961,-0.070512,0.002789,0.005999,0.021540,0.030810,-0.026591,0.019784,-0.008526
Plunge into tasks with all my heart.,0.005842,0.022052,-0.000972,-0.030645,0.011986,0.046025,-0.150106,-0.000032,0.001286,0.035226,...,-0.001800,0.006713,-0.008470,0.007871,0.027999,-0.044741,0.038500,-0.016868,0.004011,-0.017156
Demand quality.,0.049623,0.048514,-0.008554,0.026249,0.011295,-0.030933,-0.017587,0.022948,-0.055769,0.042887,...,-0.011960,0.101395,0.017835,0.036326,-0.004668,-0.002991,0.000222,0.006870,0.044851,-0.009326
Set high standards for myself and others.,0.058789,0.080498,0.006956,-0.013496,0.024805,0.002990,-0.092100,-0.021383,-0.033367,0.027041,...,-0.052835,0.019359,-0.020186,0.046977,0.030446,-0.023285,-0.011024,-0.047759,0.072670,0.027677
Turn plans into actions.,0.009256,0.059699,-0.028263,-0.017310,-0.028868,0.000182,-0.064827,-0.005294,-0.047457,0.058930,...,0.008992,-0.008091,-0.027094,0.008591,0.003334,0.076508,0.023184,0.015250,0.000454,-0.048992
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Remain calm under pressure.,0.031624,-0.046148,-0.015354,-0.012368,0.032294,-0.001356,-0.040859,0.032436,-0.034918,0.066175,...,-0.051637,-0.039044,0.000241,0.043001,-0.018225,-0.008311,0.011037,-0.042147,0.072749,-0.035666
Am calm even in tense situations.,-0.000624,-0.039618,-0.009537,0.009270,0.043538,0.038576,-0.070591,-0.013267,-0.009006,0.018363,...,0.018693,-0.013235,-0.007934,0.073622,-0.065939,-0.008112,0.008220,-0.027113,0.039138,-0.008247
Can handle complex problems.,-0.015638,0.039720,-0.040296,-0.003597,-0.016624,0.030943,0.011199,-0.010853,-0.021701,0.026895,...,0.009983,0.016662,0.010252,0.036172,-0.025050,-0.063912,-0.020847,0.015164,0.020838,-0.042706
Readily overcome setbacks.,0.016926,0.024676,-0.016937,-0.040923,0.040099,0.028838,-0.094644,-0.005305,-0.035266,0.038571,...,-0.009220,-0.048346,0.008349,0.058375,-0.039073,0.054224,0.007074,-0.004110,0.084463,-0.050250


## Computing Similarity between Personality Items
Now that we have extracted features for each personality item, we can compute the similarity between items. We use the `sklearn`'s `cosine similarity` function, which measures the cosine of the angle between two vectors. The closer the cosine similarity is to 1, the more similar the two items are. We compute the similarity between all pairs of items and store the results in a similarity matrix.

Run the cell below.

In [4]:
# Compute cosine similarity between features
predicted_sims = cosine_similarity(item_features)
predicted_sims = pd.DataFrame(predicted_sims, index=personality['item'], columns=personality['item'])
predicted_sims

item,Go straight for the goal.,Plunge into tasks with all my heart.,Demand quality.,Set high standards for myself and others.,Turn plans into actions.,Do more than what's expected of me.,Work hard.,Do just enough work to get by.,Am not highly motivated to succeed.,Put little time and effort into my work.,...,Panic easily.,Get overwhelmed by emotions.,Feel that I'm unable to deal with things.,Can't make up my mind.,Become overwhelmed by events.,Remain calm under pressure.,Am calm even in tense situations.,Can handle complex problems.,Readily overcome setbacks.,Know how to cope.
item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Go straight for the goal.,1.000000,0.486718,0.204326,0.295062,0.306179,0.357110,0.405048,0.251681,0.191622,0.186124,...,0.246802,0.241281,0.049459,0.192209,0.247376,0.420539,0.187645,0.112799,0.273881,0.277367
Plunge into tasks with all my heart.,0.486718,1.000000,0.186344,0.496806,0.464552,0.565849,0.387097,0.367822,0.258395,0.484898,...,0.453511,0.465788,0.288941,0.114042,0.490174,0.563155,0.428089,0.238054,0.491712,0.451839
Demand quality.,0.204326,0.186344,1.000000,0.373563,0.210493,0.288356,0.350638,0.206257,0.044242,0.262073,...,0.164152,0.128323,0.040879,0.094995,0.133875,0.241268,0.081442,0.209819,0.134752,0.144493
Set high standards for myself and others.,0.295062,0.496806,0.373563,1.000000,0.259509,0.679769,0.432153,0.422330,0.182889,0.446444,...,0.192830,0.228157,0.194089,0.033378,0.209289,0.425195,0.314169,0.142915,0.359163,0.407831
Turn plans into actions.,0.306179,0.464552,0.210493,0.259509,1.000000,0.399446,0.312278,0.228493,0.188553,0.291710,...,0.226236,0.293121,0.118662,0.092486,0.378768,0.389385,0.154396,0.374070,0.382767,0.301394
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Remain calm under pressure.,0.420539,0.563155,0.241268,0.425195,0.389385,0.455050,0.414142,0.320109,0.248578,0.416657,...,0.538192,0.593201,0.321912,0.042354,0.557740,1.000000,0.627128,0.330671,0.555039,0.572924
Am calm even in tense situations.,0.187645,0.428089,0.081442,0.314169,0.154396,0.338220,0.226396,0.085100,0.204152,0.339117,...,0.507896,0.453955,0.419293,0.023587,0.407580,0.627128,1.000000,0.259900,0.362371,0.368101
Can handle complex problems.,0.112799,0.238054,0.209819,0.142915,0.374070,0.175501,0.234857,0.139662,0.164762,0.169091,...,0.257189,0.220638,0.192297,0.121490,0.211317,0.330671,0.259900,1.000000,0.269055,0.234051
Readily overcome setbacks.,0.273881,0.491712,0.134752,0.359163,0.382767,0.365669,0.373074,0.280662,0.360368,0.399182,...,0.466661,0.437270,0.368052,0.078987,0.498941,0.555039,0.362371,0.269055,1.000000,0.561523


As you can see, the similarity matrix is symmetric, with the diagonal containing 1s (since the similarity of an item with itself is 1). Furthermore, items that you would expect to be more related (e.g. `"Turn plans into actions"` and `"Plunge into tasks with all my heart."`) are indeed more similar. Conversely, less related items (e.g. `"Am calm in tense situations"` and `"Demand quality"`) show lower cosine similarities.

 ## Comparing to observed correlations between items
This section compares how well the predicted similarities align with the *observed* similarities between items: that is, the correlations between the participant responses to the items. It first loads the observed correlations into a `pandas.DataFrame`:

In [5]:
# Load observed correlations
observed_sims = pd.read_csv('item_corrs.csv')
observed_sims

Unnamed: 0,text_i,text_j,cor
0,Worry about things.,Worry about things.,1.000000
1,Make friends easily.,Worry about things.,-0.092088
2,Have a vivid imagination.,Worry about things.,0.011413
3,Trust others.,Worry about things.,-0.122167
4,Complete tasks successfully.,Worry about things.,-0.052228
...,...,...,...
89995,Am calm even in tense situations.,Often make last-minute plans.,0.031644
89996,Seldom joke around.,Often make last-minute plans.,-0.143314
89997,Like to stand during the national anthem.,Often make last-minute plans.,-0.023413
89998,Can't stand weak people.,Often make last-minute plans.,0.038725


Next, the code pivots `observed_sims` to create a correlation matrix with the same structure as `predicted_sims` so that they can be easily compared.

In [6]:
# Pivoting to a correlation matrix for easy comparison with predicted correlations
observed_sims = observed_sims.pivot(index='text_i', columns='text_j', values='cor')
observed_sims

text_j,Act comfortably with others.,Act wild and crazy.,Act without thinking.,Adapt easily to new situations.,Am a creature of habit.,Am able to control my cravings.,Am able to stand up for myself.,Am afraid of many things.,Am afraid that I will do the wrong thing.,Am afraid to draw attention to myself.,...,Want everything to be just right.,Want to be left alone.,Warm up quickly to others.,Waste my time.,Willing to try anything once.,Work hard.,Worry about things.,Would never cheat on my taxes.,Would never go hang gliding or bungee jumping.,Yell at people.
text_i,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Act comfortably with others.,1.000000,0.217360,0.012991,-0.430405,0.101136,-0.104918,-0.303101,-0.245115,-0.229300,-0.393090,...,-0.021545,0.407432,0.519459,0.193105,0.158005,0.162693,-0.162281,0.029572,0.124640,0.090932
Act wild and crazy.,0.217360,1.000000,-0.421215,-0.177011,0.134400,0.101634,-0.126245,-0.028102,-0.040210,-0.276224,...,-0.070427,0.175862,0.213489,-0.113697,0.317553,-0.097946,-0.076365,-0.115061,0.294004,-0.230751
Act without thinking.,0.012991,-0.421215,1.000000,-0.023407,-0.024389,-0.240195,-0.072953,-0.154781,-0.133205,0.055036,...,0.047438,0.012868,-0.069470,0.308881,-0.188441,0.217708,-0.050914,0.132766,-0.137169,0.327725
Adapt easily to new situations.,-0.430405,-0.177011,-0.023407,1.000000,-0.222098,0.153603,0.343191,0.359561,0.278076,0.328830,...,0.105141,-0.249252,-0.314628,-0.169759,-0.257563,-0.143309,0.260600,0.019874,-0.192907,-0.120178
Am a creature of habit.,0.101136,0.134400,-0.024389,-0.222098,1.000000,-0.065602,-0.100340,-0.169118,-0.155461,-0.156853,...,-0.222918,0.140241,0.077795,0.057557,0.132251,-0.029368,-0.163692,-0.042522,0.154302,0.049869
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Work hard.,0.162693,-0.097946,0.217708,-0.143309,-0.029368,-0.168711,-0.162225,-0.115883,-0.075400,-0.040886,...,0.160276,0.106788,0.094098,0.422429,0.002044,1.000000,0.027527,0.159320,-0.016483,0.113103
Worry about things.,-0.162281,-0.076365,-0.050914,0.260600,-0.163692,0.144669,0.204268,0.431686,0.408099,0.198018,...,0.239192,-0.107536,-0.078245,-0.074586,-0.107065,0.027527,1.000000,0.051653,-0.143831,-0.136540
Would never cheat on my taxes.,0.029572,-0.115061,0.132766,0.019874,-0.042522,-0.103727,0.021737,0.030539,0.060323,0.083540,...,0.072984,0.056801,0.049876,0.137484,-0.088864,0.159320,0.051653,1.000000,-0.067971,0.115916
Would never go hang gliding or bungee jumping.,0.124640,0.294004,-0.137169,-0.192907,0.154302,-0.043246,-0.126138,-0.192227,-0.083722,-0.163592,...,-0.083874,0.119223,0.097164,-0.009435,0.352464,-0.016483,-0.143831,-0.067971,1.000000,-0.011130


The predicted and observed similarities are then aligned to ensure that the items are in the same order. The code then flattens the lower triangle of the matrices into vectors to compute the correlation between the predicted and observed similarities.

In [7]:
# Aligning observed and predicted similarities
predicted_sims, observed_sims = predicted_sims.align(observed_sims)

def lower_triangle_flat(df):
    """Takes the lower triangle of a dataframe and flattens it into a vector"""
    rows, cols = np.triu_indices(len(df), k=1)  # k=1 to exclude the diagonal (self-similarities)
    return pd.Series(df.values[rows, cols])

# Flatten the lower triangle of the observed and predicted similarities into vectors
predicted_sims_flat, observed_sims_flat = lower_triangle_flat(predicted_sims), lower_triangle_flat(observed_sims)

# Correlation between predicted and observed
print(f'r: {predicted_sims_flat.corr(observed_sims_flat).round(2)}')
print(f'r of absolute values: {predicted_sims_flat.abs().corr(observed_sims_flat.abs()).round(2)}')

r: 0.17
r of absolute values: 0.32


The correlation between the predicted and observed similarities is 0.18. If we take the absolute values of the similarities, the correlation increases to 0.33. Since we are not interested in which way round (in terms of polarity) the personality item scale was rated, we focus on the absolute values. This suggests that the extracted features capture some of the variance in the observed similarities between items. Whilst this suggests that the extracted features may not be capturing everything we want to know about the items, alternative explanations exist. Can you think of any? 

## Visualizing the Item Similarities
We can also visualize `predicted_sims` in two dimensions using PaCMAP. PaCMAP is a dimensionality reduction technique that preserves the pairwise distances between points. The code fits the PaCMAP model to the extracted features and transform them into two dimensions, saving the results in a `pandas.DataFrame`. 

In [None]:
# Initialize MDS model
pac = PaCMAP(n_components=2, random_state=42)

# Fit and transform the features
pac_features = pac.fit_transform(item_features)

# Convert features to DataFrame
pac_features = pd.DataFrame(pac_features, columns=['x', 'y'])
pac_features

Next, the code adds the personality factors and items as columns to `pac_features` to see how items cluster based on their similarity. 

In [None]:
# Adding personality factors to MDS features
pac_features['factor'] = personality['factor']
pac_features['item'] = personality['item']
pac_features

The code next plots the MDS features, with each point representing a personality item. The points are colored by factor, allowing us to see how items cluster based on their similarity.

In [None]:
# Plot pac features
sns.scatterplot(data=pac_features, x='x', y='y', hue='factor', s=100)
sns.despine(offset=10)

As illustrated, the items somewhat cluster according to their factor, again suggesting that the extracted features have captured some meaningful information about the items.

## Bonus section: Reassigning Items to Constructs
Finally, we can ask how well the extracted features predict the constructs to which the items belong. We first extract the features for each construct.

In [None]:
# Extracting construct features
constructs = personality['construct'].unique()

# Extracting features for constructs
construct_features = model.encode(constructs)

# Convert features to DataFrame
construct_features = pd.DataFrame(construct_features, index=constructs)
construct_features

The code next computes the cosine similarity between the construct features and the item features. 

In [None]:
# Computing cosine similarity between constructs and items
construct_item_sims = cosine_similarity(construct_features, item_features)
construct_sims = pd.DataFrame(construct_item_sims, index=construct_features.index, columns=item_features.index)
construct_sims

We then find the closest construct to each item by finding the construct with the highest similarity. We add this as a new column, `closest_construct`, to the `personality` dataframe.

In [None]:
# Finding the closest construct to each item adding as a new column ['closest_construct'] to the personality dataframe
closest_construct = construct_sims.idxmax()
closest_construct

In [None]:
# Adding the closest constructs to original personality dataframe
personality['predicted_construct'] = personality['item'].map(closest_construct)
personality

In [None]:
# Evaluating how well the predicted constructs align with the actual constructs
accuracy = (personality['construct'] == personality['predicted_construct']).mean()
print(f'Accuracy: {accuracy:.2f}')

Predicting the constructs based on the similarity between the item and construct features results in an accuracy of 23%. Whilst this is an improvement on the .03% accuracy that would be expected at random, it is still relatively low. This could suggest that the extracted features do not fully capture the differences between the constructs, or (perhaps more interestingly) that the constructs are not as distinct as we might expect.

You can also visualize the confusion matrix to see how well the items were assigned to the constructs. We firstly compute the confusion matrix using `pd.crosstab` and then sort it by the personality factor to make it easier to interpret. 

In [None]:
# Confusion matrix
confusion_matrix = pd.crosstab(personality['construct'], personality['predicted_construct'])

# Sorting confusion matrix by personality factor 
ordered_constructs = personality.sort_values('factor')['construct'].unique()
ordered_predicted_constructs = [construct for construct in ordered_constructs if construct in personality['predicted_construct'].unique()]
confusion_matrix = confusion_matrix.loc[ordered_constructs, ordered_predicted_constructs]
confusion_matrix

When interpreting the confusion matrix, it is important to remember that the rows represent the actual constructs, while the columns represent the predicted constructs. The values in the cells represent the number of items assigned to each construct. The diagonal reflects the number of items "correctly" (i.e, according to experts) assigned to their construct, while off-diagonal values reflect items that were misclassified. Finally, the maximum number of items that could be correctly assigned to a construct is 10, which is why the heatmap is capped at this value. 

In [None]:
# Plotting confusion matrix without numbers in cells
fig, ax = plt.subplots(figsize=(16, 12))
n_items_per_construct = 10 # Maximum possible number of correctly assigned items per construct
sns.heatmap(confusion_matrix, cmap='Blues', vmin=0, vmax=n_items_per_construct, ax=ax)

# Increasing x-tick label and y-tick label font size
ax.xaxis.set_tick_params(labelsize=12)

As illustrated, while some constructs are well predicted (e.g., "Emotionality" and "Imagination"), most are less well predicted. 

**TASK**: Now rerun the entire notebook but with `model = SentenceTransformer('dwulff/mpnet-personality')` (you can find the right line via a `cmd + f` search). This is a model that has been fine-tuned on pairs of personality items to accurately predict the observed correlations between items. Although performance should be considerably better, it is important to be aware that this model has been fine-tuned on the same data that we are using to evaluate it, which gives it an unfair advantage.