# Day 2

In this analysis--inspired by  [Wulff & Mata, 2023](https://osf.io/preprints/psyarxiv/9h7aw)--we will use an LLM to extract features from personality items. We will then use these features to compute the similarity between items, evaluate how well these predict observed similarities, and visualize the similarity matrix in two dimensions. Finally, we will assign each item to a personality construct based on its similarity to the constructs.

By the end of this analysis, you will have learned how:
- To extract features from text using a pre-trained LLM
- To compute the similarity between items using cosine similarity
- How this can be used to predict the construct to which an item belongs, and thus potentially improve construct validity

## Environment Setup 

In [None]:
import sys
if 'google.colab' in sys.modules:  # If in Google Colab environment
    # Mount google drive to enable access to data files
    from google.colab import drive
    drive.mount('/content/drive')
    
    # Installing requisite packages
    !pip install pacmap sentence-transformers &> /dev/null

    # Change working directory to day_2
    %cd /content/drive/MyDrive/LLM4BeSci_GSERM2024/day_2

In [None]:
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer
from pacmap import PaCMAP
import seaborn as sns
import matplotlib.pyplot as plt

## Extracting Features from Personality Items

We begin by loading the personality items into a `pandas.DataFrame` with three columns:

1. `factor`: The (high-level) personality factor to which the item belongs.
2. `construct`: The (mid-level) personality construct to which the item belongs.
3. `item`: The text of the personality item used to measure the construct.

Run the cell below.

In [None]:
# Loading personality data
personality = pd.read_csv('items.csv') 
personality

The code below makes use of the `all-MiniLM-L6-v2` model to extract features from the personality items. It loads the model using the `sentence_transformers` library and extract a vector of features for each item with the `encode` method. It then convert the features to a `pandas.DataFrame` for further analysis and easy viewing.

Run the cell below.

In [None]:
# Load pre-trained model
model = SentenceTransformer('all-MiniLM-L6-v2')  

# Extract features from personality items
item_features = model.encode(personality['item'])

# Convert features to DataFrame
item_features = pd.DataFrame(item_features, index=personality['item'])
item_features

## Computing Similarity between Personality Items
Now that we have extracted features for each personality item, we can compute the similarity between items. We use the `sklearn`'s `cosine similarity` function, which measures the cosine of the angle between two vectors. The closer the cosine similarity is to 1, the more similar the two items are. We compute the similarity between all pairs of items and store the results in a similarity matrix.

Run the cell below.

In [None]:
# Compute cosine similarity between features
predicted_sims = cosine_similarity(item_features)
predicted_sims = pd.DataFrame(predicted_sims, index=personality['item'], columns=personality['item'])
predicted_sims

As you can see, the similarity matrix is symmetric, with the diagonal containing 1s (since the similarity of an item with itself is 1). Furthermore, items that you would expect to be more related (e.g. "Turn plans into actions" and "Plunge into tasks with all my hear." are indeed more similar. Conversely, less related items (e.g. "Am calm in tense situations" and "Demand quality" show lower cosine similarities.

 ## Comparing to observed correlations between items
This section compares how well the predicted similarities align with the observed similarities between items--that is, the correlations between the participant responses to the items. It first loads the observed correlations into a `pandas.DataFrame`:

In [None]:
# Load observed correlations
observed_sims = pd.read_csv('item_corrs.csv')
observed_sims

Next, the code pivots `observed_sims` to create a correlation matrix with the same structure as `predicted_sims` so that they can be easily compared.

In [None]:
# Pivoting to a correlation matrix for easy comparison with predicted correlations
observed_sims = observed_sims.pivot(index='text_i', columns='text_j', values='cor')
observed_sims

The predicted and observed similarities are then aligned to ensure that the items are in the same order. The code then flattens the lower triangle of the matrices into vectors to compute the correlation between the predicted and observed similarities.

In [None]:
# Aligning observed and predicted similarities
predicted_sims, observed_sims = predicted_sims.align(observed_sims)

def lower_triangle_flat(df):
    """Takes the lower triangle of a dataframe and flattens it into a vector"""
    rows, cols = np.triu_indices(len(df), k=1)  # k=1 to exclude the diagonal (self-similarities)
    return pd.Series(df.values[rows, cols])

# Flatten the lower triangle of the observed and predicted similarities into vectors
predicted_sims_flat, observed_sims_flat = lower_triangle_flat(predicted_sims), lower_triangle_flat(observed_sims)

# Correlation between predicted and observed
print(f'r: {predicted_sims_flat.corr(observed_sims_flat).round(2)}')
print(f'r of absolute values: {predicted_sims_flat.abs().corr(observed_sims_flat.abs()).round(2)}')

## Visualizing the Item Similarities
We can also visualize `predicted_sims` in two dimensions using PaCMAP. PaCMAP is a dimensionality reduction technique that preserves the pairwise distances between points. The code fits the PaCMAP model to the extracted features and transform them into two dimensions, saving the results in a `pandas.DataFrame`. 

In [None]:
# Initialize MDS model
pac = PaCMAP(n_components=2, random_state=42)

# Fit and transform the features
pac_features = pac.fit_transform(item_features)

# Convert features to DataFrame
pac_features = pd.DataFrame(pac_features, columns=['x', 'y'])
pac_features

Next, the code adds the personality factors and items as columns to `pac_features` to see how items cluster based on their similarity. 

In [None]:
# Adding personality factors to MDS features
pac_features['factor'] = personality['factor']
pac_features['item'] = personality['item']
pac_features

The code next plots the MDS features, with each point representing a personality item. The points are colored by factor, allowing us to see how items cluster based on their similarity.

In [None]:
# Plot pac features
sns.scatterplot(data=pac_features, x='x', y='y', hue='factor', s=100)
sns.despine(offset=10)

## Reassigning Items to Constructs
Finally, we can ask how well the extracted features predict the constructs to which the items belong. We first extract the features for each construct. We then compute the cosine similarity between the construct features and the item features. We assign each item to the construct with which it has the highest similarity.

In [None]:
# Extracting construct features
constructs = personality['construct'].unique()

# Extracting features for constructs
construct_features = model.encode(constructs)

# Convert features to DataFrame
construct_features = pd.DataFrame(construct_features, index=constructs)
construct_features

The code next computes the cosine similarity between the construct features and the item features. 

In [None]:
# Computing cosine similarity between constructs and items
construct_item_sims = cosine_similarity(construct_features, item_features)
construct_sims = pd.DataFrame(construct_item_sims, index=construct_features.index, columns=item_features.index)
construct_sims

We then find the closest construct to each item by finding the construct with the highest similarity. We add this as a new column, `closest_construct`, to the `personality` dataframe.

In [None]:
# Finding the closest construct to each item adding as a new column ['closest_construct'] to the personality dataframe
closest_construct = construct_sims.idxmax()
closest_construct

In [None]:
# Adding the closest constructs to original personality dataframe
personality['predicted_construct'] = personality['item'].map(closest_construct)
personality

In [None]:
# Evaluating how well the predicted constructs align with the actual constructs
accuracy = (personality['construct'] == personality['predicted_construct']).mean()
print(f'Accuracy: {accuracy:.2f}')

You can also visualize the confusion matrix to see how well the items were assigned to the constructs.

In [None]:
# Confusion matrix
confusion_matrix = pd.crosstab(personality['construct'], personality['predicted_construct'])

# Sorting confusion matrix by personality factor 
ordered_constructs = personality.sort_values('factor')['construct'].unique()
confusion_matrix = confusion_matrix.loc[ordered_constructs, ordered_constructs]
confusion_matrix

In [None]:
# Plotting confusion matrix without numbers in cells
fig, ax = plt.subplots(figsize=(16, 12))
sns.heatmap(confusion_matrix, cmap='Blues', ax=ax)

# Increasing x-tick label and y-tick label font size
ax.xaxis.set_tick_params(labelsize=12)

**TASK**: Now rerun the entire notebook but with `model = SentenceTransformer('dwulff/mpnet-personality')` to see how the results change with a different model that has been fine-tuned on pairs of personality items to accurately predict the observed correlations between items. Although performance is considerably better, it is important to be aware that this model has been fine-tuned on the same data that we are using to evaluate it, which gives it an unfair advantage.