# Intro
In this exercise, we will be using the same feature extraction method from the last to encode the text for the personality items of the NEO Big 5 questionnaire. We will then use cosine similarity between the encoded items to predict the correlations between the personality constructs to which they belong. We will then compare these predicted correlations to the observed correlations between constructs based on a large data set of participant responses to the items.

In [None]:
import sys
if 'google.colab' in sys.modules:
    # Installing packages in Google Colab environment
    !pip install datasets transformers

    # Mounting google drive to enable access to data files
    from google.colab import drive
    drive.mount('/content/drive')

    # Changing working directory to ex1
    %cd /content/drive/MyDrive/LLM4JDM/ex2

You may notice that the "Preparing data" section and "feature extraction" section take exactly the same structure as the last notebook. This illustrates the generalizability of the approach. We can use the same code to extract features from any text data, regardless of the specific task we are interested in. The only thing that changes is the data we load and the model we use.

# Preparing data
We again begin by loading the requisite packages. We again make use of the following packages:
1. ```pandas```: A very popular package for reading and manipulating data in python.
2. ```datasets```: A HuggingFace (HF) package for loading and manipulating datasets in a format ready for use with HF models.
3. ```transformers```: A HF package for loading and manipulating transformer-based models.

In [None]:
import pandas as pd
from datasets import Dataset
from transformers import AutoTokenizer

Out item data has two columns:
1. ```construct```: The personality construct to which the item belongs.
2. ```text```: The item description.

In [None]:
# Loading data with pandas
neo_items =  pd.read_csv('NEO_items.csv', usecols=['construct', 'text'])
neo_items

In [None]:
# Converting into a HuggingFace dataset
dat = Dataset.from_pandas(neo_items)
dat

In [None]:
# Loading the tokenizer
model_ckpt = 'distilbert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
print(f'Vocabulary size: {tokenizer.vocab_size}, max context length: {tokenizer.model_max_length}')

In [None]:
# Tokenizing the text
tokenize = lambda x: tokenizer(x['text'], padding=True, truncation=True)
dat = dat.map(tokenize, batched=True, batch_size=None)
print(tokenizer.decode(dat[0]['input_ids']))
dat

# Feature Extraction

In [None]:
import torch
from transformers import AutoModel

In [None]:
# Setting the format of the dataset to torch tensors for passing to the model
dat.set_format('torch', columns=['input_ids', 'attention_mask'])
dat

In [None]:
# Loading the model and moving it to the GPU if available
if torch.cuda.is_available():  # for nvidia GPUs
    device = torch.device('cuda')
elif torch.backends.mps.is_available(): # for Apple Metal Performance Sharder (mps) GPUs
    device = torch.device('mps')
else:
    device = torch.device('cpu')

device

In [None]:
# Loading the model
model = AutoModel.from_pretrained(model_ckpt).to(device)
f'Model inputs: {tokenizer.model_input_names}'

In [None]:
def extract_features(batch):
    """Extract features from a batch of items"""
    inputs = {k:v.to(device) for k, v in batch.items() if k in tokenizer.model_input_names}
    with torch.no_grad():
        last_hidden_state = model(**inputs).last_hidden_state
        return {"hidden_state": last_hidden_state[:,0].cpu().numpy()}


dat = dat.map(extract_features, batched=True, batch_size=8)
dat

# Comparing Predicted and Observed Construct Similarities

Numpy is a popular package for scientific computing in python. We will only use it here for its ```triu_indices``` function, which returns the indices of the upper triangle of a matrix.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

In [None]:
# Converting the hidden state into a data frame for easy manipulation
embeds = pd.DataFrame(dat['hidden_state'])
embeds

In [None]:
# Adding the construct that each embedding represents
embeds['construct'] = neo_items['construct']

# Calculating the mean embedding for each construct
construct_embeds = embeds.groupby('construct').mean(numeric_only=True)
construct_embeds

In [None]:
# Calculating the cosine similarity between construct embeddings
predicted = pd.DataFrame(
    cosine_similarity(construct_embeds), # cosine similarity between each pair of rows
    index=construct_embeds.index, # row names
    columns=construct_embeds.index # column names
)
predicted

'Neo_correlations.csv' has three columns:
1. ```construct_1```: The first construct in the pair.
2. ```construct_2```: The second construct in the pair.
3. ```correlation```: The empirical correlation between the two constructs.

In [None]:
# Loading observed correlations and pivoting to a correlation matrix
observed = pd.read_csv('NEO_correlations.csv')
observed

Pivoting the data transforms it from long to wide format. 

In [None]:
# Pivoting to a correlation matrix for easy comparison with predicted correlations
observed = observed.pivot(index='construct_1', columns='construct_2', values='correlation')
observed

In [None]:
# Aligning rows and columns the predicted and observed correlations
predicted, observed = predicted.align(observed)

# printing the column names showing the orders are now the same
print(predicted.columns.tolist()) 
print(observed.columns.tolist())

We next take the lower triangle (excluding the diagonal) of the predicted and observed correlation matrices and flatten them into vectors. We then calculate the correlation between the predicted and observed correlations. This ensures we don't double count the correlations (e.g., the correlation between A and B is the same as the correlation between B and A) and that we don't include the correlation between a construct and itself (which is always 1).

In [None]:
def lower_triangle_flat(df):
    """Takes the lower triangle of a dataframe and flattens it into a vector"""
    rows, cols = np.triu_indices(len(df), k=1)  # k=1 to exclude the diagonal (self-similarities)
    return pd.Series(df.values[rows, cols])

predicted, observed = lower_triangle_flat(predicted), lower_triangle_flat(observed)

# Correlation between predicted and observed
print(f'r: {predicted.corr(observed).round(2)}')
print(f'r of absolute values: {predicted.abs().corr(observed.abs()).round(2)}')

# Conclusion
It seems we can explain a substantial portion of the inter-construct relationship based purely on the semantic information in the items (absolute values $r=.32). Why do you think that is? ;)