<a href="https://colab.research.google.com/github/andcut/text2trait/blob/main/pretrained_personality_embedding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This code was originally written by Andrew Cutler (andrew.d.cutler@gmail.com) to transform text into a 105 dimensional personality embedding. The process is text -> RoBERTa embedding -> personality embedding. This can then be used for down stream tasks such as clustering or supervised learning. For more info see [Language Modeling for Personality Prediction](https://open.bu.edu/handle/2144/41942).





In [None]:
#install the sentence-transformers package https://github.com/UKPLab/sentence-transformers
!pip install -U sentence-transformers

from sentence_transformers import SentenceTransformer
import pandas as pd

In [4]:
#download personality model
!git clone https://github.com/andcut/text2trait.git
!unzip text2trait/RoBERTa_to_IPIP100.zip

Cloning into 'text2trait'...
remote: Enumerating objects: 6, done.[K
remote: Counting objects:  16% (1/6)[Kremote: Counting objects:  33% (2/6)[Kremote: Counting objects:  50% (3/6)[Kremote: Counting objects:  66% (4/6)[Kremote: Counting objects:  83% (5/6)[Kremote: Counting objects: 100% (6/6)[Kremote: Counting objects: 100% (6/6), done.[K
remote: Compressing objects: 100% (5/5), done.[K
remote: Total 6 (delta 0), reused 0 (delta 0), pack-reused 0[K
Unpacking objects: 100% (6/6), done.


In [16]:
#define roberta and the IPIP100 embedding head.
from tensorflow import keras
#will take a minute to download roberta-large
roberta = SentenceTransformer('roberta-large-nli-stsb-mean-tokens')
IPIP100 = keras.models.load_model("RoBERTa_to_IPIP100.h5")

In [12]:
##IMPORTANT##
#docs should be a list of your text documents
docs = ["text number one", "country roads, take me home", "I like to party, I like to gamble", ":((((", "In this essay I will..."]

#transforms text into vectors that are 1024 long. Might take a couple mins
X_roberta = roberta.encode(docs, show_progress_bar = True, convert_to_numpy = True)

#Transforms these 1024 general purpose language vectors to 105 personality-loaded dimensions
X = IPIP100.predict(X_roberta)

#y should be a list of labels. Here just some dummies so the code below will compile.
y = [0]*len(docs)

HBox(children=(FloatProgress(value=0.0, description='Batches', max=1.0, style=ProgressStyle(description_width=…




Now that data is loaded as X (inputs) and y (outputs/labels), let's build a model and test how well y can be predicted from X. This will run much faster

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import RidgeCV
from sklearn import metrics
import numpy as np

R2_matrix = []
for i in range(100):
  #print(i)
  (x_train, x_test, y_train, y_test) = train_test_split(X, y, train_size = 0.8, random_state = i)
  model = RidgeCV(alphas = np.logspace(-2,3,1000), cv = None) #cv=None uses efficient-leave-one-out cross validation
  model.fit(x_train,y_train)
  y_pred = model.predict(x_test)
  R2_matrix.append([metrics.r2_score(y_test,y_pred)])




In [None]:
R2_matrix = np.array(R2_matrix)
R2_mean = np.mean(R2_matrix)
R2_std = np.std(R2_matrix)
R2_median = np.median(R2_matrix)
# Print the summary
print(R2_mean, R2_std, R2_median)