## Personality Prediction from text data
1. Reference:  Bottom-Up and Top-Down: Predicting Personality with Psycholinguistic and Language Model Features (Mehta et al.) 2020  
(https://sentic.net/predicting-personality-with-psycholinguistic-and-language-model-features.pdf)

2. Data: essays.csv (https://github.com/yashsmehta/personality-prediction/tree/master/data/essays)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

from transformers import BertModel, BertTokenizer
from transformers import get_linear_schedule_with_warmup

import torch
from torch.utils.data import DataLoader, Dataset
from torch.optim import AdamW
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

### Exploring Data

In [2]:
df = pd.read_csv('essays.csv')

In [3]:
df.head(10)

Unnamed: 0,#AUTHID,text,cEXT,cNEU,cAGR,cCON,cOPN,Unnamed: 7
0,1997_504851.txt,"Well, right now I just woke up from a mid-day ...",n,y,y,n,y,
1,1997_605191.txt,"Well, here we go with the stream of consciousn...",n,n,y,n,n,
2,1997_687252.txt,An open keyboard and buttons to push. The thin...,n,y,n,y,y,
3,1997_568848.txt,I can't believe it! It's really happening! M...,y,n,y,y,n,
4,1997_688160.txt,"Well, here I go with the good old stream of co...",y,n,y,n,y,
5,1997_722902.txt,Today. Had to turn the music down. Today I wen...,y,n,y,n,y,
6,1997_724708.txt,Stream of consciousness. What should I write a...,n,n,y,n,n,
7,1997_724794.txt,The RTF305 Usenet site is a piece of garbage! ...,n,n,n,y,y,
8,1997_628043.txt,I'm really unsure about this assignment becaus...,y,y,n,y,y,
9,1997_708036.txt,Today was a tough day for me. I can't believed...,y,y,y,y,n,


### ABOUT Input Data.   
대학생들이 쓴 에세이. 심리학 수업의 과제로 20분동안 의식의 흐름대로 글을 쓴 것.
### ABOUT Target Data (Big 5 Personality trait test)   
각 에세이를 쓴 학생의 big 5 성격테스트 결과  
OPN 열림성(Openness): 새로운 아이디어에 열려있고 창의적이며 경험을 추구하는 정도를 나타냅니다.  
CON 성실성(Conscientiousness): 조직적이고 책임감이 강하며 목표를 달성하기 위해 노력하는 정도를 나타냅니다.  
EXT 외향성(Extraversion): 사교적이고 활동적이며 외향적인 성향을 나타냅니다.  
AGR 친화성(Agreeableness): 협조적이고 다른 사람과의 관계를 중요시하는 정도를 나타냅니다.  
NEU 신경증(Neuroticism): 감정적으로 불안하고 안정성이 낮으며 스트레스에 쉽게 반응하는 정도를 나타냅니다.  

### Making Dataset

In [4]:
inputs = df['text'].values
targets = df[['cEXT', 'cNEU', 'cAGR', 'cCON', 'cOPN']].values

In [5]:
# train, validation, test data split (80-10-10 ratio)
train_inputs, test_inputs, train_targets, test_targets = train_test_split(inputs, targets, test_size=0.1, random_state=42)
train_inputs, val_inputs, train_targets, val_targets = train_test_split(train_inputs, train_targets, test_size=0.111, random_state=42)

In [6]:
# BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

In [35]:
class EssayPersonalityDataset(Dataset):
    def __init__(self, inputs, targets, tokenizer, max_len):
        self.inputs = inputs
        self.targets = targets
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.inputs)

    def __getitem__(self, idx):
        input_text = str(self.inputs[idx])
        target = self.targets[idx]

        # Encode targets for each trait
        encoded_targets = [0 if trait == 'n' else 1 for trait in target]

        encoding = self.tokenizer.encode_plus(
            input_text,
            add_special_tokens=True,
            max_length=self.max_len,
            return_token_type_ids=False,
            return_attention_mask=True,
            padding='max_length',
            truncation=True,
            return_tensors='pt'
        )

        target_tensor = torch.tensor(encoded_targets, dtype=torch.float)
        target_tensor = torch.squeeze(target_tensor)

        return {
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'targets': target_tensor
        }


### Determine MAX_LEN for essay tokens.
the 95th percentile values are around 1347-1340 tokens, so setting MAX_LEN to 1350 might be adequate.   
However maximum length that is supported by bert-base-uncased model is 512, so the MAX_LEN would be 512.

In [36]:
def analyze_tokens(inputs, tokenizer):
    max_token = 0
    sum_token = 0
    tokens_list = []

    for sample in inputs:
        # Tokenize the sample text
        tokens = tokenizer.tokenize(sample)
        token_count = len(tokens)

        if token_count > max_token:
            max_token = token_count
        sum_token += token_count
        tokens_list.append(token_count)

    mean_token = sum_token / len(inputs)
    percentile_90 = np.percentile(tokens_list, 90)
    percentile_95 = np.percentile(tokens_list, 95)

    return max_token, mean_token, percentile_90, percentile_95

In [37]:
train_info = analyze_tokens(train_inputs, tokenizer)
val_info = analyze_tokens(val_inputs, tokenizer)
test_info = analyze_tokens(test_inputs, tokenizer)
info = analyze_tokens(inputs, tokenizer)

print(train_info)
print(val_info)
print(test_info)
print(info)

(3179, 788.5382665990877, 1193.8, 1347.5999999999995)
(1905, 763.004048582996, 1155.8, 1323.3999999999999)
(2401, 803.7327935222672, 1187.0, 1310.8999999999994)
(3179, 787.5030401297122, 1190.4, 1340.3999999999996)


### Generate Dataloader from dataset

In [38]:
# 데이터셋 인스턴스 생성
MAX_LEN = 512
train_dataset = EssayPersonalityDataset(train_inputs, train_targets, tokenizer, MAX_LEN)
val_dataset = EssayPersonalityDataset(val_inputs, val_targets, tokenizer, MAX_LEN)
test_dataset = EssayPersonalityDataset(test_inputs, test_targets, tokenizer, MAX_LEN)

# 데이터로더 생성
BATCH_SIZE = 16
train_dataloader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
val_dataloader = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=False)
test_dataloader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False)

In [39]:
# target shape 확인
for i, batch in enumerate(train_dataloader):
    if i == 0:
        print(batch['targets'].shape)
        break

torch.Size([16, 5])


### Define MLP Classifier

In [40]:
# Define a feedforward neural network (MLP) for classification
class MLPClassifier(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super(MLPClassifier, self).__init__()
        self.fc1 = nn.Linear(input_dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, output_dim)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return x


In [44]:

# Bert 모델
bert_model = BertModel.from_pretrained('bert-base-uncased')
bert_model.train()
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Initialize the classifier
classifier = MLPClassifier(input_dim=bert_model.config.hidden_size, hidden_dim=512, output_dim=5).to(device)


### Training BERT_MLP model

In [45]:
def train_model(model, classifier, train_dataloader, validation_dataloader, device, epochs=5):
    # Define optimizers
    model_optimizer = AdamW(model.parameters(), lr=2e-5)
    classifier_optimizer = AdamW(classifier.parameters(), lr=1e-3)

    # Total number of training steps
    total_steps = len(train_dataloader) * epochs

    # Define schedulers
    model_scheduler = get_linear_schedule_with_warmup(model_optimizer,
                                                      num_warmup_steps=0,
                                                      num_training_steps=total_steps)
    classifier_scheduler = get_linear_schedule_with_warmup(classifier_optimizer,
                                                           num_warmup_steps=0,
                                                           num_training_steps=total_steps)

    # Move model and classifier to the correct device
    model.to(device)
    classifier.to(device)

    # Assuming loss_fn is defined outside this function
    loss_fn = torch.nn.BCEWithLogitsLoss()

    for epoch in range(epochs):
        model.train()
        classifier.train()
        total_train_loss = 0

        for step, batch in enumerate(train_dataloader):
            batch = {k: v.to(device) for k, v in batch.items()}  # Move batch to device

            # Zero gradients for both optimizers
            model_optimizer.zero_grad()
            classifier_optimizer.zero_grad()

            # Forward pass
            outputs = model(batch['input_ids'], batch['attention_mask'])
            pooled_output = outputs.pooler_output
            logits = classifier(pooled_output)

            # Calculate loss
            loss = loss_fn(logits, batch['targets'])
            total_train_loss += loss.item()

            # Backward pass and optimize
            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            torch.nn.utils.clip_grad_norm_(classifier.parameters(), 1.0)
            model_optimizer.step()
            classifier_optimizer.step()
            model_scheduler.step()
            classifier_scheduler.step()

        # Print average training loss
        avg_train_loss = total_train_loss / len(train_dataloader)
        print('Epoch:', epoch + 1, "Average training loss:", avg_train_loss)

        # Validation phase
        model.eval()
        classifier.eval()
        total_eval_accuracy = 0

        for batch in validation_dataloader:
            batch = {k: v.to(device) for k, v in batch.items()}

            with torch.no_grad():
                outputs = model(batch['input_ids'], batch['attention_mask'])
                logits = classifier(outputs.pooler_output)
                preds = torch.round(torch.sigmoid(logits))
                total_eval_accuracy += (preds == batch['targets']).sum().item()

        # Calculate and print average validation accuracy
        total_predictions = len(validation_dataloader.dataset) * 5  # 5 classes
        avg_val_accuracy = total_eval_accuracy / total_predictions
        print('Validation accuracy:', avg_val_accuracy)


In [47]:
# 모델 학습
train_model(bert_model, classifier, train_dataloader, val_dataloader, device, epochs=10)




Epoch: 1 Average training loss: 0.6763714691323619
Validation accuracy: 0.5481781376518219
Epoch: 2 Average training loss: 0.6595407503266488
Validation accuracy: 0.5352226720647774
Epoch: 3 Average training loss: 0.6291447171280461
Validation accuracy: 0.5319838056680162
Epoch: 4 Average training loss: 0.6000802824574132
Validation accuracy: 0.5376518218623482
Epoch: 5 Average training loss: 0.5694314587500787
Validation accuracy: 0.5417004048582996
Epoch: 6 Average training loss: 0.5419450279685759
Validation accuracy: 0.5368421052631579
Epoch: 7 Average training loss: 0.5042689863231874
Validation accuracy: 0.5506072874493927
Epoch: 8 Average training loss: 0.47298867279483425
Validation accuracy: 0.5530364372469636
Epoch: 9 Average training loss: 0.44433406668324626
Validation accuracy: 0.5441295546558704
Epoch: 10 Average training loss: 0.42657207192913177
Validation accuracy: 0.5433198380566802


### Test

In [50]:
def evaluate_model(model, dataloader, device):
    model.eval()
    total_correct_predictions = 0
    total_predictions = 0  # To keep track of the total number of predictions

    with torch.no_grad():
        for batch in dataloader:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            targets = batch['targets'].to(device)

            outputs = model(input_ids, attention_mask=attention_mask)
            logits = classifier(outputs.pooler_output)

            preds = torch.round(torch.sigmoid(logits))
            total_correct_predictions += (preds == targets).sum().item()
            total_predictions += targets.numel()  # Total number of label predictions

    # Calculate the accuracy as the total correct predictions over the total number of predictions
    average_accuracy = total_correct_predictions / total_predictions
    return average_accuracy


In [51]:
# 테스트 데이터에 대한 평가
test_accuracy = evaluate_model(bert_model, test_dataloader, device)
print("Test Accuracy:", test_accuracy)

Test Accuracy: 0.5530364372469636


In [55]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [56]:
model_save_path = '/content/drive/My Drive/model.pth'
torch.save(bert_model.state_dict(), model_save_path)