# Zero-shot Classify Scenario Responses in Terms of Big 5 Personality

## Instructions

In this notebook I provide example Python code for how to classify text using zero-shot text classification. For in-depth details please see this [blog post](https://dmracek.github.io/explorations_code_nbs/exploration_zs_big5/).

I provided comments throughout, however, the code should run as is.

**If you're running this on a CPU and not a GPU I don't recommend re-running certain cells and instead recommend reading in the results from a previously outputted CSV**.

## Data

*These data are from the Society of Industrial Organizational Psychologists 2019 Machine Learning competition*:

Scenario responses were designed to promote variability in terms of a specific Big 5 personality trait. Specifically,

`open_ended_1` corresponds with string responses designed to elicit the **Agreeableness** personality trait.

`open_ended_2` corresponds with string responses designed to elicit the **Conscientiousness** personality trait.

`open_ended_3` corresponds with string responses designed to elicit the **Extraversion** personality trait.

`open_ended_4` corresponds with string responses designed to elicit the **Neuroticism** personality trait.

`open_ended_5` corresponds with string responses designed to elicit the **Openness** personality trait.

* * *

`a_scale_score` corresponds with a self-report questionnaire score for **Agreeableness**

`c_scale_score` corresponds with a self-report questionnaire score for **Conscientiousness**

`e_scale_score` corresponds with a self-report questionnaire score for **Extraversion**

`n_scale_score` corresponds with a self-report questionnaire score for **Neuroticism**

`o_scale_score` corresponds with a self-report questionnaire score for **Openness**

## Install Hugging Face Transformers

In [None]:
# Install transformers library.
!pip install -q git+https://github.com/huggingface/transformers.git

## Import Packages

Import all dependencies and if available set device to GPU.

In [25]:
import pandas as pd
import pingouin as pg
import torch

from functools import reduce
from pandas import json_normalize
from transformers import pipeline
from typing import Tuple

# Set device on GPU if available else CPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Using device:', device)
print()

# Additional info when using cuda
if device.type == 'cuda':
    print(torch.cuda.get_device_name(0))

classifier = pipeline('zero-shot-classification', device=0)

Using device: cuda

GeForce RTX 2070 with Max-Q Design


Some weights of the model checkpoint at facebook/bart-large-mnli were not used when initializing BartForSequenceClassification: ['model.encoder.version', 'model.decoder.version']
- This IS expected if you are initializing BartForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing BartForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


## Load Data

Subset df to correspond with Big 5 personality trait

In [26]:
df_raw = pd.read_csv('./data/siop_2019/siop_2019.csv')

df_a = df_raw.loc[:, ['Respondent_ID', 'open_ended_1', 'Dataset']]
df_c = df_raw.loc[:, ['Respondent_ID', 'open_ended_2', 'Dataset']]
df_e = df_raw.loc[:, ['Respondent_ID', 'open_ended_3', 'Dataset']]
df_n = df_raw.loc[:, ['Respondent_ID', 'open_ended_4', 'Dataset']]
df_o = df_raw.loc[:, ['Respondent_ID', 'open_ended_5', 'Dataset']]

# These will be our candidate labels
big5_labels = ['agreeableness', 'conscientiousness', 'extraversion', 'neuroticism', 'openness']

## Helper Function

Function to enumerate through responses for each of the prompts `open_ended_1` through `open_ended_5`. 

In [34]:
def zshot(df=df_a, 
          ids='Respondent_ID', 
          responses='open_ended_1', 
          split='Dataset', 
          labels=big5_labels, 
          hypothesis_template=('"This response is characterized by {}."')) -> Tuple[pd.DataFrame]:
    
    '''
    This function will classify text responses using Transformers zero-shot pipeline.
    Arguments
    ---------
    df:                  Input pandas dataframe.
    ids:                 Pandas column corresponding to primary key.
    responses:           Pandas column corresponding to text to classify (e.g., scenario prompt).
    split:               Pandas column corresponding to training split.
    labels:              List for candidate_labels to classify
    hypothesis_template: String the pipeline turns labels into hypotheses for NLI.
    '''
    
    torch.cuda.empty_cache()

    ids = df[ids]
    sequences = df[responses].values
    training_split = df[split]

    # CREATE EMPTY LIST TO APPEND RESULTS 
    list_data = []

    # ENUMERATE THROUGH RESPONSES USING ZERO SHOT CLASSIFIER PIPELINE 
    for idx, sequence in enumerate(sequences):
        # MULTI_CLASS=True SCORES WILL BE INDEPENDENT YET FALL BETWEEN 0 AND 1
        json_a = classifier(sequence, labels, multi_class=True, hypothesis_template=hypothesis_template)
        list_data.append(json_a)

    cl = json_normalize(list_data)

    srs_ids, srs_split, srs_index  = ids, training_split, cl.sequence

    # EXPLODE JSON LABELS INTO PANDAS COLUMNS
    cl_labels = cl.explode('labels')
    srs_labels = cl_labels['labels']
    
    # EXPLODE JSON SCORES INTO PANDAS COLUMNS
    cl_scores = cl.explode('scores')
    srs_scores = cl_scores['scores']
 
    frame = { 'response_id': srs_ids, 'text': srs_index, 'split': srs_split, 'candidate_label_id': srs_labels, 'candidate_label_raw': srs_scores }

    df_tall = pd.DataFrame(frame)

    # CLEAN UP COLUMN STRING
    df_tall['candidate_label_id'] = df_tall['candidate_label_id'].str.replace(" ","_")
    df_tall['candidate_label_id'] = df_tall['candidate_label_id'].str.replace("-","_")
    df_tall['split'] = df_tall['split'].str.lower()
    # CONVERT TALL TO WIDE DATAFRAME
    df_wide = df_tall.pivot(index='response_id', columns='candidate_label_id', values='candidate_label_raw').reset_index()

    # MERGE in RESPONSE_ID, RESPONSES, SPLIT
    df_split = df_tall.drop_duplicates(subset=['response_id']).loc[:, ['response_id', 'text', 'split']].reset_index(drop=True)
    df_wide_ = pd.merge(df_wide, df_split, on='response_id', how='left')

    df_wide_ = df_wide_.loc[:, ['response_id', 'agreeableness', 'conscientiousness', 'extraversion', 'neuroticism', 'openness']]

    df_wide_ = df_wide_.rename(columns={"agreeableness": ("agreeableness_" + str(responses))})
    df_wide_ = df_wide_.rename(columns={"conscientiousness": ("conscientiousness_" + str(responses))})
    df_wide_ = df_wide_.rename(columns={"extraversion": ("extraversion_" + str(responses))})
    df_wide_ = df_wide_.rename(columns={"neuroticism": ("neuroticism_" + str(responses))})
    df_wide_ = df_wide_.rename(columns={"openness": ("openness_" + str(responses))})

    return df_wide_

## **Warning** this will take a long time using CPU -- recommend testing a single line

Transformer Pipeline Classification

Note: I included candidate labels for all Big 5 traits even though each prompt is only designed to promote meaningful variability in terms of a single trait.

In [22]:
# Agreeableness
df_oe_a = zshot(df=df_a, ids='Respondent_ID', responses='open_ended_1', split='Dataset', labels=big5_labels)

# Conscientiousness
df_oe_c = zshot(df=df_c, ids='Respondent_ID', responses='open_ended_2', split='Dataset', labels=big5_labels)

# Extraversion
df_oe_e = zshot(df=df_e, ids='Respondent_ID', responses='open_ended_3', split='Dataset', labels=big5_labels)

# Neuroticism
df_oe_n = zshot(df=df_n, ids='Respondent_ID', responses='open_ended_4', split='Dataset', labels=big5_labels)

# Openness
df_oe_o = zshot(df=df_o, ids='Respondent_ID', responses='open_ended_5', split='Dataset', labels=big5_labels)

## Read in and merge previous pipeline results

In [30]:
# Read
df_a = pd.read_csv('./output/zshot_desc/df_a.csv')
df_c = pd.read_csv('./output/zshot_desc/df_c.csv')
df_e = pd.read_csv('./output/zshot_desc/df_e.csv')
df_n = pd.read_csv('./output/zshot_desc/df_n.csv')
df_o = pd.read_csv('./output/zshot_desc/df_o.csv')

# Merge
frames = [df_a, df_c, df_e, df_n, df_o]
df_big5 = reduce(lambda left,right: pd.merge(left,right,on=['response_id', 'response_id'], how='outer'), frames)
df_raw.columns = map(str.lower, df_raw.columns)
df_merged = df_big5.merge(df_raw, left_on='response_id', right_on='respondent_id')
df_merged.head()

Unnamed: 0,response_id,agreeableness_open_ended_1,conscientiousness_open_ended_1,extraversion_open_ended_1,neuroticism_open_ended_1,openness_open_ended_1,agreeableness_open_ended_2,conscientiousness_open_ended_2,extraversion_open_ended_2,neuroticism_open_ended_2,...,open_ended_2,open_ended_3,open_ended_4,open_ended_5,e_scale_score,a_scale_score,o_scale_score,c_scale_score,n_scale_score,dataset
0,10430310916,0.961271,0.98735,0.517366,0.090582,0.851015,0.605696,0.993918,0.598209,0.278813,...,I would complete as much as possible as early ...,I would not go to the networking meeting. If I...,I would feel awful. I would discuss the negati...,The experience would be largely enjoyable. I m...,1.25,4.5,3.5,4.583333,2.0,Train
1,10430357581,0.949983,0.97545,0.812577,0.128994,0.967695,0.236811,0.913796,0.217789,0.030903,...,I would try to finish the project as early as ...,I would go regardless because this is an oppor...,You have to swallow your pride and move on. I ...,I would find this experience very enjoyable. L...,3.25,4.583333,3.666667,4.75,1.5,Test
2,10430389322,0.835928,0.991514,0.394194,0.397509,0.789948,0.805279,0.956355,0.517213,0.007445,...,I would immediately make a priority list and p...,I would probably not go because I am not a big...,I would feel quite upset by the situation. I w...,I would find the experience enjoyable because ...,2.916667,3.25,4.083333,3.583333,2.833333,Test
3,10432456337,0.862493,0.976453,0.429553,0.809877,0.320497,0.797729,0.979905,0.65925,0.378568,...,I would try to get ad much done as soon as pos...,I would try to talk my colleague into going. I...,I would think long and hard about whether to r...,I would find it interesting. I love learning n...,2.416667,4.166667,3.166667,3.833333,4.25,Test
4,10432470791,0.871392,0.923793,0.565204,0.254818,0.737471,0.520573,0.971274,0.913008,0.376868,...,I would immediately start working. I like to g...,I would for sure try and get them to come alon...,I would reconnect with the boss and asked them...,I would find this experience enjoyable. Anytim...,3.916667,4.25,4.833333,4.833333,1.5,Train


Correlations between zero-shot classification of a scenario response AND self-report questionnaire scores of corresponding Big 5 traits 

In [33]:
# Compute correlations using Pingouin
corr_a = pg.pairwise_corr(df_merged, columns=['a_scale_score', 'agreeableness_open_ended_1'], method='pearson')
corr_c = pg.pairwise_corr(df_merged, columns=['c_scale_score', 'conscientiousness_open_ended_2'], method='pearson')
corr_e = pg.pairwise_corr(df_merged, columns=['e_scale_score', 'extraversion_open_ended_3'], method='pearson')
corr_n = pg.pairwise_corr(df_merged, columns=['n_scale_score', 'neuroticism_open_ended_4'], method='pearson')
corr_o = pg.pairwise_corr(df_merged, columns=['o_scale_score', 'openness_open_ended_5'], method='pearson')

# Combine results -- r is correlation -- p-unc is p-value
corrs = [corr_a, corr_c, corr_e, corr_n, corr_o]
df_corrs = pd.concat(corrs).loc[:, ['X', 'Y', 'r', 'p-unc']].round(2).reset_index(drop = True)
df_corrs = df_corrs.rename({'X': 'self_report_big_5', 'Y': 'zero_shot_predictions_of_sjt'}, axis=1)
df_corrs

Unnamed: 0,self_report_big_5,zero_shot_predictions_of_sjt,r,p-unc
0,a_scale_score,agreeableness_open_ended_1,0.14,0.0
1,c_scale_score,conscientiousness_open_ended_2,0.05,0.03
2,e_scale_score,extraversion_open_ended_3,0.24,0.0
3,n_scale_score,neuroticism_open_ended_4,0.01,0.55
4,o_scale_score,openness_open_ended_5,0.22,0.0


See [blog post](https://dmracek.github.io/explorations_code_nbs/exploration_zs_big5/) for full write up.