# Extroversion Analysis with LLM Generated Dataset

## Read the dataset

In [1]:
import pandas as pd

file_path = "analysis/llm-dataset-generation/traits-definitions.xlsx"
high_ext_sheet_name = "High-EXT-GPT3.5"
low_ext_sheet_name = "Low-EXT-GPT3.5"

df = pd.read_excel(file_path, sheet_name=high_ext_sheet_name)
high_ext_texts_gpt = df.iloc[:, 0].tolist()

df = pd.read_excel(file_path, sheet_name=low_ext_sheet_name)
low_ext_texts_gpt = df.iloc[:, 0].tolist()

## Filter unique texts
Embed the dataset and apply cosine similarity to filter out "too similar" texts

In [2]:
from sentence_transformers import SentenceTransformer, util
from tqdm import tqdm

SIMILARITY_THRESHOLD = 0.9
MODEL = "intfloat/e5-large-v2"


def get_unique_paragraphs(texts: list[str], label: str):
    model = SentenceTransformer(MODEL)
    embeddings = model.encode(texts, convert_to_tensor=True)
    similarities = util.pytorch_cos_sim(embeddings, embeddings)
    unique_paragraphs = []
    unique_embeddings = []
    for i in tqdm(range(len(texts))):
        is_dissimilar = all(
            similarity < SIMILARITY_THRESHOLD
            for j, similarity in enumerate(similarities[i])
            if j != i
        )
        if is_dissimilar:
            unique_paragraphs.append(texts[i])
            unique_embeddings.append((embeddings[i], label))

    print(f"{len(unique_paragraphs)}/{len(texts)} Unique Paragraphs:")
    if not unique_paragraphs:
        print("No unique paragraphs found.")
    else:
        for i, paragraph in enumerate(unique_paragraphs):
            print(f"{i + 1}. {paragraph}")
    return unique_embeddings, unique_paragraphs

  from .autonotebook import tqdm as notebook_tqdm


### High Extroversion
99/176 texts left

In [3]:
(
    unique_high_ext_vectors_with_labels,
    unique_high_ext_paragraphs_gpt,
) = get_unique_paragraphs(high_ext_texts_gpt, label="HIGH_EXT")

100%|██████████| 176/176 [00:00<00:00, 2768.00it/s]

99/176 Unique Paragraphs:
1. You know what gets me going? Meeting new people! There's something so exhilarating about striking up a conversation with someone I've never met before. It's like every new connection opens up a world of possibilities and adventures. Plus, you never know who you might meet – a potential friend, a business partner, or even a soulmate!
2. Socializing is my superpower! Whether it's a casual chat with a stranger or a deep conversation with a close friend, I'm all in. I love bouncing ideas off others, sharing my thoughts, and hearing different perspectives. It's what keeps life interesting and vibrant!
3. I've never met a stranger – just potential friends I haven't gotten to know yet! Seriously though, I believe in the power of human connection. Every interaction, no matter how brief, has the potential to leave a lasting impression. That's why I always approach every conversation with an open heart and a genuine smile.
4. Networking? Oh, I'm all about it! Buildin




### Low Extroversion
102/359 texts left

In [4]:
(
    unique_low_ext_vectors_with_labels,
    unique_low_ext_paragraphs_gpt,
) = get_unique_paragraphs(low_ext_texts_gpt, label="LOW_EXT")

100%|██████████| 359/359 [00:00<00:00, 1681.55it/s]

102/359 Unique Paragraphs:
1. Social gatherings exhaust me beyond belief. It's not that I despise the company of others; it's just that after spending an hour or two in a crowded room, I feel like all my energy has been siphoned away, leaving me empty and drained. The constant noise, the never-ending chatter, it's like sensory overload for my introverted mind. Don't get me wrong, I appreciate the occasional social interaction, but too much of it can feel suffocating. I find solace in solitude, in the quiet moments where I can finally recharge my batteries and be alone with my thoughts.
2. Small talk feels like a never-ending cycle of meaningless banter. I mean, who really cares about the weather or what you had for lunch? It's like we're all just going through the motions, pretending to be interested in each other's lives when in reality, we're just biding our time until we can escape. I crave deeper connections, meaningful conversations that stimulate my mind and feed my soul. Give me




## Dataset Statistics

### High Extroversion

In [5]:
df_unique_high_ext = pd.DataFrame(unique_high_ext_paragraphs_gpt, columns=['Paragraph'])
df_unique_high_ext['Token Count'] = df_unique_high_ext['Paragraph'].apply(lambda x: len(x.split()))
df_unique_high_ext['Token Count'].describe()

count     99.000000
mean      68.434343
std       29.538685
min       38.000000
25%       48.500000
50%       54.000000
75%       85.500000
max      170.000000
Name: Token Count, dtype: float64

### Low Extroversion

In [6]:
df_unique_low_ext = pd.DataFrame(unique_low_ext_paragraphs_gpt, columns=['Paragraph'])
df_unique_low_ext['Token Count'] = df_unique_low_ext['Paragraph'].apply(lambda x: len(x.split()))
df_unique_low_ext['Token Count'].describe()

count    102.000000
mean      73.931373
std       24.202225
min       34.000000
25%       55.250000
50%       72.500000
75%       85.750000
max      142.000000
Name: Token Count, dtype: float64

## Logistic Regression

### Train

In [12]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

vectors_with_labels = unique_high_ext_vectors_with_labels + unique_low_ext_vectors_with_labels

train_data, test_data = train_test_split(vectors_with_labels, test_size=0.2)
train_vectors = [t[0] for t in train_data]
train_labels = [t[1] for t in train_data]
test_vectors = [t[0] for t in test_data]
test_labels = [t[1] for t in test_data]


gpt_only_model = LogisticRegression(
    random_state=0).fit(train_vectors, train_labels)
print(gpt_only_model.score(test_vectors, test_labels))

0.9512195121951219


### Test

#### MyPersinality Concatenated

In [13]:
import pandas as pd

MODEL = "intfloat/e5-large-v2"
model = SentenceTransformer(MODEL)
myPersonality_df = pd.read_csv(
    "./data/myPersonality-concatenated.csv",
    usecols=["STATUS", "cEXT"],
    encoding="ISO-8859-1",
)
value_map = {"y": "HIGH_EXT", "n": "LOW_EXT"}
myPersonality_df["cEXT"] = myPersonality_df["cEXT"].map(value_map)
myPersonality_embeddings = model.encode(
    myPersonality_df["STATUS"], convert_to_tensor=True
)

print(
    "GPT-Only EXT Score:",
    gpt_only_model.score(myPersonality_embeddings, myPersonality_df["cEXT"]),
)


GPT-Only EXT Score: 0.42


#### MyPersonality

In [14]:
import pandas as pd

MODEL = "intfloat/e5-large-v2"
model = SentenceTransformer(MODEL)
myPersonality_df = pd.read_csv(
    "./data/myPersonality.csv",
    usecols=["STATUS", "cEXT"],
    encoding="ISO-8859-1",
)
value_map = {"y": "HIGH_EXT", "n": "LOW_EXT"}
myPersonality_df["cEXT"] = myPersonality_df["cEXT"].map(value_map)
myPersonality_embeddings = model.encode(
    myPersonality_df["STATUS"], convert_to_tensor=True
)

print(
    "GPT-Only EXT Score:",
    gpt_only_model.score(myPersonality_embeddings, myPersonality_df["cEXT"]),
)


GPT-Only EXT Score: 0.45053947766461633
