# Predicting Personality: Introvert vs Extrovert Classification

## Introduction

This notebook analyzes the **2025 Kaggle Playground Series - Personality Prediction Challenge**. The goal is to predict whether a person is an **Introvert** or **Extrovert** based on their social behavior and personality traits.

## Approach

This analysis covers the complete machine learning pipeline:

1. **Exploratory Data Analysis** - Understanding data patterns and distributions
2. **Data Preprocessing** - Feature scaling and preparation
3. **Model Selection** - Testing multiple classification algorithms
4. **Model Evaluation** - Performance analysis using confusion matrices and cross-validation
5. **Final Predictions** - Generating competition submissions

The focus is on building a robust binary classifier through systematic evaluation of different algorithms and proper validation techniques.

---

In [522]:
import numpy as np
import pandas as pd
import random

np.random.seed(42)
random.seed(42)

## Load Data

In [525]:
# load training data
train = pd.read_csv('train.csv')
train.head()

Unnamed: 0,id,Time_spent_Alone,Stage_fear,Social_event_attendance,Going_outside,Drained_after_socializing,Friends_circle_size,Post_frequency,Personality
0,0,0.0,No,6.0,4.0,No,15.0,5.0,Extrovert
1,1,1.0,No,7.0,3.0,No,10.0,8.0,Extrovert
2,2,6.0,Yes,1.0,0.0,,3.0,0.0,Introvert
3,3,3.0,No,7.0,3.0,No,11.0,5.0,Extrovert
4,4,1.0,No,4.0,4.0,No,13.0,,Extrovert


In [527]:
# load testing data
test = pd.read_csv('test.csv')
test.head()

Unnamed: 0,id,Time_spent_Alone,Stage_fear,Social_event_attendance,Going_outside,Drained_after_socializing,Friends_circle_size,Post_frequency
0,18524,3.0,No,7.0,4.0,No,6.0,
1,18525,,Yes,0.0,0.0,Yes,5.0,1.0
2,18526,3.0,No,5.0,6.0,No,15.0,9.0
3,18527,3.0,No,4.0,4.0,No,5.0,6.0
4,18528,9.0,Yes,1.0,2.0,Yes,1.0,1.0


## Data Preparation

To prepare the dataset for modeling, I will perform the following preprocessing steps:

**Missing Value Treatment:**
- Fill NaN values in the dataset

**Target Variable Encoding:**
- Convert Personality column to binary format:
 - Extrovert = 1
 - Introvert = 0

These preprocessing steps will ensure a clean dataset with no missing values and properly encoded target labels for binary classification.

---

In [530]:
# check for missing values
print(train.info())
print(test.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18524 entries, 0 to 18523
Data columns (total 9 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   id                         18524 non-null  int64  
 1   Time_spent_Alone           17334 non-null  float64
 2   Stage_fear                 16631 non-null  object 
 3   Social_event_attendance    17344 non-null  float64
 4   Going_outside              17058 non-null  float64
 5   Drained_after_socializing  17375 non-null  object 
 6   Friends_circle_size        17470 non-null  float64
 7   Post_frequency             17260 non-null  float64
 8   Personality                18524 non-null  object 
dtypes: float64(5), int64(1), object(3)
memory usage: 1.3+ MB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6175 entries, 0 to 6174
Data columns (total 8 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     -------------- 

In [532]:
# handle categorical vars first

# replace NaNs in Drained_after_socializing with Unknown for train and test.
train['Drained_after_socializing'] = train['Drained_after_socializing'].fillna('Unknown')
test['Drained_after_socializing'] = test['Drained_after_socializing'].fillna('Unknown')

# replace NaNs in Stage_fear with Unknown for train and test.
train['Stage_fear'] = train['Stage_fear'].fillna('Unknown')
test['Stage_fear'] = test['Stage_fear'].fillna('Unknown')

In [534]:
# handle quantitative vars by using median to replace null values

# replace NaNs in Time_spent_Alone with median of the col for train and test
train['Time_spent_Alone'] = train['Time_spent_Alone'].fillna(train['Time_spent_Alone'].median())
test['Time_spent_Alone'] = test['Time_spent_Alone'].fillna(test['Time_spent_Alone'].median())

# replace NaNs in Social_event_attendance with median of the col for train and test
train['Social_event_attendance'] = train['Social_event_attendance'].fillna(train['Social_event_attendance'].median())
test['Social_event_attendance'] = test['Social_event_attendance'].fillna(test['Social_event_attendance'].median())

# replace NaNs in Going_outside with median of the col for train and test
train['Going_outside'] = train['Going_outside'].fillna(train['Going_outside'].median())
test['Going_outside'] = test['Going_outside'].fillna(test['Going_outside'].median())

# replace NaNs in Friends_circle_size with median of the col for train and test
train['Friends_circle_size'] = train['Friends_circle_size'].fillna(train['Friends_circle_size'].median())
test['Friends_circle_size'] = test['Friends_circle_size'].fillna(test['Friends_circle_size'].median())

# replace NaNs in Post_frequency with median of the col for train and test
train['Post_frequency'] = train['Post_frequency'].fillna(train['Post_frequency'].median())
test['Post_frequency'] = test['Post_frequency'].fillna(test['Post_frequency'].median())

In [536]:
# double check that there are no more null values
print(train.info())
print(test.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18524 entries, 0 to 18523
Data columns (total 9 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   id                         18524 non-null  int64  
 1   Time_spent_Alone           18524 non-null  float64
 2   Stage_fear                 18524 non-null  object 
 3   Social_event_attendance    18524 non-null  float64
 4   Going_outside              18524 non-null  float64
 5   Drained_after_socializing  18524 non-null  object 
 6   Friends_circle_size        18524 non-null  float64
 7   Post_frequency             18524 non-null  float64
 8   Personality                18524 non-null  object 
dtypes: float64(5), int64(1), object(3)
memory usage: 1.3+ MB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6175 entries, 0 to 6174
Data columns (total 8 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     -------------- 

In [538]:
# replaces extrovert with 1 and introvert with 0
train['Personality'] = train['Personality'].replace({'Extrovert': 1, 'Introvert':0})
train.head()

  train['Personality'] = train['Personality'].replace({'Extrovert': 1, 'Introvert':0})


Unnamed: 0,id,Time_spent_Alone,Stage_fear,Social_event_attendance,Going_outside,Drained_after_socializing,Friends_circle_size,Post_frequency,Personality
0,0,0.0,No,6.0,4.0,No,15.0,5.0,1
1,1,1.0,No,7.0,3.0,No,10.0,8.0,1
2,2,6.0,Yes,1.0,0.0,Unknown,3.0,0.0,0
3,3,3.0,No,7.0,3.0,No,11.0,5.0,1
4,4,1.0,No,4.0,4.0,No,13.0,5.0,1


## Create Classifier Function

I will create a reusable function to streamline the model training and evaluation process:

**Function Purpose:**
- Train any classifier on the training data
- Perform cross-validation to assess model performance
- Return mean accuracy scores with standard deviation
- Enable efficient comparison across multiple algorithms

This function will allow for consistent evaluation methodology across all tested models.

---

In [541]:
from sklearn.model_selection import cross_val_score

# set X to all cols except Personality
X_train = train.iloc[:,:-1]
X_test = test

# set y to Personality
y_train = train.iloc[:, -1]

In [543]:
# turn predictive cols into numeric cols for X_train
X_train = pd.get_dummies(X_train)
X_train.head()

Unnamed: 0,id,Time_spent_Alone,Social_event_attendance,Going_outside,Friends_circle_size,Post_frequency,Stage_fear_No,Stage_fear_Unknown,Stage_fear_Yes,Drained_after_socializing_No,Drained_after_socializing_Unknown,Drained_after_socializing_Yes
0,0,0.0,6.0,4.0,15.0,5.0,True,False,False,True,False,False
1,1,1.0,7.0,3.0,10.0,8.0,True,False,False,True,False,False
2,2,6.0,1.0,0.0,3.0,0.0,False,False,True,False,True,False
3,3,3.0,7.0,3.0,11.0,5.0,True,False,False,True,False,False
4,4,1.0,4.0,4.0,13.0,5.0,True,False,False,True,False,False


In [545]:
# turn predictive cols into numeric cols for X_test
X_test = pd.get_dummies(X_test)
X_test.head()

Unnamed: 0,id,Time_spent_Alone,Social_event_attendance,Going_outside,Friends_circle_size,Post_frequency,Stage_fear_No,Stage_fear_Unknown,Stage_fear_Yes,Drained_after_socializing_No,Drained_after_socializing_Unknown,Drained_after_socializing_Yes
0,18524,3.0,7.0,4.0,6.0,5.0,True,False,False,True,False,False
1,18525,2.0,0.0,0.0,5.0,1.0,False,False,True,False,False,True
2,18526,3.0,5.0,6.0,15.0,9.0,True,False,False,True,False,False
3,18527,3.0,4.0,4.0,5.0,6.0,True,False,False,True,False,False
4,18528,9.0,1.0,2.0,1.0,1.0,False,False,True,False,False,True


In [547]:
# create classifier model 
def clf_model(model):
    clf = model
    scores = cross_val_score(clf, X_train, y_train)
    print('Scores:', scores)
    print('Mean Score:', scores.mean())

## Create Confusion Matrix Function

I will create a function to generate detailed performance analysis for each model:

**Function Purpose:**
- Train the model on validation split (80/20)
- Generate confusion matrix to visualize classification performance
- Provide classification report with precision, recall, and F1-scores
- Return the trained model for further use

This function will provide comprehensive insights into how well each model distinguishes between introverts and extroverts.

---

In [550]:
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split

# since we aren't provided Personality for test data, we will split the training data
X_train_split, X_val, y_train_split, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

# create function that generates a confusion matrix for the model
def confusion(model):
    clf = model
    clf.fit(X_train_split, y_train_split)
    y_pred = clf.predict(X_val)
    print('Confusion Matrix:' , confusion_matrix(y_val, y_pred))
    print('Classification Report:', classification_report(y_val, y_pred))
    return clf

## Model Selection

I will test multiple classification algorithms to find the best performer for personality prediction:

**Algorithms to Test:**
- Logistic Regression
- Random Forest  
- AdaBoost
- K-Nearest Neighbors (KNN)
- Gaussian Naive Bayes
- Decision Tree

**Evaluation Method:**
Each model will be evaluated using 80/20 train-validation split with cross-validation scores and confusion matrices to assess accuracy, precision, and recall across both personality classes.

### Logistic Regression

In [554]:
from sklearn.linear_model import LogisticRegression
clf_model(LogisticRegression(max_iter = 2000))

Scores: [0.96923077 0.97112011 0.96410256 0.96707152 0.97354212]
Mean Score: 0.9690134165784956


In [555]:
confusion(LogisticRegression(max_iter = 2000))

Confusion Matrix: [[ 885   67]
 [  50 2703]]
Classification Report:               precision    recall  f1-score   support

           0       0.95      0.93      0.94       952
           1       0.98      0.98      0.98      2753

    accuracy                           0.97      3705
   macro avg       0.96      0.96      0.96      3705
weighted avg       0.97      0.97      0.97      3705



### Random Forest

In [497]:
from sklearn.ensemble import RandomForestClassifier
clf_model(RandomForestClassifier())

Scores: [0.96869096 0.97031039 0.96437247 0.96653171 0.97300216]
Mean Score: 0.9685815385781282


In [498]:
confusion(RandomForestClassifier(random_state=42))

Confusion Matrix: [[ 884   68]
 [  50 2703]]
Classification Report:               precision    recall  f1-score   support

           0       0.95      0.93      0.94       952
           1       0.98      0.98      0.98      2753

    accuracy                           0.97      3705
   macro avg       0.96      0.96      0.96      3705
weighted avg       0.97      0.97      0.97      3705



### AdaBoost

In [502]:
from sklearn.ensemble import AdaBoostClassifier
clf_model(AdaBoostClassifier(algorithm='SAMME', random_state=42))

Scores: [0.96869096 0.9705803  0.96410256 0.96626181 0.9724622 ]
Mean Score: 0.9684195661108245


In [503]:
confusion(AdaBoostClassifier(algorithm='SAMME'))

Confusion Matrix: [[ 885   67]
 [  51 2702]]
Classification Report:               precision    recall  f1-score   support

           0       0.95      0.93      0.94       952
           1       0.98      0.98      0.98      2753

    accuracy                           0.97      3705
   macro avg       0.96      0.96      0.96      3705
weighted avg       0.97      0.97      0.97      3705



### KNeighbors

In [507]:
from sklearn.neighbors import KNeighborsClassifier
clf_model(KNeighborsClassifier())

Scores: [0.73954116 0.48205128 0.73981107 0.51093117 0.26349892]
Mean Score: 0.5471667205894784


In [410]:
confusion(KNeighborsClassifier())

Confusion Matrix: [[ 643  309]
 [  38 2715]]
Classification Report:               precision    recall  f1-score   support

           0       0.94      0.68      0.79       952
           1       0.90      0.99      0.94      2753

    accuracy                           0.91      3705
   macro avg       0.92      0.83      0.86      3705
weighted avg       0.91      0.91      0.90      3705



### GaussianNB

In [375]:
from sklearn.naive_bayes import GaussianNB
clf_model(GaussianNB())

Scores: [0.96815115 0.97112011 0.96410256 0.965722   0.97327214]
Mean Score: 0.9684735909386358


In [412]:
confusion(GaussianNB())

Confusion Matrix: [[ 885   67]
 [  50 2703]]
Classification Report:               precision    recall  f1-score   support

           0       0.95      0.93      0.94       952
           1       0.98      0.98      0.98      2753

    accuracy                           0.97      3705
   macro avg       0.96      0.96      0.96      3705
weighted avg       0.97      0.97      0.97      3705



### Decison Tree Classifier

In [379]:
from sklearn.tree import DecisionTreeClassifier
clf_model(DecisionTreeClassifier(random_state=42))

Scores: [0.90580297 0.93063428 0.93792173 0.92145749 0.96166307]
Mean Score: 0.931495906238432


In [414]:
confusion(DecisionTreeClassifier())

Confusion Matrix: [[ 821  131]
 [ 121 2632]]
Classification Report:               precision    recall  f1-score   support

           0       0.87      0.86      0.87       952
           1       0.95      0.96      0.95      2753

    accuracy                           0.93      3705
   macro avg       0.91      0.91      0.91      3705
weighted avg       0.93      0.93      0.93      3705



## Model Selection: Why Logistic Regression?

After evaluating multiple classification algorithms, **Logistic Regression** achieved the best performance:

- **Accuracy**: 97% on validation data
- **Balanced performance**: High precision and recall for both classes
- **Simple and interpretable**: Clear understanding of which features drive predictions
- **No overfitting**: Consistent performance across validation splits

The combination of strong predictive performance and model simplicity made logistic regression the optimal choice for this personality prediction task.

---

### Submission

In [566]:
# sets the final submission model to logistic regression
final_model = LogisticRegression(max_iter=1000)
final_model.fit(X_train, y_train)

# Make predictions on test data
test_predictions = final_model.predict(X_test)

# covert personality col back to extrovert and introvert
prediction_labels = ['Introvert' if pred == 0 else 'Extrovert' for pred in test_predictions]

# Create submission dataframe
submission = pd.DataFrame({'id' : test['id'], 'Personality' : prediction_labels})

# 5. Save to CSV
submission.to_csv('submission.csv', index=False)
print("Submission file created!")
print(submission.head())

Submission file created!
      id Personality
0  18524   Extrovert
1  18525   Introvert
2  18526   Extrovert
3  18527   Extrovert
4  18528   Introvert


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [562]:
# check that each row is included
len(submission)

6175

## Conclusion

### Final Results

**Score: 0.974089**

This personality prediction project achieved strong performance through systematic model evaluation and data preprocessing.

### Key Findings

- **Logistic Regression** outperformed all other algorithms with 97% validation accuracy
- Proper missing value handling and target encoding were crucial for model success
- Simple linear relationships proved more effective than complex ensemble methods

### Methodology Success

The systematic approach of testing 6 different algorithms, using proper validation splits, and thorough evaluation with confusion matrices led to optimal model selection and reliable results.

**Result: 0.974089** - A strong performance in the 2025 Kaggle Playground Series!