MBTI Predictor

What is a MBTI

The MBTI (Myers-Briggs Personality Type Indicator) divides everyone into 16 distinct personality types across 4 axis:

Introversion (I) – Extroversion (E)
Intuition (N) – Sensing (S)
Thinking (T) – Feeling (F)
Judging (J) – Perceiving (P)

This system is used in: businesses, online, for fun, for research and lots more.

Usually, for knowing which of the 16 personalities is closer to our one we have to answer to a questionnaire; why not automate this task?

Learning task

Design and implement a model that given a collection of "posts" as input returns the most suited Myers-Briggs personality type indicator (MBTI) associated to the author.

Machine learning algorithms used for this project:

Naive Bayes
Linear SVC
Logistic Regression
Neural Network

Google Colab or Databricks?

MBTI.ipynb notebook is designed to run on Google Colab, BUT setting USING_COLAB=False it should also run on Databricks (not full tested, so Colab is reccomended)

Datasets

Two datasets has been used for this project:

MBTI Dataset (Kaggle link)
Reddit dataset (datasets/reddit_mbti.parquet)

MBTI Dataset

This dataset contains over 8600 rows of data, and on each row contains:

"posts": last 50 things a user have posted, each entry is separated by "|||"
"type": MBTI type

Notice that after splitting each post we will have about 430k rows.

Acknowledgements: This data was collected through the PersonalityCafe forum; it provides a large selection of people and their MBTI personality type.

Credits to @datasnaek

Reddit dataset

For obtaining a model with much more "generalization power" and for increasing the number of data to use during training phase I decided to collect new data and I came up with this new dataset.

It's a PARQUET dataset containing on each row:

redditor_id: posts' author id
post: Last things the author posted; each entry is separated by "|||" (w.r.t. Mbti dataset, posts number on each entry ranges from 50 to 100)
text_type: identify if it's a comment, title or a post
type: as the MBTI dataset
num_post: number of post in the row

This dataset contains 5754 records

How data has been collected?

Data has been collected on Reddit using a scraper created by me (code available here).

First, I collected a list of users (and their personality) in 17 subreddits about MBTI ("r/mbti", "r/infp", ...); the personality information is given by a badge that is assigned to the user (Reddit API call it "author flair text").

Then, I've scraped the most recent posts of each user (max. 100 for each user) on the ENTIRE Reddit platform, this means that I scraped also posts not related to MBTI universe.

At the end for each post I assigned the badge (so the personality) related to the author.

Final models

Final models that can be used are:

final_pipeline_202107080921, composed by
- NaiveBayes model for classifying between I/E type indicators, f1-score: 0.7414680295332422
- Neural Network model for classifying between N/S type indicators, f1-score: 0.8314306227551886
- NaiveBayes model for classifying between T/F type indicators, f1-score: 0.7338614509076911
- Neural Network model for classifying between P/J type indicators, f1-score: 0.7301667584790412
final_pipeline_202107050925, composed by
- Neural Network model for classifying between I/E type indicators, f1-score: 0.744559472384674
- Neural Network model for classifying between N/S type indicators, f1-score: 0.835502168084075
- Linear SVC model for classifying between T/F type indicators, f1-score: 0.726088608434644
- Neural Network model for classifying between P/J type indicators, f1-score: 0.72016943168465

They are both located in pretrainedModels folder of this repository.

Name	Name	Last commit message	Last commit date
Latest commit edu-rinaldi Update README.md Jul 18, 2021 974fe85 · Jul 18, 2021 History 34 Commits
datasets	datasets	Added reddit and mbti union dataset already splitted	Jul 7, 2021
pretrainedModels	pretrainedModels	final models	Jul 8, 2021
scraper	scraper	reddit dataset now is csv	Jun 28, 2021
.gitignore	.gitignore	Added some pretrained models	Jul 6, 2021
LICENSE	LICENSE	Added MBTI dataset	May 10, 2021
MBTI.ipynb	MBTI.ipynb	Updated notebook	Jul 9, 2021
Personality prediction from user’s posts.pdf	Personality prediction from user’s posts.pdf	Added presentation slides	Jul 9, 2021
README.md	README.md	Update README.md	Jul 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MBTI Predictor

What is a MBTI

Learning task

Google Colab or Databricks?

Datasets

MBTI Dataset

Reddit dataset

How data has been collected?

Final models

About

Releases

Packages

Languages

License

edu-rinaldi/MBTI-Predictor

Folders and files

Latest commit

History

Repository files navigation

MBTI Predictor

What is a MBTI

Learning task

Google Colab or Databricks?

Datasets

MBTI Dataset

Reddit dataset

How data has been collected?

Final models

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages