MBTI Personality Type Predictor
Personality type predictor using PyTorch. Attempts to predicts the 16 way personality code as well as each binary character code along each axis.
View the report here.
Download and extract the csv file from Kaggle.
Place the extracted file in a directory called
You can run all the steps at once with
Or run each step individually
python preprocess.py python word2vec.py
To see all of the config variables, run
python main --help
usage: main.py [-h] [--data_dir DATA_DIR] [--raw_csv_file RAW_CSV_FILE] [--pre_save_file PRE_SAVE_FILE] [--force_preprocessing FORCE_PREPROCESSING] [--embeddings_model EMBEDDINGS_MODEL] [--embeddings_file EMBEDDINGS_FILE] [--num_threads NUM_THREADS] [--feature_size FEATURE_SIZE] [--min_words MIN_WORDS] [--distance_between_words DISTANCE_BETWEEN_WORDS] [--epochs EPOCHS] [--force_word2vec FORCE_WORD2VEC] [--num_samples NUM_SAMPLES] optional arguments: -h, --help show this help message and exit --data_dir DATA_DIR Directory to save/read all data files from Preprocessing: --raw_csv_file RAW_CSV_FILE Filename of csv file downloaded from Kaggle --pre_save_file PRE_SAVE_FILE Filename to save preprocessed csv file as --force_preprocessing FORCE_PREPROCESSING Whether or not to do preprocessing even if output csv file is found Word2Vec: --embeddings_model EMBEDDINGS_MODEL Filename to save word2vec model to --embeddings_file EMBEDDINGS_FILE Filename to save mbti data with word vectors to --num_threads NUM_THREADS Number of threads to use for training word2vec --feature_size FEATURE_SIZE Number of features to use for word2vec --min_words MIN_WORDS Minimum number of words for word2vec --distance_between_words DISTANCE_BETWEEN_WORDS Distance between words for word2vec --epochs EPOCHS Number of epochs to train word2vec for --force_word2vec FORCE_WORD2VEC Whether or not to create word embeddings even if output word2vec file is found --num_samples NUM_SAMPLES Number of samples to return from word2vec. -1 for all samples
The format of the data can get a little confusing. Hopefully this clears things up.
For the following,
N = number of rows (samples) we have.
Note: All filepaths are prefixed with the
Raw CSV file coming from Kaggle. The location of the input file is given by
preprocess(config) # nothing returned, new csv file saved
The file is preprocessed by splitting each row into a new row for each individual post. Stopwords, numbers, links, and punctuation are removed and the text is set to all lowercase. The file is saved to
This is the data that will be mainly used for training/testing. Multiple output types can be specified depending if you are training to classify all 16 classes, or doing a binary classification for each of the 4 character codes.
Preprocessed CSV file. The location of the input file is given by
As input you also need to give the personality "character code". The options are imported from
from utils import FIRST, FOURTH, SECOND, THIRD embedding_data = word2vec(config, code=ALL) # Defaults to ALL
The output will all be numbers, no strings will be present.
For each row, the first element will be the sentence data and the second element will be the label vector.
row = [sentence, label]
The sentence data is a list of words vectors, so may be a different length for each row.
Each word vector is a vector of length
config.feature_size, which defaults to 300.
The label depends on the
code option specified.
The label will be a length 16 vector which is one-hot encoded. You can use the
utils.one_hot_to_type function to convert from a one-hot encoding to a personality type.
# Get one hot encoding Y = one_hot_encode_type('INTJ') print(Y) # => [0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] # Get personality type t = one_hot_to_type(Y) print(t) # => INTJ
FIRST, SECOND, THIRD, FOURTH
The label will be a length 1 vector which is either 0 or 1. When training, the output should be just a binary classification. To get what the character was based on the binary classification, you can use the
# Consider the third character (T or F) code = THIRD # Get binary class b = get_binary_for_code(code, 'ESTP') print(b) # => 0 # Get character for class c = get_char_for_binary(code, b) print(c) # => T