# Data Pre-Processing for Predicting Myers Briggs Types
### By Arnav Bhakta$^{1}$ and William Yue$^{1}$
#### $^{1}$ Phillips Academy Andover

##### The author(s) would like to thank Patrick Chen, Michael Huang, and Ali Cy for their helpful input and advice in crafting this notebook.

In this notebook, we will be going through the data pre-processing steps that are necessary in order predict Myers Briggs Type Indicators (MBTI). To give a bit of an overview, MBTI is an "introspective self-report questionnaire indicating differing psychological preferences in how people perceive the world and make decisions" [[1]](https://en.wikipedia.org/wiki/Myers%E2%80%93Briggs_Type_Indicator). It "divides everyone into $2^4=16$ distinct personality types across $4$ axes:

* Introversion (I) – Extroversion (E)
* Intuition (N) – Sensing (S)
* Thinking (T) – Feeling (F)
* Judging (J) – Perceiving (P)"

and assigns everyone a label, based on which of the personality types they fulfill [[2]](https://www.kaggle.com/datasnaek/mbti-type). "For example, someone who prefers introversion, intuition, thinking and perceiving would be labelled an INTP in the MBTI system" [[2]](https://www.kaggle.com/datasnaek/mbti-type). In doing so, it is possible to get more of an overview of someone's personality, their preferences, and behaviors, and psychological perspective. Hence, in the current study, we look to levarage machine learning (ML) to correctly and accurately classify people's personalities or MBTIs, based on how it is that they "speak" and interact with others. The data is taken from the Kaggle [(MBTI) Myers-Briggs Personality Type Dataset](https://www.kaggle.com/datasnaek/mbti-type), which provides text of people interacting in a forum, as we presume from reading over samples in the dataset, which are then labeled with their corresponding personality type [[2]](https://www.kaggle.com/datasnaek/mbti-type).

In the current notebook, we have seperated the presented dataset into a smaller set, consistent of two types of labeled data, Introversion (I) and Extroversion (E). Using the provided texts for each of these labels (Introversion being labeled a 1 and Extroversion being labeled a 0), we hope to be able to accurately predict each of these personality types.

In [1]:
print(__doc__)

Automatically created module for IPython interactive environment


### Importing Libraries

We import the below libraries to help us with data pre-processing. Pandas (`pandas`) is primarily used for loading in the data from a csv, and creating DataFrames. NumPy (`numpy`) is primarily used for creating arrays and simple arithmetic. Regular expression operations (`re`) are primarily used for splitting up the samples and removing unwanted or implicative text from the dataset. Tokenize and Tokenizer are primarily used to split up sentences into smaller units or words called tokens, to helping to understand the text and build the model, by making it easier to understand the meaning of the text, by analyzing it as a sequence of words. The remaining libraries are used for importing and exporting data.

In [8]:
# Standard Imports
import os
import pandas as pd
import numpy as np
import re
import tokenize
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from bs4 import BeautifulSoup
import requests
import json

### Loading in the Data

Below, we load in the data and read the csv from My Drive, using Pandas, and split up the columns of the dataset into two distinct features: the text and labels. We then cast them to NumPy array from Pandas DataFrames, for ease of use later on.

In [9]:
read_path_0 = os.path.join('..','datasets','0.csv')
df = pd.read_csv(read_path_0)
df_text = np.array(df['text'])
df_label = np.array(df['label'])

We then create two lists: `df_user` and `personality_types`. `df_user` will hold all of the text in the dataset, after sensitive or hard to understand strings are removed. `personality_types` contains all of the 16 possible personality types one can have, as in the sample text, there are multiple instances of these personality types. So, we go ahead and remove them, as to optimize the training of our model, and ensure that it fully bases its predictions off of more natural language, as the different personality types are not necessarily elements of everday speech

In [6]:
df_user = []
personality_types = ['intj', 'intp', 'entp', 'entj', 'infj', 'infp', 'enfj', 'enfp', 'istj', 'isfj', 'estj', 'esfj', 'istp', 'isfp', 'estp', 'esfp']

In addition to removing all mentions of the different personality types, we also replace all instances of links in the text using the `re.sub()` method, as once again, links do not arise in everyday speech. After removing all such instances of sensitive or hard to understand text from the samples, we for one fix any spacing issues that may have arisen during the removing of these specific types of texts, but also, split the text up into the text of the different users, by splitting the text at all instances of `'|||'`, which serve to indicate a break in the person who is saying the speech. In doing so, we are able to have an individual person's speech matched with their specific label or personality type.

In [None]:
for i in df_text:
    # Remove all links and substrings of actual personality types
    text = i.lower()
    text = re.sub(r'http\S+', '', text)
    text = re.sub(r'https\S+', '', text)
    text = re.sub(r'@[A-Za-z0-9]+','', text)
    for j in personality_types:
        text = re.sub(j, '', text)
    while True:
        before_text = text
        text=text.replace('  ', ' ')
        if before_text == text:
            break
    df_user.append(np.array(text.split('|||'))) # Divide a user's posts

Displaying the split text:

In [None]:
df_user = np.array(df_user)
display(df_user)

### Tokenization

As mentioned above, we tokenize the text, by splitting it up into smaller units, as to optimize our model's ability to analyze the text and discover patterns within it. However, prior to doing so, we first go ahead and flatten the array, by reducing its dimensionality to a 1-dimensional array, in order to be able to tokenize the text, as currently, our text is within a series of nested arrays. We do this, be defining a new list, `df_user_flattened`, and looping through each nested array in the `df_user` array, which currently holds all the texts, and assigning each string of text within these nested arrays as new elements in our new 1-dimensional `df_user_flattened` list.

In [None]:
df_user_flattened = []

In [None]:
for i in df_user:
    for j in i:
    df_user_flattened.append(j)

We then convert `df_user_flattened` to a NumPy array for ease of use, and display its values. As we can see, as opposed to `df_user` which consisted of several arrays nested within each other, `df_user_flattened` contains only 1 array, meaning that we successfuly reduced the dimensionality of our data.

In [None]:
df_user_flattened = np.array(df_user_flattened)
display(df_user_flattened)

Next we go ahead and tokenize our data. In the current study, we do so using `Tokenizer` from `Keras`. `Tokenizer` takes in a few parameters, when tokenizing the data, which are as follows: num_words returns the ids of the `n` most commonly used words in the dataset, where `n` is the `vocab_size` we defined as `vocab_size = 4000`, `oov_token` is used to replace out of vocabulary words.

In [None]:
vocab_size = 4000 

oov_tok = "<OOV>"

tokenizer = Tokenizer(num_words = vocab_size
                      ,oov_token= oov_tok
                      )
tokenizer.fit_on_texts(df_user_flattened)
word_index = tokenizer.word_index

Post-tokenization, we are able to call `word_index` from `tokenizer`, to get the most commonly used words in our dataset, which are:

In [None]:
word_index

In [None]:
lens = []
for i in df_user_flattened:
  sample_lens = i.split(" ")
  lens.append(len(sample_lens))

max(lens)

### Texts to Sequences

In order to be able to train our model using the available texts, we then use the `texts_to_sequences` method to convert our text to a sequence of integers. This is done by using the most frequent words that we found above, and replacing these words that the tokenizer knows, we integers, such that the model that we will build is able to interpret the text.

In [None]:
tokenized = tokenizer.texts_to_sequences(df_user_flattened)

In [None]:
tokenized

### Padding

The final step that we take, is padding each of the samples, to be the same length. As seen above, the max number of words in the samples of texts that we are provided with is 156 words, so we pad all of the sequences to a length of 150, to ensure that all of the sequences have the same length. This is done using the `pad_sequences` method, which takes the tokenized and sequenced text that we just defined, and pads all of the sequences or truncates them to a length of 150, by adding on 0s until the sequence has a length of 150, or removing integers from the sequence, until the sequence has a length of 150.

In [None]:
padded = pad_sequences(tokenized, maxlen=150, padding='post', truncating='post')

As seen below, our padding was successful, and leaves us with an array of 405263 sequences of length 150, to pass into our model, so that we can predict MBTIs.

In [None]:
padded.shape