# Machine Learning for Text Classification of MBTI Personality Types

### Importing data
#### Dataset from Kaggle: https://www.kaggle.com/datasnaek/mbti-type
##### This data was collected from an internet forum focused on personality types. As such, the text contains a much greater number of explicit mentions of these personality types as I would expect to find "in the wild." Because of this, my expectation is that any model trained from this data will not generalize well to other, broader sources.

In [1]:
import pandas as pd
df = pd.read_csv("./data/mbti-type.zip", compression="zip")
print(df.shape)
df.head()

(8675, 2)


Unnamed: 0,type,posts
0,INFJ,'http://www.youtube.com/watch?v=qsXHcwe3krw|||...
1,ENTP,'I'm finding the lack of me in these posts ver...
2,INTP,'Good one _____ https://www.youtube.com/wat...
3,INTJ,"'Dear INTP, I enjoyed our conversation the o..."
4,ENTJ,'You're fired.|||That's another silly misconce...


### Preprocessing
##### We need to add the types themselves to the stopword list because otherwise they are the most salient features. Because this corpus was gathered from a forum talking about mbti types, this seems like cheating

In [8]:
from nltk.corpus import stopwords as nltk_stopwords

type_stopwords = list(map(lambda x: x.lower(), df.type.unique()))
plural_type_stopwords = [x + "s" for x in type_stopwords]

print(type_stopwords)
print(plural_type_stopwords)
all_stopwords = type_stopwords + plural_type_stopwords + list(nltk_stopwords.words('english'))
print(len(all_stopwords))

['infj', 'entp', 'intp', 'intj', 'entj', 'enfj', 'infp', 'enfp', 'isfp', 'istp', 'isfj', 'istj', 'estp', 'esfp', 'estj', 'esfj']
['infjs', 'entps', 'intps', 'intjs', 'entjs', 'enfjs', 'infps', 'enfps', 'isfps', 'istps', 'isfjs', 'istjs', 'estps', 'esfps', 'estjs', 'esfjs']
211


In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(ngram_range = (1,3),
                            stop_words = all_stopwords)

df["preprocessed"] = vectorizer.fit_transform(df.posts)
print(vectorizer)
df.head()

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 3), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words=['infj', 'entp', 'intp', 'intj', 'entj', 'enfj', 'infp', 'enfp', 'isfp', 'istp', 'isfj', 'istj', 'estp', 'esfp', 'estj', 'esfj', 'infjs', 'entps', 'intps', 'intjs', 'entjs', 'enfjs', 'infps', 'enfps', 'isfps', 'istps', 'isfjs', 'istjs', 'estps', 'esfps', 'estjs', 'esfjs', 'i', 'me', 'my',... 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"],
        strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)


Unnamed: 0,type,posts,preprocessed
0,INFJ,'http://www.youtube.com/watch?v=qsXHcwe3krw|||...,"(0, 3049423)\t0.11930510055808789\n (0, 753..."
1,ENTP,'I'm finding the lack of me in these posts ver...,"(0, 3049423)\t0.11930510055808789\n (0, 753..."
2,INTP,'Good one _____ https://www.youtube.com/wat...,"(0, 3049423)\t0.11930510055808789\n (0, 753..."
3,INTJ,"'Dear INTP, I enjoyed our conversation the o...","(0, 3049423)\t0.11930510055808789\n (0, 753..."
4,ENTJ,'You're fired.|||That's another silly misconce...,"(0, 3049423)\t0.11930510055808789\n (0, 753..."


### Train / Test Split
##### I chose to use stratified sampling as the overall distribution of personality types is not at all uniform. I suspect (but have no data to support the suspicion) that the distribution of personality types found on this forum is significantly different from the population distribution, so another sampling method may in fact be better.

In [12]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df.preprocessed, 
                                                   df.type,
                                                   test_size = 0.2,
                                                   random_state = 501, # I just re-watched Band of Brothers
                                                   shuffle = True,
                                                   stratify = df.type)

print("Training:")
print(X_train.shape)
print(y_train.shape)
print("Testing:")
print(X_test.shape)
print(y_test.shape)

Training:
(6940, 2744808)
(6940,)
Testing:
(1735, 2744808)
(1735,)


### SVM Classifier

In [14]:
from sklearn.svm import SVC

clf = SVC()
clf.fit(X_train, y_train)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [None]:
predictions = clf.predict(X_test)

### Accuracy of Default SVM Classifier

In [19]:
from sklearn.metrics import accuracy_score
predictions = clf.predict(X_test)
accuracy_score(y_test, predictions)

0.21152737752161382

### Dimensionality Reduction

In [47]:
from sklearn.feature_selection import SelectKBest, chi2

X_train, X_test, y_train, y_test = train_test_split(df.preprocessed, 
                                                   df.type,
                                                   test_size = 0.2,
                                                   random_state = 501, # I just re-watched Band of Brothers
                                                   shuffle = True,
                                                   stratify = df.type)

selector = SelectKBest(chi2, 5000)
X_train = selector.fit_transform(X_train, y_train)
X_test = selector.transform(X_test)

print("Training:")
print(X_train.shape)
print(y_train.shape)
print("Testing:")
print(X_test.shape)
print(y_test.shape)
df.head()

Training:
(6940, 5000)
(6940,)
Testing:
(1735, 5000)
(1735,)


Unnamed: 0,type,posts,preprocessed,reduced
0,INFJ,'http://www.youtube.com/watch?v=qsXHcwe3krw|||...,"(0, 3049423)\t0.11930510055808789\n (0, 753...","(0, 1613871)\t0.030620126796134687\n (0, 22..."
1,ENTP,'I'm finding the lack of me in these posts ver...,"(0, 3049423)\t0.11930510055808789\n (0, 753...","(0, 1613871)\t0.030620126796134687\n (0, 22..."
2,INTP,'Good one _____ https://www.youtube.com/wat...,"(0, 3049423)\t0.11930510055808789\n (0, 753...","(0, 1613871)\t0.030620126796134687\n (0, 22..."
3,INTJ,"'Dear INTP, I enjoyed our conversation the o...","(0, 3049423)\t0.11930510055808789\n (0, 753...","(0, 1613871)\t0.030620126796134687\n (0, 22..."
4,ENTJ,'You're fired.|||That's another silly misconce...,"(0, 3049423)\t0.11930510055808789\n (0, 753...","(0, 1613871)\t0.030620126796134687\n (0, 22..."


In [48]:
clf = SVC()
clf.fit(X_train, y_train)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

### Accuracy of reduced dimensionality SVM classifier

In [49]:
predictions = clf.predict(X_test)
accuracy_score(y_test, predictions)

0.21152737752161382