# Unraveling MBTI Types from Online Posts


## Data Preprocessing

In [71]:
pip install pandas nltk

Note: you may need to restart the kernel to use updated packages.


In [72]:
pip install scikit-learn

Note: you may need to restart the kernel to use updated packages.


In [120]:
import pandas as pd
import re
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /home/daneshvar/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /home/daneshvar/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/daneshvar/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [126]:
df = pd.read_csv('mbti.csv')

Now that we have imported the dataset, the first step is to clean data which includes the following steps:
1. Removing the URLs in the posts, as they are not a relevant feature for our classification task
2. Removing special characters and numbers for further processing
3. Removing stop words (a set of commonly used words in the English language such as 'a', 'the', 'and', etc which give us little information in our context)
4. Converting the words to lowercase letters, since the semantic meaning of a word such as MoVInG and moving is the same.
5. **Lemmatization**: Reducing the words to their basic form, helps reduce the complexity of textual data while retaining the semantic meaning of words. For example all the words 'running', 'runs', 'ran' get lemmatized to the simple word 'run'.

In [127]:
def clean_posts(post):
    # Remove URLs
    post = re.sub(r'http\S+', '', post)
    # Remove special characters and numbers
    post = re.sub(r'[^A-Za-z\s]', '', post)
    # Tokenize
    tokens = word_tokenize(post)
    # Initialize Lemmatizer
    lemmatizer = WordNetLemmatizer()
    # Lemmatize tokens
    lemmatized_tokens = [lemmatizer.lemmatize(token.lower()) for token in tokens if token.lower() not in stopwords.words('english')]
    # Rejoin lemmatized tokens into a single string
    clean_post = ' '.join(lemmatized_tokens)
    return clean_post

The next step is to apply the cleaning methods described above, to our dataset. Given that we have 8000 datapoints, this is a time-consuming process. Thus, we save it locally and use that one in the future instead of cleaning the data everytime we want to apply a model.

In [128]:
df = df.head(500)
df['cleaned_posts'] = df['posts'].apply(lambda x: '|||'.join([clean_posts(post) for post in x.split('|||')]))

In [129]:
df.to_csv('cleaned_dataset_500.csv', index=False)

In [130]:
import pandas as pd

df = pd.read_csv('cleaned_dataset_500.csv')

Here is an example to illustrate the data-cleaning process:

By considering the last post, we can see that the post we can see that

*Move to the Denver area and start a new life for myself.*

is converted to

*move denver area start new life*

The stop words 'to', 'the', 'and', 'a', 'for', 'myself' are removed. Every word is in lowercase letters. 

In [131]:
df.iloc[0]['posts']

"'http://www.youtube.com/watch?v=qsXHcwe3krw|||http://41.media.tumblr.com/tumblr_lfouy03PMA1qa1rooo1_500.jpg|||enfp and intj moments  https://www.youtube.com/watch?v=iz7lE1g4XM4  sportscenter not top ten plays  https://www.youtube.com/watch?v=uCdfze1etec  pranks|||What has been the most life-changing experience in your life?|||http://www.youtube.com/watch?v=vXZeYwwRDw8   http://www.youtube.com/watch?v=u8ejam5DP3E  On repeat for most of today.|||May the PerC Experience immerse you.|||The last thing my INFJ friend posted on his facebook before committing suicide the next day. Rest in peace~   http://vimeo.com/22842206|||Hello ENFJ7. Sorry to hear of your distress. It's only natural for a relationship to not be perfection all the time in every moment of existence. Try to figure the hard times as times of growth, as...|||84389  84390  http://wallpaperpassion.com/upload/23700/friendship-boy-and-girl-wallpaper.jpg  http://assets.dornob.com/wp-content/uploads/2010/04/round-home-design.jpg ...

In [132]:
df.iloc[0]['cleaned_posts']

'||||||enfp intj moment sportscenter top ten play prank|||lifechanging experience life|||repeat today|||may perc experience immerse|||last thing infj friend posted facebook committing suicide next day rest peace|||hello enfj sorry hear distress natural relationship perfection time every moment existence try figure hard time time growth||||||welcome stuff|||game set match|||prozac wellbrutin least thirty minute moving leg dont mean moving sitting desk chair weed moderation maybe try edible healthier alternative|||basically come three item youve determined type whichever type want would likely use given type cognitive function whatnot left|||thing moderation sims indeed video game good one note good one somewhat subjective completely promoting death given sim|||dear enfp favorite video game growing current favorite video game cool||||||appears late sad|||there someone everyone|||wait thought confidence good thing|||cherish time solitude bc revel within inner world whereas time id workin 

## Feature Extraction

Potential feature extraction methods for converting text into numerical values are:
1. **TF-IDF (Term Frequency-Inverse Document Frequency)**
2. **Word2Vec**
3. **GloVe**

First, we briefly describe each of the methods, and then we will choose them properly based on the model.

### TF-IDF (Term Frequency-Inverse Document Frequency)

TF-IDF, is a numerical statistic that aims to reflect how important a word is to a document in a collection or corpus. It is a widely used feature extraction and text vectorization methodology in the field of Natural Language Processing (NLP) for tasks such as document classification, search engine ranking, and topic modeling.

#### Components of TF-IDF:

##### 1. Term Frequency (TF-IDF): 
This measures how frequently a term occurs in a document:
$$
TF(t, d) = \frac{\text{Number of times term }t \text{ appeards in document }d}{\text{Total number of terms in document }d}
$$

#### 2. Inverse Document Frequency (IDF): 
This measures how important a term is:
$$IDF(t, D) = \log( \frac{\text{Total number of documents }D}{\text{Number of documents with term }t \text{ in it} })$$

The TF-IDF score is the product of TF and IDF:
$$
TFIDF(t,d,D) = TF(t,d) \times IDF(t,D)
$$

This score is high for a term that has high frequency in a given document but low document frequency in the whole corpus. Thus, TF-IDF tends to filter out common terms and retain terms that uniquely characterize a document.

### Word2Vec

ToDo

### GloVec

ToDo

## Naive Bayes

The first model we try for our classification task, is the Naive Bayes model. The first question is:


*What feature extraction methods suits Naive Bayes the best?*




### TF-IDF with Naive Bayes

TF-IDF is particularly well-suited for use with Naive Bayes classifiers for several reasons:

1. **Sparse Representation** Naive Bayes models work well with sparse data representations, which is exactly what TF-IDF provides. Each document is represented as a vector in a high-dimensional space where each dimension corresponds to a specific term in the corpus, weighted by its TF-IDF score.

2. **Feature Interpretability**: Naive Bayes is a probabilistic model that assumes independence between features. TF-IDF scores can be directly interpreted as the importance of words within documents relative to the corpus, fitting nicely with the probabilistic nature of Naive Bayes.

3. **Effectiveness in Text Classification**: The combination of TF-IDF and Naive Bayes has been historically effective for various text classification tasks, such as spam detection, sentiment analysis, and topic classification.



### Word Embedding Methods (Word2Vec, GloVe) with Naive Bayes

Word Embeddings like Word2Vec or GloVe represent words in a continuous vector space, capturing semantic relationships between them. However, **using these embeddings directly with Naive Bayes classifiers is less common** for a few reasons:

1. **Dense Representation**: Unlike TF-IDF, word embeddings result in dense vectors, where each dimension represents a latent feature learned from the data. This dense representation does not align as naturally with the assumptions of Naive Bayes classifiers, which expect independent features.

2. **Aggregation Challenge**: To use word embeddings for document-level classification with Naive Bayes, you need to aggregate word vectors into a single vector representing the entire document. Common aggregation methods include averaging the word vectors. However, this process can dilute the semantic relationships captured by the embeddings and does not directly cater to the probabilistic nature of Naive Bayes.

3. **Compatibility Issues**: Naive Bayes classifiers inherently model the probability distribution of features given the class labels. The continuous nature of word embeddings does not directly fit into this framework without additional steps to adapt the model or data representation.

Given the explanation above, we proceed with TF-IDF for the Naive Bayes model.

### Applying TF-IDF

When applying TF-IDF, there are several parameters to set:

1. ```max_features```: Consider the top ```max_features``` ordered by term frequency across the corpus.
2. ```ngram_range```: The lower and upper boundary of the range of n-values for different n-grams to be extracted. ```(1,2)``` means unigrams and bigrams. 
3. ```norm```: l1, l2, or no normialization. 
4. ...


We will be toggling ```max_features``` and see how it changes the outcome. 

In [133]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(max_features=5000) 
tfidf_features = tfidf_vectorizer.fit_transform(df['cleaned_posts'])

In [134]:
from sklearn.model_selection import train_test_split

X = tfidf_features
y = df['type'] 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1234)

In [135]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

model = MultinomialNB()
model.fit(X_train, y_train)

In [139]:
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

Accuracy: 0.26


An accuracy of 26% shows that there is a lot of room for imporvement. 
Potential causes:
1. Imbalanced dataset (One type is more rare than other ones): Bar plots to compare diversity. If a type is a minority, oversample it, to help fit a better model.
2. Not enough data loaded (Now we have only used 500 data)
3. Look into better metrics than pure accuracy
4. Adjust ```ngram_range``` to consider bigrams too.
5. ...

In [140]:
pwd

'/home/daneshvar/Documents/CS229/MBTI'