# 1. Text Cleaning for the Dataset

The training dataset used is from the [MBTI Personality Types](https://www.kaggle.com/datasets/datasnaek/mbti-type), which includes social media posts and their corresponding MBTI personality type labels. The text cleaning process, inspired by a [StackOverflow discussion](https://stackoverflow.com/questions/55187374/cleaning-text-with-python-and-re), involves converting text to lowercase, removing punctuation, URLs, and HTML tags, and stripping leading and trailing whitespace.



In [1]:
import pandas as pd
import re

#define a text cleaner class
class TextCleaner:
    def clean_text(text):
        #convert text to lowercase and remove specific punctuation and symbols
        text = re.sub(r"[-()\"#/@;:<>{}+=~|.,?]", "", text.lower())
        #replace '|||' with a space
        text = re.sub(r'\|\|\|', ' ', text)
        #remove URLs and HTML tags
        text = re.sub(r'https?://\S+|www\.\S+', '', text)
        text = re.sub(r'<.*?>', '', text)
        #strip leading and trailing spaces，ref:https://www.w3schools.com/python/ref_string_strip.asp
        return text.strip()

#read data from the CSV file into a DataFrame
data = pd.read_csv('data/mbti_1.csv')
#apply the clean_text method from TextCleaner class to the 'posts' column, then store in a new column 'cleaned_posts'
data['cleaned_posts'] = data['posts'].apply(TextCleaner.clean_text)

# 2. Dataset Splitting and Preprocessing (split Function)

I use the `split` function to divide the cleaned dataset into training and testing sets, and then perform TF-IDF vectorization on the text data. Additionally, the target labels are encoded using `LabelEncoder`. Once these steps are completed, the data is ready to be fed into machine learning models for training and prediction.

This method is inspired by a [Kaggle notebook](https://www.kaggle.com/code/anandu08/psycho-analysis-nlp-fr-enhanced-social-media-con/notebook#Performance-Visualisation) and has been optimized and simplified to improve code readability and execution efficiency.


In [2]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder

#function to split dataset and preprocess
def data_split(df, size):
    #split the dataset into training and testing sets with stratification on 'type'
    train_data, test_data = train_test_split(df, test_size=size, random_state=0, stratify=df['type'])

    #initialize the TF-IDF vectorizer with a max of 5000 features and English stopwords 
    vectorizer = TfidfVectorizer(max_features=5000, stop_words='english')
    #fit and transform the training data into a TF-IDF matrix
    train_post = vectorizer.fit_transform(train_data['cleaned_posts']).toarray()
    #transform the test data using the same vectorizer
    test_post = vectorizer.transform(test_data['cleaned_posts']).toarray()

    #initialize the LabelEncoder
    target_encoder = LabelEncoder()
    #encode the target labels for the training data
    train_target = target_encoder.fit_transform(train_data['type'])
    #encode the target labels for the test data
    test_target = target_encoder.fit_transform(test_data['type'])

    #return the processed data and the encoding/vectorization tools
    return train_post, test_post, train_target, test_target, target_encoder, vectorizer


# 3. Function for Running All the Models

I utilized the `model` function to train and evaluate several machine learning models, including KNN, Logistic Regression, Linear SVC, Multinomial Naive Bayes, Decision Tree, and Random Forest. Each of these models was trained on the training dataset and then evaluated on the test dataset to compare their performance.

The function returns the accuracy and F1 scores for each model, which allows us to assess which model is most effective for predicting MBTI types. The results indicate that the **Linear Support Vector Classifier** achieved the highest accuracy and F1 scores, suggesting it might be more suitable for this task compared to the other models.

This process was inspired by and optimized from a [Kaggle notebook](https://www.kaggle.com/code/anandu08/psycho-analysis-nlp-fr-enhanced-social-media-con/notebook#Performance-Visualisation), which provides a comprehensive analysis of various models' performance on a similar task.



In [7]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report, f1_score

def model(X_train, X_test, y_train, y_test, count, target_encoder):
    #dictionary to store accuracy for each model
    models_accuracy = {}
    #dictionary to store classification report for each model
    report = {}
    #dictionary to store F1 scores for each model
    f1_scores = {}  
    
    #kNN
    print("Running KNN")
    neigh = KNeighborsClassifier()
    neigh.fit(X_train, y_train)
    #showing accuracy and f1_scores
    models_accuracy['KNN'] = accuracy_score(y_test, neigh.predict(X_test))
    f1_scores['KNN'] = f1_score(y_test, neigh.predict(X_test), average='macro')
    report['KNN'] = classification_report(y_test, neigh.predict(X_test), zero_division=0)
    
    #logistic Regression
    print("Running Logistic Regression")
    model_log = LogisticRegression(max_iter=3000, C=0.5, n_jobs=-1)
    model_log.fit(X_train, y_train)
    models_accuracy['Logistic Regression'] = accuracy_score(y_test, model_log.predict(X_test))
    f1_scores['Logistic Regression'] = f1_score(y_test, model_log.predict(X_test), average='macro')
    report['Logistic Regression'] = classification_report(y_test, model_log.predict(X_test), zero_division=0)
    
    #linear SVC
    print("Running Linear SVC")
    model_linear_SVC = LinearSVC(C=0.1)
    model_linear_SVC.fit(X_train, y_train)
    models_accuracy['Linear Support Vector Classifier'] = accuracy_score(y_test, model_linear_SVC.predict(X_test))
    f1_scores['Linear Support Vector Classifier'] = f1_score(y_test, model_linear_SVC.predict(X_test), average='macro')
    report['Linear Support Vector Classifier'] = classification_report(y_test, model_linear_SVC.predict(X_test), zero_division=0)

    #multinomial Naive Bayes
    print("Running Multinomial Naive Bayes")
    model_multinomial_nb = MultinomialNB()
    model_multinomial_nb.fit(X_train, y_train)
    models_accuracy['Multinomial Naive Bayes'] = accuracy_score(y_test, model_multinomial_nb.predict(X_test))
    f1_scores['Multinomial Naive Bayes'] = f1_score(y_test, model_multinomial_nb.predict(X_test), average='macro')
    report['Multinomial Naive Bayes'] = classification_report(y_test, model_multinomial_nb.predict(X_test), zero_division=0)
    
    # Decision Tree Classifier
    print("Running Decision Tree Classifier")
    model_tree = DecisionTreeClassifier(max_depth=14)
    model_tree.fit(X_train, y_train)
    models_accuracy['Decision Tree Classifier'] = accuracy_score(y_test, model_tree.predict(X_test))
    f1_scores['Decision Tree Classifier'] = f1_score(y_test, model_tree.predict(X_test), average='macro')
    report['Decision Tree Classifier'] = classification_report(y_test, model_tree.predict(X_test), zero_division=0)

    #random Forest
    print("Running Random Forest")
    model_forest = RandomForestClassifier(max_depth=10)
    model_forest.fit(X_train, y_train)
    models_accuracy['Random Forest Classifier'] = accuracy_score(y_test, model_forest.predict(X_test))
    f1_scores['Random Forest Classifier'] = f1_score(y_test, model_forest.predict(X_test), average='macro')
    report['Random Forest Classifier'] = classification_report(y_test, model_forest.predict(X_test), zero_division=0)
    
    #convert accuracy and F1 score dictionaries to DataFrames
    accuracy_under = pd.DataFrame(models_accuracy.items(), columns=['Models', 'Test accuracy'])
    f1_under = pd.DataFrame(f1_scores.items(), columns=['Models', 'Test F1 Score'])
    
    #return the accuracy DataFrame, classification reports, and F1 score DataFrame
    return accuracy_under, report, f1_under

#split the dataset and get training and testing data
X_train, X_test, y_train, y_test, target_encoder, vectorizer = data_split(data, 0.2)
#only receive the required two return values, ignoring the 'report'
accuracy_under, _, f1_under = model(X_train, X_test, y_train, y_test, len(target_encoder.classes_), target_encoder)

#sorting model accuracy and F1 score tables
accuracy_sorted = accuracy_under.sort_values(by='Test accuracy', ascending=False, ignore_index=True)
f1_sorted = f1_under.sort_values(by='Test F1 Score', ascending=False, ignore_index=True)

#merging model accuracy and F1 score tables
combined_table = pd.merge(accuracy_sorted, f1_sorted, on='Models', suffixes=('_accuracy', '_f1'))

#sorting combined_table by Test accuracy and Test F1 Score
combined_table_sorted = combined_table.sort_values(by=['Test accuracy', 'Test F1 Score'], ascending=False)

#applying background gradient to the sorted combined table and displaying it with a red background
styled_table = combined_table_sorted.style.background_gradient(cmap='Reds')
styled_table

Running KNN
Running Logistic Regression
Running Linear SVC
Running Multinomial Naive Bayes
Running Decision Tree Classifier
Running Random Forest


Unnamed: 0,Models,Test accuracy,Test F1 Score
0,Linear Support Vector Classifier,0.662248,0.465726
1,Logistic Regression,0.621902,0.340786
2,Decision Tree Classifier,0.511816,0.322902
3,Random Forest Classifier,0.451297,0.161617
4,Multinomial Naive Bayes,0.378674,0.112725
5,KNN,0.3683,0.275026


# 4. Testing the Linear SVC Model

In this step, the Linear SVC model is tested using a list of example sentences. After training the model on the dataset, the `predict_mbti` function is used to clean, vectorize, and predict the MBTI type for each sentence. This process helps evaluate the model's performance on new, unseen data.


In [10]:
#split the dataset and get training and testing data
X_train, X_test, y_train, y_test, target_encoder, vectorizer = data_split(data, 0.2)
#train the Linear SVC model
model_linear_SVC = LinearSVC(C=0.1)
model_linear_SVC.fit(X_train, y_train)

#list of test sentences for prediction
test_sentences = [
    "I love spending time alone and reflecting on my thoughts.",
    "I'm always looking for new ways to connect with others.",
    "Logic and reasoning are the most important aspects of any decision.",
    "I enjoy exploring new places and meeting new people.",
    "Planning ahead is key to staying organized and efficient.",
    "I often find myself daydreaming about future possibilities."]

def predict_mbti(sentence, vectorizer, model, target_encoder):
    #clean the input sentence
    cleaned_sentence = TextCleaner.clean_text(sentence)
    #vectorize the cleaned sentence
    vectorized_sentence = vectorizer.transform([cleaned_sentence]).toarray()
    #predict the MBTI type
    prediction = model.predict(vectorized_sentence)
    #convert the prediction to the original MBTI type
    return target_encoder.inverse_transform(prediction)[0]

#predict MBTI type for each sentence in the list
for sentence in test_sentences:
    predicted_mbti = predict_mbti(sentence, vectorizer, model_linear_SVC, target_encoder)
    print(f"Input Sentence: {sentence}")
    print(f"Predicted MBTI Type: {predicted_mbti}")

Input Sentence: I love spending time alone and reflecting on my thoughts.
Predicted MBTI Type: INFP
Input Sentence: I'm always looking for new ways to connect with others.
Predicted MBTI Type: INFJ
Input Sentence: Logic and reasoning are the most important aspects of any decision.
Predicted MBTI Type: INTJ
Input Sentence: I enjoy exploring new places and meeting new people.
Predicted MBTI Type: ENFP
Input Sentence: Planning ahead is key to staying organized and efficient.
Predicted MBTI Type: INTJ
Input Sentence: I often find myself daydreaming about future possibilities.
Predicted MBTI Type: INFP


# 5. Saving the Trained Model and Tools

The trained Linear SVC model, along with the TF-IDF vectorizer and label encoder, are saved to files using `joblib`. The code for saving these objects is based on the usage described in the [Joblib Documentation: `joblib.dump`](https://joblib.readthedocs.io/en/latest/generated/joblib.dump.html).


In [12]:
import joblib

#save the trained Linear SVC model to a file
model_filename = 'mbti_linear_svc_model.pkl'
joblib.dump(model_linear_SVC, model_filename)

#save the TF-IDF vectorizer used for text transformation
vectorizer_filename = 'tfidf_vectorizer.pkl'
joblib.dump(vectorizer, vectorizer_filename)

#save the label encoder for converting predictions back to MBTI types
encoder_filename = 'label_encoder.pkl'
joblib.dump(target_encoder, encoder_filename)

['label_encoder.pkl']