# Model selection

In this notebook, we will benchmark different configurations (vectorization and classification) to find the best one to our use case.

## Libraries

In [34]:
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import pandas as pd
import plotly.express as px
import re
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC

## Data import

In [35]:
df = pd.read_csv('../mbti_data_preproc.csv')
df.head()

Unnamed: 0,type,posts_final
0,INFJ,http youtub watch v qsxhcwe3krw http 41 media ...
1,ENTP,find lack post alarm sex bore posit often exam...
2,INTP,http youtub watch v fhigbolffgw cours bless cu...
3,INTJ,dear intp enjoy convers day esoter gab natur u...
4,ENTJ,fire anoth silli misconcept approach logic go ...


## Vectorization

We benchmark 2 methods:
- CountVectorizer
- TfidfVectorizer

In [56]:
corpus = df['posts_final']
y = df['type']

c_vectorizer = CountVectorizer()
ti_vectorizer = TfidfVectorizer()

In [41]:
c_X = c_vectorizer.fit_transform(corpus)
ti_X = ti_vectorizer.fit_transform(corpus)

In [43]:
c_X_train, c_X_test, c_y_train, c_y_test = train_test_split(c_X, y, test_size=0.2, random_state=14)
ti_X_train, ti_X_test, ti_y_train, ti_y_test = train_test_split(ti_X, y, test_size=0.2, random_state=14)

## Prediction

**Configuration 1:** CountVectorizer + LinearSVC

In [55]:
c_clf = LinearSVC(max_iter=1000, random_state=14)
c_clf.fit(c_X_train, c_y_train)
c_y_pred = c_clf.predict(c_X_test)
print(classification_report(c_y_test, c_y_pred))

              precision    recall  f1-score   support

        ENFJ       0.47      0.20      0.29        44
        ENFP       0.56      0.57      0.56       122
        ENTJ       0.67      0.48      0.56        50
        ENTP       0.53      0.62      0.57       128
        ESFJ       0.33      0.20      0.25         5
        ESFP       0.50      0.38      0.43         8
        ESTJ       0.50      0.29      0.36         7
        ESTP       0.62      0.53      0.57        15
        INFJ       0.61      0.65      0.63       274
        INFP       0.69      0.71      0.70       391
        INTJ       0.56      0.57      0.57       221
        INTP       0.63      0.63      0.63       286
        ISFJ       0.45      0.48      0.47        29
        ISFP       0.35      0.30      0.32        47
        ISTJ       0.49      0.42      0.45        40
        ISTP       0.53      0.53      0.53        68

    accuracy                           0.60      1735
   macro avg       0.53   

**Configuration 2:** TfidfVectorizer + LinearSVC

In [57]:
ti_clf = LinearSVC(random_state=14)
ti_clf.fit(ti_X_train, ti_y_train)
ti_y_pred = ti_clf.predict(ti_X_test)
print(classification_report(ti_y_test, ti_y_pred))

              precision    recall  f1-score   support

        ENFJ       0.64      0.20      0.31        44
        ENFP       0.66      0.63      0.65       122
        ENTJ       0.76      0.52      0.62        50
        ENTP       0.73      0.68      0.70       128
        ESFJ       0.25      0.20      0.22         5
        ESFP       1.00      0.12      0.22         8
        ESTJ       1.00      0.43      0.60         7
        ESTP       0.62      0.53      0.57        15
        INFJ       0.65      0.72      0.68       274
        INFP       0.73      0.86      0.79       391
        INTJ       0.65      0.65      0.65       221
        INTP       0.72      0.74      0.73       286
        ISFJ       0.56      0.52      0.54        29
        ISFP       0.68      0.36      0.47        47
        ISTJ       0.79      0.47      0.59        40
        ISTP       0.65      0.68      0.66        68

    accuracy                           0.69      1735
   macro avg       0.69   

**Configuration 3:** CountVectorizer + MultinomialNB

In [59]:
c_clf = MultinomialNB()
c_clf.fit(c_X_train, c_y_train)
c_y_pred = c_clf.predict(c_X_test)
print(classification_report(c_y_test, c_y_pred, zero_division=0))

              precision    recall  f1-score   support

        ENFJ       0.00      0.00      0.00        44
        ENFP       1.00      0.01      0.02       122
        ENTJ       0.00      0.00      0.00        50
        ENTP       0.43      0.02      0.04       128
        ESFJ       0.00      0.00      0.00         5
        ESFP       0.00      0.00      0.00         8
        ESTJ       0.00      0.00      0.00         7
        ESTP       0.00      0.00      0.00        15
        INFJ       0.35      0.64      0.45       274
        INFP       0.40      0.87      0.55       391
        INTJ       0.68      0.19      0.30       221
        INTP       0.51      0.55      0.53       286
        ISFJ       0.00      0.00      0.00        29
        ISFP       0.00      0.00      0.00        47
        ISTJ       0.00      0.00      0.00        40
        ISTP       0.00      0.00      0.00        68

    accuracy                           0.41      1735
   macro avg       0.21   

**Configuration 4:** TfidfVectorizer + MultinomialNB

In [61]:
ti_clf = MultinomialNB()
ti_clf.fit(ti_X_train, ti_y_train)
ti_y_pred = ti_clf.predict(ti_X_test)
print(classification_report(ti_y_test, ti_y_pred, zero_division=0))

              precision    recall  f1-score   support

        ENFJ       0.00      0.00      0.00        44
        ENFP       0.00      0.00      0.00       122
        ENTJ       0.00      0.00      0.00        50
        ENTP       0.00      0.00      0.00       128
        ESFJ       0.00      0.00      0.00         5
        ESFP       0.00      0.00      0.00         8
        ESTJ       0.00      0.00      0.00         7
        ESTP       0.00      0.00      0.00        15
        INFJ       0.00      0.00      0.00       274
        INFP       0.23      1.00      0.37       391
        INTJ       0.00      0.00      0.00       221
        INTP       0.00      0.00      0.00       286
        ISFJ       0.00      0.00      0.00        29
        ISFP       0.00      0.00      0.00        47
        ISTJ       0.00      0.00      0.00        40
        ISTP       0.00      0.00      0.00        68

    accuracy                           0.23      1735
   macro avg       0.01   

**Configuration 5:** CountVectorizer + LogisticRegression

In [62]:
c_clf = LogisticRegression(class_weight='balanced', max_iter=1000, random_state=14)
c_clf.fit(c_X_train, c_y_train)
c_y_pred = c_clf.predict(c_X_test)
print(classification_report(c_y_test, c_y_pred))

              precision    recall  f1-score   support

        ENFJ       0.50      0.25      0.33        44
        ENFP       0.57      0.58      0.58       122
        ENTJ       0.69      0.54      0.61        50
        ENTP       0.59      0.66      0.63       128
        ESFJ       0.14      0.20      0.17         5
        ESFP       0.43      0.38      0.40         8
        ESTJ       0.43      0.43      0.43         7
        ESTP       0.50      0.60      0.55        15
        INFJ       0.66      0.66      0.66       274
        INFP       0.71      0.74      0.73       391
        INTJ       0.61      0.59      0.60       221
        INTP       0.67      0.66      0.66       286
        ISFJ       0.52      0.55      0.53        29
        ISFP       0.41      0.38      0.40        47
        ISTJ       0.58      0.62      0.60        40
        ISTP       0.59      0.62      0.60        68

    accuracy                           0.63      1735
   macro avg       0.54   


lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression



**Configuration 6:** TfidfVectorizer + LogisticRegression

In [63]:
ti_clf = LogisticRegression(class_weight='balanced', max_iter=1000, random_state=14)
ti_clf.fit(ti_X_train, ti_y_train)
ti_y_pred = ti_clf.predict(ti_X_test)
print(classification_report(ti_y_test, ti_y_pred))

              precision    recall  f1-score   support

        ENFJ       0.54      0.50      0.52        44
        ENFP       0.65      0.61      0.63       122
        ENTJ       0.59      0.66      0.62        50
        ENTP       0.70      0.70      0.70       128
        ESFJ       0.14      0.20      0.17         5
        ESFP       0.20      0.25      0.22         8
        ESTJ       0.50      0.57      0.53         7
        ESTP       0.42      0.67      0.51        15
        INFJ       0.81      0.67      0.74       274
        INFP       0.79      0.77      0.78       391
        INTJ       0.71      0.68      0.70       221
        INTP       0.77      0.80      0.78       286
        ISFJ       0.44      0.66      0.53        29
        ISFP       0.57      0.62      0.59        47
        ISTJ       0.59      0.72      0.65        40
        ISTP       0.63      0.76      0.69        68

    accuracy                           0.71      1735
   macro avg       0.57   

In [64]:
ti_clf = LogisticRegression(class_weight='balanced', multi_class='ovr', max_iter=1000, random_state=14)
ti_clf.fit(ti_X_train, ti_y_train)
ti_y_pred = ti_clf.predict(ti_X_test)
print(classification_report(ti_y_test, ti_y_pred))

              precision    recall  f1-score   support

        ENFJ       0.69      0.41      0.51        44
        ENFP       0.65      0.61      0.63       122
        ENTJ       0.71      0.58      0.64        50
        ENTP       0.69      0.70      0.69       128
        ESFJ       0.17      0.20      0.18         5
        ESFP       0.67      0.25      0.36         8
        ESTJ       0.60      0.43      0.50         7
        ESTP       0.56      0.67      0.61        15
        INFJ       0.73      0.71      0.72       274
        INFP       0.76      0.82      0.79       391
        INTJ       0.70      0.68      0.69       221
        INTP       0.74      0.81      0.77       286
        ISFJ       0.45      0.52      0.48        29
        ISFP       0.65      0.51      0.57        47
        ISTJ       0.78      0.70      0.74        40
        ISTP       0.68      0.71      0.69        68

    accuracy                           0.72      1735
   macro avg       0.64   

## Conclusion
The best results are achieved with the **TfidfVectorizer + LogisticRegression** configuration. We notice that for well represented classes, the results are quite good. However, for the less represented classes, the results are less good and less reliable. I therefore propose a new approach strategy which consists in creating a model to predict each class independently, i.e. 4 binary classification models.