# Model selection

In this notebook, we will benchmark different configurations (vectorization and classification) to find the best one to our use case.

## Libraries

In [35]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import re
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC

## Data import

In [2]:
df = pd.read_csv('../mbti_data.csv')
df.head()

Unnamed: 0,type,posts
0,INFJ,'http://www.youtube.com/watch?v=qsXHcwe3krw|||...
1,ENTP,'I'm finding the lack of me in these posts ver...
2,INTP,'Good one _____ https://www.youtube.com/wat...
3,INTJ,"'Dear INTP, I enjoyed our conversation the o..."
4,ENTJ,'You're fired.|||That's another silly misconce...


## Data preprocessing

We apply these steps :
- lowercase
- stemming
- stopwords removal
- noise removal

In [3]:
stop_words = set(stopwords.words('english'))
porter_stemmer = PorterStemmer()

def preprocessing(text, stemming_on=False, stop_words=stop_words, porter_stemmer=porter_stemmer):
    text = text.lower()                                          # Lowercase
    text = re.sub(r'[^a-zA-Z0-9\s]', ' ', text)                  # Noise removal (regex to remove punctuations)
    text = text.strip()                                          # Noise removal (extra spaces)
    words = text.split()                                         # Split sentence into list of words
    words = [w for w in words if not w in stop_words]            # Stopwords removal
    if(stemming_on):
        words = [porter_stemmer.stem(word) for word in words]    # Replace the word by its stem
    text = " ".join(words)                                       # Convert list of words into a sentence
    return text

In [4]:
df['posts_preproc_no_stemming'] = df['posts'].apply(lambda row: preprocessing(row))

In [5]:
df['posts_preproc_full'] = df['posts'].apply(lambda row: preprocessing(row, stemming_on=True))

In [6]:
df.head()

Unnamed: 0,type,posts,posts_preproc_no_stemming,posts_preproc_full
0,INFJ,'http://www.youtube.com/watch?v=qsXHcwe3krw|||...,http www youtube com watch v qsxhcwe3krw http ...,http www youtub com watch v qsxhcwe3krw http 4...
1,ENTP,'I'm finding the lack of me in these posts ver...,finding lack posts alarming sex boring positio...,find lack post alarm sex bore posit often exam...
2,INTP,'Good one _____ https://www.youtube.com/wat...,good one https www youtube com watch v fhigbol...,good one http www youtub com watch v fhigbolff...
3,INTJ,"'Dear INTP, I enjoyed our conversation the o...",dear intp enjoyed conversation day esoteric ga...,dear intp enjoy convers day esoter gab natur u...
4,ENTJ,'You're fired.|||That's another silly misconce...,fired another silly misconception approaching ...,fire anoth silli misconcept approach logic go ...


Below, an example of the posts after preprocessing's step (one without stemming and one with).

In [7]:
df.loc[1, 'posts_preproc_no_stemming'][:1000]

'finding lack posts alarming sex boring position often example girlfriend currently environment creatively use cowgirl missionary enough giving new meaning game theory hello entp grin takes converse flirting acknowledge presence return words smooth wordplay cheeky grins lack balance hand eye coordination real iq test score 127 internet iq tests funny score 140s higher like former responses thread mention believe iq test banish know entp vanish site year half return find people still commenting posts liking ideas thoughts know entp http img188 imageshack us img188 6422 6020d1f9da6944a6b71bbe6 jpg http img adultdvdtalk com 813a0c6243814cab84c51 think things sometimes go old sherlock holmes quote perhaps man special knowledge special powers like rather encourages seek complex cheshirewolf tumblr com 400 000 post really never thought e j p real functions judge use use ne ti dominates fe emotions rarely si also use ni due strength know though ingenious saying really want try see happens pla

In [8]:
df.loc[1, 'posts_preproc_full'][:1000]

'find lack post alarm sex bore posit often exampl girlfriend current environ creativ use cowgirl missionari enough give new mean game theori hello entp grin take convers flirt acknowledg presenc return word smooth wordplay cheeki grin lack balanc hand eye coordin real iq test score 127 internet iq test funni score 140 higher like former respons thread mention believ iq test banish know entp vanish site year half return find peopl still comment post like idea thought know entp http img188 imageshack us img188 6422 6020d1f9da6944a6b71bbe6 jpg http img adultdvdtalk com 813a0c6243814cab84c51 think thing sometim go old sherlock holm quot perhap man special knowledg special power like rather encourag seek complex cheshirewolf tumblr com 400 000 post realli never thought e j p real function judg use use ne ti domin fe emot rare si also use ni due strength know though ingeni say realli want tri see happen play first person shooter back drive around want see look rock paper one best make lol gu

## Vectorization

We benchmark 2 methods:
- CountVectorizer
- TfidfVectorizer

In [11]:
# CountVectorizer
corpus = df['posts_preproc_full']

In [12]:
c_vectorizer = CountVectorizer()

In [14]:
c_X = c_vectorizer.fit_transform(corpus)

In [22]:
c_X

<8675x118698 sparse matrix of type '<class 'numpy.int64'>'
	with 3418264 stored elements in Compressed Sparse Row format>

In [37]:
y = df['type']
labels = y.unique()

In [38]:
labels

array(['INFJ', 'ENTP', 'INTP', 'INTJ', 'ENTJ', 'ENFJ', 'INFP', 'ENFP',
       'ISFP', 'ISTP', 'ISFJ', 'ISTJ', 'ESTP', 'ESFP', 'ESTJ', 'ESFJ'],
      dtype=object)

In [20]:
# TfidfVectorizer
ti_vectorizer = TfidfVectorizer()
ti_X = ti_vectorizer.fit_transform(corpus)

In [21]:
ti_X

<8675x118698 sparse matrix of type '<class 'numpy.float64'>'
	with 3418264 stored elements in Compressed Sparse Row format>

In [26]:
c_X_train, c_X_test, c_y_train, c_y_test = train_test_split(c_X, y, test_size=0.33, random_state=42)
ti_X_train, ti_X_test, ti_y_train, ti_y_test = train_test_split(ti_X, y, test_size=0.33, random_state=42)

## Prediction

In [28]:
c_clf = LinearSVC(random_state=0)

In [29]:
c_clf.fit(c_X_train, c_y_train)



LinearSVC(random_state=0)

In [31]:
c_y_pred = c_clf.predict(c_X_test)

In [40]:
print(classification_report(c_y_test, c_y_pred, target_names=labels))

              precision    recall  f1-score   support

        INFJ       0.45      0.32      0.38        62
        ENTP       0.53      0.54      0.53       212
        INTP       0.55      0.45      0.50        73
        INTJ       0.56      0.53      0.54       220
        ENTJ       0.67      0.20      0.31        10
        ENFJ       0.00      0.00      0.00        12
        INFP       0.75      0.15      0.25        20
        ENFP       0.64      0.33      0.44        27
        ISFP       0.54      0.56      0.55       475
        ISTP       0.63      0.68      0.65       619
        ISFJ       0.51      0.54      0.53       339
        ISTJ       0.61      0.66      0.63       451
        ESTP       0.62      0.46      0.53        65
        ESFP       0.46      0.42      0.44        93
        ESTJ       0.46      0.29      0.35        77
        ESFJ       0.58      0.56      0.57       108

    accuracy                           0.57      2863
   macro avg       0.53   

In [41]:
ti_clf = LinearSVC(random_state=0)
ti_clf.fit(ti_X_train, ti_y_train)
ti_y_pred = ti_clf.predict(ti_X_test)
print(classification_report(ti_y_test, ti_y_pred, target_names=labels))

              precision    recall  f1-score   support

        INFJ       0.59      0.32      0.42        62
        ENTP       0.65      0.59      0.62       212
        INTP       0.74      0.47      0.57        73
        INTJ       0.66      0.60      0.63       220
        ENTJ       0.67      0.20      0.31        10
        ENFJ       0.00      0.00      0.00        12
        INFP       1.00      0.10      0.18        20
        ENFP       0.82      0.33      0.47        27
        ISFP       0.65      0.68      0.67       475
        ISTP       0.66      0.83      0.74       619
        ISFJ       0.61      0.66      0.63       339
        ISTJ       0.71      0.76      0.74       451
        ESTP       0.85      0.51      0.63        65
        ESFP       0.64      0.46      0.54        93
        ESTJ       0.68      0.36      0.47        77
        ESFJ       0.73      0.65      0.69       108

    accuracy                           0.67      2863
   macro avg       0.67   

  _warn_prf(average, modifier, msg_start, len(result))
