#Personality Type Prediction

#About Dataset

##Context

The Myers Briggs Type Indicator (or MBTI for short) is a personality type system that divides everyone into 16 distinct personality types across 4 axis:

Introversion (I) – Extroversion (E) <br>
Intuition (N) – Sensing (S) <br>
Thinking (T) – Feeling (F) <br>
Judging (J) – Perceiving (P) <br>

So for example, someone who prefers introversion, intuition, thinking and perceiving would be labelled an INTP in the MBTI system, and there are lots of personality based components that would model or describe this person’s preferences or behaviour based on the label.

##Content

This dataset contains over 8600 rows of data, on each row is a person’s:

Type (This persons 4 letter MBTI code/type)
A section of each of the last 50 things they have posted (Each entry separated by "|||" (3 pipe characters))

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = "/content/drive/My Drive/Kaggle"
# /content/drive/My Drive/Kaggle is the path where kaggle.json is present in the Google Drive

In [None]:
#changing the working directory
%cd /content/drive/My Drive/Kaggle

/content/drive/My Drive/Kaggle


In [None]:
 !kaggle datasets download datasnaek/mbti-type

Traceback (most recent call last):
  File "/usr/local/bin/kaggle", line 5, in <module>
    from kaggle.cli import main
  File "/usr/local/lib/python3.8/dist-packages/kaggle/__init__.py", line 23, in <module>
    api.authenticate()
  File "/usr/local/lib/python3.8/dist-packages/kaggle/api/kaggle_api_extended.py", line 164, in authenticate
    raise IOError('Could not find {}. Make sure it\'s located in'
OSError: Could not find kaggle.json. Make sure it's located in /content/drive/My Drive/Kaggle. Or use the environment method.


In [None]:
!unzip mbti-type.zip

Archive:  mbti-type.zip
  inflating: mbti_1.csv              


In [None]:
import numpy as np
import pandas as pd 

#Data analysis

In [None]:
df = pd.read_csv('/content/drive/MyDrive/Kaggle/mbti_1.csv')

In [None]:
df.head()

Unnamed: 0,type,posts
0,INFJ,'http://www.youtube.com/watch?v=qsXHcwe3krw|||...
1,ENTP,'I'm finding the lack of me in these posts ver...
2,INTP,'Good one _____ https://www.youtube.com/wat...
3,INTJ,"'Dear INTP, I enjoyed our conversation the o..."
4,ENTJ,'You're fired.|||That's another silly misconce...


In [None]:
df.iloc[1,1]

"'I'm finding the lack of me in these posts very alarming.|||Sex can be boring if it's in the same position often. For example me and my girlfriend are currently in an environment where we have to creatively use cowgirl and missionary. There isn't enough...|||Giving new meaning to 'Game' theory.|||Hello *ENTP Grin*  That's all it takes. Than we converse and they do most of the flirting while I acknowledge their presence and return their words with smooth wordplay and more cheeky grins.|||This + Lack of Balance and Hand Eye Coordination.|||Real IQ test I score 127. Internet IQ tests are funny. I score 140s or higher.  Now, like the former responses of this thread I will mention that I don't believe in the IQ test. Before you banish...|||You know you're an ENTP when you vanish from a site for a year and a half, return, and find people are still commenting on your posts and liking your ideas/thoughts. You know you're an ENTP when you...|||http://img188.imageshack.us/img188/6422/6020d1f9da

In [None]:
len(df.iloc[1,1])

7053

#Data preprocessing

In [None]:
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
stopwords= list(STOP_WORDS)

In [None]:
nlp = spacy.load('en_core_web_sm')

In [None]:
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

Preprocessing, round 1-


*   Removal of Stopwords
*   Removal of Punctuations
*   Lower casing



In [None]:
def preprocessing(text):
    ls = text.split('|||')
    text1 = ' '.join(ls)
    doc = nlp(text1)
    word = [word.text.lower() for word in doc if word.pos_ not in ["SPACE","PUNCT"] and 
            word.text.lower() not in stopwords and 
            word.like_url==False]
    text2 = ' '.join(word)
    return text2

In [None]:
testing = preprocessing("'http://www.youtube.com/watch?v=qsXHcwe3krw|||http://41.media.tumblr.com/tumblr_lfouy03PMA1qa1rooo1_500.jpg|||enfp and intj moments  https://www.youtube.com/watch?v=iz7lE1g4XM4  sportscenter not top ten plays  https://www.youtube.com/watch?v=uCdfze1etec  pranks|||What has been the most life-changing experience in your life?|||http://www.youtube.com/watch?v=vXZeYwwRDw8   http://www.youtube.com/watch?v=u8ejam5DP3E  On repeat for most of today.|||May the PerC Experience immerse you.|||The last thing my INFJ friend posted on his facebook before committing suicide the next day. Rest in peace~   http://vimeo.com/22842206|||Hello ENFJ7. Sorry to hear of your distress. It's only natural for a relationship to not be perfection all the time in every moment of existence. Try to figure the hard times as times of growth, as...|||84389  84390  http://wallpaperpassion.com/upload/23700/friendship-boy-and-girl-wallpaper.jpg  http://assets.dornob.com/wp-content/uploads/2010/04/round-home-design.jpg ...|||Welcome and stuff.|||http://playeressence.com/wp-content/uploads/2013/08/RED-red-the-pokemon-master-32560474-450-338.jpg  Game. Set. Match.|||Prozac, wellbrutin, at least thirty minutes of moving your legs (and I don't mean moving them while sitting in your same desk chair), weed in moderation (maybe try edibles as a healthier alternative...|||Basically come up with three items you've determined that each type (or whichever types you want to do) would more than likely use, given each types' cognitive functions and whatnot, when left by...|||All things in moderation.  Sims is indeed a video game, and a good one at that. Note: a good one at that is somewhat subjective in that I am not completely promoting the death of any given Sim...|||Dear ENFP:  What were your favorite video games growing up and what are your now, current favorite video games? :cool:|||https://www.youtube.com/watch?v=QyPqT8umzmY|||It appears to be too late. :sad:|||There's someone out there for everyone.|||Wait... I thought confidence was a good thing.|||I just cherish the time of solitude b/c i revel within my inner world more whereas most other time i'd be workin... just enjoy the me time while you can. Don't worry, people will always be around to...|||Yo entp ladies... if you're into a complimentary personality,well, hey.|||... when your main social outlet is xbox live conversations and even then you verbally fatigue quickly.|||http://www.youtube.com/watch?v=gDhy7rdfm14  I really dig the part from 1:46 to 2:50|||http://www.youtube.com/watch?v=msqXffgh7b8|||Banned because this thread requires it of me.|||Get high in backyard, roast and eat marshmellows in backyard while conversing over something intellectual, followed by massages and kisses.|||http://www.youtube.com/watch?v=Mw7eoU3BMbE|||http://www.youtube.com/watch?v=4V2uYORhQOk|||http://www.youtube.com/watch?v=SlVmgFQQ0TI|||Banned for too many b's in that sentence. How could you! Think of the B!|||Banned for watching movies in the corner with the dunces.|||Banned because Health class clearly taught you nothing about peer pressure.|||Banned for a whole host of reasons!|||http://www.youtube.com/watch?v=IRcrv41hgz4|||1) Two baby deer on left and right munching on a beetle in the middle.  2) Using their own blood, two cavemen diary today's latest happenings on their designated cave diary wall.  3) I see it as...|||a pokemon world  an infj society  everyone becomes an optimist|||49142|||http://www.youtube.com/watch?v=ZRCEq_JFeFM|||http://discovermagazine.com/2012/jul-aug/20-things-you-didnt-know-about-deserts/desert.jpg|||http://oyster.ignimgs.com/mediawiki/apis.ign.com/pokemon-silver-version/d/dd/Ditto.gif|||http://www.serebii.net/potw-dp/Scizor.jpg|||Not all artists are artists because they draw. It's the idea that counts in forming something of your own... like a signature.|||Welcome to the robot ranks, person who downed my self-esteem cuz I'm not an avid signature artist like herself. :proud:|||Banned for taking all the room under my bed. Ya gotta learn to share with the roaches.|||http://www.youtube.com/watch?v=w8IgImn57aQ|||Banned for being too much of a thundering, grumbling kind of storm... yep.|||Ahh... old high school music I haven't heard in ages.   http://www.youtube.com/watch?v=dcCRUPCdB1w|||I failed a public speaking class a few years ago and I've sort of learned what I could do better were I to be in that position again. A big part of my failure was just overloading myself with too...|||I like this person's mentality. He's a confirmed INTJ by the way. http://www.youtube.com/watch?v=hGKLI-GEc6M|||Move to the Denver area and start a new life for myself.'")

In [None]:
testing

"enfp intj moments sportscenter plays pranks life changing experience life repeat today perc experience immerse thing infj friend posted facebook committing suicide day rest peace~ hello enfj7 sorry hear distress natural relationship perfection time moment existence try figure hard times times growth 84389 84390 welcome stuff game set match prozac wellbrutin thirty minutes moving legs mean moving sitting desk chair weed moderation maybe try edibles healthier alternative basically come items determined type whichever types want likely use given types ' cognitive functions whatnot left things moderation sims video game good note good somewhat subjective completely promoting death given sim dear enfp favorite video games growing current favorite video games cool appears late sad wait thought confidence good thing cherish time solitude b / c revel inner world time workin enjoy time worry people yo entp ladies complimentary personality hey main social outlet xbox live conversations verbally

In [None]:
df['processed_posts'] = df['posts'].apply(preprocessing)

Further, the preprocessing that I intend to do next is-

*   Removal of Frequent words
*   Removal of Rare words
*   Conversion of emoticons to words
*   Conversion of emojis to words
*   Chat words conversion



---

For this step I have taken help from this [Kaggle notebook](https://https://www.kaggle.com/code/sudalairajkumar/getting-started-with-text-preprocessing), since it contains the standard preprocessing steps.




In [None]:
#since preprocessing such a large dataset requires a lot of time and resources,
#I saved the preprocessed files so that I do not have go through the preprocessing step again and again

df.to_csv('/content/drive/MyDrive/Kaggle/cleaned_data.csv')

### 1. Removal of Frequent words

In [None]:
from collections import Counter
cnt = Counter()
for text in df["processed_posts"].values:
    for word in text.split():
        cnt[word] += 1
        
cnt.most_common(25)

[('like', 69629),
 ('think', 49816),
 ('people', 47763),
 ('know', 36836),
 ('time', 27552),
 ('feel', 23326),
 ('love', 20993),
 ('/', 20775),
 ('good', 20723),
 ('things', 20451),
 ('way', 19617),
 ('want', 19373),
 ('type', 17023),
 ('lot', 16422),
 ('life', 15354),
 ('find', 14144),
 ('thing', 14130),
 ('infp', 13356),
 ('actually', 13222),
 ('person', 12770),
 ('going', 12711),
 ('right', 12690),
 ('sure', 12636),
 ('pretty', 12347),
 ('yes', 12323)]

In [None]:
FREQWORDS = set([w for (w, wc) in cnt.most_common(20)])
def remove_freqwords(text):
    """custom function to remove the frequent words"""
    return " ".join([word for word in str(text).split() if word not in FREQWORDS])

df["text_wo_stopfreq"] = df["processed_posts"].apply(lambda text: remove_freqwords(text))
df.head()

Unnamed: 0,type,posts,processed_posts,text_wo_stopfreq
0,INFJ,'http://www.youtube.com/watch?v=qsXHcwe3krw|||...,enfp intj moments sportscenter plays pranks li...,enfp intj moments sportscenter plays pranks ch...
1,ENTP,'I'm finding the lack of me in these posts ver...,finding lack posts alarming sex boring positio...,finding lack posts alarming sex boring positio...
2,INTP,'Good one _____ https://www.youtube.com/wat...,good course know blessing curse absolutely pos...,course blessing curse absolutely positive best...
3,INTJ,"'Dear INTP, I enjoyed our conversation the o...",dear intp enjoyed conversation day esoteric ga...,dear intp enjoyed conversation day esoteric ga...
4,ENTJ,'You're fired.|||That's another silly misconce...,fired silly misconception approaching logicall...,fired silly misconception approaching logicall...


### 2.Removal of Rare words

In [None]:
df.drop(["processed_posts"], axis=1, inplace=True)

n_rare_words = 10
RAREWORDS = set([w for (w, wc) in cnt.most_common()[:-n_rare_words-1:-1]])
def remove_rarewords(text):
    """custom function to remove the rare words"""
    return " ".join([word for word in str(text).split() if word not in RAREWORDS])

df["text_wo_stopfreqrare"] = df["text_wo_stopfreq"].apply(lambda text: remove_rarewords(text))
df.head()

Unnamed: 0,type,posts,text_wo_stopfreq,text_wo_stopfreqrare
0,INFJ,'http://www.youtube.com/watch?v=qsXHcwe3krw|||...,enfp intj moments sportscenter plays pranks ch...,enfp intj moments sportscenter plays pranks ch...
1,ENTP,'I'm finding the lack of me in these posts ver...,finding lack posts alarming sex boring positio...,finding lack posts alarming sex boring positio...
2,INTP,'Good one _____ https://www.youtube.com/wat...,course blessing curse absolutely positive best...,course blessing curse absolutely positive best...
3,INTJ,"'Dear INTP, I enjoyed our conversation the o...",dear intp enjoyed conversation day esoteric ga...,dear intp enjoyed conversation day esoteric ga...
4,ENTJ,'You're fired.|||That's another silly misconce...,fired silly misconception approaching logicall...,fired silly misconception approaching logicall...


### 3.Conversion of emoticons to words

In [None]:
# Thanks : https://github.com/NeelShah18/emot/blob/master/emot/emo_unicode.py

EMOTICONS = {
    u":‑\)":"Happy face or smiley",
    u":\)":"Happy face or smiley",
    u":-\]":"Happy face or smiley",
    u":\]":"Happy face or smiley",
    u":-3":"Happy face smiley",
    u":3":"Happy face smiley",
    u":->":"Happy face smiley",
    u":>":"Happy face smiley",
    u"8-\)":"Happy face smiley",
    u":o\)":"Happy face smiley",
    u":-\}":"Happy face smiley",
    u":\}":"Happy face smiley",
    u":-\)":"Happy face smiley",
    u":c\)":"Happy face smiley",
    u":\^\)":"Happy face smiley",
    u"=\]":"Happy face smiley",
    u"=\)":"Happy face smiley",
    u":‑D":"Laughing, big grin or laugh with glasses",
    u":D":"Laughing, big grin or laugh with glasses",
    u"8‑D":"Laughing, big grin or laugh with glasses",
    u"8D":"Laughing, big grin or laugh with glasses",
    u"X‑D":"Laughing, big grin or laugh with glasses",
    u"XD":"Laughing, big grin or laugh with glasses",
    u"=D":"Laughing, big grin or laugh with glasses",
    u"=3":"Laughing, big grin or laugh with glasses",
    u"B\^D":"Laughing, big grin or laugh with glasses",
    u":-\)\)":"Very happy",
    u":‑\(":"Frown, sad, andry or pouting",
    u":-\(":"Frown, sad, andry or pouting",
    u":\(":"Frown, sad, andry or pouting",
    u":‑c":"Frown, sad, andry or pouting",
    u":c":"Frown, sad, andry or pouting",
    u":‑<":"Frown, sad, andry or pouting",
    u":<":"Frown, sad, andry or pouting",
    u":‑\[":"Frown, sad, andry or pouting",
    u":\[":"Frown, sad, andry or pouting",
    u":-\|\|":"Frown, sad, andry or pouting",
    u">:\[":"Frown, sad, andry or pouting",
    u":\{":"Frown, sad, andry or pouting",
    u":@":"Frown, sad, andry or pouting",
    u">:\(":"Frown, sad, andry or pouting",
    u":'‑\(":"Crying",
    u":'\(":"Crying",
    u":'‑\)":"Tears of happiness",
    u":'\)":"Tears of happiness",
    u"D‑':":"Horror",
    u"D:<":"Disgust",
    u"D:":"Sadness",
    u"D8":"Great dismay",
    u"D;":"Great dismay",
    u"D=":"Great dismay",
    u"DX":"Great dismay",
    u":‑O":"Surprise",
    u":O":"Surprise",
    u":‑o":"Surprise",
    u":o":"Surprise",
    u":-0":"Shock",
    u"8‑0":"Yawn",
    u">:O":"Yawn",
    u":-\*":"Kiss",
    u":\*":"Kiss",
    u":X":"Kiss",
    u";‑\)":"Wink or smirk",
    u";\)":"Wink or smirk",
    u"\*-\)":"Wink or smirk",
    u"\*\)":"Wink or smirk",
    u";‑\]":"Wink or smirk",
    u";\]":"Wink or smirk",
    u";\^\)":"Wink or smirk",
    u":‑,":"Wink or smirk",
    u";D":"Wink or smirk",
    u":‑P":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u":P":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u"X‑P":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u"XP":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u":‑Þ":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u":Þ":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u":b":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u"d:":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u"=p":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u">:P":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u":‑/":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u":/":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u":-[.]":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u">:[(\\\)]":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u">:/":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u":[(\\\)]":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u"=/":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u"=[(\\\)]":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u":L":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u"=L":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u":S":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u":‑\|":"Straight face",
    u":\|":"Straight face",
    u":$":"Embarrassed or blushing",
    u":‑x":"Sealed lips or wearing braces or tongue-tied",
    u":x":"Sealed lips or wearing braces or tongue-tied",
    u":‑#":"Sealed lips or wearing braces or tongue-tied",
    u":#":"Sealed lips or wearing braces or tongue-tied",
    u":‑&":"Sealed lips or wearing braces or tongue-tied",
    u":&":"Sealed lips or wearing braces or tongue-tied",
    u"O:‑\)":"Angel, saint or innocent",
    u"O:\)":"Angel, saint or innocent",
    u"0:‑3":"Angel, saint or innocent",
    u"0:3":"Angel, saint or innocent",
    u"0:‑\)":"Angel, saint or innocent",
    u"0:\)":"Angel, saint or innocent",
    u":‑b":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u"0;\^\)":"Angel, saint or innocent",
    u">:‑\)":"Evil or devilish",
    u">:\)":"Evil or devilish",
    u"\}:‑\)":"Evil or devilish",
    u"\}:\)":"Evil or devilish",
    u"3:‑\)":"Evil or devilish",
    u"3:\)":"Evil or devilish",
    u">;\)":"Evil or devilish",
    u"\|;‑\)":"Cool",
    u"\|‑O":"Bored",
    u":‑J":"Tongue-in-cheek",
    u"#‑\)":"Party all night",
    u"%‑\)":"Drunk or confused",
    u"%\)":"Drunk or confused",
    u":-###..":"Being sick",
    u":###..":"Being sick",
    u"<:‑\|":"Dump",
    u"\(>_<\)":"Troubled",
    u"\(>_<\)>":"Troubled",
    u"\(';'\)":"Baby",
    u"\(\^\^>``":"Nervous or Embarrassed or Troubled or Shy or Sweat drop",
    u"\(\^_\^;\)":"Nervous or Embarrassed or Troubled or Shy or Sweat drop",
    u"\(-_-;\)":"Nervous or Embarrassed or Troubled or Shy or Sweat drop",
    u"\(~_~;\) \(・\.・;\)":"Nervous or Embarrassed or Troubled or Shy or Sweat drop",
    u"\(-_-\)zzz":"Sleeping",
    u"\(\^_-\)":"Wink",
    u"\(\(\+_\+\)\)":"Confused",
    u"\(\+o\+\)":"Confused",
    u"\(o\|o\)":"Ultraman",
    u"\^_\^":"Joyful",
    u"\(\^_\^\)/":"Joyful",
    u"\(\^O\^\)／":"Joyful",
    u"\(\^o\^\)／":"Joyful",
    u"\(__\)":"Kowtow as a sign of respect, or dogeza for apology",
    u"_\(\._\.\)_":"Kowtow as a sign of respect, or dogeza for apology",
    u"<\(_ _\)>":"Kowtow as a sign of respect, or dogeza for apology",
    u"<m\(__\)m>":"Kowtow as a sign of respect, or dogeza for apology",
    u"m\(__\)m":"Kowtow as a sign of respect, or dogeza for apology",
    u"m\(_ _\)m":"Kowtow as a sign of respect, or dogeza for apology",
    u"\('_'\)":"Sad or Crying",
    u"\(/_;\)":"Sad or Crying",
    u"\(T_T\) \(;_;\)":"Sad or Crying",
    u"\(;_;":"Sad of Crying",
    u"\(;_:\)":"Sad or Crying",
    u"\(;O;\)":"Sad or Crying",
    u"\(:_;\)":"Sad or Crying",
    u"\(ToT\)":"Sad or Crying",
    u";_;":"Sad or Crying",
    u";-;":"Sad or Crying",
    u";n;":"Sad or Crying",
    u";;":"Sad or Crying",
    u"Q\.Q":"Sad or Crying",
    u"T\.T":"Sad or Crying",
    u"QQ":"Sad or Crying",
    u"Q_Q":"Sad or Crying",
    u"\(-\.-\)":"Shame",
    u"\(-_-\)":"Shame",
    u"\(一一\)":"Shame",
    u"\(；一_一\)":"Shame",
    u"\(=_=\)":"Tired",
    u"\(=\^\·\^=\)":"cat",
    u"\(=\^\·\·\^=\)":"cat",
    u"=_\^=	":"cat",
    u"\(\.\.\)":"Looking down",
    u"\(\._\.\)":"Looking down",
    u"\^m\^":"Giggling with hand covering mouth",
    u"\(\・\・?":"Confusion",
    u"\(?_?\)":"Confusion",
    u">\^_\^<":"Normal Laugh",
    u"<\^!\^>":"Normal Laugh",
    u"\^/\^":"Normal Laugh",
    u"\（\*\^_\^\*）" :"Normal Laugh",
    u"\(\^<\^\) \(\^\.\^\)":"Normal Laugh",
    u"\(^\^\)":"Normal Laugh",
    u"\(\^\.\^\)":"Normal Laugh",
    u"\(\^_\^\.\)":"Normal Laugh",
    u"\(\^_\^\)":"Normal Laugh",
    u"\(\^\^\)":"Normal Laugh",
    u"\(\^J\^\)":"Normal Laugh",
    u"\(\*\^\.\^\*\)":"Normal Laugh",
    u"\(\^—\^\）":"Normal Laugh",
    u"\(#\^\.\^#\)":"Normal Laugh",
    u"\（\^—\^\）":"Waving",
    u"\(;_;\)/~~~":"Waving",
    u"\(\^\.\^\)/~~~":"Waving",
    u"\(-_-\)/~~~ \($\·\·\)/~~~":"Waving",
    u"\(T_T\)/~~~":"Waving",
    u"\(ToT\)/~~~":"Waving",
    u"\(\*\^0\^\*\)":"Excited",
    u"\(\*_\*\)":"Amazed",
    u"\(\*_\*;":"Amazed",
    u"\(\+_\+\) \(@_@\)":"Amazed",
    u"\(\*\^\^\)v":"Laughing,Cheerful",
    u"\(\^_\^\)v":"Laughing,Cheerful",
    u"\(\(d[-_-]b\)\)":"Headphones,Listening to music",
    u'\(-"-\)':"Worried",
    u"\(ーー;\)":"Worried",
    u"\(\^0_0\^\)":"Eyeglasses",
    u"\(\＾ｖ\＾\)":"Happy",
    u"\(\＾ｕ\＾\)":"Happy",
    u"\(\^\)o\(\^\)":"Happy",
    u"\(\^O\^\)":"Happy",
    u"\(\^o\^\)":"Happy",
    u"\)\^o\^\(":"Happy",
    u":O o_O":"Surprised",
    u"o_0":"Surprised",
    u"o\.O":"Surpised",
    u"\(o\.o\)":"Surprised",
    u"oO":"Surprised",
    u"\(\*￣m￣\)":"Dissatisfied",
    u"\(‘A`\)":"Snubbed or Deflated"
}

In [None]:
import re
def convert_emoticons(text):
    for emot in EMOTICONS:
        text = re.sub(u'('+emot+')', "_".join(EMOTICONS[emot].replace(",","").split()), text)
    return text

In [None]:
df["text_convert_emoticons"] = df["text_wo_stopfreqrare"].apply(lambda text: convert_emoticons(text))
df.head()

Unnamed: 0,type,posts,text_wo_stopfreq,text_wo_stopfreqrare,text_convert_emoticons
0,INFJ,'http://www.youtube.com/watch?v=qsXHcwe3krw|||...,enfp intj moments sportscenter plays pranks ch...,enfp intj moments sportscenter plays pranks ch...,enfp intj moments sportscenter plays pranks ch...
1,ENTP,'I'm finding the lack of me in these posts ver...,finding lack posts alarming sex boring positio...,finding lack posts alarming sex boring positio...,finding lack posts alarming sex boring positio...
2,INTP,'Good one _____ https://www.youtube.com/wat...,course blessing curse absolutely positive best...,course blessing curse absolutely positive best...,course blessing curse absolutely positive best...
3,INTJ,"'Dear INTP, I enjoyed our conversation the o...",dear intp enjoyed conversation day esoteric ga...,dear intp enjoyed conversation day esoteric ga...,dear intp enjoyed conversation day esoteric ga...
4,ENTJ,'You're fired.|||That's another silly misconce...,fired silly misconception approaching logicall...,fired silly misconception approaching logicall...,fired silly misconception approaching logicall...


### 4.Conversion of emojis to words

In [None]:
#skipped this step for now

### 5.Chat words conversion

In [None]:
chat_words_str = """
AFAIK=As Far As I Know
AFK=Away From Keyboard
ASAP=As Soon As Possible
ATK=At The Keyboard
ATM=At The Moment
A3=Anytime, Anywhere, Anyplace
BAK=Back At Keyboard
BBL=Be Back Later
BBS=Be Back Soon
BFN=Bye For Now
B4N=Bye For Now
BRB=Be Right Back
BRT=Be Right There
BTW=By The Way
B4=Before
B4N=Bye For Now
CU=See You
CUL8R=See You Later
CYA=See You
FAQ=Frequently Asked Questions
FC=Fingers Crossed
FWIW=For What It's Worth
FYI=For Your Information
GAL=Get A Life
GG=Good Game
GN=Good Night
GMTA=Great Minds Think Alike
GR8=Great!
G9=Genius
IC=I See
ICQ=I Seek you (also a chat program)
ILU=ILU: I Love You
IMHO=In My Honest/Humble Opinion
IMO=In My Opinion
IOW=In Other Words
IRL=In Real Life
KISS=Keep It Simple, Stupid
LDR=Long Distance Relationship
LMAO=Laugh My A.. Off
LOL=Laughing Out Loud
LTNS=Long Time No See
L8R=Later
MTE=My Thoughts Exactly
M8=Mate
NRN=No Reply Necessary
OIC=Oh I See
PITA=Pain In The A..
PRT=Party
PRW=Parents Are Watching
ROFL=Rolling On The Floor Laughing
ROFLOL=Rolling On The Floor Laughing Out Loud
ROTFLMAO=Rolling On The Floor Laughing My A.. Off
SK8=Skate
STATS=Your sex and age
ASL=Age, Sex, Location
THX=Thank You
TTFN=Ta-Ta For Now!
TTYL=Talk To You Later
U=You
U2=You Too
U4E=Yours For Ever
WB=Welcome Back
WTF=What The F...
WTG=Way To Go!
WUF=Where Are You From?
W8=Wait...
7K=Sick:-D Laugher
"""

In [None]:
chat_words_map_dict = {}
chat_words_list = []
for line in chat_words_str.split("\n"):
    if line != "":
        cw = line.split("=")[0]
        cw_expanded = line.split("=")[1]
        chat_words_list.append(cw)
        chat_words_map_dict[cw] = cw_expanded
chat_words_list = set(chat_words_list)

def chat_words_conversion(text):
    new_text = []
    for w in text.split():
        if w.upper() in chat_words_list:
            new_text.append(chat_words_map_dict[w.upper()])
        else:
            new_text.append(w)
    return " ".join(new_text)

In [None]:
df["text_chat_words_converted"] = df["text_convert_emoticons"].apply(lambda text: chat_words_conversion(text))
df.head()

Unnamed: 0,type,posts,text_wo_stopfreq,text_wo_stopfreqrare,text_convert_emoticons,text_chat_words_converted
0,INFJ,'http://www.youtube.com/watch?v=qsXHcwe3krw|||...,enfp intj moments sportscenter plays pranks ch...,enfp intj moments sportscenter plays pranks ch...,enfp intj moments sportscenter plays pranks ch...,enfp intj moments sportscenter plays pranks ch...
1,ENTP,'I'm finding the lack of me in these posts ver...,finding lack posts alarming sex boring positio...,finding lack posts alarming sex boring positio...,finding lack posts alarming sex boring positio...,finding lack posts alarming sex boring positio...
2,INTP,'Good one _____ https://www.youtube.com/wat...,course blessing curse absolutely positive best...,course blessing curse absolutely positive best...,course blessing curse absolutely positive best...,course blessing curse absolutely positive best...
3,INTJ,"'Dear INTP, I enjoyed our conversation the o...",dear intp enjoyed conversation day esoteric ga...,dear intp enjoyed conversation day esoteric ga...,dear intp enjoyed conversation day esoteric ga...,dear intp enjoyed conversation day esoteric ga...
4,ENTJ,'You're fired.|||That's another silly misconce...,fired silly misconception approaching logicall...,fired silly misconception approaching logicall...,fired silly misconception approaching logicall...,fired silly misconception approaching logicall...


In [None]:
df.drop(["text_wo_stopfreq","text_wo_stopfreqrare","text_convert_emoticons"	], axis=1, inplace=True)

In [None]:
df.head()

Unnamed: 0,type,posts,text_chat_words_converted
0,INFJ,'http://www.youtube.com/watch?v=qsXHcwe3krw|||...,enfp intj moments sportscenter plays pranks ch...
1,ENTP,'I'm finding the lack of me in these posts ver...,finding lack posts alarming sex boring positio...
2,INTP,'Good one _____ https://www.youtube.com/wat...,course blessing curse absolutely positive best...
3,INTJ,"'Dear INTP, I enjoyed our conversation the o...",dear intp enjoyed conversation day esoteric ga...
4,ENTJ,'You're fired.|||That's another silly misconce...,fired silly misconception approaching logicall...


In [None]:
df.rename(columns = {'text_chat_words_converted':'final_preprocessed'}, inplace = True)

In [None]:
df.head(10)

Unnamed: 0,type,posts,final_preprocessed
0,INFJ,'http://www.youtube.com/watch?v=qsXHcwe3krw|||...,enfp intj moments sportscenter plays pranks ch...
1,ENTP,'I'm finding the lack of me in these posts ver...,finding lack posts alarming sex boring positio...
2,INTP,'Good one _____ https://www.youtube.com/wat...,course blessing curse absolutely positive best...
3,INTJ,"'Dear INTP, I enjoyed our conversation the o...",dear intp enjoyed conversation day esoteric ga...
4,ENTJ,'You're fired.|||That's another silly misconce...,fired silly misconception approaching logicall...
5,INTJ,'18/37 @.@|||Science is not perfect. No scien...,' 18/37 @.@ science perfect scientist claims s...
6,INFJ,"'No, I can't draw on my own nails (haha). Thos...",draw nails haha professionals nails yes gel me...
7,INTJ,'I tend to build up a collection of things on ...,tend build collection desktop use frequently f...
8,INFJ,"I'm not sure, that's a good question. The dist...",sure question distinction dependant perception...
9,INTP,'https://www.youtube.com/watch?v=w8-egj0y8Qs||...,position let reasons unfortunately having trou...


In [None]:
#saving preprocessed file to drive
df.to_csv('/content/drive/MyDrive/Kaggle/final_preprocessed_data.csv')

In [None]:
#loading preprocessed file from drive
import numpy as np
import pandas as pd
df = pd.read_csv('/content/drive/MyDrive/Kaggle/final_preprocessed_data.csv')
df = df.iloc[:,1:] #I made a mistake while saving, this code is just to correct that

In [None]:
df.head()

Unnamed: 0,type,posts,final_preprocessed
0,INFJ,'http://www.youtube.com/watch?v=qsXHcwe3krw|||...,enfp intj moments sportscenter plays pranks ch...
1,ENTP,'I'm finding the lack of me in these posts ver...,finding lack posts alarming sex boring positio...
2,INTP,'Good one _____ https://www.youtube.com/wat...,course blessing curse absolutely positive best...
3,INTJ,"'Dear INTP, I enjoyed our conversation the o...",dear intp enjoyed conversation day esoteric ga...
4,ENTJ,'You're fired.|||That's another silly misconce...,fired silly misconception approaching logicall...


#End of Preprocessing

---



---



In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fit(df.iloc[:,0])

LabelEncoder()

In [None]:
le.classes_

array(['ENFJ', 'ENFP', 'ENTJ', 'ENTP', 'ESFJ', 'ESFP', 'ESTJ', 'ESTP',
       'INFJ', 'INFP', 'INTJ', 'INTP', 'ISFJ', 'ISFP', 'ISTJ', 'ISTP'],
      dtype=object)

In [None]:
df.iloc[:,0] = le.transform(df.iloc[:,0])

In [None]:
df.iloc[:,0]

0        8
1        3
2       11
3       10
4        2
        ..
8670    13
8671     1
8672    11
8673     9
8674     9
Name: type, Length: 8675, dtype: int64

In [None]:
df['type'].value_counts()

9     1832
8     1470
11    1304
10    1091
3      685
1      675
15     337
13     271
2      231
14     205
0      190
12     166
7       89
5       48
4       42
6       39
Name: type, dtype: int64

##This is a hugely imbalanced dataset

###Ways to deal with imbalanced text dataset

1. Gathering more meaningful and diverse data is always better than sampling original data or generating artificial data from existing data points.
2. Removing data redundancy such as removing duplicate data or data with similar semantic meaning.
3. Merge minority classes
4. Undersampling the majority class, but here there is a likelihood of information loss which might lead to poor model training.
5. Oversampling the minority class instances randomly, but this approach can overfit and lead to inaccurate predictions on test data.
4. Finally, Synthetic Minority Oversampling can be used, this approach effectively forces the decision region of the minority class to become more general. But SMOTE seem to be problematic here because on one side SMOTE works with KNN and on the other hand, feature spaces for NLP problem are dramatically huge. KNN will easily fail in those huge dimensions. **Nonetheless, it may become useful, when text is represented as (neural) embeddings.**





###But First

I will apply all other text vectorization techniques so that we can can compare their performances.

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
x_train, x_test, y_train, y_test = train_test_split(df['final_preprocessed'],df['type'],stratify= df['type'], test_size=0.2,random_state=42)

In [None]:
print("x_train shape: ",x_train.shape)
print("x_test shape: ",x_test.shape)
print("y_train shape: ",y_train.shape)
print("y_test shape: ",y_test.shape)

x_train shape:  (6940,)
x_test shape:  (1735,)
y_train shape:  (6940,)
y_test shape:  (1735,)


In [None]:
from sklearn.metrics import classification_report
import imblearn
from imblearn.over_sampling import SMOTE
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
cv = CountVectorizer(max_features=5000)
x_train_cv = cv.fit_transform(x_train,y_train)

In [None]:
x_test_cv= cv.transform(x_test)

In [None]:
x_train_cv.shape

(6940, 5000)

In [None]:
model = RandomForestClassifier()
model.fit(x_train_cv,y_train)

RandomForestClassifier()

In [None]:
y_pred = model.predict(x_test_cv)

In [None]:
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.00      0.00      0.00        38
           1       0.73      0.33      0.46       135
           2       0.00      0.00      0.00        46
           3       0.64      0.35      0.45       137
           4       0.00      0.00      0.00         9
           5       0.00      0.00      0.00        10
           6       0.00      0.00      0.00         8
           7       0.00      0.00      0.00        18
           8       0.55      0.69      0.61       294
           9       0.40      0.80      0.53       366
          10       0.63      0.52      0.57       218
          11       0.63      0.69      0.66       261
          12       1.00      0.03      0.06        33
          13       0.83      0.09      0.17        54
          14       1.00      0.02      0.05        41
          15       0.94      0.22      0.36        67

    accuracy                           0.52      1735
   macro avg       0.46   

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [None]:
smote = SMOTE()

In [None]:
x_train_cv_smote,y_train_cv_smote = smote.fit_resample(x_train_cv,y_train)

In [None]:
model = RandomForestClassifier()
model.fit(x_train_cv_smote,y_train_cv_smote)

RandomForestClassifier()

In [None]:
x_train_cv_smote.shape

(23456, 5000)

In [None]:
y_train_cv_smote.value_counts()

9     1466
15    1466
0     1466
1     1466
8     1466
5     1466
10    1466
13    1466
14    1466
3     1466
2     1466
11    1466
12    1466
7     1466
4     1466
6     1466
Name: type, dtype: int64

In [None]:
y_pred = model.predict(x_test_cv)

In [None]:
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.28      0.18      0.22        38
           1       0.50      0.42      0.46       135
           2       0.55      0.35      0.43        46
           3       0.44      0.41      0.43       137
           4       0.00      0.00      0.00         9
           5       0.00      0.00      0.00        10
           6       1.00      0.12      0.22         8
           7       0.50      0.22      0.31        18
           8       0.49      0.63      0.55       294
           9       0.43      0.64      0.51       366
          10       0.55      0.44      0.49       218
          11       0.57      0.54      0.56       261
          12       0.53      0.30      0.38        33
          13       0.32      0.13      0.18        54
          14       0.50      0.12      0.20        41
          15       0.66      0.34      0.45        67

    accuracy                           0.48      1735
   macro avg       0.46   

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))




---



---







In [None]:
cv = CountVectorizer(max_features=5000, ngram_range=(1,2))
x_train_cv = cv.fit_transform(x_train,y_train)

In [None]:
x_test_cv= cv.transform(x_test)

In [None]:
x_train_cv.shape

(6940, 5000)

In [None]:
model = RandomForestClassifier()
model.fit(x_train_cv,y_train)

RandomForestClassifier()

In [None]:
y_pred = model.predict(x_test_cv)

In [None]:
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.00      0.00      0.00        38
           1       0.73      0.47      0.57       135
           2       1.00      0.02      0.04        46
           3       0.63      0.36      0.46       137
           4       0.00      0.00      0.00         9
           5       0.00      0.00      0.00        10
           6       0.00      0.00      0.00         8
           7       0.00      0.00      0.00        18
           8       0.56      0.70      0.62       294
           9       0.40      0.79      0.53       366
          10       0.64      0.52      0.57       218
          11       0.64      0.68      0.66       261
          12       0.00      0.00      0.00        33
          13       0.75      0.06      0.10        54
          14       0.00      0.00      0.00        41
          15       0.92      0.18      0.30        67

    accuracy                           0.53      1735
   macro avg       0.39   

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [None]:
smote = SMOTE()

In [None]:
x_train_cv_smote,y_train_cv_smote = smote.fit_resample(x_train_cv,y_train)

In [None]:
model = RandomForestClassifier()
model.fit(x_train_cv_smote,y_train_cv_smote)

RandomForestClassifier()

In [None]:
x_train_cv_smote.shape

(23456, 5000)

In [None]:
y_train_cv_smote.value_counts()

9     1466
15    1466
0     1466
1     1466
8     1466
5     1466
10    1466
13    1466
14    1466
3     1466
2     1466
11    1466
12    1466
7     1466
4     1466
6     1466
Name: type, dtype: int64

In [None]:
y_pred = model.predict(x_test_cv)

In [None]:
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.38      0.29      0.33        38
           1       0.48      0.42      0.45       135
           2       0.73      0.35      0.47        46
           3       0.45      0.40      0.43       137
           4       0.50      0.11      0.18         9
           5       0.00      0.00      0.00        10
           6       1.00      0.12      0.22         8
           7       0.43      0.17      0.24        18
           8       0.52      0.65      0.58       294
           9       0.45      0.65      0.53       366
          10       0.59      0.50      0.54       218
          11       0.60      0.62      0.61       261
          12       0.53      0.30      0.38        33
          13       0.40      0.19      0.25        54
          14       0.56      0.12      0.20        41
          15       0.72      0.39      0.50        67

    accuracy                           0.51      1735
   macro avg       0.52   



---



---



In [None]:
cv1 = CountVectorizer()
x_train_cv = cv1.fit_transform(x_train,y_train)

In [None]:
x_test_cv= cv1.transform(x_test)

In [None]:
x_train_cv.shape

(6940, 96128)

In [None]:
from sklearn.decomposition import TruncatedSVD

In [None]:
svd = TruncatedSVD(n_components = 500)
x_train_svd = svd.fit_transform(x_train_cv,y_train)

In [None]:
x_test_svd = svd.transform(x_test_cv)

In [None]:
x_train_svd.shape

(6940, 500)

In [None]:
model = RandomForestClassifier()
model.fit(x_train_svd,y_train)

RandomForestClassifier()

In [None]:
y_pred = model.predict(x_test_svd)

In [None]:
from sklearn.metrics import classification_report , confusion_matrix

In [None]:
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       1.00      0.03      0.05        38
           1       0.65      0.35      0.45       135
           2       1.00      0.02      0.04        46
           3       0.66      0.39      0.49       137
           4       0.00      0.00      0.00         9
           5       0.00      0.00      0.00        10
           6       0.00      0.00      0.00         8
           7       0.00      0.00      0.00        18
           8       0.58      0.67      0.62       294
           9       0.35      0.74      0.48       366
          10       0.60      0.51      0.55       218
          11       0.66      0.70      0.68       261
          12       0.00      0.00      0.00        33
          13       1.00      0.02      0.04        54
          14       0.00      0.00      0.00        41
          15       0.60      0.13      0.22        67

    accuracy                           0.50      1735
   macro avg       0.44   

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [None]:
smote = SMOTE()

In [None]:
x_train_svd_smote,y_train_svd_smote = smote.fit_resample(x_train_svd,y_train)

In [None]:
model = RandomForestClassifier()
model.fit(x_train_svd_smote,y_train_svd_smote)

RandomForestClassifier()

In [None]:
x_train_svd_smote.shape

(23456, 500)

In [None]:
y_train_svd_smote.value_counts()

9     1466
15    1466
0     1466
1     1466
8     1466
5     1466
10    1466
13    1466
14    1466
3     1466
2     1466
11    1466
12    1466
7     1466
4     1466
6     1466
Name: type, dtype: int64

In [None]:
y_pred2 = model.predict(x_test_svd)

In [None]:
print(classification_report(y_test,y_pred2))

              precision    recall  f1-score   support

           0       0.36      0.32      0.34        38
           1       0.38      0.49      0.42       135
           2       0.40      0.48      0.44        46
           3       0.43      0.53      0.48       137
           4       0.50      0.33      0.40         9
           5       0.00      0.00      0.00        10
           6       0.75      0.38      0.50         8
           7       0.60      0.33      0.43        18
           8       0.54      0.61      0.58       294
           9       0.39      0.32      0.35       366
          10       0.52      0.52      0.52       218
          11       0.55      0.56      0.55       261
          12       0.53      0.48      0.51        33
          13       0.50      0.35      0.41        54
          14       0.52      0.39      0.44        41
          15       0.56      0.51      0.53        67

    accuracy                           0.48      1735
   macro avg       0.47   



---



---



In [None]:
tf= TfidfVectorizer(max_features=5000)
x_train_tf = tf.fit_transform(x_train,y_train)

In [None]:
x_test_tf = tf.transform(x_test)

In [None]:
x_train_tf.shape

(6940, 5000)

In [None]:
smote = SMOTE()

In [None]:
x_train_smote , y_train_smote = smote.fit_resample(x_train_tf,y_train)

In [None]:
x_train_smote.shape

(23456, 5000)

In [None]:
y_train_smote.value_counts()

9     1466
15    1466
0     1466
1     1466
8     1466
5     1466
10    1466
13    1466
14    1466
3     1466
2     1466
11    1466
12    1466
7     1466
4     1466
6     1466
Name: type, dtype: int64

In [None]:
model = RandomForestClassifier()
model.fit(x_train_smote,y_train_smote)

RandomForestClassifier()

In [None]:
y_pred3 = model.predict(x_test_tf)

In [None]:
print(classification_report(y_test,y_pred3))

              precision    recall  f1-score   support

           0       0.59      0.42      0.49        38
           1       0.62      0.59      0.61       135
           2       0.71      0.33      0.45        46
           3       0.57      0.57      0.57       137
           4       1.00      0.11      0.20         9
           5       0.00      0.00      0.00        10
           6       0.00      0.00      0.00         8
           7       0.33      0.11      0.17        18
           8       0.66      0.67      0.66       294
           9       0.47      0.70      0.56       366
          10       0.61      0.55      0.58       218
          11       0.64      0.68      0.66       261
          12       0.83      0.30      0.44        33
          13       0.59      0.41      0.48        54
          14       0.82      0.34      0.48        41
          15       0.81      0.45      0.58        67

    accuracy                           0.59      1735
   macro avg       0.58   

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [None]:
tf2= TfidfVectorizer(max_features=5000, ngram_range=(1,2))
x_train_tf2 = tf2.fit_transform(x_train,y_train)

In [None]:
x_test_tf2 = tf2.transform(x_test)

In [None]:
x_train_tf2.shape

(6940, 5000)

In [None]:
smote = SMOTE()

In [None]:
x_train_smote2 , y_train_smote2 = smote.fit_resample(x_train_tf2,y_train)

In [None]:
x_train_smote2.shape

(23456, 5000)

In [None]:
y_train_smote2.value_counts()

9     1466
15    1466
0     1466
1     1466
8     1466
5     1466
10    1466
13    1466
14    1466
3     1466
2     1466
11    1466
12    1466
7     1466
4     1466
6     1466
Name: type, dtype: int64

In [None]:
model = RandomForestClassifier()
model.fit(x_train_smote2,y_train_smote2)
y_pred5 = model.predict(x_test_tf2)

In [None]:
print(classification_report(y_test,y_pred5))

              precision    recall  f1-score   support

           0       0.65      0.29      0.40        38
           1       0.65      0.56      0.60       135
           2       0.70      0.35      0.46        46
           3       0.55      0.50      0.53       137
           4       0.00      0.00      0.00         9
           5       0.00      0.00      0.00        10
           6       0.00      0.00      0.00         8
           7       0.40      0.11      0.17        18
           8       0.63      0.67      0.65       294
           9       0.48      0.69      0.57       366
          10       0.56      0.50      0.52       218
          11       0.58      0.69      0.63       261
          12       0.92      0.36      0.52        33
          13       0.62      0.37      0.47        54
          14       0.58      0.27      0.37        41
          15       0.77      0.49      0.60        67

    accuracy                           0.57      1735
   macro avg       0.51   

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


I could have done this with GridSearchCV but I wanted the results of all these steps in front of me so that I can analyze all the vectorization techniques, and compare them with Word2Vec.

#Now implementing WORD2VEC word vectorization technique

In [None]:
!python -m spacy download en_core_web_lg

2023-02-03 10:28:42.589994: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting en-core-web-lg==3.4.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.4.1/en_core_web_lg-3.4.1-py3-none-any.whl (587.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m587.7/587.7 MB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: en-core-web-lg
Successfully installed en-core-web-lg-3.4.1
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')


In [None]:
import spacy
nlp2 = spacy.load('en_core_web_lg')
#importing the large model because it has 514k unique pretrained word vectors of 300 dimensions



In [None]:
df.head()

Unnamed: 0,type,posts,final_preprocessed
0,8,'http://www.youtube.com/watch?v=qsXHcwe3krw|||...,enfp intj moments sportscenter plays pranks ch...
1,3,'I'm finding the lack of me in these posts ver...,finding lack posts alarming sex boring positio...
2,11,'Good one _____ https://www.youtube.com/wat...,course blessing curse absolutely positive best...
3,10,"'Dear INTP, I enjoyed our conversation the o...",dear intp enjoyed conversation day esoteric ga...
4,2,'You're fired.|||That's another silly misconce...,fired silly misconception approaching logicall...


In [None]:
df.drop(['posts'],axis=1,inplace=True)
df.head()

Unnamed: 0,type,final_preprocessed
0,8,enfp intj moments sportscenter plays pranks ch...
1,3,finding lack posts alarming sex boring positio...
2,11,course blessing curse absolutely positive best...
3,10,dear intp enjoyed conversation day esoteric ga...
4,2,fired silly misconception approaching logicall...


In [None]:
#converting each document to a dense vector
#It takes word embedding of individual words and for the document it just averages it up to come up with a document embedding
df['vector'] = df['final_preprocessed'].apply(lambda text: nlp2(text).vector)

In [None]:
df.drop(['final_preprocessed'],axis=1,inplace=True)
df.head()

Unnamed: 0,type,vector
0,8,"[-0.023376254, 0.39159748, -1.1219157, -0.2751..."
1,3,"[0.018455202, 0.60934913, -1.0090749, -0.29238..."
2,11,"[-0.026650576, -0.026257815, -1.3059742, -0.70..."
3,10,"[0.08814709, 0.010489976, -1.0402668, 0.242274..."
4,2,"[-0.38535833, 0.35889146, -1.4942231, 0.105725..."


In [None]:
#saving vectorized file to drive
df.to_csv('/content/drive/MyDrive/Kaggle/vectorized_data.csv')

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df['vector'].values,df['type'],test_size = 0.2)

In [None]:
#X_train is an array of array, we need to convert it to a 2d array which will be fed to the classifier
X_train.shape

(6940,)

In [None]:
X_train_2d = np.stack(X_train)
X_test_2d = np.stack(X_test)

In [None]:
X_train_2d.shape

(6940, 300)

In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler() #since MultinomialNB takes positive values as features
X_train_2d_scaled = scaler.fit_transform(X_train_2d)
X_test_2d_scaled = scaler.transform(X_test_2d)

mnb = MultinomialNB()
mnb.fit(X_train_2d_scaled,y_train)

MultinomialNB()

In [None]:
y_pred = mnb.predict(X_test_2d_scaled)

In [None]:
from sklearn.metrics import classification_report

print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.00      0.00      0.00        36
           1       0.00      0.00      0.00       129
           2       0.00      0.00      0.00        43
           3       0.00      0.00      0.00       128
           4       0.00      0.00      0.00        11
           5       0.00      0.00      0.00        10
           6       0.00      0.00      0.00        10
           7       0.00      0.00      0.00        19
           8       0.00      0.00      0.00       307
           9       0.23      0.94      0.36       374
          10       0.00      0.00      0.00       223
          11       0.22      0.15      0.17       254
          12       0.00      0.00      0.00        24
          13       0.00      0.00      0.00        50
          14       0.00      0.00      0.00        49
          15       0.00      0.00      0.00        68

    accuracy                           0.22      1735
   macro avg       0.03   

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [None]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier()
rf.fit(X_train_2d,y_train)

RandomForestClassifier()

In [None]:
y_pred2 = rf.predict(X_test_2d)

In [None]:
print(classification_report(y_test,y_pred2))

              precision    recall  f1-score   support

           0       0.00      0.00      0.00        36
           1       0.21      0.09      0.12       129
           2       0.00      0.00      0.00        43
           3       0.28      0.09      0.14       128
           4       0.00      0.00      0.00        11
           5       0.00      0.00      0.00        10
           6       0.00      0.00      0.00        10
           7       0.00      0.00      0.00        19
           8       0.28      0.34      0.31       307
           9       0.31      0.62      0.41       374
          10       0.33      0.26      0.29       223
          11       0.27      0.36      0.31       254
          12       0.00      0.00      0.00        24
          13       0.00      0.00      0.00        50
          14       0.00      0.00      0.00        49
          15       1.00      0.03      0.06        68

    accuracy                           0.29      1735
   macro avg       0.17   

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [None]:
import imblearn
from imblearn.over_sampling import SMOTE

In [None]:
smote = SMOTE()

In [None]:
X_train_2d_smote,y_train_smote = smote.fit_resample(X_train_2d,y_train)

In [None]:
X_train_2d.shape

(6940, 300)

In [None]:
X_train_2d_smote.shape

(23328, 300)

In [None]:
rf = RandomForestClassifier()
rf.fit(X_train_2d_smote,y_train_smote)

RandomForestClassifier()

In [None]:
y_pred3 = rf.predict(X_test_2d)

In [None]:
print(classification_report(y_test,y_pred3))

              precision    recall  f1-score   support

           0       0.03      0.03      0.03        36
           1       0.16      0.24      0.19       129
           2       0.17      0.26      0.20        43
           3       0.23      0.26      0.24       128
           4       0.00      0.00      0.00        11
           5       0.00      0.00      0.00        10
           6       0.00      0.00      0.00        10
           7       0.06      0.05      0.06        19
           8       0.29      0.27      0.28       307
           9       0.35      0.34      0.35       374
          10       0.26      0.26      0.26       223
          11       0.24      0.20      0.22       254
          12       0.04      0.04      0.04        24
          13       0.16      0.16      0.16        50
          14       0.14      0.10      0.12        49
          15       0.32      0.31      0.31        68

    accuracy                           0.25      1735
   macro avg       0.15   

In [None]:
from  sklearn.neighbors import KNeighborsClassifier

clf = KNeighborsClassifier(n_neighbors = 5, metric = 'euclidean')

clf.fit(X_train_2d, y_train)

y_pred = clf.predict(X_test_2d)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.03      0.06      0.04        36
           1       0.13      0.27      0.17       129
           2       0.06      0.07      0.06        43
           3       0.13      0.18      0.15       128
           4       0.00      0.00      0.00        11
           5       0.00      0.00      0.00        10
           6       0.00      0.00      0.00        10
           7       0.00      0.00      0.00        19
           8       0.23      0.29      0.26       307
           9       0.35      0.34      0.34       374
          10       0.26      0.19      0.22       223
          11       0.26      0.19      0.22       254
          12       0.00      0.00      0.00        24
          13       0.00      0.00      0.00        50
          14       0.00      0.00      0.00        49
          15       0.33      0.07      0.12        68

    accuracy                           0.22      1735
   macro avg       0.11   

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [None]:
clf.fit(X_train_2d_smote,y_train_smote)
y_pred = rf.predict(X_test_2d)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.03      0.03      0.03        36
           1       0.16      0.24      0.19       129
           2       0.17      0.26      0.20        43
           3       0.23      0.26      0.24       128
           4       0.00      0.00      0.00        11
           5       0.00      0.00      0.00        10
           6       0.00      0.00      0.00        10
           7       0.06      0.05      0.06        19
           8       0.29      0.27      0.28       307
           9       0.35      0.34      0.35       374
          10       0.26      0.26      0.26       223
          11       0.24      0.20      0.22       254
          12       0.04      0.04      0.04        24
          13       0.16      0.16      0.16        50
          14       0.14      0.10      0.12        49
          15       0.32      0.31      0.31        68

    accuracy                           0.25      1735
   macro avg       0.15   

In [None]:
df['type'].value_counts()

9     1832
8     1470
11    1304
10    1091
3      685
1      675
15     337
13     271
2      231
14     205
0      190
12     166
7       89
5       48
4       42
6       39
Name: type, dtype: int64

###Since minority classes are causing a lot of trouble, I will try dropping them to see if my solution improves.

In [None]:
df2 = df.copy()
df2.head()

Unnamed: 0,type,vector
0,8,"[-0.023376254, 0.39159748, -1.1219157, -0.2751..."
1,3,"[0.018455202, 0.60934913, -1.0090749, -0.29238..."
2,11,"[-0.026650576, -0.026257815, -1.3059742, -0.70..."
3,10,"[0.08814709, 0.010489976, -1.0402668, 0.242274..."
4,2,"[-0.38535833, 0.35889146, -1.4942231, 0.105725..."


In [None]:
df_filtered = df2[(df2['type']==9) | (df2['type']==8) | (df2['type']==11) | (df2['type']==10)]
 
# Print the new dataframe
print(df_filtered.head(15))
 
# Print the shape of the dataframe
print(df_filtered.shape)

    type                                             vector
0      8  [-0.023376254, 0.39159748, -1.1219157, -0.2751...
2     11  [-0.026650576, -0.026257815, -1.3059742, -0.70...
3     10  [0.08814709, 0.010489976, -1.0402668, 0.242274...
5     10  [-0.23208946, 0.4428398, -1.2337557, -0.083506...
6      8  [0.17454427, 0.24166353, -1.4150017, -0.393744...
7     10  [-0.18420663, 0.26189047, -1.7130889, -0.39182...
8      8  [0.35770774, 0.41009647, -1.8065919, -0.485075...
9     11  [0.14343719, 0.60642457, -1.3456203, -0.454786...
10     8  [-0.15060696, 0.4052226, -1.8420142, -0.303988...
12     8  [0.06525651, 0.20087461, -1.2773793, -0.503192...
13    10  [-0.17367014, 0.30048132, -1.6083243, -0.43745...
14    11  [-0.18833974, 0.19256878, -1.5374863, -0.10051...
15    11  [0.0713019, 0.32101712, -1.1437676, -0.2917595...
16     8  [-0.12939939, 0.5052611, -1.4106419, -0.157813...
17     9  [-0.09916992, 0.25703233, -1.565704, -0.683815...
(5697, 2)


In [None]:
X_train, X_test, y_train, y_test = train_test_split(df_filtered['vector'].values,df_filtered['type'],test_size = 0.2)

In [None]:
X_train_2d = np.stack(X_train)
X_test_2d = np.stack(X_test)

In [None]:
X_train.shape

(4557,)

In [None]:
scaler = MinMaxScaler() #since MultinomialNB takes positive values as features
X_train_2d_scaled = scaler.fit_transform(X_train_2d)
X_test_2d_scaled = scaler.transform(X_test_2d)

mnb = MultinomialNB()
mnb.fit(X_train_2d_scaled,y_train)

MultinomialNB()

In [None]:
y_pred = mnb.predict(X_test_2d_scaled)

In [None]:
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           8       0.00      0.00      0.00       309
           9       0.33      0.94      0.48       342
          10       0.40      0.01      0.02       234
          11       0.35      0.19      0.25       255

    accuracy                           0.33      1140
   macro avg       0.27      0.29      0.19      1140
weighted avg       0.26      0.33      0.20      1140



In [None]:
rf = RandomForestClassifier()
rf.fit(X_train_2d,y_train)
y_pred = rf.predict(X_test_2d)
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           8       0.37      0.28      0.32       309
           9       0.42      0.70      0.53       342
          10       0.39      0.20      0.26       234
          11       0.43      0.36      0.39       255

    accuracy                           0.41      1140
   macro avg       0.40      0.38      0.37      1140
weighted avg       0.40      0.41      0.39      1140



In [None]:
smote = SMOTE()
X_train_2d_smote,y_train_smote = smote.fit_resample(X_train_2d,y_train)

In [None]:
X_train_2d.shape

(4557, 300)

In [None]:
X_train_2d_smote.shape

(5960, 300)

In [None]:
rf = RandomForestClassifier()
rf.fit(X_train_2d_smote,y_train_smote)
y_pred3 = rf.predict(X_test_2d)
print(classification_report(y_test,y_pred3))

              precision    recall  f1-score   support

           8       0.40      0.34      0.37       309
           9       0.46      0.58      0.51       342
          10       0.36      0.32      0.34       234
          11       0.41      0.40      0.40       255

    accuracy                           0.42      1140
   macro avg       0.41      0.41      0.41      1140
weighted avg       0.42      0.42      0.42      1140



In [None]:
clf = KNeighborsClassifier(n_neighbors = 5, metric = 'euclidean')

clf.fit(X_train_2d, y_train)

y_pred = clf.predict(X_test_2d)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           8       0.30      0.36      0.33       309
           9       0.40      0.50      0.45       342
          10       0.31      0.25      0.27       234
          11       0.37      0.22      0.27       255

    accuracy                           0.35      1140
   macro avg       0.34      0.33      0.33      1140
weighted avg       0.35      0.35      0.34      1140



In [None]:
clf = KNeighborsClassifier(n_neighbors = 5, metric = 'euclidean')

clf.fit(X_train_2d_smote, y_train_smote)

y_pred = clf.predict(X_test_2d)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           8       0.34      0.44      0.39       309
           9       0.46      0.19      0.27       342
          10       0.27      0.45      0.33       234
          11       0.31      0.25      0.28       255

    accuracy                           0.33      1140
   macro avg       0.35      0.33      0.32      1140
weighted avg       0.36      0.33      0.32      1140



My solution does improve, but only to an extent. Accuracy is still worse than basic vectorization methods such as tf-idf and n_grams.

---



---



#Conclusion

My final conclusion after comparing the results of all the vectorization schemes I have made use of, i.e. , Bag of Words, Tf-idf, n_grams, and finally Word2Vec, is that contrary to my expectations the sparse vector representations methods perform better than the dense vector representation.

It must be kept in mind that only vanilla Machine Learning algorithms were used without GridSearchCV so that I can compare the word vectorization methods in their raw form.

So, why did the sparse vector representations perform better?

* Word embeddings are constructed based on the co-occurrence of words in a fixed window. It is roughly based on the principle - “A word is known by the company it keeps”. So words occurring within a fixed distance from each other in a document have similar vectors.
Many times opposite words appear together or quite close to each other. Hence the vector for the word “good” is actually close to the word “bad”. This can be disastrous for applications like mine, where I am trying to classify personality differences where such words correspond to orthogonal classes.
*   It may be because the pretrained word2vec is unable to capture the semantic meaning in my documents, which are actually informal responses on social media, whereas word2vec is trained on formal google news articles. A little fine-tuning of the word embeddings on my documents could prove beneficial.









