# Personality Profiles

[ short description]

---
#### Table of Contents

[EDA](#EDA)

[Feature Creation](#feature_creation)

[NLP](#nlp)

---

In [4]:
import pandas as pd

In [5]:
data = pd.read_csv("data/train.csv")

<a id="EDA"> </a>

---
## EDA

In [6]:
data.head()

Unnamed: 0,type,posts
0,INFJ,'http://www.youtube.com/watch?v=qsXHcwe3krw|||...
1,ENTP,'I'm finding the lack of me in these posts ver...
2,INTP,'Good one _____ https://www.youtube.com/wat...
3,INTJ,"'Dear INTP, I enjoyed our conversation the o..."
4,ENTJ,'You're fired.|||That's another silly misconce...


In [7]:
data['posts'].loc[42]

"Painting the world with the colors of my soul. Interpret it as you like - helping others by volunteering, teaching, giving everything you have.  Personal growth - learning as much about the universe...|||6 months ago I met this ENTP guy at my university. We are in the same group, so we basically have to be together everyday, and most likely have to study together for the next 6 years. So cutting off...|||1984 all the way|||I don't like people in groups because it's harder to enslave them.|||ESxJ or ISxJ, because I've noticed that ISxJs don't have much trouble with small talk either.  I'd say ISTJ.|||secret :)|||I really like the movie, and Alice is my alter ego in internet, since people find my real name a bit weird.|||Sense of smell easily, because my sense of smell has always been kind of weak. I think it's better that way, because I don't feel and suffer from disgusting smells people seem to feel everyday. I...|||I don't think there's a certain type who would tend to do this. I thi

In [8]:
data.describe()

Unnamed: 0,type,posts
count,6506,6506
unique,16,6506
top,INFP,I think this was my original point. When peopl...
freq,1386,1


In [9]:
data.groupby('type').count()

Unnamed: 0_level_0,posts
type,Unnamed: 1_level_1
ENFJ,143
ENFP,496
ENTJ,167
ENTP,530
ESFJ,35
ESFP,36
ESTJ,30
ESTP,71
INFJ,1100
INFP,1386


In [10]:
# imbalance types

<a id= "feature_creation"> </a>

---
## create features


 feature | regex
 ---|---
 `?`   | `r"(\?)"`
 `!`   | `r"(!)"`
 `...` | `r"(\.\.\.)"`
 youtube | ?part of social media
 www | `r"(www)"`
 jpg/jpeg/gif | <code>r"(jpe?g&#124;gif)"</code>
 emoji `;), :tongue:, :smile: #hastags?` | `r":[a-z]*:"`
 word count   | (\w+) / 50 (length per post sample)
 word_length > 5   | `r"(\w{5,})"`
 social_media (instagram, snapchat, etc)  | [social_medai]
 ALL_CAPS  | `r"(\b[A-Z]{2,}\b)"`


In [11]:
# trial functions
# data_trial1, data_trial2, data_trial3 


In [12]:
data_trial2 = data['posts'].loc[1]

In [13]:
data_trial3 = data['posts'].loc[369]

In [14]:
data_trial1 = data['posts'].loc[42]

In [15]:
import re

In [16]:
# count question marks
def qm_count(string):
    q_mark = re.compile(r'(\?)')
    return len(re.findall(q_mark, string))
    

In [17]:
qm_count(data_trial3)

16

In [18]:
def exclaim_count(string):
    ex_mark = re.compile(r'(\!)')
    return len(re.findall(ex_mark, string))

In [20]:
exclaim_count(data_trial3)

6

In [21]:
def elipse_count(string):
    elipse = re.compile(r"(\.\.\.)")
    return len(re.findall(elipse, string))

In [22]:
elipse_count(data_trial1)

20

In [23]:
def emoji_count(string):
    emojis = re.compile(r"(:[a-z]*:)|([:;][()pdo03])",re.I)
    return len(re.findall(emojis,string))

In [24]:
emoji_count(data_trial3)

4

In [25]:
def word_count(string):
    words = re.compile(r"(\w+)")
    count = len(re.findall(words, string))
    return count/50

In [26]:
word_count(data_trial3)

31.0

In [27]:
def word_len(string):
    len5 = re.compile(r"\w{5,}")
    return len(re.findall(len5,string))/50

In [28]:
word_len(data_trial3)

9.84

In [29]:
def all_caps(string):
    mbti_type = set(data.type) # set of all mbti types
    capsloc = re.compile(r"\b[A-Z]{2,}\b")
    caps_words = [x for x in re.findall(capsloc,string) if x not in mbti_type]
    return len(caps_words)

In [30]:
all_caps(data_trial1)

6

In [31]:
def count_pix(string):
    pix = re.compile(r"\b(jpe?g|gif|png|img)\b",re.I)
    return len(re.findall(pix, string))

In [32]:
count_pix(data_trial2)

8

In [33]:
#def soc_media_count(string):
soc_media = "Twitter LinkedIn Google+ YouTube Pinterest Instagram Tumblr Flickr Reddit Snapchat WhatsApp Quora Vine Periscope\
                BizSugar StumbleUpon Delicious Digg Facebook".lower().split()

---
## Testing Feature Creation

In [34]:
test_data = data.loc[:1000]

In [35]:
test_data.head()

Unnamed: 0,type,posts
0,INFJ,'http://www.youtube.com/watch?v=qsXHcwe3krw|||...
1,ENTP,'I'm finding the lack of me in these posts ver...
2,INTP,'Good one _____ https://www.youtube.com/wat...
3,INTJ,"'Dear INTP, I enjoyed our conversation the o..."
4,ENTJ,'You're fired.|||That's another silly misconce...


In [36]:
def create_features(df):
    df['questions'] = df['posts'].apply(qm_count)
    df['exclaimed'] = df['posts'].apply(exclaim_count)
    df['elipses'] = df['posts'].apply(elipse_count)
    df['emojis'] = df['posts'].apply(emoji_count)
    df['word_count'] = df['posts'].apply(word_count)
    df['big_words'] = df['posts'].apply(word_len)
    df['images'] = df['posts'].apply(count_pix)    
    df['words_all_caps'] = df['posts'].apply(all_caps) 

In [37]:
import warnings
with warnings.catch_warnings():
    warnings.simplefilter("ignore") # ignore warning messages cluttering up view
    create_features(test_data)

In [38]:
test_data.head(10)

Unnamed: 0,type,posts,questions,exclaimed,elipses,emojis,word_count,big_words,images,words_all_caps
0,INFJ,'http://www.youtube.com/watch?v=qsXHcwe3krw|||...,18,3,15,3,16.06,6.3,7,1
1,ENTP,'I'm finding the lack of me in these posts ver...,5,0,19,18,26.82,8.22,8,8
2,INTP,'Good one _____ https://www.youtube.com/wat...,12,4,13,10,18.66,7.34,0,3
3,INTJ,"'Dear INTP, I enjoyed our conversation the o...",11,3,26,0,23.52,7.44,0,12
4,ENTJ,'You're fired.|||That's another silly misconce...,10,1,21,3,21.88,7.78,2,13
5,INTJ,'18/37 @.@|||Science is not perfect. No scien...,10,0,39,0,31.7,10.18,0,3
6,INFJ,"'No, I can't draw on my own nails (haha). Thos...",13,3,37,9,29.02,9.96,0,15
7,INTJ,'I tend to build up a collection of things on ...,35,0,28,3,25.62,8.68,0,2
8,INFJ,"I'm not sure, that's a good question. The dist...",22,1,17,5,19.32,7.08,1,6
9,INTP,'https://www.youtube.com/watch?v=w8-egj0y8Qs||...,13,3,24,2,27.36,9.76,0,10


In [39]:
test_data.groupby('type').mean()

Unnamed: 0_level_0,questions,exclaimed,elipses,emojis,word_count,big_words,images,words_all_caps
type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
ENFJ,9.958333,13.333333,33.708333,7.916667,26.835833,8.654167,0.708333,11.041667
ENFP,13.056338,11.901408,33.042254,6.535211,26.86169,8.481972,1.112676,8.774648
ENTJ,10.482759,8.827586,33.586207,3.827586,27.18069,8.964138,1.172414,8.482759
ENTP,10.972222,6.652778,30.486111,4.305556,25.934444,8.393611,1.208333,8.527778
ESFJ,12.75,9.25,21.25,3.75,26.1,6.86,1.25,6.75
ESFP,10.571429,17.428571,22.0,4.714286,20.222857,6.474286,1.0,10.142857
ESTJ,8.666667,11.333333,21.666667,2.666667,21.633333,6.573333,0.0,8.333333
ESTP,15.181818,8.545455,24.363636,2.636364,24.489091,7.456364,0.545455,12.181818
INFJ,11.571429,9.223602,35.099379,6.236025,27.492671,8.977019,0.720497,7.478261
INFP,10.658228,8.696203,33.14346,5.594937,26.913755,8.715105,0.898734,6.924051


note:

add social media count after tokenisation

In [40]:
test_data.describe()

Unnamed: 0,questions,exclaimed,elipses,emojis,word_count,big_words,images,words_all_caps
count,1001.0,1001.0,1001.0,1001.0,1001.0,1001.0,1001.0,1001.0
mean,11.112887,8.056943,32.435564,4.989011,26.838581,8.774226,1.083916,7.546454
std,8.313498,10.000088,14.204439,6.122,6.121899,1.996737,2.531986,6.84705
min,0.0,0.0,0.0,0.0,0.78,0.2,0.0,0.0
25%,6.0,2.0,23.0,1.0,23.5,7.62,0.0,3.0
50%,10.0,5.0,32.0,3.0,27.76,9.0,0.0,6.0
75%,15.0,11.0,41.0,7.0,31.26,10.2,1.0,10.0
max,121.0,97.0,133.0,37.0,39.04,13.06,28.0,53.0


<a id="nlp">  </a>

---
## NLP

In [None]:
import nltk