# Personality Profiles

[ short description]

#### Table of Contents

[EDA](#EDA)

[Feature Creation](#feature_creation)

In [1]:
import pandas as pd

In [2]:
data = pd.read_csv("data/train.csv")

<a id="EDA"> </a>

---
## EDA

In [3]:
data.head()

Unnamed: 0,type,posts
0,INFJ,'http://www.youtube.com/watch?v=qsXHcwe3krw|||...
1,ENTP,'I'm finding the lack of me in these posts ver...
2,INTP,'Good one _____ https://www.youtube.com/wat...
3,INTJ,"'Dear INTP, I enjoyed our conversation the o..."
4,ENTJ,'You're fired.|||That's another silly misconce...


In [17]:
data['posts'].loc[42]

"Painting the world with the colors of my soul. Interpret it as you like - helping others by volunteering, teaching, giving everything you have.  Personal growth - learning as much about the universe...|||6 months ago I met this ENTP guy at my university. We are in the same group, so we basically have to be together everyday, and most likely have to study together for the next 6 years. So cutting off...|||1984 all the way|||I don't like people in groups because it's harder to enslave them.|||ESxJ or ISxJ, because I've noticed that ISxJs don't have much trouble with small talk either.  I'd say ISTJ.|||secret :)|||I really like the movie, and Alice is my alter ego in internet, since people find my real name a bit weird.|||Sense of smell easily, because my sense of smell has always been kind of weak. I think it's better that way, because I don't feel and suffer from disgusting smells people seem to feel everyday. I...|||I don't think there's a certain type who would tend to do this. I thi

In [5]:
data.describe()

Unnamed: 0,type,posts
count,6506,6506
unique,16,6506
top,INFP,'Say some insightful stuff to someone who has ...
freq,1386,1


In [6]:
data.groupby('type').count()

Unnamed: 0_level_0,posts
type,Unnamed: 1_level_1
ENFJ,143
ENFP,496
ENTJ,167
ENTP,530
ESFJ,35
ESFP,36
ESTJ,30
ESTP,71
INFJ,1100
INFP,1386


In [12]:
# imbalance types

<a id="feature_creation"> </a>

---
## create features


 feature | regex
 ---|---
 `?`   | `r"(\?)"`
 `!`   | `r"(!)"`
 `...` | `r"(\.\.\.)"`
 youtube | ?part of social media
 www | `r"(www)"`
 jpg/jpeg/gif | <code>r"(jpe?g&#124;gif)"</code>
 emoji `;), :tongue:, :smile: #hastags?` | `r":[a-z]*:"`
 word count   | (\w+) / 50 (length per post sample)
 word_length > 5   | `r"(\w{5,})"`
 social_media (instagram, snapchat, etc)  | [social_medai]
 ALL_CAPS  | `r"(\b[A-Z]{2,}\b)"`


In [None]:
# trial functions
# data_trial1, data_trial2, data_trial3 


In [94]:
data_trial2 = data['posts'].loc[1]

In [18]:
data_trial3 = data['posts'].loc[369]

In [19]:
data_trial1 = data['posts'].loc[42]

In [20]:
import re

In [21]:
# count question marks
def qm_count(string):
    q_mark = re.compile(r'(\?)')
    return len(re.findall(q_mark, string))
    

In [27]:
qm_count(data_trial3)

16

In [142]:
def exclaim_count(string):
    ex_mark = re.compile(r'(\!)')
    return len(re.findall(ex_mark, string))

In [143]:
exclam_count(data_trial3)

6

In [40]:
def elipse_count(string):
    elipse = re.compile(r"(\.\.\.)")
    return len(re.findall(elipse, string))

In [41]:
elipse_count(data_trial1)

20

In [111]:
def emoji_count(string):
    emojis = re.compile(r"(:[a-z]*:)|([:;][()pdo03])",re.I)
    return len(re.findall(emojis,string))

In [112]:
emoji_count(data_trial3)

4

In [52]:
def word_count(string):
    words = re.compile(r"(\w+)")
    count = len(re.findall(words, string))
    return count/50

In [55]:
word_count(data_trial3)

31.0

In [60]:
def word_len(string):
    len5 = re.compile(r"\w{5,}")
    return len(re.findall(len5,string))/50

In [63]:
word_len(data_trial3)

9.84

In [86]:
def all_caps(string):
    mbti_type = set(data.type) # set of all mbti types
    capsloc = re.compile(r"\b[A-Z]{2,}\b")
    caps_words = [x for x in re.findall(capsloc,string) if x not in mbti_type]
    return len(caps_words)

In [87]:
all_caps(data_trial1)

6

In [92]:
def count_pix(string):
    pix = re.compile(r"\b(jpe?g|gif|png|img)\b",re.I)
    return len(re.findall(pix, string))

In [96]:
count_pix(data_trial2)

8

In [106]:
#def soc_media_count(string):
soc_media = "Twitter LinkedIn Google+ YouTube Pinterest Instagram Tumblr Flickr Reddit Snapchat WhatsApp Quora Vine Periscope\
                BizSugar StumbleUpon Delicious Digg Facebook".lower().split()

---
## Testing

In [186]:
test_data = data.loc[:6000]

In [187]:
test_data.head()

Unnamed: 0,type,posts
0,INFJ,'http://www.youtube.com/watch?v=qsXHcwe3krw|||...
1,ENTP,'I'm finding the lack of me in these posts ver...
2,INTP,'Good one _____ https://www.youtube.com/wat...
3,INTJ,"'Dear INTP, I enjoyed our conversation the o..."
4,ENTJ,'You're fired.|||That's another silly misconce...


In [188]:
def create_features(df):
    df['questions'] = df['posts'].apply(qm_count)
    df['exclaimed'] = df['posts'].apply(exclaim_count)
    df['elipses'] = df['posts'].apply(elipse_count)
    df['emojis'] = df['posts'].apply(emoji_count)
    df['word_count'] = df['posts'].apply(word_count)
    df['big_words'] = df['posts'].apply(word_len)
    df['images'] = df['posts'].apply(count_pix)    
    df['words_all_caps'] = df['posts'].apply(all_caps) 

In [189]:
import warnings
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    create_features(test_data)

Wall time: 0 ns


In [190]:
test_data.head(10)

Unnamed: 0,type,posts,questions,exclaimed,elipses,emojis,word_count,big_words,images,words_all_caps
0,INFJ,'http://www.youtube.com/watch?v=qsXHcwe3krw|||...,18,3,15,3,16.06,6.3,7,1
1,ENTP,'I'm finding the lack of me in these posts ver...,5,0,19,18,26.82,8.22,8,8
2,INTP,'Good one _____ https://www.youtube.com/wat...,12,4,13,10,18.66,7.34,0,3
3,INTJ,"'Dear INTP, I enjoyed our conversation the o...",11,3,26,0,23.52,7.44,0,12
4,ENTJ,'You're fired.|||That's another silly misconce...,10,1,21,3,21.88,7.78,2,13
5,INTJ,'18/37 @.@|||Science is not perfect. No scien...,10,0,39,0,31.7,10.18,0,3
6,INFJ,"'No, I can't draw on my own nails (haha). Thos...",13,3,37,9,29.02,9.96,0,15
7,INTJ,'I tend to build up a collection of things on ...,35,0,28,3,25.62,8.68,0,2
8,INFJ,"I'm not sure, that's a good question. The dist...",22,1,17,5,19.32,7.08,1,6
9,INTP,'https://www.youtube.com/watch?v=w8-egj0y8Qs||...,13,3,24,2,27.36,9.76,0,10


In [191]:
test_data.groupby('type').mean()

Unnamed: 0_level_0,questions,exclaimed,elipses,emojis,word_count,big_words,images,words_all_caps
type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
ENFJ,10.213235,13.558824,36.566176,7.330882,27.471765,8.744853,0.904412,9.279412
ENFP,11.262582,16.356674,36.148796,8.463895,27.418906,8.616324,0.888403,10.914661
ENTJ,12.218543,9.198675,32.92053,4.006623,26.852185,8.749272,1.046358,9.086093
ENTP,10.807302,7.115619,31.079108,4.766734,26.304909,8.604544,1.109533,8.866126
ESFJ,9.5,12.433333,32.033333,5.233333,28.384667,8.892,0.9,7.733333
ESFP,12.606061,11.666667,27.30303,6.181818,23.118182,7.229697,0.848485,8.393939
ESTJ,10.344828,7.62069,31.482759,3.965517,27.152414,8.695862,0.482759,7.275862
ESTP,12.602941,8.426471,27.779412,4.25,25.557647,8.066765,1.235294,9.485294
INFJ,10.46325,9.113153,35.893617,5.940039,27.806557,9.036867,0.927466,7.589942
INFP,10.104769,9.310399,33.534793,5.916341,27.213745,8.881392,1.077404,7.200938
