## A shallow dive into word2vec

I've been tinkering with word2vec for a bunch of projects now, and I thought I'd give a quick demo of how it can be used on a tiny amount of Twitter data.

In order to get the data you want, you'll want to know your consumer keys and access token keys. There are a few ways to proceed, but I found it simplest to use the "twitter" package, a very basic API.

In [71]:
import twitter

# set these with your own values

api = twitter.Api(consumer_key=con_key,
consumer_secret=con_secret,
access_token_key=token_key,
access_token_secret=token_secret)

Once we log in, it's as simple as grabbing all the posts in your timeline and putting them in a list.

In [223]:
users = api.GetFriends()
users = [u.name for u in users]

tweets = []
for u in users:
    posts = api.GetUserTimeline(screen_name=user)
    tweets.extend(posts)
tweets = [t.text for t in tweets]

The basic idea of word2vec is that it takes a word and transforms it into a vector. If this is done right, you can use these vectors to see how similar words are (with various distance metrics), cluster words, and even figure out their meanings. 

A famous example of this is that for a well-trained data set, the vector for 'king' minus the vector for 'man', plus the vector for 'woman', should be very close to the vector for 'queen'.

Here, I thought it would be nice to try to figure out the meanings of emoji, as they are all the rage on Twitter.

With that in mind, I used some regular expressions extract all emoji from our Twitter data, using the fact that they all have a particular unicode representation.

In [242]:
import re  
import pandas as pd

try:  
    # UCS-4
    e = re.compile(u'[\U00010000-\U0010ffff]')
except re.error:  
    # UCS-2
    e = re.compile(u'[\uD800-\uDBFF][\uDC00-\uDFFF]')

emoji = []  
for x in tweets:  
    match  = e.search(x)
    if match:
        emoji.append(match.group())

# counting up how much of each emoji we have
df =  pd.Series('emoji': emoji) 
df = df.groupby(df['emoji'])
df.head(10)

🔪    17
😈    17
🐥    17
🙌    17
😃    17
😡    17
🐧    17
🎧    17
📅    17
🌻    17
🖖    17
dtype: int64

So we don't have too many examples, but let's see what happens anyway. We train the word2vec model on the tweets that contain emoji, trying to learn what syntactic relationships exist between the emoji and surrounding words.

Our parameters include window (how many words forward and back to include as meaningful), size (how big the vector representation is), and min_count (how many times a word must appear for it to be deemed relevant).

In [337]:
from gensim.models import word2vec

words = [i.split() for i in tweets]
model = word2vec.Word2Vec(words, size=100, min_count = 5,
            window = 10, sample = 1e-2)

model.init_sims(replace=True)  

Did it work? Well, let's take a look at an example. Here we see that for the calendar emoji, our model identifies "calendar" as the most similar word, followed by much, confusion, July, and last. Not bad!

In [8]:
print u'\U0001f4c5'

model.most_similar(u'\U0001f4c5',topn=5)  

📅


[(u'calendar', 0.993647038936615),
 (u'Much', 0.9935946464538574),
 (u'confusion', 0.9879370927810669),
 (u'July', 0.9805886745452881),
 (u'last', 0.9793851971626282)]

Finally, even though we don't have that much data, let's take a look at some other emoji correlations. Amid some non-sensical connections, we can see that the headphones emoji is paired up with "Happy", the "raising both hands" emoji is paired with "Greatest", and the "knife" emoji is paired with tsunami.

I encourage you to try this with your own Twitter feed and see what happens!

In [336]:
s = list(set(emojis))
for i in s:
    try:
        candidates = model.most_similar(i,topn=3)
        for j in candidates:
            try:
                if model.similarity(i, j[0]) > 0.8:
                    print i,j[0]
            except:
                print "?"
                continue
    except KeyError:
        print "need more data"

🐥 🐘
🐥 🐫
🐥 🐦
🖖 🐞
🖖 🐜
🖖 🗻
need more data
📅 calendar
📅 Much
📅 confusion
🎧 http://t.co/e1FQqu4sb1
🎧 Happy
🎧 🐢
🐧 a
😡 THE
😡 emoji
😡 only
need more data
🙌 day…ever?
🙌 Greatest
🙌 https://t.co/1mrqYuDMcv
need more data
🔪 1
🔪 tsunami;
🔪 tsunamis
