# Demo 6.2 - Word2Vec
In this brief demo, we'll examine how we might develop word embeddings from some custom corpus. Specifically, we'll use data from StockTwits to build our embeddings. 

We'll do this in three parts: 
1. Load & clean the data
2. Train the Word2Vec model
3. Examine language similarity

### Step 1 - Load & Clean the Data
The data I've provided is a set of tweets about GameStop (GME). Ideally you'd use a large sample of tweets that covers a more than a single stock, but the StockTwits data is massive, so it was easiest to simply analyze a subset. GME investors also use "colorful" language which provides a nice opportunity to learn some nuances.

Let's load the data

In [1]:
import pandas as pd
folder = "/Users/cooperdenning/PycharmProjects/pythonProject5/.venv/D6.2" # update with location to your files
filepath = f"{folder}/GMEstockTwitsV2_sample.csv.gz"
df = pd.read_csv(filepath)
df

Unnamed: 0,twitID,author,author_followers,author_ideas,author_join,bear_bull,text,tickers,timestamp,date
0,278552853,Investeren,54,3790,2012-10-22,{'basic': 'Bullish'},$GME ridiculous ðµð,GME,2021-01-27 17:45:09+00:00,2021-01-27
1,277914988,flappinAnolips,9,1554,2020-04-22,{'basic': 'Bullish'},$GME wooooooooooooo,GME,2021-01-26 20:14:29+00:00,2021-01-26
2,286576565,jag2pr,10,287,2018-05-16,{'basic': 'Bullish'},$GME make your MONEY BACK with $GAXY; she&#39;...,GME|GAXY,2021-02-09 16:41:06+00:00,2021-02-09
3,299825525,Tavrabbit,1,328,2020-12-08,{'basic': 'Bullish'},$GME the fact that this is barred from trendyi...,GME,2021-03-05 17:14:06+00:00,2021-03-05
4,276427744,Pupperzzz,3,100,2015-12-28,{'basic': 'Bullish'},$BB $GME Most discussed tickers on WSB; lets t...,BB|GME,2021-01-22 19:08:05+00:00,2021-01-22
...,...,...,...,...,...,...,...,...,...,...
99995,291526781,PLelek,519,6407,2016-05-25,,$SPY LOL SUCH INTELLECTUALS ð¤£ð¤£ð¤£ CAN...,GME|SPY|AMC,2021-02-18 18:10:52+00:00,2021-02-18
99996,272968454,kluski,4,80,2016-03-11,{'basic': 'Bullish'},$GME,GME,2021-01-13 16:59:23+00:00,2021-01-13
99997,294751108,Cheze,13,703,2020-08-01,,$FRX I would like to blame the dip on $GME... ...,FRX|GME,2021-02-24 20:45:46+00:00,2021-02-24
99998,277303198,PhatPuffDaddy,198,14351,2019-04-29,,$GME to those who bought at $160,GME,2021-01-25 19:12:29+00:00,2021-01-25


The data is somewhat formatted, but there is an encoding issue we need to fix (see the odd characters). We'll use `ftfy` to do that, but we'll do this as part of a larger cleaning and data-prep process.

Specifically, we will use `gensim`'s implementation of Word2Vec. Unlike `sklearn`, `gensim` requires textual data to be tokenized in a list of lists, where the outer element is the level we wish to analyze (e.g., sentence, paragraph, document) and inner element is token. 

We're examining tweets here, so we are going to use a special tokenizer in `nltk` to help handle these tweets. We'll incorporate this into a function that:
1. Addresses the formatting issue in the tweets
2. Accounts for cash tags (e.g., $MSFT)
3. Removes stop-words 
4. Retains only emojis, cash tags, or letter-based tokens

In [4]:
from nltk.tokenize.casual import TweetTokenizer
from nltk.tokenize import sent_tokenize
from nltk.corpus import stopwords
import ftfy, emoji, re
# Initialize an empty list to store each row as an entry
stops = []

# Open the file in read mode
with open('/Users/cooperdenning/PycharmProjects/pythonProject5/.venv/D6.2/english', 'r') as file:
    # Iterate through each line in the file
    for line in file:
        # Strip the newline character and append the line to the list
        stops.append(line.strip())

# Now, lines_list contains each line of the file as an element
print(stops)


tweettoken = TweetTokenizer()

print(stops)
cashtag = re.compile(r'\$(?=[a-z]+)') # this is a complex reg-ex that uses a look ahead.

def myTweetTokenizer(tweet):
    tweet = ftfy.ftfy(str(tweet)).lower() # 1. fix encoding issue and convert to lower-case (str accounts for occassional missing value)
    tweet = cashtag.sub('CT',tweet) # 2. replaces cash tag with "CT"
    tokens = [t for t in tweettoken.tokenize(tweet) if (emoji.is_emoji(t) or (len(t)>=2 and t.isalpha() and t not in stops))] # steps 3-4
    return tokens

formatted_tweets = df['text'].apply(myTweetTokenizer)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [5]:
formatted_tweets

0                                [CTgme, ridiculous, 💵, 😂]
1                                  [CTgme, wooooooooooooo]
2        [CTgme, make, money, back, CTgaxy, increase, n...
3        [CTgme, fact, barred, trendying, mass, media, ...
4        [CTbb, CTgme, discussed, tickers, wsb, lets, t...
                               ...                        
99995    [CTspy, lol, intellectuals, 🤣, 🤣, 🤣, cant, eve...
99996                                              [CTgme]
99997    [CTfrx, would, like, blame, dip, CTgme, starte...
99998                                      [CTgme, bought]
99999                                              [CTgme]
Name: text, Length: 100000, dtype: object

## Step 2: Train Word2Vec Model
Next, we'll train the __[Word2Vec](https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html)__ model. 

In [6]:
from gensim.models import Word2Vec
model = Word2Vec(sentences=formatted_tweets,
                 workers=4, # parallelize across 4 cores
                 sg=0, # CBOW (Skim-gram would be sg=1)
                 min_count=10, # require a word to appear at least 10 times
                 vector_size=100 # the dimension of the vector space we project to
                ) 


deAnd that's it! The model is trained. 

## Step 3: 
Now, let's examine some word similarities. `gensim` has several very intuitive methods built in. For instance, suppose we wanted to know the 10 most similar words to the word "revenue":

In [7]:
model.wv.most_similar('revenue',topn=10)

[('sales', 0.9355709552764893),
 ('net', 0.8825573325157166),
 ('growth', 0.8759857416152954),
 ('dividend', 0.8747082352638245),
 ('recent', 0.8744640946388245),
 ('valued', 0.8717338442802429),
 ('eps', 0.8643818497657776),
 ('global', 0.8636304140090942),
 ('sector', 0.8629682660102844),
 ('announced', 0.859205961227417)]

In [8]:
model.wv.most_similar('wsb',topn=10)

[('reddit', 0.8106210827827454),
 ('pumping', 0.6959242820739746),
 ('main', 0.6904810667037964),
 ('calling', 0.677744448184967),
 ('crowd', 0.6752024292945862),
 ('wallstreet', 0.6580294966697693),
 ('kids', 0.6521521210670471),
 ('focus', 0.6454545855522156),
 ('pumped', 0.6412028074264526),
 ('army', 0.6331405639648438)]

You could also look at some of the language typical in financial social media:

In [9]:
for w in ['💎','yolo','🚀','🙌']:
    print(f"{w}:")
    print(model.wv.most_similar(w,topn=10))

💎:
[('🙌', 0.8508074283599854), ('🧤', 0.840610146522522), ('🖐', 0.8269316554069519), ('✋', 0.8266353011131287), ('🤚', 0.8264672160148621), ('🤲', 0.823300302028656), ('🙌🏼', 0.8201403021812439), ('👐', 0.8188973069190979), ('✊🏻', 0.798251748085022), ('🥜', 0.7785016298294067)]
yolo:
[('lotto', 0.8439286947250366), ('outs', 0.8281283974647522), ('printing', 0.8260573744773865), ('stockorbit', 0.8084323406219482), ('leap', 0.7973029613494873), ('commons', 0.7946054935455322), ('shoulda', 0.7911477088928223), ('delayed', 0.788819432258606), ('assigned', 0.7838578820228577), ('itm', 0.7756495475769043)]
🚀:
[('🌝', 0.800750195980072), ('🌙', 0.8003227710723877), ('🪐', 0.7703695297241211), ('🌚', 0.7700921297073364), ('🌕', 0.7454323768615723), ('💪🏻', 0.7054610848426819), ('🤟🏼', 0.6954057216644287), ('CTentx', 0.6929192543029785), ('👨\u200d🚀', 0.6911655068397522), ('moon', 0.6909932494163513)]
🙌:
[('👐', 0.8907890319824219), ('✋', 0.8850992918014526), ('🖐', 0.882362425327301), ('🧤', 0.873512864112854)

We can also look at the similarity between any two words in the corpus:

In [None]:
model.wv.similarity('income','eps')

The word vectors are accessible in the `model.wv.vectors` attribute. These are indexed by the vocabulary, and we can link back to that index with `model.wv.index_to_key`. I think it's easiest to access this information with a dataframe:

In [None]:
w2v_vectors = pd.DataFrame(model.wv.vectors,index=model.wv.index_to_key)
w2v_vectors

Finally, suppose we wanted to represent our corpus of tweets as the sum of individual word vectors. We could do this in two different ways. If we want to allow scale to influence distances, we could sum the vectors. Alternatively, we can use the mean to remove the effect of scale. Here's a function that can handle both (just change `aggfunc` to `np.sum` if you would like to sum instead of compute the mean):

In [None]:
import numpy as np
def vectorize_tweet(tweet,aggfunc = np.mean):
    vectors = np.array([w2v_vectors.loc[word] for word in tweet if word in w2v_vectors.index])
    return pd.Series(aggfunc(vectors,axis=0)).explode()

# Test it out:
vectorize_tweet(formatted_tweets.iloc[0])

Now run on full corpus:

In [None]:
tw_vectors = formatted_tweets.apply(lambda x: vectorize_tweet(x))

In [None]:
tw_vectors