# Word Vector of Tweets
**Source:**<br>
https://github.com/codebasics/deep-learning-keras-tf-tutorial/blob/master/42_word2vec_gensim/42_word2vec_gensim.ipynb | 
https://www.youtube.com/watch?v=Q2NtCcqmIww&ab_channel=codebasics<br>
**visualization:**<br>
https://towardsdatascience.com/visualizing-word-embedding-with-pca-and-t-sne-961a692509f5

## Text Preprocessing

In [1]:
import pandas as pd
import numpy as np

df = pd.read_csv('roevwade_top10cities_900tweets_cleaned.csv')
df.head(3)

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,date,time,name,username,state,tweet
0,0,0,2022-05-10,01:56:33,Nicholas Lamar Soutter,nicksoutter,Boston,"@PaleBlueMomma Well, murdering 21 year old rea..."
1,1,1,2022-05-10,01:44:44,mys. sylverfyre,sylverfyrevalk,Boston,@hbt_ft @jskilesskinner anti abortion ppl hara...
2,2,2,2022-05-10,01:39:21,✿ artsy black girl ✿,fairyflossfro,Boston,This analysis lazy best. Why wasting time disc...


In [2]:
df = df.drop(['Unnamed: 0'], axis=1)
df.head()

Unnamed: 0,Unnamed: 0.1,date,time,name,username,state,tweet
0,0,2022-05-10,01:56:33,Nicholas Lamar Soutter,nicksoutter,Boston,"@PaleBlueMomma Well, murdering 21 year old rea..."
1,1,2022-05-10,01:44:44,mys. sylverfyre,sylverfyrevalk,Boston,@hbt_ft @jskilesskinner anti abortion ppl hara...
2,2,2022-05-10,01:39:21,✿ artsy black girl ✿,fairyflossfro,Boston,This analysis lazy best. Why wasting time disc...
3,3,2022-05-10,01:10:12,Sue,sue21222633,Boston,"@TomFitton Yes pass, I read it. It failed Febr..."
4,4,2022-05-10,01:10:06,"Lady Jasmine, HRH",hrh_ladyj,Boston,"@PressSec @POTUS Noted, addendum, They NOT phy..."


In [3]:
df.shape

(8882, 7)

In [4]:
df["datetime"] = df['date']+" "+df['time']
df = df.drop(['date','time'], axis=1)
df.head()

Unnamed: 0,Unnamed: 0.1,name,username,state,tweet,datetime
0,0,Nicholas Lamar Soutter,nicksoutter,Boston,"@PaleBlueMomma Well, murdering 21 year old rea...",2022-05-10 01:56:33
1,1,mys. sylverfyre,sylverfyrevalk,Boston,@hbt_ft @jskilesskinner anti abortion ppl hara...,2022-05-10 01:44:44
2,2,✿ artsy black girl ✿,fairyflossfro,Boston,This analysis lazy best. Why wasting time disc...,2022-05-10 01:39:21
3,3,Sue,sue21222633,Boston,"@TomFitton Yes pass, I read it. It failed Febr...",2022-05-10 01:10:12
4,4,"Lady Jasmine, HRH",hrh_ladyj,Boston,"@PressSec @POTUS Noted, addendum, They NOT phy...",2022-05-10 01:10:06


In [5]:
df['state'].unique()

array(['Boston', 'Chicago', 'Dallas', 'Houston', 'Los Angeles', 'Miami',
       'New York', 'Philadelphia', 'San Francisco', 'Washington D.C.'],
      dtype=object)

## Gensim

In [6]:
import gensim

In [7]:
tweets = df.tweet.apply(gensim.utils.simple_preprocess)
tweets

0       [palebluemomma, well, murdering, year, old, re...
1       [hbt_ft, jskilesskinner, anti, abortion, ppl, ...
2       [this, analysis, lazy, best, why, wasting, tim...
3       [tomfitton, yes, pass, read, it, it, failed, f...
4       [presssec, potus, noted, addendum, they, not, ...
                              ...                        
8877    [fyllo, aodespair, the, states, draconian, ant...
8878    [clear_louden, ratmofficial, uterus, determine...
8879    [senblumenthal, what, definition, woman, you, ...
8880    [stowevt, billmaher, really, that, lot, eu, co...
8881    [fanatical, abortion, protesters, go, full, le...
Name: tweet, Length: 8882, dtype: object

In [8]:
#initialize gensim model
model = gensim.models.Word2Vec(
    window=10,
    min_count=5,
    workers=4,
)

In [9]:
#build vocabulary
model.build_vocab(tweets, progress_per=1000)
#train word2vec
model.train(tweets, total_examples=model.corpus_count, epochs=model.epochs)

(644881, 875770)

In [10]:
#saving the model
model.wv.save_word2vec_format('word2vec-roevwade-10-cities.txt', binary=False)

**Experimenting with the model**

In [11]:
model.wv.most_similar("abortion")

[('in', 0.9903263449668884),
 ('attack', 0.9901604056358337),
 ('protesters', 0.9901257753372192),
 ('created', 0.9897302985191345),
 ('push', 0.9897077679634094),
 ('supporters', 0.9895628690719604),
 ('house', 0.9892563819885254),
 ('th', 0.9892551302909851),
 ('tuesday', 0.9891059398651123),
 ('democratic', 0.9889997839927673)]

In [12]:
model.wv.most_similar("healthcare")

[('universal', 0.9955496788024902),
 ('innocent', 0.9951094388961792),
 ('amp', 0.9936622381210327),
 ('quality', 0.9935098886489868),
 ('termination', 0.9931392669677734),
 ('plain', 0.9931087493896484),
 ('full', 0.992415189743042),
 ('happens', 0.9923977851867676),
 ('prenatal', 0.9923285841941833),
 ('risk', 0.9922488927841187)]

In [13]:
model.wv.similarity(w1="abortion", w2="healthcare")

0.9088408

In [14]:
model.wv.similarity(w1="abortion", w2="murder")

0.74523634

In [15]:
model.wv.similarity(w1="abortionrights", w2="amazon")

0.9977023