## Building the Model
### [In progress]

This will be a relatively short post where we're going to use `gensim` to create two `word2vec` models; one for right-aligned sources, another for left-aligned sources.

We'll align the left-and right- models using an orthogonal procrustes matrix, following a [method](http://nlp.stanford.edu/projects/histwords/) developed to detect changes in language using word embeddings. This suggestion comes from [Ben Schmidt](http://benschmidt.org/), a professor at Northeastern, who developed the R package `wordVectors` and used it in a phenomenal [omnibus analysis](http://bookworm.benschmidt.org/posts/2015-10-30-rejecting-the-gender-binary.html) of gendered word use in RateMyProfessor reviews.

In [49]:
import cPickle as pickle

from stream import SentenceStream #from the prior post

sql_url = 'postgres://postgres:**PASSWORD**@localhost/articles'

# query yields same text as last time (now includes alignment data)
query = """
        SELECT side, article_text
        FROM articles a
        LEFT JOIN alignment s
        ON split_part(a.post_id, '_', 1) = s.fb_id
        WHERE num_words > 100 and not
              (lower(article_text) like '%%daily caller news foundation%%' and
               base_url != 'dailycaller.com') and not
               lower(article_text) like '%%copyright 20__ the associated press%%'
        ORDER BY side
        """

# pull in the noun n-grams
with open('../intermediate/noun_ngrams_all.pkl', 'rb') as infile:
    noun_ngrams = pickle.load(infile)

# limit to those that appear with some frequency
noun_ngrams = [n for n in noun_ngrams if noun_ngrams[n] > 100]
    
# the n-grams we've located will now be identified in the stream of text
# using the MWETokenizer from nltk
# since I'm using a Google Compute Engine I have RAM to spare
# and will gluttonously store it all in memory
sentences = list(SentenceStream(sqldb=sql_url, query=query,
                                ngrams=noun_ngrams, idcol='side'))

In [None]:
from gensim.models import Word2Vec

model_left = Word2Vec([x[1] for x in sentences if x[0] == 'left'],
                      min_count=50, iter=5, sg=1, hs=1, workers=10, size=500)
model_left.save('../intermediate/word2vec_left.pkl')

model_right = Word2Vec([x[1] for x in sentences if x[0] == 'right'],
                       min_count=50, iter=5, sg=1, hs=1, workers=10, size=500)
model_right.save('../intermediate/word2vec_right.pkl')

## Some initial explorations

Before we get into anything serious, we can begin by doing a little playing around with our models. One of the most basic things we can do with word embeddings is to find word vectors that are "close" to others, as defined by their cosine similarity. 

I thought it would be interesting to see what terms are considered the most similar to another in one model, but not the other (among some arbitrary number of most similar words). This method is really unsophisticated, but it's fairly intuitive way of looking at what the partisan components of a given term are.

In [4]:
import pandas as pd

def partition_similar_terms(term, model_a, model_b, pool_n=50, n=10,
                            a_lab='left', b_lab='right'):
    """
    For the `pool_n` terms most similar to `term`
    in `model_a` and `model_b`, return
    A - B
    B - A
    A & B
    """
    labs = {'a':a_lab, 'b':b_lab}
    
    terms_a = [t for t, s in model_a.most_similar(term, topn=pool_n)]
    terms_b = [t for t, s in model_b.most_similar(term, topn=pool_n)]
    
    a_not_b = [t for t in terms_a if t not in terms_b][:n]
    b_not_a = [t for t in terms_b if t not in terms_a][:n]
    a_and_b = [t for t in terms_a if t in terms_b][:n]
    
    return pd.DataFrame({'{a} not {b}'.format(**labs): a_not_b,
                         '{b} not {a}'.format(**labs): b_not_a,
                         '{a} and {b}'.format(**labs): a_and_b })

In [5]:
# what does it mean to be "shady" on the right, that is different from what it means on the left, and vice versa?
partition_similar_terms('shady', model_left, model_right)

Unnamed: 0,left and right,left not right,right not left
0,questionable,fraudulent,influence-peddling
1,dealings,business_practices,scandals
2,sleazy,deceptive,backroom
3,unsavory,dirty_tricks,clinton_foundation
4,financial_dealings,clever,seedy
5,sketchy,scam,connections
6,unscrupulous,frauds,sordid
7,business_dealings,astroturf,cozy
8,unethical,disreputable,foreign_entities
9,dodgy,slimy,clinton_foundations


Fairly interesting stuff! The right focuses on the Clinton Foundation and its influence, as expected, and the left appears to make more on the unseemliness of (presumably) Trump's business practices. Let's try something else:

In [10]:
partition_similar_terms('radical', model_left, model_right)

Unnamed: 0,left and right,left not right,right not left
0,extremist,far-right,jihadism
1,fringe,right-wing,islamic_ideology
2,progressive,anti-imperialist,hardline
3,militant,liberal,anti-american
4,revolutionary,leftwing,islamic
5,fundamentalist,populist,muslim_brotherhood
6,extremists,socialist,salafist
7,leftist,conservatism,extremist_groups
8,radicals,hard-line,deobandi
9,reactionary,doctrinaire,islamists


The right uses "radical" to refer to muslim groups almost exclusively. I've collected a few more interesting ones below.

In [29]:
partition_similar_terms('alt-right', model_left, model_right)

Unnamed: 0,left and right,left not right,right not left
0,white_nationalists,breitbart,nevertrump
1,nationalist,white_nationalism,bigot
2,supremacist,daily_stormer,nevertrump_movement
3,neo-nazis,stormfront,pepe
4,neo-nazi,vdarecom,movement
5,white_supremacists,breitbart_news,donald_trumps_supporters
6,anti-semitic,white-supremacist,feminism
7,alt,bannon,conservatives
8,anti-semites,steve_bannon,leftist
9,far-right,jared_taylor,black_lives_matter_movement


In [6]:
partition_similar_terms(['feminism', 'feminist', 'feminists'], model_left, model_right)

Unnamed: 0,left and right,left not right,right not left
0,feminist_movement,womanhood,social_justice_warriors
1,gloria_steinem,queer,radical_feminist
2,intersectional,sex-positive,leftist
3,womens_rights,motherhood,radical_feminists
4,liberals,women,progressivism
5,liberal,intersectionality,left-wing
6,patriarchy,reproductive_justice,lefty
7,progressives,traister,dunham
8,progressive,womens_issues,lena
9,leftists,freethenipple,far-left


In [27]:
partition_similar_terms('voter_fraud', model_left, model_right)

Unnamed: 0,left and right,left not right,right not left
0,voter_id_laws,voter_impersonation,schulkin
1,election_fraud,in-person,corruption
2,voter_suppression,non-existent,welfare_fraud
3,fraud,rigged_election,criminal_activity
4,voter_id,voter-id_laws,dirty_tricks
5,voter_intimidation,noncitizens,illegal_activity
6,rigging,voter_id_law,bribery
7,disenfranchisement,turnout,elections
8,polling_places,gerrymandering,absentee_ballot
9,non-citizens,voter,upcoming_election
