## Partisan Word Use
### [In progress]

In the prior post, we created a means of reading tokenized data from our database of articles and a list of n-grams (like *supreme_court*) that we'll impose on our data to count certain n-grams as a single token.

Now we're going to engage in a preliminary and fairly rudimentary analysis of the text data as it relates to partisanship. In order to do so, we'll look at which n-grams are most characteristic of the left and the right (later looking at sources individually). We'll expect results sort of in line with this [538 piece](http://fivethirtyeight.com/features/these-are-the-phrases-each-gop-candidate-repeats-most/) on the text of the GOP debates.

This analysis will also set us up for topic modeling (out of fashion as it may be) since we'll just be building a large document-term matrix. I also hope to use it on weekly cuts of the data to see how emphasis in coverage changed over the election cycle.

### Reading the Data
First, we want to collect the data at the level of each source, which will require us to subclass the `SentenceStream` generator we built in the last post. `CountVectorizer` from `scikit-learn` treats each element in an iterator as a document, so we'll restructure the generator such that all words from each source are combined into one list of strings.

In [1]:
from itertools import groupby, imap
from nltk.corpus import stopwords

import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, TfidfTransformer

from stream import SentenceStream #from the prior post

sql_url = 'postgres://postgres:**password**@localhost/articles'

# ordering is a necessity, else groupby won't work
# we also have a few strange base_url's with '{xyz.com}'
query = """
        SELECT base_url, article_text
        FROM articles
        WHERE num_words > 100
        ORDER BY base_url
        """

class SourceNGramStream(SentenceStream):
    """
    Get a stream of pre-identified n-grams from each source
    """
    def __iter__(self):
        rows = super(SourceNGramStream, self).__iter__()
        source_sentences = groupby(rows, lambda x: x[0])
        
        for source, sentences in source_sentences:
            source_ngrams = [word for id, sentence in sentences for word in sentence if '_' in word]
            if source_ngrams:
                yield source_ngrams

We'll load in the identified n-grams from last time.

In [80]:
import cPickle as pickle
with open('../intermediate/phrasegrams_all.pkl', 'rb') as infile:
    ngrams = pickle.load(infile)

# the n-grams we've located will now be identified in the stream of text
# using the MWETokenizer from nltk
src_stream = SourceNGramStream(sqldb=sql_url, query=query,
                                ngrams=ngrams, idcol='base_url')

### Building a document-term matrix
Now we can create our document-term matrix, where each source represents a document and our columns will be a mix of single tokens and the n-grams from earlier.

Impoortantly, we're going to limit ourselves to tokens and n-grams that appear in two or more sources. I believe this choice enables us to consider these sources as a network, where there might be patterns of mututal influence on rhetoric and thinking. We don't really care about one-off uses of a particular word or phrase.

In [2]:
# dummy tokenizer since nltk is doing the tokenizing in the stream
# we can't do lambda x: x because it is unpicklable
def no_tokenizer(x):
    return x

# in fact, all processing is done, and we just need to place it
# in the appropriate data structure 
vectorizer = CountVectorizer(analyzer='word', preprocessor=None,
                             lowercase=False, tokenizer=no_tokenizer,
                             min_df=2)
dtm_source = vectorizer.fit_transform(src_stream)

NameError: name 'src_stream' is not defined

In [12]:
from scipy import io

with open('../intermediate/vec_source_phrasegram.pkl', 'wb') as vecf:
    pickle.dump(vectorizer, vecf)
io.mmwrite('../intermediate/dtm_source_phrasegram.mtx',  dtm_source)

In [117]:
from scipy import io

def no_tokenizer(x):
    return x

import cPickle as pickle
with open('../intermediate/vec_source_phrasegram.pkl', 'rb') as vecf:
    vectorizer = pickle.load(vecf)
    dtm_source = io.mmread('../intermediate/dtm_source_phrasegram.mtx')
    dtm_source = dtm_source_raw.tocsr()

Despite having limited ourselves to only n-grams that appear in more than one source, there are still some phrases that muddy up the waters unecessarily (like 'washington examiner news desk'). As a result, we'll remove those that appear in only one source over 95% of the time.

In [118]:
import numpy as np
from scipy.sparse import csr_matrix

idx_norm_terms = np.all(np.true_divide(dtm_source.toarray(), dtm_source.sum(axis=0)) <= 0.95, axis=0).A[0]
features = vectorizer.get_feature_names()
features = [f for i, f in enumerate(features) if idx_norm_terms[i]]
dtm_source = csr_matrix(dtm_source[:,idx_norm_terms])

### First Look at the Left and Right
We'll return to this source-based matrix later, and will for now collapse this matrix to only two rows, representing left- and right- aligned sources for a more straightforward analysis of which ideas are of greatest concern to each side of the aisle.

We need to pull the old alignment data from ["Blue Feed, Red Feed"](https://github.com/jonkeegan/blue-feed-red-feed-sources) to correctly identify who is on the left and the right.

In [119]:
import pandas as pd
from sqlalchemy import create_engine
# same order as earlier query
source_query = """
               SELECT base_url,
                      split_part(post_id, '_', 1) as fb_id
               FROM articles 
               WHERE num_words > 100 
               GROUP BY base_url,
                        split_part(post_id, '_', 1) 
               ORDER BY base_url
               """

# collect source alignment data
sources = pd.read_sql(source_query, create_engine(sql_url))  
alignment_data = pd.read_csv('./input/included_sources.csv', dtype={'fb_id':object})
sources = sources.merge(alignment_data, how='left')

# we supplemented the data with infowars
sources.loc[sources.base_url == 'infowars.com', 'side'] = 'right'

# get indexes of left and right sources
sources_left = np.where(sources.side == 'left')[0]
sources_right = np.where(sources.side == 'right')[0]

# create a new document-term matrix of 2 rows
dtm_side = csr_matrix(np.append(dtm_source[sources_left,:].sum(axis=0),
                                dtm_source[sources_right,:].sum(axis=0),
                                axis=0))

In [121]:
dtm_side.shape

(2, 31085)

And belhold, our matrix! We'll now transform this count matrix (where A<sub>ij</sub> is the number of times term j appears on side i) into a [normalized term-frequency matrix](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html). This means that phrases that are common to each side are discounted.

In [122]:
dtm_side_tfidf = TfidfTransformer().fit_transform(dtm_side)

#get the column indices from largest to smallest
idx_sorted_tfidf_left = np.argsort(dtm_side_tfidf[0, ].toarray()[0])[::-1]
idx_sorted_tfidf_right = np.argsort(dtm_side_tfidf[1, ].toarray()[0])[::-1]

#nonzero terms
terms_sorted_tfidf_left = [features[i] for i in idx_sorted_tfidf_left if dtm_side[0,i]] 
terms_sorted_tfidf_right = [features[i] for i in idx_sorted_tfidf_right if dtm_side[1,i]] 

In [123]:
dtm_source[:,idx_sorted_tfidf_left[:1000]].T

<1000x84 sparse matrix of type '<type 'numpy.int32'>'
	with 67201 stored elements in Compressed Sparse Column format>

In [142]:
# let's put together this information and plot it
pd.DataFrame(dtm_source[:,idx_sorted_tfidf_left[:1000]].T.toarray(),
             columns=sources.base_url)\
  .assign(terms=terms_sorted_tfidf_left[:1000],
          tfidf=dtm_side_tfidf[0, idx_sorted_tfidf_left[:1000]].A[0])\
  .to_csv('./output/ngrams_tfidf_top1000_left.csv', index_label='rank')

pd.DataFrame(dtm_source[:,idx_sorted_tfidf_right[:1000]].T.toarray(),
             columns=sources.base_url)\
  .assign(terms=terms_sorted_tfidf_right[:1000],
          tfidf=dtm_side_tfidf[1, idx_sorted_tfidf_right[:1000]].A[0])\
  .to_csv('./output/ngrams_tfidf_top1000_right.csv', index_label='rank')

sources[['base_url', 'fb_id', 'side', 'avg_align']].to_csv('./output/source_info.csv')

In [143]:
pd.DataFrame(dtm_source[:,idx_sorted_tfidf_left[:1000]].T.toarray(),
             columns=sources.base_url)\
  .assign(terms=terms_sorted_tfidf_left[:1000],
          tfidf=dtm_side_tfidf[0, idx_sorted_tfidf_left[:1000]].A[0])

base_url,aclj.org,advocate.com,alternet.org,americannews.com,americanthinker.com,billmoyers.com,bizpacreview.com,blackamericaweb.com,bluenationreview.com,breitbart.com,...,twitchy.com,washingtonexaminer.com,{weeklystandard.com},weeklystandard.com,westernjournalism.com,{wnd.com},wnd.com,youngcons.com,terms,tfidf
0,2,256,416,31,91,93,142,91,11,308,...,36,83,0,36,143,0,94,165,feel_like,0.184894
1,4,127,346,30,170,116,127,87,9,214,...,30,142,0,65,86,0,169,39,20_years,0.129267
2,0,152,402,9,40,97,60,161,0,153,...,19,75,0,32,52,0,100,27,new_orleans,0.126067
3,6,68,296,20,211,59,83,17,17,167,...,62,174,0,91,91,0,88,136,first_place,0.118590
4,34,76,263,78,295,57,126,14,28,180,...,70,122,0,58,124,0,141,151,yet_another,0.104881
5,18,54,262,11,133,67,43,13,3,168,...,23,106,0,37,51,0,131,28,every_year,0.102020
6,4,65,276,6,228,79,62,4,13,169,...,13,117,0,65,60,0,130,63,vast_majority,0.093665
7,0,3,21,0,0,10,1,0,0,1,...,0,0,0,0,0,0,0,0,rights_reserved,0.091229
8,6,55,132,30,253,55,59,32,30,124,...,38,56,0,39,83,0,120,80,let_us,0.090521
9,10,84,192,11,213,41,176,19,2,294,...,55,195,0,48,191,0,132,91,least_one,0.089020


In [132]:
pd.DataFrame(dtm_source[:,idx_sorted_tfidf_left[:1000]].T.toarray(),
             columns=sources.base_url)\
  .assign(terms=terms_sorted_tfidf_left[:1000],
          tfidf=dtm_side_tfidf[0, idx_sorted_tfidf_left].toarray())

ValueError: Length of values does not match length of index

In [111]:
terms_sorted_tfidf_left[:100]

[u'feel_like',
 u'20_years',
 u'new_orleans',
 u'first_place',
 u'yet_another',
 u'every_year',
 u'vast_majority',
 u'rights_reserved',
 u'let_us',
 u'least_one',
 u'police_brutality',
 u'material_may',
 u'rewritten_or_redistributed',
 u'good_news',
 u'lgbt_community',
 u'90_percent',
 u'months_ago',
 u'three_times',
 u'weeks_ago',
 u'ever_since',
 u'campaign_trail',
 u'north_dakota',
 u'federal_judge',
 u'juan_gonzlez',
 u'many_ways',
 u'black_lives_matter_movement',
 u'take_place',
 u'white_men',
 u'weve_seen',
 u'much_less',
 u'take_care',
 u'get_rid',
 u'republican_national_convention',
 u'60_percent',
 u'election_day',
 u'past_year',
 u'los_angeles_times',
 u'feels_like',
 u'next_day',
 u'much_better',
 u'black_community',
 u'great_deal',
 u'american_politics',
 u'looked_like',
 u'lgbt_rights',
 u'days_later',
 u'white_supremacy',
 u'take_action',
 u'making_sure',
 u'whole_thing',
 u'move_forward',
 u'makes_sense',
 u'final_form',
 u'rush_transcript',
 u'make_sense',
 u'fossil_fue

In [112]:
[t for t in terms_sorted_tfidf_right[:100] if t not in terms_sorted_tfidf_left[:500]]

[u'eligible_news_publisher',
 u'available_without_charge',
 u'daily_caller_news_foundation',
 u'original_content_please_contact_email_protected',
 u'large_audience',
 u'licensing_opportunities',
 u'aborted_babies',
 u'daily_signal',
 u'democrat_party',
 u'islamic_state_group',
 u'privacy_policy',
 u'share_your_thoughts',
 u'latest_video_at_videofoxnewscom',
 u'comments_section',
 u'interesting_story',
 u'open_borders',
 u'bear_arms',
 u'gop_establishment',
 u'terrorist_groups',
 u'sanctuary_cities',
 u'told_wnd',
 u'texas_sen_ted_cruz',
 u'refugee_resettlement',
 u'reprinted_with_permission',
 u'taxpayer_dollars',
 u'mr_obama',
 u'gop_front-runner',
 u'contested_convention',
 u'pro-life_movement']