## Partisan Word Use
### [In progress]

In the prior post, we created a means of reading tokenized data from our database of articles and a list of n-grams (like *supreme_court*) that we'll impose on our data to count certain n-grams as a single token.

Now we're going to engage in a preliminary and fairly rudimentary analysis of the text data as it relates to partisanship. In order to do so, we'll look at which n-grams are most characteristic of the left and the right (later looking at sources individually). We'll expect results sort of in line with this [538 piece](http://fivethirtyeight.com/features/these-are-the-phrases-each-gop-candidate-repeats-most/) on the text of the GOP debates.

This analysis will also set us up for topic modeling (out of fashion as it may be) since we'll just be building a large document-term matrix. I also hope to use it on weekly cuts of the data to see how emphasis in coverage changed over the election cycle.

### Building a Document-Term-Matrix

#### Reading the Data
First, we want to collect the data at the level of each source, which will require us to subclass the `SentenceStream` generator we built in the last post. `CountVectorizer` from `scikit-learn` treats each element in an iterator as a document, so we'll restructure the generator such that all words from each source are combined into one list of strings.

In [4]:
from itertools import groupby, imap

import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, TfidfTransformer

from stream import SentenceStream #from the prior post

sql_url = 'postgres://postgres:Pluto2015!@localhost/articles'

# ordering is a necessity, else groupby won't work
query = """
        SELECT base_url, article_text
        FROM articles
        WHERE num_words > 100
        ORDER BY base_url
        """

class SourceStream(SentenceStream):
    """
    Stream tokens from each source
    """
    def __iter__(self):
        rows = super(SourceStream, self).__iter__()
        source_sentences = groupby(rows, lambda x: x[0])
        
        for source, sentences in source_sentences:
            yield [word for id, sentence in sentences for word in sentence]

We'll load in the identified n-grams from last time.

In [6]:
import cPickle as pickle
with open('../intermediate/phrasegrams_all.pkl', 'rb') as infile:
    ngrams = pickle.load(infile)
    
# the n-grams we've located will now be identified in the stream of text
# using the MWETokenizer from nltk
src_stream = SourceStream(sqldb=sql_url, query=query,
                          ngrams=ngrams, idcol='base_url')

### Building the matrix
Now we can create our document-term matrix, where each source represents a document and our columns will be a mix of single tokens and the n-grams from earlier.

Impoortantly, we're going to limit ourselves to tokens and n-grams that appear in two or more sources. I believe this choice enables us to consider these sources as a network, where there might be patterns of mututal influence on rhetoric and thinking. We don't really care about one-off uses of a particular word or phrase.

In [1]:
# dummy tokenizer since nltk is doing the tokenizing in the stream
# we can't do lambda x: x because it is unpicklable
def no_tokenizer(x):
    return x

# in fact, all processing is done, and we just need to place it
# in the appropriate data structure 

# we'll insist that we only get words that appear across
# at least two sources
vectorizer = CountVectorizer(analyzer='word', preprocessor=None,
                             lowercase=False, tokenizer=no_tokenizer,
                             min_df=2)
dtm_source = vectorizer.fit_transform(src_stream)

We'll return to this source-based matrix later, and will for now collapse this matrix to only two rows, representing left- and right- aligned sources.