### A first look at the articles using n-grams
In my previous post, we collected roughly 500,000 articles from 80 left- and right-aligned online news sources, going back to July 2015. Here, we'll start to dive into the data.

The first step we'll take is to going to identify the most common n-grams for each source. n-grams are ordered sets of words of length n, generated from a larger ordered set. For example, let's take a look at an iconic phrase from the campaign

In [1]:
sentence = "Nobody has more respect for women than me"
def gen_n_gram(sentence, n):
    words = sentence.split()
    return [tuple(words[i:i+n]) for i in range(len(words) - n + 1)]

print(gen_n_gram(sentence, 1))
print(gen_n_gram(sentence, 2))

[('Nobody',), ('has',), ('more',), ('respect',), ('for',), ('women',), ('than',), ('me',)]
[('Nobody', 'has'), ('has', 'more'), ('more', 'respect'), ('respect', 'for'), ('for', 'women'), ('women', 'than'), ('than', 'me')]


We'll rely on a combination of nltk, gensim, and [scikit-learn's]() to [tokenize] our documents and generate n-grams, process the text, and place the data in a [term-document-matrix](). In order to do this, we can make an generator to return all articles from a query -- we'll be using similar functionality in the future, so I've opted to make it a class in order to allow us to inherit from it in the future.

In [3]:
#stream.py
from sqlalchemy import create_engine

class QueryStream(object):
    """ 
    Stream documents from the articles database
    Can be subclassed to stream words or sentences from each document
    """
    def __init__(self, sqldb, query=None, textcol='article_text', chunksize=1000, **kwargs):

        self.sql_engine = create_engine(sqldb)
        self.query = query
        self.chunksize = chunksize
        self.textcol = textcol

    def __iter__(self):
        """ Iterate through each row in the query """
        query_results = self.sql_engine.execute(self.query)
        result_set = query_results.fetchmany(self.chunksize)
        while result_set:
            for row in result_set:
                yield getattr(row, self.textcol)
            result_set = query_results.fetchmany(self.chunksize)

We want to create a pipeline for correctly parsing documents. For this we'll rely on nltk for tokenization and gensim to learn novel collocations in the data. Then we'll have a 

In [None]:
from nltk import sent_tokenize, word_tokenize
from nltk.tokenize import MWETokenizer

import regex as re

class SentenceStream(QueryStream):
    def __init__(self, *args, **kwargs):
        super(SentenceStream, self).__init__(*args, **kwargs)
            
    def __iter__(self):
        rows = super(SentenceStream, self).__iter__()
        punct = re.compile("[^\P{P}\-]")
        
        for doc in rows:
            for sentence in sent_tokenize(doc):
                split_sentence = [punct.sub('', word).lower()
                                  for word in word_tokenize(sentence)]
                yield [word for word in split_sentence if word]
                
    def set_mwe_lexicon(phrases):
        pass
                


In [21]:
from gensim.models.phrases import Phrases

sql_url = 'postgres://postgres:Pluto2015!@localhost/articles'
query = """
        SELECT setseed(0.5); 
        SELECT post_id, article_text
        FROM (SELECT post_id, article_text,
                    ROW_NUMBER() OVER (partition BY url ORDER BY date) AS rnum
              FROM articles
              WHERE num_words > 200) t
        WHERE t.rnum = 1
        ORDER BY random()
        LIMIT 50000
        """
#### REMOVE THE NEW REPUBLIC!! 
stream = list(SentenceStream(sqldb=sql_url, query=query))

In [39]:
bigram = Phrases(stream, max_vocab_size = 100000000, threshold=10, min_count=50)
trigram = Phrases(bigram[stream], max_vocab_size = 100000000, threshold=10, min_count=50)
quadgram = Phrases(trigram[bigram[stream]], max_vocab_size = 100000000, threshold=10, min_count=50)

In [53]:
from gensim.models.phrases import Phraser
phraser = Phraser(quadgram)

In [46]:
phraser.save('../intermediate/phraser_50k.pkl')

Some phrases are very similar, but we'll address them later

In [41]:
[(k, v) for k, v in phraser.phrasegrams.items() 
 if 'xl' in k[0] or 'xl' in k[1] or
    'keystone' in k[0] or 'keystone' in k[1]]

[(('keystone_xl', 'pipeline'), (124, 5677.905921841285)),
 (('the_keystone_xl', 'pipeline'), (98, 12318.886287089013)),
 (('the_keystone', 'xl_pipeline'), (98, 1159369.9018867926)),
 (('keystone', 'xl_pipeline'), (124, 657848.4888888889)),
 (('keystone', 'pipeline'), (82, 1653.9245478036175)),
 (('keystone', 'xl'), (97, 90831.0193236715)),
 (('the_keystone', 'pipeline'), (71, 2948.978674857394))]

In [63]:
from nltk.corpus import stopwords

def clean_phrases(phraser):
    stop = stopwords.words('english')
    ngrams = []
    for bigram, score in phraser.phrasegrams.items():
        ngram = bigram[0].split('_') + bigram[1].split('_')
        while ngram and ngram[0] in stop:
            ngram.pop(0)
        while ngram and ngram[-1] in stop:
            ngram.pop(0)
        if len(ngram) > 1:
            ngrams.append(tuple([ngram, score]))
            
    ### TODO: remove overly common
    ### see why some of the punctuation cleaning doesn't work (split on apostrophe, some | getting thru)
    ### remove low tdf-idf by source -- the 'subscribe to our newsletter' thing
            
    return set(ngrams)
        
ngrams = clean_phrases(phraser)
ngrams
    



[(['unarmed', 'black'], (127, 69.83462478550975)),
 (['right', 'direction'], (196, 35.64362041671701)),
 (['chapter', '11'], (61, 29.4090535061932)),
 (['abc', 'news', 'abc'], (59, 93.57518185096505)),
 (['iowa', 'caucus'], (133, 1836.8742747685137)),
 (['local', 'hospital'], (116, 19.872421290690745)),
 (['trans-pacific', 'partnership'], (271, 160107.7586870402)),
 (['next', 'week'], (721, 147.48908985139352)),
 (['school', 'board'], (116, 59.8830263193333)),
 (['human', 'rights'], (1774, 511.8287428184885)),
 (['blue', 'collar'], (62, 508.1524293426485)),
 (['medical', 'progress'], (550, 17452.931231935432)),
 (['reality', 'show'], (58, 16.372663149480417)),
 (['justice', 'anthony', 'kennedy'], (99, 19217.751960784313)),
 (['elena', 'kagan'], (57, 3665.9152348224507)),
 (['weather', 'patterns'], (69, 280.49494655946273)),
 (['$', '45'], (65, 10.037626599025911)),
 (['earnings', 'report'], (77, 11.299512584362096)),
 (['las', 'vegas'], (620, 259801.47831659904)),
 (['judge', 'curiel']

In [127]:
docs = SentenceStream(sql_url, "").sql_engine.execute("SELECT * FROM articles WHERE lower(article_text) like '%%mexica%%rapist%%'").fetchall()

1893

In [113]:
docs = SentenceStream(sql_url, "").sql_engine.execute("SELECT * FROM articles WHERE article_text like '%%selling%%aborted%%baby%%parts%%'").fetchall()

In [115]:
for d in docs:
    print d.base_url, d.date, d.title, len(d.article_text), d.url

teaparty.org 2016-10-18T22:24:03+0000 Michelle Obama Recruited to Run for Senate, Chicago Mayor 2423 http://www.teaparty.org/michelle-obama-recruited-run-senate-chicago-mayor-193024/?utm_source=rss&utm_medium=rss&utm_campaign=michelle-obama-recruited-run-senate-chicago-mayor
teaparty.org 2016-10-18T19:14:01+0000 Americans In Philippines Jittery As Duterte Rails Against United States 4787 http://www.teaparty.org/americans-philippines-jittery-duterte-rails-united-states-192960/?utm_source=rss&utm_medium=rss&utm_campaign=americans-philippines-jittery-duterte-rails-united-states
teaparty.org 2016-10-18T11:35:01+0000 1% Russians Approve of Obama, Global Disapproval at 8 Year High 1721 http://www.teaparty.org/1-russians-approve-obama-global-disapproval-8-year-high-192880/?utm_source=rss&utm_medium=rss&utm_campaign=1-russians-approve-obama-global-disapproval-8-year-high
teaparty.org 2016-10-18T11:34:01+0000 Obama Adopts a Grand Design to Shape His Legacy 8265 http://www.teaparty.org/obama-ado

In [None]:
TODO:::
make america great again, mexican rapists, has come to mean something very different
remove repeating articles, certain 'bad' articles like teaparty's treatise at the botom of their videos

In [None]:
# get set of all page urls
sql_uarl = 'postgres://postgres:postgres@localhost/articles'
url_query = """
            SELECT base_url, count(*) as num_posts
            FROM articles
            GROUP BY base_url
            """
results = create_engine(sql_url).execute(url_query).fetchall()
base_urls = [row.base_url for row in results]

In [None]:
def generate_source_tdm(base_url, sqldb):
    """Create and save a document-term matrix for a given source"""
    query = """
            SELECT article_text FROM articles
            WHERE word_count > 200
            """.format(base_url)
    stream = QueryStream(sqldb, query)
    
    #only keep n-grams that appear in over 10% of articles
    vectorizer = CountVectorizer(ngram_range=(1, 6), min_df=0.10)
    tdm = vectorizer.fit_transform(stream)
    
    #save
    
    
    return 
    

In [7]:
# word count:
#array_length(regexp_split_to_array(article_text, E'\\W+'), 1) > 200
query = """
        SELECT article_text FROM articles
        WHERE base_url like '%%breitbart%%' and
              array_length(regexp_split_to_array(trim(article_text), E'\\\W+'), 1) > 200
        LIMIT 5
        """
stream = QueryStream(sqldb = 'postgres://postgres:postgres@localhost/articles',
                     query=query)
vectorizer = CountVectorizer(ngram_range=(1, 6), min_df=0.10)
tdm = vectorizer.fit_transform(stream)

In [8]:
tdm.shape

(5, 12046)

In [18]:
vectorizer = CountVectorizer(min_df=0.10)
x = vectorizer.fit_transform(['my friend is a joe', 'my joe is a friend', 'i am not a crook'])
vectorizer.vocabulary_

{'am': 0, 'crook': 1, 'friend': 2, 'is': 3, 'joe': 4, 'my': 5, 'not': 6}

In [20]:
y = vectorizer.fit_transform(['justin is a boy', 'justin is a girl'])

ValueError: inconsistent shapes

In [18]:
from sklearn.feature_extraction.text import TfidfTransformer
transformer = TfidfTransformer()
transformer.fit_transform(X)

#TODO:
    #DO SIMPLE SUMMARY STATS ON RETRIEVED DOCS (word lengths, etc)
    #TREAT EACH SOURCE AS A DOCUMENT
    #RUN TF-IDF TO GET EACH SOURCE'S MOST UNIQUE N-GRAMS
    #PLOT N-GRAMS on alignment axis, perhaps indicating which n-grams come from where
    

<100x363874 sparse matrix of type '<class 'numpy.float64'>'
	with 413662 stored elements in Compressed Sparse Row format>