### A first look at the articles using n-grams
In my previous post, we collected roughly 500,000 articles from 80 left- and right-aligned online news sources, going back to July 2015. Here, we'll start to dive into the data.

The first step we'll take is to going to identify the most common n-grams for each source. n-grams are ordered sets of words of length n, generated from a larger ordered set. For example, let's take a look at an iconic phrase from the campaign

In [1]:
sentence = "Nobody has more respect for women than me"
def gen_n_gram(sentence, n):
    words = sentence.split()
    return [tuple(words[i:i+n]) for i in range(len(words) - n + 1)]

print(gen_n_gram(sentence, 1))
print(gen_n_gram(sentence, 2))

[('Nobody',), ('has',), ('more',), ('respect',), ('for',), ('women',), ('than',), ('me',)]
[('Nobody', 'has'), ('has', 'more'), ('more', 'respect'), ('respect', 'for'), ('for', 'women'), ('women', 'than'), ('than', 'me')]


We'll rely on [scikit-learn's]() built-in functionality to generate n-grams, process the text, and place the data in a [term-document-matrix](). In order to do this, we can make an generator to return all articles from a query -- we'll be using similar functionality in the future, so I've opted to make it a class in order to allow us to inherit from it in the future.

In [2]:
#stream.py
from sqlalchemy import create_engine
from sklearn.feature_extraction.text import CountVectorizer

class QueryStream(object):
    """ 
    Stream documents from the articles database
    Can be subclassed to stream words or sentences from each document
    """
    def __init__(self, sqldb, query=None, textcol='article_text', chunksize=1000, **kwargs):

        self.sql_engine = create_engine(sqldb)
        self.query = query
        self.chunksize = chunksize
        self.analyze = CountVectorizer(**kwargs).build_analyzer()
        self.textcol = textcol

    def __iter__(self):
        """ Iterate through each row in the query """
        query_results = self.sql_engine.execute(self.query)
        result_set = query_results.fetchmany(self.chunksize)
        while result_set:
            for row in result_set:
                yield getattr(row, self.textcol)
            result_set = query_results.fetchmany(self.chunksize)

In [None]:
sql_url = 'postgres://postgres:postgres@localhost/articles'
query = """
        SELECT post_id, article_text FROM articles
        WHERE num_words > 200
        ORDER BY post_id
        """
query = """
        SELECT id, text as article_text FROM articles_old
        WHERE length(text) > 1000
        ORDER BY id
        """

id_stream = QueryStream(sql_url, query, textcol='post_id')
post_ids = [id for id in id_stream]

text_stream = QueryStream(sql_url, query)
vectorizer = CountVectorizer(ngram_range=(1, 6), min_df=0.10)
tdm = vectorizer.fit_transform(text_stream)

In [None]:
print('a')

In [None]:
# get set of all page urls
sql_url = 'postgres://postgres:postgres@localhost/articles'
url_query = """
            SELECT base_url, count(*) as num_posts
            FROM articles
            GROUP BY base_url
            """
results = create_engine(sql_url).execute(url_query).fetchall()
base_urls = [row.base_url for row in results]

In [None]:
def generate_source_tdm(base_url, sqldb):
    """Create and save a document-term matrix for a given source"""
    query = """
            SELECT article_text FROM articles
            WHERE word_count > 200
            """.format(base_url)
    stream = QueryStream(sqldb, query)
    
    #only keep n-grams that appear in over 10% of articles
    vectorizer = CountVectorizer(ngram_range=(1, 6), min_df=0.10)
    tdm = vectorizer.fit_transform(stream)
    
    #save
    
    
    return 
    

In [7]:
# word count:
#array_length(regexp_split_to_array(article_text, E'\\W+'), 1) > 200
query = """
        SELECT article_text FROM articles
        WHERE base_url like '%%breitbart%%' and
              array_length(regexp_split_to_array(trim(article_text), E'\\\W+'), 1) > 200
        LIMIT 5
        """
stream = QueryStream(sqldb = 'postgres://postgres:postgres@localhost/articles',
                     query=query)
vectorizer = CountVectorizer(ngram_range=(1, 6), min_df=0.10)
tdm = vectorizer.fit_transform(stream)

In [8]:
tdm.shape

(5, 12046)

In [18]:
vectorizer = CountVectorizer(min_df=0.10)
x = vectorizer.fit_transform(['my friend is a joe', 'my joe is a friend', 'i am not a crook'])
vectorizer.vocabulary_

{'am': 0, 'crook': 1, 'friend': 2, 'is': 3, 'joe': 4, 'my': 5, 'not': 6}

In [20]:
y = vectorizer.fit_transform(['justin is a boy', 'justin is a girl'])

ValueError: inconsistent shapes

In [18]:
from sklearn.feature_extraction.text import TfidfTransformer
transformer = TfidfTransformer()
transformer.fit_transform(X)

#TODO:
    #DO SIMPLE SUMMARY STATS ON RETRIEVED DOCS (word lengths, etc)
    #TREAT EACH SOURCE AS A DOCUMENT
    #RUN TF-IDF TO GET EACH SOURCE'S MOST UNIQUE N-GRAMS
    #PLOT N-GRAMS on alignment axis, perhaps indicating which n-grams come from where
    

<100x363874 sparse matrix of type '<class 'numpy.float64'>'
	with 413662 stored elements in Compressed Sparse Row format>