## Text Processing
### [In progress]
In a previous post, we collected roughly 500,000 articles from 80 left- and right-aligned online news sources, going back to July 2015. Here, we'll start to clean and process the text data to enable future analyses.

This is sort of an iterative problem -- I'll likely be revising this process to account for noise that makes it through.

We'll rely on a combination of [nltk](http://www.nltk.org/), [gensim](https://radimrehurek.com/gensim/index.html), and [scikit-learn's](http://scikit-learn.org/stable/index.html) to tokenize our documents, generate n-grams, do some POS tagging, and place the data in a term-document-matrix.

### Streaming tokenized text

To start, we'll create a generator to return all articles from a query. We'll be using similar functionality in the future, so I've opted to make it a class so that we may inherit from it in the future and stream, for e.g., sentences for use in Word2Vec models.

In [7]:
#stream.py
from sqlalchemy import create_engine

class QueryStream(object):
    """ 
    Stream documents from the articles database
    Can be subclassed to stream words or sentences from each document
    """
    def __init__(self, sqldb, query=None, idcol='post_id', 
                 textcol='article_text', chunksize=1000):

        self.sql_engine = create_engine(sqldb)
        self.query = query
        self.chunksize = chunksize
        self.textcol = textcol
        self.idcol = idcol

    def __iter__(self):
        """ Iterate through each row in the query """
        query_results = self.sql_engine.execute(self.query)
        result_set = query_results.fetchmany(self.chunksize)
        while result_set:
            for row in result_set:
                yield row
            result_set = query_results.fetchmany(self.chunksize)

We want to create a pipeline for tokenizing documents that will split a document into sentences, then those sentences into words. Here, we'll use nltk for tokenization. 

In [8]:
from nltk import sent_tokenize, word_tokenize
from nltk.tokenize import MWETokenizer

import regex as re

class SentenceStream(QueryStream):
    """
    Stream tokenized sentences from a query
    """
    def __init__(self, ngrams=None, *args, **kwargs):
        super(SentenceStream, self).__init__(*args, **kwargs)
        # this will allow us to treat pre-defined n-grams as
        # a single word, e.g., 'hillary_clinton', once
        # we've identified them
        self.mwe = MWETokenizer(ngrams)
        
    def __iter__(self):
        rows = super(SentenceStream, self).__iter__()
        # remove all punctuation, except hyphens
        punct = re.compile("[^A-Za-z0-9\-]")
        
        for doc in rows:
            id = getattr(doc, self.idcol)
            text = getattr(doc, self.textcol)
            
            for sentence in sent_tokenize(text):
                split_sentence = [punct.sub('', word).lower()
                                  for word in word_tokenize(sentence)]
                yield id, self.mwe.tokenize([word for word in split_sentence 
                                             if word.replace('-', '')])

### Identifying Collocations

We'll use the article text we've scraped to identify n-grams like 'donald trump'. Should we deem it necessary, we can later identify synonyms like 'senator sanders' and 'bernie sanders' using Word2Vec.

In [10]:
from gensim.models.phrases import Phrases
from gensim.models.phrases import Phraser
from itertools import imap 

sql_url = 'postgres://postgres:**PASSWORD**@localhost/articles'

full_query = """
             SELECT post_id, article_text
             FROM articles
             WHERE num_words > 100
             """

# Since I'm running this on a Google Compute instance, I can afford
# to load everything in memory as a list. While this isn't strictly necessary,
# I can now avoid pulling from the database multiple times
stream = list(SentenceStream(sqldb=sql_url, query=full_query))

To illustrate, here's the 9th sentence in the data, split into words.

In [130]:
stream[9]

[(u'144317282271701_1045074145529339',
  [u'a', u'mixture', u'of', u'motives', u'is', u'on', u'display'])]

We must iteratively generate collocations from the sentence stream. The bigram object contains things like 'marco rubio', the trigram might now include 'senator marco rubio'.

In [77]:
phrase_kwargs = {'max_vocab_size': 100000000,
                  'threshold': 10,
                  'min_count': 50}

bigram = Phrases(imap(lambda x: x[1], stream), **phrase_kwargs)
trigram = Phrases(bigram[imap(lambda x: x[1], stream)], **phrase_kwargs)
quadgram = Phrases(trigram[imap(lambda x: x[1], stream)], **phrase_kwargs)

phraser = Phraser(quadgram)
phraser.save('../intermediate/phraser_all.pkl')

### Trimming n-grams
Many of the n-grams we found have stopwords at either the end or beginning, for instance, 'the supreme court'. We'd like to trim these so that they are individually meaningful components.

In [80]:
from nltk.corpus import stopwords
from nltk.corpus import wordnet as wn
from collections import defaultdict
import cPickle as pickle

def trim_phrases(phraser):
    """
    Remove stopwords at the start and end of an ngram,
    generate list of unique ngrams in corpus
    """
    stop = stopwords.words('english')
    ngrams = defaultdict(tuple)
    for bigram, score in phraser.phrasegrams.items():
        ngram = bigram[0].split('_') + bigram[1].split('_')
        
        idx = [i for i, v in enumerate(ngram) if v not in stop]  
        ngram = ngram[idx[0]:idx[-1] + 1] if idx else []
        
        
        if len(ngram) > 1:
            ngrams[tuple(ngram)] = score
                        
    return ngrams
  
ngrams = trim_phrases(phraser)

with open('../intermediate/phrasegrams_all.pkl', 'wb') as o:
    pickle.dump(ngrams, o)

### POS Tagging
Of the n-grams collected above, we're largely interested in noun-phrases like 'fetal_organs' (appearing 56 times in the sample, uh, ok). rather than adjective or verb phrases like 'radical_islamic'. To subset to only nouns, we'll use [Part-of-Speech tagging](http://www.nltk.org/book/ch05.html) to see how these n-grams are employed in the data.

In [None]:
from nltk import pos_tag

class POSSentenceStream(SentenceStream):
    """
    Assign parts of speech to n-grams
    """
    def __iter__(self):
        sentences = super(POSSentenceStream, self).__iter__()
        for id, sentence in sentences:
            for word, pos in pos_tag(sentence):
                if '_' in word:
                    yield word, pos
            
pos_words = POSSentenceStream(sqldb=sql_url, query=full_query, ngrams=ngrams)

To take a look at what we're referring to, see some examples below. 

In [4]:
pos_words[:10]

[(u'hasnt_stopped', 'VBD'),
 (u'republican_politicians', 'NNS'),
 (u'radio_show', 'NN'),
 (u'rafael_cruz', 'VBD'),
 (u'ted_cruz', 'NN'),
 (u'barack_obama', 'NN'),
 (u'conversion_therapy', 'NN'),
 (u'reminds_us', 'NN'),
 (u'conversion_therapy', 'NN'),
 (u'sharp_contrast', 'NN')]

We might expect that certain n-grams, like 'presidential_candidate', could be considered both a noun or adjective phrase. As a result, we'll determine the POS distribution for each phrase, selecting only those that are nouns over 75% of the time.

In [133]:
from collections import defaultdict
from collections import Counter
import numpy as np

def count_ngram_occurances(pos_stream):
    """
    Tally up the different POS associated with each n-gram
    """
    pos_counts = {word: Counter() for word, pos in pos_stream}
    
    for word, pos in pos_stream:
        pos_counts[word].update([pos])
        
    return pos_counts

def identify_np_ngrams(pos_counts):
    """
    Determine the n-grams that are most often employed as noun phrases
    """     
    np_ngrams = defaultdict(str)
    
    for word, counts in pos_counts.items():
        split_word = tuple(word.split('_'))
                
        word_count = sum(counts.values())        
        noun_count = sum([v for k, v in counts.items() if 'NN' in k])
        
        # is it usually used as a noun?
        if np.true_divide(noun_count, word_count) > 0.75:
            np_ngrams[split_word] = word_count
            
    return np_ngrams
            
pos_counts = count_ngram_occurances(pos_words)  
noun_ngrams = identify_np_ngrams(pos_counts)

Already, we're seeing some moderately interesting findings in the data. For instance, *rigged election by hillary clinton* appears 113 times in 57 sources, *congress must investigate planned parenthood* 58 times over 16 sources. These these ngrams and their counts are saved to github.

In [160]:
# save n-grams appearing over 25 times in the data
ngram_data = pd.DataFrame.from_records(pos_counts)\
                         .transpose()
    
ngram_data['total'] = ngram_data.sum(axis=1)
ngram_data.loc[ngram_data.total >= 25]\
          .to_csv('../output/ngram_pos_counts.csv', index_label='n-gram')

In [60]:
from itertools import combinations
def find_top_collocations(ngrams):
    """
    For similar ngrams ('sen hillary clinton' and 'hillary clinton'),
    and return only the most common
    """
    
    ngrams_string = {'_'.join(k): v[0] for k, v in ngrams.items()}
    combin_ngrams = combinations(ngrams_string, r=2)
    remove = []
    
    for n1, n2 in combin_ngrams:
        if n1 in n2 or n2 in n1:
            remove.append(min([n1, n2], key=lambda x: ngrams_string[x]))
                
    return {k: v for k, v in ngrams.items() if '_'.join(k) not in remove}

top_ngrams = find_top_collocations(ngrams)



In [87]:
ngrams_string = {'_'.join(k): v for k, v in ngrams.items()}
[(n,v) for n,v in ngrams_string.items() if 'hillary_clinton' in n]

[('hillary_clinton_would', (128, 191.59602688589797)),
 ('candidate_hillary_clinton', (237, 14464.714377379476)),
 ('support_hillary_clinton', (67, 87.66493562048167)),
 ('hillary_clinton', (263, 455.9353804630901)),
 ('hillary_clinton_said', (103, 35.061411384055475)),
 ('hillary_clintons_health', (69, 133.97272709547855)),
 ('presidential_candidate_hillary_clinton', (98, 10.88110080057768)),
 ('hillary_clinton_hillaryclinton', (191, 60.34598829351883)),
 ('beat_hillary_clinton', (65, 72.51695042135432)),
 ('hillary_clintons_private_email', (60, 29.628529997296432)),
 ('state_hillary_clinton', (434, 4950.490482097788)),
 ('elect_hillary_clinton', (52, 77.35141378277794)),
 ('democrat_hillary_clinton', (150, 10.104236165825691)),
 ('hillary_clintons_campaign', (181, 408.9326524693404)),
 ('said_hillary_clinton', (52, 13.260242362761932)),
 ('hillary_clinton_campaign', (120, 33.510077611320725)),
 ('hillary_clintons', (99, 218.72281453577784)),
 ('investigation_into_hillary_clinton', (6

A similar problem are the preponderence of phrases like 'follow johnsmith on twitter' or 'source shutterstock'. We'll remove these as well

### n-grams by Source

Some phrases are meaningless and are disproportionately attached to certain publications. We're going to attempt to identify these by looking at n-grams that are uniquely common for each publication, then strip them from the data.

What will likely be more fruitful is a side-effect of this analysis: we'll attempt to identify n-grams that are nearly-exclusive to the left and the right.

In [6]:
from itertools import groupby, imap
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, TfidfTransformer

query = """
        SELECT base_url, article_text
        FROM articles
        WHERE num_words > 100
        ORDER BY base_url
        """

class SourceNGramStream(SentenceStream):
    """
    Get a stream of pre-identified n-grams from each source
    """
    def __iter__(self):
        rows = super(SourceStream, self).__iter__()
        source_sentences = groupby(rows, lambda x: x[0])
        
        for source, sentences in source_sentences:
            source_ngrams = [word for id, sentence in sentences for word in sentence if '_' in word]
            if source_ngrams:
                yield source, source_ngrams

# dummy tokenizer -- we can't do lambda x: x because it is unpicklable
def no_tokenizer(x):
    return x

# the n-grams we've located will now be identified in the stream of text
src_stream = SourceNGramStream(sqldb=sql_url, query=query,
                               ngrams=ngrams, idcol='base_url')
    
# we're using max_df=1 so that we only keep 
vectorizer = CountVectorizer(analyzer='word', preprocessor=None,
                             lowercase=False, tokenizer=no_tokenizer, max_df=1)
dtm = vectorizer.fit_transform(imap(lambda x: x[1], src_stream))

By limiting ourselves to only phrases that appear within each source and no others, we can handily remove junk phrases like "subscribe to our newsletter". Next, we'll re-run the above query but will limit ourselves to phrases that only appear on a given side [alternatively, we could do tf-idf side by side, perhaps finding that things like 'united states' appear more on the right than the left. In all likelihood, both options are appropriate].

In [7]:
with open('../intermediate/vec_source_5gram.pkl', 'wb') as vecf,\
     open('../intermediate/dtm_source_5gram.pkl', 'wb') as dtmf:
    pickle.dump(vectorizer, vecf)
    pickle.dump(dtm_source, dtmf, pickle.HIGHEST_PROTOCOL)

In [8]:
# we want a parallel list of sources
source_query = "SELECT base_url FROM articles GROUP BY base_url ORDER BY base_url"
sources = [r.base_url for r in create_engine(sql_url).execute(source_query).fetchall()]

In [29]:
features = vectorizer.get_feature_names()
np.random.permutation(features)[:15]

array([u'2015_thehill', u'image_screenshot_via', u'buy_point',
       u'us_based_cleric', u'contact_robert_trigaux',
       u'shares_shares_facebooktwitter_googlepinterestdigglinkedin_reddit_stumbleupon_printdeliciouspockettumblr',
       u'receive_your_first_alert', u'credit_ap_photosusan_walsh',
       u'weekly_standard_podcast', u'newsabortion_catholic_church',
       u'taxes_depreciation', u'bioethics_attorney',
       u'uwgreen_bay_because_students_lean', u'special_lets_slam_congress',
       u'richard_danielson'], 
      dtype='<U103')

Note that we have phrases like "weekly standard podcast" or "follow X on twitter". These should made for a decent stopword list.

We can now remove junk phrases from each publication knowing that we'll keep the real meat of the article. What's more, in the next post, we'll explore the most characteristic phrases from each source.

NOTE: You can also do tf-idf after having summed up left- and right-.

In [None]:
wfidf_dtm = TfidfTransformer(sublinear_tf=True).fit_transform(dtm) 
#also can do tfidf w/o sublinear
bb_dict_sorted_wf = np.argsort(wfidf_dtm[bbidx,].toarray()[0])[::-1]
bb_top_features = [features[i] for i in bb_dict_sorted]    

In [None]:

query = """
        SELECT base_url, article_text
        FROM articles
        WHERE num_words > 100
        ORDER BY base_url
        """


### Part-of-Speech Tagging
Of the n-grams collected above, we're largely interested in noun-phrases like 'fetal_organs' (appears 56 times in the sample) rather than verbal or adjectival phrases like 'radical-islamic'.

In [141]:
ngrams[('crooked','hillary')]

(212, 726.420180499342)

In [33]:
from nltk import pos_tag

class POSSentenceStream(SentenceStream):
    def __iter__(self):
        sentences = super(POSSentenceStream, self).__iter__()
        for id, sentence in sentences:
            for word, pos in pos_tag(sentence):
                if '_' in word:
                    yield word, pos
            
pos_words = list(POSSentenceStream(sqldb=sql_url, query=query, ngrams=ngrams))

In [34]:
with open('../intermediate/pos_stream_all.pkl', 'wb') as o:
    pickle.dump(pos_words, o)

In [36]:
from collections import defaultdict
from collections import Counter

def count_ngram_pos(pos_stream, ngrams):
    pos_counts = {word: Counter() for word, pos in pos_stream}
    for word, pos in pos_stream:
        pos_counts[word].update([pos])
    
    for word, counts in pos_counts.items():
        split_word = tuple(word.split('_'))
                
        word_count = ngrams[split_word][0]
        has_high_score = ngrams[split_word][1] > 1000
        
        noun_counts = counts['NN'] + counts['NNS']
        is_noun = 'NN' in top_pos[0]
         
        if is_noun and has_high_score:
            yield split_word 
            
noun_ngrams = list(count_ngram_pos(pos_words, ngrams))            

NameError: global name 'top_pos' is not defined

## TESTING GROUNDS

In [148]:
[(k, v) for k, v in ngrams.items() if 'subscribe' in k]

[(('subscribe', 'today'), (227, 9305.383664156087))]

In [None]:
[(k, v) for k, v in phraser.phrasegrams.items() 
 if 'xl' in k[0] or 'xl' in k[1] or
    'keystone' in k[0] or 'keystone' in k[1]]

In [38]:
eng = create_engine(sql_url)
query = """
        SELECT article_text, url, base_url
        FROM articles
        WHERE lower(article_text) like '%%please_try_again_later%%'
        LIMIT 2
        """
eng.execute(query).fetchall()

[(u'Vice presidential candidate Sen. Tim Kaine, D-Va., suggested Wednesday that Donald Trump faked a health condition in the 1960s so as to avoid being drafted and sent off to Vietnam.\n\nThe Virginia senator\'s insinuation came as he continued his week-long campaign to cast doubt on Trump\'s claim he is in good physical shape.\n\n"Look, the one good thing we can say about Trump\'s health is apparently the medical issue that kept him out of the services magically cleared up, because now he\'s the healthiest individual who has ever going to be elected," Kaine, whose son is a United States Marine, said at a campaign stop in Bethlehem, Pa.\n\n"So I guess we got a medical miracle working here," he said to laughs.\n\nStay abreast of the latest developments from nation\'s capital and beyond with curated News Alerts from the Washington Examiner news desk and delivered to your inbox. Sorry, there was a problem processing your email signup. Please try again later. Processing... Thank you for si

In [41]:
import pandas as pd
df = pd.DataFrame().from_records([{'ngram':nltk.pos_tag(k), 'count':v[0], 'score':v[1]} for k, v in ngrams.items()])

Unnamed: 0,count,ngram,score
43178,2057,"[(supreme, NN), (courts, NNS)]",2.060704e+03
42519,2049,"[(gop, NN), (candidate, NN)]",1.471702e+01
47289,2017,"[(federal, JJ), (agencies, NNS)]",1.025098e+02
11881,1996,"[(donald, NN), (trump, NN)]",8.610223e+02
33101,1996,"[(2, CD), (percent, NN)]",3.851224e+01
38577,1988,"[(interesting, JJ), (story, NN)]",9.989389e+02
44755,1984,"[(affirmative, JJ), (action, NN)]",1.713739e+03
27144,1978,"[(get, VB), (intelligence, NN)]",1.034730e+01
28302,1977,"[(jacqueline, NN), (klimas, NN)]",2.395293e+06
42337,1975,"[(issue, NN), (soon, RB)]",1.240909e+01


In [None]:
df.sort_values(['count', 'score'], ascending = False)[200:299]

In [44]:
nltk.pos_tag(['the','republican','presidential', 'candidate'])

[('the', 'DT'),
 ('republican', 'JJ'),
 ('presidential', 'JJ'),
 ('candidate', 'NN')]

In [51]:
### MAKE SURE `grammar` IS CORRECT
### ONLY COLLECT NOUN PHRASES

import nltk
#Taken from Su Nam Kim Paper...
# http://stackoverflow.com/questions/38194579/extracting-noun-phrases-from-nltk-using-python
posttoks = nltk.tag.pos_tag(['justin','clinton', 'is','a','big','boy', 'who',
                            'studies','campaign','finance','reform', 'on','september','11th'])
grammar = r"""
           NBAR:
               {<NN.*|JJ>*<NN.*>}  # Nouns and Adjectives, terminated with Nouns
           NP:
               {<NBAR>}
               {<NBAR><IN><NBAR>}  # Above, connected with in/of/etc...
           """
chunker = nltk.RegexpParser(grammar)
tree = chunker.parse(posttoks)

In [13]:
nltk.tag.pos_tag(['selling', 'aborted', 'baby', 'parts'])

[('selling', 'VBG'), ('aborted', 'VBN'), ('baby', 'NN'), ('parts', 'NNS')]

In [3]:
for subtree in tree.subtrees(filter=lambda t: t.label() == 'NP'):
    print(subtree)

(NP (NBAR justin/NN clinton/NN))
(NP (NBAR big/JJ boy/NN))
(NP (NBAR studies/NNS campaign/NN finance/NN reform/NN))
(NP (NBAR september/NN))


In [82]:
[(k, v) for k, v in ngrams.items() if v[1] > 100]

[(('right', 'direction'), (183, 132.2074442511208)),
 (('new', 'c-pop', 'podcast', 'where', 'taylor'), (98, 2854.845853234911)),
 (('best', 'friend'), (67, 260.4415876744121)),
 (('next', 'week'), (642, 131.6409975015382)),
 (('human', 'rights'), (156, 201.4450398890045)),
 (('sen', 'marco', 'rubio', 'r-fla'), (56, 3278.5030417871744)),
 (('taken', 'into', 'custody'), (173, 1476.139290830475)),
 (('wide', 'range'), (172, 2434.9890714363773)),
 (('las', 'vegas'), (605, 615422.6624165149)),
 (('tampabaycom', 'or', '813'), (60, 70242.62820512822)),
 (('criminal', 'justice', 'system'), (197, 12140.518164003617)),
 (('media', 'outlets'), (529, 445.54178533497395)),
 (('police', 'brutality'), (85, 521.2237162551507)),
 (('oval', 'office'), (446, 27032.822078245703)),
 (('car', 'accident'), (57, 120.04505702945998)),
 (('analysts', 'polled'), (59, 3592.737704918033)),
 (('lifenews', 'previously', 'reported'), (72, 2522.050300253175)),
 (('abraham', 'lincoln'), (149, 8576.25106726117)),
 (('co

In [113]:
docs = SentenceStream(sql_url, "").sql_engine.execute("SELECT * FROM articles WHERE article_text like '%%selling%%aborted%%baby%%parts%%'").fetchall()

In [115]:
for d in docs:
    print d.base_url, d.date, d.title, len(d.article_text), d.url

teaparty.org 2016-10-18T22:24:03+0000 Michelle Obama Recruited to Run for Senate, Chicago Mayor 2423 http://www.teaparty.org/michelle-obama-recruited-run-senate-chicago-mayor-193024/?utm_source=rss&utm_medium=rss&utm_campaign=michelle-obama-recruited-run-senate-chicago-mayor
teaparty.org 2016-10-18T19:14:01+0000 Americans In Philippines Jittery As Duterte Rails Against United States 4787 http://www.teaparty.org/americans-philippines-jittery-duterte-rails-united-states-192960/?utm_source=rss&utm_medium=rss&utm_campaign=americans-philippines-jittery-duterte-rails-united-states
teaparty.org 2016-10-18T11:35:01+0000 1% Russians Approve of Obama, Global Disapproval at 8 Year High 1721 http://www.teaparty.org/1-russians-approve-obama-global-disapproval-8-year-high-192880/?utm_source=rss&utm_medium=rss&utm_campaign=1-russians-approve-obama-global-disapproval-8-year-high
teaparty.org 2016-10-18T11:34:01+0000 Obama Adopts a Grand Design to Shape His Legacy 8265 http://www.teaparty.org/obama-ado

In [None]:
TODO:::
make america great again, mexican rapists, has come to mean something very different
remove repeating articles, certain 'bad' articles like teaparty's treatise at the botom of their videos

In [None]:
# get set of all page urls
sql_uarl = 'postgres://postgres:postgres@localhost/articles'
url_query = """
            SELECT base_url, count(*) as num_posts
            FROM articles
            GROUP BY base_url
            """
results = create_engine(sql_url).execute(url_query).fetchall()
base_urls = [row.base_url for row in results]

In [None]:
def generate_source_tdm(base_url, sqldb):
    """Create and save a document-term matrix for a given source"""
    query = """
            SELECT article_text FROM articles
            WHERE word_count > 200
            """.format(base_url)
    stream = QueryStream(sqldb, query)
    
    #only keep n-grams that appear in over 10% of articles
    vectorizer = CountVectorizer(ngram_range=(1, 6), min_df=0.10)
    tdm = vectorizer.fit_transform(stream)
    
    #save
    
    
    return 
    

In [None]:
import pandas as pd
qry = """SELECT article_text, base_url FROM articles
         WHERE lower(article_text) like
         '%%congress%%must%%investigate%%planned%%parenthood%%'"""
test = pd.read_sql(qry, create_engine(sql_url))

In [7]:
# word count:
#array_length(regexp_split_to_array(article_text, E'\\W+'), 1) > 200
query = """
        SELECT article_text FROM articles
        WHERE base_url like '%%breitbart%%' and
              array_length(regexp_split_to_array(trim(article_text), E'\\\W+'), 1) > 200
        LIMIT 5
        """
stream = QueryStream(sqldb = 'postgres://postgres:postgres@localhost/articles',
                     query=query)
vectorizer = CountVectorizer(ngram_range=(1, 6), min_df=0.10)
tdm = vectorizer.fit_transform(stream)

In [8]:
tdm.shape

(5, 12046)

In [22]:
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer(analyzer='word', preprocessor=None, lowercase=False, tokenizer=lambda x: x, ngram_range=(1,2))
vec.fit_transform([['my','friend','is','bob'],['eat','my','lunch','puppy'],['my','friend','is','joe']])
vec.vocabulary_

{u'bob': 0,
 u'eat': 1,
 u'eat my': 2,
 u'friend': 3,
 u'friend is': 4,
 u'is': 5,
 u'is bob': 6,
 u'is joe': 7,
 u'joe': 8,
 u'lunch': 9,
 u'lunch puppy': 10,
 u'my': 11,
 u'my friend': 12,
 u'my lunch': 13,
 u'puppy': 14}

In [18]:
vectorizer = CountVectorizer(analyzer='word', ngram_range=(1,2))
x = vectorizer.fit_transform(['my friend is a joe', 'my joe is a friend',
                              'i am not a crook','my friend is a friend'])
vectorizer.vocabulary_

{u'am': 0,
 u'am not': 1,
 u'crook': 2,
 u'friend': 3,
 u'is': 4,
 u'is friend': 5,
 u'is joe': 6,
 u'joe': 7,
 u'joe is': 8,
 u'my': 9,
 u'my joe': 10,
 u'my_friend': 11,
 u'my_friend is': 12,
 u'not': 13,
 u'not crook': 14}

In [20]:
y = vectorizer.fit_transform(['justin is a boy', 'justin is a girl'])

In [18]:
from sklearn.feature_extraction.text import TfidfTransformer
transformer = TfidfTransformer()
transformer.fit_transform(X)

#TODO:
    #DO SIMPLE SUMMARY STATS ON RETRIEVED DOCS (word lengths, etc)
    #TREAT EACH SOURCE AS A DOCUMENT
    #RUN TF-IDF TO GET EACH SOURCE'S MOST UNIQUE N-GRAMS
    #PLOT N-GRAMS on alignment axis, perhaps indicating which n-grams come from where
    

<100x363874 sparse matrix of type '<class 'numpy.float64'>'
	with 413662 stored elements in Compressed Sparse Row format>

In [None]:
sentence = "Nobody has more respect for women than me"
def gen_n_gram(sentence, n):
    words = sentence.split()
    return [tuple(words[i:i+n]) for i in range(len(words) - n + 1)]

print(gen_n_gram(sentence, 1))
print(gen_n_gram(sentence, 2))

sample_query = """
             SELECT setseed(0.5); 
             SELECT post_id, article_text
             FROM articles
             WHERE num_words > 100
             ORDER BY random()
             LIMIT 50000
             """

In [3]:
from itertools import groupby, imap
import cPickle as pickle

import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, TfidfTransformer

from stream import SentenceStream #from the prior post

sql_url = 'postgres://postgres:Pluto2015!@localhost/articles'

# ordering is a necessity, else groupby won't work
query = """
        SELECT base_url, article_text
        FROM articles
        WHERE num_words > 100
        ORDER BY base_url
        """

class SourceStream(SentenceStream):
    """
    Stream tokens from each source
    """
    def __iter__(self):
        rows = super(SourceStream, self).__iter__()
        source_sentences = groupby(rows, lambda x: x[0])
        
        for source, sentences in source_sentences:
            yield [word for id, sentence in sentences for word in sentence]
            
            import cPickle as pickle
            
with open('../intermediate/phrasegrams_all.pkl', 'rb') as infile:
    ngrams = pickle.load(infile)
    
# the n-grams we've located will now be identified in the stream of text
# using the MWETokenizer from nltk
src_stream = SourceStream(sqldb=sql_url, query=query,
                          ngrams=ngrams.keys(), idcol='base_url')

In [None]:
# dummy tokenizer since nltk is doing the tokenizing in the stream
# we can't do lambda x: x because it is unpicklable
def no_tokenizer(x):
    return x

# in fact, all processing is done, and we just need to place it
# in the appropriate data structure and count out n-grams 
# we'll insist that we only get words that appear across
# at least two sources
vectorizer = CountVectorizer(analyzer='word', preprocessor=None,
                             lowercase=False, tokenizer=no_tokenizer,
                             min_df=2)
dtm_source = vectorizer.fit_transform(src_stream)

In [8]:
with open('../intermediate/vec_source_phrasegram.pkl', 'wb') as vecf,\
     open('../intermediate/dtm_source_phrasegram.pkl', 'wb') as dtmf:
    pickle.dump(vectorizer, vecf)
    pickle.dump(dtm_source, dtmf, pickle.HIGHEST_PROTOCOL)

<84x410248 sparse matrix of type '<type 'numpy.int64'>'
	with 5563489 stored elements in Compressed Sparse Row format>