### Domain 'signatures'

Attempting to find per-publisher 'signatures', or phrases that appear in most-all of a particular publisher's articles. Models are likely heavily overfitting on these terms.

In [2]:
import pandas as pd

DATA_PATH = '../data/'
DATA_INTERIM_PATH = DATA_PATH + 'interim/'
train = pd.read_csv(DATA_INTERIM_PATH + 'train_c.csv')
val = pd.read_csv(DATA_INTERIM_PATH + 'val_c.csv')

In [3]:
train.head()

Unnamed: 0,id,published-at,title,article_text,external_links,internal_links,hyperpartisan,bias,url,labeled-by,num_external_links,num_internal_links,domain
0,1,2017-09-10,After DeVos Announced Plans To Reexamine Title...,After DeVos Announced Plans To Reexamine Title...,{'https://www.nbcnews.com/news/us-news/betsy-d...,{},True,right,https://dailywire.com/news/20858/after-devos-a...,publisher,7,0,dailywire
1,2,2017-05-05,University To Award Trayvon Martin With Posthu...,University To Award Trayvon Martin With Posthu...,{'https://www.facebook.com/FLMemorialUniv/': '...,{},True,right,https://dailywire.com/news/16103/university-aw...,publisher,1,0,dailywire
2,8,2017-11-15,Texas State University suspends Greek life aft...,Texas State University suspends Greek life aft...,{'https://www.nbcnews.com/storyline/hazing-in-...,{},False,right-center,https://newsline.com/texas-state-university-su...,publisher,2,0,newsline
3,9,2015-06-10,Red Sox waste Rodriguez outing in 1-0 loss to ...,Red Sox waste Rodriguez outing in 1-0 loss to ...,{},{},False,least,https://apnews.com/795c45f6db904f8eb85d72cee61...,publisher,0,0,apnews
4,10,,Eve and the New Jerusalem,Eve and the New Jerusalem Imaginings of a utop...,{},{},True,left,https://dissentmagazine.org/article/eve-and-th...,publisher,0,0,dissentmagazine


In [4]:
train['domain'].value_counts()

foxbusiness                  116106
abqjournal                    93853
apnews                        75725
pri                           51781
newsline                      47593
counterpunch                  42410
motherjones                   38487
truthdig                      33780
dailywire                     21538
thedailybeast                 19560
poynter                       17276
baptistnews                   13728
nbcnews                       12829
reuters                       10914
consortiumnews                 9593
washingtonblade                8265
calwatchdog                    7611
thegoldwater                   6947
natmonitor                     6556
dissentmagazine                6292
politicalillusionsexposed      5714
chicagoreporter                5316
ivn                            4967
eppc                           4652
mintpressnews                  4494
factcheck                      4485
billmoyers                     4328
circa                       

In [62]:
train[train['domain'] == 'foxbusiness'].reset_index()['article_text'][0]

"2017's 3 Best Index Funds  \n2017 has been a tumultuous year, with plenty going on in the political and financial realms. Yet the stock market has done quite well, and index funds that track popular stock benchmarks have given their investors solid returns. In particular, among the most popular exchange-traded funds, iShares MSCI Emerging Markets (NYSEMKT: EEM), PowerShares QQQ (NASDAQ: QQQ), and iShares Core MSCI EAFE (NYSEMKT: IEFA) have delivered some of the best returns. \nContinue Reading Below \nData source: Fund providers, ETFdb.com. \nAmong U.S.-oriented index funds, the PowerShares QQQ leads the field with strong returns in the upper teens. The ETF tracks the Nasdaq 100 Index Opens a New Window., which is composed of the 100 top nonfinancial companies that trade on the Nasdaq exchange. Although intermingling of stocks across the Nasdaq and New York Stock Exchanges over the past several years has made both indexes more diverse, the Nasdaq still has a preponderance of technolog

In [6]:
train[(train['domain'] == 'foxbusiness') & (train['article_text'].str.contains('Continue Reading Below'))].shape[0] / train[train['domain'] == 'foxbusiness'].shape[0]

1.0

#### Every single Fox Business article in the dataset contains the phrase 'Continue Reading Below'

In [33]:
df = train[train['domain'] == 'foxbusiness'].reset_index()[0:500]

In [34]:
df.shape

(500, 14)

In [35]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

In [36]:
vectorizer = CountVectorizer(stop_words='english',binary=True)
X = vectorizer.fit_transform(df['article_text'])

In [37]:
X = X.toarray()

In [38]:
sums = X.sum(axis=0)

In [39]:
sums.max()

500

In [63]:
thresh = 0.8
max_indices = np.argwhere(sums >= thresh*df.shape[0])
print(max_indices)

[[ 1089]
 [ 3816]
 [12185]]


In [55]:
for index in max_indices:
    feature = vectorizer.get_feature_names()[index[0]]
    print(feature)

advertisement
company
continue
new
reading
said
year


In [65]:
df[df['article_text'].str.contains('below')].reset_index()['article_text'][2]

'Wall Street Worries How Patient Fed will Remain The Dow and the S&amp;P 500 fell on Tuesday as materials shares declined and nervousness increased ahead of a Federal Reserve statement, while the Nasdaq ended higher. \nStocks cut losses in afternoon trading, led by a rise in tech shares. The S&amp;P 500 technology index ended up 0.1 percent, helped by gains in Apple, up 1.7 percent at $127.04, and Facebook, up 1.7 percent at $79.36. \nContinue Reading Below \nAfter the bell, shares of Oracle rose 3.4 percent to $44.33 following results. \nInvestors were anxious as the Federal Open Market Committee kicked off a two-day policy meeting, to be followed by a statement from Fed Chair Janet Yellen Wednesday afternoon. \nMost economists expect the Fed to remove a pledge to be "patient" about raising interest rates from its statement. Market strategists said with or without a change in the language, the Fed may still be on track to raise rates as early as June. \n"People are waiting for the Fed

In [72]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

# Extract and inspect words that appear in some (large) percentage of a publisher's articles
def find_domain_signatures(df, domain, thresh, stop_words='english', should_print=True):
    
    # Get all articles by provided publisher
    df = df[df['domain'] == domain]
    
    # Create counter vectorizer
    vectorizer = CountVectorizer(stop_words=stop_words,binary=True)
    X = vectorizer.fit_transform(df['article_text'])
    
    # Convert to dense array
    X = X.toarray()
    
    # Create sums vector with entry for each word
    sums = X.sum(axis=0)
    
    # Find all indices which exceed the given percentage for inclusion
    thresh = thresh
    max_indices = np.argwhere(sums >= thresh*df.shape[0])
    
    signature_tokens = []
    
    # Find words at indices and append to list
    for index in max_indices:
        feature = vectorizer.get_feature_names()[index[0]]
        signature_tokens.append(feature)
    
    # Print and return
    if should_print:
        print(signature_tokens)
    return signature_tokens

In [73]:
import sys
sys.path.append('../src')

%load_ext autoreload
%autoreload 1

from models.EDA import find_domain_signatures
%aimport models.EDA

In [118]:
sig = find_domain_signatures(train, 'foxbusiness', thresh=0.5)

{'advertisement': 0.828, 'company': 0.505, 'continue': 1.0, 'new': 0.684, 'reading': 1.0, 'said': 0.606, 'year': 0.658}


In [116]:
train[(train['domain']=='thedailybeast') & (train['article_text'].str.contains('beast'))].reset_index()['url'][0]

'https://thedailybeast.com/the-week-in-wingnuts-mandatory-guns-and-state-religions'