#NEWSWATCH SUPERFILTERS

Motivation: We hope to conduct a text analysis of breaking news headlines and associated reports pulled from newswatch and entered into a spreadsheet by our team. All the articles in our database resulted in tangible price movements in the associated stocks and were available immediately upon release through newswatch. We will be seeking to identify keywords, phrases, and article tags that show up across a variety of news headlines with the intention of using the results to tailor newswatch filters such that we can get the news and see it right away.

Specifically, we will start by investigating headlines in several key areas:
    - Biotech
    - M/A
    - Corporate Activity
    - Enforcement Agency Activity
    - Patent Law
    
And will seek to further categorize headlines within these groups

In [1]:
%matplotlib inline
import numpy as np
import scipy as sp
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import pandas as pd
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)
import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("poster")
from collections import Counter
import codecs

import nltk
from nltk.collocations import *
from nltk.tokenize import word_tokenize
from nltk import FreqDist
from nltk.corpus import stopwords

#nltk.download('stopwords')
#nltk.download("genesis")
#nltk.download('punkt')

Read in Headline Spreadsheet, saved from google sheets

In [62]:
news_df = pd.read_csv('HeadlineSpreadsheet.csv')
ind = news_df.index

remove_list = ['press release: ', 'dj: press release: ', 'top-line', 'phase 1', 'phase 2', 'phase 3', 'dj', 'u.s.', '*dj']
replace_list = ['', '', 'topline', 'phase1', 'phase2', 'phase3', '', 'us', '']

for num in ind:
    tagset = news_df.Tags[num]
    tagset = tagset.split(', ')
    taglistlist = []
    for taglist in tagset:
        taglist = taglist.split()
        taglistlist.append(taglist)
    news_df.loc[num, 'Tags'] = taglistlist
    news_df.loc[num, 'Vendors'] = news_df.loc[num, 'Vendors'].split(',')
    
    headline = str(news_df.Headline[num])
    text = str(news_df.Text[num])
    c_txt = str(news_df['Clean Text'][num])
    
    news_df.loc[num, 'Headline'] = headline.lower()
    news_df.loc[num, 'Text'] = text.lower()
    news_df.loc[num, 'Clean Text'] = c_txt.lower()
    
    for rem, rep in zip(remove_list, replace_list):
        headline = news_df.Headline[num]
        text = news_df.Text[num]
        c_txt = news_df['Clean Text'][num]
        news_df.loc[num, 'Headline'] = headline.replace(rem, rep)
        news_df.loc[num, 'Text'] = text.replace(rem, rep)
        news_df.loc[num, 'Clean Text'] = c_txt.replace(rem, rep)

news_df.head()

Unnamed: 0,Ticker(s),Date,Headline Type,Headline Sub Type,Vendors,Tags,Headline,Text,Clean Text,Text Timing,Stock Reaction
0,BCRX,2/8/16,Biotech,Clinical,[Dow Jones],"[[S.DJ, BCRX, .NASDAQ, US09058V1035, I/BTC, .B...",biocryst announces results from opus-2,biocryst announces results from opus-2\n rese...,biocryst announces results from opus-2 resea...,,-60%
1,NVLS,11/28/16,Biotech,Clinical,[Dow Jones GI],"[[S.GI, NVLS, .NASDAQ, US65481J1097, I/DRG, .P...",: nivalis therapeutics announces results from ...,nivalis therapeutics announces results from ph...,nivalis therapeutics announces results from ph...,,-50%
2,NOVN,11/29/16,Biotech,Clinical,[Dow Jones GI],"[[S.GI, NOVN, .NASDAQ, US66988N1063, I/BTC, .B...",: novan announces statistically significant ph...,novan announces statistically significant phas...,novan announces statistically significant phas...,,10%
3,LXRX,12/5/16,Biotech,Clinical,[Dow Jones],"[[S.DJ, LXRX, .NASDAQ, US5288723027, I/BTC, .B...",lexicon reports topline results from phase2 cl...,lexicon reports topline results from phase2 cl...,lexicon reports topline results from phase2 cl...,,-20%
4,ATRA,12/14/15,Biotech,Clinical,[Dow Jones],"[[S.DJ, ATRA, .NASDAQ, US0465131078, I/BTC, .B...",atara bio announces results from the phase2 pr...,atara bio announces results from the phase2 pr...,atara bio announces results from the phase2 pr...,,-40%


Subset the dataframe and analyze tags, headlines, text

###Most common tags by group

In [63]:
sub_df = news_df[news_df['Headline Type']=='Biotech']
subsub = sub_df[(sub_df['Headline Sub Type']=='Clinical')]

vendset = subsub.Vendors
n = vendset.index
numlist=[]
vend_ind = []
for num in n:
    vendlist = vendset[num]
    if 'Dow Jones' in vendlist or 'Dow Jones GI' in vendlist:
        numlist.append(num)
        #vend_ind = vendlist.index('Dow Jones')
        
subsub.iloc[numlist, :]
        
tags = subsub.Tags
tag_list = []
for group in tags:
    for tag in group:
        tag_list.append(tag)

top_tags = Counter(tag_list)
top_tags.most_common(35)

ValueError: 'Dow Jones' is not in list

In [356]:
subsub

Unnamed: 0,Ticker(s),Date,Headline Type,Headline Sub Type,Vendor,Headline,Tags,Text,Clean Text,Text Timing,Stock Reaction
0,BCRX,2/8/16,Biotech,Clinical,Dow Jones,biocryst announces results from opus-2,"[S.DJ, BCRX, .NASDAQ, US09058V1035, I/BTC, .BI...",biocryst announces results from opus-2\n rese...,biocryst announces results from opus-2 resea...,,-60%
1,NVLS,11/28/16,Biotech,Clinical,Dow Jones GI,: nivalis therapeutics announces results from ...,"[S.GI, NVLS, .NASDAQ, US65481J1097, I/DRG, .PH...",nivalis therapeutics announces results from ph...,nivalis therapeutics announces results from ph...,,-50%
2,NOVN,11/29/16,Biotech,Clinical,Dow Jones GI,: novan announces statistically significant ph...,"[S.GI, NOVN, .NASDAQ, US66988N1063, I/BTC, .BI...",novan announces statistically significant phas...,novan announces statistically significant phas...,,10%
3,LXRX,12/5/16,Biotech,Clinical,Dow Jones,lexicon reports topline results from phase2 cl...,"[S.DJ, LXRX, .NASDAQ, US5288723027, I/BTC, .BI...",lexicon reports topline results from phase2 cl...,lexicon reports topline results from phase2 cl...,,-20%
4,ATRA,12/14/15,Biotech,Clinical,Dow Jones,atara bio announces results from the phase2 pr...,"[S.DJ, ATRA, .NASDAQ, US0465131078, I/BTC, .BI...",atara bio announces results from the phase2 pr...,atara bio announces results from the phase2 pr...,,-40%
5,ALKS,1/21/16,Biotech,Clinical,Dow Jones,alkermes announces topline results of forward-...,"[S.DJ, ALKS, .NASDAQ, IE00B56GVS15, I/BTC, .BI...",alkermes announces topline results of forward-...,alkermes announces topline results of forward-...,,-40%
6,OCUL,2/16/16,Biotech,Clinical,BusinessWire,ocular therapeutix™ announces phase3 clinical ...,"[S.BW, OCUL, BIOTC.BW, .BIOTECH, CLINT.BW, FDA...","\nfebruary 16, 2016 21:05:00 utc\npivotal phas...","february 16, 2016 21:05:00 utc pivotal phase3...",,33%
7,GWPH,3/14/16,Biotech,Clinical,Dow Jones,gw pharmaceuticals announces positive phase3 p...,"[S.DJ, GWP.LN, GWP-L, GWPH, .NASDAQ, GB0030544...",gw pharmaceuticals announces positive phase3 p...,gw pharmaceuticals announces positive phase3 p...,,110%
8,APRI,3/28/16,Biotech,Clinical,Dow Jones,apricus reports topline phase2b data for fispe...,"[S.DJ, APRI, .NASDAQ, US03832V1098, I/DRG, .PH...",apricus reports topline phase2b data for fispe...,apricus reports topline phase2b data for fispe...,,-50%
9,ALDR,3/28/16,Biotech,Clinical,Dow Jones,alder reports phase2b trial of ald403 meets pr...,"[S.DJ, ALDR, .NASDAQ, US0143391052, I/BTC, .BI...",alder reports phase2b trial of ald403 meets pr...,alder reports phase2b trial of ald403 meets pr...,,45%


##Bigram & Trigram Collocations, Keywords by Group

In [361]:
txt_list = subsub['Clean Text']
heads_list = subsub.Headline

with open('txtsoup.txt', 'w') as txts:
    for line in txt_list:
        txts.write("%s\n" % line)
        
with open('headsoup.txt', 'w') as hds:
    for line in heads_list:
        hds.write("%s\n" % line)

bigram_measures = nltk.collocations.BigramAssocMeasures()
trigram_measures = nltk.collocations.TrigramAssocMeasures()
head_soup = '/Users/titans_bball30/Documents/Trlm/headsoup.txt'
txt_soup = '/Users/titans_bball30/Documents/Trlm/txtsoup.txt'

# find collocations
head_bi_finder = BigramCollocationFinder.from_words(nltk.corpus.genesis.words(head_soup))
head_tri_finder = TrigramCollocationFinder.from_words(nltk.corpus.genesis.words(head_soup))

text_bi_finder = BigramCollocationFinder.from_words(nltk.corpus.genesis.words(txt_soup))
text_tri_finder = TrigramCollocationFinder.from_words(nltk.corpus.genesis.words(txt_soup))


# only bigrams that appear n+ times adjust for text vs headlines only
head_bi_finder.apply_freq_filter(3) 
head_tri_finder.apply_freq_filter(3)

text_bi_finder.apply_freq_filter(5) 
text_tri_finder.apply_freq_filter(5)

#filter stopwords
ignored_words = nltk.corpus.stopwords.words('english')
head_bi_finder.apply_word_filter(lambda w: len(w) < 3 or w.lower() in ignored_words)
head_tri_finder.apply_word_filter(lambda w: len(w) < 3 or w.lower() in ignored_words)

text_bi_finder.apply_word_filter(lambda w: len(w) < 3 or w.lower() in ignored_words)
text_tri_finder.apply_word_filter(lambda w: len(w) < 3 or w.lower() in ignored_words)


In [362]:
fp = open(txt_soup, 'r')
words = fp.read()
words = nltk.tokenize.word_tokenize(words)

words = [word for word in words if len(word) > 3]
words = [word.lower() for word in words]
words = [w for w in words if w not in ignored_words]
fdist = FreqDist(words)

In [363]:
fdist

FreqDist({'patients': 754, 'trial': 394, 'treatment': 380, 'clinical': 371, 'study': 359, 'results': 301, 'company': 240, 'data': 231, 'placebo': 211, 'dose': 198, ...})

In [364]:
head_bi_finder.nbest(bigram_measures.pmi, 50)

[('depressive', 'disorder'),
 ('major', 'depressive'),
 ('clinical', 'trial'),
 ('pivotal', 'phase3'),
 ('positive', 'topline'),
 ('reports', 'positive'),
 ('phase2', 'study'),
 ('phase2', 'clinical'),
 ('announces', 'positive'),
 ('topline', 'results'),
 ('therapeutics', 'announces'),
 ('reports', 'topline'),
 ('pharmaceuticals', 'announces'),
 ('phase3', 'study'),
 ('announces', 'topline'),
 ('phase3', 'clinical'),
 ('positive', 'results'),
 ('announces', 'results')]

In [365]:
text_bi_finder.nbest(trigram_measures.pmi, 25)

[('cystic', 'fibrosis'),
 ('homologous', 'recombination'),
 ('macular', 'degeneration'),
 ('software', 'download'),
 ('intellectual', 'property'),
 ('north', 'america'),
 ('visual', 'acuity'),
 ('central', 'nervous'),
 ('intracanalicular', 'depot'),
 ('solar', 'capital'),
 ('pinta', '745'),
 ('segmental', 'glomerulosclerosis'),
 ('hot', 'flashes'),
 ('nitric', 'oxide'),
 ('hazard', 'ratio'),
 ('limiting', 'toxicities'),
 ('myers', 'squibb'),
 ('wire', ')--'),
 ('metastatic', 'melanoma'),
 ('litigation', 'reform'),
 ('focal', 'segmental'),
 ('set', 'forth'),
 ('---', '------'),
 ('wet', 'amd'),
 ('dna', 'repair')]

['S.DJ BCRX .NASDAQ US09058V1035 I/BTC .BIOTECH I/XDJGI I/XRUS N/DJG N/DJGP N/DJGS N/DJGV N/DJI N/DJIV N/DJN N/DJPT N/DN N/WED N/WER N/CNW N/DJPN N/DJWI N/PRL N/TPCT M/HCR .HEALTH M/NND M/TPX P/ABO P/AEQI P/SGN P/TAP P/WMAI P/WMMI R/AL .AL R/NME .NAMERICA R/US .US R/USS .SOUTHUS DJ/TAB']

In [17]:
tagset = news_df.Tags[183]
taglistlist = []
for taglist in tagset:
    taglist = taglist.split()
    taglistlist.append(taglist)

In [26]:
'Dow Jones' in news_df.Vendors[0] or 'Dow Jones GI' in news_df.Vendors[0]

True

In [55]:
vendset = subsub.Vendors
n = range(0, len(vendset))
numlist=[]
for num in n:
    vendlist = vendset[num]
    if 'Dow Jones' in vendlist or 'Dow Jones GI' in vendlist:
        numlist.append(num)
        
subsub.iloc[numlist, :]

Unnamed: 0,Ticker(s),Date,Headline Type,Headline Sub Type,Vendors,Tags,Headline,Text,Clean Text,Text Timing,Stock Reaction
0,BCRX,2/8/16,Biotech,Clinical,[Dow Jones],"[[S.DJ, BCRX, .NASDAQ, US09058V1035, I/BTC, .B...",biocryst announces results from opus-2,biocryst announces results from opus-2\n rese...,biocryst announces results from opus-2 resea...,,-60%
1,NVLS,11/28/16,Biotech,Clinical,[Dow Jones GI],"[[S.GI, NVLS, .NASDAQ, US65481J1097, I/DRG, .P...",: nivalis therapeutics announces results from ...,nivalis therapeutics announces results from ph...,nivalis therapeutics announces results from ph...,,-50%
2,NOVN,11/29/16,Biotech,Clinical,[Dow Jones GI],"[[S.GI, NOVN, .NASDAQ, US66988N1063, I/BTC, .B...",: novan announces statistically significant ph...,novan announces statistically significant phas...,novan announces statistically significant phas...,,10%
3,LXRX,12/5/16,Biotech,Clinical,[Dow Jones],"[[S.DJ, LXRX, .NASDAQ, US5288723027, I/BTC, .B...",lexicon reports topline results from phase2 cl...,lexicon reports topline results from phase2 cl...,lexicon reports topline results from phase2 cl...,,-20%
4,ATRA,12/14/15,Biotech,Clinical,[Dow Jones],"[[S.DJ, ATRA, .NASDAQ, US0465131078, I/BTC, .B...",atara bio announces results from the phase2 pr...,atara bio announces results from the phase2 pr...,atara bio announces results from the phase2 pr...,,-40%
5,ALKS,1/21/16,Biotech,Clinical,[Dow Jones],"[[S.DJ, ALKS, .NASDAQ, IE00B56GVS15, I/BTC, .B...",alkermes announces topline results of forward-...,alkermes announces topline results of forward-...,alkermes announces topline results of forward-...,,-40%
7,GWPH,3/14/16,Biotech,Clinical,[Dow Jones],"[[S.DJ, GWP.LN, GWP-L, GWPH, .NASDAQ, GB003054...",gw pharmaceuticals announces positive phase3 p...,gw pharmaceuticals announces positive phase3 p...,gw pharmaceuticals announces positive phase3 p...,,110%
8,APRI,3/28/16,Biotech,Clinical,[Dow Jones],"[[S.DJ, APRI, .NASDAQ, US03832V1098, I/DRG, .P...",apricus reports topline phase2b data for fispe...,apricus reports topline phase2b data for fispe...,apricus reports topline phase2b data for fispe...,,-50%
9,ALDR,3/28/16,Biotech,Clinical,[Dow Jones],"[[S.DJ, ALDR, .NASDAQ, US0143391052, I/BTC, .B...",alder reports phase2b trial of ald403 meets pr...,alder reports phase2b trial of ald403 meets pr...,alder reports phase2b trial of ald403 meets pr...,,45%
10,GNCA,3/31/16,Biotech,Clinical,[Dow Jones],"[[S.DJ, GNCA, .NASDAQ, US3724271040, I/BTC, .B...",genital herpes immunotherapy gen-003 shows sus...,genital herpes immunotherapy gen-003 shows sus...,genital herpes immunotherapy gen-003 shows sus...,,70%


In [67]:
subsub.Tags[62][0]

['S.DJ',
 'OCUL',
 '.NASDAQ',
 'US67576A1007',
 'I/DRG',
 '.PHARMA',
 'I/XRUS',
 'N/DJG',
 'N/DJGP',
 'N/DJGS',
 'N/DJGV',
 'N/DJI',
 'N/DJIV',
 'N/DJN',
 'N/DJPT',
 'N/DN',
 'N/WED',
 'N/WER',
 'N/CNW',
 'N/DJPN',
 'N/DJWI',
 'N/GEN',
 'N/HLT',
 '.HEALTH',
 'N/PRL',
 'N/TPCT',
 'M/HCR',
 'M/MMR',
 'M/TPX',
 'P/ABO',
 'P/AEQI',
 'P/MC1',
 'P/SGN',
 'P/TAP',
 'P/WMAI',
 'P/WMMI',
 'R/MA',
 '.MASS',
 'R/NME',
 '.NAMERICA',
 'R/US',
 '.US',
 'R/USE',
 '.EASTUS',
 'DJ/TAB']