#NEWSWATCH SUPERFILTERS

Motivation: We hope to conduct a text analysis of breaking news headlines and associated reports pulled from newswatch and entered into a spreadsheet by our team. All the articles in our database resulted in tangible price movements in the associated stocks and were available immediately upon release through newswatch. We will be seeking to identify keywords, phrases, and article tags that show up across a variety of news headlines with the intention of using the results to tailor newswatch filters such that we can get the news and see it right away.

Specifically, we will start by investigating headlines in three key areas:
    - Dow Jones General Headlines
    - Biotech
    - M/A
And will seek to further categorize headlines within these groups

In [1]:
%matplotlib inline
import numpy as np
import scipy as sp
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import pandas as pd
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)
import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("poster")
from collections import Counter
import codecs

import nltk
from nltk.collocations import *
from nltk.tokenize import word_tokenize
from nltk import FreqDist
from nltk.corpus import stopwords

#nltk.download('stopwords')
#nltk.download("genesis")
#nltk.download('punkt')

Read in Headline Spreadsheet, saved from google sheets

In [114]:
news_df = pd.read_csv('HeadlineSpreadsheet.csv')
ind = news_df.index

remove_list = ['press release: ', 'dj: press release: ', 'top-line', 'phase 1', 'phase 2', 'phase 3', 'dj', 'u.s.']
replace_list = ['', '', 'topline', 'phase1', 'phase2', 'phase3', '', 'us']

for num in ind:
    news_df.loc[num, 'Tags'] = news_df.loc[num, 'Tags'].split()
    
    headline = news_df.Headline[num]
    text = news_df.Text[num]
    c_txt = news_df['Clean Text'][num]
    
    news_df.loc[num, 'Headline'] = headline.lower()
    news_df.loc[num, 'Text'] = text.lower()
    news_df.loc[num, 'Clean Text'] = c_txt.lower()
    
    for rem, rep in zip(remove_list, replace_list):
        headline = news_df.Headline[num]
        text = news_df.Text[num]
        c_txt = news_df['Clean Text'][num]
        news_df.loc[num, 'Headline'] = headline.replace(rem, rep)
        news_df.loc[num, 'Text'] = text.replace(rem, rep)
        news_df.loc[num, 'Clean Text'] = c_txt.replace(rem, rep)

news_df.head()

Unnamed: 0,Ticker(s),Date,Headline Type,Headline Sub Type,Vendor,Headline,Tags,Text,Clean Text,Text Timing,Stock Reaction
0,BCRX,2/8/16,Biotech,Clincal,Dow Jones,biocryst announces results from opus-2,"[S.DJ, BCRX, .NASDAQ, US09058V1035, I/BTC, .BI...",biocryst announces results from opus-2\r rese...,biocryst announces results from opus-2 resea...,,-60%
1,NVLS,11/28/16,Biotech,Clincal,Dow Jones GI,: nivalis therapeutics announces results from ...,"[S.GI, NVLS, .NASDAQ, US65481J1097, I/DRG, .PH...",nivalis therapeutics announces results from ph...,nivalis therapeutics announces results from ph...,,-50%
2,NOVN,11/29/16,Biotech,Clincal,Dow Jones GI,: novan announces statistically significant ph...,"[S.GI, NOVN, .NASDAQ, US66988N1063, I/BTC, .BI...",novan announces statistically significant phas...,novan announces statistically significant phas...,,10%
3,ATRA,12/14/15,Biotech,Clinical,Dow Jones,atara bio announces results from the phase2 pr...,"[S.DJ, ATRA, .NASDAQ, US0465131078, I/BTC, .BI...",atara bio announces results from the phase2 pr...,atara bio announces results from the phase2 pr...,,-40%
4,ALKS,1/21/16,Biotech,Clinical,Dow Jones,alkermes announces topline results of forward-...,"[S.DJ, ALKS, .NASDAQ, IE00B56GVS15, I/BTC, .BI...",alkermes announces topline results of forward-...,alkermes announces topline results of forward-...,,-40%


Subset the dataframe and analyze tags, headlines, text

###Most common tags by group

In [115]:
sub_df = news_df[news_df['Headline Type']=='Biotech']
subsub = sub_df[sub_df['Headline Sub Type'] == 'Regulatory']
        
tags = subsub[(subsub.Vendor == 'Dow Jones') | (subsub.Vendor == 'Dow Jones GI')].Tags
tag_list = []
for group in tags:
    for tag in group:
        tag_list.append(tag)

top_tags = Counter(tag_list)
top_tags.most_common(25)

[('M/HCR', 7),
 ('N/PRL', 7),
 ('N/CNW', 7),
 ('.HEALTH', 7),
 ('N/WED', 7),
 ('N/WER', 7),
 ('N/DJGV', 7),
 ('N/DJGP', 7),
 ('N/DJGS', 7),
 ('P/ABO', 7),
 ('P/TAP', 7),
 ('M/TPX', 7),
 ('N/TPCT', 7),
 ('N/DN', 7),
 ('N/DJIV', 7),
 ('P/WMMI', 7),
 ('P/SGN', 7),
 ('P/AEQI', 7),
 ('I/XRUS', 7),
 ('N/DJG', 7),
 ('N/DJN', 7),
 ('N/DJI', 7),
 ('N/DJWI', 7),
 ('N/DJPT', 7),
 ('N/DJPN', 7)]

In [116]:
subsub

Unnamed: 0,Ticker(s),Date,Headline Type,Headline Sub Type,Vendor,Headline,Tags,Text,Clean Text,Text Timing,Stock Reaction
45,CLVS,11/16/15,Biotech,Regulatory,BusinessWire,clovis oncology announces regulatory update fo...,"[S.BW, CLVS, CO.BW, .CO, CONFC.BW, .CC, FDA.BW...","november 16, 2015 13:00:00 utc\rmid-cycle comm...","november 16, 2015 13:00:00 utc mid-cycle commu...",,-70%
46,PTCT,2/23/16,Biotech,Regulatory,PR Newswire,ptc receives refuse to file letter from fda fo...,"[S.PN, PTCT, NJ.PN, .NJ, .NASDAQ, HEA.PN, .HEA...","south plainfield, n.j., feb. 23, 2016 /prnewsw...","south plainfield, n.j., feb. 23, 2016 /prnewsw...",,-60%
47,CLVS,4/8/16,Biotech,Regulatory,BusinessWire,fda posts briefing documents for advisory comm...,"[S.BW, CLVS, BIOTC.BW, .BIOTECH, CLINT.BW, CO....","april 8, 2016 12:43:00 utc\rclovis oncology, i...","april 8, 2016 12:43:00 utc clovis oncology, in...",,-20%
48,FLXN,5/26/16,Biotech,Regulatory,Dow Jones,flexion therapeutics receives positive guidanc...,"[S.DJ, FLXN, .NASDAQ, US33938J1060, I/DRG, .PH...",flexion therapeutics receives positive guidan...,flexion therapeutics receives positive guidanc...,,40%
49,LPCN,6/29/16,Biotech,Regulatory,Dow Jones,lipocine receives complete response letter (cr...,"[S.DJ, LPCN, .NASDAQ, US53630X1046, I/DRG, .PH...",lipocine receives complete response letter (cr...,lipocine receives complete response letter (cr...,,-50%
50,MRK,8/5/16,Biotech,Regulatory,Dow Jones,merck announces us fda filing acceptance of ne...,"[S.DJ, MRK, .NYSE, MRK-L, MCC-R, MRKC-T, MRK-F...",merck announces us fda filing acceptance of ne...,merck announces us fda filing acceptance of ne...,,10%
51,BIIB,9/1/16,Biotech,Regulatory,Dow Jones,biogen's investigational alzheimer's disease t...,"[S.DJ, BIIB, .NASDAQ, US09062X1037, I/BTC, .BI...",biogen's investigational alzheimer's disease t...,biogen's investigational alzheimer's disease t...,,2%
52,CLVS,9/8/16,Biotech,Regulatory,Fly On The Wall,clovis says fda not currently planning advisor...,"[S.FO, HOTS.FLY, CLVS]",clovis says fda not currently planning advisor...,clovis says fda not currently planning advisor...,,15%
53,PTCT,10/17/16,Biotech,Regulatory,Dow Jones,ptc therapeutics provides regulatory update on...,"[S.DJ, PTCT, .NASDAQ, US69366J2006, I/BTC, .BI...",ptc therapeutics provides regulatory update on...,ptc therapeutics provides regulatory update on...,,-30%
54,KERX,11/9/16,Biotech,Regulatory,Dow Jones,keryx biopharmaceuticals announces us fda appr...,"[S.DJ, KERX, .NASDAQ, US4925151015, I/BTC, .BI...",keryx biopharmaceuticals announces us fda appr...,keryx biopharmaceuticals announces us fda appr...,,10%


In [136]:
txt_list = subsub.Text
heads_list = subsub.Headline

with open('txtsoup.txt', 'w') as txts:
    for line in txt_list:
        txts.write("%s\n" % line)
        
with open('headsoup.txt', 'w') as hds:
    for line in heads_list:
        hds.write("%s\n" % line)

bigram_measures = nltk.collocations.BigramAssocMeasures()
trigram_measures = nltk.collocations.TrigramAssocMeasures()
head_soup = '/Users/titans_bball30/Documents/Trlm/headsoup.txt'
txt_soup = '/Users/titans_bball30/Documents/Trlm/txtsoup.txt'

# find collocations
head_bi_finder = BigramCollocationFinder.from_words(nltk.corpus.genesis.words(head_soup))
head_tri_finder = TrigramCollocationFinder.from_words(nltk.corpus.genesis.words(head_soup))

text_bi_finder = BigramCollocationFinder.from_words(nltk.corpus.genesis.words(txt_soup))
text_tri_finder = TrigramCollocationFinder.from_words(nltk.corpus.genesis.words(txt_soup))


# only bigrams that appear n+ times adjust for text vs headlines only
head_bi_finder.apply_freq_filter(2) 
head_tri_finder.apply_freq_filter(2)

text_bi_finder.apply_freq_filter(15) 
text_tri_finder.apply_freq_filter(15)

#filter stopwords
ignored_words = nltk.corpus.stopwords.words('english')
head_bi_finder.apply_word_filter(lambda w: len(w) < 3 or w.lower() in ignored_words)
head_tri_finder.apply_word_filter(lambda w: len(w) < 3 or w.lower() in ignored_words)

text_bi_finder.apply_word_filter(lambda w: len(w) < 3 or w.lower() in ignored_words)
text_tri_finder.apply_word_filter(lambda w: len(w) < 3 or w.lower() in ignored_words)


In [137]:
head_bi_finder.nbest(bigram_measures.pmi, 50)

[('advisory', 'committee'),
 ('committee', 'meeting'),
 ('complete', 'response'),
 ('regulatory', 'update'),
 ('response', 'letter'),
 ('drug', 'administration'),
 ('new', 'drug'),
 ('receives', 'complete'),
 ('drug', 'application')]

In [138]:
head_tri_finder.nbest(trigram_measures.pmi, 25)

[('advisory', 'committee', 'meeting'),
 ('complete', 'response', 'letter'),
 ('receives', 'complete', 'response'),
 ('new', 'drug', 'application')]

In [141]:

fp = open(head_soup, 'r')
words = fp.read()
words = nltk.tokenize.word_tokenize(words)

words = [word for word in words if len(word) > 1]
words = [word.lower() for word in words]
fdist = FreqDist(words)

In [142]:
fdist

FreqDist({'for': 15, 'fda': 7, 'us': 5, 'announces': 4, 'of': 4, 'drug': 4, 'receives': 4, 'from': 4, 'therapeutics': 3, 'regulatory': 3, ...})

In [22]:
import sys  

reload(sys)  
sys.setdefaultencoding('utf8')

In [127]:


hd_tokens = []
stopset = set(stopwords.words('english'))
with open(head_soup, 'r') as head:
    hd = head.read()
    tokens = nltk.word_tokenize(hd)
    tokens = [w for w in tokens if not w in stopset]
    hd_tokens.append(tokens)