# Web Crawling and Text Analysis

# Topic: Selecting Meaningful Words From Reuter News Articles (In Preparation for News Sentiment Analysis)

#### The result of this analysis can applied to our final project, where we will anlayze how news sentiment can effect the performance of ETFs

- We will start with news articles from Reuters, eventually expanding to multiple news sources
- Due to time constraints, the scope of the assinment will be limited to US equities market. 
- S&P 500 will serve as our market index for the time 


In [1]:
import pandas as pd
import datetime as dt
import matplotlib.pyplot as plt
import matplotlib.font_manager as font_manager
import numpy as np
import requests
import time
from bs4 import BeautifulSoup
import pandas_datareader.data as web
import re

### Collecting price data for S&P500

In [2]:
# grabs open, high, low, close price data for SP500
def SP500(startDate, endDate):
    sp = web.DataReader('^GSPC', 'yahoo', startDate, endDate)
    sp = sp.resample('D').ffill()
    return sp

startDate = dt.date(2018,10,1)
endDate = dt.date.today()
SP = SP500(startDate, endDate)
SP.head()

Unnamed: 0_level_0,High,Low,Open,Close,Volume,Adj Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2018-10-01,2937.060059,2917.909912,2926.290039,2924.590088,3364190000,2924.590088
2018-10-02,2931.419922,2919.370117,2923.800049,2923.429932,3401880000,2923.429932
2018-10-03,2939.860107,2921.360107,2931.689941,2925.51001,3598710000,2925.51001
2018-10-04,2919.780029,2883.919922,2919.350098,2901.610107,3496860000,2901.610107
2018-10-05,2909.639893,2869.290039,2902.540039,2885.570068,3328980000,2885.570068


### Collecting news articles from Reuters

In [3]:
# grab links to news articles from reuter's archive page
# ten+ articles are displayed on each page
url_links = []
for i in range(1,100):
    url = 'https://www.reuters.com/news/archive/marketsNews?view=page&page=' + str(i) + '&pageSize=10'
    html = requests.get(url)
    content = html.content
    content.decode().strip().replace('\t','').split('\n')
    soup = BeautifulSoup(content, "html.parser")
    for tags in soup.find_all('a'):
        if re.search('article', tags['href']):
            url_links.append(tags['href'])
            
# some linkes may be duplicated thus we only select those that only appear once
final_urls = []
for url in url_links:
    if url not in final_urls:
        final_urls.append(url)
    

In [20]:
# retreive the title, publish time and content for each article

title_all = []
time_all = []
content_all = []
url_all = []

for url in final_urls:
    link = 'https://www.reuters.com' + url
    page = requests.get(link).content
    soup = BeautifulSoup(page, "html.parser")
    newsTitle = soup.title.text
    print(newsTitle.lstrip())
    print(link, '\n')
    newsTime = soup.find_all("div", {"class": 'ArticleHeader_date'})[0].text
    newsContent = ''
    for tag in soup.find_all('p'):
        newsContent += tag.text
        
    title_all.append(newsTitle)
    time_all.append(newsTime)
    content_all.append(newsContent)
    url_all.append(link)

# remove spaces infront of titles
title_all = [x.lstrip() for x in title_all]

In [22]:
# save all articles to one csv file
file = pd.DataFrame({'Title' : title_all, 'Time':time_all, 'Content':content_all, 'Link':url_all})
file['Date'] = [x.split('/')[0] for x in file['Time'].tolist()]
file['Date'] = pd.to_datetime(file['Date'])

file['Len'] = [len(x) for x in file['Content']]
file = file[file['Len'] >= 600]

file.to_csv('articles.csv')
file.head(10)

Unnamed: 0,Title,Time,Content,Link,Date,Len
1,Fed's Clarida says there's no good case for ra...,"May 7, 2019 / 2:34 PM / in 3 minutes",2 Min ReadWASHINGTON (Reuters) - The U.S. Fede...,https://www.reuters.com/article/usa-fed-clarid...,2019-05-07,1311
2,Investors most neutral on U.S. Treasuries in f...,"May 7, 2019 / 2:37 PM / a minute ago","1 Min ReadNEW YORK, May 7 (Reuters) - Bond inv...",https://www.reuters.com/article/treasuries-jpm...,2019-05-07,747
4,Fed's Clarida says there's no good case for ra...,"May 7, 2019 / 2:28 PM / Updated 16 minutes ago","1 Min ReadWASHINGTON, May 7 (Reuters) - The U....",https://www.reuters.com/article/usa-fed-clarid...,2019-05-07,677
5,UPDATE 1-Sterling slides to day's low on Brexi...,"May 7, 2019 / 2:27 PM / in a minute",3 Min Read* Graphic: World FX rates in 2019 tm...,https://www.reuters.com/article/britain-sterli...,2019-05-07,2876
6,CANADA STOCKS-TSX falls for second day on U.S....,"May 7, 2019 / 2:28 PM / Updated 16 minutes ago",2 Min ReadMay 7 (Reuters) - Canada’s main stoc...,https://www.reuters.com/article/canada-stocks/...,2019-05-07,2056
7,Wall Street declines on U.S.-China trade tensi...,"May 7, 2019 / 11:08 AM / Updated 13 minutes ago",4 Min Read(Reuters) - U.S. stocks posted broad...,https://www.reuters.com/article/usa-stocks/us-...,2019-05-07,3312
8,Scout24 bidders reach 9.7 percent stake ahead ...,"May 7, 2019 / 2:33 PM / in a minute",1 Min ReadBERLIN (Reuters) - The private equit...,https://www.reuters.com/article/scout24-ag-ma/...,2019-05-07,677
9,Britain will take part in European Parliament ...,"May 7, 2019 / 2:33 PM / a few seconds ago",1 Min ReadLONDON (Reuters) - Britain will have...,https://www.reuters.com/article/britain-eu-ele...,2019-05-07,1173
10,Senate's McConnell to declare 'case closed' on...,"May 7, 2019 / 1:25 PM / in an hour",4 Min ReadWASHINGTON (Reuters) - The divided U...,https://www.reuters.com/article/us-usa-trump-c...,2019-05-07,3435
11,U.S. House panel readies contempt vote against...,"May 6, 2019 / 5:02 AM / Updated 11 hours ago",4 Min ReadWASHINGTON (Reuters) - Congressional...,https://www.reuters.com/article/us-usa-trump-b...,2019-05-06,4048


In [2]:
file = pd.read_csv('articles.csv')

### TF-IDF

In [3]:
#  Getting the word frequency matrix with sklearn
corpus = file['Content'].values.tolist()
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()  
X = vectorizer.fit_transform(corpus)  
word = vectorizer.get_feature_names()  
pd.DataFrame(X.toarray(), columns=word).head()

Unnamed: 0,00,000,0000,001,002,005,006,008,009,01,...,zone,zones,zoom,zooming,zuckerberg,zug,zuma,zurich,zwaan,ﬂat
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [4]:
from sklearn.feature_extraction.text import TfidfTransformer
transformer = TfidfTransformer(smooth_idf = False)
tfidf = transformer.fit_transform(X)

df_tfidf = pd.DataFrame(tfidf.toarray(), columns=word)
df_tfidf.head()

Unnamed: 0,00,000,0000,001,002,005,006,008,009,01,...,zone,zones,zoom,zooming,zuckerberg,zug,zuma,zurich,zwaan,ﬂat
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [5]:
df_tfidf.sum().reset_index().sort_values([0], ascending = False).head(10)

Unnamed: 0,index,0
15672,the,207.783551
15834,to,107.751337
10990,of,107.130755
1810,and,90.901376
8206,in,89.73571
11058,on,57.047368
6807,for,48.633942
3206,by,48.426135
13751,said,40.649421
11602,percent,39.50566


In [6]:
df_tfidf.sum().reset_index().sort_values([0], ascending = False).tail(10)

Unnamed: 0,index,0
14209,sheffield,0.014541
14330,sibley,0.014541
11322,overy,0.014541
92,0920,0.014541
1762,amelia,0.014541
93,0930,0.014541
11394,panetta,0.014541
7616,haskel,0.014541
259,1630,0.014541
2974,brazier,0.014541


Normally we would expect words such as "the" and "to" to get zero weighting from the inverse document frequency, as it is appeared in almost all articles. However, with the imported function "TfidfTransformer," idf is computed as idf(t) = log [ n / df(t)] + 1. This means that words that occur in all documents will still receive a weighting greater than zero. This feature conflicts with our aim to remove words that have no meaning. In the following we will calcualate our own TF-IDF. 

In [7]:
# combine the content of all articles to one list
text_all = []
for content in file['Content']:
    text = content.split(' ')
    text = [x.lower() for x in text]
    text_all.append(text)


# calculate term frequency in each article
def computeReviewTFDict(reviews):
    # counts the number of times the word appears in review
    all_TFDict = []
    for review in reviews:
        reviewTFDict = {}
        for word in review:
            if word in reviewTFDict:
                reviewTFDict[word] += 1
            else:
                reviewTFDict[word] = 1
        all_TFDict.append(reviewTFDict)
    
    return all_TFDict

TF = computeReviewTFDict(text_all)
TF_list = [pd.DataFrame(list(doc.values()), index=doc.keys()) for doc in TF]
wfm = pd.concat(TF_list, axis= 1)
wfm = np.transpose(wfm).fillna(0)
wfm.head()

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.




Unnamed: 0,Unnamed: 1,"""avengers:","""case","""disastrous""","""has","""hush","""letting","""one","""other","""raise",...,“you’d,“you’re,“you’ve,“yuan,“zainab,“‘black,“‘sell,“”restoring,…,ﬂat
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [8]:
# Inverse Document Frequency requires the number of documents each word has appeared in 

import math
total_words = wfm.astype(bool).sum(axis=0).sum()

#Total number of documents / Number of documents with term in it
idf = len(wfm)/wfm.astype(bool).sum(axis=0)

# Taking log
idf = idf.apply(math.log)

idf = idf.sort_values().reset_index()
idf.head(10)

Unnamed: 0,index,0
0,15,0.0
1,complete,0.0
2,for,0.0
3,2019,0.0
4,exchanges,0.0
5,quotes,0.0
6,a,0.0
7,list,0.0
8,minutes.,0.0
9,of,0.0


In [9]:
idf.tail(10)

Unnamed: 0,index,0
31685,opec+,6.908755
31686,"opec+,",6.908755
31687,opec+.,6.908755
31688,"opec,",6.908755
31689,"collapsed,",6.908755
31690,collaborations.”,6.908755
31691,collaboration,6.908755
31692,collaborating,6.908755
31693,ontario-based,6.908755
31694,benkoe)all,6.908755


The problem with words appearing in all articles seems to be solved as words such a "the" and "a" now have zero weighting. However, we need to address the issue with punctuation and symbols.

In [16]:
import string
import math

exclude = set(string.punctuation + '©…”“‘—')

# combine the content of all articles to one list while removing punctuations
text_all = []
for content in file['Content']:
    text = ''.join(ch for ch in content if ch not in exclude)
    text = text.split(' ')
    text = [x.lower() for x in text]
    text_all.append(text)


# calculate term frequency in each article
def computeReviewTFDict(reviews):
    # counts the number of times the word appears in review
    all_TFDict = []
    for review in reviews:
        reviewTFDict = {}
        for word in review:
            if word in reviewTFDict:
                reviewTFDict[word] += 1
            else:
                reviewTFDict[word] = 1
        all_TFDict.append(reviewTFDict)
    
    return all_TFDict

TF = computeReviewTFDict(text_all)
TF_list = [pd.DataFrame(list(doc.values()), index=doc.keys()) for doc in TF]
wfm = pd.concat(TF_list, axis= 1)
wfm = np.transpose(wfm).fillna(0)





total_words = wfm.astype(bool).sum(axis=0).sum()
tf = wfm.sum()/total_words


#Total number of documents / Number of documents with term in it
idf = len(wfm)/wfm.astype(bool).sum(axis=0)
# Taking log
idf = idf.apply(math.log)

tfidf = tf*idf
tfidf = tfidf.sort_values().reset_index()
tfidf.head(30)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.




Unnamed: 0,index,0
0,minutes,0.0
1,complete,0.0
2,2019,0.0
3,exchanges,0.0
4,quotes,0.0
5,reuters,0.0
6,list,0.0
7,min,0.0
8,of,0.0
9,delayed,0.0


In [17]:
tfidf.tail(30)

Unnamed: 0,index,0
20731,governor,0.003335
20732,president,0.003346
20733,data,0.003405
20734,campaign,0.003426
20735,its,0.003505
20736,china,0.00354
20737,mueller,0.003542
20738,sp,0.003596
20739,holds,0.003695
20740,fed,0.003695


## Summary and Follow ups

1. Using TF-IDF we identify words such as "a" and "for" to carry very little meaningful information about the actual content, as they appear in all articles. These words will be excluded from future analysis. 


2. Words such as "policy" and "reserve" are the terms we are most interested in. We will be focusing on these words in our sentiment analysis. 


3. Before we continue with this set of words, we will need to solve the problem with stemming. For example, the word "reserve" and "reserves" carry essentailly the same meaning though they are seperated as individual terms. Thus our text needs to be further modified before constructing our new TF-IDF. 
