# Web Crawling and Text Analysis

# Topic: Selecting Meaningful Words From Reuter News Articles (In Preparation for News Sentiment Analysis)

#### The result of this analysis can applied to our final project, where we will anlayze how news sentiment can effect the performance of ETFs

- We will start with news articles from Reuters, eventually expanding to multiple news sources
- Due to time constraints, the scope of the assinment will be limited to US equities market. 
- S&P 500 will serve as our market index for the time 


In [1]:
import pandas as pd
import datetime as dt
import matplotlib.pyplot as plt
import matplotlib.font_manager as font_manager
import numpy as np
import requests
import time
from bs4 import BeautifulSoup
import pandas_datareader.data as web
import re

### Collecting price data for S&P500

In [2]:
# grabs open, high, low, close price data for SP500
def SP500(startDate, endDate):
    sp = web.DataReader('^GSPC', 'yahoo', startDate, endDate)
    sp = sp.resample('D').ffill()
    return sp

startDate = dt.date(2018,10,1)
endDate = dt.date.today()
SP = SP500(startDate, endDate)
SP.head()

Unnamed: 0_level_0,High,Low,Open,Close,Volume,Adj Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2018-10-01,2937.060059,2917.909912,2926.290039,2924.590088,3364190000,2924.590088
2018-10-02,2931.419922,2919.370117,2923.800049,2923.429932,3401880000,2923.429932
2018-10-03,2939.860107,2921.360107,2931.689941,2925.51001,3598710000,2925.51001
2018-10-04,2919.780029,2883.919922,2919.350098,2901.610107,3496860000,2901.610107
2018-10-05,2909.639893,2869.290039,2902.540039,2885.570068,3328980000,2885.570068


### Collecting news articles from Reuters

In [3]:
# grab links to news articles from reuter's archive page
# ten+ articles are displayed on each page
url_links = []
for i in range(1,100):
    url = 'https://www.reuters.com/news/archive/marketsNews?view=page&page=' + str(i) + '&pageSize=10'
    html = requests.get(url)
    content = html.content
    content.decode().strip().replace('\t','').split('\n')
    soup = BeautifulSoup(content, "html.parser")
    for tags in soup.find_all('a'):
        if re.search('article', tags['href']):
            url_links.append(tags['href'])
            
# some linkes may be duplicated thus we only select those that only appear once
final_urls = []
for url in url_links:
    if url not in final_urls:
        final_urls.append(url)
    

In [4]:
# retreive the title, publish time and content for each article

title_all = []
time_all = []
content_all = []
url_all = []

for url in final_urls:
    link = 'https://www.reuters.com' + url
    page = requests.get(link).content
    soup = BeautifulSoup(page, "html.parser")
    newsTitle = soup.title.text
    print(newsTitle.lstrip())
    print(link, '\n')
    newsTime = soup.find_all("div", {"class": 'ArticleHeader_date'})[0].text
    newsContent = ''
    for tag in soup.find_all('p'):
        newsContent += tag.text
        
    title_all.append(newsTitle)
    time_all.append(newsTime)
    content_all.append(newsContent)
    url_all.append(link)

# remove spaces infront of titles
title_all = [x.lstrip() for x in title_all]

EMERGING MARKETS-Latam stocks extend losses on trade concerns, growth worries - Reuters
https://www.reuters.com/article/emerging-markets-latam/emerging-markets-latam-stocks-extend-losses-on-trade-concerns-growth-worries-idUSL5N22J6SD 

Fed's Clarida says there's no good case for rate hikes or cuts: Bloomberg - Reuters
https://www.reuters.com/article/usa-fed-clarida/update-1-feds-clarida-says-theres-no-good-case-for-rate-hikes-or-cuts-bloomberg-idUSL2N22J0LE 

Investors most neutral on U.S. Treasuries in five weeks - survey - Reuters
https://www.reuters.com/article/treasuries-jpmorgan/investors-most-neutral-on-u-s-treasuries-in-five-weeks-survey-idUSL2N22J0M1 

TREASURIES-U.S. yields fall on trade worries before 3-year supply - Reuters
https://www.reuters.com/article/usa-bonds/treasuries-u-s-yields-fall-on-trade-worries-before-3-year-supply-idUSL2N22J0IW 

Fed's Clarida says there's no good case for rate hikes or cuts-Bloomberg - Reuters
https://www.reuters.com/article/usa-fed-clarida/f

Senate judiciary chair offers Mueller opportunity to testify - Reuters
https://www.reuters.com/article/us-usa-trump-congress-graham/senate-judiciary-chair-offers-mueller-opportunity-to-testify-idUSKCN1S91RH 

RPT-INSIGHT-As IPO looms, Uber clings to hard-knuckled tactics in pursuit of growth - Reuters
https://www.reuters.com/article/uber-ipo-cities/rpt-insight-as-ipo-looms-uber-clings-to-hard-knuckled-tactics-in-pursuit-of-growth-idUSL2N22J00A 

Sri Lanka stocks end near 6-1/2-year low; rupee gains - Reuters
https://www.reuters.com/article/sri-lanka-markets/sri-lanka-stocks-end-near-6-1-2-year-low-rupee-gains-idUSL3N22J3IG 

U.S. cited international security issues for cancellation of Pompeo's Berlin visit: German source - Reuters
https://www.reuters.com/article/germany-usa-pompeo-reason/u-s-cited-international-security-issues-for-cancellation-of-pompeos-berlin-visit-german-source-idUSS8N22B04V 

CORRECTED-Sri Lanka rupee touches near 6-week low; stocks hit 6-1/2-year closing low - Reu

Democratic Senator Bennet of Colorado joins crowded 2020 field - Reuters
https://www.reuters.com/article/us-usa-election-bennet/democratic-senator-bennet-of-colorado-joins-crowded-2020-field-idUSKCN1S814A 

Barr cancels second day of testimony, escalating battle with U.S. Congress - Reuters
https://www.reuters.com/article/us-usa-trump-barr/barr-cancels-second-day-of-testimony-escalating-battle-with-u-s-congress-idUSKCN1S73HF 

Yuan recoups most losses as Beijing confirms Vice Premier Liu's trip to U.S. - Reuters
https://www.reuters.com/article/china-yuan-midday/yuan-recoups-most-losses-as-beijing-confirms-vice-premier-lius-trip-to-u-s-idUSL3N22J24G 

Sri Lanka April tourist arrivals slide after Easter bombings - Reuters
https://www.reuters.com/article/sri-lanka-blasts-tourists/sri-lanka-april-tourist-arrivals-slide-after-easter-bombings-idUSL3N22J27P 

Beirut stock market closed, awaits resumption of work at c.bank - source - Reuters
https://www.reuters.com/article/lebanon-economy-bour

Wall St. falls as White House vows to raise China tariffs - Reuters
https://www.reuters.com/article/usa-stocks/us-stocks-wall-st-falls-as-white-house-vows-to-raise-china-tariffs-idUSL2N22I1L4 

Gundlach recommends buying rate volatility on long maturity U.S. Treasuries: Sohn - Reuters
https://www.reuters.com/article/funds-sohn-gundlach/update-1-gundlach-recommends-buying-rate-volatility-on-long-maturity-u-s-treasuries-sohn-idUSL2N22I1KF 

Illinois credit quality could improve with graduated tax rates -Moody's - Reuters
https://www.reuters.com/article/illinois-tax-moodys/illinois-credit-quality-could-improve-with-graduated-tax-rates-moodys-idUSL2N22I1H5 

Gundlach says buy interest rate volatility on long maturity U.S. Treasuries -Sohn - Reuters
https://www.reuters.com/article/funds-sohn-gundlach/gundlach-says-buy-interest-rate-volatility-on-long-maturity-u-s-treasuries-sohn-idUSL2N22I1JE 

Foxconn chairman travels to White House to discuss Wisconsin: source - Reuters
https://www.reuter

Italian banks back BlackRock's 720 million euro Carige rescue plan - Reuters
https://www.reuters.com/article/eurozone-banks-carige/update-3-italian-banks-back-blackrocks-720-mln-euro-carige-rescue-plan-idUSL5N22I3L5 

GLOBAL MARKETS-Stocks, yields fall as investors seek safety after Trump's China tariff threats - Reuters
https://www.reuters.com/article/global-markets/global-markets-stocks-yields-fall-as-investors-seek-safety-after-trumps-china-tariff-threats-idUSL2N22I0UM 

Starbucks' China rival Luckin seeks to raise up to $586.5 million in IPO - Reuters
https://www.reuters.com/article/luckincoffee-ipo/update-2-starbucks-china-rival-luckin-seeks-to-raise-up-to-586-5-mln-in-ipo-idUSL3N22I2Y5 

Banks tighten standards on commercial real estate, credit card loans - Reuters
https://www.reuters.com/article/usa-fed-credit/banks-tighten-standards-on-commercial-real-estate-credit-card-loans-idUSW1N22900A 

Glenview’s Larry Robbins says he’s shorting 3M shares, likes healthcare stocks: Sohn - 

Former U.S. vice president Biden to announce 2020 election run on Thursday - Reuters
https://www.reuters.com/article/us-usa-election-biden/former-u-s-vice-president-biden-to-announce-2020-election-run-on-thursday-idUSKCN1RZ1WB 

House oversight chairman cites 'massive' obstruction by Trump, Barr - Reuters
https://www.reuters.com/article/us-usa-trump-oversight/house-oversight-chairman-cites-massive-obstruction-by-trump-barr-idUSKCN1S02KE 

Lebanese Druze leader backs austerity plan - Reuters
https://www.reuters.com/article/lebanon-economy-jumblatt/lebanese-druze-leader-backs-austerity-plan-idUSL5N22I3VM 

TREASURIES-U.S. bond yields fall on U.S.-China trade tension - Reuters
https://www.reuters.com/article/usa-bonds/treasuries-u-s-bond-yields-fall-on-u-s-china-trade-tension-idUSL2N22I0D7 

China gives modest boost to economy with RRR cut amid renewed trade tensions - Reuters
https://www.reuters.com/article/china-economy/update-6-china-gives-modest-boost-to-economy-with-rrr-cut-amid-rene

House panel chair subpoenas ex-White House counsel McGahn on Mueller inquiry - Reuters
https://www.reuters.com/article/us-usa-trump-russia-mcgahn/house-panel-chair-subpoenas-ex-white-house-counsel-mcgahn-on-mueller-inquiry-idUSKCN1RY1K7 

On staff following orders, Trump says: 'Nobody disobeys me' - Reuters
https://www.reuters.com/article/us-usa-trump/on-staff-following-orders-trump-says-nobody-disobeys-me-idUSKCN1RY19R 

US STOCKS-Futures sink after Trump escalates China tariff threat - Reuters
https://www.reuters.com/article/usa-stocks/us-stocks-futures-sink-after-trump-escalates-china-tariff-threat-idUSL3N22I1UR 

Boeing 737 slides off runway in Russia's Norilsk - Reuters
https://www.reuters.com/article/russia-airplane-boeing/boeing-737-slides-off-runway-in-russias-norilsk-idUSL5N22I20Z 

German business lobby: Trump's new tariff threat not good at all - Reuters
https://www.reuters.com/article/germany-economy-trade/german-business-lobby-trumps-new-tariff-threat-not-good-at-all-idUSS

DIARY-Top Economic Events to June 27 - Reuters
https://www.reuters.com/article/diary-top-econ/diary-top-economic-events-to-june-27-idUSL3N22F2GQ 

FOREX-Yen firms, yuan falls as Trump China tariff threat jolts risk assets - Reuters
https://www.reuters.com/article/global-forex/forex-yen-firms-yuan-falls-as-trump-china-tariff-threat-jolts-risk-assets-idUSL3N22H0NP 

Top Democrats say Mueller report undercuts Barr claims on Trump obstruction - Reuters
https://www.reuters.com/article/us-usa-trump-russia-democrats/top-democrats-say-mueller-report-undercuts-barr-claims-on-trump-obstruction-idUSKCN1RU267 

Mueller report provides intimate scenes from the Trump White House - Reuters
https://www.reuters.com/article/us-usa-trump-mueller-scenes/mueller-report-provides-intimate-scenes-from-the-trump-white-house-idUSKCN1RU25D 

Factbox: Long-awaited Mueller report is finally out. Now what? - Reuters
https://www.reuters.com/article/us-usa-trump-russia-next-facbox/factbox-long-awaited-mueller-report-

Wall St. climbs as jobs data supports upbeat economic outlook - Reuters
https://www.reuters.com/article/usa-stocks/us-stocks-wall-st-climbs-as-jobs-data-supports-upbeat-economic-outlook-idUSL1N22F1J8 

BUZZ-U.S. stocks weekly: Saddle up - Reuters
https://www.reuters.com/article/buzz-us-stocks-weekly-saddle-up/buzz-u-s-stocks-weekly-saddle-up-idUSL1N22F176 

U.S. attorney general to hold Mueller report news conference on Thursday - Reuters
https://www.reuters.com/article/us-usa-trump-russia-barr/u-s-attorney-general-to-hold-mueller-report-news-conference-on-thursday-idUSKCN1RT2I3 

Former Virginia Governor McAuliffe decides to not enter 2020 presidential race - Reuters
https://www.reuters.com/article/us-usa-election-mcauliffe/former-virginia-governor-mcauliffe-decides-to-not-enter-2020-presidential-race-idUSKCN1RU063 

Energy Secretary Perry planning to leave Trump administration: source - Reuters
https://www.reuters.com/article/us-usa-trump-perry/energy-secretary-perry-planning-to-leav

Trump may be trying to make everyone 'crazy' with sanctuary cities threat:  Sen. Rick Scott - Reuters
https://www.reuters.com/article/us-usa-immigration-sanctuary/trump-may-be-trying-to-make-everyone-crazy-with-sanctuary-cities-threat-sen-rick-scott-idUSKCN1RQ0IG 

Booker launches 'Justice' tour, aiming for surge in U.S. presidential bid - Reuters
https://www.reuters.com/article/us-usa-election-booker/booker-launches-justice-tour-aiming-for-surge-in-u-s-presidential-bid-idUSKCN1RP070 

CORRECTED-UPDATE 5-Oil prices headed for weekly decline as U.S. output grows - Reuters
https://www.reuters.com/article/global-oil/corrected-update-5-oil-prices-headed-for-weekly-decline-as-u-s-output-grows-idUSL3N22F0MH 

METALS-Copper jumps on weaker dollar but logs weekly loss - Reuters
https://www.reuters.com/article/global-metals/metals-copper-jumps-on-weaker-dollar-but-logs-weekly-loss-idUSL3N22F1DE 

Trump asylum policy gets temporary reprieve from Court of Appeals - Reuters
https://www.reuters.com

MONEY MARKETS-U.S. fed funds rate falls after Fed's tweak on reserves - Reuters
https://www.reuters.com/article/usa-moneymarkets-fedfunds/money-markets-u-s-fed-funds-rate-falls-after-feds-tweak-on-reserves-idUSL1N22F0HB 

CORRECTED-UPDATE 1-Hudbay Minerals settles board battle with shareholder Waterton - Reuters
https://www.reuters.com/article/hudbay-shareholders-waterton/corrected-update-1-hudbay-minerals-settles-board-battle-with-shareholder-waterton-idUSL3N22F1R5 

US STOCKS-Strong jobs data set to boost Wall Street at open - Reuters
https://www.reuters.com/article/usa-stocks/us-stocks-strong-jobs-data-set-to-boost-wall-street-at-open-idUSL3N22F237 

UPDATE 1-Itaú Unibanco sees competition among card processors unlikely to cool down - Reuters
https://www.reuters.com/article/itau-unibanco-hldg-call/update-1-ita-unibanco-sees-competition-among-card-processors-unlikely-to-cool-down-idUSL1N22F0EE 

Trump to hold event Friday on 5G, rural broadband: White House - Reuters
https://www.reut

UK government's Brexit talks with Labour to resume after weekend - May's spokeswoman - Reuters
https://www.reuters.com/article/britain-eu-labour/uk-governments-brexit-talks-with-labour-to-resume-after-weekend-pm-mays-spokeswoman-idUSS8N1YB009 

CEE MARKETS-Hungary's forint firms ahead of Moody's review, CPI data - Reuters
https://www.reuters.com/article/easteurope-markets/cee-markets-hungarys-forint-firms-ahead-of-moodys-review-cpi-data-idUSL5N22F2NG 

UPDATE 1-Euro zone inflation jumps beyond expectations in April - Reuters
https://www.reuters.com/article/eurozone-economy-inflation/update-1-euro-zone-inflation-jumps-beyond-expectations-in-april-idUSL5N22F2D9 

EU leaders set to meet just after EU election -officials - Reuters
https://www.reuters.com/article/eu-election/eu-leaders-set-to-meet-just-after-eu-election-officials-idUSL5N22F2EP 

April data points to Turkey meeting its inflation targets -Albayrak - Reuters
https://www.reuters.com/article/turkey-economy-inflation-albayrak/apr

Brazil's Via Varejo proposes change in bylaws that eases sale: filing - Reuters
https://www.reuters.com/article/via-varejo-bylaws/brazils-via-varejo-proposes-change-in-bylaws-that-eases-sale-filing-idUSE6N20K02H 

PG&E says SEC investigating it for disclosures, losses for wildfires - Reuters
https://www.reuters.com/article/pge-us-sec/pge-says-sec-investigating-it-for-disclosures-losses-for-wildfires-idUSL3N22E4IX 

Brazil airline Azul could bid for Avianca Brasil's assets after all - Reuters
https://www.reuters.com/article/avianca-brasil-bankruptcy/brazil-airline-azul-could-bid-for-avianca-brasils-assets-after-all-idUSL1N22E1XJ 

MOVES-Pimco's head of Emerging Markets portfolio management takes 6-month sabbatical - Reuters
https://www.reuters.com/article/funds-pimco/moves-pimcos-head-of-emerging-markets-portfolio-management-takes-6-month-sabbatical-idUSL1N22E1VT 

U.S. advertising groups create privacy bill coalition - Reuters
https://www.reuters.com/article/us-usa-privacy/u-s-advertis

NYSE-owner ICE cool with crypto 'winter' as profits climb - Reuters
https://www.reuters.com/article/interconti-exc-results/update-2-nyse-owner-ice-cool-with-crypto-winter-as-profits-climb-idUSL3N22E290 

METALS-Zinc hits 1-1/2 month low, others recover on trade deal hopes - Reuters
https://www.reuters.com/article/global-metals/metals-zinc-hits-1-1-2-month-low-others-recover-on-trade-deal-hopes-idUSL3N22E1H6 

U.S. first-quarter productivity strongest since 2014, labor costs subdued - Reuters
https://www.reuters.com/article/usa-economy/wrapup-3-u-s-q1-productivity-strongest-since-2014-labor-costs-subdued-idUSL1N22D19D 

UPDATE 1-South Africa says Chinese loan to Eskom not in jeopardy - Reuters
https://www.reuters.com/article/safrica-eskom-china/update-1-south-africa-says-chinese-loan-to-eskom-not-in-jeopardy-idUSL5N22E6SY 

U.S. ratchets up pressure on Venezuela, Cuban backers - Reuters
https://www.reuters.com/article/us-venezuela-politics-pence-houston/u-s-ratchets-up-pressure-on-venez

Proxy firm ISS backs MNG director for Gannett board, rejects two others - Reuters
https://www.reuters.com/article/mng-proxy-gannett-co/proxy-firm-iss-backs-mng-director-for-gannett-board-rejects-two-others-idUSL3N22E2K6 

MONEY MARKETS-Dollar LIBOR falls as Fed tweaks interest on U.S. reserves - Reuters
https://www.reuters.com/article/usa-moneymarkets/money-markets-dollar-libor-falls-as-fed-tweaks-interest-on-u-s-reserves-idUSL1N22E0DC 

GLOBAL MARKETS-Stocks slip, dollar drifts after Fed dents rate cut hopes - Reuters
https://www.reuters.com/article/global-markets/global-markets-stocks-slip-dollar-drifts-after-fed-dents-rate-cut-hopes-idUSL5N22E5BI 

Top Lebanese banker warns against raising interest income tax - Reuters
https://www.reuters.com/article/lebanon-economy-banks/top-lebanese-banker-warns-against-raising-interest-income-tax-idUSL5N22E4ZO 

PRECIOUS-Gold falls to 1-week low after Fed dashes rate cut hopes - Reuters
https://www.reuters.com/article/global-precious/precious-gol

Metro Bank shares tumble after weak Q1 profits, deposit outflows - Reuters
https://www.reuters.com/article/metro-bank-results-stocks/metro-bank-shares-tumble-after-weak-q1-profits-deposit-outflows-idUSL5N22E1LX 

RPT-Thomas Cook sets May 7 deadline for interest in airline business -sources - Reuters
https://www.reuters.com/article/thomas-cook-grp-ma-airlines/rpt-thomas-cook-sets-may-7-deadline-for-interest-in-airline-business-sources-idUSL5N22E1NP 

FOREX-Dollar recovers after overnight stumble; BOE eyed - Reuters
https://www.reuters.com/article/global-forex/forex-dollar-recovers-after-overnight-stumble-boe-eyed-idUSL5N22E1MN 

FOREX-Dollar recovers after overnight stumble; BOE eyed - Reuters
https://www.reuters.com/article/global-forex/forex-dollar-recovers-after-overnight-stumble-boe-eyed-idUSL5N22E1L3 

RPT-GLOBAL MARKETS-Asian shares flatline after Fed's neutral message - Reuters
https://www.reuters.com/article/global-markets/rpt-global-markets-asian-shares-flatline-after-feds-neut

Bank of Canada says rates will go up if economic headwinds dissipate - Reuters
https://www.reuters.com/article/canada-cenbank/bank-of-canada-says-rates-will-go-up-if-economic-headwinds-dissipate-idUSL1N22D1DW 

Dollar rises as Fed's Powell cools bets on rate-cut - Reuters
https://www.reuters.com/article/global-forex/forex-dollar-rises-as-feds-powell-cools-bets-on-rate-cut-idUSL1N22D194 

CANADA FX DEBT-Loonie weakens as Fed inflation view boosts greenback - Reuters
https://www.reuters.com/article/canada-forex/canada-fx-debt-loonie-weakens-as-fed-inflation-view-boosts-greenback-idUSL1N22D1BE 

World stocks fall, dollar gains on Powell comments - Reuters
https://www.reuters.com/article/global-markets/global-markets-stocks-fall-dollar-gains-on-powell-comments-idUSL1N22D19Y 

Ex-U.S. Vice President Biden denies inappropriate conduct over alleged kiss - Reuters
https://www.reuters.com/article/us-usa-election-biden/ex-u-s-vice-president-biden-denies-inappropriate-conduct-over-alleged-kiss-id

Bill to let banks work with cannabis companies advances in U.S. House - Reuters
https://www.reuters.com/article/us-usa-house-cannabis/bill-to-let-banks-work-with-cannabis-companies-advances-in-u-s-house-idUSKCN1R91R6 

Low ECB rates to stay, banks should merge: de Guindos - Reuters
https://www.reuters.com/article/ecb-policy/update-1-low-ecb-rates-to-stay-banks-should-mergede-guindos-idUSL5N22D1K7 

Exports sag, UK factories report lower Brexit stockpile boost - PMI survey - Reuters
https://www.reuters.com/article/britain-economy-pmi/update-1-exports-sag-uk-factories-report-lower-brexit-stockpile-boost-pmi-survey-idUSL5N22D1CM 

FOREX-New Zealand dollar hit by jobs data; investors eye Fed meeting - Reuters
https://www.reuters.com/article/global-forex/forex-new-zealand-dollar-hit-by-jobs-data-investors-eye-fed-meeting-idUSL5N22D1G4 

Bayer supervisory board to meet to discuss crisis: report - Reuters
https://www.reuters.com/article/bayer-management-shareholders/update-1-bayer-supervisory

CANADA FX DEBT-Loonie notches one-week high as Poloz dials back pessimism - Reuters
https://www.reuters.com/article/canada-forex/canada-fx-debt-loonie-notches-one-week-high-as-poloz-dials-back-pessimism-idUSL1N22C20J 

Voce says Argo ROE could double with cost cuts, nominates five directors - Reuters
https://www.reuters.com/article/argo-group-intl-voce/voce-says-argo-roe-could-double-with-cost-cuts-nominates-five-directors-idUSL1N22C01G 

Democrats push for Mueller report to Congress by next week, Republicans resist - Reuters
https://www.reuters.com/article/us-usa-trump-russia/democrats-push-for-mueller-report-to-congress-by-next-week-republicans-resist-idUSKCN1R61J0 

Can Trump block Twitter users whose views he dislikes? U.S. Appeals Court skeptical - Reuters
https://www.reuters.com/article/us-usa-trump-twitter/can-trump-block-twitter-users-whose-views-he-dislikes-u-s-appeals-court-skeptical-idUSKCN1R72DW 

U.S. House fails to override Trump veto in border wall dispute - Reuters
http

Kremlin, after Mueller report, says it's open to better U.S. ties - Reuters
https://www.reuters.com/article/us-usa-trump-russia-kremlin/kremlin-after-mueller-report-says-its-open-to-better-u-s-ties-idUSKCN1R6128 

US STOCKS-Wall Street's record run hits snag after Alphabet tumbles - Reuters
https://www.reuters.com/article/usa-stocks/us-stocks-wall-streets-record-run-hits-snag-after-alphabet-tumbles-idUSL3N22C45Y 

Trump legal team says Mueller report totally vindicates president - Reuters
https://www.reuters.com/article/us-usa-trump-russia-legal/trump-legal-team-says-mueller-report-totally-vindicates-president-idUSKCN1R50US 

Trump slams Russia probe as 'illegal takedown' - Reuters
https://www.reuters.com/article/us-usa-russia-trump-takedown/trump-slams-russia-probe-as-illegal-takedown-idUSKCN1R50UF 

Trump responds to Mueller report: 'complete and total exoneration' - Reuters
https://www.reuters.com/article/us-usa-trump-russia-tweet/trump-responds-to-mueller-report-complete-and-total-

Italian bank fund could take Carige stake in BlackRock rescue: Intesa CEO - Reuters
https://www.reuters.com/article/eurozone-banks-carige-intesa-sanpaolo/update-1-italian-bank-fund-could-take-carige-stake-in-blackrock-rescue-intesa-ceo-idUSL5N22C66A 

Putin says contaminated oil pipeline scandal has hurt Russia's image - Reuters
https://www.reuters.com/article/russia-oil-exports-putin/putin-says-contaminated-oil-pipeline-scandal-has-hurt-russias-image-idUSR4N22804H 

Citizenship question on U.S. Census would cause Hispanic undercount by millions: study - Reuters
https://www.reuters.com/article/us-usa-census-undercount/citizenship-question-on-u-s-census-would-cause-hispanic-undercount-by-millions-study-idUSKCN1R32BV 

Trump aide Lance Leggitt resigning from White House job - Reuters
https://www.reuters.com/article/us-usa-trump-leggitt/trump-aide-lance-leggitt-resigning-from-white-house-job-idUSKCN1R32OQ 

Trump's Fed nominee not sure if U.S. central bank should cut rates: Bloomberg - Re

General Electric quarterly profit more than triples - Reuters
https://www.reuters.com/article/ge-results/general-electric-quarterly-profit-more-than-triples-idUSL3N22C2N5 

Banco Santander Brasil beats profit estimates as provisions fall - Reuters
https://www.reuters.com/article/bco-santander-br-results/banco-santander-brasil-beats-profit-estimates-as-provisions-fall-idUSE6N1ZZ00M 

Euro zone first-quarter economic growth stronger than expected, unemployment falls - Reuters
https://www.reuters.com/article/eurozone-economy-gdp/update-1-euro-zone-q1-economic-growth-stronger-than-expected-unemployment-falls-idUSL5N22C3R7 

Russia's Magnit pulls bid for Lenta as Mordashov's offer moves on - Reuters
https://www.reuters.com/article/lenta-ltd-ma-magnit/update-1-russias-magnit-pulls-bid-for-lenta-as-mordashovs-offer-moves-on-idUSL5N22C31I 

Wealth clients willing to pay for financial advice, 33% switch providers: study - Reuters
https://www.reuters.com/article/wealth-clients-study/wealth-clien

Trump sues Deutsche Bank and Capital One to block House subpoenas - Reuters
https://www.reuters.com/article/usa-trump-russia-banks/update-1-trump-sues-deutsche-bank-and-capital-one-to-block-house-subpoenas-idUSL1N22C047 

Britain's Labour hails progress in cross-party Brexit talks - The Times - Reuters
https://www.reuters.com/article/britain-eu/britains-labour-hails-progress-in-cross-party-brexit-talks-the-times-idUSL9N21601Q 

FOREX-Dollar marks time, Aussie eases on China data miss - Reuters
https://www.reuters.com/article/global-forex/forex-dollar-marks-time-aussie-eases-on-china-data-miss-idUSL3N22C0IK 

German chauffeur service Blacklane plans IPO within three years - Reuters
https://www.reuters.com/article/gulf-travel-blacklane/refile-update-1-german-chauffeur-service-blacklane-plans-ipo-within-3-years-idUSL5N22B4M2 

CORRECTED-Trump sues Deutsche Bank and Capital One to block House subpoenas - Reuters
https://www.reuters.com/article/usa-trump-russia-banks/corrected-trump-sues-de

S&P 500 posts high, extends 2019 rally; Alphabet falls late - Reuters
https://www.reuters.com/article/usa-stocks/refile-us-stocks-sp-500-posts-high-extends-2019-rally-alphabet-falls-late-idUSL1N22B1JA 

Brazil public investment this year may hit new low below 0.5 pct/GDP -Treasury - Reuters
https://www.reuters.com/article/brazil-economy-investment/brazil-public-investment-this-year-may-hit-new-low-below-0-5-pct-gdp-treasury-idUSL1N22B1E9 

EMERGING MARKETS-Latam FX, stocks mostly weaker; Argentina peso jumps - Reuters
https://www.reuters.com/article/emerging-markets-latam/emerging-markets-latam-fx-stocks-mostly-weaker-argentina-peso-jumps-idUSL1N22B1HN 

Brazil govt to insist on pension reform savings of 1.2 tln reais - Reuters
https://www.reuters.com/article/brazil-politics-pensions/brazil-govt-to-insist-on-pension-reform-savings-of-1-2-tln-reais-idUSE4N20D021 

WeWork owner The We Company joins IPO stampede - Reuters
https://www.reuters.com/article/wework-ipo/update-2-wework-owner-th


Iraq will 'cooperate' with Siemens on power grid plan - Reuters
https://www.reuters.com/article/iraq-energy-siemens/iraq-will-cooperate-with-siemens-on-power-grid-plan-idUSL5N22B61K 

European shares end higher as Spanish stocks recover poise - Reuters
https://www.reuters.com/article/europe-stocks/update-2-european-shares-end-higher-as-spanish-stocks-recover-poise-idUSL5N22B1N6 

U.S. consumer spending roars back, but inflation tame - Reuters
https://www.reuters.com/article/usa-economy/wrapup-2-u-s-consumer-spending-roars-back-but-inflation-tame-idUSL1N2281DR 

REFILE-US STOCKS-Wall St gains as soft inflation data supports accommodative Fed - Reuters
https://www.reuters.com/article/usa-stocks/refile-us-stocks-wall-st-gains-as-soft-inflation-data-supports-accommodative-fed-idUSL3N22B3BE 

White House still standing behind Moore for the Fed - Kudlow - Reuters
https://www.reuters.com/article/usa-fed-trump/white-house-still-standing-behind-moore-for-the-fed-kudlow-idUSL1N22B0RD 

Airbnb C

House passes Democrats' campaign finance, ethics bill - Reuters
https://www.reuters.com/article/us-usa-congress-democrats/house-passes-democrats-campaign-finance-ethics-bill-idUSKCN1QP1ZZ 

Putin says Russians and Ukrainians would benefit from shared citizenship: Ifax - Reuters
https://www.reuters.com/article/russia-ukraine-putin/putin-says-russians-and-ukrainians-would-benefit-from-shared-citizenship-ifax-idUSR4N228033 

CANADA STOCKS-TSX futures flat as oil prices slip - Reuters
https://www.reuters.com/article/canada-stocks/canada-stocks-tsx-futures-flat-as-oil-prices-slip-idUSL3N22B29E 

Mnuchin says two rounds of talks may seal U.S.-China deal -Fox Business Network - Reuters
https://www.reuters.com/article/usa-trade-china/mnuchin-says-two-rounds-of-talks-may-seal-u-s-china-deal-fox-business-network-idUSL1N22B06T 

RPT-BUZZ-U.S. stocks weekly: Love 'em or hate 'em - Reuters
https://www.reuters.com/article/idUSL1N22A0AW 

What happens next as Spain's Socialists try to form government

DIARY-Top Economic Events to June 27 - Reuters
https://www.reuters.com/article/diary-top-econ/diary-top-economic-events-to-june-27-idUSL3N22841T 

Brazil's Petrobras details refinery, other asset sale plans - Reuters
https://www.reuters.com/article/petrobras-divestiture/update-2-brazils-petrobras-details-refinery-other-asset-sale-plans-idUSL1N2281S1 

UPDATE 1-Speculators turn net long in Eurodollar for first time since 2016 -CFTC - Reuters
https://www.reuters.com/article/usa-bonds-cftc/update-1-speculators-turn-net-long-in-eurodollar-for-first-time-since-2016-cftc-idUSL1N2281MN 

New Mexico City airport to go in service by mid-2021: official - Reuters
https://www.reuters.com/article/mexico-airport/new-mexico-city-airport-to-go-in-service-by-mid-2021-official-idUSL1N2281KR 

Speculative U.S. 10-year T-note net shorts hit 4-month high -CFTC - Reuters
https://www.reuters.com/article/usa-bonds-cftc/speculative-u-s-10-year-t-note-net-shorts-hit-4-month-high-cftc-idUSAQN00HR4U 

Shareholder

Morgan Stanley sees U.S. second-quarter GDP growth at 1.1%, Goldman view 2.2% - Reuters
https://www.reuters.com/article/usa-economy-morganstanley/update-1-morgan-stanley-sees-us-q2-gdp-growth-at-1-1-goldman-view-2-2-idUSL1N22810G 

Russia to restore oil supplies within two weeks - Deputy PM - Reuters
https://www.reuters.com/article/russia-oil-exports-restart/russia-to-restore-oil-supplies-within-two-weeks-deputy-pm-idUSR4N22602K 

U.S. economy expands 3.2 percent in first quarter; growth details weak - Reuters
https://www.reuters.com/article/usa-economy/wrapup-4-u-s-economy-expands-3-2-percent-in-q1-growth-details-weak-idUSLNSQFEF2R 

European shares get a lift from strong earnings, U.S. GDP - Reuters
https://www.reuters.com/article/europe-stocks/update-3-european-shares-get-a-lift-from-strong-earnings-u-s-gdp-idUSL5N2282J6 

TREASURIES-Yields fall on weak inflation data, shrug off GDP reading - Reuters
https://www.reuters.com/article/usa-bonds/treasuries-yields-fall-on-weak-inflation-

CANADA STOCKS-TSX opens lower as oil rally pauses - Reuters
https://www.reuters.com/article/canada-stocks/canada-stocks-tsx-opens-lower-as-oil-rally-pauses-idUSL3N2283KJ 

U.S. Senate weighs blocking Trump's border emergency gambit - Reuters
https://www.reuters.com/article/us-usa-trump-congress/u-s-senate-weighs-blocking-trumps-border-emergency-gambit-idUSKCN1QH2V1 

Senate confirms ex-coal lobbyist to lead U.S. environment regulator - Reuters
https://www.reuters.com/article/us-usa-epa-wheeler/senate-confirms-ex-coal-lobbyist-to-lead-u-s-environment-regulator-idUSKCN1QH1LA 

Puerto Rico oversight board to appeal appointments ruling - Reuters
https://www.reuters.com/article/us-usa-puertorico/puerto-rico-oversight-board-to-appeal-appointments-ruling-idUSKCN1QH2LU 

Finland's Fortum frees up cash by replacing Nasdaq collateral with bond - Reuters
https://www.reuters.com/article/finland-fortum-nasdaq/finlands-fortum-frees-up-cash-by-replacing-nasdaq-collateral-with-bond-idUSL5N2284XF 

US 

In [17]:
# save all articles to one csv file
file = pd.DataFrame({'Title' : title_all, 'Time':time_all, 'Content':content_all, 'Link':url_all})
file['Date'] = [x.split('/')[0] for x in file['Time'].tolist()]
file['Date'] = pd.to_datetime(file['Date'])

file['Len'] = [len(x) for x in file['Content']]
file = file[file['Len'] >= 600]

file.to_csv('articles.csv')
file.head(20)

Unnamed: 0,Title,Time,Content,Link,Date,Len
1,Fed's Clarida says there's no good case for ra...,"May 7, 2019 / 2:34 PM / in 3 minutes",2 Min ReadWASHINGTON (Reuters) - The U.S. Fede...,https://www.reuters.com/article/usa-fed-clarid...,2019-05-07,1311
2,Investors most neutral on U.S. Treasuries in f...,"May 7, 2019 / 2:37 PM / a minute ago","1 Min ReadNEW YORK, May 7 (Reuters) - Bond inv...",https://www.reuters.com/article/treasuries-jpm...,2019-05-07,747
4,Fed's Clarida says there's no good case for ra...,"May 7, 2019 / 2:28 PM / Updated 16 minutes ago","1 Min ReadWASHINGTON, May 7 (Reuters) - The U....",https://www.reuters.com/article/usa-fed-clarid...,2019-05-07,677
5,UPDATE 1-Sterling slides to day's low on Brexi...,"May 7, 2019 / 2:27 PM / in a minute",3 Min Read* Graphic: World FX rates in 2019 tm...,https://www.reuters.com/article/britain-sterli...,2019-05-07,2876
6,CANADA STOCKS-TSX falls for second day on U.S....,"May 7, 2019 / 2:28 PM / Updated 16 minutes ago",2 Min ReadMay 7 (Reuters) - Canada’s main stoc...,https://www.reuters.com/article/canada-stocks/...,2019-05-07,2056
7,Wall Street declines on U.S.-China trade tensi...,"May 7, 2019 / 11:08 AM / Updated 13 minutes ago",4 Min Read(Reuters) - U.S. stocks posted broad...,https://www.reuters.com/article/usa-stocks/us-...,2019-05-07,3312
8,Scout24 bidders reach 9.7 percent stake ahead ...,"May 7, 2019 / 2:33 PM / in a minute",1 Min ReadBERLIN (Reuters) - The private equit...,https://www.reuters.com/article/scout24-ag-ma/...,2019-05-07,677
9,Britain will take part in European Parliament ...,"May 7, 2019 / 2:33 PM / a few seconds ago",1 Min ReadLONDON (Reuters) - Britain will have...,https://www.reuters.com/article/britain-eu-ele...,2019-05-07,1173
10,Senate's McConnell to declare 'case closed' on...,"May 7, 2019 / 1:25 PM / in an hour",4 Min ReadWASHINGTON (Reuters) - The divided U...,https://www.reuters.com/article/us-usa-trump-c...,2019-05-07,3435
11,U.S. House panel readies contempt vote against...,"May 6, 2019 / 5:02 AM / Updated 11 hours ago",4 Min ReadWASHINGTON (Reuters) - Congressional...,https://www.reuters.com/article/us-usa-trump-b...,2019-05-06,4048


### TF-IDF

In [6]:
#  Getting the word frequency matrix with sklearn
corpus = file['Content'].values.tolist()
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()  
X = vectorizer.fit_transform(corpus)  
word = vectorizer.get_feature_names()  
pd.DataFrame(X.toarray(), columns=word).head()

Unnamed: 0,00,000,0000,001,002,005,006,008,009,01,...,zone,zones,zoom,zooming,zuckerberg,zug,zuma,zurich,zwaan,ﬂat
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [7]:
from sklearn.feature_extraction.text import TfidfTransformer
transformer = TfidfTransformer(smooth_idf = False)
tfidf = transformer.fit_transform(X)

df_tfidf = pd.DataFrame(tfidf.toarray(), columns=word)
df_tfidf.head()

Unnamed: 0,00,000,0000,001,002,005,006,008,009,01,...,zone,zones,zoom,zooming,zuckerberg,zug,zuma,zurich,zwaan,ﬂat
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [8]:
df_tfidf.sum().reset_index().sort_values([0], ascending = False).head(10)

Unnamed: 0,index,0
15672,the,207.783551
15834,to,107.751337
10990,of,107.130755
1810,and,90.901376
8206,in,89.73571
11058,on,57.047368
6807,for,48.633942
3206,by,48.426135
13751,said,40.649421
11602,percent,39.50566


In [9]:
df_tfidf.sum().reset_index().sort_values([0], ascending = False).tail(10)

Unnamed: 0,index,0
14209,sheffield,0.014541
14330,sibley,0.014541
11322,overy,0.014541
92,0920,0.014541
1762,amelia,0.014541
93,0930,0.014541
11394,panetta,0.014541
7616,haskel,0.014541
259,1630,0.014541
2974,brazier,0.014541


Normally we would expect words such as "the" and "to" to get zero weighting from the inverse document frequency, as it is appeared in almost all articles. However, with the imported function "TfidfTransformer," idf is computed as idf(t) = log [ n / df(t)] + 1. This means that words that occur in all documents will still receive a weighting greater than zero. This feature conflicts with our aim to remove words that have no meaning. In the following we will calcualate our own TF-IDF. 

In [10]:
# combine the content of all articles to one list
text_all = []
for content in file['Content']:
    text = content.split(' ')
    text = [x.lower() for x in text]
    text_all.append(text)


# calculate term frequency in each article
def computeReviewTFDict(reviews):
    # counts the number of times the word appears in review
    all_TFDict = []
    for review in reviews:
        reviewTFDict = {}
        for word in review:
            if word in reviewTFDict:
                reviewTFDict[word] += 1
            else:
                reviewTFDict[word] = 1
        all_TFDict.append(reviewTFDict)
    
    return all_TFDict

TF = computeReviewTFDict(text_all)
TF_list = [pd.DataFrame(list(doc.values()), index=doc.keys()) for doc in TF]
wfm = pd.concat(TF_list, axis= 1)
wfm = np.transpose(wfm).fillna(0)
wfm.head()

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.




Unnamed: 0,Unnamed: 1,"""avengers:","""case","""disastrous""","""has","""hush","""letting","""one","""other","""raise",...,“you’d,“you’re,“you’ve,“yuan,“zainab,“‘black,“‘sell,“”restoring,…,ﬂat
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [11]:
# Inverse Document Frequency requires the number of documents each word has appeared in 

import math
total_words = wfm.astype(bool).sum(axis=0).sum()

#Total number of documents / Number of documents with term in it
idf = len(wfm)/wfm.astype(bool).sum(axis=0)

# Taking log
idf = idf.apply(math.log)

idf = idf.sort_values().reset_index()
idf.head(10)

Unnamed: 0,index,0
0,15,0.0
1,complete,0.0
2,for,0.0
3,2019,0.0
4,exchanges,0.0
5,quotes,0.0
6,a,0.0
7,list,0.0
8,minutes.,0.0
9,of,0.0


In [12]:
idf.tail(10)

Unnamed: 0,index,0
31685,opec+,6.908755
31686,"opec+,",6.908755
31687,opec+.,6.908755
31688,"opec,",6.908755
31689,"collapsed,",6.908755
31690,collaborations.”,6.908755
31691,collaboration,6.908755
31692,collaborating,6.908755
31693,ontario-based,6.908755
31694,benkoe)all,6.908755


The problem with words appearing in all articles seems to be solved as words such a "the" and "a" now have zero weighting. However, we need to address the issue with punctuation and symbols.

In [18]:
import string
exclude = set(string.punctuation + '©…”“‘—')

# combine the content of all articles to one list while removing punctuations
text_all = []
for content in file['Content']:
    text = ''.join(ch for ch in content if ch not in exclude)
    text = text.split(' ')
    text = [x.lower() for x in text]
    text_all.append(text)


# calculate term frequency in each article
def computeReviewTFDict(reviews):
    # counts the number of times the word appears in review
    all_TFDict = []
    for review in reviews:
        reviewTFDict = {}
        for word in review:
            if word in reviewTFDict:
                reviewTFDict[word] += 1
            else:
                reviewTFDict[word] = 1
        all_TFDict.append(reviewTFDict)
    
    return all_TFDict

TF = computeReviewTFDict(text_all)
TF_list = [pd.DataFrame(list(doc.values()), index=doc.keys()) for doc in TF]
wfm = pd.concat(TF_list, axis= 1)
wfm = np.transpose(wfm).fillna(0)


# Inverse Document Frequency requires the number of documents each word has appeared in 

import math
total_words = wfm.astype(bool).sum(axis=0).sum()

#Total number of documents / Number of documents with term in it
idf = len(wfm)/wfm.astype(bool).sum(axis=0)

# Taking log
idf = idf.apply(math.log)

idf = idf.sort_values().reset_index()
idf.head(30)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.




Unnamed: 0,index,0
0,delayed,0.0
1,a,0.0
2,for,0.0
3,all,0.0
4,of,0.0
5,minimum,0.0
6,15,0.0
7,and,0.0
8,min,0.0
9,list,0.0


In [19]:
idf.tail(30)

Unnamed: 0,index,0
20731,danish,6.908755
20732,pemex,6.908755
20733,pencil,6.908755
20734,pencilled,6.908755
20735,penetrated,6.908755
20736,penny,6.908755
20737,dang,6.908755
20738,pensions,6.908755
20739,danforth,6.908755
20740,penciled,6.908755


## Summary and Follow ups

1. Using TF-IDF we identify words such as "a" and "for" to carry very little meaningful information about the actual content, as they appear in all articles. These words will be excluded from future analysis. 


2. Words such as "indefinitely" and "datadriven" are the terms we are most interested in. We will be focusing on these words in our sentiment analysis. 


3. Before we continue with this set of words, we will need to solve the problem with stemming. For example, the word "pension" and "pensions" carry essentailly the same meaning though they are seperated as individual terms. Thus our text needs to be further modified before constructing our new TF-IDF. 
