<a href="https://colab.research.google.com/github/este7734/Project_DS_Tools/blob/master/Web_Scraper_Reuters_3_A_Aron_Beef.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Import Dependencies

In [0]:
# Import libraries for processing web text
from bs4 import BeautifulSoup
import requests

from textblob import TextBlob
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from lxml import html

# Import these dependencies if using Google Colab 
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

## Define All Functions

In [0]:
# Get content of the webpage in an html string format by passing a url 
def get_html(url):
    page = requests.get(url)
    html_out = html.fromstring(page.content)
    text = page.text
    return html_out, text

In [0]:
# Convert html into soup to enable soup menthods
def get_soup(html_string):
    soup = BeautifulSoup(html_string, 'html.parser')
    return soup

In [0]:
# Extract hyperlinks from soup
def get_soup_links(soup):
    links = []
    for link in soup.find_all('a'):
        out_link = link.get('href')
        links.append(out_link)
    return links

In [0]:
# This function is for use with only the Topic pages on reuters.com
# Search through ALL links and filter for only those that are for actual articles
# links are formatted differently 
def get_articles_reuters_topics(links, old_url_set):
    articles = []
    for link in links:
        try:
            split_link = link.split('/')
            #if 'www.reuters.com'in split_link:
            if 'article'in split_link:
              link = 'https://www.reuters.com' + link
              if url_check(old_url_set, link) == False:
                articles.append(link)
        except:
            continue
    articles = list(set(articles))
    old_url_set = set(articles + list(old_url_set))       
    return articles, old_url_set

In [0]:
# Check if new urls exists in the old_url_set. if yes, return True; if no, return False
# This function is used in the get_articles_reuters_topics function
def url_check(old_url_set, url):
    url_set = set([url])
    test_set = old_url_set & url_set
    if len(test_set) == 0:
        check = False
    else:
        check = True
    return check

In [0]:
# Get html strings from list of article weblinks
def get_html_reuters(articles):
    soup_list = []
    for article in articles:
        _, text = get_html(article)
        soup = get_soup(text)
        soup_list.append(soup)
    return soup_list

In [0]:
# Break out article_body, article_headline, and article_date from each article in provided hyperlinks and put into a dictionary called: out_list
def get_reuters_elements(soup_list, articles):
    out_list = []
    i = 0
    for article in soup_list:
        link = articles[i] # I don't think this is used at all here, which means there is no reason to require the second argument: articles
        i += 1
        try:
            article_body = article.find_all('div', {'class': 'StandardArticleBody_body'})
            article_headline = article.find_all('h1', {'class': 'ArticleHeader_headline'})
            article_date = article.find_all('div', {'class': 'ArticleHeader_date'})
            try:
                date_time = article_date[0].text.split(' / ')
                date_in = date_time[0]
                date = format_date(date_in)
                a_time = date_time[1][1:]
            except:
                date = article_date[0].text
                time = article_date[0].text
            headline = article_headline[0].text
            article_p = []
            for item in article_body:
                p_list = item.find_all('p')
                for p in p_list:
                    article_p.append(p.text)
            out_text = ' '.join(article_p)
            out_dict = dict([('date',date),('time',a_time),('source','www.reuters.com'),('Title',headline),('Text',out_text),('url',link)])
            out_list.append(out_dict)
        except:
            print('Unable to decode...skipping article...')
            continue
    return out_list

In [0]:
# Format date for use in the get_reuters_elements function
def format_date(date):
    date_dict = {'January':'1','February':'2','March':'3','April':'4','May':'5','June':'6','July':'7','August':'8','September':'9','October':'10','November':'11','December':'12'}
    split_date = date.split(' ')
    year = split_date[2]
    day_list = split_date[1].split(',')[0]
    day = day_list
    month = date_dict[split_date[0]]
    out_date = year + '-' + month + '-' + day
    return out_date

## Define URL Variables and Run Functions

Step 1. Instantiate `old_url_set` to be used in the `get_articles_reuters_topics` function. This is a running log of article links that will be compiled by iterating from steps 2 - 6.

In [0]:
# Instantiate empty set to use a running list of hyperlinks while 
# running the scrape iterations
old_url_set = set([])
#old_url_set = set([r'https://www.reuters.com/article/us-space-exploration-spacex-launch/weather-postpones-spacexs-first-astronaut-launch-from-florida-idUSKBN2331B8'])

## Scrape Reuters Topics pages for all the most recent news articles. <font color='orange'>*Run Steps 2 - 7 for each instance of `url` variable, before moving on to the next steps*</font>

<font color='orange'>Step 2.</font> Define variables for each of Reuters main topics pages. Run this cell for each iteration by uncommenting a different url each time.

In [0]:
# Define url variables
# NOTE: You must run these individually through the end of this section
# I didn't have time to figure out how to loop through all of them properly
# There is a section at the very bottom where you can see that I attempted but ran
# into a problem on one of the last functions. 

#url = r'https://www.reuters.com/news/world'
#url = r'https://www.reuters.com/finance/markets'
#url = r'https://www.reuters.com/politics'
#url = r'https://www.reuters.com/breakingviews'
#url = r'https://www.reuters.com/news/lifestyle'
#url = r'https://www.reuters.com/finance'
#url = r'https://www.reuters.com/news/technology'
#url = r'https://www.reuters.com/finance/wealth'
#url = r'https://www.reuters.com/news/archive/wealth-taxes'

<font color='orange'>Step 3.</font> Get HTML srting from web `url`

In [0]:
# Pass the each instance of `url` variable to return the web page in HTML format and convert it to a string
html_string = str(get_html(url))

<font color='orange'>Step 4.</font> Get soup from the HTML string

In [0]:
# Pass the HTML string (of the web page) to get its soup
soup = get_soup(html_string)
print(len(soup))
#print(soup)
#print(soup.contents[9])

2


<font color='orange'>Step 5.</font> Extract ALL hyperlinks from the soup

In [0]:
# Find ALL links on within the soup
links = get_soup_links(soup)
print(len(links))
#for link in links:
  #print(link)

151


<font color='orange'>Step 6.</font> Filter out only those links that go directly to articles

In [0]:
# Use this for Topics Pages only
# Filter out only those links that are for actual articles. We only want the "good" links
# This filters out things like links to images and advertisements or non-news worthy pages
articles, old_url_set = get_articles_reuters_topics(links, old_url_set)
print(len(articles))
#articles

7


<font color='orange'>Step 7.</font> Print out the running list of hyperlinks to see how many you have

In [0]:
print(len(old_url_set))
#old_url_set

120


## <font color='skyblue'>Parse soup from entire list of hyperlinks that you just accumulated</font>

Step 8. Get soup for every link

In [0]:
# Get soup for each one of the "good" links
url_links = list(old_url_set) # Convert the running set of links to a list for use in the following functions

soup_list = get_html_reuters(url_links)
print(len(soup_list))
#print(soup_list[0])

120


Step 9. Parse the soup for each link into `article_body`, `article_title`, and `article_date`. Create list of dictionaries for each web page

In [0]:
# Parse the soup for each "good" link to get article text, title, and date
out_list = get_reuters_elements(soup_list, url_links)
print(len(out_list))
out_list[0:3]

120


[{'Text': '(Reuters) - U.S. stocks finished mostly higher on Friday after President Donald Trump announced measures against China in response to new security legislation that were less threatening to the U.S. economy than investors had feared.  The Dow ended the session slightly lower, but all three indexes registered gains for the month and the week.  The S&P 500 initially extended losses after Trump said he was directing his administration to begin the process of eliminating special treatment for Hong Kong in response to China’s plans to impose new security legislation in the semi-autonomous territory.  But Trump made no mention of any action that could undermine the Phase One trade deal that Washington and Beijing struck early this year, a concern that had cast a cloud over the market throughout the week.  “He began speaking in a very tough tone,” said Chris Zaccarelli, chief investment officer at Independent Advisor Alliance in Charlotte, North Carolina. “The market was worried he 

Step 10. Run `TextBlob` on list of dictionaries to separate all sentences in a single list

In [0]:
blob_sentences =[]
x = 0
for i in range(len(out_list)):
  blob = TextBlob(out_list[i]['Text'])
  for item in blob.sentences:
    blob_sentences.append(item)
print(f'You have {len(blob_sentences)} total sentences from {i+1} different articles')

You have 2177 total sentences from 120 different articles


This is just a troubleshooting section to ensure you're `TextBlob` came out right. You should see a list of stentences, each starting with the word `Sentence`. Check this before moving to the next step.

In [0]:
# print the first 6 sentences so you see what it looks like
blob_sentences[0:6]

[Sentence("(Reuters) - U.S. stocks finished mostly higher on Friday after President Donald Trump announced measures against China in response to new security legislation that were less threatening to the U.S. economy than investors had feared."),
 Sentence("The Dow ended the session slightly lower, but all three indexes registered gains for the month and the week."),
 Sentence("The S&P 500 initially extended losses after Trump said he was directing his administration to begin the process of eliminating special treatment for Hong Kong in response to China’s plans to impose new security legislation in the semi-autonomous territory."),
 Sentence("But Trump made no mention of any action that could undermine the Phase One trade deal that Washington and Beijing struck early this year, a concern that had cast a cloud over the market throughout the week."),
 Sentence("“He began speaking in a very tough tone,” said Chris Zaccarelli, chief investment officer at Independent Advisor Alliance in Ch

## <font color='skyblue'> Event Detection </font>



Step 11. Determine key words used to search for event of interest in all the articles

In [0]:
# Allow user to type in key words to search the text for
# Note this is case sensitive... so you need to make sure you enter your search 
# Try the entering the following to see some results: corona,COVID,death
filter_list = input("Enter key words to search for separated by commas. don't use spaces.\nSearch is case sensitive.\nExample search: coronavirus,COVID,death\n\n") #.title() # This is still a string... not a list yet

Enter key words to search for separated by commas. don't use spaces.
Search is case sensitive.
Example search: coronavirus,COVID,death

coronavirus,COVID,death


Step 12. Convert keywords into an iterable list for use in the next step


In [0]:
# Split filter words and convert into a list for itterating in the next step 

f = []
for word in (filter_list.split(",")):  # Split string into separate words, separate by comma
  f.append(word)                       # Generate new list containing each key word
f # This is now a list of key words that the user typed in

['coronavirus', 'COVID', 'death']

Step 13. <font color='skyblue'> Search </font>  All sentences for Key Words and return only those consisting of a key word



In [0]:
# Instantiate list for holding the sentences
sentences = []

# Generate empty list of lists to store the sentences with your key words in them
for i in range(len(f)):
  sentences.append([])
#print('Here is what you just made, and empty list of lists: ', sentences, '\n')

# Generate lists of sentences for each key word and plug them into the list of lists from above            
for i in range(len(f)):
  for sentence in blob_sentences:
    if f[i] in sentence:
        sentences[i].append(sentence)
        
# Print number of sentences containing each key word
# Print out all sentences containing each key word
for i in range(len(f)):
  print('='*200)   
  print('\nThere are {} sentences containing the word: {} '.format(len(sentences[i]), f[i])) 
  print('-'*200)   
  for sentence in sentences[i]:
      print(sentence)


There are 113 sentences containing the word: coronavirus 
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Expectations of a quick economic recovery from the coronavirus pandemic have driven the S&P 500 .SPX up more than 30% from its March lows.
Federal Reserve Chair Jerome Powell, speaking in a webcast organized by Princeton University Friday, reiterated the U.S. central bank’s promise to use its tools to shore up the economy amid the coronavirus pandemic.
WASHINGTON (Reuters) - White House economic adviser Larry Kudlow said on Tuesday that President Donald Trump’s administration is looking carefully at a potential “back to work bonus” to encourage Americans who had been laid off as the coronavirus pandemic spread to return to work.
Kudlow, speaking on Fox News Channel, also said he does not think Congress will approve another $600 pe

## <font color='black'>Step 13. Store All article data in DataFrame </font>  
<font color='grey'>Note: This is not just the filtered data, this contains all information from each web page. While not used in this project, it represents all of the original data used in this run. It can be used for reference.</font> 

In [0]:
# Put parsed data into Pandas DataFrame
pd.DataFrame(out_list)

Unnamed: 0,date,time,source,Title,Text,url
0,2020-5-29,10:58 AM,www.reuters.com,Wall Street ends mostly up; Trump comments on ...,(Reuters) - U.S. stocks finished mostly higher...,https://www.reuters.com/article/us-usa-stocks/...
1,2020-5-26,4:04 PM,www.reuters.com,Kudlow says Trump administration looking at 'b...,WASHINGTON (Reuters) - White House economic ad...,https://www.reuters.com/article/us-health-coro...
2,2020-5-29,11:17 AM,www.reuters.com,GLOBAL MARKETS-Stocks sink as investors await ...,* Graphic: World FX rates in 2020 tmsnrt.rs/2e...,https://www.reuters.com/article/global-markets...
3,2020-5-29,9:00 PM,www.reuters.com,U.S. warns of Russian bid for Libya stronghold...,WASHINGTON (Reuters) - The U.S. military belie...,https://www.reuters.com/article/us-libya-secur...
4,2020-5-29,5:50 PM,www.reuters.com,Red light: Mexican coronavirus restart hits sp...,MEXICO CITY (Reuters) - Mexico faces a sluggis...,https://www.reuters.com/article/us-health-coro...
...,...,...,...,...,...,...
115,2020-5-29,3:32 PM,www.reuters.com,North Carolina Democrats 'dragging their feet'...,(Reuters) - The head of the Republican Nationa...,https://www.reuters.com/article/us-usa-electio...
116,2020-5-28,4:10 PM,www.reuters.com,The new normal: How safe are beaches?,(Reuters) - People are hitting the beach as co...,https://www.reuters.com/article/us-health-coro...
117,2020-5-29,5:32 PM,www.reuters.com,Democrats want interviews with Trump admin off...,WASHINGTON (Reuters) - U.S. Democratic lawmake...,https://www.reuters.com/article/us-usa-trump-i...
118,2020-5-29,7:46 PM,www.reuters.com,GLOBAL MARKETS-Stocks pare losses after Trump'...,"(Adds details from Trump statement, gold and o...",https://www.reuters.com/article/global-markets...
