# Scraping headlines - final

To collect headlines and short summaries I scraped <b>WSJ online</b>, where these information were availabele without subscription. This notebook contains all the necessary codes how to go trough the process.

For scraping headlines I was using Selenium. For this purpose a chromedriver should be installed at the some folder. The scraped headlines were collected directly to a Mongodb document, which allowed us to collect these in real time. The biggest adventage if it was, that if a process ended with error (finally it didn't happen), the already scraped headlines wouldn't get lost. In Mongodb GUI it was possible to check realtime the collected documents.

The collected headlines should go throug a cleaning process, because the majority of articles not really connected to the company.

<b>The targeted companies:</b>

* General Electric (GE)
* Goldman Sachs
* Ford Motor
* Tesla
* International Business Machines (IBM)
* Boeing
* Goldman Sachs
* JPMorgan
* Microsoft
* General Motors
* Chrysler


## 1. Preparations

### Import libraries and create useful instances

In [1]:
import requests
from selenium import webdriver
import json
import random
import time

In [2]:
# generated random sleeptimes
sleep_time = np.random.normal(loc=15, scale=5, size=100).round(3) # the sleep time average is 15 seconds
sleep_time = sleep_time[sleep_time > 5]  #the sleep time is not shorter than 5

### Creating a Mongodb document to store the articles

In [22]:
import pymongo

# create a Mongodb document
myclient = pymongo.MongoClient("mongodb://127.0.0.1:27017/")
news = myclient['news']

# create collection if not exist - example:
# news.create_collection('tesla_news')

# my collections:
tesla_news = news['tesla_news']
ge_news = news['ge_news']
ibm_news = news['ibm_news']
goldman_news = news['goldman_news']
ford_news = news['ford_news']
boeing_news = news['boeing_news']
microsoft_news = news['microsoft_news']
jpmorgan_news = news['jpmorgan_news']
gm_news = news['gm_news']
chrysler_news = news['chrysler_news']
test_news = news['test_news']

In [21]:
# create collection if not exist - example:
news.create_collection('test_news')


Collection(Database(MongoClient(host=['127.0.0.1:27017'], document_class=dict, tz_aware=False, connect=True), 'news'), 'test_news')

### OOP - Class and methods for retrieving headlines and summary texts

In [9]:
class WebNavigator:
    '''
    This class initiates a chrome webdriver.
    '''

    def __init__(self):
        self.driver = webdriver.Chrome('/Users/flatironschool/chromedriver')
        # if there is a modification, the version num helps to track it down
        print('Version2')

    def navigate(self, url):
        self.driver.get(url)

    def get_headlines(self, first_page, collection, num_of_pages=1):
        '''
        This method retrievs and saves the headlines and summeries to a MongoDB collection.
        -------------------------------------------------------
        Inputs:
            first_page: url of the first page of the search result
            num_of_pages: the number of pages of the search results
            collection: name of the Mongodb collection where to save the scraped headlines
        -------------------------------------------------------
        Returns:
            there is no return, everything is saved directly to Mongodb collection
        '''

        # create the search urls from the first urls page and the number of pages
        if num_of_pages > 1:
            search_urls = [first_page] + [first_page + f'&page={n}' 
                                          for n in range(2, num_of_pages+1)]
        else:
            search_urls = [first_page]

        for url in search_urls:
            self.url = url
            self.driver.get(url)
            # being polite avg 15 sec
            time.sleep(np.random.choice(sleep_time, 1).item())
            headline_containers = self.driver.find_elements_by_class_name(
                'headline-container')

            for hc in headline_containers:
                # -------------topic---------------
                try:
                    topic = hc.find_element_by_class_name(
                        'category').find_element_by_tag_name('a').text
                except:
                    topic = 'nan'
                # ------------headline-------------
                try:
                    headline = hc.find_element_by_class_name(
                        'headline').find_element_by_tag_name('a').text
                except:
                    continue
                # --------------summary-------------
                try:
                    summary = hc.find_element_by_class_name(
                        'summary-container').find_element_by_tag_name('p').text
                except:
                    summary = 'nan'
                # ---------------date---------------
                try:
                    date = hc.find_element_by_tag_name('time').text
                except:
                    continue
                # ---------------url-----------------
                try:
                    url = hc.find_element_by_class_name(
                        'headline').find_element_by_tag_name('a').get_attribute('href')
                except:
                    url = 'nan'

                # ---------insert document into collection-------
                document = {'headline': headline,
                            'summary': summary,
                            'topic': topic,
                            'date': date,
                            'url': url}
                collection.insert_one(document)

    def close(self):
        '''This method closes the window.'''
        self.driver.close()

In [13]:
def run_search(company, num_of_pages, collection):
    '''
    This function runs the search on the page.
    Input:
        company: the name we want to run the search on
        num_of_pages: the resulted number of pages
        collection: which collection to save in
    Returns:
        the collected documents are saved in the given collection
    '''
    search_url=f'https://www.wsj.com/search/term.html?KEYWORDS={company}&min-date=2010/01/01&max-date=2019/11/22&isAdvanced=true&daysback=90d&andor=AND&sort=date-desc&source=wsjarticle'
    
    webpage = WebNavigator()
    try:
        webpage.get_headlines(search_url, collection, num_of_pages=num_of_pages)
    except:
        print('error: ', webpage.url)
    webpage.close()

## 2. Getting the headline/summary text from the search results



The advanced search was generated manually on the website, where there was possible to identify the first result page and the number of result pages altogether. Each search result page contains 20 headlines, its topic category, date, short summary and url.

The "search_result" variable contains the first page of the search result. 

#### Tesla:

In [42]:
pages = 144
collection = tesla_news
run_search('Tesla', pages, collection)

Version2


#### GE:

In [50]:
pages = 146
collection = ge_news
run_search('GE', pages, collection)

Version2


#### IBM:

In [12]:
pages = 101
collection = ibm_news
run_search('IBM', pages, collection)

Version2


#### Ford:

In [18]:
pages = 515
collection = ford_news
run_search('Ford', pages, collection)

Version2


#### Goldman Sachs:

In [23]:
pages = 702
collection = goldman_news
run_search('Goldman', pages, collection)

Version2


#### Boeing:

In [32]:
pages = 348
collection = boeing_news
run_search('Boeing', pages, collection)

Version2


#### JPMorgan

In [None]:
pages = 104
collection = jpmorgan_news
run_search('JPMorgan', pages, collection)

#### Microsoft

In [None]:
pages = 429
collection = microsoft_news
run_search('Microsoft', pages, collection)

#### General Motors

In [14]:
pages = 491
collection = gm_news
run_search('General%20Motors', pages, collection)

Version2


#### Chrysler

In [19]:
pages = 181
collection = chrysler_news
run_search('Chrysler', pages, collection)

Version2


#### Test news

In [25]:
search_url = 'https://www.wsj.com/search/term.html?KEYWORDS=General%20Motors%20Tesla%20IBM%20Ford%20GE%20Boeing%20Chrysler%20Microsoft&min-date=2019/11/23&max-date=2019/12/04&isAdvanced=true&daysback=90d&andor=OR&sort=date-desc&source=wsjarticle'
collection = test_news
num_of_pages = 27
webpage = WebNavigator()
try:
    webpage.get_headlines(search_url, collection, num_of_pages=num_of_pages)
except:
    print('error: ', webpage.url)
webpage.close()

Version2


## 3. Check the number of documents in Mongodb

In [20]:
print('Tesla', tesla_news.estimated_document_count())
print('GE', ge_news.estimated_document_count())
print('IBM', ibm_news.estimated_document_count())
print('Ford', ford_news.estimated_document_count())
print('Goldman', goldman_news.estimated_document_count())
print('Boeing', boeing_news.estimated_document_count())
print('JPMorgan', jpmorgan_news.estimated_document_count())
print('Microsoft', microsoft_news.estimated_document_count())
print('General Motors', gm_news.estimated_document_count())
print('Chrysler', chrysler_news.estimated_document_count())

Tesla 2735
GE 2838
IBM 1983
Ford 10015
Goldman 12927
Boeing 6529
JPMorgan 2071
Microsoft 8514
General Motors 9415
Chrysler 3529
