# Gather HTML from Newsrooms

After confirming all of the newsroom links, the next step is to figure out how to best iterate through the pages/tabs of these links, and collect all of the HTML from each page/tab of the company's newsroom. This HTML will contain the links to the press releases, which can then be used to gather the press release text and then begin to model.

In this notebook, I limit the number of companies I am working with to just the top five Fortune 100 companies. However, I have included several different code blocks that future iterations of this project can use and expand upon to more easily include more companies.

## Imports

In [1]:
import pandas as pd

from tqdm import tqdm
import time

from selenium import webdriver
from selenium.webdriver.common.by import By

import warnings
warnings.filterwarnings('ignore')

## Read in the data

In [2]:
# read in the data
cos = pd.read_csv('./data/fortune_100_data_w_links.csv')

Because this project is a proof of concept, I am limiting the DataFrame to just the top five Fortune 100 companies. 

In [3]:
cos = cos[cos['rank'] <=5]
cos

Unnamed: 0,company,rank,fortune_link,co_website,newsroom_link,pressroom_link,corporate_link,final
0,Walmart,1,https://fortune.com/company/walmart/fortune500/,https://corporate.walmart.com,https://corporate.walmart.com/newsroom/2021/03...,https://www.diabetes.org/newsroom/press-releas...,https://corporate.walmart.com/#,https://corporate.walmart.com/newsroom/company...
1,Amazon,2,https://fortune.com/company/amazon-com/fortune...,https://www.amazon.com/,https://www.amazon.com/gp/customer-preferences...,https://www.amazon.com/ref=nav_logo_prime,https://www.amazon.com/Stonebriar-Decorative-S...,https://press.aboutamazon.com/press-releases
2,Exxon Mobil,3,https://fortune.com/company/exxon-mobil/fortun...,https://www.exxonmobil.com/,https://corporate.exxonmobil.com/About-us/Busi...,https://corporate.exxonmobil.com/News/Newsroom...,https://corporate.exxonmobil.com/#main-content,https://corporate.exxonmobil.com/News/Newsroom...
3,Apple,4,https://fortune.com/company/apple/fortune500/,https://www.apple.com/,https://www.apple.com/apple-news/,https://www.apple.com/us/shop/goto/temporary_c...,https://www.apple.com/us/shop/goto/trade_in,https://www.apple.com/newsroom/archive/
4,CVS Health,5,https://fortune.com/company/cvs-health/fortune...,https://www.cvshealth.com/,https://www.cvshealth.com/news-and-insights/to...,https://www.cvshealth.com/news-and-insights/pr...,https://www.cvshealth.com/social-responsibilit...,https://www.cvshealth.com/news-and-insights/pr...


## Adding the `loop_url`, `type` and `page_type` columns 

Part of the overall process not seen in these notebooks is determining how each company website works. For instance, Walmart's newsroom has pages that can be iterated through, while Amazon's press releases are organized in a long list by year, which each year getting its own tab on its newsroom. 

For some of the companies not included in the reduced list, the structure of their page urls don't follow the same pattern.

After examining the newsrooms for the top five Fortune 100 companies, I've saved the additional information needed to get us one step closer to our end goal in `company_loop_info.csv`.

Below are the descriptions for what each column contains:

* **company**: The name of the company.
* **loop_url**: This column is the base of the url that the code can use to iterate through.
* **type**: The category of iteration used - for the top five companies, the types are `pages` and `years`
* **page_type**: This is used in a function created below to return the right ending as the code loops through the values.

In [4]:
loops = pd.read_csv('./data/company_loop_info.csv')
loops

Unnamed: 0,company,loop_url,type,page_type
0,Walmart,https://corporate.walmart.com/newsroom/company...,pages,page
1,Amazon,https://press.aboutamazon.com/press-releases?a...,years,year
2,Exxon Mobil,https://corporate.exxonmobil.com/api/v2/relate...,pages,page
3,Apple,https://www.apple.com/newsroom/archive/?page=,pages,page
4,CVS Health,https://www.cvshealth.com/news-and-insights/pr...,pages,page


In [5]:
cos = cos.merge(loops, on='company')
cos

Unnamed: 0,company,rank,fortune_link,co_website,newsroom_link,pressroom_link,corporate_link,final,loop_url,type,page_type
0,Walmart,1,https://fortune.com/company/walmart/fortune500/,https://corporate.walmart.com,https://corporate.walmart.com/newsroom/2021/03...,https://www.diabetes.org/newsroom/press-releas...,https://corporate.walmart.com/#,https://corporate.walmart.com/newsroom/company...,https://corporate.walmart.com/newsroom/company...,pages,page
1,Amazon,2,https://fortune.com/company/amazon-com/fortune...,https://www.amazon.com/,https://www.amazon.com/gp/customer-preferences...,https://www.amazon.com/ref=nav_logo_prime,https://www.amazon.com/Stonebriar-Decorative-S...,https://press.aboutamazon.com/press-releases,https://press.aboutamazon.com/press-releases?a...,years,year
2,Exxon Mobil,3,https://fortune.com/company/exxon-mobil/fortun...,https://www.exxonmobil.com/,https://corporate.exxonmobil.com/About-us/Busi...,https://corporate.exxonmobil.com/News/Newsroom...,https://corporate.exxonmobil.com/#main-content,https://corporate.exxonmobil.com/News/Newsroom...,https://corporate.exxonmobil.com/api/v2/relate...,pages,page
3,Apple,4,https://fortune.com/company/apple/fortune500/,https://www.apple.com/,https://www.apple.com/apple-news/,https://www.apple.com/us/shop/goto/temporary_c...,https://www.apple.com/us/shop/goto/trade_in,https://www.apple.com/newsroom/archive/,https://www.apple.com/newsroom/archive/?page=,pages,page
4,CVS Health,5,https://fortune.com/company/cvs-health/fortune...,https://www.cvshealth.com/,https://www.cvshealth.com/news-and-insights/to...,https://www.cvshealth.com/news-and-insights/pr...,https://www.cvshealth.com/social-responsibilit...,https://www.cvshealth.com/news-and-insights/pr...,https://www.cvshealth.com/news-and-insights/pr...,pages,page


## Get html for companies with `type` == `pages`

I decided to split up this part of the data collection by `type` in order to keep the code blocks shorter and more manageable, rather than try to cram all of the code into one long block. Additionally, I believe this will make the code and these notebooks more adaptable for future iterations of this project that include more companies. 

In [6]:
pages = cos[cos['type'] == 'pages'].reset_index(drop= True)
pages

Unnamed: 0,company,rank,fortune_link,co_website,newsroom_link,pressroom_link,corporate_link,final,loop_url,type,page_type
0,Walmart,1,https://fortune.com/company/walmart/fortune500/,https://corporate.walmart.com,https://corporate.walmart.com/newsroom/2021/03...,https://www.diabetes.org/newsroom/press-releas...,https://corporate.walmart.com/#,https://corporate.walmart.com/newsroom/company...,https://corporate.walmart.com/newsroom/company...,pages,page
1,Exxon Mobil,3,https://fortune.com/company/exxon-mobil/fortun...,https://www.exxonmobil.com/,https://corporate.exxonmobil.com/About-us/Busi...,https://corporate.exxonmobil.com/News/Newsroom...,https://corporate.exxonmobil.com/#main-content,https://corporate.exxonmobil.com/News/Newsroom...,https://corporate.exxonmobil.com/api/v2/relate...,pages,page
2,Apple,4,https://fortune.com/company/apple/fortune500/,https://www.apple.com/,https://www.apple.com/apple-news/,https://www.apple.com/us/shop/goto/temporary_c...,https://www.apple.com/us/shop/goto/trade_in,https://www.apple.com/newsroom/archive/,https://www.apple.com/newsroom/archive/?page=,pages,page
3,CVS Health,5,https://fortune.com/company/cvs-health/fortune...,https://www.cvshealth.com/,https://www.cvshealth.com/news-and-insights/to...,https://www.cvshealth.com/news-and-insights/pr...,https://www.cvshealth.com/social-responsibilit...,https://www.cvshealth.com/news-and-insights/pr...,https://www.cvshealth.com/news-and-insights/pr...,pages,page


Another way I've made this code more flexible is by creating functions that can be used to get the end of the url for the iteration process. Other `page_type`s aren't necessarily as straightforward as adding on page number as a string at the end of a url.

In [7]:
# create a function that will return the appropriate page ending 

def get_page_ending(i, page_type):

    if page_type == 'page':
        return str(i)

In [8]:
# set up the webdriver
options = webdriver.ChromeOptions()
options.page_load_strategy = 'normal'
options.add_argument('headless')
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
browser = webdriver.Chrome(options=options)
browser.execute_cdp_cmd(
    'Network.setUserAgentOverride', {
        "userAgent":
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) \
        Chrome/83.0.4103.53 Safari/537.36'
    })

# loop through each row in the `pages` dataframe
for row in range(len(pages)):

    # create a list that can be appended to
    htmls = []
    
    # get the page type as a variable
    page_type = pages.loc[row, 'page_type']
    
    # get the url
    url = pages.loc[row, 'loop_url']

    for i in tqdm(range(50)):
            
        try:
            # create a dictionary that we can add to
            page_html = {}
            
            # create a variable for the end of the page url, 
            # calling on the previously created function
            ending = get_page_ending(i, page_type)
            
            # add the page number to the end of the url
            page_url = url + ending

            # open the browser
            browser.get(page_url)
            time.sleep(5)
            
            browser.execute_script("window.scrollTo(0, document.body.scrollHeight/2);")
            
            # add information for each row in case needed later
            page_html['company'] = pages.loc[row, 'company']
            page_html['base_url'] = pages.loc[row, 'final']
            page_html['url'] = page_url
            page_html['page_num'] = i

            # add the html to the dictionary
            page_html['html'] = browser.page_source

            # append the dictionary for this page to the list
            htmls.append(page_html)
            time.sleep(3)
        
        except:
            print()
            print(f"Company: {pages.loc[row, 'company']} | Web page: {i} | Page type: {pages.loc[row,'page_type']} |  Status: Error"                  )
            
    #create a dataframe and save locally
    html_df = pd.DataFrame(htmls)
    html_df.to_csv(f'./data/html/{pages.loc[row,"company"].replace(" ","_").lower()}_html.csv',index=False)


100%|██████████| 50/50 [07:25<00:00,  8.90s/it]
100%|██████████| 50/50 [07:16<00:00,  8.73s/it]
100%|██████████| 50/50 [07:45<00:00,  9.31s/it]
100%|██████████| 50/50 [08:41<00:00, 10.42s/it]


## Getting html for companies with `type` == `years`

In [9]:
years = cos[cos['type'] == 'years'].reset_index(drop= True)
years

Unnamed: 0,company,rank,fortune_link,co_website,newsroom_link,pressroom_link,corporate_link,final,loop_url,type,page_type
0,Amazon,2,https://fortune.com/company/amazon-com/fortune...,https://www.amazon.com/,https://www.amazon.com/gp/customer-preferences...,https://www.amazon.com/ref=nav_logo_prime,https://www.amazon.com/Stonebriar-Decorative-S...,https://press.aboutamazon.com/press-releases,https://press.aboutamazon.com/press-releases?a...,years,year


Similar to the function I created above (i.e., `get_page_ending()`), some of the year endings are formatted differently than just the year as a string. Creating the function now makes this code more adaptable for future iterations.

In [10]:
def get_year_ending(i, page_type):

    if page_type == 'year':
        return str(i)

In [11]:
# set up the webdriver
options = webdriver.ChromeOptions()
options.page_load_strategy = 'normal'
options.add_argument('headless')
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
browser = webdriver.Chrome(options=options)
browser.execute_cdp_cmd(
    'Network.setUserAgentOverride', {
        "userAgent":
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) \
        Chrome/83.0.4103.53 Safari/537.36'
    })

# loop through each row in the `pages` dataframe
for row in range(len(years)):

    # create a list that can be appended to
    htmls = []
    
    # get the page type as a variable
    page_type = years.loc[row, 'page_type']
    
    # get the url
    url = years.loc[row, 'loop_url']
    
    for i in tqdm(range(2019,2022)):
            
        try:
            # create a dictionary that we can add to
            page_html = {}
            
            # create a variable for the end of the page url, 
            # calling on the previously created function
            ending = get_year_ending(i, page_type)
            
            # add the page number to the end of the url
            page_url = url + ending

            # open the browser
            browser.get(page_url)
            time.sleep(5)
            
            browser.execute_script("window.scrollTo(0, document.body.scrollHeight/2);")
            
            # add information for each row in case needed later
            page_html['company'] = years.loc[row, 'company']
            page_html['base_url'] = years.loc[row, 'final']
            page_html['url'] = page_url
            page_html['page_num'] = i

            # add the html to the dictionary
            page_html['html'] = browser.page_source

            # append the dictionary for this page to the list
            htmls.append(page_html)
            time.sleep(3)
        
        except:
            print()
            print(f"Company: {years.loc[row, 'company']} | Web page: {i} | Page type: {years.loc[row,'page_type']} |  Status: Error"                  )
            
    #create a dataframe and save locally
    html_df = pd.DataFrame(htmls)
    html_df.to_csv(f'./data/html/{years.loc[row,"company"].replace(" ","_").lower()}_html.csv',index=False)


100%|██████████| 3/3 [00:32<00:00, 10.94s/it]
