# Find the Newsroom URLs

After collecting information on the Fortune 100 companies from Fortune's website, the next step in the process is to find the links to the newsrooms for each company. Typically, companies will host links to their press releases on a page typically labeled Newsroom, although sometimes labeled Press room. In some instances, a company won't link to its newsroom on the main website at all, and will instead have a link available on the Corporate page. 

In order to find the page where each company keeps their press releases, the code below gathers all of the links on the company's website that I collected in the first notebook, and then rates each link based upon their similarity to "news", "press" and "corporate" to help me determine the link's likelihood of being the newsroom link. 

## Imports

In [1]:
import pandas as pd

from tqdm import tqdm
import time

from selenium import webdriver
from selenium.webdriver.common.by import By

from fuzzywuzzy import fuzz

import warnings
warnings.filterwarnings('ignore')

In [2]:
# read in the data
cos = pd.read_csv('./data/fortune_100_data.csv')
cos.head()

Unnamed: 0,company,rank,fortune_link,co_website
0,Walmart,1,https://fortune.com/company/walmart/fortune500/,https://corporate.walmart.com
1,Amazon,2,https://fortune.com/company/amazon-com/fortune...,https://www.amazon.com/
2,Exxon Mobil,3,https://fortune.com/company/exxon-mobil/fortun...,https://www.exxonmobil.com/
3,Apple,4,https://fortune.com/company/apple/fortune500/,https://www.apple.com/
4,CVS Health,5,https://fortune.com/company/cvs-health/fortune...,https://www.cvshealth.com/


## Finding potential newsrooms from corporate websites

Similar to the situation I found while I was scraping the Fortune website, many companies will use JavaScript on the main pages of their website, which inhibits the use of the `requests` library. To again work around this, I've used `Selenium` in order to gather the links from the company's website.

The below code visits each `co_website` link collected in the previous notebook and scrapes the HTML for links. It then assesses each link using `fuzzywuzzy` to determine its similarity to 'news', 'press' and 'corporate' to account for differences in how each company may refer to their newsroom page. It then pulls the link with the largest value into the `cos` DataFrame for each company.

In [3]:
# for any companies that simply don't work, catch them in 
# this list to review later
error_cos = []

# prepare the options for the chrome driver
options = webdriver.ChromeOptions()

# making headless so as not to bombard my screen
options.add_argument('headless')

# getting around website features that stop bots
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)

# start chrome browser
browser = webdriver.Chrome(options=options)

# getting around website features that stop bots
browser.execute_cdp_cmd(
    'Network.setUserAgentOverride', {
        "userAgent":
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) \
        AppleWebKit/537.36 (KHTML, like Gecko) \
        Chrome/83.0.4103.53 Safari/537.36'
    })

# iterate through each of the companies in the dataframe
# use multiple try/except to be able to continue gathering data
# for other websites even if the website doesnt work on the first pass

# create different print statements in each of the excepts
# to be able to troubleshoot later, if needed

for i in tqdm(range(len(cos))):
    try:
        # get the site url
        site_url = cos.loc[i, 'co_website']
        # find the base of the url to use later
        base = site_url.split('.')[1]

        try:
            # open the website & sleep to avoid security features
            browser.get(f'{site_url}')
            time.sleep(2)

        except:
            print(f'{cos.loc[i, "company"]} browser')
            print(f'{browser.current_url}, {browser.title}')

# get all the links from the main website
        links = browser.find_elements(By.TAG_NAME, 'a')

        # create list to put all the links from the website into
        site_links = []

        # iterate through all of the links and add to a dictionary
        # that will be added to `site_links`
        for l in links:
            link_info = {}
            url = l.get_attribute('href')

            link_info['link'] = url

            # use fuzzywuzzy package to assess the similarity of each link to a
            # string to find the newsroom, pressroom, and corporate website links
            link_info['news_ratio'] = fuzz.partial_ratio('news', url)
            link_info['press_ratio'] = fuzz.partial_ratio('press', url)
            link_info['corporate_ratio'] = fuzz.partial_ratio('corporate', url)
            try:
                link_info['url_len'] = len(url)
            except:
                pass

# append link info to the `site_links` list
            site_links.append(link_info)

# putting the site links into a data frame
        site_links_df = pd.DataFrame(site_links).drop_duplicates().dropna()
        site_links_df = site_links_df[site_links_df['link'].str.contains(base)]

        # getting dataframes for each of the series of links and resetting at
        # the top of each loop to avoid links from one website being put into
        # another company's row
        try:
            news_link_df = None
            news_link_df = site_links_df[
                site_links_df['news_ratio'] >
                site_links_df['news_ratio'].mean()].sort_values(
                    'news_ratio', ascending=False).reset_index(drop=True)
        except:
            print(f'{cos.loc[i, "company"]} news_link_df')
        try:
            press_link_df = None
            press_link_df = site_links_df[
                site_links_df['press_ratio'] >
                site_links_df['press_ratio'].mean()].sort_values(
                    'press_ratio', ascending=False).reset_index(drop=True)
        except:
            print(f'{cos.loc[i, "company"]} press_link_df')

        try:
            corp_link_df = None
            corp_link_df = site_links_df[
                site_links_df['corporate_ratio'] >
                site_links_df['corporate_ratio'].mean()].sort_values(
                    'corporate_ratio', ascending=False).reset_index(drop=True)
        except:
            print(f'{cos.loc[i, "company"]} corp_link_df')

# pulling the top links into cos
        try:
            cos.loc[i, 'newsroom_link'] = news_link_df.loc[0, 'link']
        except:
            cos.loc[i, 'newsroom_link'] = 'N/A'

        try:
            cos.loc[i, 'pressroom_link'] = press_link_df.loc[0, 'link']
        except:
            cos.loc[i, 'pressroom_link'] = 'N/A'

        try:
            cos.loc[i, 'corporate_link'] = corp_link_df.loc[0, 'link']
        except:
            cos.loc[i, 'corporate_link'] = 'N/A'

    except:
        error_cos.append(cos.loc[i, "company"])
        print(f'{cos.loc[i, "company"]}')
        print(f'{browser.current_url}, {browser.title}')

100%|██████████| 100/100 [12:52<00:00,  7.73s/it]


In [4]:
cos.head()

Unnamed: 0,company,rank,fortune_link,co_website,newsroom_link,pressroom_link,corporate_link
0,Walmart,1,https://fortune.com/company/walmart/fortune500/,https://corporate.walmart.com,https://corporate.walmart.com/newsroom/2021/03...,https://www.diabetes.org/newsroom/press-releas...,https://corporate.walmart.com/#
1,Amazon,2,https://fortune.com/company/amazon-com/fortune...,https://www.amazon.com/,https://www.amazon.com/gp/customer-preferences...,https://www.amazon.com/ref=nav_logo_prime,https://www.amazon.com/Stonebriar-Decorative-S...
2,Exxon Mobil,3,https://fortune.com/company/exxon-mobil/fortun...,https://www.exxonmobil.com/,https://corporate.exxonmobil.com/About-us/Busi...,https://corporate.exxonmobil.com/News/Newsroom...,https://corporate.exxonmobil.com/#main-content
3,Apple,4,https://fortune.com/company/apple/fortune500/,https://www.apple.com/,https://www.apple.com/apple-news/,https://www.apple.com/us/shop/goto/temporary_c...,https://www.apple.com/us/shop/goto/trade_in
4,CVS Health,5,https://fortune.com/company/cvs-health/fortune...,https://www.cvshealth.com/,https://www.cvshealth.com/news-and-insights/to...,https://www.cvshealth.com/news-and-insights/pr...,https://www.cvshealth.com/social-responsibilit...


## Finding the official newsroom link

After collecting the links for each company that was most similar to 'news', 'press' and 'corporate', I went through and manually investigated whether or not the link was accurate, and saved the information into `final_websites.csv`. While the newsroom links are ultimately not the final links I will use in the function that gathers all of the information, these links were useful in helping me determine what those links would be.

Although doing this manually would not be feasible for larger datasets, in this case it was the best option to make sure I had all of the correct links.

In [5]:
# read in the final_websites csv
final_websites = pd.read_csv('./data/final_websites.csv')
final_websites.head()

Unnamed: 0,company,final
0,Walmart,https://corporate.walmart.com/newsroom/company...
1,Amazon,https://press.aboutamazon.com/press-releases
2,Exxon Mobil,https://corporate.exxonmobil.com/News/Newsroom...
3,Apple,https://www.apple.com/newsroom/archive/
4,CVS Health,https://www.cvshealth.com/news-and-insights/pr...


In [6]:
cos = cos.merge(final_websites, on='company')
cos.head()

Unnamed: 0,company,rank,fortune_link,co_website,newsroom_link,pressroom_link,corporate_link,final
0,Walmart,1,https://fortune.com/company/walmart/fortune500/,https://corporate.walmart.com,https://corporate.walmart.com/newsroom/2021/03...,https://www.diabetes.org/newsroom/press-releas...,https://corporate.walmart.com/#,https://corporate.walmart.com/newsroom/company...
1,Amazon,2,https://fortune.com/company/amazon-com/fortune...,https://www.amazon.com/,https://www.amazon.com/gp/customer-preferences...,https://www.amazon.com/ref=nav_logo_prime,https://www.amazon.com/Stonebriar-Decorative-S...,https://press.aboutamazon.com/press-releases
2,Exxon Mobil,3,https://fortune.com/company/exxon-mobil/fortun...,https://www.exxonmobil.com/,https://corporate.exxonmobil.com/About-us/Busi...,https://corporate.exxonmobil.com/News/Newsroom...,https://corporate.exxonmobil.com/#main-content,https://corporate.exxonmobil.com/News/Newsroom...
3,Apple,4,https://fortune.com/company/apple/fortune500/,https://www.apple.com/,https://www.apple.com/apple-news/,https://www.apple.com/us/shop/goto/temporary_c...,https://www.apple.com/us/shop/goto/trade_in,https://www.apple.com/newsroom/archive/
4,CVS Health,5,https://fortune.com/company/cvs-health/fortune...,https://www.cvshealth.com/,https://www.cvshealth.com/news-and-insights/to...,https://www.cvshealth.com/news-and-insights/pr...,https://www.cvshealth.com/social-responsibilit...,https://www.cvshealth.com/news-and-insights/pr...


In [7]:
# confirming there are no null values
cos['final'].isna().sum()

0

In [8]:
# saving dataframe to a new csv

cos.to_csv('./data/fortune_100_data_w_links.csv', index = False)