# Gather Press Release Text



## Imports

In [2]:
import pandas as pd

import time
from tqdm import tqdm

import os

import requests
from bs4 import BeautifulSoup

import warnings
warnings.filterwarnings('ignore')

In [3]:
files = [f for f in os.listdir('../data/links/') if 'csv' in f]
files

['apple_links.csv',
 'walmart_links.csv',
 'cvs_health_links.csv',
 'amazon_links.csv',
 'exxon_mobil_links.csv']

The below provides a rough estimate of how long the code in the cell that follows will take to run.

In [4]:
lengths = [len(pd.read_csv(f'../data/links/{file}')) for file in files]
time = (sum(lengths)*3.9)/60
print('Files:',files)
print('Lengths:',lengths)
print('Time: %0.2f minutes'%(time))

Files: ['apple_links.csv', 'walmart_links.csv', 'cvs_health_links.csv', 'amazon_links.csv', 'exxon_mobil_links.csv']
Lengths: [263, 437, 767, 449, 143]
Time: 133.83 minutes


For each file, iterate through the rows and use either the link by itself or the base + the link with the `requests` library to gather press release text. These files are then saved in the press_releases folder.

In [5]:
for file in files:
    # create new file name that can be used later
    new_file_name = file.replace('links.csv', 'press_releases.csv')

    # read in the file as a data frame
    df = pd.read_csv(f'../data/links/{file}')

    # create list that dictionaries (created in for loop) can be appended to
    press_releases = []

    try:
        # iterate through each row in the data frame
        # range(len(df))
        for i in tqdm(range(4)):
            try:

                # create dictionary for each row and the results it returns
                press_release = {}

                # for those files that have a base string, get the url, otherwise
                # just use the link column
                if type(df.loc[i, 'base']) == str:
                    url = df.loc[i, 'base'] + df.loc[i, 'link']
                else:
                    url = df.loc[i, 'link']

                req = requests.get(url)

                soup = BeautifulSoup(req.content, 'lxml')

                press_release['full_link'] = url
                press_release['title'] = soup.title.text
                press_release['body'] = soup.body.text
                press_release['html'] = soup
                
                if 'amazon' in file:
                    press_release['year'] = df.loc[i,'year']
                else:
                    pass
                    
                press_releases.append(press_release)
                time.sleep(3)

            except:
                print(f'Error: {file} | {url} | {i} | {req}')

    except:
        print(f'Error: {file} | {url}')

    pr_df = pd.DataFrame(press_releases)
#     pr_df.to_csv(f'../data/press_releases/{new_file_name}', index=False)

 50%|█████     | 2/4 [00:00<00:00,  4.09it/s]

Error: apple_links.csv | https://www.apple.com/newsroom/2021/03/apple-earns-historic-academy-award-nominations-for-wolfwalkers-and-greyhound/ | 0 | <Response [200]>
Error: apple_links.csv | https://www.apple.com/newsroom/2021/03/apple-womens-health-study-releases-preliminary-data-to-help-destigmatize-menstrual-symptoms/ | 1 | <Response [200]>


 75%|███████▌  | 3/4 [00:00<00:00,  4.03it/s]

Error: apple_links.csv | https://www.apple.com/newsroom/2021/03/apple-tv-plus-announces-programming-partnership-with-nobel-laureate-malala-yousafzai/ | 2 | <Response [200]>


100%|██████████| 4/4 [00:01<00:00,  3.68it/s]
  0%|          | 0/4 [00:00<?, ?it/s]

Error: apple_links.csv | https://www.apple.com/newsroom/2021/03/apple-hearing-study-shares-new-insights-on-hearing-health/ | 3 | <Response [200]>


 25%|██▌       | 1/4 [00:00<00:01,  2.15it/s]

Error: walmart_links.csv | https://corporate.walmart.com/newsroom/2021/03/12/walmart-investment-to-accelerate-growth-of-rakutens-global-ecosystem | 0 | <Response [200]>


 50%|█████     | 2/4 [00:00<00:00,  2.15it/s]

Error: walmart_links.csv | https://corporate.walmart.com/newsroom/2021/03/09/walmart-doubles-down-on-tiktok-shopping-hosts-all-new-live-stream-shopping-event | 1 | <Response [200]>


 75%|███████▌  | 3/4 [00:01<00:00,  2.16it/s]

Error: walmart_links.csv | https://corporate.walmart.com/newsroom/2021/03/05/walmart-board-of-directors-adds-former-at-t-chairman-and-ceo-randall-stephenson | 2 | <Response [200]>


100%|██████████| 4/4 [00:01<00:00,  2.17it/s]
  0%|          | 0/4 [00:00<?, ?it/s]

Error: walmart_links.csv | https://corporate.walmart.com/newsroom/2021/03/04/walmart-maintains-relentless-focus-on-growing-frontline-associates-in-the-pandemic-and-beyond | 3 | <Response [200]>


 50%|█████     | 2/4 [00:02<00:01,  1.15it/s]

Error: cvs_health_links.csv | https://www.cvshealth.com/news-and-insights/press-releases/cvs-health-invests-114-million-in-affordable-housing-across-the | 0 | <Response [200]>
Error: cvs_health_links.csv | https://www.cvshealth.com/news-and-insights/press-releases/cvs-health-now-offering-covid-19-vaccines-in-29-states | 1 | <Response [200]>


 75%|███████▌  | 3/4 [00:03<00:01,  1.34s/it]

Error: cvs_health_links.csv | https://www.cvshealth.com/news-and-insights/press-releases/cvs-health-completes-first-round-of-covid-19-vaccine-doses-at | 2 | <Response [200]>


100%|██████████| 4/4 [00:05<00:00,  1.44s/it]
  0%|          | 0/4 [00:00<?, ?it/s]

Error: cvs_health_links.csv | https://www.cvshealth.com/news-and-insights/press-releases/cvs-health-launches-symphonytm-to-support-senior-safety-at-home | 3 | <Response [200]>


 50%|█████     | 2/4 [00:00<00:00,  3.20it/s]

Error: amazon_links.csv | https://press.aboutamazon.com/news-releases/news-release-details/amazon-continues-investment-florida-deltona-fulfillment-center | 0 | <Response [200]>
Error: amazon_links.csv | https://press.aboutamazon.com/news-releases/news-release-details/customers-shopped-record-levels-holiday-season-billions-items | 1 | <Response [200]>


 75%|███████▌  | 3/4 [00:01<00:00,  2.31it/s]

Error: amazon_links.csv | https://press.aboutamazon.com/news-releases/news-release-details/amazon-has-enabled-hundreds-small-businesses-and-created-over | 2 | <Response [200]>


100%|██████████| 4/4 [00:01<00:00,  2.12it/s]
  0%|          | 0/4 [00:00<?, ?it/s]

Error: amazon_links.csv | https://press.aboutamazon.com/news-releases/news-release-details/amazon-makes-returns-even-easier-holiday-free-returns-millions | 3 | <Response [200]>


 25%|██▌       | 1/4 [00:00<00:01,  1.81it/s]

Error: exxon_mobil_links.csv | https://corporate.exxonmobil.com/News/Newsroom/News-releases/2021/0311_Darren-Woods-shares-strategy-for-long-term-growth-in-lower-carbon-future-with-employees | 0 | <Response [200]>


 50%|█████     | 2/4 [00:00<00:00,  2.65it/s]

Error: exxon_mobil_links.csv | https://corporate.exxonmobil.com/News/Newsroom/News-releases/2021/0303_ExxonMobil-outlines-plans-to-grow-long-term-shareholder-value-in-lower-carbon-future | 1 | <Response [200]>


 75%|███████▌  | 3/4 [00:01<00:00,  3.10it/s]

Error: exxon_mobil_links.csv | https://corporate.exxonmobil.com/News/Newsroom/News-releases/2021/0302_ExxonMobil-announces-Singapore-workforce-reductions | 2 | <Response [200]>


100%|██████████| 4/4 [00:01<00:00,  2.89it/s]

Error: exxon_mobil_links.csv | https://corporate.exxonmobil.com/News/Newsroom/News-releases/2021/0301_Neil-Duffin-to-retire-as-president-of-ExxonMobil-Global-Projects-Company_Jon-Gibbs-elected | 3 | <Response [200]>





In [None]:
pr_df