# Scraping Articles About Africa
This notebook is meant to scrape articles from several major news outlets based in the U.S. along with a few major outlets based outside of the U.S. These articles will then be fed into ASTRSC--now that the dictionaries have been finalized--to test the tool. 

The script below follows scraping practices outlined in [this freeCodeCamp website](https://www.freecodecamp.org/news/web-scraping-python-tutorial-how-to-scrape-data-from-a-website/).

Author: Anabelle Colmenares

In [2]:
# necessary packages
import re
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd

## Scrape html of African news page from CNN, BBC, NBC, Fox News, Al Jazeera, The Washington Post, The New Yorker
We will now make a list of all the links of the news outlets' website pages containing all recent articles written about Africa. Since the New Yorker has several pages of articles, we will include these in our list of links, as well. 

In [7]:
urls = ["https://www.cnn.com/africa", "https://www.bbc.com/news/world/africa", "https://www.nbcnews.com/news/africa",  "https://www.foxnews.com/category/world/world-regions/africa", "https://www.aljazeera.com/africa/", "https://www.washingtonpost.com/world/africa/?itid=sf_world_subnav", "https://www.newyorker.com/tag/africa"]
news_outlets = ["cnn", "bbc", "nbc", "fox", "aljazeera", "wp", "newyorker"]
html_soups = {}

for i in range(2, 39):
    urls.append(urls[6] + '/page/' + str(i))
    news_outlets.append(news_outlets[6] + '_' + str(i))


print(news_outlets, urls)
    

['cnn', 'bbc', 'nbc', 'fox', 'aljazeera', 'wp', 'newyorker', 'newyorker_2', 'newyorker_3', 'newyorker_4', 'newyorker_5', 'newyorker_6', 'newyorker_7', 'newyorker_8', 'newyorker_9', 'newyorker_10', 'newyorker_11', 'newyorker_12', 'newyorker_13', 'newyorker_14', 'newyorker_15', 'newyorker_16', 'newyorker_17', 'newyorker_18', 'newyorker_19', 'newyorker_20', 'newyorker_21', 'newyorker_22', 'newyorker_23', 'newyorker_24', 'newyorker_25', 'newyorker_26', 'newyorker_27', 'newyorker_28', 'newyorker_29', 'newyorker_30', 'newyorker_31', 'newyorker_32', 'newyorker_33', 'newyorker_34', 'newyorker_35', 'newyorker_36', 'newyorker_37', 'newyorker_38'] ['https://www.cnn.com/africa', 'https://www.bbc.com/news/world/africa', 'https://www.nbcnews.com/news/africa', 'https://www.foxnews.com/category/world/world-regions/africa', 'https://www.aljazeera.com/africa/', 'https://www.washingtonpost.com/world/africa/?itid=sf_world_subnav', 'https://www.newyorker.com/tag/africa', 'https://www.newyorker.com/tag/afri

In [8]:
for i in range(len(urls)):
    # send request
    res = requests.get(urls[i])

    txt = res.text
    status = res.status_code

    # convert to a beautiful soup object
    soup = bs(res.content, 'html.parser')
    
    # store soup in dict
    html_soups[news_outlets[i]] = soup

['https://www.cnn.com/africa', 'https://www.bbc.com/news/world/africa', 'https://www.nbcnews.com/news/africa', 'https://www.foxnews.com/category/world/world-regions/africa', 'https://www.aljazeera.com/africa/', 'https://www.washingtonpost.com/world/africa/?itid=sf_world_subnav', 'https://www.newyorker.com/tag/africa', 'https://www.newyorker.com/tag/africa/page/2', 'https://www.newyorker.com/tag/africa/page/3', 'https://www.newyorker.com/tag/africa/page/4', 'https://www.newyorker.com/tag/africa/page/5', 'https://www.newyorker.com/tag/africa/page/6', 'https://www.newyorker.com/tag/africa/page/7', 'https://www.newyorker.com/tag/africa/page/8', 'https://www.newyorker.com/tag/africa/page/9', 'https://www.newyorker.com/tag/africa/page/10', 'https://www.newyorker.com/tag/africa/page/11', 'https://www.newyorker.com/tag/africa/page/12', 'https://www.newyorker.com/tag/africa/page/13', 'https://www.newyorker.com/tag/africa/page/14', 'https://www.newyorker.com/tag/africa/page/15', 'https://www.new

## Extract CNN Links
We will first look at the source html of the CNN page and find where the links are. Then, we will extract those links using Beautiful Soup functions.

In [25]:
print(html_soups["cnn"].body.prettify())

<body class="pg pg-vertical pg-vertical--news pg-africa pg-section domestic t-light" data-eq-pts="xsmall: 0, medium: 460, large: 780, full16x9: 1100">
 <div class="ad ad--epic ad--all">
  <div class="ad-ad_bnr_atf_02 ad-refresh-adbody" data-ad-id="ad_bnr_atf_02" id="ad_bnr_atf_02">
  </div>
 </div>
 <div class="user-msg">
  <div class="user-msg--container">
   <div class="user-msg--header">
    <div class="user-msg--header-text js-user-msg--header-text">
    </div>
    <div class="user-msg--close js-user-msg--close">
    </div>
   </div>
   <div class="user-msg--body">
    <div class="user-msg--body-text js-user-msg--body-text">
    </div>
   </div>
  </div>
 </div>
 <div id="header-wrap">
  <div id="non-sticky-ad-wrap">
   <div class="ad ad--epic ad--all t-high-contrast">
    <div class="ad-ad_bnr_atf_01 ad-refresh-adbanner" data-ad-id="ad_bnr_atf_01" id="ad_bnr_atf_01">
    </div>
   </div>
  </div>
 </div>
 <div class="nav--plain-header" id="nav__plain-header">
  <div data-react-id=

In [26]:
print(html_soups["nbc"].body.prettify())

<body class="frontPage news savory">
 <div class="z-5 relative" id="modal-root">
 </div>
 <div data-reactroot="" id="__next">
  <div class="globalContainerStyles_container__XzPpd bg-knockout-primary" id="content">
   <div class="header-and-footer--banner-ad ad-container topbannerAd isScrolledToTop">
    <div class="ad dn-print">
     <div data-active-tab="true" data-mps="true" data-refresh-interval="0" data-render-on-view="true" data-sizes="[[[1000,1],[[728,90],[970,66],[970,90],[970,250],[1900,400]]],[[758,1],[[728,90],[970,66],[970,90],[970,250],[1900,400]]],[[0,0],[[1900,400]]]]" data-slot="topbanner" data-targeting="{}">
     </div>
    </div>
   </div>
   <style>
    .alert-banner {
          display: none;
        }
   </style>
   <div class="alert-banner">
    IE 11 is not supported. For an optimal experience visit our site on another browser.
    <button aria-label="close" class="alert-banner__close-button icon icon-close" type="button">
    </button>
   </div>
   <div class="s

In [27]:
print(html_soups["aljazeera"].body.prettify())

<body>
 <noscript>
  <iframe height="0" src="https://www.googletagmanager.com/ns.html?id=GTM-MJWQ5L2" style="display:none;visibility:hidden" width="0">
  </iframe>
 </noscript>
 <link data-chunk="section-route" href="/static/css/component~section-route~493df0b3.a1e0c978.chunk.css" rel="stylesheet">
  <script id="__LOADABLE_REQUIRED_CHUNKS__" type="application/json">
   [8]
  </script>
  <script id="__LOADABLE_REQUIRED_CHUNKS___ext" type="application/json">
   {"namedChunks":["section-route"]}
  </script>
  <script async="" data-chunk="section-route" src="/static/js/component~section-route~493df0b3.f6ca404e.chunk.js">
  </script>
  <div id="root">
   <style data-emotion-css="15dx29c">
    .css-15dx29c{font-family:"Roboto","Helvetica Neue","Helvetica","Arial",sans-serif;}
   </style>
   <div class="css-15dx29c">
    <div class="container--ads container--ads-leaderboard-atf">
    </div>
    <div class="container container--header container--white header-is-sticky header-is-minimized">
   

In [33]:
all_links = {}
# loop through soups of all news outlets
for i in range(len(html_soups)):
    a_tags = html_soups[news_outlets[i]].body.find_all('a')
    outlet_links = []
    # loop through a tags of given news outlet
    for a in a_tags:
        full_link = str(a.get('href'))
        # only do this for cnn articles
        if i == 0:
            full_link = urls[i].replace('/africa', '') + full_link
        if i == 1:
            full_link = urls[i].replace('/news/world/africa', '') + full_link
        
        if full_link not in outlet_links:  # avoid repeat links from being added
            if i == 0:
                if 'html' in full_link:       # filter cnn for articles ab africa only
                    outlet_links.append(full_link)
            elif i == 1: 
                if 'world-africa' in full_link:  # filter bbc for articles ab africa only
                    outlet_links.append(full_link)
                    
            elif i == 2:
                if 'africa' or 'world' or 'poli' in full_link:  # filter nbc for articles ab africa only
                    outlet_links.append(full_link)
            elif i == 3:
                if 'fox' not in full_link.lower() and '/' in full_link:   # partial rather than full urls are provided in fox html
                    full_link = urls[i].replace('/category/world/world-regions/africa', '') + full_link
                    outlet_links.append(full_link)
            elif i == 4: 
                if '/' in full_link and 'aljazeera' not in full_link.lower():
                    full_link = urls[i].replace('/africa/', '') + full_link
                    outlet_links.append(full_link)
                    
            elif i >= 6:
                if not '/page/' or not 'https://' or not 'http' in full_link:
                    full_link = re.sub('/tag/africa.*', '', urls[i]) + full_link
                    outlet_links.append(full_link)
                
#             elif i > 6: 
#                 full_link = urls[i].replace('', '') + full_link
#                 outlet_links.append(full_link)
            else:
                outlet_links.append(full_link)
                
    all_links[news_outlets[i]] = outlet_links

# exclude all irrelevant links
all_links['nbc'] = all_links['nbc'][53:-12] # note: unsure whether "53" and "-12" will always be the right indices for this page
all_links['aljazeera'] = all_links['aljazeera'][23:-10] # note: same as above
all_links['wp'] = all_links['wp'][15:-48] # note: same as above

for i in range(len(news_outlets[6:])):
    if (i > 0 and i < 9) or (i > 9 and i < 27) or (i == 28) or (i == 29) or (i >= 31):
        all_links[news_outlets[6:][i]] = all_links[news_outlets[6:][i]][20:-40] # note: same as above
    else: 
        all_links[news_outlets[6:][i]] = all_links[news_outlets[6:][i]][19:-40] # note: same as above
    
print(all_links, len(all_links))

# get total # of links
size = 0
for link in all_links:
    size += len(all_links[link])
    print(len(all_links[link]))
print(size)

{'cnn': ['https://www.cnn.com/interactive/2020/weather/gonzalo-storm-path-tracker/index.html', 'https://www.cnn.com/2022/11/22/africa/king-charles-hosts-ramaphosa-intl/index.html', 'https://www.cnn.com/2022/11/22/africa/gunmen-abduct-hundred-nigeria-intl/index.html', 'https://www.cnn.com/2022/11/21/tech/twitter-africa-elon-musk-intl-lgs/index.html', 'https://www.cnn.com/2022/11/21/africa/president-obiang-extends-rule-intl/index.html', 'https://www.cnn.com/2022/11/18/africa/vaccines-uganda-ebola-outbreak-intl-cmd/index.html', 'https://www.cnn.com/2022/11/18/africa/gates-foundation-africa-pledge-intl/index.html', 'https://www.cnn.com/2022/11/17/africa/lekki-deep-sea-port-construction-completion-spc-intl/index.html', 'https://www.cnn.com/2022/11/14/africa/zambian-prisoner-killed-ukraine-intl/index.html', 'https://www.cnn.com/2022/09/10/africa/colonialism-africa-queen-elizabeth-intl/index.html', 'https://www.cnn.com/2022/09/25/opinions/britain-empire-history-kenya-bergen/index.html', 'http

## Export Links to .csv
Lastly, we will convert our dictionary to a DataFrame and export as it as a .csv file.

In [92]:

DF = pd.DataFrame.from_dict(all_links, orient='index')
print(DF.transpose())
DF.transpose().to_csv("articles_ab_africa.csv")

                                                  cnn  \
0   https://www.cnn.com/interactive/2020/weather/g...   
1   https://www.cnn.com/2022/11/21/africa/presiden...   
2   https://www.cnn.com/2022/11/19/world/cop27-egy...   
3   https://www.cnn.com/2022/11/18/africa/vaccines...   
4   https://www.cnn.com/2022/11/18/africa/gates-fo...   
5   https://www.cnn.com/2022/11/17/africa/lekki-de...   
6   https://www.cnn.com/2022/11/14/africa/zambian-...   
7   https://www.cnn.com/2022/11/14/africa/western-...   
8   https://www.cnn.com/2022/11/10/africa/tiktoker...   
9   https://www.cnn.com/2022/09/10/africa/colonial...   
10  https://www.cnn.com/2022/09/25/opinions/britai...   
11  https://www.cnn.com/style/article/great-star-o...   
12  https://www.cnn.com/style/article/ben-enwonwu-...   
13  https://www.cnn.com/2022/11/17/africa/cop27-eg...   
14  https://www.cnn.com/2022/11/06/world/cop27-egy...   
15  https://www.cnn.com/style/article/black-panthe...   
16  https://www.cnn.com/2022/10