# Web Scraper for The Economist

This notebook is being made to develop a web scraper for the news website "The Economist".

The aim is to create a web scraper that will get all the headlines, their publishing days, the link to access each, and their publishers using Python and BeautifulSoup

In [None]:
# import the necessary libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd
from urllib.parse import urlparse, urljoin
import re

In [2]:
# Get the HTML of the website
url = ("https://www.economist.com/")
response = requests.get(url)

In [3]:
# Create a BeautifulSoup object to parse the HTML
soup = BeautifulSoup(response.content, "html.parser")

In [20]:
# The title of the web page
title_tag = soup.find('title')

In [21]:
# Extract the title
title = title_tag.text 

In [22]:
# Print the title
print("Title:", title)

Title: The Economist | World News, Economics, Politics, Business & Finance


In [4]:
# Search for the different headlines using role and id
main_content = soup.find_all(role="main", id="content")

In [5]:
# Get the base URL for relative URL resolution
base_url = urlparse(url).scheme + "://" + urlparse(url).netloc

In [6]:
# Create empty lists to store the extracted data
urls = []
meanings = []
dates = []

In [7]:
# Iterate over the matching elements
for content in main_content:
    # Find all anchor tags within the content
    anchor_tags = content.find_all('a')

    # Iterate over the anchor tags and extract the URL, meaning, and date
    for anchor_tag in anchor_tags:
        # Extract URL
        url = anchor_tag['href']

        # Resolve relative URLs to absolute URLs
        url = urljoin(base_url, url)

        # Extract date from the URL
        parsed_url = urlparse(url)
        path_components = parsed_url.path.split('/')
        date_published = "/".join(path_components[2:5])

        # Extract accompanying meaning
        meaning = anchor_tag.text

        # Append data to the respective lists
        urls.append(url)
        meanings.append(meaning)
        dates.append(date_published)

In [8]:
# Create a dictionary with the lists
data = {
    'Headlines': meanings,
    'Article Link': urls,
    'Date Published': dates
}

In [9]:
# Create a Pandas DataFrame from the dictionary
df = pd.DataFrame(data)

                                            Headlines   
0                                        The War Room  \
1                             The A to Z of economics   
2                                    The Intelligence   
3   Nvidia is not the only firm cashing in on the ...   
4       The speech police are coming for social media   
..                                                ...   
81      What happens when Belarus loses its dictator?   
82                                The Economist reads   
83    The spy who read me: authors under surveillance   
84            What to read about villains in business   
85               What to read to avoid getting hoaxed   

                                         Article Link Date Published  
0   https://www.economist.com/newsletters/the-war-...   the-war-room  
1         https://www.economist.com/economics-a-to-z/                 
2   https://www.economist.com/podcasts/2023/05/29/...     2023/05/29  
3   https://www.economist.com/b

In [10]:
df

Unnamed: 0,Headlines,Article Link,Date Published
0,The War Room,https://www.economist.com/newsletters/the-war-...,the-war-room
1,The A to Z of economics,https://www.economist.com/economics-a-to-z/,
2,The Intelligence,https://www.economist.com/podcasts/2023/05/29/...,2023/05/29
3,Nvidia is not the only firm cashing in on the ...,https://www.economist.com/business/2023/05/29/...,2023/05/29
4,The speech police are coming for social media,https://www.economist.com/international/2023/0...,2023/05/29
...,...,...,...
81,What happens when Belarus loses its dictator?,https://www.economist.com/the-economist-explai...,2023/05/22
82,The Economist reads,https://www.economist.com/the-economist-reads/,
83,The spy who read me: authors under surveillance,https://www.economist.com/the-economist-reads/...,2023/05/24
84,What to read about villains in business,https://www.economist.com/the-economist-reads/...,2023/05/16


In [11]:
# Drop the first three rows from the DataFrame as they are not part of the daily headlines
df = df.drop(df.index[:3])

In [12]:
# Reset the index of the DataFrame
df = df.reset_index(drop=True)

In [13]:
df

Unnamed: 0,Headlines,Article Link,Date Published
0,Nvidia is not the only firm cashing in on the ...,https://www.economist.com/business/2023/05/29/...,2023/05/29
1,The speech police are coming for social media,https://www.economist.com/international/2023/0...,2023/05/29
2,"Bad Bunny, a superstar rapper, is good business",https://www.economist.com/the-americas/2023/05...,2023/05/29
3,Continue reading,https://www.economist.com/the-world-in-brief,
4,Spain’s prime minister gambles on a snap gener...,https://www.economist.com/europe/2023/05/29/sp...,2023/05/29
...,...,...,...
78,What happens when Belarus loses its dictator?,https://www.economist.com/the-economist-explai...,2023/05/22
79,The Economist reads,https://www.economist.com/the-economist-reads/,
80,The spy who read me: authors under surveillance,https://www.economist.com/the-economist-reads/...,2023/05/24
81,What to read about villains in business,https://www.economist.com/the-economist-reads/...,2023/05/16


In [15]:
# Drop rows with missing dates as they are not part of the news
df = df.drop(df[df['Date Published'] == ''].index)

In [16]:
# Reset the index of the DataFrame
df = df.reset_index(drop=True)

In [17]:
df

Unnamed: 0,Headlines,Article Link,Date Published
0,Nvidia is not the only firm cashing in on the ...,https://www.economist.com/business/2023/05/29/...,2023/05/29
1,The speech police are coming for social media,https://www.economist.com/international/2023/0...,2023/05/29
2,"Bad Bunny, a superstar rapper, is good business",https://www.economist.com/the-americas/2023/05...,2023/05/29
3,Spain’s prime minister gambles on a snap gener...,https://www.economist.com/europe/2023/05/29/sp...,2023/05/29
4,Bartleby: Why are corporate retreats so extrav...,https://www.economist.com/business/2023/05/25/...,2023/05/25
...,...,...,...
68,Who are the pro-Ukrainian militias raiding Rus...,https://www.economist.com/the-economist-explai...,2023/05/23
69,What happens when Belarus loses its dictator?,https://www.economist.com/the-economist-explai...,2023/05/22
70,The spy who read me: authors under surveillance,https://www.economist.com/the-economist-reads/...,2023/05/24
71,What to read about villains in business,https://www.economist.com/the-economist-reads/...,2023/05/16


In [18]:
# Add a new column "Publisher" with "The Economist" as the value
df = df.assign(Publisher="The Economist")

In [19]:
df

Unnamed: 0,Headlines,Article Link,Date Published,Publisher
0,Nvidia is not the only firm cashing in on the ...,https://www.economist.com/business/2023/05/29/...,2023/05/29,The Economist
1,The speech police are coming for social media,https://www.economist.com/international/2023/0...,2023/05/29,The Economist
2,"Bad Bunny, a superstar rapper, is good business",https://www.economist.com/the-americas/2023/05...,2023/05/29,The Economist
3,Spain’s prime minister gambles on a snap gener...,https://www.economist.com/europe/2023/05/29/sp...,2023/05/29,The Economist
4,Bartleby: Why are corporate retreats so extrav...,https://www.economist.com/business/2023/05/25/...,2023/05/25,The Economist
...,...,...,...,...
68,Who are the pro-Ukrainian militias raiding Rus...,https://www.economist.com/the-economist-explai...,2023/05/23,The Economist
69,What happens when Belarus loses its dictator?,https://www.economist.com/the-economist-explai...,2023/05/22,The Economist
70,The spy who read me: authors under surveillance,https://www.economist.com/the-economist-reads/...,2023/05/24,The Economist
71,What to read about villains in business,https://www.economist.com/the-economist-reads/...,2023/05/16,The Economist


## Conclusion

Their was no much dificulty as i have some experience with scraping the web for data.

strategies I implement to enhance the efficiency of data collection:

*  The first is always to define my data collection goals, so as to know exactly what i need, like in the above case I wanted only news headline, so it was easy for me to streamline my search

* How i'm i collecting or getting my data and which method is more efficient for it, Like using APIs is almost always better than scraping as the data is more structured and usually recent.

* Proper Quality control which involves checking the integrity of data collected, ensuring that it is not biased, cleaning and formatting data  

* Conducting regular updates and maintenance on my data collecting processes to check for changes in data formats, websites changes and more




### Thank you