#### 2. Scrape the content of https://www.lemonde.fr/ and save it as a CSV.

We want: titles, subhead, article URL, whether it's premium or not, byline, article type, image URL.

#### Bonus, if you want to get fancy:

Make the CSV file auto-updating. Use this tutorial (videoLinks to an external site., textLinks to an external site.) but just ignore the visualization/datawrapper aspect

In [3]:
import pandas as pd
import requests
import re
from bs4 import BeautifulSoup

In [4]:
# For Le Monde, each of these are, as classes: 
# title: <p class='article-title'>
# subhead: 'article-desc' <p>, 
# article URL: 'lmd-link-clickarea__link' (a form of <a> its accompanying href), 
# premium or not? class="sr-only", 
# byline: class="article__byline" (but not all articles have bylines),
# article type: ???
# image URL: class="initial lzld--loading" in class="article__media"

# fetch the html

url = "https://www.lemonde.fr"
response = requests.get(url)
if response.status_code != 200:
    print("Failed to fetch the webpage.")
    exit()

# parse the 'soup' with Beautiful Soup

doc = BeautifulSoup(response.text, 'html.parser')

# look for articles - all article divs that contain images and headlines on Le Monde tend to start with 'article' but have distinct
# styles like 'article article--main' and 'article article--runner old__article-runner.' I'm using a bit of regex to find anything that
# has the word 'article' in its class.

In [5]:
items = doc.find_all(class_=re.compile(r'\barticle\b'))

rows = []

for item in items:
    row = {}
    
    # title
    title_tag = item.select_one('h1, h2, p')
    if title_tag:
        title_text = title_tag.get_text()
        row['title'] = title_text
    else:
        title_tag = item.find('div', class_='article__title')
        title_text = title_tag.get_text()
        row['title'] = title_text

    # subhead
    try:
        row['subhed'] = item.find(class_="article__desc").text
    except:
        pass

    # article url
    try:
        row['article_url'] = item['href']
    except:
        row['article_url'] = item.find('a')['href']

    # premium?
    premium_icon_exists = item.find(class_=re.compile(r'\bsr-only\b'))
    if premium_icon_exists:
        row['premium_or_not'] = "Premium"
    else:
        row['premium_or_not'] = ""

    # byline - we'll query the metadata of the article itself with a request
    response = requests.get(row['article_url'])
    soup = BeautifulSoup(response.text)
    try:
        row['byline'] = soup.find(class_="meta__author").text
    except:
        row['byline'] = ""
    
    # article type - these are seen as 'breadcrumbs' at the top of the article...
    breadcrumb_items = soup.find('li', class_=re.compile(r'\bbreadcrumb\b'))
    topic_list = []
    if breadcrumb_items:
        for crumb in breadcrumb_items:
            topic_list.append(crumb.text)
        row['article_type'] = topic_list
    else:
        row['article_type'] = ""

    # image url
    try:
        image_url = item.find('img')['data-lazy'] or item.find('img')['src']
        row['image_url'] = image_url
    except:
        row['image_url'] = ""

    rows.append(row)

In [6]:
df=pd.json_normalize(rows)
df.to_csv("le_monde_scrape.csv")