Scrape the website Towards Data Science using Beautiful Soup https://dorianlazar.medium.com/scraping-medium-with-python-beautiful-soup-3314f898bbf5

In [3]:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
import random

In [7]:
urls = {
    'Towards Data Science': 'https://towardsdatascience.com/archive/{0}/{1:02d}/{2:02d}'
}

In [8]:
def convert_day(day):
    month_days = [31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31]
    m = 0
    d = 0
    while day > 0:
        d = day
        day -= month_days[m]
        m += 1
    return (m, d)

In [9]:
def get_claps(claps_str):
    if (claps_str is None) or (claps_str == '') or (claps_str.split is None):
        return 0
    split = claps_str.split('K')
    claps = float(split[0])
    claps = int(claps*1000) if len(split) == 2 else int(claps)
    return claps

In [36]:
#https://hackernoon.com/how-to-scrape-a-medium-publication-a-python-tutorial-for-beginners-o8u3t69
def get_article_text(story_url):
    
    story_page = requests.get(story_url)
    story_soup = BeautifulSoup(story_page.text, 'html.parser')

    sections = story_soup.find_all('section')
    story_paragraphs = []
    section_titles = []
    
    for section in sections:
        paragraphs = section.find_all('p')
        for paragraph in paragraphs:
            story_paragraphs.append(paragraph.text)

        subs = section.find_all('h1')
        for sub in subs:
            section_titles.append(sub.text)

    number_sections = len(section_titles)
    number_paragraphs = len(story_paragraphs)
    section_title_text = " ".join(section_titles)
    story_text = " ".join(story_paragraphs)
    
    return number_sections, number_paragraphs, section_titles, story_text

In [37]:
selected_days = random.sample([i for i in range(1, 366)], 5)

In [38]:
data = []
article_id = 0
years = range(2015,2023)
i = 0
n = len(selected_days)
for year in years:
    for d in selected_days:
        i += 1
        month, day = convert_day(d)
        date = '{0}-{1:02d}-{2:02d}'.format(year, month, day)
        print(f'{i} / {n} ; {date}')
        for publication, url in urls.items():
            response = requests.get(url.format(year, month, day), allow_redirects=True)
            if not response.url.startswith(url.format(year, month, day)):
                continue
            page = response.content
            soup = BeautifulSoup(page, 'html.parser')
            articles = soup.find_all(
                "div",
                class_="postArticle postArticle--short js-postArticle js-trackPostPresentation js-trackPostScrolls")
            for article in articles:
                title = article.find("h3", class_="graf--title")
                if title is None:
                    continue
                title = title.contents[0]
                article_id += 1
                subtitle = article.find("h4", class_="graf--subtitle")
                subtitle = subtitle.contents[0] if subtitle is not None else ''
                #image = article.find("img", class_="graf-image")
                #image = '' if image is None else get_img(image['src'], 'images', f'{article_id}')
                article_url = article.find_all("a")[3]['href'].split('?')[0]
                number_sections, number_paragraphs, section_titles, story_text = get_article_text(article_url)
                buttons = article.find_all("button")
                claps = get_claps(buttons[1].contents[0]) if len(buttons) > 0 else None
                reading_time = article.find("span", class_="readingTime")
                reading_time = 0 if reading_time is None else int(reading_time['title'].split(' ')[0])
                responses = article.find_all("a")
                if len(responses) == 7:
                    responses = responses[6].contents[0].split(' ')
                    if len(responses) == 0:
                        responses = 0
                    else:
                        responses = responses[0]
                else:
                    responses = 0

                data.append([article_id, article_url, title, subtitle,
                             number_sections, number_paragraphs, section_titles, story_text,
                             claps, responses,
                             reading_time, publication, date,year])

1 / 5 ; 2015-05-31
2 / 5 ; 2015-04-22
3 / 5 ; 2015-03-04
4 / 5 ; 2015-10-29
5 / 5 ; 2015-02-28
6 / 5 ; 2016-05-31
7 / 5 ; 2016-04-22
8 / 5 ; 2016-03-04
9 / 5 ; 2016-10-29
10 / 5 ; 2016-02-28
11 / 5 ; 2017-05-31
12 / 5 ; 2017-04-22
13 / 5 ; 2017-03-04
14 / 5 ; 2017-10-29
15 / 5 ; 2017-02-28
16 / 5 ; 2018-05-31
17 / 5 ; 2018-04-22
18 / 5 ; 2018-03-04
19 / 5 ; 2018-10-29
20 / 5 ; 2018-02-28
21 / 5 ; 2019-05-31
22 / 5 ; 2019-04-22
23 / 5 ; 2019-03-04
24 / 5 ; 2019-10-29
25 / 5 ; 2019-02-28
26 / 5 ; 2020-05-31
27 / 5 ; 2020-04-22
28 / 5 ; 2020-03-04
29 / 5 ; 2020-10-29
30 / 5 ; 2020-02-28
31 / 5 ; 2021-05-31
32 / 5 ; 2021-04-22
33 / 5 ; 2021-03-04
34 / 5 ; 2021-10-29
35 / 5 ; 2021-02-28
36 / 5 ; 2022-05-31
37 / 5 ; 2022-04-22
38 / 5 ; 2022-03-04
39 / 5 ; 2022-10-29
40 / 5 ; 2022-02-28


In [45]:
data

[[1,
  'https://towardsdatascience.com/difference-between-permutation-and-combination-9e12b6763ee1',
  'Difference between Permutation and Combination',
  '',
  1,
  25,
  ['Difference between Permutation and Combination'],
  'Long story short The difference between Permutation and combination is: A combination lock should be called a permutation lock ;) Long story While studying Machine Learning, on edx.org, the instructor uses Gaussian Distribution to explain the Supervised and Unsupervised learning ( Please move to the discussion ahead if you are purely interested in knowing the difference ). The Gaussian Distribution approximates the Binomial distribution when the occurrence of events is very large and this is where I actually wanted to understand the difference as the formula for Binomial distribution contains multiples of a combination of occurrence of an event. Let’s start with a basic definition for permutation and combinations with examples: Permutation: A selection of objects

In [46]:
medium_df = pd.DataFrame(data, columns=[
    'id', 'url', 'title', 'subtitle',
    'n_sections', 'n_paragraphs', 'section_titles', 'story_text',
    'claps', 'responses',
    'reading_time', 'publication', 'date','year'])

In [47]:
medium_df

Unnamed: 0,id,url,title,subtitle,n_sections,n_paragraphs,section_titles,story_text,claps,responses,reading_time,publication,date,year
0,1,https://towardsdatascience.com/difference-betw...,Difference between Permutation and Combination,,1,25,[Difference between Permutation and Combination],Long story short The difference between Permut...,479.0,5,3,Towards Data Science,2017-05-31,2017
1,2,https://towardsdatascience.com/building-a-real...,Building a realtime dashboard with Flink: The ...,,1,9,[Building a realtime dashboard with Flink: The...,With the demand for “realtime” low latency dat...,16.0,0,3,Towards Data Science,2017-05-31,2017
2,3,https://towardsdatascience.com/artificial-inte...,Artificial Intelligence is the Panacea to Toda...,,12,27,[Artificial Intelligence is the Panacea to Tod...,"My foray into understanding, and more importan...",33.0,3,7,Towards Data Science,2017-05-31,2017
3,4,https://towardsdatascience.com/opportunities-a...,[Opportunities And Obstacles For Deep Learning...,,1,6,[Opportunities And Obstacles For Deep Learning...,Target audience: general. 27 scientists collab...,,0,3,Towards Data Science,2017-05-31,2017
4,5,https://towardsdatascience.com/https-medium-co...,The Single Most Important Thing That Data Can ...,How to reduce the cognitive load that entrepre...,1,22,[The Single Most Important Thing That Data Can...,I’ve worked with a number of startups and ther...,83.0,0,4,Towards Data Science,2017-05-31,2017
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
813,814,https://towardsdatascience.com/a-low-down-on-m...,A Low Down on Machine Learning,,12,68,"[A Low Down on Machine Learning, Part 1, A sho...",How machine learning and artificial intelligen...,79.0,1,13,Towards Data Science,2022-02-28,2022
814,815,https://towardsdatascience.com/background-task...,Background tasks for NLP,Heavy lifting is better in the background rath...,3,33,"[Background tasks for NLP, Recap, Heavy liftin...",Those people dancing in the photo are similar ...,0.0,0,8,Towards Data Science,2022-02-28,2022
815,816,https://towardsdatascience.com/elasticsearch-w...,Elasticsearch Workshop #6 — Scripting Part 4,Regex and pattern matching,9,27,"[Elasticsearch Workshop #6 — Scripting Part 4,...","Welcome to part 6 of the workshop. As usual, t...",15.0,0,5,Towards Data Science,2022-02-28,2022
816,817,https://towardsdatascience.com/data-science-tr...,Data Science training — run them effectively,A few leads and watch-outs as you prepare to t...,2,50,"[Data Science training — run them effectively,...",Data Science and Analytics is a very rapidly e...,55.0,0,8,Towards Data Science,2022-02-28,2022


In [24]:
medium_df.to_csv("tds.csv",index=True)

In [25]:
requests.get(article_url)

<Response [200]>

In [26]:
page = response.content
soup = BeautifulSoup(page, 'html.parser')

In [27]:
page

b'<!DOCTYPE html><html xmlns:cc="http://creativecommons.org/ns#"><head prefix="og: http://ogp.me/ns# fb: http://ogp.me/ns/fb# medium-com: http://ogp.me/ns/fb/medium-com#"><meta http-equiv="Content-Type" content="text/html; charset=utf-8"><meta name="viewport" content="width=device-width, initial-scale=1.0, viewport-fit=contain"><title>All stories published by Towards Data Science on February 01, 2022</title><link rel="canonical" href="https://towardsdatascience.com/archive/2022/02/01"><meta name="robots" content="index,follow"><meta name="title" content="All stories published by Towards Data Science on February 01, 2022"><meta name="referrer" content="unsafe-url"><meta name="description" content="Read all stories published by Towards Data Science on February 01, 2022. Your home for data science. A Medium publication sharing concepts, ideas and codes."><meta name="theme-color" content="#000000"><meta property="og:title" content="All stories published by Towards Data Science on February 

In [33]:
get_article_text("https://towardsdatascience.com/web-scraping-with-python-beautifulsoup-40d2ce4b6252")

(4,
 34,
 ['Web scraping with Python & BeautifulSoup',
  'Installing the libraries',
  'Using requests & beautiful soup to extract data',
  'Web scraping example: get top 10 linux distros'],
 "The web contains lots of data. The ability to extract the information you need from it is, with no doubt, a useful one, even necessary. Of course, there are still lots of datasets already available for you to download, on places like Kaggle, but in many cases, you won’t find the exact data that you need for your particular problem. However, chances are you’ll find what you need somewhere on the web and you’ll need to extract it from there. Web scraping is the process of doing this, of extracting data from web pages. In this article, we’ll see how to do web scraping in python. For this task, there are several libraries that you can use. Among these, here we will use Beautiful Soup 4. This library takes care of extracting data from a HTML document, not downloading it. For downloading web pages, we 