### Detailed Steps:

- Step 1: Include all the packages here.
- Step 2: Initialise variables to hold all article links, raw and scrubbed contents.
- Step 3: Retrieve all the articles links from a blog provided.
- Step 4: Crawl all articles links to retrieve all title, content and link accordingly.
- Step 5: Save raw content in `raw_machine_learning_mastery.json` and scrubbed content in `scrubbed_machine_learning_mastery.json` files accordingly.

In [1]:
### Step 1: Import necessary packages here ###

import csv
import pandas
import numpy

# feedparser helps to xml to hash
# Install: conda install feedparser
import feedparser

# BeautifulSoup helps to grab text out of html
# Install: conda install beautifulsoup4
from bs4 import BeautifulSoup

import json
import urllib3

from collections import Counter

import scrapy
import newspaper
from newspaper import Article
from pandas import read_csv
from lxml import html
import requests

In [2]:
### Step 2: Initialise variables to store raw article, scrubbed article & article links.

# for raw article
raw_json = list()

# for scrubbed article
scrubbed_json = list()

# for article links
article_links = list()

In [3]:
### Step 3: Feed in the blog URL to aggregate content ###
# 2.1 Select the blog that we want to crawl.
# 2.2 Grab all article links from the current page.
# 2.3 Crawl pages until we reach at a page that returns empty results.

feed_url = "http://machinelearningmastery.com/blog/"
urllib3.disable_warnings()
http = urllib3.PoolManager()

# define a method that will accept a URL with pagination parameter.
def parse(page):
    print("Running for: ", page)
    response = http.request('GET', page)
    html = response.data
    soup = BeautifulSoup(html, 'lxml')
    page_links = soup.findAll('a', attrs={ 'rel': "bookmark" })

    if len(page_links) == 0:
        return False
    else:
        for link in page_links:
            article_links.append(link.get('href'))
        return True

In [4]:
# initialise feed_url into a variable to make it incremental to be compatible with pagination.
current_page = feed_url

# maintain a counter for pagination
count = 1

# iterate until we get empty results from the page.
while True:
    status = parse(current_page)
    if status == False:
        print("I'm done!")
        break
    count += 1
    current_page = feed_url + "page/" + str(count)

Running for:  http://machinelearningmastery.com/blog/
Running for:  http://machinelearningmastery.com/blog/page/2
Running for:  http://machinelearningmastery.com/blog/page/3
Running for:  http://machinelearningmastery.com/blog/page/4
Running for:  http://machinelearningmastery.com/blog/page/5
Running for:  http://machinelearningmastery.com/blog/page/6
Running for:  http://machinelearningmastery.com/blog/page/7
Running for:  http://machinelearningmastery.com/blog/page/8
Running for:  http://machinelearningmastery.com/blog/page/9
Running for:  http://machinelearningmastery.com/blog/page/10
Running for:  http://machinelearningmastery.com/blog/page/11
Running for:  http://machinelearningmastery.com/blog/page/12
Running for:  http://machinelearningmastery.com/blog/page/13
Running for:  http://machinelearningmastery.com/blog/page/14
Running for:  http://machinelearningmastery.com/blog/page/15
Running for:  http://machinelearningmastery.com/blog/page/16
Running for:  http://machinelearningmas

In [5]:
print("Total number of links to crawl: ", len(article_links))

Total number of links to crawl:  547


In [6]:
### Step 4: Crawl every articles to pull in title, content and URL ###

# machinelearningmastery.com expects User-Agent in the request header.
# else, it throws 403 Forbidden error.
headers = { "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36" }

for link in article_links:
    print("Parsing URL: ", link)

    # open the URL
    current_page = requests.get(link, headers = headers)
    # pass in the html content to BeautifulSoup
    tree = BeautifulSoup(current_page.content, "lxml")
    # parse to find title of the article
    title = tree.body.find('h1', attrs={'class': 'entry-title'}).text
    # parse to find summary of the article
    raw_content = tree.html
    # scrub all the html tags, lowercase all the letters, get rid of \t and \n.
    scrubbed_content = raw_content.text.replace('\t', '').replace('\n', '').lower()
    # add raw data to `raw_json`
    raw_entry = { 'title': title, 'link': link, 'content': raw_content }
    raw_json.append(raw_entry)
    # add scrubbed data to `scrubbed_json`
    scrubbed_entry = { 'title': title, 'link': link, 'content': scrubbed_content }
    scrubbed_json.append(scrubbed_entry)

Parsing URL:  https://machinelearningmastery.com/run-deep-learning-experiments-linux-server/
Parsing URL:  https://machinelearningmastery.com/standard-multivariate-multi-step-multi-site-time-series-forecasting-problem/
Parsing URL:  https://machinelearningmastery.com/install-xgboost-python-macos/
Parsing URL:  https://machinelearningmastery.com/generate-test-datasets-python-scikit-learn/
Parsing URL:  https://machinelearningmastery.com/develop-n-gram-multichannel-convolutional-neural-network-sentiment-analysis/
Parsing URL:  https://machinelearningmastery.com/develop-neural-machine-translation-system-keras/
Parsing URL:  https://machinelearningmastery.com/prepare-french-english-dataset-machine-translation/
Parsing URL:  https://machinelearningmastery.com/beam-search-decoder-natural-language-processing/
Parsing URL:  https://machinelearningmastery.com/configure-encoder-decoder-model-neural-machine-translation/
Parsing URL:  https://machinelearningmastery.com/encoder-decoder-recurrent-ne

Parsing URL:  https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/
Parsing URL:  https://machinelearningmastery.com/difference-between-a-parameter-and-a-hyperparameter/
Parsing URL:  https://machinelearningmastery.com/much-training-data-required-machine-learning/
Parsing URL:  https://machinelearningmastery.com/gentle-introduction-mini-batch-gradient-descent-configure-batch-size/
Parsing URL:  https://machinelearningmastery.com/sequence-prediction-problems-learning-lstm-recurrent-neural-networks/
Parsing URL:  https://machinelearningmastery.com/models-sequence-prediction-recurrent-neural-networks/
Parsing URL:  https://machinelearningmastery.com/difference-test-validation-datasets/
Parsing URL:  https://machinelearningmastery.com/how-to-one-hot-encode-sequence-data-in-python/
Parsing URL:  https://machinelearningmastery.com/remove-trends-seasonality-difference-transform-python/
Parsing URL:  https://machinelearningmastery.com/how-to-scale-data-for-long-short-

Parsing URL:  https://machinelearningmastery.com/power-transform-time-series-forecast-data-python/
Parsing URL:  https://machinelearningmastery.com/reframe-time-series-forecasting-problem/
Parsing URL:  https://machinelearningmastery.com/gentle-introduction-random-walk-times-series-forecasting-python/
Parsing URL:  https://machinelearningmastery.com/grid-search-arima-hyperparameters-with-python/
Parsing URL:  https://machinelearningmastery.com/machine-learning-books/
Parsing URL:  https://machinelearningmastery.com/gentle-introduction-box-jenkins-method-time-series-forecasting/
Parsing URL:  https://machinelearningmastery.com/model-residual-errors-correct-time-series-forecasts-python/
Parsing URL:  https://machinelearningmastery.com/arima-for-time-series-forecasting-with-python/
Parsing URL:  https://machinelearningmastery.com/visualize-time-series-residual-forecast-errors-with-python/
Parsing URL:  https://machinelearningmastery.com/time-series-data-visualization-with-python/
Parsing 

Parsing URL:  https://machinelearningmastery.com/use-ensemble-machine-learning-algorithms-weka/
Parsing URL:  https://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/
Parsing URL:  https://machinelearningmastery.com/use-classification-machine-learning-algorithms-weka/
Parsing URL:  https://machinelearningmastery.com/use-regression-machine-learning-algorithms-weka/
Parsing URL:  https://machinelearningmastery.com/time-series-prediction-lstm-recurrent-neural-networks-python-keras/
Parsing URL:  https://machinelearningmastery.com/estimate-baseline-performance-machine-learning-models-weka/
Parsing URL:  https://machinelearningmastery.com/time-series-prediction-with-deep-learning-in-python-with-keras/
Parsing URL:  https://machinelearningmastery.com/estimate-performance-machine-learning-algorithms-weka/
Parsing URL:  https://machinelearningmastery.com/use-machine-learning-algorithms-weka/
Parsing URL:  https://machinelearningmastery.com/inspira

Parsing URL:  https://machinelearningmastery.com/gentle-introduction-to-the-bias-variance-trade-off-in-machine-learning/
Parsing URL:  https://machinelearningmastery.com/supervised-and-unsupervised-machine-learning-algorithms/
Parsing URL:  https://machinelearningmastery.com/parametric-and-nonparametric-machine-learning-algorithms/
Parsing URL:  https://machinelearningmastery.com/how-machine-learning-algorithms-work/
Parsing URL:  https://machinelearningmastery.com/data-terminology-in-machine-learning/
Parsing URL:  https://machinelearningmastery.com/r-machine-learning-mini-course/
Parsing URL:  https://machinelearningmastery.com/dont-use-random-guessing-as-your-baseline-classifier/
Parsing URL:  https://machinelearningmastery.com/get-started-in-machine-learning-with-r/
Parsing URL:  https://machinelearningmastery.com/machine-learning-evaluation-metrics-in-r/
Parsing URL:  https://machinelearningmastery.com/compare-the-performance-of-machine-learning-algorithms-in-r/
Parsing URL:  http

Parsing URL:  https://machinelearningmastery.com/discover-the-methodology-and-mindset-of-a-kaggle-master-an-interview-with-diogo-ferreira/
Parsing URL:  https://machinelearningmastery.com/how-to-build-an-intuition-for-machine-learning-algorithms/
Parsing URL:  https://machinelearningmastery.com/a-data-driven-approach-to-machine-learning/
Parsing URL:  https://machinelearningmastery.com/discover-feature-engineering-how-to-engineer-features-and-how-to-get-good-at-it/
Parsing URL:  https://machinelearningmastery.com/compare-models-and-select-the-best-using-the-caret-r-package/
Parsing URL:  https://machinelearningmastery.com/feature-selection-with-the-caret-r-package/
Parsing URL:  https://machinelearningmastery.com/tuning-machine-learning-models-using-the-caret-r-package/
Parsing URL:  https://machinelearningmastery.com/data-visualization-with-the-caret-r-package/
Parsing URL:  https://machinelearningmastery.com/caret-r-package-for-applied-predictive-modeling/
Parsing URL:  https://machi

Parsing URL:  https://machinelearningmastery.com/project-spotlight-stack-exchange-clustering-using-mahout-with-konstantin-slisenko/
Parsing URL:  https://machinelearningmastery.com/feature-selection-to-improve-accuracy-and-decrease-training-time/
Parsing URL:  https://machinelearningmastery.com/biggest-mistake-i-made-when-starting-machine-learning-and-how-to-avoid-it/
Parsing URL:  https://machinelearningmastery.com/template-for-working-through-machine-learning-problems-in-weka/
Parsing URL:  https://machinelearningmastery.com/a-simple-intuition-for-overfitting/
Parsing URL:  https://machinelearningmastery.com/4-steps-to-get-started-in-machine-learning/
Parsing URL:  https://machinelearningmastery.com/improve-machine-learning-results-with-boosting-bagging-and-blending-ensemble-methods-in-weka/
Parsing URL:  https://machinelearningmastery.com/how-to-tune-a-machine-learning-algorithm-in-weka/
Parsing URL:  https://machinelearningmastery.com/quick-and-dirty-data-analysis-for-your-machine-

In [1]:
### Step 5: Write JSON data in text format in `scrubbed_machine_learning_mastery.json` ###

# write scrubbed JSON to `scrubbed_machine_learning_mastery.json`
with open('scrubbed_machine_learning_mastery.json', 'w') as outfile:
     json.dump(scrubbed_json, outfile, sort_keys = True, indent = 4,
               ensure_ascii = False)
print('Done writing to file!!!')

NameError: name 'json' is not defined

In [48]:
### Step 6: Download all articles in Machine Learning Mastery articles folder ###

for entry in raw_json:
    file_name = "Machine Learning Mastery articles/" + entry['title'] + ".html"
    with open(file_name, 'w') as outfile:
        outfile.write(str(entry['content']))
        outfile.close()
print("All HTML files downloaded!!!")

All HTML files downloaded!!!
