Here are the steps for better understanding the flow.

Notes:
1. File `machinelearningmastery_url.csv` has all the article URLs.
2. File `machine-learning-full.json` has all the keywords to match the following articles.
3. File `machinelearningmastery.csv` has all the article data one-by-one.

Steps:
1. Packages are supposed to be imported at this place.
2. Read keywords from `machine-learning-full.json` and save it data.
3. Read articles from `scrubbed_machine_learning_mastery.json` and save it in articles.
4. Keywords from #2 are nested hashes. Flatten them and generate single list of keywords.
5. Map the plural form of a keyword to its singular form so that we have singular noun of a word.
6. Maps original keywords to list of articles.
7. Maps normalised keywords to list of articles.

In [70]:
### Step 1: Import necessary packages here ###

import csv
import pandas
import numpy

# feedparser helps to xml to hash
# Install: conda install feedparser
import feedparser

# BeautifulSoup helps to grab text out of html
# Install: conda install beautifulsoup4
from bs4 import BeautifulSoup

import json
import urllib3

from collections import Counter

import newspaper
from newspaper import Article
from pandas import read_csv
from lxml import html
import requests

# inflect helps to identify if a word is singular or plural.
# conda install -c conda-forge inflect
# --- fails to detect if a word is a noun or not ---
import inflect

In [71]:
### Step 2: Read JSON file with domain and topic list to grab all the topics ###
data = json.load(open('machine-learning-full.json'))

In [72]:
### Step 3: Read scrubbed articles from `scrubbed_machine_learning_mastery.json`
articles = json.load(open('scrubbed_machine_learning_mastery.json'))

In [73]:
### Step 4: Flatten all the topics from JSON file ###
flattened_topics = []
p = inflect.engine()

for domain in data:
    if domain['label'] != None:
        label = domain['label'].lower()
        flattened_topics.append(flattened_topics.append(label))
    for concept in domain['concepts']:
        if concept['label'] != None:
            flattened_topics.append(concept['label'].lower());

keywords = list(filter(None, flattened_topics))

# retrieve unique keywords
keywords = list(set(keywords))

In above steps:
- `machine-learning-full.json` lists tree-structured keywords.
- Levelling by keywords in the form of tree-structure is not yet  part of this project.
- For now, we flatten all the keywords just to verify what all articles are listed under each keyword.

In [75]:
### Step 5: Singularize plural noun ###
# NOTE: There could be some bugs in this conversion. Lets see!!!

normalised_keywords = list()
mappings = dict()

for keyword in keywords:
    key = p.singular_noun(keyword)
    if key == False:
        key = keyword
    normalised_keywords.append(key)
    mappings[keyword] = key

# define uniqueness
normalised_keywords = list(set(normalised_keywords))

In the step 5:
- Keywords can either be plural or singular form of a noun.
- We do not intend to identify such keywords as two different concepts.
- We get rid of plural form of a keyword and keep singular form.
- Of course, singular form is the subset of a plural form which suffice the need.

In [76]:
### Displays the mapping between singular form v/s plural form.

mappings

{'visual programming language': 'visual programming language',
 'orthonormal': 'orthonormal',
 'knowledge interchange format': 'knowledge interchange format',
 'range search': 'range search',
 'aston martin': 'aston martin',
 'image analysis': 'image analysi',
 'oded goldreich': 'oded goldreich',
 'cython': 'cython',
 'cutoff (reference value)': 'cutoff (reference value)',
 'never-ending language learning': 'never-ending language learning',
 'toxicogenomics': 'toxicogenomic',
 'semantic analysis (machine learning)': 'semantic analysis (machine learning)',
 'alpha–beta pruning': 'alpha–beta pruning',
 'temporal-difference learning': 'temporal-difference learning',
 'database marketing': 'database marketing',
 'piecewise': 'piecewise',
 'secret escapes': 'secret escape',
 'negative predictive value': 'negative predictive value',
 'iowa state university': 'iowa state university',
 'jeff hawkins': 'jeff hawkin',
 'end-user development': 'end-user development',
 'imagenet': 'imagenet',
 'mi

In [77]:
# inspect keywords
print("Total keywords: ", len(keywords))

# inspect normalised keywords
print("Total normalised keywords: ", len(normalised_keywords))

# inspect articles
print("Total articles: ", len(articles))

Total keywords:  2980
Total normalised keywords:  2931
Total articles:  547


In [78]:
print(normalised_keywords)

['visual programming language', 'knowledge interchange format', 'orthonormal', 'range search', 'aston martin', 'oded goldreich', 'cython', 'cutoff (reference value)', 'never-ending language learning', 'semantic analysis (machine learning)', 'alpha–beta pruning', 'temporal-difference learning', 'database marketing', 'piecewise', 'negative predictive value', 'iowa state university', 'end-user development', 'imagenet', 'minimum message length', "bayes's theorem", 'regular expression', 'gramian matrix', 'university of california, berkeley', 'term (logic)', 'monoid homomorphism', 'rprop', 'interface (computing)', 'data visualization', 'the art of computer programming', 'multivariate analysi', 'loss function', 'maximum likelihood', 'short-term memory', 'video surveillance', 'relational dependency network', 'hilbert space', 'error tolerance (pac learning)', 'physical space', 'rule-based system', 'moore–penrose pseudoinverse', 'inauthentic text', 'world wide web consortium', 'abstraction (soft

In [79]:
### Step 6: Look up by keywords in each of the articles ###
# Note: We will exact match in the contents by converting it into list().

occurrences = dict()

print("Gathering articles by keywords...")
for keyword in keywords:
    occurrences[keyword] = []
    for article in articles:
        contents = article['content'].split(' ')
        if keyword in contents:
            occurrences[keyword].append(article['link'])

print("Statistics is now ready to review.")

Gathering articles by keywords...
Statistics is now ready to review.


In the step 6:
- We map keywords to the list of articles that falls under it.
- This is to understand the pattern of article v/s keywords.
- How familiar is the article w.r.t to keywords and the people who marked it.
- This is optional since this is executed over all the keywords from the JSON file.

In [80]:
### Step 7: Look up by normalised keywords in each of the articles ###
# Note: We will exact match in the contents by converting it into list().

normalised_occurrences = dict()

print("Gathering articles by normalised keywords...")
for keyword in normalised_keywords:
    normalised_occurrences[keyword] = []
    for article in articles:
        contents = article['content'].split(' ')
        if keyword in contents:
            normalised_occurrences[keyword].append(article['link'])

print("Statistics is now ready to review.")

Gathering articles by normalised keywords...
Statistics is now ready to review.


In the step 7:
- We map keywords to the list of articles that falls under it.
- This is to understand the pattern of article v/s keywords.
- How familiar is the article w.r.t to keywords and the people who marked it.
- This is mandatory since we normalised all the keywords to get rid of all plural forms.

In [81]:
### FORMAT: JSON dump from keywords ###
print(json.dumps(occurrences, sort_keys=True, indent=4))

{
    ".net framework": [],
    "0-1 loss function": [],
    "a priori and a posteriori": [],
    "abductive logic programming": [],
    "abductive reasoning": [],
    "abortion": [],
    "abstract data type": [],
    "abstraction": [
        "https://machinelearningmastery.com/encoder-decoder-models-text-summarization-keras/",
        "https://machinelearningmastery.com/primer-neural-network-models-natural-language-processing/",
        "https://machinelearningmastery.com/stacked-long-short-term-memory-networks/",
        "https://machinelearningmastery.com/decompose-time-series-data-trend-seasonality/",
        "https://machinelearningmastery.com/time-series-data-stationary-python/",
        "https://machinelearningmastery.com/implement-simple-linear-regression-scratch-python/",
        "https://machinelearningmastery.com/dont-implement-machine-learning-algorithms/",
        "https://machinelearningmastery.com/what-is-deep-learning/",
        "https://machinelearningmastery.com/gentl

In [82]:
### FORMAT: JSON dump from normalised keywords ###
print(json.dumps(normalised_occurrences, sort_keys=True, indent=4))

{
    ".net framework": [],
    "0-1 loss function": [],
    "False general intelligence": [],
    "False general public license": [],
    "a priori and a posteriori": [],
    "abductive logic programming": [],
    "abductive reasoning": [],
    "abortion": [],
    "abstract data type": [],
    "abstraction": [
        "https://machinelearningmastery.com/encoder-decoder-models-text-summarization-keras/",
        "https://machinelearningmastery.com/primer-neural-network-models-natural-language-processing/",
        "https://machinelearningmastery.com/stacked-long-short-term-memory-networks/",
        "https://machinelearningmastery.com/decompose-time-series-data-trend-seasonality/",
        "https://machinelearningmastery.com/time-series-data-stationary-python/",
        "https://machinelearningmastery.com/implement-simple-linear-regression-scratch-python/",
        "https://machinelearningmastery.com/dont-implement-machine-learning-algorithms/",
        "https://machinelearningmastery.

In [83]:
# Stats from keyword-articles mappings: Each keyword must have count of articles matching ###

occ_counts = Counter()
for k, v in occurrences.items():
    occ_counts[k] = len(v)

In [84]:
occ_counts.most_common()

[('learning', 547),
 ('email', 547),
 ('google', 547),
 ('machine', 547),
 ('prediction', 547),
 ('data', 528),
 ('algorithm', 375),
 ('algorithms', 349),
 ('mean', 266),
 ('information', 259),
 ('accuracy', 227),
 ('linear', 217),
 ('parameters', 204),
 ('understanding', 181),
 ('r', 172),
 ('knowledge', 163),
 ('gradient', 145),
 ('parameter', 137),
 ('statistics', 135),
 ('scikit-learn', 130),
 ('numpy', 115),
 ('optimization', 112),
 ('thinking', 105),
 ('probability', 98),
 ('observation', 97),
 ('science', 95),
 ('image', 94),
 ('argument', 90),
 ('software', 90),
 ('tensorflow', 88),
 ('integer', 85),
 ('increasing', 84),
 ('engineering', 82),
 ('complexity', 80),
 ('kaggle', 77),
 ('patterns', 75),
 ('scipy', 73),
 ('subset', 67),
 ('univariate', 65),
 ('translation', 65),
 ('variance', 63),
 ('dimensions', 62),
 ('document', 62),
 ('overfitting', 60),
 ('bias', 56),
 ('concept', 54),
 ('embedding', 51),
 ('video', 51),
 ('analytics', 50),
 ('github', 50),
 ('database', 49),
 (

In [85]:
# Stats from normalised_keyword-articles mappings: Each keyword must have count of articles matching ###

normalised_occ_counts = Counter()
for k, v in normalised_occurrences.items():
    normalised_occ_counts[k] = len(v)

In [86]:
normalised_occ_counts.most_common()

[('learning', 547),
 ('email', 547),
 ('google', 547),
 ('machine', 547),
 ('prediction', 547),
 ('algorithm', 375),
 ('mean', 266),
 ('information', 259),
 ('accuracy', 227),
 ('linear', 217),
 ('understanding', 181),
 ('r', 172),
 ('knowledge', 163),
 ('gradient', 145),
 ('parameter', 137),
 ('scikit-learn', 130),
 ('numpy', 115),
 ('optimization', 112),
 ('thinking', 105),
 ('probability', 98),
 ('observation', 97),
 ('science', 95),
 ('image', 94),
 ('argument', 90),
 ('software', 90),
 ('logistic', 88),
 ('tensorflow', 88),
 ('integer', 85),
 ('increasing', 84),
 ('engineering', 82),
 ('complexity', 80),
 ('pattern', 79),
 ('kaggle', 77),
 ('scipy', 73),
 ('subset', 67),
 ('univariate', 65),
 ('translation', 65),
 ('variance', 63),
 ('document', 62),
 ('overfitting', 60),
 ('concept', 54),
 ('embedding', 51),
 ('video', 51),
 ('github', 50),
 ('database', 49),
 ('dimension', 47),
 ('perceptron', 47),
 ('distance', 41),
 ('cpu', 41),
 ('computing', 39),
 ('nonlinear', 39),
 ('gpu',

In this step:
- Each keyword shows the number of articles falling under it.
- This helps to determine how simple / complex is the keyword.

In [87]:
### Step 8: Mappings and scores evaluation ###
# 8.1 Map all the keywords that exists in an article and store it in `articles_keywords_maps`.
# 8.2 Sum the scores of all keywords that exists in an article and store it in `articles_keywords_scores`.
# 8.3 Store the scores of all keywords that exists in an article and store it in `articles_keywords_scores_list`.

# 8.1
articles_keywords_maps = dict()
# 8.2
articles_keywords_scores = Counter()
# 8.3
articles_keywords_scores_list = dict()


for keyword, links in normalised_occurrences.items():
    for link in links:
        if link not in articles_keywords_maps:
            articles_keywords_maps[link] = []
        if link not in articles_keywords_scores:
            articles_keywords_scores[link] = 0
        if link not in articles_keywords_scores_list:
            articles_keywords_scores_list[link] = []
        articles_keywords_maps[link].append(keyword)
        articles_keywords_scores[link] += normalised_occ_counts[keyword]
        articles_keywords_scores_list[link].append(keyword)

print("Done aggregating!!!")

Done aggregating!!!


In step 8:

We now reverse the process and grab the list of articles that has matching keywords from JSON file.
This is to study how familiar and complex is the article.

- articles_keywords_maps maps a single article to list of keywords.
- articles_keywords_scores maps a single article to sum of the scores each keyword has. This score is retrieved from `normalised_occ_counts`.

##### Why normalised_occ_counts is used?
- The higher the score of each keyword in normalised_occ_counts, the higher is the usage of it.
- Helps to determinse how frequently this word is used to describe a concept.
- This helps to understand how simple is the concept for a given keyword.

##### Then what? Why are we summing up the score?
- Score is nothing but the number articles that falls under a keyword.
- So, we sum up the score in total for a given article. This will widen the horizon about the complexity or simplicity of the article.

In [93]:
articles_keywords_maps

{'https://machinelearningmastery.com/16-options-to-get-started-and-make-progress-in-machine-learning-and-data-science/': ['london',
  'thinking',
  'kaggle',
  'intel',
  'learning',
  'email',
  'u.s.',
  'google',
  'cancer',
  'machine',
  'hadoop',
  'software',
  'pc',
  'conjunction',
  'stanford',
  'algorithm',
  'science',
  'mean',
  'mit',
  'engineering',
  'information',
  'prediction'],
 'https://machinelearningmastery.com/4-steps-to-get-started-in-machine-learning/': ['image',
  'learning',
  'linux',
  'email',
  'medicine',
  'facebook',
  'google',
  'cancer',
  'machine',
  'software',
  'accuracy',
  'database',
  'concept',
  'cross-platform',
  'likelihood',
  'algorithm',
  'scikit-learn',
  'github',
  'knowledge',
  'engineering',
  'information',
  'prediction'],
 'https://machinelearningmastery.com/5-benefits-of-competitive-machine-learning/': ['netflix',
  'ranking',
  'kaggle',
  'learning',
  'email',
  'r',
  'google',
  'machine',
  'analogy',
  'science

In [89]:
### FORMAT: Count of keywords for an article ###

article_keywords_counts = Counter()
for k, v in articles_keywords_maps.items():
    article_keywords_counts[k] = len(v)

In [56]:
article_keywords_counts.most_common()

[('https://machinelearningmastery.com/time-series-prediction-lstm-recurrent-neural-networks-python-keras/',
  62),
 ('https://machinelearningmastery.com/machine-learning-in-python-step-by-step/',
  57),
 ('https://machinelearningmastery.com/tutorial-first-neural-network-python-keras/',
  50),
 ('https://machinelearningmastery.com/a-tour-of-machine-learning-algorithms/',
  48),
 ('https://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/',
  46),
 ('https://machinelearningmastery.com/multivariate-time-series-forecasting-lstms-keras/',
  44),
 ('https://machinelearningmastery.com/time-series-forecasting-long-short-term-memory-network-python/',
  37),
 ('https://machinelearningmastery.com/multi-class-classification-tutorial-keras-deep-learning-library/',
  36),
 ('https://machinelearningmastery.com/implement-backpropagation-algorithm-scratch-python/',
  36),
 ('https://machinelearningmastery.com/what-is-deep-learning/', 36),
 ('https://machine

In [60]:
articles_keywords_scores.most_common()

[('https://machinelearningmastery.com/time-series-prediction-lstm-recurrent-neural-networks-python-keras/',
  7083),
 ('https://machinelearningmastery.com/machine-learning-in-python-step-by-step/',
  6928),
 ('https://machinelearningmastery.com/multivariate-time-series-forecasting-lstms-keras/',
  6489),
 ('https://machinelearningmastery.com/tutorial-first-neural-network-python-keras/',
  6458),
 ('https://machinelearningmastery.com/time-series-forecasting-long-short-term-memory-network-python/',
  6269),
 ('https://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/',
  6170),
 ('https://machinelearningmastery.com/machine-learning-for-programmers/',
  6113),
 ('https://machinelearningmastery.com/discover-feature-engineering-how-to-engineer-features-and-how-to-get-good-at-it/',
  6058),
 ('https://machinelearningmastery.com/a-tour-of-machine-learning-algorithms/',
  6000),
 ('https://machinelearningmastery.com/regression-tutorial-keras-deep-l

In [64]:
articles_keywords_scores_list

{'https://machinelearningmastery.com/16-options-to-get-started-and-make-progress-in-machine-learning-and-data-science/': ['london',
  'thinking',
  'kaggle',
  'intel',
  'learning',
  'email',
  'u.s.',
  'google',
  'cancer',
  'machine',
  'hadoop',
  'software',
  'pc',
  'conjunction',
  'stanford',
  'algorithm',
  'science',
  'mean',
  'mit',
  'engineering',
  'information',
  'prediction'],
 'https://machinelearningmastery.com/4-steps-to-get-started-in-machine-learning/': ['image',
  'learning',
  'linux',
  'email',
  'medicine',
  'facebook',
  'google',
  'cancer',
  'machine',
  'software',
  'accuracy',
  'database',
  'concept',
  'cross-platform',
  'likelihood',
  'algorithm',
  'scikit-learn',
  'github',
  'knowledge',
  'engineering',
  'information',
  'prediction'],
 'https://machinelearningmastery.com/5-benefits-of-competitive-machine-learning/': ['netflix',
  'ranking',
  'kaggle',
  'learning',
  'email',
  'r',
  'google',
  'machine',
  'analogy',
  'science

In [69]:
### Step 8: Write JSON data in text format in `scrubbed_machine_learning_mastery.json` ###

# write scrubbed JSON to `scrubbed_machine_learning_mastery.json`
with open('keywords_articles_count.json', 'w') as outfile:
     json.dump(normalised_occ_counts.most_common(), outfile, sort_keys = True, indent = 4,
               ensure_ascii = False)
print('Done writing to file!!!')

Done writing to file!!!


### Observation

`articles_keywords_scores` should have listed all the naive articles on top and advanced/complex articles to the bottom.

In the end of this iteration, we noticed that even a complex article can have all the basic keywords that everybody knows. 
Also, advanced keywords that no everybody is aware contributed to the score of an article.
This ended up with advanced articles with highest scores.

Simple articles were pushed to the bottom of the list.

We need this data for analysis going forward.