**The following cell must be run when using Google Colab to set up the notebook.**

In [None]:
# import sys
# sys.path.append('/content/drive/MyDrive/towards_ds_topic_modeling')
# 
# from google.colab import drive
# drive.mount('/content/drive')
# 
# !python -m spacy download en_core_web_md
# !pip install pyspark

# Topic Model Notebook
Author: Andrew  

This notebook will outline the steps used when cleaning the raw articles from Towards Data Science. 

Steps are adapted from 
* [PySpark for Natural Language Processing on Dataproc](https://codelabs.developers.google.com/codelabs/spark-nlp/#7)
* [Topic Modelling with PySpark and Spark NLP by Maria Obedkova](https://medium.com/trustyou-engineering/topic-modelling-with-pyspark-and-spark-nlp-a99d063f1a6e)

In [89]:
# Load Libraries
import pyspark

In [90]:
# start SparkSession
spark = pyspark.sql.SparkSession.builder.getOrCreate()
spark.getActiveSession()

In [91]:
# load CSV
# use the following when running on Colab
articles = spark.read.csv('../src/TDS_articles.csv', inferSchema=True, header=True, sep="\t")

In [92]:
articles.show()

+----------+--------------------+--------------------+-------------------+----------+-----+------+----------+--------------------+--------------------+
|article_id|               title|            subtitle|             author|      date|claps|images|codeblocks|                link|                body|
+----------+--------------------+--------------------+-------------------+----------+-----+------+----------+--------------------+--------------------+
|      3406|Iteratively Findi...|             Gear Up|       Kemal Tugrul|2018-03-24| null|     0|         0|https://towardsda...|"{""There is a th...|
|      5405|Why Get a Data Ex...|                null|       Lester Leong|2019-07-02| null|     0|         0|https://towardsda...|"{""This article ...|
|      5957|The curse of know...|                null|          Iwo Herka|2019-11-26| null|     0|         0|https://towardsda...|"{""Why are compe...|
|      6932|Image segmentatio...|                null|       Jakub Czakon|      null| nu

## Checkout Data

In [93]:
# register SQL table
articles.registerTempTable('articles')

In [5]:
query = """
SELECT * FROM articles;
"""
spark.sql(query).show()

+----------+--------------------+--------------------+-------------------+----------+-----+------+----------+--------------------+--------------------+
|article_id|               title|            subtitle|             author|      date|claps|images|codeblocks|                link|                body|
+----------+--------------------+--------------------+-------------------+----------+-----+------+----------+--------------------+--------------------+
|      3406|Iteratively Findi...|             Gear Up|       Kemal Tugrul|2018-03-24| null|     0|         0|https://towardsda...|"{""There is a th...|
|      5405|Why Get a Data Ex...|                null|       Lester Leong|2019-07-02| null|     0|         0|https://towardsda...|"{""This article ...|
|      5957|The curse of know...|                null|          Iwo Herka|2019-11-26| null|     0|         0|https://towardsda...|"{""Why are compe...|
|      6932|Image segmentatio...|                null|       Jakub Czakon|      null| nu

In [7]:
query = """
SELECT COUNT(subtitle)
FROM articles
WHERE subtitle like CONCAT('%', author, '%');
"""

spark.sql(query).show()

+---------------+
|count(subtitle)|
+---------------+
|             75|
+---------------+



Some of the Authors are replicated in the subtitles. 

In [29]:
query = """
SELECT COUNT(title), COUNT(body)
FROM articles;
"""

spark.sql(query).show()

+------------+-----------+
|count(title)|count(body)|
+------------+-----------+
|       35507|      35646|
+------------+-----------+



Of the entire corpus, only about 100 articles do not have a Title associated with them

## Clean Data

In [5]:
# Register cleaning function as UDF 
from cleaning import clean_doc
from pyspark.sql.functions import udf, split, explode, col, posexplode
from pyspark.sql.types import *

In [6]:
# add the cleaning function as a UDF
clean_udf = udf(clean_doc)

# make a UDF to remove the bracket delimiters
remove_brackets = udf(lambda row: row.replace('}"', '').replace('"{"', ''))

In [29]:
# apply remove brackets to body column
articles = (articles
  .withColumn('body', remove_brackets('body'))
  )

articles.show(5)

+----------+--------------------+--------------------+------------+----------+-----+------+----------+--------------------+--------------------+
|article_id|               title|            subtitle|      author|      date|claps|images|codeblocks|                link|                body|
+----------+--------------------+--------------------+------------+----------+-----+------+----------+--------------------+--------------------+
|      3406|Iteratively Findi...|             Gear Up|Kemal Tugrul|2018-03-24| null|     0|         0|https://towardsda...|"There is a theor...|
|      5405|Why Get a Data Ex...|                null|Lester Leong|2019-07-02| null|     0|         0|https://towardsda...|"This article is ...|
|      5957|The curse of know...|                null|   Iwo Herka|2019-11-26| null|     0|         0|https://towardsda...|"Why are competen...|
|      6932|Image segmentatio...|                null|Jakub Czakon|      null| null|     0|         0|https://towardsda...|"This a

Use *posexplode* to separate the body of the article into its paragraphs

In [30]:
articles_by_paragraph = (articles
  .select('*', posexplode(split(col('body'), '","')))
  .withColumnRenamed('pos', 'p_index')
  .withColumnRenamed('col', 'paragraph')
  )

articles_by_paragraph.show()

+----------+--------------------+--------+------------+----------+-----+------+----------+--------------------+--------------------+-------+--------------------+
|article_id|               title|subtitle|      author|      date|claps|images|codeblocks|                link|                body|p_index|           paragraph|
+----------+--------------------+--------+------------+----------+-----+------+----------+--------------------+--------------------+-------+--------------------+
|      3406|Iteratively Findi...| Gear Up|Kemal Tugrul|2018-03-24| null|     0|         0|https://towardsda...|"There is a theor...|      0|"There is a theor...|
|      3406|Iteratively Findi...| Gear Up|Kemal Tugrul|2018-03-24| null|     0|         0|https://towardsda...|"There is a theor...|      1|"You may have hea...|
|      3406|Iteratively Findi...| Gear Up|Kemal Tugrul|2018-03-24| null|     0|         0|https://towardsda...|"There is a theor...|      2|"There is no free...|
|      3406|Iteratively Find

In [12]:
# Clean "body" with udf
clean_df = articles.withColumn("clean_body", clean_udf("body"))

clean_df.show(5)

In [14]:
clean_df.printSchema()

root
 |-- article_id: integer (nullable = true)
 |-- title: string (nullable = true)
 |-- subtitle: string (nullable = true)
 |-- author: string (nullable = true)
 |-- date: string (nullable = true)
 |-- claps: integer (nullable = true)
 |-- images: integer (nullable = true)
 |-- codeblocks: integer (nullable = true)
 |-- link: string (nullable = true)
 |-- body: string (nullable = true)
 |-- clean_body: string (nullable = true)



## Making the Pipeline for Title Topic Modeling

In [7]:
# import from pyspark machine learning
from pyspark.ml import Pipeline
from pyspark.ml.feature import CountVectorizer, StopWordsRemover, IDF
from pyspark.ml.clustering import LDA

**For the following, the pipeline will be written to perform topic modeling on the Titles of the articles.**
For use with other documents, the training data should label the feature column as "text".

In [8]:
train = (
    articles
        .select('article_id', 'title')
        .where(col('title').isNotNull())
        .withColumn('title', clean_udf('title'))
        .withColumn('title', split(col('title'), ' '))
        .withColumnRenamed('title', 'text')
    )
train.show(5)

+----------+--------------------+
|article_id|                text|
+----------+--------------------+
|      3406|[iteratively, fin...|
|      5405|   [data, executive]|
|      5957|  [curse, knowledge]|
|      6932|[image, segmentat...|
|      8041|[introduce, model...|
+----------+--------------------+
only showing top 5 rows



In [9]:
tf = CountVectorizer(inputCol='text', outputCol='tf_result')
idf = IDF(inputCol=tf.getOutputCol(), outputCol='features')
lda = LDA(k=20, maxIter=10)

In [10]:
pipe = Pipeline(stages=[tf, idf, lda])

In [11]:
model = pipe.fit(train)

In [14]:
vocab = model.stages[0].vocabulary

In [19]:
topics = model.stages[2].describeTopics().collect()
topic_idx = [topic.termIndices for topic in topics]

In [22]:
topic_idx

[[80, 123, 136, 203, 158, 244, 291, 19, 222, 295],
 [13, 35, 34, 36, 7, 177, 90, 75, 155, 255],
 [23, 37, 33, 88, 1, 4, 225, 27, 171, 0],
 [15, 0, 32, 47, 5, 50, 19, 6, 3, 127],
 [98, 160, 254, 264, 139, 303, 234, 288, 511, 333],
 [46, 81, 3, 10, 196, 0, 111, 181, 2, 5],
 [41, 202, 199, 271, 74, 275, 404, 381, 324, 601],
 [28, 60, 39, 25, 265, 286, 262, 502, 294, 142],
 [45, 138, 261, 251, 115, 410, 423, 230, 484, 458],
 [17, 11, 72, 192, 270, 42, 7, 103, 267, 220],
 [49, 5, 0, 147, 188, 169, 56, 191, 73, 246],
 [1, 8, 14, 4, 9, 2, 21, 11, 18, 20],
 [141, 9, 257, 2, 18, 70, 108, 369, 470, 290],
 [99, 53, 2, 146, 30, 350, 280, 308, 29, 250],
 [10, 58, 38, 105, 327, 273, 263, 302, 85, 399],
 [5, 0, 63, 124, 129, 205, 150, 7, 305, 106],
 [83, 148, 227, 180, 396, 446, 281, 625, 361, 393],
 [79, 112, 92, 190, 100, 237, 64, 226, 219, 110],
 [133, 221, 228, 157, 332, 434, 102, 343, 362, 3],
 [162, 3, 208, 97, 151, 479, 380, 995, 301, 565]]

In [25]:
def show_topics(vocab, topic_indexes, topic_labels=None):
    if not topic_labels:
        topic_labels = ["Topic " + str(i) for i in range(len(topic_indexes))]
    assert len(topic_labels) == len(topic_indexes)

    for label, words in zip(topic_labels, topic_indexes):
        topic_words = ', '.join([vocab[word_idx] for word_idx in words])
        print(label + ': ' + topic_words)

In [26]:
show_topics(vocab, topic_idx)

Topic 0: cloud, gradient, mean, automate, logistic, descent, predictive, regression, theory, platform
Topic 1: learn, artificial, intelligence, google, ai, scikit, review, business, kaggle, state
Topic 2: r, visualization, understand, graph, learning, machine, database, predict, detect, data
Topic 3: scientist, data, know, linear, science, need, regression, datum, python, engineering
Topic 4: spark, apache, pyspark, knowledge, pipeline, chart, forest, management, charts, hand
Topic 5: start, approach, python, analysis, probability, data, twitter, exploratory, use, science
Topic 6: step, complete, recommender, different, system, gpu, miss, steps, github, football
Topic 7: detection, object, vs, code, type, matrix, vector, global, anomaly, cnn
Topic 8: covid, game, company, environment, example, tech, trends, development, brief, virtual
Topic 9: networks, neural, base, docker, bayes, nlp, ai, line, topic, coronavirus
Topic 10: new, science, data, solve, right, best, way, technique, probl

## Making the Pipeline for Article Topic Modeling

In [35]:
para_train = (
    articles_by_paragraph
        .select('article_id', 'p_index', 'paragraph') # select unique identifiers
        .where(col('paragraph').isNotNull()) # ignore blank paragraphs
        .withColumn('paragraph', clean_udf('paragraph')) # clean the text
        .withColumn('paragraph', split(col('paragraph'), ' ')) # split on blank space to tokenize words
        .withColumnRenamed('paragraph', 'text') # rename column to text for pipeline
    )
para_train.show(5)

+----------+-------+--------------------+
|article_id|p_index|                text|
+----------+-------+--------------------+
|      3406|      0|[theorem, tell, s...|
|      3406|      1|[hear, free, lunc...|
|      3406|      2|[free, lunch, met...|
|      3406|      3|[good, idea, expl...|
|      3406|      4|[agile, methodolo...|
+----------+-------+--------------------+
only showing top 5 rows



In [37]:
tf = CountVectorizer(inputCol='text', outputCol='tf_result', minDF=0.05, maxDF=0.6)
idf = IDF(inputCol=tf.getOutputCol(), outputCol='features')
lda = LDA(k=20, maxIter=10)
paragraph_pipe = Pipeline(stages=[tf, idf, lda])

In [38]:
para_model = paragraph_pipe.fit(para_train)

In [39]:
para_model.save("../models/articles_LDA")

In [42]:
para_model.stages[0].vocabSize

Param(parent='CountVectorizer_dd7b58202f7d', name='vocabSize', doc='max size of the vocabulary. Default 1 << 18.')

In [87]:
vocab2 = para_model.stages[0].vocabulary
topics2 = para_model.stages[2].describeTopics().collect()
topic_idx2 = [topic.termIndices for topic in topics2]

In [88]:
show_topics(vocab2, topic_idx2)

Topic 0: create, way, let, data, datum, use, learning, look, like, value
Topic 1: use, work, example, way, need, function, datum, like, model, s
Topic 2: set, example, use, create, let, look, datum, time, data, function
Topic 3: value, work, use, time, data, way, like, example, need, datum
Topic 4: model, example, way, use, s, datum, learning, set, good, let
Topic 5: like, want, use, create, find, datum, code, good, learn, time
Topic 6: function, use, create, set, datum, s, like, model, need, code
Topic 7: learn, use, learning, way, data, model, function, datum, set, want
Topic 8: way, need, find, work, datum, data, use, code, value, set
Topic 9: need, want, set, value, model, use, like, create, datum, function
Topic 10: code, let, use, need, function, work, find, time, look, data
Topic 11: good, example, use, datum, work, model, need, learn, learning, like
Topic 12: look, like, model, use, let, way, learning, want, example, datum
Topic 13: learning, work, data, datum, set, like, want,