# Introduction

In this report, we present our methodology, the pipelines we generated, and the results of our streaming text analysis using Spark and MLlib. We employed various text processing techniques such as tokenization, hashingTF, IDF, word2vec and several other tools to convert words into vector representations. Finally, we predicted whether newly data added in the stream would appear on the front page.

First, as given in the example notebooks, we fetched data from the stream using the previously given code. Then, we imported the necessary libraries to be used in preprocessing, word processing, and machine learning models. 

In [None]:
sc

In [1]:
import threading

# Helper thread to avoid the Spark StreamingContext from blocking Jupyter
        
class StreamingThread(threading.Thread):
    def __init__(self, ssc):
        super().__init__()
        self.ssc = ssc
    def run(self):
        self.ssc.start()
        self.ssc.awaitTermination()
    def stop(self):
        print('----- Stopping... this may take a few seconds -----')
        self.ssc.stop(stopSparkContext=False, stopGraceFully=True)

In [None]:
spark

In [None]:
spark.stop()

In [2]:
import os
from pyspark.streaming import StreamingContext
from pyspark import SparkContext

In [None]:
ssc = StreamingContext(sc, 120)

In [None]:
lines = ssc.socketTextStream("seppe.net", 7778)

In [None]:
out_dir = f"{os.path.abspath('')}{os.path.sep}saved_stories"
lines.saveAsTextFiles(f"file:///{out_dir}")
print("Saving to", out_dir)

In [None]:
ssc_t = StreamingThread(ssc)
ssc_t.start()

In [None]:
ssc_t.stop()

Additionally, since spark was using only 2 threads, we converted it into *, so that can use many threads which will not limit the performance.

In [5]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .master("local[*]") \
    .appName("PySparkShell") \
    .getOrCreate()


In [6]:
spark

In [4]:
spark.stop()

The incoming data stream follows a specific structure, that’s why we defined a schema with corresponding data types for each feature. After defining the schema, to access the files named “saved_stories_(any number)”, we used “-*/“ which helps the computer to reach the data. In addition to that step, since we used two computers to fetch the data, some “saved_stories” were duplicated because both computers fetched the same file. Therefore, we drop duplicates with respect to “aid” to work with unique values.

In [7]:
from pyspark.sql import SparkSession
from pyspark import SparkContext
from pyspark.ml.feature import RegexTokenizer, StopWordsRemover, CountVectorizer, IDF, VectorAssembler
from pyspark.ml import Pipeline
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType, StructType, DoubleType, StructType, StructField, ArrayType, IntegerType, TimestampType, BooleanType
import re
import json
import numpy as np
import pandas as pd 
from pyspark.sql.functions import col, lower, regexp_replace, trim, when
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.classification import LogisticRegression, LinearSVC, RandomForestClassifier
from pyspark.ml import Pipeline
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from nltk.stem.snowball import SnowballStemmer
from pyspark.ml.feature import StopWordsRemover, VectorAssembler, Tokenizer, RegexTokenizer, HashingTF, IDF, Word2Vec

In [8]:
schema = StructType([
    StructField("aid", StringType()),
    StructField("title", StringType()),
    StructField("url", StringType()),
    StructField("domain", StringType()),
    StructField("votes", IntegerType()),
    StructField("user", StringType()),
    StructField("posted_at", TimestampType()),
    StructField("comments", IntegerType()),
    StructField("source_title", StringType()),
    StructField("source_text", StringType()),
    StructField("frontpage", BooleanType())
])

input_dir = "/Users/umutkurt/Downloads/spark/notebooks/saved_stories-*/"

df_ = spark.read.format("JSON").load(input_dir)

                                                                                

## Pre-Processing


We generated a preprocess function and we put the previous duplicate checking into this function. Since word processing techniques can work with lowercase values, we converted the title, source_title, and source_text into lowercase. By using trim(), we removed values that were not going to contribute any explanatory value. Then, we used regex replace to replace special characters with an empty string. This is supported by Microsoft Support which states that using RegexReplace can help to replace the pattern that is specified. 

For the target variable, we converted the front page to 0 if it did not appear on the front page, and 1 if it appeared on the front page. 

Usual,ly the data contains non-empty values but we dropped missing values just in case. Then we dropped “aid” and “url” to make it efficient for data retrieval. 


In [9]:
def preprocess(df):

    #Checking for duplicates
    if "aid" in df.columns:
        df_unique = df.dropDuplicates(["aid"])
    else:
        df_unique = df

    #Appyling LowerCase, Trim and RegexpReplace
    print("Applying column transformation")
    
    text_cols = ["source_text", "source_title", "title"]
    for col_name in text_cols:
        if col_name in df_unique.columns:
            df_unique = df_unique.withColumn(col_name, lower(col(col_name)))
            df_unique = df_unique.withColumn(col_name, trim(col(col_name)))
            df_unique = df_unique.withColumn(col_name, regexp_replace(col(col_name), "[^a-zA-Z0-9,.!? ]", ""))

    #Converting frontpage into binary
    if 'frontpage' in df_unique.columns:
        df_unique = df_unique.withColumn('frontpage', when(col('frontpage') == True, 1).otherwise(0))

    #Dropping missing values
    df_cleaned = df_unique.dropna(how='any')  

    #Dropping "aid" and "url" since they do not provide any additional information
    if "aid" in df_cleaned.columns and "url" in df_cleaned.columns:
        df_cleaned = df_cleaned.drop("aid", "url")

    #Checking the length
    print("Final count:", df_cleaned.count()) 
    return df_cleaned

df_raw = preprocess(df_)
df_raw.show()

    

Applying column transformation


                                                                                

Final count: 4484




+--------+--------------------+---------+-------------------+--------------------+--------------------+--------------------+------------+-----+
|comments|              domain|frontpage|          posted_at|         source_text|        source_title|               title|        user|votes|
+--------+--------------------+---------+-------------------+--------------------+--------------------+--------------------+------------+-----+
|       0|       wikipedia.org|        0|2024-04-15 07:03:05|earthquake early ...|earthquake early ...|earthquake early ...| thunderbong|    1|
|       0|   frontendshape.com|        0|2024-04-15 07:39:09|how to add dragan...|how to add dragan...|how to add dragan...|      Web250|    1|
|       0|indiantinker.bear...|        0|2024-04-15 07:42:09|food proposed ame...|food proposed ame...|one knuckle rice ...|indiantinker|    1|
|       0|github.com/fabrid...|        0|2024-04-15 07:59:02|github  fabridami...|github  fabridami...|cronex.nvim a neo...|       fbrdm

                                                                                

## Creating Pipelines and Converting to Vector Representation

Since we cannot use the usual train_test_split here, we did some research and employed RandomSplit (StackOverflow). 

After splitting our data into train and test, we employed various conversion techniques from text to vector. In addition, while doing that we generated pipelines that will be applied to new data. Instead of the default tokenizer, after some research, we employed a regex tokenizer. Caching (n.d.) stated that using a regex tokenizer improves the ability of the Tokenizer with regular expressions furthermore, it helps to protect important details. Since the data contains words that do not mean anything, we started our pipeline by defining and removing them from source_text. We added user-defined stop words to default stop words to make it more efficient With StopWordsRemover and RegexTokenizer, the first pipeline of the model is created. 

Pipeline 1 --> RegexTokenizer + StopWordsRemover

However, the reason we employed two different pipelines is that we wanted to implement lemmatization or stemming into the pipeline. But, there is no available package for lemmatization and stemmin.,There was a package for that but when we implemented thats a problem from Java Package occurred during the process. In addition, Lemmatization can be much more effective than Stemming but, we could not find lemmatization UDF with NLTK or any other tools.  Therefore, after doing some research, we used Snowball Stemmer as our stemming function. As Sharma et. al. (2021)demonstrated Snowball Stemmer is well-known for its ability to effectively handle slang which supports source_texts in our case.* With a lot of searcingh on the web, we came up with nUDFf for Snowball Stemmer.

After the first pipeline and stemmer, we can convert the words to vector representation with using TF-IDF. Dhamija (2019) stated that Count Vectorizer and HashingTF can be used for vector representation, although both have advantages and disadvantages, we chose HashingTF since it is less computationally expensive, and if we can use higher dimension the risk of collision will decrease as stated. In addition to HashingTF, to improve the importance of rare terms and to indicate the importance of a word, we additionally employed IDF. Therefore, the second pipeline includes Hashing TF and IDF. Finally, by using a vector assembler, we additionally added votes as an explanatory variable into the vector. 

Pipeline 2 --> HashingTF + IDF

Overall Pipeline to be implemented on new data is going to be: 

Pipeline 1 + Stemmer + Pipeline 2





In [10]:
train_data, test_data = df_raw.randomSplit([0.8, 0.2])

In addition to the pipelines stated above, after creating them ".fit" is used for the pipeline to be implemented on the model, and ".transform" is used for validation and test data.

In [None]:
#Appyling StopWordsRemover
default_stop_words = StopWordsRemover.loadDefaultStopWords("english")

custom_stop_words = ["and", "or", "but", "so", "because"] 
all_stop_words = default_stop_words + custom_stop_words

#Defining Stemmer
stemmer = SnowballStemmer(language='english')
stemmer_udf = udf(lambda tokens: [stemmer.stem(token) for token in tokens], ArrayType(StringType()))

#First Pipeline --> RegexTokenizer + StopWordsRemover
tokenizer_ = RegexTokenizer(inputCol="source_text", outputCol="tokenized_words", pattern="\\W")
remover_ = StopWordsRemover(inputCol="tokenized_words", outputCol="words").setStopWords(all_stop_words)


pipeline1 = Pipeline(stages=[tokenizer_, remover_])
model1 = pipeline1.fit(train_data)
train_data_ = model1. transform(train_data)
test_data_ = model1.transform(test_data)

df_stemmed_train = train_data_.withColumn("stem_words", stemmer_udf(col("words")))
df_stemmed_test = test_data_.withColumn("stem_words", stemmer_udf(col("words")))

#Second Pipeline --> HashingTF + IDF + VectorAssembler
hashingTF = HashingTF(inputCol="stem_words",outputCol="featurevec", numFeatures=8192) #We defined numfeatures high with considering the possiblity of collision
idf = IDF(inputCol="featurevec", outputCol="features",minDocFreq=5)
assembler = VectorAssembler(inputCols=["features", "votes"], outputCol="featuresfinal")

pipeline = Pipeline(stages=[hashingTF, idf, assembler])
model = pipeline.fit(df_stemmed_train)
train_data_f = model.transform(df_stemmed_train)
test_data_f = model.transform(df_stemmed_test)

While applying the HashingTF, one should be careful about the numfeatures hyperparameter. To avoid collusion, this hyperparameter should be high and with respect to some researches, this value should 2^n. If this hyperparameter is low, there could be possible collusion issues. On the other hand, if this hyperparameter is high, the process of applying it with Random Forest or any other machine learning tools is going to be slower.  

In [None]:
train_data_f.show(3)

## Machine Learning Models and Accuracy

We implemented RandomForest, SVM and LogisticRegression. In addition, we used accuracy as our performance metric. 

In [None]:
rf = RandomForestClassifier(labelCol="frontpage", featuresCol="featuresfinal")
model_rf = rf.fit(train_data_f)
predictions_rf = model_rf.transform(test_data_f)
evaluator_rf = BinaryClassificationEvaluator(labelCol="frontpage")
accuracy_rf = evaluator_rf.evaluate(predictions_rf)
print (f"Accuracy of Random Forest Classifierr : ", accuracy_rf)

In [None]:
svm = LinearSVC(featuresCol="featuresfinal", labelCol="frontpage")
model_svm = svm.fit(train_data_f)
predictions_svm = model_svm.transform(test_data_f)
evaluator_svm = BinaryClassificationEvaluator(labelCol="frontpage")
accuracy_svm = evaluator_svm.evaluate(predictions_svm)
print (f"Accuracy of SVM : ", accuracy_svm)

In [None]:
log_reg = LogisticRegression(labelCol = "frontpage", featuresCol = "featuresfinal")
model_log = log_reg.fit(train_data_f)
predictions_log = model_log.transform(test_data_f)
evaluator_log = BinaryClassificationEvaluator(labelCol = "frontpage")
accuracy_log = evaluator_log.evaluate(predictions_log)
print(f"Accuracy of Logistic Regression:", accuracy_log )

For this particular example, since RandomForest gave the best accuracy we can save and use it for new data that will come into the stream. However, one can generate better results by increasing the numfeatures but in our case, since the computer did not allow for a larger numfeatures, we could not generate optimal results. In case of clearing the output: Random Forest with 92% accuracy, Logistic Regression with 74% accuracy. 

After generating machine learning models and checking the accuracies each of them, we can save them into a folder.


In [None]:
#Predicting new data 
#Saving model and pipelines
model.write().save("/Users/umutkurt/Downloads/spark/pipe&model/pipeline")
model1.write().save("/Users/umutkurt/Downloads/spark/pipe&model/pipeline1")
model_rf.write().save("/Users/umutkurt/Downloads/spark/pipe&model/randomforestmodel")

## Appyling Word2Vec

Another option can be appyling Word2Vec into the text analysis. However, while applying it into our pipeline, in case of high VectorSize, the machine that we are using is going to crash. We got better results with Word2Vec, however, its going to be safer to implement HashingTF + IDF for word to vector representation.

In [None]:
#With using Word2Vec

default_stop_words = StopWordsRemover.loadDefaultStopWords("english")

custom_stop_words = ["and", "or", "but", "so", "because"] 
all_stop_words = default_stop_words + custom_stop_words


stemmer = SnowballStemmer(language='english')
stemmer_udf = udf(lambda tokens: [stemmer.stem(token) for token in tokens], ArrayType(StringType()))

tokenizer_w2v = RegexTokenizer(inputCol="source_text",outputCol="tokenwords", pattern="\\W")
remover_w2v = StopWordsRemover(inputCol="tokenwords", outputCol="words").setStopWords(all_stop_words)

pipeline1 = Pipeline(stages=[tokenizer_w2v, remover_w2v])
model1 = pipeline1.fit(train_data)
train_data_ = model1. transform(train_data)
test_data_ = model1.transform(test_data)

df_stemmed_train = train_data_.withColumn("stem_words", stemmer_udf(col("words")))
df_stemmed_test = test_data_.withColumn("stem_words", stemmer_udf(col("words")))

w2v = Word2Vec(inputCol="stem_words", outputCol="features", vectorSize=100, minCount=5)
assembler = VectorAssembler(inputCols=["features", "votes"], outputCol="combined_features")

doc2vec_pipeline = Pipeline(stages=[w2v, assembler])
doc2vec_model = doc2vec_pipeline.fit(df_stemmed_train)
train_data_w2v = doc2vec_model.transform(df_stemmed_train)
test_data_w2v = doc2vec_model.transform(df_stemmed_test)

In [None]:
rf = RandomForestClassifier(labelCol="frontpage", featuresCol="combined_features")
model_rf = rf.fit(train_data_w2v)
predictions_rf = model_rf.transform(test_data_w2v)
evaluator_rf = BinaryClassificationEvaluator(labelCol="frontpage")
accuracy_rf = evaluator_rf.evaluate(predictions_rf)
print (f"Accuracy of Random Forest Classifierr : ", accuracy_rf) # 98%

In [None]:
log_reg = LogisticRegression(labelCol = "frontpage", featuresCol = "combined_features")
model_log = log_reg.fit(train_data_w2v)
predictions_log = model_log.transform(test_data_w2v)
evaluator_log = BinaryClassificationEvaluator(labelCol = "frontpage")
accuracy_log = evaluator_log.evaluate(predictions_log)
print(f"Accuracy of Logistic Regression:", accuracy_log ) #98%

# References

Cecchini, D. (n.d.). Unleashing the Power of Text Tokenization with Spark NLP. JohnSnow Labs. Retrieved from https://www.johnsnowlabs.com/unleashing-the-power-of-text-tokenization-with-spark-nlp/

Dhamija, V. (2019). CountVectorizer & HashingTF. Medium. Retrieved from https://towardsdatascience.com/countvectorizer-hashingtf-e66f169e2d4e

GeeksforGeeks. (2022). GeeksforGeeks. (2022). Snowball stemmer NLP. Retrieved from https://www.geeksforgeeks.org/snowball-stemmer-nlp/
 
Microsoft. (n.d.). REGEXREPLACE function. Microsoft Support. Retrieved from https://support.microsoft.com/en-gb/office/regexreplace-function-9c030bb2-5e47-4efc-bad5-4582d7100897

Sharma, V., Srivastava, S., Valarmathi, B., & Srinivasa Gupta, N. (2021). A Comparative Study on the Performance of Deep Learning Algorithms for Detecting the Sentiments Expressed in Modern Slangs. In V. Bindhu, J.M.R.S. Tavares, A.A.A. Boulogeorgos, & C. Vuppalapati (Eds.), International Conference on Communication, Computing and Electronics Systems (Lecture Notes in Electrical Engineering, vol 733). Springer, Singapore. https://doi.org/10.1007/978-981-33-4909-4_33

Stack Overflow. (n.d.). Efficient text preprocessing using PySpark: Clean, tokenize, stopwords, stemming. Retrieved from https://stackoverflow.com/questions/53579444/efficient-text-preprocessing-using-pyspark-clean-tokenize-stopwords-stemming

Stack Overflow. (n.d.). Is there any train test split in PySpark or MLlib? Retrieved from https://stackoverflow.com/questions/69071201/is-there-any-train-test-split-in-pyspark-or-mllib

Stack Overflow. (n.d.). What is the relation between numFeatures in HashingTF in Spark MLlib and actual? Retrieved from https://stackoverflow.com/questions/44966444/what-is-the-relation-between-numfeatures-in-hashingtf-in-spark-mllib-and-actual
