# 2.0 Data Preprocessing 

Author: Yeap Jie Shen, Siew Ho Yan, Gan Yee Jing

Last Edited: 31/08/2024

## 2.2 Preprocessing Data

### 2.2.1 Importing Necessary Libraries and Instantiate Spark Session 

Found behaviour:
1) Once a udf is registered, the same udf cannot be registered/overidden for the same spark session
2) Some folders for dependencies (nltk, regex) need to be placed in the same place where this .ipynb is located, else module not found exception will arise

In [1]:
from pyspark.sql import SparkSession
from pyspark.ml.feature import Tokenizer, StopWordsRemover
from pyspark.sql.functions import lower, regexp_replace, regexp_extract, udf, col
from pyspark.sql.types import ArrayType, StringType, FloatType

import nltk
from nltk.corpus import words
from nltk.stem import WordNetLemmatizer

import pickle
import csv

import sys

sys.path.append(r'/home/student/RDS2S3G4_CLO2_B')

from data_stores.hbaseClient import HBaseClient
from data_stores.mongodbClient import MongoDBClient
from data_stores.redisClient import RedisClient

spark = SparkSession.builder.appName("Data Preprocessing").getOrCreate()

nltk.download('words')
nltk.download('wordnet')
nltk.download('punkt')

24/09/01 16:55:49 WARN Utils: Your hostname, Gan. resolves to a loopback address: 127.0.1.1; using 10.255.255.254 instead (on interface lo)
24/09/01 16:55:49 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/09/01 16:55:50 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[nltk_data] Downloading package words to /home/student/nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package wordnet to /home/student/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /home/student/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### 2.2.2 Reading Data From Redis or HBase

If there is a cache in Redis, then the data is read from it
Otherwise, the data is read from HBase

Reading data from Redis is faster as this is a in-memory data store

In [2]:
# Instantiating HBase client
hbase_client = HBaseClient(host = 'localhost', port = 9090)

# Instantiating Redis client
redis_client = RedisClient(host = 'localhost', port = 6379, db = 0, start_now = True)

if redis_client.exists_key('record_list_with_key'):
    record_list = [
        {
            'key': record[0],
            'content': record[1]['cf1:content'],
            'headline': record[1]['cf1:headline'],
            'url': record[1]['cf1:url'],
            'author': record[1]['cf1:author'],
            'datetime': record[1]['cf1:datetime'],
            'publisher': record[1]['cf1:publisher']
        }
        for record in pickle.loads(redis_client.get_value('record_list_with_key'))
    ]
else:
    record_list = [
        {
            'key': record[0].decode('utf-8'),
            'content': record[1][b'cf1:content'].decode('utf-8'),
            'headline': record[1][b'cf1:headline'].decode('utf-8'),
            'url': record[1][b'cf1:url'].decode('utf-8'),
            'author': record[1][b'cf1:author'].decode('utf-8'),
            'datetime': record[1][b'cf1:datetime'].decode('utf-8'),
            'publisher': record[1][b'cf1:publisher'].decode('utf-8')
        }
        for record in hbase_client.read_keys('news', ['k' + str(i) for i in range(6690)], [
            'cf1:content', 'cf1:headline', 'cf1:url', 'cf1:author', 'cf1:datetime', 'cf1:publisher'
        ])
    ]
    redis_client.set_key_value('record_list_with_key', pickle.dumps(record_list), seconds = 5 * 60)

[sudo] password for student: 

### 2.2.3 Preparing Data for Preprocessing

Simple Random Sampling are employed(approximately 5000 records are extracted)

In [3]:
df = spark.createDataFrame(record_list)

# Perform sampling to reduce the dimension of the dataset
df_sampled_full = df.sample(withReplacement = False, fraction = 0.75, seed = 5)

# for sorting 'key' column purpose
df_sampled_full = df_sampled_full.withColumn('int_key', regexp_extract('key', '(\d+)', 1).cast('int')).orderBy('int_key').drop('int_key')

# Select 'headline', 'content' and 'key' only
df_sampled = df_sampled_full.select('key', 'headline', 'content')

df_sampled.show()

24/09/01 16:55:55 WARN TaskSetManager: Stage 0 contains a task of very large size (1002 KiB). The maximum recommended task size is 1000 KiB.
                                                                                

+---+--------------------+--------------------+
|key|            headline|             content|
+---+--------------------+--------------------+
| k0|Five charged with...|PORT DICKSON, Jul...|
| k1|Businessman acqui...|KUALA LUMPUR, Jul...|
| k3|Ex-MARii CEO gets...|KUALA LUMPUR, Jul...|
| k4|Nur Farah Kartini...|PEKAN, July 18 — ...|
| k5|Police seize RM9 ...|KUALA LUMPUR, Jul...|
| k6|Moroccan pleads g...|KUALA TERENGGANU,...|
| k8|Murder: Ex-securi...|JOHOR BAHRU, July...|
| k9|Federal Court uph...|JOHOR BAHRU, July...|
|k10|Father convicted ...|PUTRAJAYA, July 1...|
|k11|Consultant charge...|TAIPING, July 16 ...|
|k12|Cops arrest eight...|PUTRAJAYA, July 1...|
|k14|Cops nab 18 for a...|KUALA LUMPUR, Jul...|
|k15|Duo charged with ...|KANGAR, July 12 —...|
|k16|Police detain 39 ...|KUALA LUMPUR, Jul...|
|k18|Cops on lookout f...|KUALA LUMPUR, Jul...|
|k19|Man’s death sente...|PUTRAJAYA, July 4...|
|k20|Former company ow...|KUALA LUMPUR, Jul...|
|k21|Unemployed man ja...|KUALA LUMPUR, 

### 2.2.4 Lowercasing, Removing Punctuation, Numbers, Emojis, and non-ASCII characters

In [4]:
# Lowercasing
df_sampled = df_sampled.select('key', lower('content').alias('content'), lower('headline').alias('headline'))

# Remove punctuation
df_sampled = (
    df_sampled
    .withColumn('content', regexp_replace('content', r'[^\d\w\s]+',''))
    .withColumn('headline', regexp_replace('headline', r'[^\d\w\s]+',''))
)

# Remove number
df_sampled = (
    df_sampled
    .withColumn('content', regexp_replace('content', r'\d+', ''))
    .withColumn('headline', regexp_replace('headline', r'\d+', ''))
)

# Remove emoji
df_sampled = (
    df_sampled
    .withColumn('content', regexp_replace('content', '[\U0001F600-\U0001F64F]', ''))
    .withColumn('headline', regexp_replace('headline', '[\U0001F600-\U0001F64F]', ''))
)

# Remove non-ASCII characters
df_sampled = (
    df_sampled
    .withColumn('content', regexp_replace('content', '[^\x00-\x7F]+', ''))
    .withColumn('headline', regexp_replace('headline', '[^\x00-\x7F]+', ''))
)

# Remove space in between
df_sampled = (
    df_sampled
    .withColumn('content', regexp_replace('content', r'\s+',' '))
    .withColumn('headline', regexp_replace('headline', r'\s+',' '))
)

### 2.2.5 Tokenization

In [5]:
# Tokenization
content_tokenizer = Tokenizer(outputCol = 'content_tokens', inputCol = 'content')
df_tokenized = content_tokenizer.transform(df_sampled)

headline_tokenizer = Tokenizer(outputCol = 'headline_tokens', inputCol = 'headline')
df_tokenized = headline_tokenizer.transform(df_tokenized)

# Can use tokenizer in nltk as well, same behaviour
# from nltk.tokenize import word_tokenize
# @udf(returnType = ArrayType(StringType()))
# def tokenize_text(words):
#     return word_tokenize(words)
# df_tokenized = df_content.withColumn('tokens',tokenize_text('content'))

### 2.2.6 Removing non-English Words

In [6]:
# Remove non-english words
english_words = set(words.words())

# Define a UDF to filter words not in the English corpus
@udf(returnType = ArrayType(StringType()))
def filter_non_english_words(words):
    return [word for word in words if word.lower() in english_words]

# Apply the UDF to filter words
df_tokenized = df_tokenized.withColumn('content_tokens', filter_non_english_words('content_tokens')).withColumn('headline_tokens', filter_non_english_words('headline_tokens'))

# Move execution earlier and delete to save memory
df_tokenized.collect()
del english_words

24/09/01 16:55:58 WARN TaskSetManager: Stage 1 contains a task of very large size (1002 KiB). The maximum recommended task size is 1000 KiB.
24/09/01 16:55:58 WARN TaskSetManager: Stage 2 contains a task of very large size (1002 KiB). The maximum recommended task size is 1000 KiB.
                                                                                

### 2.2.7 Removing Stopwords

In [7]:
# Remove stopword
content_stopword_remover = StopWordsRemover(inputCol= 'content_tokens', outputCol = 'cleaned_content_tokens')
headline_stopword_remover = StopWordsRemover(inputCol= 'headline_tokens', outputCol = 'cleaned_headline_tokens')

# Transform existing dataframe with the StopWordsRemover
df_cleaned = (
    headline_stopword_remover
    .transform(content_stopword_remover.transform(df_tokenized))
    .select('key', 'cleaned_content_tokens', 'cleaned_headline_tokens')
)

### 2.2.8 Lemmatization

In [8]:
lemmatizer_broadcast = spark.sparkContext.broadcast(WordNetLemmatizer())

@udf(returnType = ArrayType(StringType()))
def lemmatize_words(words):
    lemmatizer = lemmatizer_broadcast.value
    return [lemmatizer.lemmatize(word) for word in words]

# Apply the UDF to lemmatize words
df_lemmatized = (
    df_cleaned
    .withColumn('cleaned_content_tokens', lemmatize_words('cleaned_content_tokens'))
    .withColumn('cleaned_headline_tokens', lemmatize_words('cleaned_headline_tokens'))
)

In [9]:
df_lemmatized.head()

24/09/01 16:56:02 WARN TaskSetManager: Stage 5 contains a task of very large size (1002 KiB). The maximum recommended task size is 1000 KiB.
24/09/01 16:56:02 WARN TaskSetManager: Stage 6 contains a task of very large size (1002 KiB). The maximum recommended task size is 1000 KiB.
                                                                                

Row(key='k0', cleaned_content_tokens=['port', 'five', 'people', 'court', 'today', 'causing', 'death', 'friend', 'swimming', 'pool', 'last', 'week', 'accused', 'd', 'b', 'r', 'guilty', 'charge', 'read', 'magistrate', 'jointly', 'accused', 'causing', 'death', 'hotel', 'swimming', 'pool', 'port', 'charge', 'section', 'penal', 'code', 'read', 'together', 'section', 'act', 'maximum', 'jail', 'term', 'two', 'fine', 'upon', 'conviction', 'deputy', 'public', 'prosecutor', 'bail', 'accused', 'court', 'impose', 'additional', 'condition', 'disturb', 'prosecution', 'case', 'resolved', 'court', 'accused', 'bail', 'one', 'surety', 'lawyer', 'wan', 'wan', 'court', 'five', 'accused', 'guilty', 'second', 'charge', 'allegedly', 'port', 'police', 'headquarters', 'court', 'bail', 'fixed', 'mention'], cleaned_headline_tokens=['five', 'causing', 'death', 'hotel', 'swimming', 'pool'])

### 2.2.9 Detecting and Handling Near Duplicates

In [10]:
@udf(returnType = FloatType())
def jaccard_similarity(list1, list2):
    set1 = set(list1)
    set2 = set(list2)
    intersection = set1.intersection(set2)
    union = set1.union(set2)

    if len(union) == 0:
        return 0.0

    return float(len(intersection)) / len(union)

In [11]:
# Don't simply run, it takes around 6 minutes to run

print("df_lemmatized Count:", df_lemmatized.count())

df_limit = df_lemmatized.limit(5000)

df_cross = (
    df_limit
    .alias('df1')
    .join(df_limit.alias('df2'), col('df1.key') < col('df2.key'))
)
print("df_cross Count:", df_cross.count())

df_jaccard = (
    df_cross
    .withColumn('jaccard_similarity', jaccard_similarity('df1.cleaned_content_tokens', 'df2.cleaned_content_tokens'))
    .select('df1.key', 'df2.key', 'jaccard_similarity')
)

df_duplicates = df_jaccard.filter(col('jaccard_similarity') > 0.8)

print("Number of duplicates:", df_duplicates.count())

24/09/01 16:56:19 WARN TaskSetManager: Stage 9 contains a task of very large size (1002 KiB). The maximum recommended task size is 1000 KiB.


df_lemmatized Count: 5036


24/09/01 16:56:19 WARN TaskSetManager: Stage 12 contains a task of very large size (1002 KiB). The maximum recommended task size is 1000 KiB.
24/09/01 16:56:20 WARN TaskSetManager: Stage 14 contains a task of very large size (1002 KiB). The maximum recommended task size is 1000 KiB.


df_cross Count: 12497500


24/09/01 16:56:20 WARN TaskSetManager: Stage 16 contains a task of very large size (1002 KiB). The maximum recommended task size is 1000 KiB.
24/09/01 16:56:20 WARN TaskSetManager: Stage 17 contains a task of very large size (1002 KiB). The maximum recommended task size is 1000 KiB.
[Stage 25:>                                                         (0 + 1) / 1]

Number of duplicates: 9341


                                                                                

In [12]:
# Don't simply run, it takes around 6 minutes to run
# Get all the keys in the dictionaries 
duplicated_keys = set(df_duplicates.select(col('df2.key').alias('key')).distinct().collect())
duplicated_keys = [row['key'] for row in duplicated_keys]

filtered_rdd = df_limit.rdd.filter(lambda x: x[0] not in duplicated_keys)

df_duplicates_removed = filtered_rdd.toDF()

df_duplicates_removed.show()

24/09/01 17:00:36 WARN TaskSetManager: Stage 30 contains a task of very large size (1002 KiB). The maximum recommended task size is 1000 KiB.
24/09/01 17:00:37 WARN TaskSetManager: Stage 31 contains a task of very large size (1002 KiB). The maximum recommended task size is 1000 KiB.
24/09/01 17:05:25 WARN TaskSetManager: Stage 44 contains a task of very large size (1002 KiB). The maximum recommended task size is 1000 KiB.
24/09/01 17:05:25 WARN TaskSetManager: Stage 45 contains a task of very large size (1002 KiB). The maximum recommended task size is 1000 KiB.
                                                                                

+---+----------------------+-----------------------+
|key|cleaned_content_tokens|cleaned_headline_tokens|
+---+----------------------+-----------------------+
| k0|  [port, five, peop...|   [five, causing, d...|
| k1|  [session, court, ...|   [businessman, che...|
| k3|  [former, chief, e...|   [one, year, jail,...|
| k4|  [pekan, body, lat...|   [remains, laid, r...|
| k5|  [police, uncovere...|   [police, seize, c...|
| k6|  [court, fixed, au...|   [guilty, therapis...|
| k8|  [former, security...|   [murder, guard, e...|
| k9|  [federal, court, ...|   [federal, court, ...|
|k10|  [court, appeal, r...|   [father, sexually...|
|k11|  [investment, cons...|   [consultant, nearly]|
|k12|  [police, eight, f...|   [arrest, eight, d...|
|k14|  [eighteen, foreig...|   [nab, involvement...|
|k15|  [two, underage, g...|        [duo, underage]|
|k16|  [total, foreign, ...|   [police, detain, ...|
|k18|  [police, driver, ...|   [lookout, driver,...|
|k19|  [former, scrap, m...|   [death, sentenc

### 2.2.10 Additional Cleaning (Removing tokens with character length less than 4)

In [13]:
# Removing tokens with character length less than 4
@udf(returnType = ArrayType(StringType()))
def remove_short_length_words(words):
    return [word for word in words if len(word) > 3]

In [14]:
df_duplicates_removed = (
    df_duplicates_removed
    .withColumn('cleaned_content_tokens', remove_short_length_words('cleaned_content_tokens'))
    .withColumn('cleaned_headline_tokens', remove_short_length_words('cleaned_headline_tokens'))
)

### 2.2.11 Export Cleaned Dataset to CSV and Upload to MongoDB

In [15]:
df_cleaned = df_sampled_full.join(df_duplicates_removed, on = 'key', how = 'inner').drop('headline').drop('content')

df_cleaned = df_cleaned.withColumn('int_key', regexp_extract('key', '(\d+)', 1).cast('int')).orderBy('int_key').drop('int_key')

df_cleaned.show()

24/09/01 17:07:02 WARN TaskSetManager: Stage 55 contains a task of very large size (1002 KiB). The maximum recommended task size is 1000 KiB.
                                                                                

+---+------+--------------------+----------------+--------------------+----------------------+-----------------------+
|key|author|            datetime|       publisher|                 url|cleaned_content_tokens|cleaned_headline_tokens|
+---+------+--------------------+----------------+--------------------+----------------------+-----------------------+
| k0|      |2024-07-19T20:15:...|Selangor Journal|https://selangorj...|  [port, five, peop...|   [five, causing, d...|
| k1|      |2024-07-19T17:30:...|Selangor Journal|https://selangorj...|  [session, court, ...|   [businessman, che...|
| k3|      |2024-07-18T21:12:...|Selangor Journal|https://selangorj...|  [former, chief, e...|   [year, jail, fine...|
| k4|      |2024-07-18T19:58:...|Selangor Journal|https://selangorj...|  [pekan, body, lat...|   [remains, laid, r...|
| k5|      |2024-07-18T19:45:...|Selangor Journal|https://selangorj...|  [police, uncovere...|   [police, seize, c...|
| k6|      |2024-07-18T17:09:...|Selangor Journa

In [16]:
with open(r'../data/cleaned_dataset.csv', 'w') as csvfile:
    writer = csv.DictWriter(csvfile, df_cleaned.columns)

    writer.writeheader()
    for news_item in df_cleaned.collect():
        writer.writerow(news_item.asDict())

24/09/01 17:07:22 WARN TaskSetManager: Stage 65 contains a task of very large size (1002 KiB). The maximum recommended task size is 1000 KiB.
                                                                                

In [17]:
# Initailise MongoDB and Redis client
mongodb_client = MongoDBClient()

Pinged your deployment. You successfully connected to MongoDB!


In [18]:
# Convert dataframe into list of dictionaries to efficiently insert data into MongoDB
documents = df_cleaned.rdd.map(lambda row: {
    'key': row[0], 
    'tokenised_content': row[5], 
    'tokenised_headline': row[6],
    'publisher' : row[3],
    'author' : row[1],
    'datetime' : row[2],
    'url' : row[4]
}).collect()

mongodb_client.insert_many('Cleaned_Dataset', 'cleaned_data', documents)

Documents successfully inserted: [ObjectId('66d42f48a3ddbe6f6765ee01'), ObjectId('66d42f48a3ddbe6f6765ee02'), ObjectId('66d42f48a3ddbe6f6765ee03'), ObjectId('66d42f48a3ddbe6f6765ee04'), ObjectId('66d42f48a3ddbe6f6765ee05'), ObjectId('66d42f48a3ddbe6f6765ee06'), ObjectId('66d42f48a3ddbe6f6765ee07'), ObjectId('66d42f48a3ddbe6f6765ee08'), ObjectId('66d42f48a3ddbe6f6765ee09'), ObjectId('66d42f48a3ddbe6f6765ee0a'), ObjectId('66d42f48a3ddbe6f6765ee0b'), ObjectId('66d42f48a3ddbe6f6765ee0c'), ObjectId('66d42f48a3ddbe6f6765ee0d'), ObjectId('66d42f48a3ddbe6f6765ee0e'), ObjectId('66d42f48a3ddbe6f6765ee0f'), ObjectId('66d42f48a3ddbe6f6765ee10'), ObjectId('66d42f48a3ddbe6f6765ee11'), ObjectId('66d42f48a3ddbe6f6765ee12'), ObjectId('66d42f48a3ddbe6f6765ee13'), ObjectId('66d42f48a3ddbe6f6765ee14'), ObjectId('66d42f48a3ddbe6f6765ee15'), ObjectId('66d42f48a3ddbe6f6765ee16'), ObjectId('66d42f48a3ddbe6f6765ee17'), ObjectId('66d42f48a3ddbe6f6765ee18'), ObjectId('66d42f48a3ddbe6f6765ee19'), ObjectId('66d42f

In [19]:
redis_client.stop_service()
spark.stop()

[sudo] password for student: 