# **Basic Text Preprocessing**

This will cover

**1.** Converting to lowercase

**2.** Removing stop words and punctuation

**3.** Finding POS tags

**4.** Lemmatization

## **Install packages if not yet installed**

In [1]:
import sys

# Installing a little older version of Python as PySpark's 3.3.0 pandas-on-spark apply function calls `iteritems` which 
# is deprecated in the latest versions of pandas.PySpark 3.4.0 pandas-on-spark apply function calls `items`.
# !{sys.executable} -m pip install -U pandas==1.5.3 # Pandas
!{sys.executable} -m pip install cassandra-driver # Cassandra
!{sys.executable} -m pip install nltk # NTLK
!{sys.executable} -m pip install spacy # Spacy
# Reset the kernel for this language model to load after running this code block.
!{sys.executable} -m pip install --user https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0.tar.gz # SpaCy Language Model

Collecting https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0.tar.gz
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0.tar.gz (13.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.7/13.7 MB[0m [31m18.2 MB/s[0m eta [36m0:00:00[0m00:01[0m0:01[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone


In [2]:
%run ../EnvironmentVariablesSetup.ipynb

## **Read database from Keyspaces using PySpark**

**1.** Download the required jar files (`spark-cassandra-connector_2.12-3.3.0.jar, spark-cassandra-connector-assembly_2.12-3.3.0.jar`).

**2.** Download your `cassandra_truststore.jks` file.

**3.** Create `application.conf` file.

**4.** Create `SparkSession` and set the configuration to connect to Keyspaces using service-specific credentials.

**5.** Read all rows from `GFGArticles` table, `GFGArticles` keyspace into PySpark dataframe.

In [3]:
import os

# To resolve the following warning ----------------------------------------
# WARNING:root:'PYARROW_IGNORE_TIMEZONE' environment variable was not set. 
# It is required to set this environment variable to '1' in both driver and executor sides if you use pyarrow>=2.0.0. 
# pandas-on-Spark will set it for you but it does not work if there is a Spark context already launched.

os.environ["PYARROW_IGNORE_TIMEZONE"]="1"

In [4]:
import pandas as pd
import numpy as np
from pyspark.sql import SparkSession
# import pyspark.pandas as ps
# from pyspark.sql.types import StructType, StructField, StringType, IntegerType

In [5]:
spark=SparkSession.builder.appName("BasicTextPreprocessing")\
    .config("spark.files", "../application.conf")\
    .config("spark.jars", "../jar-files/spark-cassandra-connector_2.12-3.3.0.jar,"
                            "../jar-files/spark-cassandra-connector-assembly_2.12-3.3.0.jar")\
    .getOrCreate()

spark.conf.set("spark.cassandra.connection.config.profile.path", "application.conf")
spark.conf.set("spark.cassandra.connection.ssl.clientAuth.enabled", "true")
spark.conf.set("spark.cassandra.connection.ssl.enabled", "true")
# spark.conf.set("spark.sql.extensions", "com.datastax.spark.connector.CassandraSparkExtensions")

# Spark version
spark.sparkContext.version

23/09/21 15:56:13 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


'3.3.0'

In [6]:
articles=spark.read\
  .format("org.apache.spark.sql.cassandra")\
  .options(table="GFGArticles", keyspace="GFGArticles")\
  .load()

articles.show(5)

23/09/21 15:56:18 WARN CassandraConnectionFactory: Ignoring all programmatic configuration, only using configuration from application.conf


[Stage 0:>                                                          (0 + 1) / 1]

+-----+------------------+--------+--------------------+------------+--------------------+--------------------+
|   ID|          AuthorID|Category|             Content| LastUpdated|                Link|               Title|
+-----+------------------+--------+--------------------+------------+--------------------+--------------------+
| 8772|     GeeksforGeeks|    easy|Consider the foll...|18 Jan, 2018|https://www.geeks...|GATE | GATE-CS-20...|
|11346|PRAKHARAGRAWAL8013|   basic|What is AMCAT?\nA...|17 Jul, 2020|https://www.geeks...|AMCAT Test Experi...|
|23825|     GeeksforGeeks|  medium|Online Coding Rou...|10 Jan, 2019|https://www.geeks...|Paytm Interview e...|
|23790|   ShreyaChourasia|  medium|Samsung R&D Insti...|18 Jul, 2019|https://www.geeks...|Samsung Delhi Int...|
|13740|        AshwinGoel|   basic|man command in Li...|18 Feb, 2021|https://www.geeks...|man command in Li...|
+-----+------------------+--------+--------------------+------------+--------------------+--------------

                                                                                

In [7]:
articles.printSchema()

root
 |-- ID: integer (nullable = false)
 |-- AuthorID: string (nullable = true)
 |-- Category: string (nullable = true)
 |-- Content: string (nullable = true)
 |-- LastUpdated: string (nullable = true)
 |-- Link: string (nullable = true)
 |-- Title: string (nullable = true)



## **Convert to `pandas-on-spark` Dataframe**

In [8]:
# Duplicating the ID column so it can be used as an index column when converted to pandas-on-spark Dataframe.
articles=articles.withColumn("Index", articles["ID"])
articles.show(5)

+-----+--------------+--------+--------------------+------------+--------------------+--------------------+-----+
|   ID|      AuthorID|Category|             Content| LastUpdated|                Link|               Title|Index|
+-----+--------------+--------+--------------------+------------+--------------------+--------------------+-----+
|23474| ManasChhabra2|  medium|This resizable pr...|04 Dec, 2018|https://www.geeks...|How to disable re...|23474|
| 3531|    manjeet_04|    easy|Sometimes, we req...|25 Apr, 2019|https://www.geeks...|Python | Find the...| 3531|
|22273|KhushalAgarwal|  medium|JavaScript is Syn...|02 Dec, 2021|https://www.geeks...|Async/Await Funct...|22273|
|24847|ayushjauhari14|  medium|Given a non-negat...|08 Apr, 2021|https://www.geeks...|Check if actual b...|24847|
|23798| Sanjit_Prasad|  medium|Non-homogeneous P...|21 Sep, 2018|https://www.geeks...|Nonhomogeneous Po...|23798|
+-----+--------------+--------+--------------------+------------+--------------------+--

In [9]:
articles=articles.pandas_api("Index")
articles[:5]

                                                                                

Unnamed: 0_level_0,ID,AuthorID,Category,Content,LastUpdated,Link,Title
Index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
8772,8772,GeeksforGeeks,easy,Consider the following partial Schedule S invo...,"18 Jan, 2018",https://www.geeksforgeeks.org/gate-gate-cs-201...,GATE | GATE-CS-2015 (Set 3) | Question 40
11346,11346,PRAKHARAGRAWAL8013,basic,What is AMCAT?\nAMCAT or Aspiring Minds Comput...,"17 Jul, 2020",https://www.geeksforgeeks.org/amcat-test-exper...,AMCAT Test Experience 2021
23825,23825,GeeksforGeeks,medium,Online Coding Round : Platform used was cocube...,"10 Jan, 2019",https://www.geeksforgeeks.org/paytm-interview-...,Paytm Interview experience (On-Campus) for FTE
23790,23790,ShreyaChourasia,medium,"Samsung R&D Institute, Delhi visited our campu...","18 Jul, 2019",https://www.geeksforgeeks.org/samsung-delhi-in...,Samsung Delhi Interview Experience (On-Campus ...
13740,13740,AshwinGoel,basic,man command in Linux is used to display the us...,"18 Feb, 2021",https://www.geeksforgeeks.org/man-command-in-l...,man command in Linux with Examples


Right now, Amazon Keyspaces does not support to get the shape/count of rows, so an error will be thrown if `articles.shape` is executed. I absolutely need this total count, so I'm getting it as follows.

In [10]:
NO_OF_ARTICLES=len(articles.LastUpdated.to_list())
NO_OF_ARTICLES

                                                                                

34550

In [11]:
IDs=articles.ID.to_list()



## **Download stopwords**

In [12]:
import nltk, re
nltk.download("stopwords")

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/ec2-user/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [13]:
from nltk.corpus import stopwords
STOP_WORDS_LIST=stopwords.words('english')

In [14]:
# Define a function to remove stopwords and puncutation.
def removeStopwordsAndPunctuation(text):
    # Removing extra spaces and anything other than alphanumeric characters.
    text=re.sub("[\s]+", " ", re.sub("[^A-z0-9\s]", " ", text)).lower()
    # Removing stopwords
    text=[word for word in text.split() if not word in STOP_WORDS_LIST]
    return ' '.join(text)

## **Load the language model**

The language model pipeline consists of classes: `['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']`

Disabling `ner` (Named-Entity Recognition) to speed up the model.

In [15]:
import spacy

In [16]:
nlp=spacy.load("en_core_web_sm", disable=["ner"])
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer']

## **Preprocess Text**

**1.** Removal of stop words and punctuation.

**2.** Pass the articles text in the following form (a list of tuples). The text of each article goes first in each tuple and we pass the `id` of the article in the dictionary. This is to easy get the ID of an article after spacy text processing. When passing the data in this format to `nlp.pipe()` method, mark `as_tuples` as True.

```python
[
    ("text ...", {"ID" : <ID-Value>}),
    ("text ...", {"ID" : <ID-Value>}),
    ...
]
```

**3.** Replace the tokens in the article text with their lemmatized form.

In [17]:
# Define a function to preprocess text.
# def textPreprocess(articles):
    # Remove stop words and punctuation
    # articles.loc[:, "Content"]=articles.Content.apply(removeStopwordsAndPunctuation)
    # for article in articles[:5].iterrows():
    #     articles.loc[article[0], "Content"]=removeStopwordsAndPunctuation(article[1].Content))
        
    # Get lemmatized words
    # for article, attr in nlp.pipe(list(articles.apply(lambda article: (article.Content, {"ID": article.name}), axis=1)), as_tuples=True, n_process=-1, batch_size=32):
    #     articles.loc[articles.ID==attr["ID"], "Content"]=" ".join([token.lemma_.strip() for token in article])

In [18]:
# Define a function to preprocess text.
def textPreprocess(i, article):
    article=removeStopwordsAndPunctuation(article)
    # Get lemmatized words
    return i, " ".join([token.lemma_.lower().strip() for token in nlp(article)])

## **Connect to Amazon Keyspaces**

In [19]:
from cassandra.cluster import Cluster, ExecutionProfile, EXEC_PROFILE_DEFAULT
from ssl import SSLContext, PROTOCOL_TLSv1_2, CERT_REQUIRED
from cassandra.auth import PlainTextAuthProvider
from cassandra.query import SimpleStatement
from cassandra import ConsistencyLevel

In [20]:
# Service username and password for AWS Keyspaces. I previous saved my Keyspace credentials as environment variables.
username=os.environ["keyspacesCredentialUsername"]
password=os.environ["keyspacesCredentialPassword"]

In [21]:
# Creates a session connection to the keyspace that is secured by TLS.
ssl_context=SSLContext(PROTOCOL_TLSv1_2)
ssl_context.load_verify_locations('../sf-class2-root.crt')
ssl_context.verify_mode=CERT_REQUIRED
exec_profile=ExecutionProfile(consistency_level=ConsistencyLevel.LOCAL_QUORUM)
auth_provider=PlainTextAuthProvider(username=username, password=password)

cluster=Cluster(['cassandra.us-east-2.amazonaws.com'], 
                ssl_context=ssl_context, 
                auth_provider=auth_provider, 
                execution_profiles={EXEC_PROFILE_DEFAULT: exec_profile}, 
                port=9142)
session=cluster.connect()

  ssl_context=SSLContext(PROTOCOL_TLSv1_2)


## **Preprocess `Content` using Multithreading**

Also save the results to another table in Amazon Keyspaces.


In [22]:
from concurrent.futures import ThreadPoolExecutor, as_completed

In [23]:
TIMEOUT_SECS=60

In [24]:
%%time
# Run the above function for all the links in batches using multithreading.
futureResultErrors=[]
batchesCount, BATCH_SIZE=0, 1024
# Print batch size
print(f"Batch size: {BATCH_SIZE}")

for batch_start in range(0, NO_OF_ARTICLES, BATCH_SIZE):
    future_to_url={}
    batchesCount+=1 # Batch number of the current batch.
    batch_end=batch_start+BATCH_SIZE if batch_start+BATCH_SIZE<NO_OF_ARTICLES else NO_OF_ARTICLES

    with ThreadPoolExecutor(max_workers=64) as executor: 
        for i in range(batch_start, batch_end):
            future_to_url[executor.submit(textPreprocess, IDs[i], articles.loc[IDs[i], "Content"])]=IDs[i]
            
        for future in as_completed(future_to_url):
            try:
                ID, text=future.result(timeout=TIMEOUT_SECS)

                # Insert in Amazon Keyspaces.
                session.execute(session.prepare(f'INSERT INTO "GFGArticles"."BasicPreprocessedGFGArticles" '
                                                f'("ID", "PreprocessedContent") '
                                                f'VALUES (?, ?);'), 
                                parameters=[
                                            ID, 
                                            str(text)
                                           ])

            except Exception as err:
                futureResultErrors.append(err)
    
    # Print status.
    print(f"Batch #{batchesCount}: Preprocessed `Content` for {(batch_end-batch_start)} rows")

Batch size: 1024
Batch #1: Preprocessed `Content` for 1024 rows
Batch #2: Preprocessed `Content` for 1024 rows
Batch #3: Preprocessed `Content` for 1024 rows
Batch #4: Preprocessed `Content` for 1024 rows
Batch #5: Preprocessed `Content` for 1024 rows
Batch #6: Preprocessed `Content` for 1024 rows
Batch #7: Preprocessed `Content` for 1024 rows
Batch #8: Preprocessed `Content` for 1024 rows
Batch #9: Preprocessed `Content` for 1024 rows
Batch #10: Preprocessed `Content` for 1024 rows
Batch #11: Preprocessed `Content` for 1024 rows
Batch #12: Preprocessed `Content` for 1024 rows
Batch #13: Preprocessed `Content` for 1024 rows
Batch #14: Preprocessed `Content` for 1024 rows
Batch #15: Preprocessed `Content` for 1024 rows
Batch #16: Preprocessed `Content` for 1024 rows
Batch #17: Preprocessed `Content` for 1024 rows
Batch #18: Preprocessed `Content` for 1024 rows
Batch #19: Preprocessed `Content` for 1024 rows
Batch #20: Preprocessed `Content` for 1024 rows
Batch #21: Preprocessed `Content

In [25]:
futureResultErrors

[]