## Introduction to Natural Language Processing with PySpark

### NLP Tools

1. Tokenizer
2. Stop word Removal
3. n-grams
4. Term frequency-inverse document frequency (TF-IDF)
5. Count Vectorizer

### SMS Spam Collection Dataset

* Download the data for later usage

In [1]:
!curl https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip >> smsspamcollection.zip

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  198k    0  198k    0     0  72371      0 --:--:--  0:00:02 --:--:-- 72363


In [2]:
!unzip smsspamcollection.zip

Archive:  smsspamcollection.zip
  inflating: SMSSpamCollection       
  inflating: readme                  


In [3]:
from pyspark.sql import SparkSession

In [4]:
spark = SparkSession.builder.appName("nlp").getOrCreate()

24/09/22 23:22:34 WARN Utils: Your hostname, aditya-HP resolves to a loopback address: 127.0.1.1; using 10.103.4.167 instead (on interface wlo1)
24/09/22 23:22:34 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/09/22 23:22:36 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


### Tokenizer

In [7]:
from pyspark.ml.feature import Tokenizer, RegexTokenizer
from pyspark.sql.functions import col, udf 
from pyspark.sql.types import IntegerType

In [16]:
sent_df = spark.createDataFrame(
    [
        (0, "Hello I am happy to be learning Apache Spark"),
        (1, "I enjoy learning about Python and javascript Programming"),
        (2, "I am familiar with machine learning applications"),
        (3, "here,is,a,list,of,words")
    ],
    [ 'id', 'sentence']
)

In [17]:
sent_df.show()

+---+--------------------+
| id|            sentence|
+---+--------------------+
|  0|Hello I am happy ...|
|  1|I enjoy learning ...|
|  2|I am familiar wit...|
|  3|here,is,a,list,of...|
+---+--------------------+



In [18]:
tokenizer = Tokenizer(inputCol="sentence", outputCol="words")

regexTokenizer = RegexTokenizer(inputCol="sentence", outputCol="words", pattern="\\W") # regex for whole word

countTokens = udf(lambda w: len(w), IntegerType())

In [19]:
tokenized = tokenizer.transform(sent_df)

In [20]:
tokenized.show()

+---+--------------------+--------------------+
| id|            sentence|               words|
+---+--------------------+--------------------+
|  0|Hello I am happy ...|[hello, i, am, ha...|
|  1|I enjoy learning ...|[i, enjoy, learni...|
|  2|I am familiar wit...|[i, am, familiar,...|
|  3|here,is,a,list,of...|[here,is,a,list,o...|
+---+--------------------+--------------------+



In [21]:
tokenized.select("sentence", "words").withColumn("tokens", countTokens(col("words"))).show()

+--------------------+--------------------+------+
|            sentence|               words|tokens|
+--------------------+--------------------+------+
|Hello I am happy ...|[hello, i, am, ha...|     9|
|I enjoy learning ...|[i, enjoy, learni...|     8|
|I am familiar wit...|[i, am, familiar,...|     7|
|here,is,a,list,of...|[here,is,a,list,o...|     1|
+--------------------+--------------------+------+



In [22]:
regexTokenized = regexTokenizer.transform(sent_df)
regexTokenized.select("sentence", "words").withColumn("tokens", countTokens(col("words"))).show()

+--------------------+--------------------+------+
|            sentence|               words|tokens|
+--------------------+--------------------+------+
|Hello I am happy ...|[hello, i, am, ha...|     9|
|I enjoy learning ...|[i, enjoy, learni...|     8|
|I am familiar wit...|[i, am, familiar,...|     7|
|here,is,a,list,of...|[here, is, a, lis...|     6|
+--------------------+--------------------+------+



In [30]:
sent_df_token = regexTokenized.select("sentence", "words").withColumn("tokens", countTokens(col("words")))

### Stop Words Removal

In [23]:
from pyspark.ml.feature import StopWordsRemover

In [25]:
sent_df.show(truncate=False)

+---+--------------------------------------------------------+
|id |sentence                                                |
+---+--------------------------------------------------------+
|0  |Hello I am happy to be learning Apache Spark            |
|1  |I enjoy learning about Python and javascript Programming|
|2  |I am familiar with machine learning applications        |
|3  |here,is,a,list,of,words                                 |
+---+--------------------------------------------------------+



In [31]:
remover = StopWordsRemover(inputCol="words", outputCol="cleaned")

In [33]:
remover.transform(sent_df_token).show(truncate=False)

+--------------------------------------------------------+-----------------------------------------------------------------+------+--------------------------------------------------+
|sentence                                                |words                                                            |tokens|cleaned                                           |
+--------------------------------------------------------+-----------------------------------------------------------------+------+--------------------------------------------------+
|Hello I am happy to be learning Apache Spark            |[hello, i, am, happy, to, be, learning, apache, spark]           |9     |[hello, happy, learning, apache, spark]           |
|I enjoy learning about Python and javascript Programming|[i, enjoy, learning, about, python, and, javascript, programming]|8     |[enjoy, learning, python, javascript, programming]|
|I am familiar with machine learning applications        |[i, am, familiar, with, mac

### n-grams

In [34]:
from pyspark.ml.feature import NGram

In [37]:
sent_df_token.show(truncate=False)

+--------------------------------------------------------+-----------------------------------------------------------------+------+
|sentence                                                |words                                                            |tokens|
+--------------------------------------------------------+-----------------------------------------------------------------+------+
|Hello I am happy to be learning Apache Spark            |[hello, i, am, happy, to, be, learning, apache, spark]           |9     |
|I enjoy learning about Python and javascript Programming|[i, enjoy, learning, about, python, and, javascript, programming]|8     |
|I am familiar with machine learning applications        |[i, am, familiar, with, machine, learning, applications]         |7     |
|here,is,a,list,of,words                                 |[here, is, a, list, of, words]                                   |6     |
+--------------------------------------------------------+------------------

In [38]:
bigrams = NGram(n=2, inputCol="words", outputCol="bigrams")

In [40]:
bigram_df = bigrams.transform(sent_df_token)

In [42]:
bigram_df.select("bigrams").show(truncate=False)

+-----------------------------------------------------------------------------------------------------------+
|bigrams                                                                                                    |
+-----------------------------------------------------------------------------------------------------------+
|[hello i, i am, am happy, happy to, to be, be learning, learning apache, apache spark]                     |
|[i enjoy, enjoy learning, learning about, about python, python and, and javascript, javascript programming]|
|[i am, am familiar, familiar with, with machine, machine learning, learning applications]                  |
|[here is, is a, a list, list of, of words]                                                                 |
+-----------------------------------------------------------------------------------------------------------+



### Term Freq-inverse document frequency (TF-IDF)

In [43]:
from pyspark.ml.feature import HashingTF, IDF, Tokenizer

In [46]:
sent_df = spark.createDataFrame(
    [
        (0, 0.0, "Hello I am happy to be learning Apache Spark"),
        (1, 0.0, "I enjoy learning about Python and javascript Programming"),
        (2, 1.0, "I am familiar with machine learning applications"),
        (3, 1.0, "here,is,a,list,of,words")
    ],
    [ 'id', 'label', 'sentence']
)

In [47]:
sent_df.show(truncate=False)

+---+-----+--------------------------------------------------------+
|id |label|sentence                                                |
+---+-----+--------------------------------------------------------+
|0  |0.0  |Hello I am happy to be learning Apache Spark            |
|1  |0.0  |I enjoy learning about Python and javascript Programming|
|2  |1.0  |I am familiar with machine learning applications        |
|3  |1.0  |here,is,a,list,of,words                                 |
+---+-----+--------------------------------------------------------+



In [48]:
tokenizer = RegexTokenizer(inputCol="sentence", outputCol="words", pattern="\\W")
words_df = tokenizer.transform(sent_df)

In [50]:
words_df.show(truncate=False)

+---+-----+--------------------------------------------------------+-----------------------------------------------------------------+
|id |label|sentence                                                |words                                                            |
+---+-----+--------------------------------------------------------+-----------------------------------------------------------------+
|0  |0.0  |Hello I am happy to be learning Apache Spark            |[hello, i, am, happy, to, be, learning, apache, spark]           |
|1  |0.0  |I enjoy learning about Python and javascript Programming|[i, enjoy, learning, about, python, and, javascript, programming]|
|2  |1.0  |I am familiar with machine learning applications        |[i, am, familiar, with, machine, learning, applications]         |
|3  |1.0  |here,is,a,list,of,words                                 |[here, is, a, list, of, words]                                   |
+---+-----+--------------------------------------------

In [52]:
hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=20)
featurized = hashingTF.transform(words_df)

In [53]:
featurized.show(truncate=False)

+---+-----+--------------------------------------------------------+-----------------------------------------------------------------+-----------------------------------------------------------------+
|id |label|sentence                                                |words                                                            |rawFeatures                                                      |
+---+-----+--------------------------------------------------------+-----------------------------------------------------------------+-----------------------------------------------------------------+
|0  |0.0  |Hello I am happy to be learning Apache Spark            |[hello, i, am, happy, to, be, learning, apache, spark]           |(20,[3,5,6,7,8,9,12,15,16],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])|
|1  |0.0  |I enjoy learning about Python and javascript Programming|[i, enjoy, learning, about, python, and, javascript, programming]|(20,[1,5,9,11,12,14,16],[1.0,1.0,1.0,1.0,1.0,1.0,2.0])        

In [54]:
idf = IDF(inputCol="rawFeatures", outputCol="features")
idf_model = idf.fit(featurized)

                                                                                

In [56]:
rescale = idf_model.transform(featurized)
rescale.select("label", "features").show()

+-----+--------------------+
|label|            features|
+-----+--------------------+
|  0.0|(20,[3,5,6,7,8,9,...|
|  0.0|(20,[1,5,9,11,12,...|
|  1.0|(20,[0,2,5,10,12,...|
|  1.0|(20,[7,8,9,12,15]...|
+-----+--------------------+



### Count Vectorization

In [57]:
from pyspark.ml.feature import CountVectorizer

In [59]:
df = spark.createDataFrame([
    (0, list("abcde")),
    (1, list("abbbcccdde"))
], ["id", "words"])

In [60]:
df.show()

+---+--------------------+
| id|               words|
+---+--------------------+
|  0|     [a, b, c, d, e]|
|  1|[a, b, b, b, c, c...|
+---+--------------------+



In [61]:
cv = CountVectorizer(inputCol="words", outputCol="features", vocabSize=5, minDF = 2.0)

In [62]:
model = cv.fit(df)

In [63]:
res = model.transform(df)
res.show(truncate=False)

+---+------------------------------+-------------------------------------+
|id |words                         |features                             |
+---+------------------------------+-------------------------------------+
|0  |[a, b, c, d, e]               |(5,[0,1,2,3,4],[1.0,1.0,1.0,1.0,1.0])|
|1  |[a, b, b, b, c, c, c, d, d, e]|(5,[0,1,2,3,4],[3.0,3.0,2.0,1.0,1.0])|
+---+------------------------------+-------------------------------------+

