# Using window function sql for natural language processing

## Loading natural language text

To load parquet file which is a Hadoop file format to store data structures use spark.read.load(...parquet).

regexp_replace replaces values that matches a pattern.

df = df1.select(regexp_replace('value', 'Mr\.', 'Mr').alias('v')) to make Mr. Holmes > Mr Holmes

split operation seperates a string into individual tokens. Splitting on unwanted symbols in addition to spaces discards the unwanted symbols. (df.select(split("name","[ ]")))

In [3]:
from pyspark import SparkContext
sc = SparkContext("local", "pyspark-shell")

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

### Loading a dataframe from a parquet file

In [7]:
# from urllib.request import urlretrieve
# url = "https://assets.datacamp.com/production/repositories/3937/datasets/213ca262bf6af12428d42842848464565f3d5504/sherlock.txt"
# data = urlretrieve(url, "sherlock.txt")

In [None]:
#df.to_parquet("sherlock_sentences.parquet", engine='pyarrow', compression='gzip', index=False)

In [4]:
df = spark.read.load("sherlock_sentences.parquet")
df.show()

+--------------------+
|              clause|
+--------------------+
|               title|
|the adventures of...|
|sir arthur conan ...|
|          march 1999|
|          ebook 1661|
|most recently upd...|
|    november 29 2002|
|             edition|
|         12 language|
|english character...|
|               ascii|
|start of the proj...|
|additional editin...|
|the adventures of...|
|a scandal in bohe...|
|             the red|
|   headed league iii|
|a case of identit...|
|the boscombe vall...|
|the five orange p...|
+--------------------+
only showing top 20 rows



In [5]:
from pyspark.sql.functions import monotonically_increasing_id

df = df.withColumn("id", monotonically_increasing_id())
df.createOrReplaceTempView("df")
df.show()

+--------------------+---+
|              clause| id|
+--------------------+---+
|               title|  0|
|the adventures of...|  1|
|sir arthur conan ...|  2|
|          march 1999|  3|
|          ebook 1661|  4|
|most recently upd...|  5|
|    november 29 2002|  6|
|             edition|  7|
|         12 language|  8|
|english character...|  9|
|               ascii| 10|
|start of the proj...| 11|
|additional editin...| 12|
|the adventures of...| 13|
|a scandal in bohe...| 14|
|             the red| 15|
|   headed league iii| 16|
|a case of identit...| 17|
|the boscombe vall...| 18|
|the five orange p...| 19|
+--------------------+---+
only showing top 20 rows



In [7]:
df.where('id > 70').show(5, truncate=False) 

+--------------------------------------------------------+---+
|clause                                                  |id |
+--------------------------------------------------------+---+
|i answered                                              |71 |
|indeed i should have thought a little more              |72 |
|just a trifle more i fancy watson                       |73 |
|and in practice again i observe                         |74 |
|you did not tell me that you intended to go into harness|75 |
+--------------------------------------------------------+---+
only showing top 5 rows



### Split and explode a text column

In [10]:
df = spark.sql("SELECT * FROM df LIMIT 100")
df.count()

100

In [13]:
from pyspark.sql.functions import split, explode

split_df = df.select(split("clause", " ").alias("words"))
split_df.show(5, truncate=False)

exploded_df = split_df.select(explode("words").alias("word"))
exploded_df.show(10)

print("Number of rows: ", exploded_df.count())

+-----------------------------------------------+
|words                                          |
+-----------------------------------------------+
|[title]                                        |
|[the, adventures, of, sherlock, holmes, author]|
|[sir, arthur, conan, doyle, release, date]     |
|[march, 1999]                                  |
|[ebook, 1661]                                  |
+-----------------------------------------------+
only showing top 5 rows

+----------+
|      word|
+----------+
|     title|
|       the|
|adventures|
|        of|
|  sherlock|
|    holmes|
|    author|
|       sir|
|    arthur|
|     conan|
+----------+
only showing top 10 rows

Number of rows:  1279


## Moving window analysis

Creating 3 tuple by using sliding windows.

Properly repartitioning data allows Spark to parallelize operations more efficiently.


### Creating context window feature data

In [35]:
from pyspark.sql.functions import lit

df = spark.read.load("sherlock1.parquet")
df2 = df.filter("id > 95165")
df2 = df2.withColumn("part", lit(12))
df2 = df2.withColumn("title", lit("Sherlock Chapter XII"))
df2.createOrReplaceTempView("df2")
df2.show()

+---------+-----+----+--------------------+
|     word|   id|part|               title|
+---------+-----+----+--------------------+
|      xii|95166|  12|Sherlock Chapter XII|
|      the|95167|  12|Sherlock Chapter XII|
|adventure|95168|  12|Sherlock Chapter XII|
|       of|95169|  12|Sherlock Chapter XII|
|      the|95170|  12|Sherlock Chapter XII|
|   copper|95171|  12|Sherlock Chapter XII|
|  beeches|95172|  12|Sherlock Chapter XII|
|       to|95173|  12|Sherlock Chapter XII|
|      the|95174|  12|Sherlock Chapter XII|
|      man|95175|  12|Sherlock Chapter XII|
|      who|95176|  12|Sherlock Chapter XII|
|    loves|95177|  12|Sherlock Chapter XII|
|      art|95178|  12|Sherlock Chapter XII|
|      for|95179|  12|Sherlock Chapter XII|
|      its|95180|  12|Sherlock Chapter XII|
|      own|95181|  12|Sherlock Chapter XII|
|     sake|95182|  12|Sherlock Chapter XII|
| remarked|95183|  12|Sherlock Chapter XII|
| sherlock|95184|  12|Sherlock Chapter XII|
|   holmes|95185|  12|Sherlock C

In [37]:
query = """SELECT part, LAG(word, 2) OVER(PARTITION BY part ORDER BY id) AS w1,
             LAG(word, 1) OVER(PARTITION BY part ORDER BY id) AS w2,
             word AS w3,
             LEAD(word, 1) OVER(PARTITION BY part ORDER BY id) AS w4,
             LEAD(word, 2) OVER(PARTITION BY part ORDER BY id) AS w5
             FROM df2"""

spark.sql(query).show(10)

+----+---------+---------+---------+---------+---------+
|part|       w1|       w2|       w3|       w4|       w5|
+----+---------+---------+---------+---------+---------+
|  12|     null|     null|      xii|      the|adventure|
|  12|     null|      xii|      the|adventure|       of|
|  12|      xii|      the|adventure|       of|      the|
|  12|      the|adventure|       of|      the|   copper|
|  12|adventure|       of|      the|   copper|  beeches|
|  12|       of|      the|   copper|  beeches|       to|
|  12|      the|   copper|  beeches|       to|      the|
|  12|   copper|  beeches|       to|      the|      man|
|  12|  beeches|       to|      the|      man|      who|
|  12|       to|      the|      man|      who|    loves|
+----+---------+---------+---------+---------+---------+
only showing top 10 rows



### Repartitioning the data

The dataframe is currently in a single partition. Suppose that you know that the upcoming processing steps are going to be grouping the data on chapters. Processing the data will be most efficient if each chapter stays within a single machine. To avoid unnecessary shuffling of the data from one machine to another, let's repartition the dataframe into one partition per chapter, using the repartition and getNumPartitions commands

In [64]:
from pyspark.sql.functions import when
df3 = df.withColumn("chapter", when(df.id < 9260, "Sherlock Chapter I")
                    .when(df.id < 18520, "Sherlock Chapter II")
                    .when(df.id < 27780, "Sherlock Chapter III")
                    .when(df.id < 37040, "Sherlock Chapter IV")
                    .when(df.id < 46300, "Sherlock Chapter V")
                    .when(df.id < 55560, "Sherlock Chapter VI")
                    .when(df.id < 64820, "Sherlock Chapter VII")
                    .when(df.id < 74080, "Sherlock Chapter VIII")
                    .when(df.id < 83340, "Sherlock Chapter IX")
                    .when(df.id < 92600, "Sherlock Chapter X")
                    .when(df.id < 101860, "Sherlock Chapter XI")
                    .otherwise("Sherlock Chapter XII"))

In [65]:
df3.select("chapter").distinct().sort("chapter").show(truncate=False)

+---------------------+
|chapter              |
+---------------------+
|Sherlock Chapter I   |
|Sherlock Chapter II  |
|Sherlock Chapter III |
|Sherlock Chapter IV  |
|Sherlock Chapter IX  |
|Sherlock Chapter V   |
|Sherlock Chapter VI  |
|Sherlock Chapter VII |
|Sherlock Chapter VIII|
|Sherlock Chapter X   |
|Sherlock Chapter XI  |
|Sherlock Chapter XII |
+---------------------+



In [67]:
repart_df = df3.repartition(12,"chapter")
repart_df.show()
repart_df.rdd.getNumPartitions()

+--------+-----+-------------------+
|    word|   id|            chapter|
+--------+-----+-------------------+
|   which|27780|Sherlock Chapter IV|
|followed|27781|Sherlock Chapter IV|
|     the|27782|Sherlock Chapter IV|
|coroner:|27783|Sherlock Chapter IV|
|    that|27784|Sherlock Chapter IV|
|      is|27785|Sherlock Chapter IV|
|     for|27786|Sherlock Chapter IV|
|     the|27787|Sherlock Chapter IV|
|   court|27788|Sherlock Chapter IV|
|      to|27789|Sherlock Chapter IV|
|  decide|27790|Sherlock Chapter IV|
|       i|27791|Sherlock Chapter IV|
|    need|27792|Sherlock Chapter IV|
|     not|27793|Sherlock Chapter IV|
|   point|27794|Sherlock Chapter IV|
|     out|27795|Sherlock Chapter IV|
|      to|27796|Sherlock Chapter IV|
|     you|27797|Sherlock Chapter IV|
|    that|27798|Sherlock Chapter IV|
|    your|27799|Sherlock Chapter IV|
+--------+-----+-------------------+
only showing top 20 rows



12

## Common word sequences


You can create training sets for predictive models to predict a word from previous wors in a sequence. Categorical data generally have no logical order when they do they are called ordinal data.

You can determine what words tend to appear together by sequence analysis.  

In [82]:
query = """SELECT w1, w2, w3, w4, w5, COUNT(*) AS count FROM(SELECT word AS w1,
                                                LEAD(word, 1) OVER(PARTITION BY chapter ORDER BY id) AS w2,
                                                LEAD(word, 2) OVER(PARTITION BY chapter ORDER BY id) AS w3,
                                                LEAD(word, 3) OVER(PARTITION BY chapter ORDER BY id) AS w4,
                                                LEAD(word, 4) OVER(PARTITION BY chapter ORDER BY id) AS w5 FROM df3
                                                WHERE id < 41729)
            GROUP BY w1, w2, w3, w4, w5
            ORDER BY count DESC
            LIMIT 10"""
spark.sql(query).show()

+-----+----------+------+--------+------+-----+
|   w1|        w2|    w3|      w4|    w5|count|
+-----+----------+------+--------+------+-----+
|   in|       the|  case|      of|   the|    4|
|  the|adventures|    of|sherlock|holmes|    4|
|  the|    church|    of|      st|monica|    3|
| what|        do|   you|    make|    of|    3|
|  the|       man|   who| entered|   was|    3|
|dying| reference|    to|       a|   rat|    3|
|    i|        am|afraid|    that|     i|    3|
|    i|     think|  that|      it|    is|    3|
|   in|       his| chair|    with|   his|    3|
|    i|      rang|   the|    bell|   and|    3|
+-----+----------+------+--------+------+-----+



### Unique 5-tuples in sorted order

In [83]:
query = """SELECT DISTINCT w1, w2, w3, w4, w5 FROM(SELECT word AS w1,
                                              LEAD(word, 1) OVER(PARTITION BY chapter ORDER BY id) AS w2,
                                              LEAD(word, 2) OVER(PARTITION BY chapter ORDER BY id) AS w3,
                                              LEAD(word, 3) OVER(PARTITION BY chapter ORDER BY id) AS w4,
                                              LEAD(word, 4) OVER(PARTITION BY chapter ORDER BY id) AS w5 FROM df3
                                              WHERE id < 34389)
            ORDER BY w1 DESC, w2 DESC, w3 DESC, w4 DESC, w5 DESC
            LIMIT 10"""
spark.sql(query).show()

+----------+------+---------+------+-----+
|        w1|    w2|       w3|    w4|   w5|
+----------+------+---------+------+-----+
|   zealand| stock|   paying|     4|  1/4|
|   youwill|   see|     your|   pal|again|
|   youwill|    do|     come|  come| what|
|     youth|though|   comely|    to| look|
|     youth|    in|       an|ulster|  who|
|     youth|either|       it|     s| hard|
|     youth| asked| sherlock|holmes|  his|
|yourselves|  that|       my|  hair|   is|
|yourselves|behind|    those|  then| when|
|  yourself|  your|household|   and|  the|
+----------+------+---------+------+-----+



### Most frequent 3-tuples per chapter


In [99]:
subquery = """
SELECT chapter, w1, w2, w3, COUNT(*) as count
FROM
(
    SELECT
    chapter,
    word AS w1,
    LEAD(word, 1) OVER(PARTITION BY chapter ORDER BY id ) AS w2,
    LEAD(word, 2) OVER(PARTITION BY chapter ORDER BY id ) AS w3
    FROM df3
)
GROUP BY chapter, w1, w2, w3
ORDER BY chapter, count DESC
"""

df4 = spark.sql(subquery)
df4.createOrReplaceTempView("df4")
query = ("""SELECT chapter, w1, w2, w3, count FROM(SELECT chapter, 
                                                ROW_NUMBER() OVER(PARTITION BY chapter ORDER BY count DESC) AS row,
                                                w1, w2, w3, count
                                                FROM df4)
                WHERE row = 1
                ORDER BY chapter ASC""")
spark.sql(query).show()

+--------------------+-------+------+-----+-----+
|             chapter|     w1|    w2|   w3|count|
+--------------------+-------+------+-----+-----+
|  Sherlock Chapter I|    one|    of|  the|    7|
| Sherlock Chapter II|    one|    of|  the|    7|
|Sherlock Chapter III|     mr|hosmer|angel|   13|
| Sherlock Chapter IV|   that|    he|  was|    7|
| Sherlock Chapter IX|   lord|    st|simon|   19|
|  Sherlock Chapter V|    one|    of|  the|    7|
| Sherlock Chapter VI|neville|    st|clair|    9|
|Sherlock Chapter VII|     at|   the| time|    7|
|Sherlock Chapter ...|   that|    it|  was|    9|
|  Sherlock Chapter X|   lord|    st|simon|    9|
| Sherlock Chapter XI|     to|    be|    a|    9|
|Sherlock Chapter XII|    one|    of|  the|  327|
+--------------------+-------+------+-----+-----+

