# PySpark: a brief analysis to the most common words in Dracula, by Bram Stoker

A landmark in Gothic literature, the iconic novel Dracula, written by Bram Stoker in 1897, stirs the emotions of people across the world. Today, to introduce Spark's new concepts and features, we will develop a brief notebook to analyze the most common words in this classic book 🧛🏼‍♂️.

To do this, we will write a notebook in [Google Colab](https://colab.research.google.com/), a cloud service built by Google to encourage machine learning and artificial intelligence researches.

This notebook is also available in [Dev Community](https://dev.to/geazi_anc/pyspark-a-brief-analysis-to-the-most-common-words-in-dracula-by-bram-stoker-1ij4).

This novel was obtained through [Project Gutenberg](https://www.gutenberg.org/), a digital library that centralizes public books around the world.


## Before get start

Before start, we need to install [PySpark](https://spark.apache.org/docs/latest/api/python/index.html) library.

The PySpark is the official API of Apache Spark for Python. We will develop our data analysis using it 🎲.


In [1]:
!pip install pyspark


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyspark
  Downloading pyspark-3.3.1.tar.gz (281.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m281.4/281.4 MB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting py4j==0.10.9.5
  Downloading py4j-0.10.9.5-py2.py3-none-any.whl (199 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.7/199.7 KB[0m [31m13.2 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.3.1-py2.py3-none-any.whl size=281845512 sha256=575b3bdf2a119f7c6ff930268c8320870028723defca11d4ef926d9ab968d1b8
  Stored in directory: /root/.cache/pip/wheels/43/dc/11/ec201cd671da62fa9c5cc77078235e40722170ceba231d7598
Successfully built pyspark
Installing collected packages: py4j, pyspa

## Step one: running Apache Spark

After the installation is complete, we need to run Apache Spark. Let's do it!


In [3]:
from pyspark.sql import SparkSession


spark = (SparkSession.builder
         .appName("The top most common words in Dracula, by Bram Stoker")
         .getOrCreate()
)


## Step two: downloading and reading

In this step, we will download the novel from Guttenberg project and, after that, load it using PySpark.

We will use **wget** tool to do this, passing the URL book for it and saving it in local directory, and renaming to **Dracula – Bram Stoker.txt**.


In [4]:
!wget https: // www.gutenberg.org/cache/epub/345/pg345.txt -O "Dracula - Bram Stoker.txt"


--2023-01-11 01:55:44--  ftp://https/
           => ‘.listing’
Resolving https (https)... failed: Name or service not known.
wget: unable to resolve host address ‘https’
//: Scheme missing.
--2023-01-11 01:55:45--  http://www.gutenberg.org/cache/epub/345/pg345.txt
Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47, 2610:28:3090:3000:0:bad:cafe:47
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://www.gutenberg.org/cache/epub/345/pg345.txt [following]
--2023-01-11 01:55:45--  https://www.gutenberg.org/cache/epub/345/pg345.txt
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 881220 (861K) [text/plain]
Saving to: ‘Dracula - Bram Stoker.txt’


2023-01-11 01:55:45 (8.90 MB/s) - ‘Dracula - Bram Stoker.txt’ saved [881220/881220]

FINISHED --2023-01-11 01:55:45--
Total wall clock time: 0.5s


## Step three: stopwords downloading

In this section, we will download the list of stopwords used in English language. These stops words normally include prepositions, particles, interjections, unions, adverbs, pronouns, introductory words, numbers from 0 to 9 (unambiguous), other frequently used official, independent parts of speech, symbols, punctuation. Relatively recently, this list was supplemented by such commonly used on the Internet sequences of symbols as www, com, http, etc.

This list was obtained through [CountWordsFree](https://countwordsfree.com/stopwords), a website that centralizes the stopwords used in many languages across the world.

Get to work!


In [6]:
!wget https://countwordsfree.com/stopwords/english/txt -O "stop_words_english.txt"


--2023-01-11 01:58:45--  https://countwordsfree.com/stopwords/english/txt
Resolving countwordsfree.com (countwordsfree.com)... 212.83.51.246
Connecting to countwordsfree.com (countwordsfree.com)|212.83.51.246|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6343 (6.2K) [text/plain]
Saving to: ‘stop_words_english.txt’


2023-01-11 01:58:45 (819 MB/s) - ‘stop_words_english.txt’ saved [6343/6343]



After that, let’s load the book using Spark. Create a new code cell and add the following code block:


In [8]:
book = spark.read.text("Dracula - Bram Stoker.txt")


And let’s load the stopwords as well. The stopwords will are stored in a list, in **stopwords** variable.


In [13]:

with open("stop_words_english.txt", "r") as f:
    text = f.read()
    stopwords = text.splitlines()

len(stopwords), stopwords[:15]


(851,
 ['able',
  'about',
  'above',
  'abroad',
  'according',
  'accordingly',
  'across',
  'actually',
  'adj',
  'after',
  'afterwards',
  'again',
  'against',
  'ago',
  'ahead'])

## Step four: extracting words

After load is completed, we need to extract the words to a dataframe column.

To do this, use the **split** function to each line, will split them using blank spaces between them. The result is a list of words.


In [14]:
from pyspark.sql.functions import split


lines = book.select(split(book.value, " ").alias("line"))
lines.show(5)


+--------------------+
|                line|
+--------------------+
|[The, Project, Gu...|
|                  []|
|[This, eBook, is,...|
|[most, other, par...|
|[whatsoever., You...|
+--------------------+
only showing top 5 rows



## Step five: exploding list words

Now, let’s convert this list of words in dataframe column, using **explode** function.


In [15]:
from pyspark.sql.functions import explode, col


words = lines.select(explode(col("line")).alias("word"))
words.show(15)


+---------+
|     word|
+---------+
|      The|
|  Project|
|Gutenberg|
|    eBook|
|       of|
| Dracula,|
|       by|
|     Bram|
|   Stoker|
|         |
|     This|
|    eBook|
|       is|
|      for|
|      the|
+---------+
only showing top 15 rows



## Step six: words to lowercase

This is a simple step. We don't want the same word to be different because of capital letters, so we convert these words to lowercase, using **lower** function.


In [17]:
from pyspark.sql.functions import lower


words_lower = words.select(lower(col("word")).alias("word_lower"))
words_lower.show()


+----------+
|word_lower|
+----------+
|       the|
|   project|
| gutenberg|
|     ebook|
|        of|
|  dracula,|
|        by|
|      bram|
|    stoker|
|          |
|      this|
|     ebook|
|        is|
|       for|
|       the|
|       use|
|        of|
|    anyone|
|  anywhere|
|        in|
+----------+
only showing top 20 rows



## Step seven: removing punctuations

so that the same word is not different because of the punctuation at the end of them, is necessary to remove these punctuations.

We'll do this using the **regexp_extract** function, which extracts words from a string using a regex.


In [18]:
from pyspark.sql.functions import regexp_extract


words_clean = words_lower.select(
    regexp_extract(col("word_lower"), "[a-z]+", 0).alias("word")
)

words_clean.show()


+---------+
|     word|
+---------+
|      the|
|  project|
|gutenberg|
|    ebook|
|       of|
|  dracula|
|       by|
|     bram|
|   stoker|
|         |
|     this|
|    ebook|
|       is|
|      for|
|      the|
|      use|
|       of|
|   anyone|
| anywhere|
|       in|
+---------+
only showing top 20 rows



## Step eight: removing null values

However, how you see, there are null values yet, in other words, blank spaces.

It is necessary remove them so that these blanks values are not analyzed.


In [20]:
words_nonull = words_clean.filter(col("word") != "")
words_nonull.show()


+---------+
|     word|
+---------+
|      the|
|  project|
|gutenberg|
|    ebook|
|       of|
|  dracula|
|       by|
|     bram|
|   stoker|
|     this|
|    ebook|
|       is|
|      for|
|      the|
|      use|
|       of|
|   anyone|
| anywhere|
|       in|
|      the|
+---------+
only showing top 20 rows



## Step nine: removing stopwords

We are almost there! The last step is removes the stopwords so that, again, these words are not analyzed.


In [27]:
words_without_stopwords = words_nonull.filter(
    ~words_nonull.word.isin(stopwords))


words_count_before_removing = words_nonull.count()
words_count_after_removing = words_without_stopwords.count()

words_count_before_removing, words_count_after_removing


(163399, 50222)

## Step ten: analyzing the most common words in Dracula, finally!

And, finally, our data are completely cleared. So, now we could to analyze the most common words in our book.

At first, we’ll group the words and after use an aggregate function to count them.


In [29]:
words_count = (words_without_stopwords.groupby("word")
               .count()
               .orderBy("count", ascending=False)
               )


After, show the top 20 most common words. This value may be changed through **rank** variable.


In [30]:
rank = 20
words_count.show(rank)


+--------+-----+
|    word|count|
+--------+-----+
|    time|  381|
| helsing|  323|
|     van|  322|
|    lucy|  297|
|    good|  256|
|     man|  255|
|    mina|  240|
|    dear|  224|
|   night|  224|
|    hand|  209|
|    room|  207|
|    face|  206|
|jonathan|  206|
|   count|  197|
|    door|  197|
|   sleep|  192|
|    poor|  191|
|    eyes|  188|
|    work|  188|
|      dr|  187|
+--------+-----+
only showing top 20 rows



## Conclusion

That’s all for now, folks! In this article, we analyzed the most common words in Dracula, written by Bram Stoker. To do this, we cleared the words: removing punctuations; converting from uppercase letters to lowercase; and removing stopwords.

I hope you enjoyed it. Keep those stakes sharp, watch out for the shadows that walk at night, and see you in next time 🧛🏼‍♂️🍷.


## bibliography

RIOUX, Jonathan. [Data Analysis with Python and PySpark](https://www.amazon.com.br/Analysis-Python-PySpark-Jonathan-Rioux/dp/1617297208).

STOKER, Bram. [Dracula](https://www.gutenberg.org/cache/epub/345/pg345.txt).
