# NLP using Pyspark

The goal of this notebook is to peform basic NLP tasks using Pyspark. We are perforing these tasks on a few books. Though these books are small in size and are not considered big data, the same procedure could be applied to much larger data sets such as forum data, log data, etc.

``flatMap`` mapper is very important in the text processing problems as it allows mapping each line into many elements (one-to-many mapper).

#### Probelm 1:
List top 20 words used in the Pride and Prejudice book, sorted alphabetically.

In [10]:
%%file codes/word_count.py


# finding pyspark
import findspark
findspark.init() # adding pyspark to sys.path
# importing require modules
from pyspark import SparkConf, SparkContext
# importing regex to perform tokenization
import re
# setting up spark environment
conf = SparkConf().setMaster('local[4]').setAppName('WordCounter')
sc = SparkContext(conf = conf)
# importing the file
lines = sc.textFile('file:///Users/Amin/Dropbox/Career Deveoment/Data Science/PySpark/NLP using Pyspark/data/raw/pride_prejudice.txt')
# defining the text parser
def parser(line):
    '''
    Performs the parsing of each line of the book.
    Input is a line of text and output is a list of words in that line.'''
    line = line.strip().lower() # the application is not case sensitive
    punctuations = re.compile(r'[^a-z]+') # tokenize on non-alphabetical  (\W+ can also be used)
    tokens = punctuations.split(line)
    tokens = [word for word in tokens if word]
    return tokens

def full_line(line):
    '''
    checks if the line has text.'''
    if line:
        return True
    else: False
# perform mapping
filtered_lines = lines.filter(full_line)
tokens_one = filtered_lines\
.flatMap(parser).map(lambda x: (x,1)) # tokenization and counting
# perform reduction
tokens_count = tokens_one.reduceByKey(lambda x, y : x+ y)
# collect the results
results = tokens_count.collect()
# output 20 most frequent words
sorted_alpha = sorted(results, key = lambda x: x[0]) # sorting alphabetically
sorted_results = sorted(sorted_alpha, reverse = True, key = lambda x: x[1]) # sorting according to frequncy
top20 = sorted_results[:20]
for i, wc in enumerate(top20):
    print('Word-{}: {} \t {} times.'.format(i+1, wc[0], wc[1]))

Overwriting codes/word_count.py


In [11]:
!python codes/word_count.py

Word-1: the 	 4507 times.
Word-2: to 	 4243 times.
Word-3: of 	 3730 times.
Word-4: and 	 3658 times.
Word-5: her 	 2225 times.
Word-6: i 	 2070 times.
Word-7: a 	 2011 times.
Word-8: in 	 1937 times.
Word-9: was 	 1847 times.
Word-10: she 	 1710 times.
Word-11: that 	 1594 times.
Word-12: it 	 1550 times.
Word-13: not 	 1450 times.
Word-14: you 	 1428 times.
Word-15: he 	 1339 times.
Word-16: his 	 1271 times.
Word-17: be 	 1260 times.
Word-18: as 	 1192 times.
Word-19: had 	 1177 times.
Word-20: with 	 1100 times.
SUCCESS: The process with PID 13500 (child process of PID 10272) has been terminated.
SUCCESS: The process with PID 10272 (child process of PID 13528) has been terminated.
SUCCESS: The process with PID 13528 (child process of PID 5276) has been terminated.


19/03/10 23:39:18 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).

[Stage 0:>                                                          (0 + 2) / 2]
[Stage 1:>                                                          (0 + 2) / 2]
                                                                                


Let's perform the same logic while performing the sort in pyspark so that we can keep our code scalable:

In [1]:
%%file codes/scalable_wc.py


# finding pyspark
import findspark
findspark.init() # adding pyspark to sys.path
# importing require modules
from pyspark import SparkConf, SparkContext
# importing regex to perform tokenization
import re
# setting up spark environment
conf = SparkConf().setMaster('local[4]').setAppName('WordCounter')
sc = SparkContext(conf = conf)
# importing the file
lines = sc.textFile('file:///Users/Amin/Dropbox/Career Deveoment/Data Science/PySpark/NLP using Pyspark/data/raw/pride_prejudice.txt')
# defining the text parser
def parser(line):
    '''
    Performs the parsing of each line of the book.
    Input is a line of text and output is a list of words in that line.'''
    line = line.strip().lower() # the application is not case sensitive
    punctuations = re.compile(r'[^a-z]+') # tokenize on non-alphabetical  (\W+ can also be used)
    tokens = punctuations.split(line)
    tokens = [word for word in tokens if word]
    return tokens

def full_line(line):
    '''
    checks if the line has text.'''
    if line:
        return True
    else: False
# perform mapping
filtered_lines = lines.filter(full_line)
tokens_one = filtered_lines\
.flatMap(parser).map(lambda x: (x,1)) # tokenization and counting
# perform reduction
tokens_count = tokens_one.reduceByKey(lambda x, y : x+ y)
# collect the results
sorted_flipped = tokens_count.map(lambda x, y : (y,x)).sortByKey(ascending = False)
sorted_correct = sorted_flipped.map(lambda x, y : (y, x))
sorted_results = sorted_correct.collect()


for res in sorted_results:
    count = res[1]
    word = res[0]
    print(word + ":\t\t", count)

Writing codes/scalable_wc.py


In [None]:
!python codes/scalable_wc.py