## Spark + NLTK Exercise
Spark exercise leveraging NLTK to improve familiarity with PySpark, work on basics of Spark RDD API, practice application of NLTK. This exercise seeks to show the translation of a Python-based script that leverages part-of-speech tagging on a large dataset and convert it to a pyspark-based approach.

`%%time` functions included, but results may vary by operating system configuration.
_________________________

Part-of-speech tagging is a basic introduction to NLP, and will be performed on some New York Times articles. The original script was written by Luke Petschauer and a forked version is available at https://github.com/umsi-data-science/NP_chunking_with_nltk/blob/master/NP_chunking_with_the_NLTK.ipynb. The complete analysis should take about 10 minutes to run.

First, let's load libraries.

In [None]:
# !pip install nltk

In [None]:
# !apt-get install openjdk-8-jdk-headless -qq > /dev/null
# !update-java-alternatives -l
# !update-java-alternatives -s java-1.8.0-openjdk-amd64 > /dev/null 2> &1
# !pip install pyspark

In [1]:
import nltk
nltk.download('book')
import re
import pprint
from nltk import Tree

[nltk_data] Downloading collection 'book'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /home/jovyan/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    | Downloading package brown to
[nltk_data]    |     /home/jovyan/nltk_data...
[nltk_data]    |   Unzipping corpora/brown.zip.
[nltk_data]    | Downloading package chat80 to
[nltk_data]    |     /home/jovyan/nltk_data...
[nltk_data]    |   Unzipping corpora/chat80.zip.
[nltk_data]    | Downloading package cmudict to
[nltk_data]    |     /home/jovyan/nltk_data...
[nltk_data]    |   Unzipping corpora/cmudict.zip.
[nltk_data]    | Downloading package conll2000 to
[nltk_data]    |     /home/jovyan/nltk_data...
[nltk_data]    |   Unzipping corpora/conll2000.zip.
[nltk_data]    | Downloading package conll2002 to
[nltk_data]    |     /home/jovyan/nltk_data...
[nltk_data]    |   Unzipping corpora/conll2002.zip.
[nltk_data]    | Downloading package dependency_treebank to
[nltk_data]    |     /home/jovyan/nlt

This is the original non-Spark Python script, run on a small snippet of practice text for brevity while working through the code.

In [2]:
# This is the original (non-Spark) script

patterns = """
    NP: {<JJ>*<NN*>+}
    {<JJ>*<NN*><CC>*<NN*>+}
    """

NPChunker = nltk.RegexpParser(patterns)

def prepare_text(input):
    sentences = nltk.sent_tokenize(input)
    sentences = [nltk.word_tokenize(sent) for sent in sentences]
    sentences = [nltk.pos_tag(sent) for sent in sentences]
    sentences = [NPChunker.parse(sent) for sent in sentences]
    return sentences


def parsed_text_to_NP(sentences):
    nps = []
    for sent in sentences:
        tree = NPChunker.parse(sent)
        for subtree in tree.subtrees():
            if subtree.label() == 'NP':
                t = subtree
                t = ' '.join(word for word, tag in t.leaves())
                nps.append(t)
    return nps


def sent_parse(input):
    sentences = prepare_text(str(input))
    nps = parsed_text_to_NP(sentences)
    return nps


text_to_be_analyzed = """WASHINGTON - Stellar pitching kept the Mets afloat in the first half of last season despite their offensive woes. But they cannot produce an encore of their pennant-winning season if their lineup keeps floundering while their pitching is nicked, bruised and stretched thin.
"We were going to ride our pitching," Manager Terry Collins said before Wednesday’s game. "But we're not riding it right now. We've got as many problems with our pitching as we do anything."
Wednesday's 4-2 loss to the Washington Nationals was cruel for the already-limping Mets. Pitching in Steven Matz's place, the spot starter Logan Verrett allowed two runs over five innings. But even that was too large a deficit for the Mets' lineup to overcome against Max Scherzer, the Nationals' starter.
"We're not even giving ourselves chances," Collins said, adding later, "We just can’t give our pitchers any room to work."
The Mets did not score until the ninth inning, when a last-gasp two-run homer by James Loney off Nationals reliever Shawn Kelley snapped a streak of 23 scoreless innings for the team."""


nps = sent_parse(text_to_be_analyzed)
print(nps)

['Stellar pitching', 'afloat', 'first half', 'last season', 'encore', 'pennant-winning season', 'lineup', 'pitching', 'thin', 'pitching', 's game', 'pitching', 'anything', '4-2 loss', 'place', 'spot starter', 'deficit', 'lineup', 'starter', 'room', 'ninth inning', 'last-gasp two-run homer', 'reliever', 'streak', 'team']


#### Spark Conversion
Now, let's spin up a Spark session for conversion of this script to a Spark-friendly RDD.

In [None]:
# !pip install pyspark

In [9]:
from pyspark.sql import SparkSession
spark = SparkSession \
    .builder \
    .master("local[*]") \
    .appName('test1') \
    .getOrCreate() 

sc = spark.sparkContext

# from pyspark.sql import SparkSession
# spark = SparkSession \
#     .builder.master("local[*]") \
#     .appName('test1') \ 
#     .config("spark.sql.catalogImplementation","in-memory") \
#     .getOrCreate()

# sc = spark.sparkContext

In [10]:
text = sc.textFile('data/nytimes/nytimes_news_articles.txt')
# show the first two lines of the file
text.take(2)

['URL: http://www.nytimes.com/2016/06/30/sports/baseball/washington-nationals-max-scherzer-baffles-mets-completing-a-sweep.html',
 '']

In [11]:
TOKEN_RE = re.compile(r"\b[\w']+\b")
def pos_tag_counter(line):
    toks = nltk.regexp_tokenize(line, TOKEN_RE)
    postoks = nltk.tag.pos_tag(toks)

    return postoks

#### Create RDD Pipeline that does the following:

1. filters out blank lines 
2. filters out lines starting with 'URL'
3. creates a single list (using flatMap) that applies the pos_tag_counter function to each line
4. maps each resulting line to show the part of speech (which is the second element returned from the pos_tag_counter)
5. converts each resulting line to a pairRDD with words as keys and values of 1
6. reduces the resulting RDD by key, adding up all the 1s (like the lecture and lab examples)
7. sorts the resulting list by the counts, in descending order.

In [14]:
# pos_tag_counts = text.filter(lambda line: len(line) > 0) \   # Filters out blanks
#     .filter(lambda line: re.findall('^(?!URL).*', line)) \   # Filters out lines starting with 'URL'
#     .flatMap(pos_tag_counter) \                              # Create single list using flatMap applying pos_tag_counter
#     .map(lambda word: word[1]) \                             # maps resulting line to show part of speech
#     .map(lambda word: (word, 1)) \                           # convert each resulting line to a pair RDD by key
#     .reduceByKey(lambda x, y: x + y) \                       # reduce results by key, adding up all 1s
#     .sortBy(lambda x: x[1], ascending = False)               # sorts resulting list by counts, descending

# pos_tag_counts.collect()

pos_tag_counts = text.filter(lambda line: len(line) > 0) \
    .filter(lambda line: re.findall('^(?!URL).*', line)) \
    .flatMap(pos_tag_counter) \
    .map(lambda word: word[1]) \
    .map(lambda word: (word, 1)) \
    .reduceByKey(lambda x, y: x + y) \
    .sortBy(lambda x: x[1], ascending = False)

pos_tag_counts.collect()

[('NN', 1126515),
 ('IN', 928916),
 ('NNP', 853093),
 ('DT', 761492),
 ('JJ', 498482),
 ('NNS', 437116),
 ('VBD', 379509),
 ('PRP', 282603),
 ('RB', 271053),
 ('CC', 231491),
 ('VB', 223717),
 ('CD', 187602),
 ('TO', 187005),
 ('VBN', 174980),
 ('VBZ', 169149),
 ('VBG', 163653),
 ('VBP', 143368),
 ('PRP$', 107984),
 ('MD', 67185),
 ('WDT', 44582),
 ('WP', 42406),
 ('WRB', 33160),
 ('RP', 29345),
 ('JJR', 24746),
 ('NNPS', 18870),
 ('JJS', 16425),
 ('EX', 12397),
 ('RBR', 12286),
 ('RBS', 5146),
 ('PDT', 3784),
 ('FW', 2793),
 ('WP$', 2329),
 ('POS', 493),
 ('UH', 325),
 ('$', 219),
 ('LS', 5),
 ("''", 2)]

In [13]:
# Alternative way to do it, combining steps 4 & 5

# pos_tag_counts = text.filter(lambda line: len(line) > 0) \
#     .filter(lambda line: re.findall('^(?!URL).*', line)) \
#     .flatMap(pos_tag_counter) \
#     .map(lambda word: (word[1], 1)) \
#     .reduceByKey(lambda x, y: x + y) \
#     .sortBy(lambda x: x[1], ascending = False)
    
# pos_tag_counts.collect()

#### Create RDD pipeline to show distribution of length of noun phrases

1. Apply (using flatMap) a ```tokenize_chunk_parse``` function to each line in the ```text``` RDD
2. Use map to emit the length of each noun phrase
3. Use map to convert each resulting line to a pairRDD with words as keys and values of 1
4. Reduce the resulting RDD by key, adding up all the 1s (like the lecture and lab examples)
5. Sort the resulting list by the counts, in descending order.

In [7]:
grammar = r"""
    NBAR:
        {<NN.*|JJS>*<NN.*>}
        
    NP:
        {<NBAR>}
        {<NBAR><IN><NBAR>}
"""

  
def tokenize_chunk_parse(line):
    chunker = nltk.RegexpParser(grammar)
  
    toks = nltk.regexp_tokenize(line, TOKEN_RE)
    postoks = nltk.tag.pos_tag(toks)

    tree = chunker.parse(postoks)

    return [term for term in leaves(tree)] 
  
def leaves(tree):
    for subtree in tree.subtrees(filter = lambda t: t.label()=='NP'):
        yield subtree.leaves()

In [8]:
np_counts = text.filter(lambda line: len(line) > 0) \
    .filter(lambda line: re.findall('^(?!URL).*', line)) \
    .flatMap(tokenize_chunk_parse) \
    .map(lambda phrase: (len(phrase), 1)) \
    .reduceByKey(lambda x, y: x+y) \
    .sortBy(lambda x: x[1], ascending = False)

np_counts.collect()

[(1, 1194014),
 (2, 345011),
 (3, 106957),
 (4, 34459),
 (5, 10561),
 (6, 3638),
 (7, 1261),
 (8, 494),
 (9, 238),
 (10, 106),
 (11, 50),
 (13, 34),
 (12, 26),
 (14, 23),
 (16, 16),
 (18, 14),
 (27, 10),
 (20, 9),
 (32, 9),
 (19, 9),
 (34, 8),
 (15, 8),
 (40, 7),
 (26, 7),
 (24, 7),
 (25, 7),
 (17, 7),
 (46, 6),
 (28, 6),
 (37, 6),
 (21, 6),
 (22, 5),
 (23, 5),
 (29, 5),
 (44, 4),
 (30, 4),
 (31, 4),
 (39, 4),
 (33, 4),
 (41, 4),
 (55, 4),
 (56, 3),
 (48, 3),
 (50, 3),
 (36, 3),
 (63, 3),
 (49, 3),
 (57, 3),
 (51, 3),
 (71, 3),
 (45, 3),
 (43, 3),
 (61, 2),
 (47, 2),
 (35, 2),
 (65, 2),
 (88, 1),
 (66, 1),
 (58, 1),
 (64, 1),
 (42, 1),
 (82, 1),
 (104, 1),
 (38, 1),
 (80, 1),
 (140, 1),
 (92, 1),
 (54, 1),
 (53, 1),
 (91, 1),
 (131, 1),
 (75, 1),
 (127, 1),
 (135, 1),
 (113, 1)]

______________________
<div style="text-align: right"><sub>Exercise adapted and modified from UMSI homework assignment for SIADS 516.</sub></div>