In [0]:
from IPython.display import HTML


## Naturual Language Processing (NLP) the Snow Lab way

- [Reference link](https://johnsnowlabs.github.io/spark-nlp-workshop/databricks/index.html#Getting%20Started.html)
- [Pre-built models](https://nlp.johnsnowlabs.com/2021/01/09/classifierdl_use_fakenews_en.html)

### Annotators

- [documentation](https://nlp.johnsnowlabs.com/docs/en/annotators)

In [0]:
HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/fsS057SNFtg" frameborder="0" allowfullscreen></iframe>')


In [0]:
# %pip install altair spark-nlp

In [0]:
%sh
java -version

In [0]:
spark.version

In [0]:
import pandas as pd
import numpy as np
import altair as alt
import seaborn as sns
import sparknlp

from IPython.display import HTML
from sklearn.metrics import classification_report


In [0]:
from pyspark.ml.feature import HashingTF, IDF, Tokenizer, StringIndexer, CountVectorizer, StopWordsRemover, NGram
from pyspark.ml import Pipeline
import pyspark.sql.functions as F

In [0]:
from sparknlp.annotator import *
from sparknlp.common import *
from sparknlp.base import *

## The NLP process using sparknlp

I pulled material from [here](https://johnsnowlabs.github.io/spark-nlp-workshop/databricks/index.html#python/annotation/Spark%20NLP%20start.html) for this guide.

### John Snow LABS background

The [spark-nlp package](https://pypi.org/project/spark-nlp/) developed by [John Snow LABS](https://nlp.johnsnowlabs.com/) looks to be the gold standard for text analytics at scale.  This package is not developed by the [Apache Spark team](https://spark.apache.org/) but it uses the [Apache software license](https://www.apache.org/licenses/). They have built their API to seamlessly connect with Spark ML.  [Their documentation](https://nlp.johnsnowlabs.com/docs/en/concepts) highlights this connection. 

Let's look at their example code.

In [0]:
df = spark.createDataFrame([("Yeah, I get that. is the",)], ["comment"])
display(df)

comment
"Yeah, I get that. is the"


In [0]:
document_assembler = DocumentAssembler() \
    .setInputCol("comment") \
    .setOutputCol("document")
    
sentence_detector = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence") \
    .setUseAbbreviations(True)
    
tokenizer = Tokenizer() \
  .setInputCols(["sentence"]) \
  .setOutputCol("token")

stemmer = Stemmer() \
    .setInputCols(["token"]) \
    .setOutputCol("stem")
    
normalizer = Normalizer() \
    .setInputCols(["stem"]) \
    .setOutputCol("normalized")

finisher = Finisher() \
    .setInputCols(["normalized"]) \
    .setOutputCols(["ntokens"]) \
    .setOutputAsArray(True) \
    .setCleanAnnotations(True)

In [0]:
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, stemmer, normalizer, finisher])

nlp_model = nlp_pipeline.fit(df)
processed = nlp_model.transform(df).persist()


In [0]:
processed.count()

In [0]:
display(processed)

comment,ntokens
"Yeah, I get that. is the","List(yeah, i, get, that, i, the)"


In [0]:
display(processed.toPandas())

comment,ntokens
"Yeah, I get that. is the","List(yeah, i, get, that, i, the)"


#### Multi-Class Text Classification Example

[See the databricks example this below code is based on](https://johnsnowlabs.github.io/spark-nlp-workshop/databricks/index.html#python/training/ClassifierDL_Train_multi_class_news_category_classifier.html)

In [0]:
%fs ls "file:/dbfs/FileStore/news_category"


path,name,size
file:/dbfs/FileStore/news_category/news_category_test.csv,news_category_test.csv,1504408
file:/dbfs/FileStore/news_category/news_category_train.csv,news_category_train.csv,24032125


In [0]:
trainDataset = spark.read \
  .option("header", True) \
  .option("inferSchema", True) \
  .csv("file:/dbfs/FileStore/news_category/news_category_train.csv")


In [0]:
display(trainDataset)

category,description
Business,"Short sellers, Wall Street's dwindling band of ultra cynics, are seeing green again."
Business,"Private investment firm Carlyle Group, which has a reputation for making well timed and occasionally controversial plays in the defense industry, has quietly placed its bets on another part of the market."
Business,Soaring crude prices plus worries about the economy and the outlook for earnings are expected to hang over the stock market next week during the depth of the summer doldrums.
Business,"Authorities have halted oil export flows from the main pipeline in southern Iraq after intelligence showed a rebel militia could strike infrastructure, an oil official said on Saturday."
Business,"Tearaway world oil prices, toppling records and straining wallets, present a new economic menace barely three months before the US presidential elections."
Business,"Stocks ended slightly higher on Friday but stayed near lows for the year as oil prices surged past #36;46 a barrel, offsetting a positive outlook from computer maker Dell Inc. (DELL.O)"
Business,"Assets of the nation's retail money market mutual funds fell by #36;1.17 billion in the latest week to #36;849.98 trillion, the Investment Company Institute said Thursday."
Business,"Retail sales bounced back a bit in July, and new claims for jobless benefits fell last week, the government said Thursday, indicating the economy is improving from a midsummer slump."
Business,""" After earning a PH.D. in Sociology, Danny Bazil Riley started to work as the general manager at a commercial real estate firm at an annual base salary of #36;70,000. Soon after, a financial planner stopped by his desk to drop off brochures about insurance benefits available through his employer. But, at 32, """"buying insurance was the furthest thing from my mind"
Business,"Short sellers, Wall Street's dwindling band of ultra cynics, are seeing green again."


In [0]:
# actual content is inside description column
document = DocumentAssembler()\
    .setInputCol("description")\
    .setOutputCol("document")

use = UniversalSentenceEncoder.pretrained() \
 .setInputCols(["document"])\
 .setOutputCol("sentence_embeddings")

# the classes/labels/categories are in category column
classsifierdl = ClassifierDLApproach()\
  .setInputCols(["sentence_embeddings"])\
  .setOutputCol("class")\
  .setLabelColumn("category")\
  .setMaxEpochs(1)

pipeline = Pipeline(
    stages = [
        document,
        use,
        classsifierdl
    ])

In [0]:
pipelineModel = pipeline.fit(trainDataset)


In [0]:
dfTest = spark.createDataFrame([
    "Unions representing workers at Turner Newall say they are 'disappointed' after talks with stricken parent firm Federal Mogul.",
    "Scientists have discovered irregular lumps beneath the icy surface of Jupiter's largest moon, Ganymede. These irregular masses may be rock formations, supported by Ganymede's icy shell for billions of years..."
], F.StringType()).toDF("description")

In [0]:
# on prediction
prediction1 = pipelineModel.transform(dfTest)
prediction1.select("class.result").show()
prediction1.select("class.metadata").show(truncate=False)

In [0]:
testDataset = spark.read \
  .option("header", True) \
  .option("inferSchema", True) \
  .csv("file:/dbfs/FileStore/news_category/news_category_test.csv")

In [0]:
preds = pipelineModel.transform(testDataset)
display(preds.select('category','description',"class.result"))


In [0]:
preds_df = preds.select('category','description',"class.result").toPandas()


In [0]:
# The result is an array since in Spark NLP you can have multiple sentences.
# This means you can add SentenceDetector in the pipeline and feed it into
# UniversalSentenceEncoder and you can have prediction based on each sentence.
# Let's explode the array and get the item(s) inside of result column out
preds_df['result'] = preds_df['result'].apply(lambda x : x[0])

In [0]:
print (classification_report(preds_df['result'], preds_df['category']))


#### Our data

Let's see if we can predict category based on the purpose text

In [0]:
description = spark.sql('SELECT ein, ActvtyOrMssnDsc as description FROM irs990.return_part_i')
display(description)

ein,description
541733690,TO PROVIDE A LIBERAL ARTS EDUCATION TO STUDENTS.
540505867,PROVIDE HIGH QUALITY CHILD CARE THAT PROMOTES AND FOSTERS GROWTH AND DEVELOPMENT IN YOUNG CHILDREN PRIMARILY FOR LOW INCOME INDIVIDUALS AND FAMILIES
451266409,To provide opportunities for educational choice to students and families of this community. The organization is primarily engaged in providing instructional services to enrolled students in Grades K-8.
436092593,TO SERVE AS A PROFESSIONAL ASSOCIATION OF INDIVIDUALS AND ORGANIZATIONS INVOLVED IN REAL ESTATE AND TO ASSIST THEM IN THAT PROFESSION.
561057522,PROVIDE WATER SERVICES
610844925,"TO MAXIMIZE THE VOCATIONAL POTENTIAL AND QUALITY OF LIFE OF ADULT PERSONS WITH DISABILITES OR OTHER BARRIERS TO EMPLOYMENT THROUGH THE FLEXIBLE INTEGRATION OF COUNSELING, EVALUATION, LIFE AND WORK SKILLS TRAINING, JOB PLACEMENT, SUPPORT SERVICES AND EMPLOYMENT IN A THERAPUTIC OR COMMUNITY BASED REMUNERATIVE WORK ENVIRONMENT."
133277408,"THE MISSION OF THE ALZHEIMER'S ASSOCIATION, NEW YORK CITY CHAPTER IS TO ELIMINATE ALZHEIMER'S DISEASE THROUGH THE ADVANCEMENT OF RESEARCH; TO PROVIDE AND ENHANCE CARE AND SUPPORT FOR ALL AFFECTED; AND TO REDUCE THE RISK OF DEMENTIA THROUGH THE PROMOTION OF BRAIN HEALTH."
210692834,TO PROMOTE AGRICULTURAL PURSUITS
270918026,To provide a community developmental facility for pre-school learing
471846514,"The Garfield American Legion Veterans Memorial Scholarship Fund was organized exclusively and specifically to operate a scholarship fund that will benefit the youth of America. First, the scholarship fund will provide educational assistance to the children, grandchildren and legacy of veterans who served our country and were/are life members of the Garfield American Legion Post 255 or has twenty-five (25) consecutive years membership in Post 255. Second, the scholarship fund will provide educational assistance to the children, grandchildren and legacy of veterans who served our country and were/are life members of American Legion Posts throughout Bergen County, New Jersey. Third, the scholarship fund will provide educational assistance to the children, grandchildren and legacy of veterans who honorably served our country and who have thirty (30) consecutive year membership in any bonafide veterans' organization in Bergen County."


In [0]:
description.count()

In [0]:
temp = description.limit(100).toPandas()

In [0]:
display(temp)

ein,description
541733690,TO PROVIDE A LIBERAL ARTS EDUCATION TO STUDENTS.
540505867,PROVIDE HIGH QUALITY CHILD CARE THAT PROMOTES AND FOSTERS GROWTH AND DEVELOPMENT IN YOUNG CHILDREN PRIMARILY FOR LOW INCOME INDIVIDUALS AND FAMILIES
451266409,To provide opportunities for educational choice to students and families of this community. The organization is primarily engaged in providing instructional services to enrolled students in Grades K-8.
436092593,TO SERVE AS A PROFESSIONAL ASSOCIATION OF INDIVIDUALS AND ORGANIZATIONS INVOLVED IN REAL ESTATE AND TO ASSIST THEM IN THAT PROFESSION.
561057522,PROVIDE WATER SERVICES
610844925,"TO MAXIMIZE THE VOCATIONAL POTENTIAL AND QUALITY OF LIFE OF ADULT PERSONS WITH DISABILITES OR OTHER BARRIERS TO EMPLOYMENT THROUGH THE FLEXIBLE INTEGRATION OF COUNSELING, EVALUATION, LIFE AND WORK SKILLS TRAINING, JOB PLACEMENT, SUPPORT SERVICES AND EMPLOYMENT IN A THERAPUTIC OR COMMUNITY BASED REMUNERATIVE WORK ENVIRONMENT."
133277408,"THE MISSION OF THE ALZHEIMER'S ASSOCIATION, NEW YORK CITY CHAPTER IS TO ELIMINATE ALZHEIMER'S DISEASE THROUGH THE ADVANCEMENT OF RESEARCH; TO PROVIDE AND ENHANCE CARE AND SUPPORT FOR ALL AFFECTED; AND TO REDUCE THE RISK OF DEMENTIA THROUGH THE PROMOTION OF BRAIN HEALTH."
210692834,TO PROMOTE AGRICULTURAL PURSUITS
270918026,To provide a community developmental facility for pre-school learing
471846514,"The Garfield American Legion Veterans Memorial Scholarship Fund was organized exclusively and specifically to operate a scholarship fund that will benefit the youth of America. First, the scholarship fund will provide educational assistance to the children, grandchildren and legacy of veterans who served our country and were/are life members of the Garfield American Legion Post 255 or has twenty-five (25) consecutive years membership in Post 255. Second, the scholarship fund will provide educational assistance to the children, grandchildren and legacy of veterans who served our country and were/are life members of American Legion Posts throughout Bergen County, New Jersey. Third, the scholarship fund will provide educational assistance to the children, grandchildren and legacy of veterans who honorably served our country and who have thirty (30) consecutive year membership in any bonafide veterans' organization in Bergen County."


In [0]:
# http://text-analytics101.rxnlp.com/2014/10/all-about-stop-words-for-text-mining.html
# http://ir.dcs.gla.ac.uk/resources/linguistic_utils/stop_words is used by pyspark and sklearn
remover = StopWordsRemover()
stopwords = remover.getStopWords() 

In [0]:
print(len(stopwords))
stopwords[:10]

Some reference links that I found using my first search.

- [Example 1](https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/3923635548890252/1357850364289680/4930913221861820/latest.html)
- [Example 2](https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6052175677058526/3537626382528910/5364082293869370/latest.html)
- [Example 3](https://towardsdatascience.com/sentiment-analysis-with-pyspark-bc8e83f80c35)
- [Example 4](https://community.cloudera.com/t5/Community-Articles/Spark-Text-Analytics-Uncovering-Data-Driven-Topics/ta-p/244377)
- [Example 5](https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/2799933550853697/1853118572324048/2202577924924539/latest.html)
- [Mueller report and spark-nlp](https://hackernoon.com/mueller-report-for-nerds-spark-meets-nlp-with-tensorflow-and-bert-part-1-32490a8f8f12)
- [Settings for cluster and spark-nlp](https://medium.com/spark-nlp/spark-nlp-quickstart-tutorial-with-databricks-5df54853cf0a)

The [John Snow Labs Databricks examples](https://johnsnowlabs.github.io/spark-nlp-workshop/databricks/index.html#python/annotation/Spark%20NLP%20start.html) look very promising. You can find their full [repo of materials](https://github.com/JohnSnowLabs/spark-nlp-workshop)

In [0]:
# from sparknlp.pretrained import PretrainedPipeline
# pipeline = PretrainedPipeline('recognize_entities_dl', 'en')
# result = pipeline.annotate('Harry Potter is a great movie')
# print(result['entities']) 