## Basic Linguistic Algorithms Using PreTrainedModels

#### Table of Contents
- **[Spark2.3 Set Up](#section1)**
- **[Explain Document](#section2)**
- **[Clean Stop words](#section3)**
- **[Entiry recognization](#section4)**
- **[Clean slang](#section5)**
- **[Spell Checker](#section6)** 
- **[Sentiment Analysis](#section7)** 
- **[Matching Chunks](#section8)**
- **[Match_Date_Phrase](#section9)** 


### Spark2.3 Set Up
<a id='section1'></a>

In [1]:
import os
import sys
import findspark
import pandas as pd
os.environ["JAVA_HOME"] = "/usr/lib64/jvm/java-1.8.0-openjdk-1.8.0"
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]
os.environ["SPARK_HOME"] = "/dfs/is/home/m292915/spark-2.3.0-bin-hadoop2.7"
findspark.init()
jar_path='/dfs/is/home/m292915/spark-2.3.0-bin-hadoop2.7/jars/'

from pyspark.sql import SparkSession
spark = SparkSession.builder \
        .appName("Spark NLP") \
        .master("local[*]") \
        .config("spark.driver.memory", "16G") \
        .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
        .config("spark.kryoserializer.buffer.max", "1000M") \
        .config("spark.driver.maxResultSize", "0")\
        .config("spark.jars", "{}spark-nlp-spark23_2.11-2.5.5.jar,{}spark-nlp-spark23-assembly-2.5.5.jar".format(jar_path,jar_path)) \
        .getOrCreate()

import sparknlp
sparknlp.start(spark23=True)

In [4]:
import sparknlp
print("Spark NLP version", sparknlp.version())
print("Apache Spark version:", spark.version)

Spark NLP version 2.6.0
Apache Spark version: 2.3.0


In [2]:
text = '''The Merck Group, branded and commonly known as Merck, is a German multinational pharmaceutical, chemical and life sciences company headquartered in Darmstadt, with about 56,000 employees and present in 66 countries. 
The group includes around 250 companies; the main company is Merck KGaA in Germany. 
Merck was founded in 1668 and is the world's oldest operating chemical and pharmaceutical company, as well as one of the largest pharmaceutical companies in the world'''

### Explain Document
<a id='section2'></a>

**Stages in ml**
- DocumentAssembler
- SentenceDetector
- Tokenizer
- Lemmatizer
- Stemmer
- Part of Speech
- SpellChecker (Norvig)

**Stages in DL**
- DocumentAssembler
- SentenceDetector
- Tokenizer
- NER (NER with GloVe 100D embeddings, CoNLL2003 dataset)
- Lemmatizer
- Stemmer
- Part of Speech
- SpellChecker (Norvig)

download pretrained models from https://github.com/JohnSnowLabs/spark-nlp-models

In [3]:
from sparknlp.pretrained import PretrainedPipeline
explain_dl=PretrainedPipeline.from_disk("/dfs/is/home/m292915/spark_nlp/sparknlp_pipelines/explain_document_dl_en")
explain_ml=PretrainedPipeline.from_disk("/dfs/is/home/m292915/spark_nlp/sparknlp_pipelines/explain_document_ml_en")

Py4JJavaError: An error occurred while calling o135.getParam.
: java.util.NoSuchElementException: Param detectLists does not exist.
	at org.apache.spark.ml.param.Params$$anonfun$getParam$2.apply(params.scala:729)
	at org.apache.spark.ml.param.Params$$anonfun$getParam$2.apply(params.scala:729)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.ml.param.Params$class.getParam(params.scala:728)
	at org.apache.spark.ml.PipelineStage.getParam(Pipeline.scala:42)
	at sun.reflect.GeneratedMethodAccessor46.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:214)
	at java.lang.Thread.run(Thread.java:748)


In [4]:
result_explain_dl = explain_dl.annotate(text)
result_explain_ml= explain_ml.annotate(text)

In [5]:
print(result_explain_dl.keys())
print(result_explain_dl.keys())

dict_keys(['entities', 'stem', 'checked', 'lemma', 'document', 'pos', 'token', 'ner', 'embeddings', 'sentence'])
dict_keys(['entities', 'stem', 'checked', 'lemma', 'document', 'pos', 'token', 'ner', 'embeddings', 'sentence'])


In [6]:
dl_df = pd.DataFrame({'token':result_explain_dl['token'], 'dl_ner_label':result_explain_dl['ner'],
                      'dl_spell_corrected':result_explain_dl['checked'], 'dl_POS':result_explain_dl['pos'],
                      'dl_lemmas':result_explain_dl['lemma'], 'dl_stems':result_explain_dl['stem']})
ml_df = pd.DataFrame({'token':result_explain_ml['token'], 
                      'ml_corrected':result_explain_ml['spell'], 'ml_POS':result_explain_ml['pos'],
                      'ml_lemmas':result_explain_ml['lemmas'], 'ml_stems':result_explain_ml['stems']})
dl_df.merge(ml_df,on='token').drop_duplicates().head(10)

Unnamed: 0,token,dl_ner_label,dl_spell_corrected,dl_POS,dl_lemmas,dl_stems,ml_corrected,ml_POS,ml_lemmas,ml_stems
0,The,O,The,DT,The,the,The,DT,The,the
4,Merck,B-ORG,Merck,NNP,Merck,merck,Merck,NNP,Merck,merck
20,Group,I-ORG,Group,NNP,Group,group,Group,NNP,Group,group
21,",",O,",",",",",",",",",",",",",",","
46,branded,O,branded,VBD,brand,brand,branded,VBD,brand,brand
47,and,O,and,CC,and,and,and,CC,and,and
72,commonly,O,commonly,RB,commonly,commonli,commonly,RB,commonly,commonli
73,known,O,known,VBN,know,known,known,VBN,know,known
74,as,O,as,IN,as,a,as,IN,as,a
75,as,O,as,IN,as,a,as,RB,as,a


In [7]:
detailed_result_explain_dl = explain_dl.fullAnnotate(text)

In [8]:
detailed_result_explain_dl

[{'entities': [Annotation(chunk, 4, 14, Merck Group, {'entity': 'ORG', 'sentence': '0', 'chunk': '0'}),
   Annotation(chunk, 47, 51, Merck, {'entity': 'ORG', 'sentence': '0', 'chunk': '1'}),
   Annotation(chunk, 59, 64, German, {'entity': 'MISC', 'sentence': '0', 'chunk': '2'}),
   Annotation(chunk, 148, 156, Darmstadt, {'entity': 'LOC', 'sentence': '0', 'chunk': '3'}),
   Annotation(chunk, 278, 287, Merck KGaA, {'entity': 'ORG', 'sentence': '2', 'chunk': '4'}),
   Annotation(chunk, 292, 298, Germany, {'entity': 'LOC', 'sentence': '2', 'chunk': '5'}),
   Annotation(chunk, 302, 306, Merck, {'entity': 'ORG', 'sentence': '3', 'chunk': '6'})],
  'stem': [Annotation(token, 0, 2, the, {'confidence': '1.0', 'sentence': '0'}),
   Annotation(token, 4, 8, merck, {'confidence': '1.0', 'sentence': '0'}),
   Annotation(token, 10, 14, group, {'confidence': '1.0', 'sentence': '0'}),
   Annotation(token, 15, 15, ,, {'confidence': '0.0', 'sentence': '0'}),
   Annotation(token, 17, 23, brand, {'confiden

In [21]:
detailed_result_explain_dl[0]['entities']

[Annotation(chunk, 4, 14, Merck Group, {'entity': 'ORG', 'sentence': '0', 'chunk': '0'}),
 Annotation(chunk, 47, 51, Merck, {'entity': 'ORG', 'sentence': '0', 'chunk': '1'}),
 Annotation(chunk, 59, 64, German, {'entity': 'MISC', 'sentence': '0', 'chunk': '2'}),
 Annotation(chunk, 148, 156, Darmstadt, {'entity': 'LOC', 'sentence': '0', 'chunk': '3'}),
 Annotation(chunk, 278, 287, Merck KGaA, {'entity': 'ORG', 'sentence': '2', 'chunk': '4'}),
 Annotation(chunk, 292, 298, Germany, {'entity': 'LOC', 'sentence': '2', 'chunk': '5'}),
 Annotation(chunk, 302, 306, Merck, {'entity': 'ORG', 'sentence': '3', 'chunk': '6'})]

In [23]:
chunks=[]
entities=[]
for n in detailed_result_explain_dl[0]['entities']:
        
  chunks.append(n.result)
  entities.append(n.metadata['entity']) 
    
df = pd.DataFrame({'chunks':chunks, 'entities':entities})
df    

Unnamed: 0,chunks,entities
0,Merck Group,ORG
1,Merck,ORG
2,German,MISC
3,Darmstadt,LOC
4,Merck KGaA,ORG
5,Germany,LOC
6,Merck,ORG


In [25]:
tuples = []

for x,y,z in zip(detailed_result_explain_dl[0]["token"], detailed_result_explain_dl[0]["pos"], detailed_result_explain_dl[0]["ner"]):

  tuples.append((int(x.metadata['sentence']), x.result, x.begin, x.end, y.result, z.result))

df = pd.DataFrame(tuples, columns=['sent_id','token','start','end','pos', 'ner'])

df.head(10)


Unnamed: 0,sent_id,token,start,end,pos,ner
0,0,The,0,2,DT,O
1,0,Merck,4,8,NNP,B-ORG
2,0,Group,10,14,NNP,I-ORG
3,0,",",15,15,",",O
4,0,branded,17,23,VBD,O
5,0,and,25,27,CC,O
6,0,commonly,29,36,RB,O
7,0,known,38,42,VBN,O
8,0,as,44,45,IN,O
9,0,Merck,47,51,NNP,B-ORG


### Clean Stop words
<a id='section3'></a>

In [7]:
clean_stop=PretrainedPipeline.from_disk("/dfs/is/home/m292915/spark_nlp/sparknlp_pipelines/clean_stop_en")
result_clean_stop = clean_stop.annotate(text)
print(' '.join(result_clean_stop['token']))

The Merck Group , branded and commonly known as Merck , is a German multinational pharmaceutical , chemical and life sciences company headquartered in Darmstadt , with about 56,000 employees and present in 66 countries . The group includes around 250 companies ; the main company is Merck KGaA in Germany . Merck was founded in 1668 and is the world's oldest operating chemical and pharmaceutical company , as well as one of the largest pharmaceutical companies in the world


<a id='section4'></a>
### Entity Recognization

In [8]:
recognize_entities = PretrainedPipeline.from_disk('/dfs/is/home/m292915/spark_nlp/sparknlp_pipelines/recognize_entities_dl_en')

In [9]:
doc_entity_rec = '''
Peter is a very good persn.
He has a good car though.
Europe is very culture rich.
'''

In [10]:
result_recognize_entities = recognize_entities.annotate(doc_entity_rec)
pd.DataFrame({'word':result_recognize_entities['token'],"corrected_word":result_recognize_entities['ner']})

Unnamed: 0,word,corrected_word
0,Peter,B-PER
1,is,O
2,a,O
3,very,O
4,good,O
5,persn,O
6,.,O
7,He,O
8,has,O
9,a,O


### V. Clean slang
<a id='section5'></a>

In [11]:
clean_slang = PretrainedPipeline.from_disk('/dfs/is/home/m292915/spark_nlp/sparknlp_pipelines/clean_slang_en')
result_clean_slang = clean_slang.annotate(' Whatsup bro, call me ASAP')
print(' '.join(result_clean_slang['normal']))

how are you friend call me as soon as possible


### Spell Checker
<a id='section6'></a>

In [12]:
spell_checker_ml = PretrainedPipeline.from_disk('/dfs/is/home/m292915/spark_nlp/sparknlp_pipelines/check_spelling_en')

In [13]:
#spell_checker_dl = PretrainedPipeline.from_disk('/dfs/is/home/m292915/spark_nlp/sparknlp_pipelines/check_spelling_dl')

In [14]:
text_spell_check = '''
He is a  good persn.
My life  is very intersting.
We are brothrs.
'''

In [15]:
result_spell_checker_ml = spell_checker_ml.annotate(text_spell_check)
#result_spell_checker_dl = spell_checker_dl.annotate(text_spell_check)
pd.DataFrame({'word':result_spell_checker_ml['token'],"corrected_word":result_spell_checker_ml['checked']})

Unnamed: 0,word,corrected_word
0,He,He
1,is,is
2,a,a
3,good,good
4,persn,person
5,.,.
6,My,My
7,life,life
8,is,is
9,very,very


<a id='section7'></a>
### Sentiment Analysis

In [17]:
sentiment = PretrainedPipeline.from_disk('/dfs/is/home/m292915/spark_nlp/sparknlp_pipelines/analyze_sentiment_en')
result_sentiment = sentiment.annotate("The movie I watched today was not a good one")
result_sentiment['sentiment']

['negative']

<a id='section8'></a>
### Matching chunks

In [26]:
matching_chunks = PretrainedPipeline.from_disk('/dfs/is/home/m292915/spark_nlp/sparknlp_pipelines/match_chunks_en')
result_matching_chunks = matching_chunks.annotate("The book has many chapters") # single noun phrase
result_matching_chunks

{'chunk': ['The book'],
 'document': ['The book has many chapters'],
 'pos': ['DT', 'NN', 'VBZ', 'JJ', 'NNS'],
 'token': ['The', 'book', 'has', 'many', 'chapters'],
 'sentence': ['The book has many chapters']}

In [28]:
result_matching_chunks = matching_chunks.annotate("the little yellow dog barked at the cat") #multiple noune phrases
result_matching_chunks

{'chunk': ['the little yellow dog', 'the cat'],
 'document': ['the little yellow dog barked at the cat'],
 'pos': ['DT', 'JJ', 'JJ', 'NN', 'JJ', 'IN', 'DT', 'NN'],
 'token': ['the', 'little', 'yellow', 'dog', 'barked', 'at', 'the', 'cat'],
 'sentence': ['the little yellow dog barked at the cat']}

<a id='section9'></a>
### Extract Exact Dates from Referential Date Phrases

In [29]:
match_datetime_en= PretrainedPipeline.from_disk('/dfs/is/home/m292915/spark_nlp/sparknlp_pipelines/match_datetime_en')

In [None]:
result_match_datetime_en = match_datetime_en.annotate("I saw him yesterday and he told me that he will visit us next week")

In [None]:
detailed_result_match_datetime_en = match_datetime_en.fullAnnotate("I saw him yesterday and he told me that he will visit us next week")
detailed_result_match_datetime_en