![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/8.Keyword_Extraction_YAKE.ipynb)

# 8 Keyword Extraction with YAKE

In [1]:
# ! pip install -q pyspark==3.3.0 spark-nlp==4.2.0

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
spark-nlp-jsl 4.2.2 requires spark-nlp==4.2.2, but you have spark-nlp 4.2.0 which is incompatible.[0m[31m
[0m

In [3]:
from pyspark.sql import functions as F
from pyspark.sql.functions import lit
from pyspark.sql.types import StringType, DataType,ArrayType
from pyspark.sql.functions import udf, struct
from pyspark.ml import Pipeline
from IPython.display import display, HTML
import re

In [5]:
import sparknlp

from pyspark.ml import PipelineModel
from sparknlp.annotator import *
from sparknlp.base import *
from pyspark import SparkContext,SparkConf
from pyspark.sql import SparkSession
# spark = sparknlp.start() # for GPU training >> sparknlp.start(gpu = True) # for Spark 2.3 =>> sparknlp.start(spark23 = True)
spark = SparkSession.builder \
    .appName("Spark NLP")\
    .master("local[2]")\
    .config("spark.driver.memory","8G")\
    .config("spark.driver.maxResultSize", "0") \
    .config("spark.kryoserializer.buffer.max", "2000M")\
    .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.3,org.postgresql:postgresql:42.5.0")\
    .getOrCreate()
print("Spark NLP version", sparknlp.version())
print("Apache Spark version:", spark.version)

spark

22/11/18 15:44:39 WARN Utils: Your hostname, Glorias-MacBook-Pro.local resolves to a loopback address: 127.0.0.1; using 192.168.1.39 instead (on interface en0)
22/11/18 15:44:39 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
:: loading settings :: url = jar:file:/Users/gloria/opt/anaconda3/envs/pyspark/lib/python3.9/site-packages/pyspark/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /Users/gloria/.ivy2/cache
The jars for the packages stored in: /Users/gloria/.ivy2/jars
com.johnsnowlabs.nlp#spark-nlp_2.12 added as a dependency
org.postgresql#postgresql added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-c8120f3d-e2ad-4928-82e0-7e08c3738fea;1.0
	confs: [default]
	found com.johnsnowlabs.nlp#spark-nlp_2.12;4.2.3 in central
	found com.typesafe#config;1.4.2 in central
	found org.rocksdb#rocksdbjni;6.29.5 in central
	found com.amazonaws#aws-java-sdk-bundle;1.11.828 in central
	found com.github.universal-automata#liblevenshtein;3.0.0 in central
	found com.google.code.findbugs#annotations;3.0.1 in central
	found net.jcip#jcip-annotations;1.0 in central
	found com.google.code.findbugs#jsr305;3.0.1 in central
	found com.google.protobuf#protobuf-java-util;3.0.0-beta-3 in central
	found com.google.protobuf#protobuf-java;3.0.0-beta-3 in central
	found com.google.code.gson#gson;2.3 in central
	found it.unimi.dsi#fastu

22/11/18 15:52:11 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


22/11/18 15:52:14 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
Spark NLP version 4.2.0
Apache Spark version: 3.3.0


In [6]:
stopwords = StopWordsCleaner().getStopWords()

22/11/18 16:08:07 WARN StopWordsCleaner: Default locale set was [zh_KR_#Hans]; however, it was not found in available locales in JVM, falling back to en_US locale. Set param `locale` in order to respect another locale.


In [7]:
stopwords[:5]

['i', 'me', 'my', 'myself', 'we']

## YAKE Keyword Extractor

Yake is an Unsupervised, Corpus-Independent, Domain and Language-Independent and Single-Document keyword extraction algorithm.

Extracting keywords from texts has become a challenge for individuals and organizations as the information grows in complexity and size. The need to automate this task so that text can be processed in a timely and adequate manner has led to the emergence of automatic keyword extraction tools. Yake is a novel feature-based system for multi-lingual keyword extraction, which supports texts of different sizes, domain or languages. Unlike other approaches, Yake does not rely on dictionaries nor thesauri, neither is trained against any corpora. Instead, it follows an unsupervised approach which builds upon features extracted from the text, making it thus applicable to documents written in different languages without the need for further knowledge. This can be beneficial for a large number of tasks and a plethora of situations where access to training corpora is either limited or restricted.


The algorithm makes use of the position of a sentence and token. Therefore, to use the annotator, the text should be first sent through a Sentence Boundary Detector and then a tokenizer.

You can tweak the following parameters to get the best result from the annotator.

- *setMinNGrams(int)* Select the minimum length of a extracted keyword
- *setMaxNGrams(int)* Select the maximum length of a extracted keyword
- *setNKeywords(int)* Extract the top N keywords
- *setStopWords(list)* Set the list of stop words
- *setThreshold(float)* Each keyword will be given a keyword score greater than 0. (Lower the score better the keyword) Set an upper bound for the keyword score from this method.
- *setWindowSize(int)* Yake will construct a co-occurence matrix. You can set the window size for the cooccurence matrix construction from this method. ex: windowSize=2 will look at two words to both left and right of a candidate word.


<b>References</b>

Campos, R., Mangaravite, V., Pasquali, A., Jatowt, A., Jorge, A., Nunes, C. and Jatowt, A. (2020). YAKE! Keyword Extraction from Single Documents using Multiple Local Features. In Information Sciences Journal. Elsevier, Vol 509, pp 257-289. [pdf](https://doi.org/10.1016/j.ins.2019.09.013)

In [8]:
document = DocumentAssembler() \
            .setInputCol("text") \
            .setOutputCol("document")

sentenceDetector = SentenceDetector() \
            .setInputCols("document") \
            .setOutputCol("sentence")

token = Tokenizer() \
            .setInputCols("sentence") \
            .setOutputCol("token") \
            .setContextChars(["(", ")", "?", "!", ".", ","])

keywords = YakeKeywordExtraction() \
            .setInputCols("token") \
            .setOutputCol("keywords") \
            .setMinNGrams(1) \
            .setMaxNGrams(3)\
            .setNKeywords(20)\
            .setStopWords(stopwords)

yake_pipeline = Pipeline(stages=[document, sentenceDetector, token, keywords])

empty_df = spark.createDataFrame([['']]).toDF("text")

yake_Model = yake_pipeline.fit(empty_df)



In [10]:
# LightPipeline

light_model = LightPipeline(yake_Model)

text = '''
Then the LORD said, "The outcry against Sodom and Gomorrah is so great and their sin so grievous'''

light_result = light_model.fullAnnotate(text)[0]

[(s.metadata['sentence'], s.result) for s in light_result['sentence']]

[('0',
  'Then the LORD said, "The outcry against Sodom and Gomorrah is so great and their sin so grievous')]

In [11]:
light_result.keys()


dict_keys(['document', 'sentence', 'token', 'keywords'])

In [20]:
# df = spark.createDataFrame(data=light_result,schema=[['document', 'sentence', 'token', 'keywords']])
# df.printSchema()
# df.show(truncate=False)


In [12]:
import pandas as pd

keys_df = pd.DataFrame([(k.result, k.begin, k.end, k.metadata['score'],  k.metadata['sentence']) for k in light_result['keywords']],
                       columns = ['keywords','begin','end','score','sentence'])
keys_df['score'] = keys_df['score'].astype(float)

# ordered by relevance 
keys_df.sort_values(['sentence','score']).head(30)

Unnamed: 0,keywords,begin,end,score,sentence
7,grievous,89,96,0.393326,0
8,lord said,10,18,0.440864,0
0,lord,10,13,0.47587,0
3,sodom,41,45,0.47587,0
4,gomorrah,51,58,0.47587,0
1,said,15,18,0.642974,0
2,outcry,26,31,0.642974,0
5,great,66,70,0.642974,0
6,sin,82,84,0.642974,0
10,sodom and gomorrah,41,58,0.907923,0


### Getting keywords from datraframe

In [13]:
# ! wget -q https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/pubmed/pubmed_sample_text_small.csv

df = spark.read.csv("../data/bibleNIV.csv")\
                
df.printSchema()

                                                                                

root
 |-- _c0: string (nullable = true)
 |-- _c1: string (nullable = true)
 |-- _c2: string (nullable = true)
 |-- _c3: string (nullable = true)



In [14]:
df2=df.selectExpr('_c3 as text')
df2.show()

+--------------------+
|                text|
+--------------------+
|In the beginning ...|
|Now the earth was...|
|And God said, "Le...|
|God saw that the ...|
|God called the li...|
|And God said, "Le...|
|So God made the e...|
|God called the ex...|
|And God said, "Le...|
|God called the dr...|
|Then God said, "L...|
|The land produced...|
|And there was eve...|
|And God said, "Le...|
|and let them be l...|
|God made two grea...|
|God set them in t...|
|to govern the day...|
|And there was eve...|
|And God said, "Le...|
+--------------------+
only showing top 20 rows



In [15]:
result = yake_pipeline.fit(df2).transform(df2)

In [16]:
result = result.withColumn('unique_keywords', F.array_distinct("keywords.result"))

In [17]:
def highlight(text, keywords):
    for k in keywords:
        text = (re.sub(r'(\b%s\b)'%k, r'<span style="background-color: yellow;">\1</span>', text, flags=re.IGNORECASE))
    return text

In [18]:
highlight_udf = udf(highlight, StringType())


In [19]:
result = result.withColumn("highlighted_keywords",highlight_udf('text','unique_keywords'))

In [16]:
result.printSchema()

root
 |-- text: string (nullable = true)
 |-- document: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)
 |-- sentence: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true

In [44]:
result.write.json('bibleKeyword.json')

In [20]:
pandas_df=result.selectExpr('text','unique_keywords').pandas_api()
pandas_df.head()

                                                                                

Unnamed: 0,text,unique_keywords
0,In the beginning God created the heavens and t...,"[beginning, god, created, heavens, earth, begi..."
1,"Now the earth was formless and empty, darkness...","[earth, formless, empty, darkness, surface, de..."
2,"And God said, ""Let there be light,"" and there ...","[god, said, light, god said]"
3,"God saw that the light was good, and He separa...","[god, saw, light, good, separated, darkness, g..."
4,"God called the light ""day,"" and the darkness h...","[god, called, light, darkness, evening, first,..."


In [21]:
result_db=result.selectExpr('text','unique_keywords')

In [23]:
# pgDF=spark.read.format("jdbc").\
#     option("url", "jdbc:postgresql://192.168.1.39:5432/aiknowledge").\
#     option("dbtable", "public.articles_articles").\
#     option("user", "postgres").\
#     option("password", "postgres").\
#     option("driver", "org.postgresql.Driver").load()    
result_db.write.format("jdbc")\
    .option("url", "jdbc:postgresql://192.168.1.39:5432/aiknowledge")\
    .option("dbtable", "public.bible")\
    .option("user", "postgres")\
    .option("password", "postgres")\
    .option("driver", "org.postgresql.Driver") \
    .mode("append").save()

In [32]:
for r in result.select("highlighted_keywords").limit(10).collect():
    display(HTML(r.highlighted_keywords))
    print("\n\n")



















































In [22]:
from neo4j import GraphDatabase
import time
from tqdm import tqdm

In [28]:
class Neo4jConnection:
    
    def __init__(self, uri, user, pwd):
        
        self.__uri = uri
        self.__user = user
        self.__pwd = pwd
        self.__driver = None
        
        try:
            self.__driver = GraphDatabase.driver(self.__uri, auth=(self.__user, self.__pwd))
        except Exception as e:
            print("Failed to create the driver:", e)
        
    def close(self):
        
        if self.__driver is not None:
            self.__driver.close()
        
    def query(self, query, parameters=None, db=None):
        
        assert self.__driver is not None, "Driver not initialized!"
        session = None
        response = None
        
        try: 
            session = self.__driver.session(database=db) if db is not None else self.__driver.session() 
            response = list(session.run(query, parameters))
        except Exception as e:
            print("Query failed:", e)
        finally: 
            if session is not None:
                session.close()
        return response

In [23]:
from py2neo import Graph
graph = Graph("neo4j://localhost:7687", auth=("daniel", "fighting"),name="aiknowledge")
graph.run("UNWIND range(1, 3) AS n RETURN n, n * n as n_sq")

n,n_sq
1,1
2,4
3,9


In [26]:
keywords=result.selectExpr("keywords.result as keyword")

In [43]:
import numpy as np
keywords_bible=keywords.rdd.flatMap(lambda x:np.concatenate(x)).collect()


                                                                                

In [51]:
# keywords_bible=set(keywords_bible)
keywords_bible2=list(set(' '.join(keywords_bible).split()))
len(keywords_bible2)

13339

In [52]:
from py2neo import Graph, Node, Relationship
for word in keywords_bible2:
    graph.create(Node("bible",name = word))

In [25]:
const_ners = 'CREATE CONSTRAINT ners IF NOT EXISTS ON (n:NER) ASSERT n.name IS UNIQUE'
graph.run(const_ners)