

![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/TEXT_FINDER_EN.ipynb)




# **Find words/phrases in text using word and regex matching**

**Demo of the following annotators:**


* TextMatcher
* RegexMatcher

## 1. Colab Setup

In [None]:
# Install java
!apt-get update -qq
!apt-get install -y openjdk-8-jdk-headless -qq > /dev/null
!java -version

# Install pyspark
!pip install --ignore-installed -q pyspark==2.4.4

# Install Sparknlp
!pip install --ignore-installed spark-nlp

In [None]:
import pandas as pd
import numpy as np
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]
import json
from pyspark.ml import Pipeline
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from sparknlp.annotator import *
from sparknlp.base import *
import sparknlp
from sparknlp.pretrained import PretrainedPipeline

## 2. Start Spark Session

In [None]:
spark = sparknlp.start()

## 3. Select annotator and re-run the cells below

In [None]:
#MODEL_NAME='TextMatcher'
MODEL_NAME='RegexMatcher'

## 4. Create some sample examples and desired regex/string matching queries

In [None]:
## Generating Example Files ##
text_list = ["""Quantum computing is the use of quantum-mechanical phenomena such as superposition and entanglement to perform computation. Computers that perform quantum computations are known as quantum computers. Quantum computers are believed to be able to solve certain computational problems, such as integer factorization (which underlies RSA encryption), substantially faster than classical computers. The study of quantum computing is a subfield of quantum information science. Quantum computing began in the early 1980s, when physicist Paul Benioff proposed a quantum mechanical model of the Turing machine. Richard Feynman and Yuri Manin later suggested that a quantum computer had the potential to simulate things that a classical computer could not. In 1994, Peter Shor developed a quantum algorithm for factoring integers that had the potential to decrypt RSA-encrypted communications. Despite ongoing experimental progress since the late 1990s, most researchers believe that "fault-tolerant quantum computing is still a rather distant dream." In recent years, investment into quantum computing research has increased in both the public and private sector. On 23 October 2019, Google AI, in partnership with the U.S. National Aeronautics and Space Administration (NASA), published a paper in which they claimed to have achieved quantum supremacy. While some have disputed this claim, it is still a significant milestone in the history of quantum computing.""",
             """Instacart has raised a new round of financing that makes it one of the most valuable private companies in the U.S., leapfrogging DoorDash, Palantir and Robinhood. Amid surging demand for grocery delivery due to the coronavirus pandemic, Instacart has raised $225 million in a new funding round led by DST Global and General Catalyst. The round increases Instacart’s valuation to $13.7 billion, up from $8 billion when it last raised money in 2018.""",
            ]

exact_matches = ['Quantum', 'million', 'payments', 'index', 'market share', 'gap', 'market', 'measure', 'aspects', 'accounts', 'king' ]

regex_rules = ["""Quantum\s\w+""", """million\s\w+""", """John\s\w+, followed by leader""", """payment.*?\s""", """rall.*?\s""", '\d\d\d\d', '\d+ Years' ]


## 5. Save the queries in separate files

In [None]:
if MODEL_NAME=='TextMatcher':
  with open ('text_to_match.txt', 'w') as f:
    for i in exact_matches:
      f.write(i+'\n')
else:
  with open ('regex_to_match.txt', 'w') as f:
    for i in regex_rules:
        f.write(i+'\n')

## 6. Define Spark NLP pipleline

In [None]:
documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")
    
if MODEL_NAME=='TextMatcher':
  tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")
  text_matcher = TextMatcher() \
      .setInputCols(["document",'token'])\
      .setOutputCol("matched_text")\
      .setCaseSensitive(False)\
      .setEntities(path="text_to_match.txt")

  nlpPipeline = Pipeline(stages=[documentAssembler, 
                                 tokenizer,
                                 text_matcher
                                 ])
else:
  regex_matcher = RegexMatcher()\
    .setInputCols('document')\
    .setStrategy("MATCH_ALL")\
    .setOutputCol("matched_text")\
    .setExternalRules(path='regex_to_match.txt', delimiter=',')
    

  nlpPipeline = Pipeline(stages=[documentAssembler, 
                                 regex_matcher
                                 ])

## 7. Run the pipeline

In [None]:
empty_df = spark.createDataFrame([['']]).toDF("text")

pipelineModel = nlpPipeline.fit(empty_df)

df = spark.createDataFrame(pd.DataFrame({'text':text_list}))
result = pipelineModel.transform(df)

## 8. Visualize results

In [None]:
result.select(F.explode(F.arrays_zip('matched_text.result', 'matched_text.metadata')).alias("cols")) \
.select(
        F.expr("cols['0']").alias("Matches Found"),
        F.expr("cols['1']['identifier']").alias("matching_regex/string"),
        ).show(truncate=False)