# Tools for NLP

There are lots of feature transformations that need to be done on text data to get it to a point that machine learning algorithms can understand. Luckily, Spark has placed the most important ones in convienent Feature Transformer calls.

### Creating a Spark Session

In [1]:
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().appName("nlp_basics").getOrCreate()

Intitializing Scala interpreter ...

Spark Web UI available at http://Varun-CK:4040
SparkContext available as 'sc' (version = 2.3.0, master = local[*], app id = local-1577798857679)
SparkSession available as 'spark'


2019-12-31 18:57:53 WARN  SparkContext:66 - Using an existing SparkContext; some configuration may not take effect.


import org.apache.spark.sql.SparkSession
spark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@3a277130


### Initializing Logger

In [2]:
import org.apache.log4j._
Logger.getLogger("org").setLevel(Level.ERROR)

import org.apache.log4j._


## Tokenizer and Regex Tokenizer
<p><a href="http://en.wikipedia.org/wiki/Lexical_analysis#Tokenization">Tokenization</a> is the process of taking text (such as a sentence) and breaking it into individual terms (usually words).  A simple <a href="api/scala/index.html#org.apache.spark.ml.feature.Tokenizer">Tokenizer</a> class provides this functionality.  The example below shows how to split sentences into sequences of words.</p>

<p><a href="api/scala/index.html#org.apache.spark.ml.feature.RegexTokenizer">RegexTokenizer</a> allows more
 advanced tokenization based on regular expression (regex) matching.
 By default, the parameter &#8220;pattern&#8221; (regex, default: <code>"\\s+"</code>) is used as delimiters to split the input text.
 Alternatively, users can set parameter &#8220;gaps&#8221; to false indicating the regex &#8220;pattern&#8221; denotes
 &#8220;tokens&#8221; rather than splitting gaps, and find all matching occurrences as the tokenization result.</p>

#### Imports

In [3]:
import org.apache.spark.ml.feature.{Tokenizer,RegexTokenizer}
import org.apache.spark.sql.types.IntegerType

import org.apache.spark.ml.feature.{Tokenizer, RegexTokenizer}
import org.apache.spark.sql.types.IntegerType


#### Creating a DataFrame

In [5]:
val sentenceDataFrame = spark.createDataFrame(Array((0,"Hi I heard about Spark"),
                                                    (1, "I wish Java could use case classes"),
                                                    (2, "Logistic,regression,models,are,neat")))
                              .toDF("id", "sentence")

sentenceDataFrame: org.apache.spark.sql.DataFrame = [id: int, sentence: string]


or

In [6]:
val sentenceDataFrame = Seq((0,"Hi I heard about Spark"),
                            (1, "I wish Java could use case classes"),
                            (2, "Logistic,regression,models,are,neat"))
                        .toDF("id", "sentence")

sentenceDataFrame: org.apache.spark.sql.DataFrame = [id: int, sentence: string]


In [7]:
sentenceDataFrame.show(false)

+---+-----------------------------------+
|id |sentence                           |
+---+-----------------------------------+
|0  |Hi I heard about Spark             |
|1  |I wish Java could use case classes |
|2  |Logistic,regression,models,are,neat|
+---+-----------------------------------+



### Creating a udf 'countTokens' which will return the word count

In [8]:
import org.apache.spark.sql.functions.{udf,col}

import org.apache.spark.sql.functions.{udf, col}


In [9]:
val countTokens = udf((words: Seq[String]) => words.length)

countTokens: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,IntegerType,Some(List(ArrayType(StringType,true))))


### Tokenizer

In [10]:
val tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words")

tokenizer: org.apache.spark.ml.feature.Tokenizer = tok_adeadb01950c


In [11]:
val tokenized_df = tokenizer.transform(sentenceDataFrame)

tokenized_df: org.apache.spark.sql.DataFrame = [id: int, sentence: string ... 1 more field]


In [12]:
tokenized_df.printSchema

root
 |-- id: integer (nullable = false)
 |-- sentence: string (nullable = true)
 |-- words: array (nullable = true)
 |    |-- element: string (containsNull = true)



In [13]:
tokenized_df.show(false)

+---+-----------------------------------+------------------------------------------+
|id |sentence                           |words                                     |
+---+-----------------------------------+------------------------------------------+
|0  |Hi I heard about Spark             |[hi, i, heard, about, spark]              |
|1  |I wish Java could use case classes |[i, wish, java, could, use, case, classes]|
|2  |Logistic,regression,models,are,neat|[logistic,regression,models,are,neat]     |
+---+-----------------------------------+------------------------------------------+



### Regex Tokenizer

In [14]:
val regexTokenizer = new RegexTokenizer().setInputCol("sentence").setOutputCol("words").setPattern("\\W")

regexTokenizer: org.apache.spark.ml.feature.RegexTokenizer = regexTok_c691ce66e77a


In [15]:
val regexTokenized_df = regexTokenizer.transform(sentenceDataFrame)

regexTokenized_df: org.apache.spark.sql.DataFrame = [id: int, sentence: string ... 1 more field]


In [16]:
regexTokenized_df.show(false)

+---+-----------------------------------+------------------------------------------+
|id |sentence                           |words                                     |
+---+-----------------------------------+------------------------------------------+
|0  |Hi I heard about Spark             |[hi, i, heard, about, spark]              |
|1  |I wish Java could use case classes |[i, wish, java, could, use, case, classes]|
|2  |Logistic,regression,models,are,neat|[logistic, regression, models, are, neat] |
+---+-----------------------------------+------------------------------------------+



### Displaying both the results

In [17]:
println("Using Tokenizer:")
tokenized_df.select("sentence","words").withColumn("tokens",countTokens(col("words"))).show(false)
println("Using| Regex Tokenizer")
regexTokenized_df.select("sentence","words").withColumn("tokens",countTokens(col("words"))).show(false)

Using Tokenizer:
+-----------------------------------+------------------------------------------+------+
|sentence                           |words                                     |tokens|
+-----------------------------------+------------------------------------------+------+
|Hi I heard about Spark             |[hi, i, heard, about, spark]              |5     |
|I wish Java could use case classes |[i, wish, java, could, use, case, classes]|7     |
|Logistic,regression,models,are,neat|[logistic,regression,models,are,neat]     |1     |
+-----------------------------------+------------------------------------------+------+

Using| Regex Tokenizer
+-----------------------------------+------------------------------------------+------+
|sentence                           |words                                     |tokens|
+-----------------------------------+------------------------------------------+------+
|Hi I heard about Spark             |[hi, i, heard, about, spark]              

From above observation, we can infer that for token count is 1 for 3rd sentence when Tokenizer is used (since it will consider whole sentence as a single word if the space delimiter is not present) and token count is 5 when Regex Tokenizer has been used (Here the words are splitted well). This the use of Regex Tokenizer gives better result than compared to Tokenizer.


## Stop Words Removal

<p><a href="https://en.wikipedia.org/wiki/Stop_words">Stop words</a> are words which
should be excluded from the input, typically because the words appear
frequently and don&#8217;t carry as much meaning.</p>

<p><code>StopWordsRemover</code> takes as input a sequence of strings (e.g. the output
of a <a href="ml-features.html#tokenizer">Tokenizer</a>) and drops all the stop
words from the input sequences. The list of stopwords is specified by
the <code>stopWords</code> parameter. Default stop words for some languages are accessible 
by calling <code>StopWordsRemover.loadDefaultStopWords(language)</code>, for which available 
options are &#8220;danish&#8221;, &#8220;dutch&#8221;, &#8220;english&#8221;, &#8220;finnish&#8221;, &#8220;french&#8221;, &#8220;german&#8221;, &#8220;hungarian&#8221;, 
&#8220;italian&#8221;, &#8220;norwegian&#8221;, &#8220;portuguese&#8221;, &#8220;russian&#8221;, &#8220;spanish&#8221;, &#8220;swedish&#8221; and &#8220;turkish&#8221;. 
A boolean parameter <code>caseSensitive</code> indicates if the matches should be case sensitive 
(false by default).</p>

In [18]:
import org.apache.spark.ml.feature.StopWordsRemover

import org.apache.spark.ml.feature.StopWordsRemover


In [19]:
val sentenceData = Seq((0, Seq("I", "saw", "the", "red", "balloon")),(1, Seq("Mary", "had", "a", "little", "lamb"))).toDF("id", "raw")

sentenceData: org.apache.spark.sql.DataFrame = [id: int, raw: array<string>]


In [20]:
sentenceData.show(false)

+---+----------------------------+
|id |raw                         |
+---+----------------------------+
|0  |[I, saw, the, red, balloon] |
|1  |[Mary, had, a, little, lamb]|
+---+----------------------------+



In [21]:
val remover = new StopWordsRemover().setInputCol("raw").setOutputCol("filtered")

remover: org.apache.spark.ml.feature.StopWordsRemover = stopWords_62a505a8c483


In [22]:
remover.transform(sentenceData).show(false)

+---+----------------------------+--------------------+
|id |raw                         |filtered            |
+---+----------------------------+--------------------+
|0  |[I, saw, the, red, balloon] |[saw, red, balloon] |
|1  |[Mary, had, a, little, lamb]|[Mary, little, lamb]|
+---+----------------------------+--------------------+



## n-grams

An n-gram is a sequence of nn tokens (typically words) for some integer nn. The NGram class can be used to transform input features into nn-grams.

<p><code>NGram</code> takes as input a sequence of strings (e.g. the output of a <a href="ml-features.html#tokenizer">Tokenizer</a>).  The parameter <code>n</code> is used to determine the number of terms in each $n$-gram. The output will consist of a sequence of $n$-grams where each $n$-gram is represented by a space-delimited string of $n$ consecutive words.  If the input sequence contains fewer than <code>n</code> strings, no output is produced.</p>


In [23]:
import org.apache.spark.ml.feature.NGram

import org.apache.spark.ml.feature.NGram


In [24]:
val wordDataFrame = Seq((0, Seq("Hi", "I", "heard", "about", "Spark")),
                        (1, Seq("I", "wish", "Java", "could", "use", "case", "classes")),
                        (2, Seq("Logistic", "regression", "models", "are", "neat")))
                    .toDF("id", "words")

wordDataFrame: org.apache.spark.sql.DataFrame = [id: int, words: array<string>]


In [25]:
wordDataFrame.show(false)

+---+------------------------------------------+
|id |words                                     |
+---+------------------------------------------+
|0  |[Hi, I, heard, about, Spark]              |
|1  |[I, wish, Java, could, use, case, classes]|
|2  |[Logistic, regression, models, are, neat] |
+---+------------------------------------------+



In [26]:
val ngram = new NGram().setInputCol("words").setOutputCol("ngrams").setN(2)

ngram: org.apache.spark.ml.feature.NGram = ngram_96a7302d32d4


In [27]:
ngram.transform(wordDataFrame).select("ngrams").show(false)

+------------------------------------------------------------------+
|ngrams                                                            |
+------------------------------------------------------------------+
|[Hi I, I heard, heard about, about Spark]                         |
|[I wish, wish Java, Java could, could use, use case, case classes]|
|[Logistic regression, regression models, models are, are neat]    |
+------------------------------------------------------------------+



_______
# Feature Extractors
_______

<h2 id="tf-idf">TF-IDF</h2>

<p><a href="http://en.wikipedia.org/wiki/Tf%E2%80%93idf">Term frequency-inverse document frequency (TF-IDF)</a> 
is a feature vectorization method widely used in text mining to reflect the importance of a term 
to a document in the corpus. Denote a term by <code>$t$</code>, a document by  d , and the corpus by D.
Term frequency <code>$TF(t, d)$</code> is the number of times that term <code>$t$</code> appears in document <code>$d$</code>, while 
document frequency <code>$DF(t, D)$</code> is the number of documents that contains term <code>$t$</code>. If we only use 
term frequency to measure the importance, it is very easy to over-emphasize terms that appear very 
often but carry little information about the document, e.g. &#8220;a&#8221;, &#8220;the&#8221;, and &#8220;of&#8221;. If a term appears 
very often across the corpus, it means it doesn&#8217;t carry special information about a particular document.
Inverse document frequency is a numerical measure of how much information a term provides:

$$ IDF(t, D) = \log \frac{|D| + 1}{DF(t, D) + 1} $$

where |D| is the total number of documents in the corpus. Since logarithm is used, if a term 
appears in all documents, its IDF value becomes 0. Note that a smoothing term is applied to avoid 
dividing by zero for terms outside the corpus. The TF-IDF measure is simply the product of TF and IDF:
$$ TFIDF(t, d, D) = TF(t, d) \cdot IDF(t, D). $$


#### Tokenizer

In [28]:
import org.apache.spark.ml.feature.{HashingTF,IDF,Tokenizer}

import org.apache.spark.ml.feature.{HashingTF, IDF, Tokenizer}


In [29]:
val sentenceData = Seq((0.0, "Hi I heard about Spark"),
                       (0.0, "I wish Java could use case classes"),
                       (1.0, "Logistic regression models are neat"))
                    .toDF("label", "sentence")

sentenceData: org.apache.spark.sql.DataFrame = [label: double, sentence: string]


In [30]:
sentenceData.show(false)

+-----+-----------------------------------+
|label|sentence                           |
+-----+-----------------------------------+
|0.0  |Hi I heard about Spark             |
|0.0  |I wish Java could use case classes |
|1.0  |Logistic regression models are neat|
+-----+-----------------------------------+



In [31]:
val tokenizer1 = new Tokenizer().setInputCol("sentence").setOutputCol("words")

tokenizer1: org.apache.spark.ml.feature.Tokenizer = tok_62d36a36eb1a


In [32]:
val wordsdata = tokenizer1.transform(sentenceData)

wordsdata: org.apache.spark.sql.DataFrame = [label: double, sentence: string ... 1 more field]


In [33]:
wordsdata.show(false)

+-----+-----------------------------------+------------------------------------------+
|label|sentence                           |words                                     |
+-----+-----------------------------------+------------------------------------------+
|0.0  |Hi I heard about Spark             |[hi, i, heard, about, spark]              |
|0.0  |I wish Java could use case classes |[i, wish, java, could, use, case, classes]|
|1.0  |Logistic regression models are neat|[logistic, regression, models, are, neat] |
+-----+-----------------------------------+------------------------------------------+



#### HashingTF

In [34]:
val hashingTF = new HashingTF().setInputCol("words").setOutputCol("rawFeatures").setNumFeatures(20)

hashingTF: org.apache.spark.ml.feature.HashingTF = hashingTF_2ede4f04bc1f


In [35]:
val featurizedData = hashingTF.transform(wordsdata)
// alternatively, CountVectorizer can also be used to get term frequency vectors

featurizedData: org.apache.spark.sql.DataFrame = [label: double, sentence: string ... 2 more fields]


In [36]:
featurizedData.show(false)

+-----+-----------------------------------+------------------------------------------+-----------------------------------------+
|label|sentence                           |words                                     |rawFeatures                              |
+-----+-----------------------------------+------------------------------------------+-----------------------------------------+
|0.0  |Hi I heard about Spark             |[hi, i, heard, about, spark]              |(20,[0,5,9,17],[1.0,1.0,1.0,2.0])        |
|0.0  |I wish Java could use case classes |[i, wish, java, could, use, case, classes]|(20,[2,7,9,13,15],[1.0,1.0,3.0,1.0,1.0]) |
|1.0  |Logistic regression models are neat|[logistic, regression, models, are, neat] |(20,[4,6,13,15,18],[1.0,1.0,1.0,1.0,1.0])|
+-----+-----------------------------------+------------------------------------------+-----------------------------------------+



#### IDF

In [37]:
val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")

idf: org.apache.spark.ml.feature.IDF = idf_b7ac09462d8c


In [38]:
val idf_model = idf.fit(featurizedData)

idf_model: org.apache.spark.ml.feature.IDFModel = idf_b7ac09462d8c


In [39]:
val rescaled_data = idf_model.transform(featurizedData)

rescaled_data: org.apache.spark.sql.DataFrame = [label: double, sentence: string ... 3 more fields]


In [40]:
rescaled_data.select("label","features").show(false)

+-----+----------------------------------------------------------------------------------------------------------------------+
|label|features                                                                                                              |
+-----+----------------------------------------------------------------------------------------------------------------------+
|0.0  |(20,[0,5,9,17],[0.6931471805599453,0.6931471805599453,0.28768207245178085,1.3862943611198906])                        |
|0.0  |(20,[2,7,9,13,15],[0.6931471805599453,0.6931471805599453,0.8630462173553426,0.28768207245178085,0.28768207245178085]) |
|1.0  |(20,[4,6,13,15,18],[0.6931471805599453,0.6931471805599453,0.28768207245178085,0.28768207245178085,0.6931471805599453])|
+-----+----------------------------------------------------------------------------------------------------------------------+



## CountVectorizer
CountVectorizer and CountVectorizerModel aim to help convert a collection of text documents to vectors of token counts. When an a-priori dictionary is not available, CountVectorizer can be used as an Estimator to extract the vocabulary, and generates a CountVectorizerModel. The model produces sparse representations for the documents over the vocabulary, which can then be passed to other algorithms like LDA.

During the fitting process, CountVectorizer will select the top vocabSize words ordered by term frequency across the corpus. An optional parameter minDF also affects the fitting process by specifying the minimum number (or fraction if < 1.0) of documents a term must appear in to be included in the vocabulary. Another optional binary toggle parameter controls the output vector. If set to true all nonzero counts are set to 1. This is especially useful for discrete probabilistic models that model binary, rather than integer, counts.

In [41]:
import org.apache.spark.ml.feature.CountVectorizer

import org.apache.spark.ml.feature.CountVectorizer


In [42]:
//Input data: Each row is a bag of words with a ID.

val df = Seq((0, "a b c".split(" ")),
             (1, "a b b c a".split(" ")))
         .toDF("id", "words")

df: org.apache.spark.sql.DataFrame = [id: int, words: array<string>]


In [43]:
df.show(false)

+---+---------------+
|id |words          |
+---+---------------+
|0  |[a, b, c]      |
|1  |[a, b, b, c, a]|
+---+---------------+



#### fit a CountVectorizerModel from the corpus.

In [44]:
val cv = new CountVectorizer().setInputCol("words").setOutputCol("features").setVocabSize(3).setMinDF(2.0)

cv: org.apache.spark.ml.feature.CountVectorizer = cntVec_2c1c290fb31f


In [45]:
val cv_model = cv.fit(df)

cv_model: org.apache.spark.ml.feature.CountVectorizerModel = cntVec_2c1c290fb31f


In [46]:
val result = cv_model.transform(df)

result: org.apache.spark.sql.DataFrame = [id: int, words: array<string> ... 1 more field]


In [47]:
result.show(false)

+---+---------------+-------------------------+
|id |words          |features                 |
+---+---------------+-------------------------+
|0  |[a, b, c]      |(3,[0,1,2],[1.0,1.0,1.0])|
|1  |[a, b, b, c, a]|(3,[0,1,2],[2.0,2.0,1.0])|
+---+---------------+-------------------------+



### Closing Spark Session

In [48]:
spark.stop()

## Thank You!