# Chapter 16: Text Analytics

This notebook contains the sample source code explained in the book *Hands-On Julia Programming, Sambit Kumar Dash, 2021, bpb Publications. All Rights Reserved.*

In [1]:
using Pkg
Pkg.activate(".")
Pkg.instantiate()

[32m[1m  Activating[22m[39m environment at `~/work/books/HOJP/Chapter-16/Project.toml`


## Text Processing Pipeline

The text processing can be broadly outlined by the diagram below.

![Text Pipeline](414_16_01.png)


## Preprocessing 

Acquisition, tokenization and representation tasks can be loosely considered as preprocessing. 

### Tokenization

Tokenization is the process of breaking down text into smaller units like paragraphs, sentences, words, subwords or characters. Subwords and characters are particularly useful for handling unknown words. However, words are the most commonly used tokens in text. 

In [2]:
using WordTokenizers
s1 = """This is a multi-line text. And the sentence splitter
may not do a great job on this. But if no line ending
is provided it works like a charm. Even keeps Dr. Smith 
intact."""
split_sentences(s1)

7-element Vector{SubString{String}}:
 "This is a multi-line text."
 "And the sentence splitter"
 "may not do a great job on this."
 "But if no line ending"
 "is provided it works like a charm."
 "Even keeps Dr. Smith "
 "intact."

Julia has a rules based sentence tokenizer. 

In [3]:
s2 = "This is a multi-line text. And the sentence splitter "*
"provided works like a charm. Even keeps Dr. Smith "*
"intact."
split_sentences(s2)

3-element Vector{SubString{String}}:
 "This is a multi-line text."
 "And the sentence splitter provided works like a charm."
 "Even keeps Dr. Smith intact."

You can change the tokenizer heuristics by choosing the tokenizer that best meets to your purpose. You may be able to write your own tokenizer. 

In [4]:
s3 = "I won't be able to attend the meeting. This isn't possible."
set_tokenizer(poormans_tokenize);
println("Poorman: \t", tokenize(s3))
set_tokenizer(nltk_word_tokenize);
println("NLTK: \t", tokenize(s3))

Poorman: 	SubString{String}["I", "wont", "be", "able", "to", "attend", "the", "meeting", "This", "isnt", "possible"]
NLTK: 	["I", "wo", "n't", "be", "able", "to", "attend", "the", "meeting.", "This", "is", "n't", "possible", "."]


### Word Relevance

Sentences with similar meanings may have different word compositions. There are suffixes or prefixes associated with words. Stemming strips off some of these so that sentences can be compared. Similarly, commonly used words like articles or prepositions may be contribute significantly to the meaning of a sentence. They can be removed as stop words. Capitalization of the words may not also add signficant value to a sentence. Once those are removed, the rest of the words can be compared to check similar sentences. 

In [5]:
using TextAnalysis

doc1 = StringDocument("Educate to add value to a person")
prepare!(doc1, strip_case | strip_stopwords | stem_words)
println(doc1.text)

doc2 = StringDocument("Education is valued for a person")
prepare!(doc2, strip_case | strip_stopwords | stem_words)
println(doc2.text)

educ add valu person
educ valu person


Cosine similarity is another way to compare two sentences and find similarities. 

In [6]:
using LinearAlgebra

doc1 = StringDocument("Educate to add value to a person")
doc2 = StringDocument("Education is valued for a person")
doc3 = StringDocument("Education is valuable asset for a human")
c = Corpus([doc1, doc2, doc3])
prepare!(c, strip_case | strip_stopwords | stem_words)
update_lexicon!(c)

println((c[1].text, c[2].text, c[3].text))
tfm =  tf_idf(dtm(c))
cs = Matrix(tfm*tfm')
d = sqrt.(diag(cs))
cs ./ (d*d')

("educ add valu person", "educ valu person", "educ valuabl asset human")


3×3 Matrix{Float64}:
 1.0       0.462709  0.0
 0.462709  1.0       0.0
 0.0       0.0       1.0

### Embeddings

Embeddings are pretty much the de-facto standards for representations in neural computing today. Here a word or token is converted to a vector of numbers that captures the word and its interaction with other words and the context in which it is used. Word2Vec, GloVe and Transformer based models are used to generate embeddings.


<i><b> Note: </b>If you are running this code for the first time, it may download 1.5 GBs of data over the internet and may take hours to complete depending on your network speed and connectivity. Ensure, you have a good data connectivity during this period. Once downloaded, the embeddings shall be cached for subsequent use. </i>

In [7]:
ENV["DATADEPS_ALWAYS_ACCEPT"] = "true"
using Embeddings
const et = load_embeddings(Word2Vec)
const w2idx = Dict((w => i for (i, w) in enumerate(et.vocab)))
embedding(w) = et.embeddings[:, w2idx[w]]

embedding (generic function with 1 method)

`embedding("queen") = embedding("king") - embedding("man") + embedding("woman")`

While the above rule seems intiutive, the cosine distance measures may not be as accurate. "king" is closer to the computation while "queen" is the second best choice. Hence, it is important that you collect several of the neighborhood words. 

In [8]:
res = embedding("king") - embedding("man") + embedding("woman")
cosdist = vec(sum(res .* et.embeddings, dims=1))
ids = partialsortperm(cosdist, 1:5, rev=true)

5-element view(::Vector{Int64}, 1:5) with eltype Int64:
  6031
  9970
 21459
 16561
 14961

In [9]:
[(et.vocab[ii], cosdist[ii]) for ii in ids]

5-element Vector{Tuple{String, Float32}}:
 ("king", 0.8990531)
 ("queen", 0.80069506)
 ("monarch", 0.69625)
 ("princess", 0.6639391)
 ("prince", 0.6048718)

## Semantic Analysis

These techniques are often used in topic modeling. One can intuitively realize the first two sentences are similar while the last one is slightly different. The same can be seen from the results obtained from using the methods `lsa` and `lda`. 

In [10]:
using LinearAlgebra

doc1 = StringDocument("Educate to add value to a person")
doc2 = StringDocument("Education is valued for a person")
doc3 = StringDocument("Education is valuable asset for a human")
c = Corpus([doc1, doc2, doc3])
prepare!(c, strip_case | strip_stopwords | stem_words)
update_lexicon!(c)

println((c[1].text, c[2].text, c[3].text))
lsa(c)

("educ add valu person", "educ valu person", "educ valuabl asset human")


SVD{Float64, Float64, Matrix{Float64}}
U factor:
3×3 Matrix{Float64}:
 0.0  -0.931471  -0.363815
 0.0  -0.363815   0.931471
 1.0   0.0        0.0
singular values:
3-element Vector{Float64}:
 0.47571307544817304
 0.3266291290597002
 0.16072253690428198
Vt factor:
3×7 Matrix{Float64}:
 -3.63248e-17  0.57735  0.0  0.57735   4.92112e-17   4.92112e-17  0.57735
 -0.783248     0.0      0.0  0.0      -0.439615     -0.439615     0.0
 -0.62171      0.0      0.0  0.0       0.55384       0.55384      0.0

In [11]:
m = DocumentTermMatrix(c)
lda(m, 2, 2000, 0.1, 0.1)

(
 0.142857   ⋅    0.428571   ⋅    0.285714  0.142857   ⋅ 
  ⋅        0.25   ⋅        0.25   ⋅        0.25      0.25, [0.75 1.0 0.25; 0.25 0.0 0.75])


## Classifiers

Naive Bayes classification is a simplified Bayes rules based classifier. The words are assumed to be independent and dependent on the class to which the sentence belongs to. Even with just using two sentences for training, the predicted sentence is given significant weightage for `:g1`. 


In [12]:
using TextAnalysis
m = NaiveBayesClassifier([:g1, :g2])
fit!(m, "Education is valued for a person", :g1)
fit!(m, "Education is valuable asset for a human", :g2)
predict(m, "Education is valued for a man")

Dict{Symbol, Float64} with 2 entries:
  :g2 => 0.278623
  :g1 => 0.721377

## Extractive Summarization

TextRank algorithm can be used to identify the sentences that provide the most significant information for a paragraph. They can be given higher rank and can be used to summarize text.

In [13]:
using TextAnalysis
s = "Research in the text analysis space has been so rapid that "*
"many concepts that have been relevant a few years ago are "*
"getting continuously augmented with newer concepts and "*
"knowledge. Neural computing has made significant inroads "*
"into most text processing tasks. The traditional statistical "*
"approaches though relevant are being experimented alongside "*
"deep learning techniques. We shall discuss a few commonly used "*
"packages for text analytics and suggest the reader to explore "*
"and learn newer techniques as they proceed with expanding their "*
"horizons. The codes or texts described here are for the ease of "*
"learning and understanding and should not be considered for a "*
"production performance. Most text processing tasks can be "*
"considered as a simplified pipeline of activities as shown in "*
"Figure 16.1. However, it is an over simplification from your "*
"real life use cases and there may be certain tasks that are "*
"overlapping."

summary = *(summarize(StringDocument(s), ns=4)...)

"Neural computing has made significant inroads into most text processing tasks.The traditional statistical approaches though relevant are being experimented alongside deep learning techniques.The codes or texts described here are for the ease of learning and understanding and should not be considered for a production performance.However, it is an over simplification from your real life use cases and there may be certain tasks that are overlapping."

### Evaluation Metrics

The quality of summarization can be compared with a human reference summarization using ROUGE_N scores. 

In [14]:
ref = ["Text analysis research has been rapid. ",
       "Neural computing is being used alongwith statistical techniques. ",
       "Deep learning is gaining popularity as well. ",
       "Code and text provides here are for ease of learning and should not be used for production performance."]
rouge_n(ref, summary, 2, avg=true)

0.369047619047619

In [15]:
rouge_n(ref, summary, 1, avg=true)

0.9457236842105263

## Named Entity Recognition

We shall use the pretrained `NERTagger` from the `TextModels` package. The NERs recognized are: Person(PER), Others(O), Locations (LOC), Organizations(ORG), Miscellaneous (MISC). 

<i><b>Note:</b> This part of the code is temporarily commented out as TextModels package is not compatible with Julia 1.6 release. We will update it to the latest version as soon as the package is supported. However, you can run this code in a Julia 1.5.x installation. </i>

In [16]:
#=
using Pkg
Pkg"add TextModels"

using TextModels # This package is currently not compatible with Julia 1.6. You can run this code with Julia 1.5. 

ner = NERTagger()
s1 = "Kumar is from Bangalore and works for Julia Computing."
println(ner(s1))
s2 = "Julia language is easy to learn."
println(ner(s2))

=#

#= Output
["PER", "O", "O", "LOC", "O", "O", "O", "ORG", "ORG", "O"]
["PER", "O", "O", "O", "O", "O", "O"]
=#

## ULMFiT

## Conclusion

## Exercises