<a href="https://colab.research.google.com/github/a-mhamdi/jlai/blob/main/Codes/Julia/Part-3/nlp/nlp.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NATURAL LANGUAGE PROCESSING
---

In [1]:
versioninfo() # -> v"1.11.5"

Julia Version 1.11.5
Commit 760b2e5b739 (2025-04-14 06:53 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 2 × Intel(R) Xeon(R) CPU @ 2.20GHz
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, broadwell)
Threads: 2 default, 0 interactive, 1 GC (on 2 virtual cores)
Environment:
  LD_LIBRARY_PATH = /usr/local/nvidia/lib:/usr/local/nvidia/lib64
  JULIA_NUM_THREADS = auto


In [2]:
pkgs = """[deps]
Embeddings = "c5bfea45-b7f1-5224-a596-15500f5db411"
LinearAlgebra = "37e2e46d-f89d-539d-b4ee-838fcccc9c8e"
TextAnalysis = "a2db99b7-8b79-58f8-94bf-bbc811eef33d"
TextModels = "77b9cbda-2a23-51df-82a3-24144d1cd378"
"""

open("Project.toml", "w") do file
    write(file, pkgs)
end

220

In [3]:
_ = begin
  import Pkg;
  Pkg.activate(".");
  Pkg.instantiate();
end

[32m[1m  Activating[22m[39m project at `/content`


In [4]:
Pkg.status()

[32m[1mStatus[22m[39m `/content/Project.toml`
  [90m[c5bfea45] [39mEmbeddings v0.4.6
  [90m[a2db99b7] [39mTextAnalysis v0.8.4
  [90m[77b9cbda] [39mTextModels v0.2.1
  [90m[37e2e46d] [39mLinearAlgebra v1.11.0


In [5]:
using TextAnalysis

In [6]:
txt = "The quick brown fox is jumping over the lazy dog" # Pangram [modif.]

"The quick brown fox is jumping over the lazy dog"

Create a `Corpus` using `txt`

In [7]:
crps = Corpus([StringDocument(txt)])

A Corpus with 1 documents:
 * 1 StringDocument's
 * 0 FileDocument's
 * 0 TokenDocument's
 * 0 NGramDocument's

Corpus's lexicon contains 0 tokens
Corpus's index contains 0 tokens

In [8]:
lexicon(crps)

Dict{String, Int64}()

In [9]:
update_lexicon!(crps)

In [10]:
lexicon(crps)

Dict{String, Int64} with 10 entries:
  "brown"   => 1
  "The"     => 1
  "lazy"    => 1
  "the"     => 1
  "is"      => 1
  "quick"   => 1
  "fox"     => 1
  "over"    => 1
  "jumping" => 1
  "dog"     => 1

In [11]:
lexical_frequency(crps, "fox")

0.1

Create a `StringDocument` using `txt`

In [12]:
sd = StringDocument(txt)

A StringDocument{String}
 * Language: Languages.English()
 * Title: Untitled Document
 * Author: Unknown Author
 * Timestamp: Unknown Time
 * Snippet: The quick brown fox is jumping over the lazy dog

Get a smaller set of words `text(sd)`

In [13]:
prepare!(sd, strip_articles | strip_numbers | strip_punctuation | strip_case | strip_whitespace)
stem!(sd)

Get the tokens of `sd`

In [14]:
the_tokens = tokens(sd)

8-element Vector{String}:
 "quick"
 "brown"
 "fox"
 "is"
 "jump"
 "over"
 "lazi"
 "dog"

Get the stemmed tokens of `sd`

In [15]:
stemmer = Stemmer("english")
stemmed_tokens = stem(stemmer, the_tokens)

8-element Vector{String}:
 "quick"
 "brown"
 "fox"
 "is"
 "jump"
 "over"
 "lazi"
 "dog"

In [16]:
println("Original tokens: ", the_tokens)
println("Stemmed tokens: ", stemmed_tokens)

Original tokens: ["quick", "brown", "fox", "is", "jump", "over", "lazi", "dog"]
Stemmed tokens: ["quick", "brown", "fox", "is", "jump", "over", "lazi", "dog"]


**Part-of-speech tags**

In [17]:
#=
Common POS tags:

JJ: Adjective
NN: Noun, singular or mass
NNS: Noun, plural
VB: Verb, base form
VBZ: Verb, 3rd person singular present
VBG: Verb, gerund or present participle
VBD: Verb, past tense
RB: Adverb
IN: Preposition or subordinating conjunction
DT: Determiner
PRP: Personal pronoun
CC: Coordinating conjunction
=#

#=
using TextModels
pos = PoSTagger()
pos(txt)
=#

**Word embeddings**

In [18]:
using Embeddings
embtab = load_embeddings(GloVe{:en}, max_vocab_size=5)

Embeddings.EmbeddingTable{Matrix{Float32}, Vector{String}}(Float32[0.418 0.013441 … 0.70853 0.68047; 0.24968 0.23682 … 0.57088 -0.039263; … ; -0.11514 0.044691 … -0.093918 -0.064699; -0.78581 0.30392 … -0.80375 -0.26044], ["the", ",", ".", "of", "to"])

In [19]:
embtab.vocab
embtab.embeddings

50×5 Matrix{Float32}:
  0.418        0.013441   0.15164     0.70853    0.68047
  0.24968      0.23682    0.30177     0.57088   -0.039263
 -0.41242     -0.16899   -0.16763    -0.4716     0.30186
  0.1217       0.40951    0.17684     0.18048   -0.17792
  0.34527      0.63812    0.31719     0.54449    0.42962
 -0.044457     0.47709    0.33973     0.72603    0.032246
 -0.49688     -0.42852   -0.43478     0.18157   -0.41376
 -0.17862     -0.55641   -0.31086    -0.52393    0.13228
 -0.00066023  -0.364     -0.44999     0.10381   -0.29847
 -0.6566      -0.23938   -0.29486    -0.17566   -0.085253
  0.27843      0.13001    0.16608     0.078852   0.17118
 -0.14767     -0.063734   0.11963    -0.36216    0.22419
 -0.55677     -0.39575   -0.41328    -0.11829   -0.10046
  ⋮                                             
  0.012041     0.70358    0.41705     0.24185    0.16351
 -0.054223     0.44858    0.056763    0.36576   -0.21634
 -0.29871     -0.080262  -6.3681f-5  -0.34727   -0.094375
 -0.15749    

In [20]:
glove = load_embeddings(GloVe{:en}, 3, max_vocab_size=10_000)
const word_to_index = Dict(word => ii for (ii,word) in enumerate(glove.vocab))
function get_word_vector(word)
    idx = word_to_index[word]
    return glove.embeddings[:, idx]
end

get_word_vector (generic function with 1 method)

In [21]:
using LinearAlgebra
function cosine_similarity(v1::Vector{Float32}, v2::Vector{Float32})
    return *(v1', v2) / *(norm(v1), norm(v2))
end

cosine_similarity (generic function with 1 method)

_e.g. - \"king\" - \"man\" + \"woman\" ≈ \"queen\"_

In [22]:
king = get_word_vector("king")
queen = get_word_vector("queen")
man = get_word_vector("man")
woman = get_word_vector("woman")

cosine_similarity(king - man + woman, queen)

0.71191657f0

_e.g. - \"Madrid\" - \"Spain\" + \"France\" ≈ \"Paris\"_

In [23]:
Madrid = get_word_vector("madrid")
Spain = get_word_vector("spain")
France = get_word_vector("france")
Paris = get_word_vector("paris")

cosine_similarity(Madrid - Spain + France, Paris)

0.8027344f0

**Text classification**

https://github.com/JuliaText/TextAnalysis.jl/blob/master/docs/src/classify.md

In [24]:
m = NaiveBayesClassifier([:legal, :financial])
fit!(m, "this is financial doc", :financial)
fit!(m, "this is legal doc", :legal)
predict(m, "this should be predicted as a legal document")

Dict{Symbol, Float64} with 2 entries:
  :legal     => 0.666667
  :financial => 0.333333

**Semantic analysis**

In [25]:
m = DocumentTermMatrix(crps)

A 1 X 10 DocumentTermMatrix

*Latent Semantic Analysis* (**LSA**) is a dimensionality reduction technique that uses singular value decomposition (SVD) on a term-document matrix to discover hidden semantic relationships between words and documents in a corpus.

In [26]:
lsa(m)

SVD{Float64, Float64, Matrix{Float64}, Vector{Float64}}
U factor:
1×1 Matrix{Float64}:
 1.0
singular values:
1-element Vector{Float64}:
 0.0
Vt factor:
1×10 Matrix{Float64}:
 1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0

*Latent Dirichlet Allocation* (**LDA**) is a probabilistic generative model that represents documents as mixtures of topics, where each topic is a distribution over words, commonly used for topic modeling and document clustering.

In [27]:
k = 2              # number of topics
iterations = 1000  # number of Gibbs sampling iterations
α = 0.1            # hyperparameter
β  = 0.1           # hyperparameter
ϕ, θ  = lda(m, k, iterations, α, β)

(sparse([2, 1, 1, 1, 2, 1, 1, 1, 1, 1], [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], [0.5, 0.125, 0.125, 0.125, 0.5, 0.125, 0.125, 0.125, 0.125, 0.125], 2, 10), [0.8; 0.2;;])