In **bucharest**, i'll explore omer's song-corpus (all it available english-translated songs) with simple text & NLP algorithms.

i'll use [clojure](https://clojure.org/) and it libraries:
1. array programming - "clojure-numpy" library ([core.matrix] (https://github.com/mikera/core.matrix))
2. visualization library ([incanter] (https://github.com/incanter/incanter), [vega](https://vega.github.io/vega/) (vega is an awesome DSL, tool, concept)).
3. test libraries for RDD,TDD and property based testing (clojure.test, clojure.test.check, clojure.spec)
4. clojure-notebook-style: clojupyter (this!) a kernel for clojure & jupyter. (cooler alternative would be the [gorillaREPL](http://gorilla-repl.org/).

unlike python and it numpy-scipy-nltk-tf awesome support stability, clojure tools are less common and not much in use, i'd try to show that there exist great (& fun) alternatives.

first let's get the data, will look on all omer-adam's english-translated songs, 55x of them on lyricstranslate.com (world's famous songs-translations website) and omer's being the most famous artist in Hebrew -> English translations.
<img src="omer1.png" width="200">

I'll have 2 basic assumptions:
1. words that we'll find interest in (words like "love", "night", "tel-aviv", "habibi") will have no mistake in translation.
2. translation wouldn't change the meaning of the words (and their intents), as this can happen when a none-professional-translator translates text,this will rely on the fact that omer's songs are not "that deep" :) and "tonight


simple scraping with Java interop & Java Jsoup library will get us all of the songs into memory, by looking on all "a-href" web-links will grab it specific-english translations easily, then save each title & text:
<figure class="half" style="display:flex">
    <img src="omer2.png" width=400>
    <img src="omer3.png" width=400>
</figure>

In [43]:
; =================================
(require '[clojupyter.misc.helper :as helper])
(require '[clojupyter.misc.display :as display])
(helper/add-dependencies '[org.jsoup/jsoup "1.7.3"])
(import (org.jsoup Jsoup)
        (org.jsoup.select Elements)
        (org.jsoup.nodes Element))
; =================================

(def BASE-URL "https://lyricstranslate.com")

(def OMERS-URL (str BASE-URL "/en/omer-adam-lyrics.html"))

(defn get-page [url]
  (.get (Jsoup/connect url)))

(defn get-elems [page css]
  (.select page css))

(defn extract-links [url]
  (for [e (get-elems (get-page url) "a[href]")
        :when (and (= (.attr e "class") "lang")
                   (= (.text e) "English"))]
    (str BASE-URL (.attr e "href"))))

(defn extract-song [url]
  (let [elems (get-elems (get-page url) "div#songtranslation > .translate-node-text")
        title (.text (get-elems elems ".title-h2"))
        text (.text (get-elems elems ".ltf > .par"))]
  {:title title
   :text text}))

(defn scrape-omer []
  (for [song (extract-links OMERS-URL)]
    (extract-song song)))

; =================================
(def songs (scrape-omer))
; =================================

#'user/songs

In [41]:
(println (type songs))
(println (count songs))
(first songs)

clojure.lang.LazySeq
52


{:title "I give thanks", :text "I offer thanks to you each morning For restoring my soul to me Thank you for the life that covers me Warms me like a flame That protects me from the cold You're there 1 and believe in me I offer thanks each morning For the present moment and for the light Thank you for the generous pale gold 2 You have placed on my table To feed my children You protect, you are great For my joys and my smiles I thank you For my gifts 3 and my passions And for my songs They are all for you Know that, know that I thank you my king I cry to you, my G-d, oh my G-d 4 To you I call To you, my life To you, my heart 5 I thank you To you I call 6 To you I call I offer thanks each morning For the love of my father, of my mother Thank you for the rain that waters the trees Of my fields, for being the gardian 7 Of our lives of our destinies For the day of rest 8 I thank you For the success, for being here For being happy sometimes Know that, know that I thank you, my G-d I cry to yo

In [44]:
(helper/add-dependencies '[net.mikera/core.matrix "0.62.0"])
(helper/add-dependencies '[net.mikera/vectorz-clj "0.48.0"])
(helper/add-dependencies '[incanter "1.9.3"])
(require '[clojure.set :as set])
(require '[clojure.core.matrix :as m])
(use '(incanter core stats charts io))
(m/set-current-implementation :vectorz)
:ok

:ok

In [47]:
(defn distinct-words [nested-v]
  (reduce (fn [s xs]
                (set/union s (set xs)))
          #{} nested-v))

#'user/distinct-words

In [None]:
(defn co-occurrence-matrix [corpus n]
  (let [{:keys [words num-words]} (distinct-words corpus)
        word->idx (zipmap words (range num-words))
        shape (vec (repeat 2 num-words))
        M (reduce (fn [M' coll]
                    (co-occur M' coll word->idx n))
                  (m/zero-array shape) corpus)]
    {:M         M
     :word->idx word->idx}))

In [32]:
(defn plot-embeddings [M word->idx title words]
  (let [indices (vals (select-keys word->idx words))
        sliced (m/emap #(m/select M % :all) indices)
        x-cors (m/get-column sliced 0)
        y-cors (m/get-column sliced 1)
        plot (scatter-plot x-cors y-cors
                           :title title
                           :x-label "X"
                           :y-label "Y")]
    (doseq [[x y w] (map list x-cors y-cors words)]
      (add-text plot x (+ 0.06 y) w))                       
    (.createBufferedImage plot 600 400)))

#'user/plot-embeddings