# In <font color="red">bucharest</font>, i'll examine omer's songs...


# with simple text algorithms..... simple viz.........and........... 
# <font color="green" size=6>clojure!</font>

### NO numpy, no matplotlib....no python!!:
with:
1.  [core.matrix](https://github.com/mikera/core.matrix), array programming
2. [vega](https://vega.github.io/vega/) (awesome DSL) and little bit [incanter](https://github.com/incanter/incanter), for viz, graphs
3. clojure.test, [clojure.test.check](https://github.com/clojure/test.check), clojure.spec for example/property based testing (QuickCHECKKK)

# BEGIN:

### let's grab omer's songs with JSoup (javas scraping lib) & Java interop (calling java code from clojure) into memory:
<figure class="half" style="display:flex">
    <img src="omer2.png" width=400>
    <img src="omer3.png" width=400>
</figure>

In [200]:
; =================================
(require '[clojupyter.misc.helper :as helper]
         '[clojupyter.misc.display :as display])
(helper/add-dependencies '[org.jsoup/jsoup "1.7.3"])
(import (org.jsoup Jsoup)
        (org.jsoup.select Elements)
        (org.jsoup.nodes Element))
; =================================

(def base-url "https://lyricstranslate.com")

(defn get-page [url]
  (.get (Jsoup/connect url)))

(defn get-elems [page css]
  (.select page css))

(defn extract-links [url]
  (for [e (get-elems (get-page url) "a[href]")
        :when (and (= (.attr e "class") "lang")
                   (or (= (.text e) "English")
                       (= (.text e) "#1")))]
    (str base-url (.attr e "href"))))

(defn extract-song [url]
  (let [elems (get-elems (get-page url) "div#songtranslation > .translate-node-text")
        title (.text (get-elems elems ".title-h2"))
        text (.text (get-elems elems ".ltf > .par"))]
  {:title title
   :text text}))

(def omer-url (str base-url "/en/omer-adam-lyrics.html"))

(defn scrape-omer []
  (for [song-url (extract-links omer-url)]
    (extract-song song-url)))

; =================================
(def scraped-songs (scrape-omer))
(println "==========================================================")
(println (type scraped-songs))
(println "total scraped songs:" (count scraped-songs))
(println "==========================================================")
(first (filter #(= (:title %) "Tel Aviv") songs))

clojure.lang.LazySeq
total scraped songs: 67


{:title "Tel Aviv", :text "She feels that her luck has opened up She met a manly man and a rajal 1 And she'll whisper to him, what will she whisper to him? 'Take me on the camel' I'm your beauty You're my beast Welcome to the Middle East Tel Aviv, ya habibi 2, Tel Aviv Look how many lirdim 3 there are around Telling me 'hi, hi' At night 'wai, wai' And well done, Tel Aviv Sun rises in the white city And he's being stared at from every corner And she knows, what does she know? He'll run away from her in a second I'm your beauty You're my beast Welcome to the Middle East Tel Aviv, ya habibi 2, Tel Aviv Look how many lirdim 3 there are around Telling me 'hi, hi' At night 'wai, wai' And well done, Tel Aviv I have it going up, up, up and not down"}

# pre processing, cleaning up some badly-scraped songs & dups, we get: 

In [202]:
(require '[clojure.set :as set])

(def badly-scraped #{"После стольких лет", "Az Halachti", "Khaverot Shelakh", "Noetset Mabat", "Sheket", "Mahapecha Shel Simha"})

(def dup-songs-starts-text "Hi margisha")

(defn pre-process [songs]
    (->> songs
         (remove #(or (contains? badly-scraped (:title %))
                      (s/starts-with? (:text %) dup-songs-starts-text)))))

(def songs (pre-process scraped-songs))
(def titles (map :title songs))

(println "==========================================================")
(println "total after processing:" (count songs))
(println "sanity check:"(set/subset? #{"I give thanks", "Bucharest", "Your girl-friends"} (set titles)))
(println "==========================================================")

total after processing: 60
sanity check: true


nil

## examing the corpus, we can see that most of the songs have normal structure and are separated by <font color="red"> [space, ","] </font>,

## while small amount of songs like: in "your-girlfirneds" and "thousand-times", we can see different lines like:

* <font size =3 color="green">"(She) does me "chiqi chiqi dam dam" like this all day."
* "and she has character, (it's) son of a...*"
* "The square(=floor) is on fire, all the ladies are dancing"</font>

## this implies how we should cut (tokenize) the strings (songs), removing symbols like: <font color="red"> ['=', '()', '='] & more </font>, while splitting with spaces, commas.

## after removing all english-stop-words, we have count occurrences and get a simple word-count (Bag of words) from omer's songs,

In [203]:
(require '[clojure.string :as s])

(defn tokenize [text]
  (as-> text t
        (s/trim t)
        (filter #(or (Character/isSpace %) (Character/isLetter ^Character %)) t)
        (apply str t)
        (s/lower-case t)
        (s/split t #"\s+")))

(def stops (->> (slurp "stopwords") s/split-lines set))

(defn bow [corpus]
  (->> corpus
       (reduce (fn [m doc]
                 (merge-with + m (-> doc 
                                     :text
                                     tokenize 
                                     frequencies)))
               {})))

; =================================
(def freq-dist
    (as-> songs songs
          (bow songs)
          (select-keys songs 
                       (set/difference (set (keys songs)) stops))
          (sort-by val > songs)))

(take 10 freq-dist)

(["dont" 112] ["im" 99] ["love" 98] ["come" 77] ["like" 75] ["heart" 60] ["day" 60] ["youre" 58] ["end" 47] ["go" 43])

# so our pipeline so far:  
#### <font color="red">array of songs -> tokenize-each-song -> count frequencies -> collect to result</font>, 
### plotting it with vega (cool viz grammer - <font color="green">just define your viz in JSON</font> and run it in every language you want, no python-matplotlib-dependency!!):

In [209]:
; =================================
(helper/add-dependencies '[metasoarous/oz "1.5.0"])
(require '[oz.notebook.clojupyter :as oz])
; =================================

(defn xs->vega-map [xs]
    (map #(hash-map :freq (val %) :word (key %)) xs))

(def viz-data (xs->vega-map (take 30 freq-dist)))

(def stacked-bar 
    {:title "distribution of words in omer's corpus"
     :data {:values (take 15 viz-data)}
     :mark "bar"
     :encoding {:x {:field "word"
                    :type "ordinal"
                    :sort "x"}
                :y {:field "freq"
                    :type "quantitative"}}})

(def word-cloud
    {:data {:values viz-data
            :name "data"}
     :marks [{:type "text"
              :from {:data "data"}
              :encode {:enter {:text {:field "word"
                                      :baseline {:value "alphabetic"}
                                      :align {:value "center"}}}}
             :transform [{:type "wordcloud"
                          :size [800, 400]
                          :text {:field "word"}
                          :font "Helvetica Neue, Arial"
                          :fontSize {:field "datum.freq"}
                          :fontSizeRange [10, 120]
                          :padding 2}]}]})
; =================================
(oz/view! stacked-bar)

# or in a word cloud..

In [210]:
(oz/view! word-cloud)

# a more interesting approach would be to look on the "context" of the words and not on their frequency (count),

#### looking on the neighberhood of a word (the words [surrounding it](https://en.wikipedia.org/wiki/Distributional_semantics#Distributional_hypothesis)) can help us understand the word meaning,

#### for example 
* <font color="green">"I ate sabich today"
* "Sigal made a delicious sabich for us"
* "Sabich should be served in pita"</font>

#### a window of size 3 (word's before/after) and word 'Sabich', we get (removing stopwords),  <font color="red"> ['ate', 'made', 'delicious', 'served']</font>

#### and can 'feel' it's in the context of food, ofcourse this needs to be 'learned' on loads of data (sentences...)

# now we can build for each word a vector, which will contain it neighbours (it context), this is <font color="red">word-embeddings</font>, and it's cool!

In [295]:
; =================================
(helper/add-dependencies '[net.mikera/core.matrix "0.62.0"])
(helper/add-dependencies '[net.mikera/vectorz-clj "0.48.0"])
(require '[clojure.core.matrix :as m])
(m/set-current-implementation :vectorz)
; =================================

(defn distinct-words [nested-v]
  (->> nested-v (mapcat #(identity %)) set))

(defn inc' [M [x y]]
  (m/mset M x y 
        (inc (m/mget M x y))))

(defn occurrence-indices [corpus word->idx n]
  (mapcat (fn [line]
            (mapcat (fn [[w & words]]
                      (map #(vector (word->idx w) (word->idx %)) words))
                    (partition (inc n) 1 line))) corpus))

(defn co-occurrence-matrix [corpus n]
  (let [word->idx (zipmap (sort (distinct-words corpus)) (range))
        shape (vec (repeat 2 (count word->idx)))
        M (->> (occurrence-indices corpus word->idx n)
               (reduce (fn [M' loc]
                         (-> (inc' M' loc)
                             (inc' (reverse loc))))
                       (m/zero-array shape)))]
    {:M         M
     :word->idx word->idx}))

; =================================

(defn print' [M]
    (m/pm M {:column-names? true}))

(def ex ["Bebe yo te bote y te bote", 
         "Te di banda y te solte, yo te solte"])

(def res (co-occurrence-matrix (mapv tokenize ex) 2))
(print' (:M res))
(:word->idx res)

[[    0     1     2     3     4     5     6     7]
 [0.000 0.000 0.000 1.000 0.000 2.000 1.000 0.000]
 [0.000 0.000 0.000 0.000 0.000 1.000 0.000 1.000]
 [0.000 0.000 0.000 0.000 0.000 2.000 2.000 1.000]
 [1.000 0.000 0.000 0.000 0.000 1.000 1.000 0.000]
 [0.000 0.000 0.000 0.000 0.000 2.000 1.000 2.000]
 [2.000 1.000 2.000 1.000 2.000 0.000 3.000 3.000]
 [1.000 0.000 2.000 1.000 1.000 3.000 0.000 0.000]
 [0.000 1.000 1.000 0.000 2.000 3.000 0.000 0.000]]


{"banda" 0, "bebe" 1, "bote" 2, "di" 3, "solte" 4, "te" 5, "y" 6, "yo" 7}

## so we have built a big square nXn matrix, with N=numOfWordsInCorpus, where each row is a word-vector, which represents the count of it neighbours in a fixed window-size (let's say 3). so our Dimension is N, and it's big, 

### reducing the dim would let us plot/feel the vectors, in 2-3 dimension,

### a cool feature of matrices is that any real nXm matrix can be decomposed into SVD (3x other matrices...), and doing some other manipulations on the SVD result we can get a reduced dimensions of our matrix. (also called PCA)

### there are other options for performing dim-reduction (tSNE), but will stick to it, 

In [287]:
(require '[clojure.core.matrix.linear :as lm])

(defn reduce-to-dim [k M]
  (let [{:keys [U S]} (lm/svd M)
        U' (->> U 
                m/columns
                (take k) 
                m/transpose)
        S' (->> S 
                (take k) 
                m/diagonal-matrix)]
    (m/mmul U' S')))

; =================================
(m/pm (reduce-to-dim 2 (:M res)))

[[1.793  1.055]
 [0.987  0.286]
 [2.435  0.204]
 [1.263  0.096]
 [2.397  0.159]
 [4.499 -3.336]
 [3.178  1.851]
 [2.897  2.055]]


nil

In [32]:
; =================================
(helper/add-dependencies '[incanter "1.9.3"])
(use '(incanter core stats charts io))
; =================================


(defn plot-embeddings [M word->idx title words]
  (let [indices (vals (select-keys word->idx words))
        sliced (m/emap #(m/select M % :all) indices)
        x-cors (m/get-column sliced 0)
        y-cors (m/get-column sliced 1)
        plot (scatter-plot x-cors y-cors
                           :title title
                           :x-label "X"
                           :y-label "Y")]
    (doseq [[x y w] (map list x-cors y-cors words)]
      (add-text plot x (+ 0.06 y) w))                       
    (.createBufferedImage plot 600 400)))

; =================================



#'user/plot-embeddings

# now using this co-occurrence matrix, we can get the most 'similar' words, which will have same 'context', using cosine-similarity as

In [None]:
(defn cosine-sim [v1 v2]
  (m/div
    (m/dot v1 v2)
    (m/dot (lm/norm v1) (lm/norm v2))))

(defn similarity [M word->idx w1 w2]
  (cosine-sim (m/get-row M (word->idx w1))
              (m/get-row M (word->idx w2))))

(defn similar-words [M word->idx w n]
  (let [words (keys (dissoc word->idx w))
        sim (for [w' words]
              (similarity M word->idx w w'))]
    (->> (zipmap sim words)
         (sort-by >)
         (take n)
         vals)))
