In [None]:
(set! *print-length* 10)

In [None]:
(require '[clojupyter.misc.helper :as helper])
(require '[clojupyter.misc.display :as display])

In [None]:
(helper/add-dependencies '[semantic-csv "0.2.1-alpha1"])
(helper/add-dependencies '[hiccup-table "0.2.0"])
(helper/add-dependencies '[metasoarous/oz "1.5.6"])
(helper/add-dependencies '[de.find-method/tfidf "0.1.0"])
(helper/add-dependencies '[com.zensols.nlp/parse "0.1.6"])

In [None]:
(require '[clojure.java.io :as io]
            '[clojure-csv.core :as csv]
            '[semantic-csv.core :as sc])
(require 'hiccup.table)
(require 'oz.core) 
(require 'oz.notebook.clojupyter)
(require 'zensols.nlparse.parse
          'zensols.nlparse.config
          'zensols.nlparse.stopword
          'tfidf.tfidf
          'tfidf.freq
          'tfidf.xf)


# Read training data 


The next line reads the cvs file and stores as a sequnce-of-maps. So every elemnt in the sequence is a map, 
with one key per column.


In [None]:

(def train 
(sc/slurp-csv "data/sentiment-analysis-rotten-tomatoes/train.tsv" :parser-opts {:delimiter \tab}))

We can print the column names as taking the keys of the first row.


In [None]:
(keys (first train))

In [None]:
(defn display-seq-of-maps [seq-of-maps]
   (let [ks (keys (first seq-of-maps))
      mapping (map #(vector %1  %1) ks)]
  (display/hiccup-html (hiccup.table/to-table1d seq-of-maps mapping))))

To see the table nicely in the notebook, we convert it into hiccup, and render it to html.

In [None]:
(display-seq-of-maps (take 5 train))


In [None]:
(count train)

So we have arround 155000 training cases

Let's see the distribution among the 5 different values for "Sentiment", so how many do we have for each ?

In [None]:
(frequencies (map :Sentiment train))

# Explorative data analysis
## Word clouds

Word clouds allow a first glimpse into the text data, and we can see the distribution of words.
First we do this for all texts, and then seperatedly for each senetiment value.

The word is as larger as more often it apperas. Very common stopwords are excluded from a list.


I use the oz library which uses vega/vega-lite specification to draw plots.
The following is such a spec to draw word clouds given a sequence of text.

In [None]:
(defn word-cloud-spec-fromtext[texts]
      {"width" 800
       "height" 400,
              "padding" 0
              "data" [{"name" "table"
                       "values" texts
                           "transform" [{"type" "countpattern"
                                         "field" "data"
                                         "case" "upper"
                                         "pattern" "[\\w']{3,}"
                                         "stopwords" "(i|me|my|myself|we|us|our|ours|ourselves|you|your|yours|yourself|yourselves|he|him|his|himself|she|her|hers|herself|it|its|itself|they|them|their|theirs|themselves|what|which|who|whom|whose|this|that|these|those|am|is|are|was|were|be|been|being|have|has|had|having|do|does|did|doing|will|would|should|can|could|ought|i'm|you're|he's|she's|it's|we're|they're|i've|you've|we've|they've|i'd|you'd|he'd|she'd|we'd|they'd|i'll|you'll|he'll|she'll|we'll|they'll|isn't|aren't|wasn't|weren't|hasn't|haven't|hadn't|doesn't|don't|didn't|won't|wouldn't|shan't|shouldn't|can't|cannot|couldn't|mustn't|let's|that's|who's|what's|here's|there's|when's|where's|why's|how's|a|an|the|and|but|if|or|because|as|until|while|of|at|by|for|with|about|against|between|into|through|during|before|after|above|below|to|from|up|upon|down|in|out|on|off|over|under|again|further|then|once|here|there|when|where|why|how|all|any|both|each|few|more|most|other|some|such|no|nor|not|only|own|same|so|than|too|very|say|says|said|shall)"
                                         }
                                        {"type" "formula", 
                                        "as" "angle", 
                                        "expr" "[-45, 0, 45][~~(random() * 3)]"}
                                        {"type" "formula", 
                                        "as" "weight", 
                                        "expr" "if(datum.text=='VEGA', 600, 300)"}]}]
                                        
                  "scales" [{"name" "color", 
                             "type" "ordinal", 
                             "domain" {"data" "table", "field" "text"}, 
                             "range" ["#d5a928" "#652c90" "#939597"]}]
                  "marks" [{"type" "text", 
                            "from" {"data" "table"} 
                            "encode" {"enter" {"text" {"field" "text"}, 
                             "align" {"value" "center"}
                             "baseline" {"value" "alphabetic"}
                             "fill" {"scale" "color", "field" "text"}},
                             "update" {"fillOpacity" {"value" 1}}, 
                             "hover" {"fillOpacity" {"value" 0.5}}}
                             
                    "transform" [{"fontSizeRange" [12 56]
                                  "fontWeight" {"field" "datum.weight"},
                                  "padding" 2
                                  "text" {"field" "text"}
                                   "fontSize" {"field" "datum.count"}
                                   "font" "Helvetica Neue, Arial"
                                   "type" "wordcloud", "size" [800 400]
                                   "rotate" {"field" "datum.angle"}}]}]}
    
  
)


In [None]:
(def sample-size 1000)

In [None]:
(defn filtered-word-cloud [sentiment]
    (oz.notebook.clojupyter/view! (word-cloud-spec-fromtext 
                                      (map #(:Phrase %) (take sample-size (filter #(= sentiment (:Sentiment %)) train))))))

### All text word cloud

In [None]:
(oz.notebook.clojupyter/view! (word-cloud-spec-fromtext (map  #(:Phrase %)  (take sample-size (shuffle train)))))

### Word clouds for each sentiment

In [None]:
(filtered-word-cloud "1")

In [None]:
(filtered-word-cloud "2")

In [None]:
(filtered-word-cloud "3")

In [None]:
(filtered-word-cloud "4")

In order to create the vocabulary, we first need to tokenize the text and get overall counts for each token.

This can then be used to filter rare or very frequent tokens

In [None]:
(def tokenize-context
  (->> (zensols.nlparse.config/create-parse-config :only-tokenize? true)
       zensols.nlparse.config/create-context))


(defn tokenize [s]
  (zensols.nlparse.config/with-context tokenize-context
   (->>(zensols.nlparse.parse/parse s)
       (zensols.nlparse.parse/tokens)
       (map :text)
       )))




(def tokens
  (->>
   (sequence
    (comp
     (map tokenize)
     (map tfidf.freq/freq)
     (filter #(not (empty? %)))
     )
    (map :Phrase (take 100 train)))
   (apply merge-with +)
   (#(sort-by second %))
   reverse 
   ))



In [None]:
(take 10 tokens)

In [None]:
(display/hiccup-html (hiccup.table/to-table1d (take 50 tokens) [0 "token" 1 "freq"]))

In [None]:
(+ 1 1)