In [2]:
(require '[clojupyter.javascript.alpha :as cjp-js])
(require '[clojupyter.display :as display])
(require '[clojupyter.misc.helper :as helper])
(require '[clojure.data.json :as json])

nil

# 1.1 Reading data from a csv file

You can read data from a CSV file using the read_csv function. By default, it assumes that the fields are comma-separated.

We're going to be looking some cyclist data from Montréal. Here's the original page (in French), but it's already included in this repository. We're using the data from 2012.

This dataset is a list of how many people were on 7 different bike paths in Montreal, each day.

In [21]:
(helper/add-dependencies '[org.clojure/data.csv "1.0.0"])
(helper/add-dependencies '[metasoarous/oz "1.5.6"])
(require '[oz.notebook.clojupyter :as oz])
(require '[clojure.java.io :as io])
(require '[clojure.data.csv :as csv])
(require '[clojure.pprint :as pp])

nil

In [22]:
(def broken-data 
    (with-open [reader (io/reader "../data/bikes.csv")]
      (doall
        (csv/read-csv reader))))

#'user/broken-data

In [23]:
; Look at the first 3 rows
(take 3 broken-data)

(["Date;Berri 1;Br�beuf (donn�es non disponibles);C�te-Sainte-Catherine;Maisonneuve 1;Maisonneuve 2;du Parc;Pierre-Dupuy;Rachel1;St-Urbain (donn�es non disponibles)"] ["01/01/2012;35;;0;38;51;26;10;16;"] ["02/01/2012;83;;1;68;153;53;6;43;"])

You'll notice that this is totally broken! `io/reader` has a bunch of options that will let us fix that, though. Here we'll

* change the column separator to a `;`
* Set the encoding to `'ISO-8859-1'` (the default is `'UTF-8'`)

In [29]:
(def fixed-data
    (with-open [reader (io/reader "../data/bikes.csv" :encoding "ISO-8859-1")]
      (doall
        (csv/read-csv reader :separator \;))))
(->> fixed-data
     (take 5)
     pp/pprint)

(["Date"
  "Berri 1"
  "Brébeuf (données non disponibles)"
  "Côte-Sainte-Catherine"
  "Maisonneuve 1"
  "Maisonneuve 2"
  "du Parc"
  "Pierre-Dupuy"
  "Rachel1"
  "St-Urbain (données non disponibles)"]
 ["01/01/2012" "35" "" "0" "38" "51" "26" "10" "16" ""]
 ["02/01/2012" "83" "" "1" "68" "153" "53" "6" "43" ""]
 ["03/01/2012" "135" "" "2" "104" "248" "89" "3" "58" ""]
 ["04/01/2012" "144" "" "1" "116" "318" "111" "8" "61" ""])


nil

In [30]:
(defn blank->nil [s]
  (when-not (clojure.string/blank? s) s))

(defn csv-data->maps [csv-data]
  (map zipmap
       (->> (first csv-data) ;; First row is the header
            (map keyword) ;; Drop if you want string keys instead
            repeat)
       (->> (rest csv-data)
            (map #(map blank->nil %))))) ;; Drop if you want blank strings to stay

(def fixed-df (csv-data->maps fixed-data))
(->> fixed-df
     (take 5)
     pp/pprint)

({:du Parc "26",
  :Rachel1 "16",
  :Pierre-Dupuy "10",
  :Berri 1 "35",
  :Maisonneuve 1 "38",
  :Brébeuf (données non disponibles) nil,
  :Date "01/01/2012",
  :Côte-Sainte-Catherine "0",
  :St-Urbain (données non disponibles) nil,
  :Maisonneuve 2 "51"}
 {:du Parc "53",
  :Rachel1 "43",
  :Pierre-Dupuy "6",
  :Berri 1 "83",
  :Maisonneuve 1 "68",
  :Brébeuf (données non disponibles) nil,
  :Date "02/01/2012",
  :Côte-Sainte-Catherine "1",
  :St-Urbain (données non disponibles) nil,
  :Maisonneuve 2 "153"}
 {:du Parc "89",
  :Rachel1 "58",
  :Pierre-Dupuy "3",
  :Berri 1 "135",
  :Maisonneuve 1 "104",
  :Brébeuf (données non disponibles) nil,
  :Date "03/01/2012",
  :Côte-Sainte-Catherine "2",
  :St-Urbain (données non disponibles) nil,
  :Maisonneuve 2 "248"}
 {:du Parc "111",
  :Rachel1 "61",
  :Pierre-Dupuy "8",
  :Berri 1 "144",
  :Maisonneuve 1 "116",
  :Brébeuf (données non disponibles) nil,
  :Date "04/01/2012",
  :Côte-Sainte-Catherine "1",
  :St-Urbain (données non disponibl

nil

Finally, for a result which is a sequence of maps (like the above), you can use clojure.pprint/print-table to print it as a table:

In [37]:
(->> fixed-df
     (take 5)
     (map #(into {} (take 7 %)))
     pp/print-table)


| :du Parc | :Rachel1 | :Pierre-Dupuy | :Berri 1 | :Maisonneuve 1 | :Brébeuf (données non disponibles) |      :Date |
|----------+----------+---------------+----------+----------------+------------------------------------+------------|
|       26 |       16 |            10 |       35 |             38 |                                    | 01/01/2012 |
|       53 |       43 |             6 |       83 |             68 |                                    | 02/01/2012 |
|       89 |       58 |             3 |      135 |            104 |                                    | 03/01/2012 |
|      111 |       61 |             8 |      144 |            116 |                                    | 04/01/2012 |
|       97 |       95 |            13 |      197 |            124 |                                    | 05/01/2012 |


nil

* Parse the dates in the 'Date' column
* Tell it that our dates have the day first instead of the month first
* Set the index to be the 'Date' column

In [34]:
(helper/add-dependencies '[clojure.java-time "0.3.2"])
(require '[java-time :as t])
(require '[clojure.edn :as edn])

nil

You can use the edn reader to parse numbers. This has the benefit of giving you floats or Bignums when needed, too.

In [38]:
(defn col-parser [col-key]
    (if (= :Date col-key) 
         (partial t/local-date "dd/MM/yyyy") 
         edn/read-string))

(def parsed-df
    (->> fixed-df
         (map #(into {} (map (fn [[k v]] [k ((col-parser k) v)]) %))))) ;; Apply each parser to columns

(->> parsed-df
     (take 5)
     (map #(into {} (take 7 %)))
     pp/print-table)


| :du Parc | :Rachel1 | :Pierre-Dupuy | :Berri 1 | :Maisonneuve 1 | :Brébeuf (données non disponibles) |      :Date |
|----------+----------+---------------+----------+----------------+------------------------------------+------------|
|       26 |       16 |            10 |       35 |             38 |                                    | 2012-01-01 |
|       53 |       43 |             6 |       83 |             68 |                                    | 2012-01-02 |
|       89 |       58 |             3 |      135 |            104 |                                    | 2012-01-03 |
|      111 |       61 |             8 |      144 |            116 |                                    | 2012-01-04 |
|       97 |       95 |            13 |      197 |            124 |                                    | 2012-01-05 |


nil

# 1.2 Selecting columns

Here's an example:

In [41]:
(->> parsed-df 
     (map #(select-keys % [:Date (keyword "Berri 1")]))
     pp/print-table)


|      :Date | :Berri 1 |
|------------+----------|
| 2012-01-01 |       35 |
| 2012-01-02 |       83 |
| 2012-01-03 |      135 |
| 2012-01-04 |      144 |
| 2012-01-05 |      197 |
| 2012-01-06 |      146 |
| 2012-01-07 |       98 |
| 2012-01-08 |       95 |
| 2012-01-09 |      244 |
| 2012-01-10 |      397 |
| 2012-01-11 |      273 |
| 2012-01-12 |      157 |
| 2012-01-13 |       75 |
| 2012-01-14 |       32 |
| 2012-01-15 |       54 |
| 2012-01-16 |      168 |
| 2012-01-17 |      155 |
| 2012-01-18 |      139 |
| 2012-01-19 |      191 |
| 2012-01-20 |      161 |
| 2012-01-21 |       53 |
| 2012-01-22 |       71 |
| 2012-01-23 |      210 |
| 2012-01-24 |      299 |
| 2012-01-25 |      334 |
| 2012-01-26 |      306 |
| 2012-01-27 |       91 |
| 2012-01-28 |       80 |
| 2012-01-29 |       87 |
| 2012-01-30 |      219 |
| 2012-01-31 |      186 |
| 2012-02-01 |      138 |
| 2012-02-02 |      217 |
| 2012-02-03 |      174 |
| 2012-02-04 |       84 |
| 2012-02-05 |       72 |
| 2012-02-0

nil

# 1.3 Plotting a column


We can see that, unsurprisingly, not many people are biking in January, February, and March, 

In [45]:
(def line-plot
  "Transform data for visualization"
  {:mark     "line"
   :data     {:values (map #(update % :Date t/format) parsed-df)}
   :encoding {:x     {:field :Date
                      :type "temporal"}
              :y     {:field (keyword "Berri 1")
                      :type "quantitative"}}
   :width 500})
(oz/view! line-plot)

We can also plot all the columns just as easily. We'll make it a little bigger, too.
You can see that it's more squished together, but all the bike paths behave basically the same -- if it's a bad day for cyclists, it's a bad day everywhere.

In [48]:
;; prepare dataframe for multiple categories
(def value-vars (remove #{:Date} (keys (first parsed-df))))

(defn melt 
    [m key-var value-vars]
    (mapcat (fn [m] 
              (for [v value-vars]
                {key-var (key-var m)
                 :value (v m)
                 :variable v}))
             m))

(def melted-df (melt parsed-df :Date value-vars))
(->> melted-df
     (take 10)
     pp/print-table)


|      :Date | :value |                            :variable |
|------------+--------+--------------------------------------|
| 2012-01-01 |     26 |                             :du Parc |
| 2012-01-01 |     16 |                             :Rachel1 |
| 2012-01-01 |     10 |                        :Pierre-Dupuy |
| 2012-01-01 |     35 |                             :Berri 1 |
| 2012-01-01 |     38 |                       :Maisonneuve 1 |
| 2012-01-01 |        |   :Brébeuf (données non disponibles) |
| 2012-01-01 |      0 |               :Côte-Sainte-Catherine |
| 2012-01-01 |        | :St-Urbain (données non disponibles) |
| 2012-01-01 |     51 |                       :Maisonneuve 2 |
| 2012-01-02 |     53 |                             :du Parc |


nil

In [49]:
(def line-plot
  "Transform data for visualization"
  {:mark     "line"
   :data     {:values (map #(update % :Date t/format) melted-df)}
   :encoding {:x     {:field :Date
                      :type "temporal"}
              :y     {:field :value
                      :type "quantitative"}
              :color {:field :variable
                      :type "nominal"}}
   :width 500})
(oz/view! line-plot)

# 1.4 Putting all that together

In [50]:
;; read data in
(def fixed-data
    (with-open [reader (io/reader "../data/bikes.csv" :encoding "ISO-8859-1")]
      (doall
        (csv/read-csv reader :separator \;))))

(defn blank->nil [s]
  (when-not (clojure.string/blank? s) s))

(defn csv-data->maps [csv-data]
  (map zipmap
       (->> (first csv-data) ;; First row is the header
            (map keyword) ;; Drop if you want string keys instead
            repeat)
       (->> (rest csv-data)
            (map #(map blank->nil %))))) ;; Drop if you want blank strings to stay

(def fixed-df (csv-data->maps fixed-data))

(defn col-parser [col-key]
    (if (= :Date col-key) 
         (partial t/local-date "dd/MM/yyyy") 
         edn/read-string))

(def parsed-df
    (->> fixed-df
         (map #(into {} (map (fn [[k v]] 
                                 [k ((col-parser k) v)])  ;; Apply each parser to the values of the column
                             %)))))

;; prepare dataframe for multiple categories
(def value-vars (remove #{:Date} (keys (first parsed-df))))

(defn melt 
    [m key-var value-vars]
    (mapcat (fn [m] 
              (for [v value-vars]
                {key-var (key-var m)
                 :value (v m)
                 :variable v}))
             m))

(def melted-df (melt parsed-df :Date value-vars))

(def line-plot
  "Transform data for visualization"
  {:mark     "line"
   :data     {:values (map #(update % :Date t/format) melted-df)}
   :encoding {:x     {:field :Date
                      :type "temporal"}
              :y     {:field :value
                      :type "quantitative"}
              :color {:field :variable
                      :type "nominal"}}
   :height 500
   :width 800})
(oz/view! line-plot)