In [1]:
(require '[clojupyter.javascript.alpha :as cjp-js])
(require '[clojupyter.display :as display])
(require '[clojupyter.misc.helper :as helper])
(require '[clojure.data.json :as json])
(helper/add-dependencies '[org.clojure/data.csv "1.0.0"])
(require '[clojure.data.csv :as csv])
(helper/add-dependencies '[metasoarous/oz "1.5.6"])
(require '[oz.notebook.clojupyter :as oz])
(require '[clojure.java.io :as io])
(require '[clojure.pprint :as pp])
(helper/add-dependencies '[clojure.java-time "0.3.2"])
(require '[java-time :as t])
(require '[clojure.edn :as edn])
(helper/add-dependencies '[panthera "0.1-alpha.13"])
(require '[libpython-clj.python :as py])
(require '[panthera.panthera :as pt])

nil

In [2]:
;; use panthera html display
(defn show
  [obj]
  (display/html
    (py/call-attr obj "to_html")))

(defn show-table
  [m]
  (-> m
      pt/data-frame
      show))

(show-table [{:a 1 :b 2} {:a 3 :b 4}])

Unnamed: 0,a,b
0,1,2
1,3,4


We saw earlier that pandas is really good at dealing with dates. It is also amazing with strings! We're going to go back to our weather data from Chapter 5, here.

In [5]:
;; Python
;; weather_2012 = pd.read_csv('../data/weather_2012.csv', index_col='Date/Time')
;; weather_2012[:5]

(def fixed-data
    (with-open [reader (io/reader "../data/weather_2012.csv")]
      (doall
        (csv/read-csv reader))))

(defn blank->nil [s]
   (when-not (#{""} s) s))

(defn csv-data->maps [csv-data]
  (map zipmap
       (->> (first csv-data) ;; First row is the header
            (map keyword) ;; Drop if you want string keys instead
            repeat)
       (->> (rest csv-data)
            (map #(map blank->nil %))))) ;; Drop if you want blank strings to stay

(defn col-parser [col-key]
    (if (= :Date/Time col-key)
         identity
         edn/read-string))

(def weather-2012
    (->> fixed-data
         csv-data->maps
        (map #(into {} (map (fn [[k v]] [k ((col-parser k) v)]) %))))) ;; Apply each parser to columns

(->> weather-2012
     (take 5)
     show-table)

Unnamed: 0,Time,Temp (C),Dew Point Temp (C),Rel Hum (%),h),Visibility (km),Stn Press (kPa),Weather
0,2012-01-01 00:00:00,-1.8,-3.9,86,4,8.0,101.24,Fog
1,2012-01-01 01:00:00,-1.8,-3.7,87,4,8.0,101.24,Fog
2,2012-01-01 02:00:00,-1.8,-3.4,89,7,4.0,101.26,Freezing
3,2012-01-01 03:00:00,-1.5,-3.2,88,6,4.0,101.27,Freezing
4,2012-01-01 04:00:00,-1.5,-3.3,88,7,4.8,101.23,Fog


# 6.1 String operations

You'll see that the 'Weather' column has a text description of the weather that was going on each hour. We'll assume it's snowing if the text description contains "Snow".

pandas provides vectorized string functions, to make it easy to operate on columns containing text. There are some great [examples](http://pandas.pydata.org/pandas-docs/stable/basics.html#vectorized-string-methods) in the documentation.

In [25]:
(t/local-date-time "2012-01-01 00:00:00")

Execution error (DateTimeParseException) at java.time.format.DateTimeFormatter/parseResolved0 (DateTimeFormatter.java:1949).
Text '2012-01-01 00:00:00' could not be parsed at index 10


class clojure.lang.ExceptionInfo: 

In [23]:
;; weather_description = weather_2012['Weather']
;; is_snowing = weather_description.str.contains('Snow')

(def weather-description 
    (->> weather-2012
        (map #(assoc % :snow (clojure.string/includes? (:Weather %) "Snow")))))

(->> weather-description 
     (take 5)
     show-table)

Unnamed: 0,Stn Press (kPa),Time,Weather,snow,Temp (C),Rel Hum (%),h),Visibility (km),Dew Point Temp (C)
0,101.24,2012-01-01 00:00:00,Fog,False,-1.8,86,4,8.0,-3.9
1,101.24,2012-01-01 01:00:00,Fog,False,-1.8,87,4,8.0,-3.7
2,101.26,2012-01-01 02:00:00,Freezing,False,-1.8,89,7,4.0,-3.4
3,101.27,2012-01-01 03:00:00,Freezing,False,-1.5,88,6,4.0,-3.2
4,101.23,2012-01-01 04:00:00,Fog,False,-1.5,88,7,4.8,-3.3


This gives us a binary vector, which is a bit hard to look at, so we'll plot it.

In [24]:
;; # More useful!
;; is_snowing.plot()

(oz/view!
  {:data {:values (filter :snow weather-description)}
  :mark "tick"
  :encoding {:x {:field :Date/Time
                 :type "temporal"}}
  :width 800})

# 6.2 Use resampling to find the snowiest month

If we wanted the median temperature each month, we could use the `resample()` method like this:

In [62]:
(def x (-> "2012-01-01 00:00:00" ((partial t/local-date-time "yyyy-MM-dd HH:mm:ss"))))

(t/adjust x :first-day-of-month)

#object[java.time.LocalDateTime 0x4b0a655e "2012-01-01T00:00"]

In [87]:
((keyword "month") 
 {:month "2012-12-01"})

"2012-12-01"

In [148]:

(defn median [vect]
    ((vec (sort vect)) (quot (count vect) 2)))

(median [1 4 3 2 5 6])

4

In [150]:
;; Python
;; weather_2012['Temp (C)'].resample('M').apply(np.median).plot(kind='bar')

(defn median [vect]
    ((vec (sort vect)) (quot (count vect) 2)))

(def median-temp
    (->> weather-description 
         (map #(assoc % :month (-> % 
                                   :Date/Time 
                                   ((partial t/local-date-time "yyyy-MM-dd HH:mm:ss"))
                                   (t/adjust :first-day-of-month)
                                   (t/local-date)
                                   (t/format))))
         (group-by :month)
         (map (fn [[k v]] {:month k 
                           :median (->> v
                                      (map (keyword "Temp (C)"))
                                      median)}))))
    
(oz/view!
  {:data {:values median-temp}
  :mark "bar"
  :encoding {:x {:field :month
                 :type "ordinal"}
             :y {:field :median
                 :type "quantitative"}}
  :width 800})

Unsurprisingly, July and August are the warmest.

So we can think of snowiness as being a bunch of 1s and 0s instead of `True`s and `False`s:

In [152]:
;; ;; is_snowing.astype(float)[:10]

(->> weather-description
     (map #(update % :snow bool->int))
     (take 5)
     show-table)

Unnamed: 0,Stn Press (kPa),Time,Weather,snow,Temp (C),Rel Hum (%),h),Visibility (km),Dew Point Temp (C)
0,101.24,2012-01-01 00:00:00,Fog,0,-1.8,86,4,8.0,-3.9
1,101.24,2012-01-01 01:00:00,Fog,0,-1.8,87,4,8.0,-3.7
2,101.26,2012-01-01 02:00:00,Freezing,0,-1.8,89,7,4.0,-3.4
3,101.27,2012-01-01 03:00:00,Freezing,0,-1.5,88,6,4.0,-3.2
4,101.23,2012-01-01 04:00:00,Fog,0,-1.5,88,7,4.8,-3.3


In [153]:
;; is_snowing.astype(float).resample('M').apply(np.mean)

(def bool->int #(if % 1 0))

(defn mean
  ([vs] (mean (reduce + vs) (count vs)))
  ([sm sz] (/ sm sz)))

(->> weather-description
     (map #(update % :snow bool->int))
     (map #(assoc % :month (-> % 
                               :Date/Time 
                               ((partial t/local-date-time "yyyy-MM-dd HH:mm:ss"))
                               (t/adjust :first-day-of-month)
                               (t/local-date)
                               (t/format))))
     (group-by :month)
     (map (fn [[k v]] {:month k 
                       :mean (->> v
                                  (map :snow)
                                  mean
                                  float)})))

({:month "2012-12-01", :mean 0.19623657} {:month "2012-01-01", :mean 0.22715054} {:month "2012-09-01", :mean 0.0} {:month "2012-06-01", :mean 0.0} {:month "2012-07-01", :mean 0.0} {:month "2012-02-01", :mean 0.15948276} {:month "2012-03-01", :mean 0.086021505} {:month "2012-10-01", :mean 0.0} {:month "2012-08-01", :mean 0.0} {:month "2012-05-01", :mean 0.0} {:month "2012-11-01", :mean 0.0375} {:month "2012-04-01", :mean 0.0069444445})

and then use `resample` to find the percentage of time it was snowing each month

In [160]:
;; is_snowing.astype(float).resample('M').apply(np.mean)

(def bool->int #(if % 1 0))

(defn mean [vs] 
    (float (/ (reduce + vs) (count vs))))

(->> weather-description
     (map #(update % :snow bool->int))
     (map #(assoc % :month (-> % 
                               :Date/Time 
                               ((partial t/local-date-time "yyyy-MM-dd HH:mm:ss"))
                               (t/adjust :first-day-of-month)
                               (t/local-date)
                               (t/format))))
     (group-by :month)
     (map (fn [[k v]] {:month k 
                       :mean (->> v
                                  (map :snow)
                                  mean)})))

({:month "2012-12-01", :mean 0.19623657} {:month "2012-01-01", :mean 0.22715054} {:month "2012-09-01", :mean 0.0} {:month "2012-06-01", :mean 0.0} {:month "2012-07-01", :mean 0.0} {:month "2012-02-01", :mean 0.15948276} {:month "2012-03-01", :mean 0.086021505} {:month "2012-10-01", :mean 0.0} {:month "2012-08-01", :mean 0.0} {:month "2012-05-01", :mean 0.0} {:month "2012-11-01", :mean 0.0375} {:month "2012-04-01", :mean 0.0069444445})

In [156]:
;; is_snowing.astype(float).resample('M').apply(np.mean).plot(kind='bar')

(oz/view!
  {:data {:values *1}
  :mark "bar"
  :encoding {:x {:field :month
                 :type "ordinal"}
             :y {:field :mean
                 :type "quantitative"}}
  :width 800})

In [169]:
(map (fn [[k v]] (prn k) (prn v)) {:a 1 :b 2})
;; (doseq [[k v] db] (prn k v))

(:a 1
:b 2
nil nil)

So now we know! In 2012, December was the snowiest month. Also, this graph suggests something that I feel -- it starts snowing pretty abruptly in November, and then tapers off slowly and takes a long time to stop, with the last snow usually being in April or May.

# 6.3 Plotting temperature and snowiness stats together

We can also combine these two statistics (temperature, and snowiness) into one dataframe and plot them together:

In [157]:
;; temperature = weather_2012['Temp (C)'].resample('M').apply(np.median)
;; is_snowing = weather_2012['Weather'].str.contains('Snow')
;; snowiness = is_snowing.astype(float).resample('M').apply(np.mean)

;; # Name the columns
;; temperature.name = "Temperature"
;; snowiness.name = "Snowiness"

(def stats 
    (->> weather-description
         (map #(update % :snow bool->int))
         (map #(assoc % :month (-> % 
                                   :Date/Time 
                                   ((partial t/local-date-time "yyyy-MM-dd HH:mm:ss"))
                                   (t/adjust :first-day-of-month)
                                   (t/local-date)
                                   (t/format))))
         (group-by :month)
         (map (fn [[k v]] {:month k 
                           :snowiness (->> v
                                          (map :snow)
                                          mean
                                          float)
                           :temperature (->> v
                                             (map (keyword "Temp (C)"))
                                             median)}))))

#'user/stats

In [158]:
;; stats.plot(kind='bar', subplots=True, figsize=(15, 10))

(oz/view! 
     (for [field [:snowiness :temperature]]
        [:vega-lite  
             {:data {:values stats}
              :mark "bar"
              :encoding {:x {:field :month
                             :type "ordinal"}
                         :y {:field field
                             :type "quantitative"}}
              :width 800}]))