In [1]:
(require '[clojupyter.javascript.alpha :as cjp-js])
(require '[clojupyter.display :as display])
(require '[clojupyter.misc.helper :as helper])
(require '[clojure.data.json :as json])
(helper/add-dependencies '[org.clojure/data.csv "1.0.0"])
(require '[clojure.data.csv :as csv])
(helper/add-dependencies '[metasoarous/oz "1.5.6"])
(require '[oz.notebook.clojupyter :as oz])
(require '[clojure.java.io :as io])
(require '[clojure.pprint :as pp])
(helper/add-dependencies '[clojure.java-time "0.3.2"])
(require '[java-time :as t])
(require '[clojure.edn :as edn])
(helper/add-dependencies '[panthera "0.1-alpha.13"])
(require '[libpython-clj.python :as py])
(require '[panthera.panthera :as pt])

nil

In [6]:
;; use panthera html display
(defn show
  [obj]
  (display/html
    (py/call-attr obj "to_html")))

(defn show-table
  [m]
  (-> m
      pt/data-frame
      show))

(show-table [{:a 1 :b 2} {:a 3 :b 4}])

Unnamed: 0,a,b
0,1,2
1,3,4


We're going to use a new dataset here, to demonstrate how to deal with larger datasets. This is a subset of the of 311 service requests from [NYC Open Data](https://nycopendata.socrata.com/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9). 

In [7]:
;; Python: complaints = pd.read_csv('../data/311-service-requests.csv')

;; read data in
(def raw-data
    (with-open [reader (io/reader "../data/311-service-requests.csv")]
      (doall
        (csv/read-csv reader))))

(defn blank->nil [s]
  (when-not (clojure.string/blank? s) s))

(defn csv-data->maps [csv-data]
  (map zipmap
       (->> (first csv-data) ;; First row is the header
            (map keyword) ;; Drop if you want string keys instead
            repeat)
       (->> (rest csv-data)
            (map #(map blank->nil %))))) ;; Drop if you want blank strings to stay

(def fixed-df (csv-data->maps raw-data))

#'user/fixed-df

In [8]:
(->> fixed-df
     (take 10)
     (map #(into {} (take 5 %)))
     show-table)

Unnamed: 0,Road Ramp,Resolution Action Updated Date,Bridge Highway Name,Park Facility Name,School Number
0,,10/31/2013 02:35:17 AM,,Unspecified,Unspecified
1,,,,Unspecified,Unspecified
2,,10/31/2013 02:39:42 AM,,Unspecified,Unspecified
3,,10/31/2013 02:21:10 AM,,Unspecified,Unspecified
4,,10/31/2013 01:59:54 AM,,Unspecified,Unspecified
5,,,,Unspecified,Unspecified
6,,10/31/2013 01:59:51 AM,,Unspecified,Unspecified
7,,10/31/2013 01:58:49 AM,,Unspecified,Unspecified
8,,10/31/2013 02:00:56 AM,,Unspecified,Unspecified
9,,10/31/2013 01:48:26 AM,,Unspecified,Unspecified


# 2.2 Selecting columns and rows

To select a column, we index with the name of the column, like this:

In [9]:
(->> fixed-df
     (take 10)
     (map (keyword "Complaint Type"))
     (pp/pprint))

("Noise - Street/Sidewalk"
 "Illegal Parking"
 "Noise - Commercial"
 "Noise - Vehicle"
 "Rodent"
 "Noise - Commercial"
 "Blocked Driveway"
 "Noise - Commercial"
 "Noise - Commercial"
 "Noise - Commercial")


nil

and it doesn't matter which direction we do it in:

In [10]:
(->> fixed-df
     (map (keyword "Complaint Type"))
     (take 10)
     (pp/pprint))

("Noise - Street/Sidewalk"
 "Illegal Parking"
 "Noise - Commercial"
 "Noise - Vehicle"
 "Rodent"
 "Noise - Commercial"
 "Blocked Driveway"
 "Noise - Commercial"
 "Noise - Commercial"
 "Noise - Commercial")


nil

# 2.3 Selecting multiple columns


What if we just want to know the complaint type and the borough, but not the rest of the information? Pandas makes it really easy to select a subset of the columns: just index with list of columns you want.

In [11]:
(->> fixed-df
    (take 10)
    (map (juxt (keyword "Complaint Type") (keyword "Borough")))
    pp/pprint)

(["Noise - Street/Sidewalk" "QUEENS"]
 ["Illegal Parking" "QUEENS"]
 ["Noise - Commercial" "MANHATTAN"]
 ["Noise - Vehicle" "MANHATTAN"]
 ["Rodent" "MANHATTAN"]
 ["Noise - Commercial" "QUEENS"]
 ["Blocked Driveway" "QUEENS"]
 ["Noise - Commercial" "QUEENS"]
 ["Noise - Commercial" "MANHATTAN"]
 ["Noise - Commercial" "BROOKLYN"])


nil

In [12]:
;; or print as a table
(->> fixed-df
     (take 10)
     (map #(select-keys % [(keyword "Complaint Type") (keyword "Borough")]))
     show-table)

Unnamed: 0,Complaint Type,Borough
0,Noise - Street/Sidewalk,QUEENS
1,Illegal Parking,QUEENS
2,Noise - Commercial,MANHATTAN
3,Noise - Vehicle,MANHATTAN
4,Rodent,MANHATTAN
5,Noise - Commercial,QUEENS
6,Blocked Driveway,QUEENS
7,Noise - Commercial,QUEENS
8,Noise - Commercial,MANHATTAN
9,Noise - Commercial,BROOKLYN


# 2.4 What's the most common complaint type?


This is a really easy question to answer! There's a `frequencies` function that we can use:

In [13]:
;; Python: 
;; complaints['Complaint Type'].value_counts()

(->> fixed-df
     (map (keyword "Complaint Type"))
     frequencies
     (sort-by val >)
     (take 10)
     pp/pprint)

(["HEATING" 14200]
 ["GENERAL CONSTRUCTION" 7471]
 ["Street Light Condition" 7117]
 ["DOF Literature Request" 5797]
 ["PLUMBING" 5373]
 ["PAINT - PLASTER" 5149]
 ["Blocked Driveway" 4590]
 ["NONCONST" 3998]
 ["Street Condition" 3473]
 ["Illegal Parking" 3343])


nil

But it gets better! We can plot them!

In [14]:
(defn bar-graph [vs]
 {:data {:values (map (fn [[k v]] {:Index k :Value v}) vs)}
  :mark "bar"
  :encoding {:x {:field :Index
                 :type "nominal"
                 :sort false}
             :y {:field :Value
                 :type "quantitative"}}
  :width 800})

(->> fixed-df
     (map (keyword "Complaint Type"))
     frequencies
     (sort-by val >)
     (take 10)
     bar-graph
     oz/view!)