In [134]:
(require '[clojupyter.javascript.alpha :as cjp-js])
(require '[clojupyter.display :as display])
(require '[clojupyter.misc.helper :as helper])
(require '[clojure.data.json :as json])
(helper/add-dependencies '[org.clojure/data.csv "1.0.0"])
(require '[clojure.data.csv :as csv])
(helper/add-dependencies '[metasoarous/oz "1.5.6"])
(require '[oz.notebook.clojupyter :as oz])
(require '[clojure.java.io :as io])
(require '[clojure.pprint :as pp])
(helper/add-dependencies '[clojure.java-time "0.3.2"])
(require '[java-time :as t])
(require '[clojure.edn :as edn])

nil

We're going to use a new dataset here, to demonstrate how to deal with larger datasets. This is a subset of the of 311 service requests from [NYC Open Data](https://nycopendata.socrata.com/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9). 

In [136]:
;; Python: complaints = pd.read_csv('../data/311-service-requests.csv')

;; read data in
(def raw-data
    (with-open [reader (io/reader "../data/311-service-requests.csv")]
      (doall
        (csv/read-csv reader))))

(defn blank->nil [s]
  (when-not (clojure.string/blank? s) s))

(defn csv-data->maps [csv-data]
  (map zipmap
       (->> (first csv-data) ;; First row is the header
            (map keyword) ;; Drop if you want string keys instead
            repeat)
       (->> (rest csv-data)
            (map #(map blank->nil %))))) ;; Drop if you want blank strings to stay

(def fixed-df (csv-data->maps raw-data))

#'user/fixed-df

In [141]:
(->> fixed-df
     (take 10)
     (map #(into {} (take 5 %)))
     pp/print-table)


| :Road Ramp | :Resolution Action Updated Date | :Bridge Highway Name | :Park Facility Name | :School Number |
|------------+---------------------------------+----------------------+---------------------+----------------|
|            |          10/31/2013 02:35:17 AM |                      |         Unspecified |    Unspecified |
|            |                                 |                      |         Unspecified |    Unspecified |
|            |          10/31/2013 02:39:42 AM |                      |         Unspecified |    Unspecified |
|            |          10/31/2013 02:21:10 AM |                      |         Unspecified |    Unspecified |
|            |          10/31/2013 01:59:54 AM |                      |         Unspecified |    Unspecified |
|            |                                 |                      |         Unspecified |    Unspecified |
|            |          10/31/2013 01:59:51 AM |                      |         Unspecified |    Unspecified |


nil

# 2.2 Selecting columns and rows

To select a column, we index with the name of the column, like this:

In [145]:
(->> fixed-df
     (take 10)
     (map (keyword "Complaint Type"))
     (pp/pprint))

("Noise - Street/Sidewalk"
 "Illegal Parking"
 "Noise - Commercial"
 "Noise - Vehicle"
 "Rodent"
 "Noise - Commercial"
 "Blocked Driveway"
 "Noise - Commercial"
 "Noise - Commercial"
 "Noise - Commercial")


nil

and it doesn't matter which direction we do it in:

In [148]:
(->> fixed-df
     (map (keyword "Complaint Type"))
     (take 10)
     (pp/pprint))

("Noise - Street/Sidewalk"
 "Illegal Parking"
 "Noise - Commercial"
 "Noise - Vehicle"
 "Rodent"
 "Noise - Commercial"
 "Blocked Driveway"
 "Noise - Commercial"
 "Noise - Commercial"
 "Noise - Commercial")


nil

# 2.3 Selecting multiple columns


What if we just want to know the complaint type and the borough, but not the rest of the information? Pandas makes it really easy to select a subset of the columns: just index with list of columns you want.

In [169]:
(->> fixed-df
    (take 10)
    (map (juxt (keyword "Complaint Type") (keyword "Borough")))
    pp/pprint)

(["Noise - Street/Sidewalk" "QUEENS"]
 ["Illegal Parking" "QUEENS"]
 ["Noise - Commercial" "MANHATTAN"]
 ["Noise - Vehicle" "MANHATTAN"]
 ["Rodent" "MANHATTAN"]
 ["Noise - Commercial" "QUEENS"]
 ["Blocked Driveway" "QUEENS"]
 ["Noise - Commercial" "QUEENS"]
 ["Noise - Commercial" "MANHATTAN"]
 ["Noise - Commercial" "BROOKLYN"])


nil

In [168]:
;; or print as a table
(->> fixed-df
     (take 10)
     (map #(select-keys % [(keyword "Complaint Type") (keyword "Borough")]))
     pp/pprint)

({:Complaint Type "Noise - Street/Sidewalk", :Borough "QUEENS"}
 {:Complaint Type "Illegal Parking", :Borough "QUEENS"}
 {:Complaint Type "Noise - Commercial", :Borough "MANHATTAN"}
 {:Complaint Type "Noise - Vehicle", :Borough "MANHATTAN"}
 {:Complaint Type "Rodent", :Borough "MANHATTAN"}
 {:Complaint Type "Noise - Commercial", :Borough "QUEENS"}
 {:Complaint Type "Blocked Driveway", :Borough "QUEENS"}
 {:Complaint Type "Noise - Commercial", :Borough "QUEENS"}
 {:Complaint Type "Noise - Commercial", :Borough "MANHATTAN"}
 {:Complaint Type "Noise - Commercial", :Borough "BROOKLYN"})


nil

# 2.4 What's the most common complaint type?


This is a really easy question to answer! There's a `frequencies` function that we can use:

In [166]:
;; Python: 
;; complaints['Complaint Type'].value_counts()

(->> fixed-df
     (map (keyword "Complaint Type"))
     frequencies
     (sort-by val >)
     (take 10)
     pp/pprint)

(["HEATING" 14200]
 ["GENERAL CONSTRUCTION" 7471]
 ["Street Light Condition" 7117]
 ["DOF Literature Request" 5797]
 ["PLUMBING" 5373]
 ["PAINT - PLASTER" 5149]
 ["Blocked Driveway" 4590]
 ["NONCONST" 3998]
 ["Street Condition" 3473]
 ["Illegal Parking" 3343])


nil

But it gets better! We can plot them!

In [167]:
(defn bar-graph [vs]
 {:data {:values vs}, 
  :mark "bar"
  :encoding {:x {:field :Index
                 :type "nominal"
                 :sort false}
             :y {:field :Value
                 :type "quantitative"}}})

(->> fixed-df
     (map (keyword "Complaint Type"))
     frequencies
     (sort-by val >)
     (take 10)
     (map (fn [[k v]] {:Index k :Value v}))
     bar-graph
     oz/view!)