# Uploading a CO2e dataset into Wikidata 

This executable notebook uploads a CO2e dataset into [Wikidata](https://www.wikipedia.org/wiki/Wikidata).

### The dataset that is to be uploaded

The [Scottish Environment Protection Agency](https://www.sepa.org.uk) (SEPA) has published CO2e data for household waste in Scotlan to indicate the [global warming potential](https://en.wikipedia.org/wiki/Global_warming_potential) of this waste. The data shows the tonnes of CO2e per Scottish council area per year (only 2017 and 2018, at present). 

The data is found in the two Excel spreadsheets 
[2017-household-waste-tables](https://www.sepa.org.uk/media/378875/2017-household-waste-summary-tables-final.xlsx) and
[2018-household-waste-tables](https://www.sepa.org.uk/media/469611/2018-household-waste-data-tables.xlsx) - 
in worksheet "Table 1" and column "J', in each. 
For convenience, I copied this data into the [sepa-CO2e.csv](sepa-CO2e.csv) file.


### Prep tooling

In [1]:
; Add code libraries

(require '[clojupyter.misc.helper :as helper])

(helper/add-dependencies '[org.clojure/data.csv "1.0.0"])
(helper/add-dependencies '[clj-http/clj-http "3.10.1"])
(helper/add-dependencies '[metasoarous/oz "1.6.0-alpha24"])

(require '[clojure.string :as str]
         '[clojure.set :as set]
         '[clojure.data :as data]
         '[clojure.pprint :as pp]
         '[clojure.java.io :as io]
         '[clojure.data.csv :as csv]
         '[clj-http.client :as http]
         '[oz.notebook.clojupyter :as oz]
         '[oz.core :as ozcore])

(import 'java.net.URLEncoder)

java.net.URLEncoder

In [2]:
; Define convenience functions

; Convert the CSV structure to a list-of-maps structure.
(defn to-maps [csv-data]
    (map zipmap (->> (first csv-data)
                    (map keyword)
                    repeat)
                (rest csv-data)))

; Map the name of a SPARQL service to its URL.
(def service-urls {:wikidata "https://query.wikidata.org/sparql"})
                                
; Ask the service to execute the given SPARQL query
; and return its result as a list-of-maps.
(defn exec-query [service-name sparql]
  (->> (http/post (service-name service-urls) 
        {:body (str "query=" (URLEncoder/encode sparql)) 
         :headers {"Accept" "text/csv" 
                   "Content-Type" "application/x-www-form-urlencoded"} 
         :debug false})
    :body
    csv/read-csv
    to-maps))

#'user/exec-query

### Load SEPA's CO2e dataset into memory

Load the dataset from the [sepa-CO2e.csv](sepa-CO2e.csv) file that I created for convenience.

In [3]:
; Load the dataset from sepa-CO2e.csv
(def sepa
    (with-open [reader (io/reader "sepa-CO2e.csv")]
        (doall
            (to-maps (csv/read-csv reader)))))

(println (count sepa) "rows")

; Adjust a few keys and values to be in keeping with those used by Wikidata
(def sepa
    (->> sepa
        (map #(set/rename-keys % {:TCO2e :amount}))
        (map #(assoc % :unit "tonne of carbon dioxide equivalent"))
        (map #(let [council (:council %)]
                  (cond 
                    (= council "Na h-Eileanan Siar") (assoc % :council "Outer Hebrides")
                    (str/starts-with? council "Orkney Islands") (assoc % :council "Orkney Islands")
                    :else %)))))

; Print a sample
(pp/print-table [:council :year :amount :unit] (repeatedly 5 #(rand-nth sepa)))

64 rows

|          :council | :year |   :amount |                              :unit |
|-------------------+-------+-----------+------------------------------------|
|      East Lothian |  2018 | 110685.64 | tonne of carbon dioxide equivalent |
|  Clackmannanshire |  2017 |  55348.62 | tonne of carbon dioxide equivalent |
|    North Ayrshire |  2017 | 137512.34 | tonne of carbon dioxide equivalent |
| City of Edinburgh |  2017 | 507553.12 | tonne of carbon dioxide equivalent |
|           Falkirk |  2018 | 154954.43 | tonne of carbon dioxide equivalent |


nil

### Load Wikidata's relevant CO2e data into memory

Load the data by running a SPARQL query against Wikidata.  

In [4]:
; Define the SPARQL query that will fetch the relevant Scottish council area data
(def sparql "
SELECT ?qid ?councilAreaLabel ?year ?amount ?unitLabel
WHERE {
  ?councilArea wdt:P31 wd:Q15060255 . # Scottish council area
  BIND(strafter(str(?councilArea), 'http://www.wikidata.org/entity/') as ?qid)
  OPTIONAL { 
    ?councilArea p:P5991 ?CO2e . 
    ?CO2e psv:P5991 ?quantity ;
          pq:P585 ?date ;
          pq:P828 wd:Q180388 ; # 'has cause' 'waste management'
          pq:P828 wd:Q259059 . # 'has cause' 'household'
    ?quantity wikibase:quantityAmount ?amount;
              wikibase:quantityUnit ?unit.
    BIND(YEAR(?date) as ?year)
  }
  SERVICE wikibase:label { bd:serviceParam wikibase:language '[AUTO_LANGUAGE],en' . }
}")

; Execute the SPARQL query agaist Wikidata
(def wikidata0
    (->> sparql
        (exec-query :wikidata)
        (map #(set/rename-keys % {:councilAreaLabel :council
                                  :unitLabel :unit}))))

(println (count wikidata0) "rows")

[W 07:33:46.470 Clojupyter] org.apache.http.client.protocol.ResponseProcessCookies:130 -- Invalid cookie header: "Set-Cookie: WMF-Last-Access=28-Sep-2020;Path=/;HttpOnly;secure;Expires=Fri, 30 Oct 2020 00:00:00 GMT". Invalid 'expires' attribute: Fri, 30 Oct 2020 00:00:00 GMT
[W 07:33:46.488 Clojupyter] org.apache.http.client.protocol.ResponseProcessCookies:130 -- Invalid cookie header: "Set-Cookie: WMF-Last-Access-Global=28-Sep-2020;Path=/;Domain=.wikidata.org;HttpOnly;secure;Expires=Fri, 30 Oct 2020 00:00:00 GMT". Invalid 'expires' attribute: Fri, 30 Oct 2020 00:00:00 GMT
64 rows


nil

In [5]:
; wikidata0 might contain rows where the year is not 2017 or 2018.
; This is deliberate. 
; We want to know the Scottish council area label and qid values 
;   even when they are not associated with 2017 or 2018 CO2e values
; Because we need all the labels and qids later, when we build the label->qid map.
; However, the next step is to compare relevant (i.e. 2017 & 2018) data items,
;  and for that purpose, we filter wikidata0 to remove non relevant data items.
(def wikidata 
    (->> wikidata0
        (filter #(contains? #{"2017" "2018"} (:year %)))
        (map #(select-keys % [:council :year :amount :unit]))))

(println (count wikidata) "rows")

64 rows


nil

### Compare Wikidata's CO2e data against SEPA's 

In [6]:
; Check for differences

(def diff (data/diff 
            (set sepa) 
            (set wikidata)))

(println (count (first diff)) "in SEPA only")
(println (count (second diff)) "in Wikidata only")
(println (count (nth diff 2)) "in both")

(with-open [wtr (io/writer "CO2e-diff.txt")]
  (binding [*out* wtr]
    (do
      (println "CO2e-values: SEPA versus Wikidata")
      (println)
      (println (count (first diff)) "in SEPA only...")
      (pp/pprint (first diff))
      (println)
      (println (count (second diff)) "in Wikidata only...")
      (pp/pprint (second diff))
      (println)
      (println (count (nth diff 2)) "in both...")
      (pp/pprint (nth diff 2)))))

0 in SEPA only
0 in Wikidata only
64 in both


nil

#### When I first compared...

When I ran the above comparison for the first time (at 2020-09-24T15_30Z) it discovered the following differences...

```
64 in SEPA only
0 in Wikidata only
0 in both
```
The details are in [this file](CO2e-diff-2020-09-24T15_30Z.txt).

To fix those differences, 
I created/amended the appropriate Wikidata data items 
via the [QuickSatements](https://quickstatements.toolforge.org/) service. 
QuickStatements accepts CSV input - representing edits to be applied to Wikidata. 
The quickStatements CSV input was generated as follows.

In [7]:
; Generate the CSV that is to be imported into QuickStatements

(def TQD "\"\"\"") ; triple double-quote 

(def year->url {"2017" "https://www.sepa.org.uk/media/378875/2017-household-waste-summary-tables-final.xlsx"
                "2018" "https://www.sepa.org.uk/media/469611/2018-household-waste-data-tables.xlsx"})

(def council->qid
    (->> wikidata0
        (map #(vector (:council %) (:qid %)))
        (into {})))

(with-open [wtr (io/writer "CO2e-quickstatements.csv")]
  (binding [*out* wtr]
    ;; qid, CO2e, point in time (qualifier), determination method (qualifier), has cause (qualifier), has cause (qualifier), reference URL, editing comment
    (println "qid,P5991,qal585,qal459,qal828,qal828,S854,#") 
    (doseq [m (first diff)]
      (println (str (council->qid (:council m)) ","
                    (:amount m) "U57084755," ; tonne of carbon dioxide equivalent
                    "+" (:year m) "-00-00T00:00:00Z/9,"
                    "Q791801," ; estimation process
                    "Q180388," ; waste management
                    "Q259059," ; household
                    TQD (year->url (:year m)) TQD ","
                    TQD "Set " (:council m) " council area's " (:year m) " CO2e amount" TQD)))))

nil


When I ran the above CSV generation for the first time - the output is in [this file](CO2-quickstatements-2020-09-25T08_50Z.csv) - it specified 378 _individual_ edits to Wikidata (after I removed 6 of them that I had already manually edited into Wikidata). These were successfully executed (taking about 20 mins) against Wikidata by QuickStatements.

### Using the uploaded dataset

As an example use, let's display a grouped bar chart to visualise...

>_for household waste, the tonnes of CO2e generated per citizen per year per Scottish council area_.

To do so...
* we run a SPARQL query against Wikidata 
* that uses the CO2e dataset that we've just uploaded
* and uses the population dataset that we [previously](dataset-into-wikidata.ipynb) uploaded 
* to perform the required calculation 
* then, finally, we instruct Vega-Lite to render the chart. 

In [8]:
(def sparql "
SELECT ?qid ?councilAreaLabel ?year ?CO2ePerCitizen
WHERE {
  ?councilArea wdt:P31 wd:Q15060255 . # Scottish council area
  BIND(strafter(str(?councilArea), 'http://www.wikidata.org/entity/') as ?qid)
  ?councilArea p:P5991 ?CO2e ;
               p:P1082 ?population. 
  ?CO2e psv:P5991 ?CO2eQuantity ;
        pq:P585 ?date ;
        pq:P828 wd:Q180388 ; # 'has cause' 'waste management'
        pq:P828 wd:Q259059 . # 'has cause' 'household'
  ?population psv:P1082 ?populationQuantity ;
              pq:P585 ?date .
  ?CO2eQuantity wikibase:quantityAmount ?CO2eAmount.
  ?populationQuantity wikibase:quantityAmount ?populationAmount .
  BIND(YEAR(?date) as ?year)
  BIND((xsd:decimal(?CO2eAmount)/xsd:integer(?populationAmount)) AS ?CO2ePerCitizen)
  SERVICE wikibase:label { bd:serviceParam wikibase:language '[AUTO_LANGUAGE],en' . }
}")

(def wikidata
    (->> sparql
        (exec-query :wikidata)
        (map #(set/rename-keys % {:councilAreaLabel :council}))))

(def chart-spec
  {:data {:values wikidata}
   :mark {:type "bar" 
          :tooltip {:content "data"}}
   :width 20
   :encoding {:column {:field "council"
                       :type "nominal"
                       :spacing 0
                       :sort {:field "CO2ePerCitizen" :op "mean"}
                       :header{:labelAngle -90
                               :labelAlign "right"
                               :labelOrient "bottom"
                               :labelBaseline "middle"}
                       :axis {:title nil}}
              :x {:field "year"
                  :type "ordinal"
                  :axis {:title nil
                         :labels false
                         :ticks false}}
              :y {:field "CO2ePerCitizen"
                  :type "quantitative"
                  :axis {:grid true
                         :labelAngle -45
                         :title "CO2e tonnes"}
                  :scale {:zero false}}
              :color {:field "year"}}})

(oz/view! chart-spec)

[W 07:33:49.588 Clojupyter] org.apache.http.client.protocol.ResponseProcessCookies:130 -- Invalid cookie header: "Set-Cookie: WMF-Last-Access=28-Sep-2020;Path=/;HttpOnly;secure;Expires=Fri, 30 Oct 2020 00:00:00 GMT". Invalid 'expires' attribute: Fri, 30 Oct 2020 00:00:00 GMT
[W 07:33:49.590 Clojupyter] org.apache.http.client.protocol.ResponseProcessCookies:130 -- Invalid cookie header: "Set-Cookie: WMF-Last-Access-Global=28-Sep-2020;Path=/;Domain=.wikidata.org;HttpOnly;secure;Expires=Fri, 30 Oct 2020 00:00:00 GMT". Invalid 'expires' attribute: Fri, 30 Oct 2020 00:00:00 GMT
