# Uploading a CO2e dataset into Wikidata 

This executable notebook uploads a CO2e dataset into [Wikidata](https://www.wikipedia.org/wiki/Wikidata).

### The dataset that is to be uploaded

The [Scottish Environment Protection Agency](https://www.sepa.org.uk) (SEPA) has published CO2e data for household waste in Scotlan to indicate the [global warming potential](https://en.wikipedia.org/wiki/Global_warming_potential) of this waste. The data shows the tonnes of CO2e per Scottish council area per year (only 2017 and 2018, at present). 

The data is found in the two Excel spreadsheets 
[2017-household-waste-tables](https://www.sepa.org.uk/media/378875/2017-household-waste-summary-tables-final.xlsx) and
[2018-household-waste-tables](https://www.sepa.org.uk/media/469611/2018-household-waste-data-tables.xlsx) - 
in worksheet "Table 1" and column "J', in each. 
For convenience, I copied this data into the [sepa-CO2e.csv](sepa-CO2e.csv) file.


### Prep tooling

In [3]:
; Add code libraries

(require '[clojupyter.misc.helper :as helper])

(helper/add-dependencies '[org.clojure/data.csv "1.0.0"])
(helper/add-dependencies '[clj-http/clj-http "3.10.1"])

(require '[clojure.string :as str]
         '[clojure.set :as set]
         '[clojure.data :as data]
         '[clojure.pprint :as pp]
         '[clojure.java.io :as io]
         '[clojure.data.csv :as csv]
         '[clj-http.client :as http])

(import 'java.net.URLEncoder)

java.net.URLEncoder

In [4]:
; Define convenience functions

; Convert the CSV structure to a list-of-maps structure.
(defn to-maps [csv-data]
    (map zipmap (->> (first csv-data)
                    (map keyword)
                    repeat)
                (rest csv-data)))

; Map the name of a SPARQL service to its URL.
(def service-urls {:wikidata "https://query.wikidata.org/sparql"})
                                
; Ask the service to execute the given SPARQL query
; and return its result as a list-of-maps.
(defn exec-query [service-name sparql]
  (->> (http/post (service-name service-urls) 
        {:body (str "query=" (URLEncoder/encode sparql)) 
         :headers {"Accept" "text/csv" 
                   "Content-Type" "application/x-www-form-urlencoded"} 
         :debug false})
    :body
    csv/read-csv
    to-maps))

#'user/exec-query

### Load SEPA's CO2e dataset into memory

Load the dataset from the [sepa-CO2e.csv](sepa-CO2e.csv) file that I created for convenience.

In [36]:
; Load the dataset from sepa-CO2e.csv
(def sepa
    (with-open [reader (io/reader "sepa-CO2e.csv")]
        (doall
            (to-maps (csv/read-csv reader)))))

(println (count sepa) "rows")

; Adjust a couple of council area labels to be in keeping with those used by Wikidata
(def sepa
    (map #(let [council (:council %)]
              (cond 
                (= council "Na h-Eileanan Siar") (assoc % :council "Outer Hebrides")
                (str/starts-with? council "Orkney Islands") (assoc % :council "Orkney Islands")
                :else %))
        sepa))

; Print a sample
(pp/print-table [:council :year :TCO2e] (repeatedly 5 #(rand-nth sepa)))

64 rows

|            :council | :year |    :TCO2e |
|---------------------+-------+-----------|
|         Dundee City |  2018 | 148298.68 |
|   City of Edinburgh |  2018 | 492831.56 |
| East Dunbartonshire |  2018 | 111395.06 |
|               Moray |  2017 |  97535.97 |
| West Dunbartonshire |  2018 | 102739.24 |


nil

### Load Wikidata's relevant CO2e data into memory

Load the data by running a SPARQL query against Wikidata.  

In [38]:
; Define the SPARQL query that will fetch the relevant Scottish council area data
(def sparql "
SELECT ?qid ?councilAreaLabel ?year ?co2eQuantity 
WHERE {
  ?councilArea wdt:P31 wd:Q15060255 . # Scottish council area
  BIND(strafter(str(?councilArea), 'http://www.wikidata.org/entity/') as ?qid)
  OPTIONAL { 
    ?councilArea p:P5991 ?co2e . 
    ?co2e ps:P5991 ?co2eQuantity ;
          pq:P585 ?co2eDate ;
          pq:P1269 wd:Q180388 ; # 'facet of' 'waste management'
          pq:P828 wd:Q259059 . # 'has cause' 'household'
    BIND(YEAR(?co2eDate) as ?year)
  }
  SERVICE wikibase:label { bd:serviceParam wikibase:language '[AUTO_LANGUAGE],en' . }
}")

; Execute the SPARQL query agaist Wikidata
(def wikidata0
    (->> sparql
        (exec-query :wikidata)
        (map #(set/rename-keys % {:councilAreaLabel :label}))))

(println (count wikidata0) "rows")

32 rows


nil

In [41]:
; wikidata0 might contain rows where the year is not 2017 or 2018.
; This is deliberate. 
; We want to know the Scottish council area label and qid values 
;   even when they are not associated with 2017 or 2018 CO2e values
; Because we need all the labels and qids later, when we build the label->qid map.
; However, the next step is to compare relevant (i.e. 2017 & 2018) data items,
;  and for that purpose, we filter wikidata0 to remove non relevant data items.
(def wikidata 
    (->> wikidata0
        (filter #(contains? #{2017 2018} (:year %)))))

(println (count wikidata) "rows")

0 rows


nil

### Compare Wikidata's CO2e data against SEPA's 

In [43]:
; Check for differences

(def diff (data/diff 
            (set sepa) 
            (set wikidata)))

(println (count (first diff)) "in SEPA only")
(println (count (second diff)) "in Wikidata only")
(println (count (nth diff 2)) "in both")

(with-open [wtr (io/writer "CO2e-values-diff.txt")]
  (binding [*out* wtr]
    (do
      (println "CO2e-values: SEPA versus Wikidata")
      (println)
      (println (count (first diff)) "in SEPA only...")
      (pp/pprint (first diff))
      (println)
      (println (count (second diff)) "in Wikidata only...")
      (pp/pprint (second diff))
      (println)
      (println (count (nth diff 2)) "in both...")
      (pp/pprint (nth diff 2)))))

64 in SEPA only
0 in Wikidata only
0 in both


nil

#### When I first compared...

When I ran the above comparison for the first time (at 2020-09-24T16:26GMT) it discovered the following differences...

64 in SEPA only
0 in Wikidata only
0 in both
The details are in [this file](CO2e-values-diff-2020-09-24T16_26.txt).

Then I manually edited Wikidata to fix a few of those differences. However this was labourious so I decided to introduce some automation by using QuickSatements. QuickStatements accepts CSV input - representing edits to be applied to Wikidata. The quickStatements CSV input was generated as follows.

In [35]:
;TODO for use later on


; Build a label->qid map
(def label->qid
    (->> wikidata0
        (map #(vector (:label %) (:qid %)))
        distinct
        (into {})))

(println (count label->qid) "entries")

; Print a sample
(pp/print-table [:label :qid] 
                (map #(hash-map :label (first %) :qid (second %)) (repeatedly 5 #(rand-nth (into [] label->qid)))))

32 entries

|            :label |    :qid |
|-------------------+---------|
|      West Lothian | Q204940 |
|      Renfrewshire | Q211091 |
|    Orkney Islands | Q100166 |
| South Lanarkshire | Q209142 |
|    Outer Hebrides |  Q80967 |


nil

#### 2.3.1. When I first compared...

When I ran the above comparison for the first time (at 2020-09-07T21:20GMT) it discovered the following differences...

```
2 in :scotgov only...
#{{:code "S12000050", :label "North Lanarkshire"}
  {:code "S12000013", :label "Na h-Eileanan Siar"}}

2 in :wikidata only...
#{{:code "S12000044", :label "North Lanarkshire"}
  {:code "S12000013", :label "Outer Hebrides"}}

30 in both
```
At that time, `wikidata`'s code value for North Lanarkshire was incorrect so I amended it directly via its web page:
* North Lanarkshire [Q207111](https://www.wikidata.org/wiki/Q207111): `S12000044` -> `S12000050`

I am using `code` values to identify council areas so it is important that they are correct.

For my purpose, the `label` values are not significant so I didn't ponder if the Outer Hebrides' "English" `label` should be changed to be its Scottish Gaelic name. So the above live `diff` is probably still indicating this one difference.

## 3. Population

Now consider the _population_ aspect of the dataset, how do `scotgov` (statistic.gov.scot) and `wikidata` (Wikidata) compare?

### 3.1. Population values from statistics.gov.scot

In [8]:
; Query statistics.gov.scot for population values

(def sparql "

PREFIX qb: <http://purl.org/linked-data/cube#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX pdmx: <http://purl.org/linked-data/sdmx/2009/dimension#>
PREFIX sdmx: <http://statistics.gov.scot/def/dimension/>
PREFIX snum: <http://statistics.gov.scot/def/measure-properties/>
PREFIX uent: <http://statistics.data.gov.uk/def/statistical-entity#>
PREFIX ugeo: <http://statistics.data.gov.uk/def/statistical-geography#>

SELECT 
  (strafter(str(?areaUri), 'http://statistics.gov.scot/id/statistical-geography/') as ?code) 
  ?label
  ?year
  ?population

WHERE {
  ?areaUri uent:code <http://statistics.gov.scot/id/statistical-entity/S12> ;
           ugeo:status 'Live' ;
           rdfs:label ?label .
           
  ?populationUri qb:dataSet <http://statistics.gov.scot/data/population-estimates-current-geographic-boundaries> ;
                 pdmx:refArea ?areaUri ;
                 pdmx:refPeriod ?periodUri ;
                 sdmx:age <http://statistics.gov.scot/def/concept/age/all> ;
                 sdmx:sex <http://statistics.gov.scot/def/concept/sex/all> ;
                 snum:count ?population .
  
  ?periodUri rdfs:label ?year .
}")

(def population-values-scotgov 
    (->> sparql
        (exec-query :scotgov)))

(println (count population-values-scotgov ) "rows")

608 rows


nil

In [9]:
; Print a sample

(def ks [:code :label :year :population])
(pp/print-table ks (repeatedly 5 #(rand-nth population-values-scotgov)))


|     :code |            :label | :year | :population |
|-----------+-------------------+-------+-------------|
| S12000041 |             Angus |  2009 |      114830 |
| S12000026 |  Scottish Borders |  2014 |      114040 |
| S12000036 | City of Edinburgh |  2013 |      487460 |
| S12000018 |        Inverclyde |  2001 |       84150 |
| S12000036 | City of Edinburgh |  2019 |      524930 |


nil

### 3.2. Population values from Wikidata

In [10]:
; Query Wikidata for population values

(def sparql "

SELECT DISTINCT
  ?code
  ?areaEntityLabel
  (YEAR(?populationWhen) as ?year )
  ?population 

WHERE {
  ?areaEntity wdt:P31 wd:Q15060255 ; # Scottish council area
              wdt:P836 ?code ; # nine-character UK Government Statistical Service code
              p:P1082 ?populationEntity .
  ?populationEntity ps:P1082 ?population ;
                    pq:P585 ?populationWhen .

  SERVICE wikibase:label { bd:serviceParam wikibase:language '[AUTO_LANGUAGE],en' . }
}")

(def population-values-wikidata
    (->> sparql
        (exec-query :wikidata)
        (map #(set/rename-keys % {:areaEntityLabel :label}))
        (map #(assoc % :population (str/replace (:population %) #".0$" "")))))

(println (count population-values-wikidata) "rows")

608 rows


nil

[Hyperlink](https://w.wiki/boc) to try out the above SPARQL query.

In [11]:
; Print a sample

(def ks [:code :label :year :population])
(pp/print-table ks (repeatedly 5 #(rand-nth population-values-wikidata)))


|     :code |            :label | :year | :population |
|-----------+-------------------+-------+-------------|
| S12000041 |             Angus |  2013 |      116290 |
| S12000011 | East Renfrewshire |  2001 |       89410 |
| S12000018 |        Inverclyde |  2002 |       83730 |
| S12000028 |    South Ayrshire |  2015 |      112400 |
| S12000005 |  Clackmannanshire |  2019 |       51540 |


nil

### 3.3. Compare Wikidata's population values against those of statistics.gov.scot

In [12]:
; Check for differences

; Make the label sets in both the same so that the diff 
; doesn't pick up on the 'Outer Hebrides' vs 'Na h-Eileanan Siar' label difference
(def population-values-scotgov
  (map #(if (= "S12000013" (:code %)) 
              (assoc % :label "Outer Hebrides") 
              %)
       population-values-scotgov))

(def diff (data/diff 
            (set population-values-scotgov) 
            (set population-values-wikidata)))

(println (count (first diff)) "in :scotgov only")
(println (count (second diff)) "in :wikidata only")
(println (count (nth diff 2)) "in both")

(with-open [wtr (io/writer "population-values-diff.txt")]
  (binding [*out* wtr]
    (do
      (println "population-values: scotgov versus wikidata")
      (println)
      (println (count (first diff)) "in :scotgov only...")
      (pp/pprint (first diff))
      (println)
      (println (count (second diff)) "in :wikidata only...")
      (pp/pprint (second diff))
      (println)
      (println (count (nth diff 2)) "in both...")
      (pp/pprint (nth diff 2)))))

0 in :scotgov only
0 in :wikidata only
608 in both


nil

#### 3.3.1. When I first compared...

When I ran the above comparison for the first time (at 2020-09-08T10:01GMT) it discovered the following differences...
```
574 in :scotgov only
9 in :wikidata only
34 in both
```
The details are in [this file](population-values-diff-2020-09-08T10_01GMT.txt). 

Then I manually edited Wikidata to fix a few of those differences. However this was labourious so I decided to introduce some automation by using [QuickSatements](https://quickstatements.toolforge.org/). QuickStatements accepts CSV input - representing edits to be applied to Wikidata. The quickStatements CSV input was generated as follows.

32 rows


#'user/code->qid

In [14]:
; Generate the CSV that is to be imported into QuickStatements

(def TQD "\"\"\"") ; triple double-quote 

(with-open [wtr (io/writer "population-values-quickstatements.csv")]
  (binding [*out* wtr]
    ;; qid, population, point in time (qualifier), determination method (qualifier), editing comment
    (println "qid,P1082,qal585,qal459,S854,#") 
    (doseq [m (first diff)]
      (println (str (code->qid (:code m)) ","
                    (:population m) ","
                    "+" (:year m) "-00-00T00:00:00Z/9,"
                    "Q791801," ; estimation process
                    TQD "http://statistics.gov.scot/data/population-estimates-current-geographic-boundaries" TQD ","
                    TQD "Set " (:label m) " council area's " (:year m) " population" TQD)))))

nil


When I ran the above CSV generation for the first time - the output is in [this file](population-values-quickstatements-2020-09-09T11_20GMT.csv) - it specified 2232 _individual_ edits to Wikidata. These were successfully executed (taking about 30 mins) against Wikidata by QuickStatements. (Unfortunately QuickStatements does not yet support a means to set the `rank` of a triple so I had to individually edit the 32 council area pages to mark, in each, its 2019 population value  as the `Preferred rank` population value - indicating that it is the most up-to-date population value.) 

## 4. The usefulness of the uploaded dataset

The uploaded dataset can be pulled (_de-referenced_) into Wikipedia articles and other web pages. 

### 4.1. Embedding dataset values into Wikipedia articles

As an example, I edited the Wikipedia article [Council areas of Scotland](https://simple.wikipedia.org/wiki/Council_areas_of_Scotland) to insert into its main table, the new column "_Number of people (latest estimate)_" whose values are pulled (each time the page is rendered) directly from the data that we have uploaded into Wikidata:

![Screenshot of Wikipedia article](screenshot-wikipedia-council-areas-article.png)

### 4.2. Embedding dataset based graphs into web pages

And >><a href="https://query.wikidata.org/embed.html#%23defaultView%3ALineChart%0ASELECT%20%0A%20%20%3FcouncilArea%0A%20%20(str(YEAR(%3FpopulationWhen))%20as%20%3Fyear%20)%0A%20%20%3Fpopulation%0A%20%20%3FcouncilAreaLabel%0AWHERE%20%7B%0A%20%20%3FcouncilArea%20wdt%3AP31%20wd%3AQ15060255%20%3B%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20p%3AP1082%20%3FpopulationEntity%20.%0A%20%20%3FpopulationEntity%20ps%3AP1082%20%3Fpopulation%20%3B%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20pq%3AP585%20%3FpopulationWhen%20.%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%27%5BAUTO_LANGUAGE%5D%2Cen%27%20.%20%7D%0A%7D">here</a><< is a line graph, dynamically generated from the new Wikidata data, that can be embedded in any web page.

### 4.3. Concerns, next steps, alternative approaches

Interestingly, there is [some discussion](https://en.wikipedia.org/wiki/Wikipedia:Requests_for_comment/Wikidata_Phase_2) about the pros & cons of inserting Wikidata values into Wikipedia articles. The main argument against is the immaturity of Wikidata's structure: therefore a concern about the durability of the references into its data structure. The counter point is that early use & evolution might be the best path to maturity.

The case study for our Data Commons Scotland project, is _open data about waste in Scotland_. So a next step for the project might be to upload into Wikidata, datasets that describe the amounts of household waste generated & recycled, and 'carbon impact' figures. These could also be linked to [council areas](https://www.wikidata.org/wiki/Q15060255) - as we have done the population dataset - to support _per council area_/_per citizen_ statistics and visualisations. Appropriate [properties](https://www.wikidata.org/wiki/Q18616576) do not yet exist in Wikidata for the description of such data about waste, so new ones would need to be ratified by the Wikidata community.

Should should datasets actually be uploaded into Wikidata?... These are small datasets and they seem to fit well enough into Wikidata's knowledge graph. Uploading them into Wikidata may make them easier to access, de-silo the data
and help enrich Wikidata's knowledge graph. But then, of course, there is the _keeping it up-to-date_ issue to solve. 
Alternatively, those datasets could be pulled dynamically & directly from statistics.gov.scot, into Wikipedia  articles with the help of some new MediaWiki [extensions](https://www.mediawiki.org/wiki/Category:Extensions).




