# An exploration of uploading a dataset into Wikidata 

This workbook explores the steps involved in, and the usefulness of, uploading a dataset into [Wikidata](https://www.wikipedia.org/wiki/Wikidata).

Wikidata maintains [linked data](https://www.wikipedia.org/wiki/Linked_data) so uploading a dataset into it is a little more difficult than into a non-linked respository like [CKAN](https://en.wikipedia.org/wiki/CKAN). 

## 1. The dataset that is to be uploaded

The dataset that is to be uploaded is a slice of [statistics.gov.scot](http://statistics.gov.scot)'s [Population Estimates](http://statistics.gov.scot/data/population-estimates-current-geographic-boundaries) data cube: _population per council area per year_.

## 1.1 Prep tooling

In [129]:
; Add code libraries

(require '[clojupyter.misc.helper :as helper])

(helper/add-dependencies '[org.clojure/data.csv "1.0.0"])
(helper/add-dependencies '[clj-http/clj-http "3.10.1"])

(require '[clojure.string :as str]
         '[clojure.set :as set]
         '[clojure.data :as data]
         '[clojure.pprint :as pp]
         '[clojure.java.io :as io]
         '[clojure.data.csv :as csv]
         '[clj-http.client :as http])

nil

In [130]:
; Define convenience functions

; Convert the CSV structure to a list-of-maps structure.
(defn to-maps [csv-data]
    (map zipmap (->> (first csv-data)
                    (map keyword)
                    repeat)
                (rest csv-data)))

; Map the name of a SPARQL service to its URL.
(def service-urls {:scotgov "http://statistics.gov.scot/sparql"
                   :wikidata "https://query.wikidata.org/sparql"})
                                
; Ask the service to execute the given SPARQL query
; and return its result as a list-of-maps.
(defn exec-query [service-name sparql]
  (->> (http/post (service-name service-urls) 
        {:body (str "query=" (URLEncoder/encode sparql)) 
         :headers {"Accept" "text/csv" 
                   "Content-Type" "application/x-www-form-urlencoded"} 
         :debug false})
    :body
    csv/read-csv
    to-maps))

#'user/exec-query

## 2. Council areas

Considering the _council area_ aspect of the dataset, 
how do `scotgov` (statistic.gov.scot) and `wikidata` (Wikidata) compare?

`scotgov` identifies council areas by a 9 character codes (e.g. `S12000030` identifies the Stirling council area).   
Happily, `wikidata` can also identify Scottish council areas using the same codes.

### 2.1. Council area codes data from statistics.gov.scot 

In [131]:
; Query statistics.gov.scot for council area codes

(def sparql "

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX uent: <http://statistics.data.gov.uk/def/statistical-entity#>
PREFIX ugeo: <http://statistics.data.gov.uk/def/statistical-geography#>

SELECT 
  (strafter(str(?areaUri), 'http://statistics.gov.scot/id/statistical-geography/') as ?code) 
  ?label

WHERE {
  ?areaUri uent:code <http://statistics.gov.scot/id/statistical-entity/S12> ;
           ugeo:status 'Live' ;
           rdfs:label ?label .
}
")

(def area-codes-scotgov 
    (->> sparql
        (exec-query :scotgov)))

(println (count area-codes-scotgov ) "rows")

32 rows


nil

In [132]:
; Print a sample

(def ks [:code :label])
(pp/print-table ks (repeatedly 5 #(rand-nth area-codes-scotgov)))


|     :code |             :label |
|-----------+--------------------|
| S12000013 | Na h-Eileanan Siar |
| S12000030 |           Stirling |
| S12000029 |  South Lanarkshire |
| S12000021 |     North Ayrshire |
| S12000030 |           Stirling |


nil

### 2.2. Council area codes from Wikidata

In [133]:
; Query Wikidata for council area codes

(def sparql "

SELECT DISTINCT
  ?code
  ?areaEntityLabel

WHERE {
  ?areaEntity wdt:P31 wd:Q15060255 ; # Scottish council area
              wdt:P836 ?code . # nine-character UK Government Statistical Service code

  SERVICE wikibase:label { bd:serviceParam wikibase:language '[AUTO_LANGUAGE],en' . }
}")

(def area-codes-wikidata
    (->> sparql
        (exec-query :wikidata)
        (map #(set/rename-keys % {:areaEntityLabel :label}))))

(println (count area-codes-wikidata) "rows")

[W 18:57:25.201 Clojupyter] org.apache.http.client.protocol.ResponseProcessCookies:130 -- Invalid cookie header: "Set-Cookie: WMF-Last-Access=10-Sep-2020;Path=/;HttpOnly;secure;Expires=Mon, 12 Oct 2020 12:00:00 GMT". Invalid 'expires' attribute: Mon, 12 Oct 2020 12:00:00 GMT
[W 18:57:25.203 Clojupyter] org.apache.http.client.protocol.ResponseProcessCookies:130 -- Invalid cookie header: "Set-Cookie: WMF-Last-Access-Global=10-Sep-2020;Path=/;Domain=.wikidata.org;HttpOnly;secure;Expires=Mon, 12 Oct 2020 12:00:00 GMT". Invalid 'expires' attribute: Mon, 12 Oct 2020 12:00:00 GMT
32 rows


nil

In [134]:
; Print a sample

(def ks [:code :label])
(pp/print-table ks (repeatedly 5 #(rand-nth area-codes-wikidata)))


|     :code |              :label |
|-----------+---------------------|
| S12000045 | East Dunbartonshire |
| S12000019 |          Midlothian |
| S12000049 |        Glasgow City |
| S12000027 |    Shetland Islands |
| S12000018 |          Inverclyde |


nil

### 2.3. Compare Wikidata's council area codes against statistics.gov.scot's

In [135]:
; Check for differences

(def diff (data/diff 
            (set area-codes-scotgov) 
            (set area-codes-wikidata)))

(println (count (first diff)) "in :scotgov only...")
(pp/pprint (first diff))
(println)
(println (count (second diff)) "in :wikidata only...")
(pp/pprint (second diff))
(println)
(println (count (nth diff 2)) "in both")

1 in :scotgov only...
#{{:code "S12000013", :label "Na h-Eileanan Siar"}}

1 in :wikidata only...
#{{:code "S12000013", :label "Outer Hebrides"}}

31 in both


nil

#### 2.3.1. When I first compared...

When I ran the above comparison for the first time (at 2020-09-07T21:20GMT) it discovered the following differences...

```
2 in :scotgov only...
#{{:code "S12000050", :label "North Lanarkshire"}
  {:code "S12000013", :label "Na h-Eileanan Siar"}}

2 in :wikidata only...
#{{:code "S12000044", :label "North Lanarkshire"}
  {:code "S12000013", :label "Outer Hebrides"}}

30 in both
```
At that time, `wikidata`'s code value for North Lanarkshire was incorrect so I amended it directly via its web page:
* North Lanarkshire [Q207111](https://www.wikidata.org/wiki/Q207111): `S12000044` -> `S12000050`

I am using `code` values to identify council areas so it is important that they are correct.

For my purpose, the `label` values are not significant so I didn't ponder if the Outer Hebrides' "English" `label` should be changed to be its Scottish Gaelic name. So the above live `diff` is probably still indicating this one difference.

## 3. Population

Now consider the _population_ aspect of the dataset, how do `scotgov` (statistic.gov.scot) and `wikidata` (Wikidata) compare?

### 3.1. Population values from statistics.gov.scot

In [146]:
; Query statistics.gov.scot for population values

(def sparql "

PREFIX qb: <http://purl.org/linked-data/cube#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX pdmx: <http://purl.org/linked-data/sdmx/2009/dimension#>
PREFIX sdmx: <http://statistics.gov.scot/def/dimension/>
PREFIX snum: <http://statistics.gov.scot/def/measure-properties/>
PREFIX uent: <http://statistics.data.gov.uk/def/statistical-entity#>
PREFIX ugeo: <http://statistics.data.gov.uk/def/statistical-geography#>

SELECT 
  (strafter(str(?areaUri), 'http://statistics.gov.scot/id/statistical-geography/') as ?code) 
  ?label
  ?year
  ?population

WHERE {
  ?areaUri uent:code <http://statistics.gov.scot/id/statistical-entity/S12> ;
           ugeo:status 'Live' ;
           rdfs:label ?label .
           
  ?populationUri qb:dataSet <http://statistics.gov.scot/data/population-estimates-current-geographic-boundaries> ;
                 pdmx:refArea ?areaUri ;
                 pdmx:refPeriod ?periodUri ;
                 sdmx:age <http://statistics.gov.scot/def/concept/age/all> ;
                 sdmx:sex <http://statistics.gov.scot/def/concept/sex/all> ;
                 snum:count ?population .
  
  ?periodUri rdfs:label ?year .
}")

(def population-values-scotgov 
    (->> sparql
        (exec-query :scotgov)))

(println (count population-values-scotgov ) "rows")

608 rows


nil

In [147]:
; Print a sample

(def ks [:code :label :year :population])
(pp/print-table ks (repeatedly 5 #(rand-nth population-values-scotgov)))


|     :code |             :label | :year | :population |
|-----------+--------------------+-------+-------------|
| S12000035 |    Argyll and Bute |  2013 |       88050 |
| S12000040 |       West Lothian |  2016 |      180130 |
| S12000013 | Na h-Eileanan Siar |  2013 |       27400 |
| S12000026 |   Scottish Borders |  2009 |      113590 |
| S12000030 |           Stirling |  2004 |       86920 |


nil

### 3.2. Population values from Wikidata

In [151]:
; Query Wikidata for population values

(def sparql "

SELECT DISTINCT
  ?code
  ?areaEntityLabel
  (YEAR(?populationWhen) as ?year )
  ?population 

WHERE {
  ?areaEntity wdt:P31 wd:Q15060255 ; # Scottish council area
              wdt:P836 ?code ; # nine-character UK Government Statistical Service code
              p:P1082 ?populationEntity .
  ?populationEntity ps:P1082 ?population ;
                    pq:P585 ?populationWhen .

  SERVICE wikibase:label { bd:serviceParam wikibase:language '[AUTO_LANGUAGE],en' . }
}")

(def population-values-wikidata
    (->> sparql
        (exec-query :wikidata)
        (map #(set/rename-keys % {:areaEntityLabel :label}))
        (map #(assoc % :population (str/replace (:population %) #".0$" "")))))

(println (count population-values-wikidata) "rows")

[W 20:42:56.400 Clojupyter] org.apache.http.client.protocol.ResponseProcessCookies:130 -- Invalid cookie header: "Set-Cookie: WMF-Last-Access=10-Sep-2020;Path=/;HttpOnly;secure;Expires=Mon, 12 Oct 2020 12:00:00 GMT". Invalid 'expires' attribute: Mon, 12 Oct 2020 12:00:00 GMT
[W 20:42:56.408 Clojupyter] org.apache.http.client.protocol.ResponseProcessCookies:130 -- Invalid cookie header: "Set-Cookie: WMF-Last-Access-Global=10-Sep-2020;Path=/;Domain=.wikidata.org;HttpOnly;secure;Expires=Mon, 12 Oct 2020 12:00:00 GMT". Invalid 'expires' attribute: Mon, 12 Oct 2020 12:00:00 GMT
608 rows


nil

In [152]:
; Print a sample

(def ks [:code :label :year :population])
(pp/print-table ks (repeatedly 5 #(rand-nth population-values-wikidata)))


|     :code |              :label | :year | :population |
|-----------+---------------------+-------+-------------|
| S12000038 |        Renfrewshire |  2009 |      173020 |
| S12000008 |       East Ayrshire |  2008 |      121600 |
| S12000039 | West Dunbartonshire |  2014 |       89710 |
| S12000013 |      Outer Hebrides |  2003 |       26430 |
| S12000030 |            Stirling |  2019 |       94210 |


nil

### 3.3. Compare Wikidata's population values against those of statistics.gov.scot

In [153]:
; Check for differences

; Make the label sets in both the same so that the diff 
; doesn't pick up on the 'Outer Hebrides' vs 'Na h-Eileanan Siar' label difference
(def population-values-scotgov
  (map #(if (= "S12000013" (:code %)) 
              (assoc % :label "Outer Hebrides") 
              %)
       population-values-scotgov))

(def diff (data/diff 
            (set population-values-scotgov) 
            (set population-values-wikidata)))

(println (count (first diff)) "in :scotgov only")
(println (count (second diff)) "in :wikidata only")
(println (count (nth diff 2)) "in both")

(with-open [wtr (io/writer "population-values-diff.txt")]
  (binding [*out* wtr]
    (do
      (println "population-values: scotgov versus wikidata")
      (println)
      (println (count (first diff)) "in :scotgov only...")
      (pp/pprint (first diff))
      (println)
      (println (count (second diff)) "in :wikidata only...")
      (pp/pprint (second diff))
      (println)
      (println (count (nth diff 2)) "in both...")
      (pp/pprint (nth diff 2)))))

0 in :scotgov only
0 in :wikidata only
608 in both


nil

#### 3.3.1. When I first compared...

When I ran the above comparison for the first time (at 2020-09-08T10:01GMT) it discovered the following differences...
```
574 in :scotgov only
9 in :wikidata only
34 in both
```
The details are in [this file](population-values-diff-2020-09-08T10_01GMT.txt). 

Then I manually edited Wikidata to fix a few of those differences. However this was labourious so I decided to introduce some automation by using [QuickSatements](https://quickstatements.toolforge.org/). QuickStatements accepts CSV input - representing edits to be applied to Wikidata. The quickStatements CSV input was generated as follows.

In [126]:
; Build the map code->qid from a SPARQL query against Wikidata

(def sparql "
SELECT 
  (strafter(str(?areaEntity), 'http://www.wikidata.org/entity/') as ?qid) 
  ?code
WHERE {
  ?areaEntity wdt:P31 wd:Q15060255 ; # Scottish council area
              wdt:P836 ?code . # nine-character UK Government Statistical Service code
}")

(def code->qid
    (->> sparql
        (exec-query :wikidata)
        (#(do (println (count %) "rows") %))
        (map #(vector (:code %) (:qid %)))
        (into {})))

[W 12:21:37.648 Clojupyter] org.apache.http.client.protocol.ResponseProcessCookies:130 -- Invalid cookie header: "Set-Cookie: WMF-Last-Access=10-Sep-2020;Path=/;HttpOnly;secure;Expires=Mon, 12 Oct 2020 12:00:00 GMT". Invalid 'expires' attribute: Mon, 12 Oct 2020 12:00:00 GMT
[W 12:21:37.652 Clojupyter] org.apache.http.client.protocol.ResponseProcessCookies:130 -- Invalid cookie header: "Set-Cookie: WMF-Last-Access-Global=10-Sep-2020;Path=/;Domain=.wikidata.org;HttpOnly;secure;Expires=Mon, 12 Oct 2020 12:00:00 GMT". Invalid 'expires' attribute: Mon, 12 Oct 2020 12:00:00 GMT
32 rows


#'user/code->qid

In [125]:
; Generate the CSV that is to be imported into QuickStatements

(def TQD "\"\"\"") ; triple double-quote 

(with-open [wtr (io/writer "population-values-quickstatements.csv")]
  (binding [*out* wtr]
    ;; qid, population, point in time (qualifier), determination method (qualifier), editing comment
    (println "qid,P1082,qal585,qal459,S854,#") 
    (doseq [m (first diff)]
      (println (str (code->qid (:code m)) ","
                    (:population m) ","
                    "+" (:year m) "-00-00T00:00:00Z/9,"
                    "Q791801," ; estimation process
                    TQD "http://statistics.gov.scot/data/population-estimates-current-geographic-boundaries" TQD ","
                    TQD "Set " (:label m) " council area's " (:year m) " population" TQD)))))

nil


When I ran the above CSV generation for the first time - the output is in [this file](population-values-quickstatements-2020-09-09T11_20GMT.csv) - it specified 2232 _individual_ edits to Wikidata. These were successfully executed (taking about 30 mins) against Wikidata by QuickStatements.

## 4. The usefulness of the uploaded dataset

TODO

show line graph plot

mention how this now enables 'per citizen' statistics and visualizations

...for example, waste, recycling and carbon impacts  amounts and trends ...per citizen

...no exisiting appropriate Property so would have to get new ones ratified and then upload the waste and carbon impact datasets.


