# An exploration of uploading a dataset into Wikidata 

This workbook explores the steps involved in, and the usefulness of, uploading a dataset into [Wikidata](https://www.wikipedia.org/wiki/Wikidata).

Wikidata maintains [linked data](https://www.wikipedia.org/wiki/Linked_data) so uploading a dataset into it is a little more difficult than into a non-linked respository like [CKAN](https://en.wikipedia.org/wiki/CKAN). 

## 1. The dataset that is to be uploaded

The dataset that is to be uploaded is a slice of [statistics.gov.scot](http://statistics.gov.scot)'s [Population Estimates](http://statistics.gov.scot/data/population-estimates-current-geographic-boundaries) data cube: _population per council area per year_.

## 1.1 Prep tooling

In [15]:
; Add code libraries

(require '[clojupyter.misc.helper :as helper])

(helper/add-dependencies '[org.clojure/data.csv "1.0.0"])
(helper/add-dependencies '[clj-http/clj-http "3.10.1"])

(require '[clojure.string :as str]
         '[clojure.set :as set]
         '[clojure.data :as data]
         '[clojure.pprint :as pp]
         '[clojure.java.io :as io]
         '[clojure.data.csv :as csv]
         '[clj-http.client :as http])

(import 'java.net.URLEncoder)

java.net.URLEncoder

In [16]:
; Define convenience functions

; Convert the CSV structure to a list-of-maps structure.
(defn to-maps [csv-data]
    (map zipmap (->> (first csv-data)
                    (map keyword)
                    repeat)
                (rest csv-data)))

; Map the name of a SPARQL service to its URL.
(def service-urls {:scotgov "http://statistics.gov.scot/sparql"
                   :wikidata "https://query.wikidata.org/sparql"})
                                
; Ask the service to execute the given SPARQL query
; and return its result as a list-of-maps.
(defn exec-query [service-name sparql]
  (->> (http/post (service-name service-urls) 
        {:body (str "query=" (URLEncoder/encode sparql)) 
         :headers {"Accept" "text/csv" 
                   "Content-Type" "application/x-www-form-urlencoded"} 
         :debug false})
    :body
    csv/read-csv
    to-maps))

#'user/exec-query

## 2. Council areas

Considering the _council area_ aspect of the dataset, 
how do `scotgov` (statistic.gov.scot) and `wikidata` (Wikidata) compare?

`scotgov` identifies council areas by a 9 character codes (e.g. `S12000030` identifies the Stirling council area).   
Happily, `wikidata` can also identify Scottish council areas using the same codes.

### 2.1. Council area codes data from statistics.gov.scot 

In [17]:
; Query statistics.gov.scot for council area codes

(def sparql "

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX uent: <http://statistics.data.gov.uk/def/statistical-entity#>
PREFIX ugeo: <http://statistics.data.gov.uk/def/statistical-geography#>

SELECT 
  (strafter(str(?areaUri), 'http://statistics.gov.scot/id/statistical-geography/') as ?code) 
  ?label

WHERE {
  ?areaUri uent:code <http://statistics.gov.scot/id/statistical-entity/S12> ;
           ugeo:status 'Live' ;
           rdfs:label ?label .
}
")

(def area-codes-scotgov 
    (->> sparql
        (exec-query :scotgov)))

(println (count area-codes-scotgov ) "rows")

32 rows


nil

In [18]:
; Print a sample

(def ks [:code :label])
(pp/print-table ks (repeatedly 5 #(rand-nth area-codes-scotgov)))


|     :code |            :label |
|-----------+-------------------|
| S12000047 |              Fife |
| S12000021 |    North Ayrshire |
| S12000050 | North Lanarkshire |
| S12000047 |              Fife |
| S12000008 |     East Ayrshire |


nil

### 2.2. Council area codes from Wikidata

In [19]:
; Query Wikidata for council area codes

(def sparql "

SELECT DISTINCT
  ?code
  ?areaEntityLabel

WHERE {
  ?areaEntity wdt:P31 wd:Q15060255 ; # Scottish council area
              wdt:P836 ?code . # nine-character UK Government Statistical Service code

  SERVICE wikibase:label { bd:serviceParam wikibase:language '[AUTO_LANGUAGE],en' . }
}")

(def area-codes-wikidata
    (->> sparql
        (exec-query :wikidata)
        (map #(set/rename-keys % {:areaEntityLabel :label}))))

(println (count area-codes-wikidata) "rows")

32 rows


nil

Hyperlink to try out the above SPARQL: [https://w.wiki/boe](https://w.wiki/boe)

In [20]:
; Print a sample

(def ks [:code :label])
(pp/print-table ks (repeatedly 5 #(rand-nth area-codes-wikidata)))


|     :code |         :label |
|-----------+----------------|
| S12000047 |           Fife |
| S12000021 | North Ayrshire |
| S12000030 |       Stirling |
| S12000028 | South Ayrshire |
| S12000010 |   East Lothian |


nil

### 2.3. Compare Wikidata's council area codes against statistics.gov.scot's

In [21]:
; Check for differences

(def diff (data/diff 
            (set area-codes-scotgov) 
            (set area-codes-wikidata)))

(println (count (first diff)) "in :scotgov only...")
(pp/pprint (first diff))
(println)
(println (count (second diff)) "in :wikidata only...")
(pp/pprint (second diff))
(println)
(println (count (nth diff 2)) "in both")

1 in :scotgov only...
#{{:code "S12000013", :label "Na h-Eileanan Siar"}}

1 in :wikidata only...
#{{:code "S12000013", :label "Outer Hebrides"}}

31 in both


nil

#### 2.3.1. When I first compared...

When I ran the above comparison for the first time (at 2020-09-07T21:20GMT) it discovered the following differences...

```
2 in :scotgov only...
#{{:code "S12000050", :label "North Lanarkshire"}
  {:code "S12000013", :label "Na h-Eileanan Siar"}}

2 in :wikidata only...
#{{:code "S12000044", :label "North Lanarkshire"}
  {:code "S12000013", :label "Outer Hebrides"}}

30 in both
```
At that time, `wikidata`'s code value for North Lanarkshire was incorrect so I amended it directly via its web page:
* North Lanarkshire [Q207111](https://www.wikidata.org/wiki/Q207111): `S12000044` -> `S12000050`

I am using `code` values to identify council areas so it is important that they are correct.

For my purpose, the `label` values are not significant so I didn't ponder if the Outer Hebrides' "English" `label` should be changed to be its Scottish Gaelic name. So the above live `diff` is probably still indicating this one difference.

## 3. Population

Now consider the _population_ aspect of the dataset, how do `scotgov` (statistic.gov.scot) and `wikidata` (Wikidata) compare?

### 3.1. Population values from statistics.gov.scot

In [22]:
; Query statistics.gov.scot for population values

(def sparql "

PREFIX qb: <http://purl.org/linked-data/cube#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX pdmx: <http://purl.org/linked-data/sdmx/2009/dimension#>
PREFIX sdmx: <http://statistics.gov.scot/def/dimension/>
PREFIX snum: <http://statistics.gov.scot/def/measure-properties/>
PREFIX uent: <http://statistics.data.gov.uk/def/statistical-entity#>
PREFIX ugeo: <http://statistics.data.gov.uk/def/statistical-geography#>

SELECT 
  (strafter(str(?areaUri), 'http://statistics.gov.scot/id/statistical-geography/') as ?code) 
  ?label
  ?year
  ?population

WHERE {
  ?areaUri uent:code <http://statistics.gov.scot/id/statistical-entity/S12> ;
           ugeo:status 'Live' ;
           rdfs:label ?label .
           
  ?populationUri qb:dataSet <http://statistics.gov.scot/data/population-estimates-current-geographic-boundaries> ;
                 pdmx:refArea ?areaUri ;
                 pdmx:refPeriod ?periodUri ;
                 sdmx:age <http://statistics.gov.scot/def/concept/age/all> ;
                 sdmx:sex <http://statistics.gov.scot/def/concept/sex/all> ;
                 snum:count ?population .
  
  ?periodUri rdfs:label ?year .
}")

(def population-values-scotgov 
    (->> sparql
        (exec-query :scotgov)))

(println (count population-values-scotgov ) "rows")

608 rows


nil

In [23]:
; Print a sample

(def ks [:code :label :year :population])
(pp/print-table ks (repeatedly 5 #(rand-nth population-values-scotgov)))


|     :code |            :label | :year | :population |
|-----------+-------------------+-------+-------------|
| S12000023 |    Orkney Islands |  2011 |       21420 |
| S12000023 |    Orkney Islands |  2018 |       22190 |
| S12000050 | North Lanarkshire |  2008 |      333280 |
| S12000048 | Perth and Kinross |  2002 |      135130 |
| S12000018 |        Inverclyde |  2001 |       84150 |


nil

### 3.2. Population values from Wikidata

In [24]:
; Query Wikidata for population values

(def sparql "

SELECT DISTINCT
  ?code
  ?areaEntityLabel
  (YEAR(?populationWhen) as ?year )
  ?population 

WHERE {
  ?areaEntity wdt:P31 wd:Q15060255 ; # Scottish council area
              wdt:P836 ?code ; # nine-character UK Government Statistical Service code
              p:P1082 ?populationEntity .
  ?populationEntity ps:P1082 ?population ;
                    pq:P585 ?populationWhen .

  SERVICE wikibase:label { bd:serviceParam wikibase:language '[AUTO_LANGUAGE],en' . }
}")

(def population-values-wikidata
    (->> sparql
        (exec-query :wikidata)
        (map #(set/rename-keys % {:areaEntityLabel :label}))
        (map #(assoc % :population (str/replace (:population %) #".0$" "")))))

(println (count population-values-wikidata) "rows")

608 rows


nil

Hyperlink to try out the above SPARQL: [https://w.wiki/boc](https://w.wiki/boc)

In [25]:
; Print a sample

(def ks [:code :label :year :population])
(pp/print-table ks (repeatedly 5 #(rand-nth population-values-wikidata)))


|     :code |            :label | :year | :population |
|-----------+-------------------+-------+-------------|
| S12000023 |    Orkney Islands |  2017 |       22000 |
| S12000023 |    Orkney Islands |  2001 |       19220 |
| S12000050 | North Lanarkshire |  2019 |      341370 |
| S12000048 | Perth and Kinross |  2012 |      147740 |
| S12000027 |  Shetland Islands |  2011 |       23240 |


nil

### 3.3. Compare Wikidata's population values against those of statistics.gov.scot

In [26]:
; Check for differences

; Make the label sets in both the same so that the diff 
; doesn't pick up on the 'Outer Hebrides' vs 'Na h-Eileanan Siar' label difference
(def population-values-scotgov
  (map #(if (= "S12000013" (:code %)) 
              (assoc % :label "Outer Hebrides") 
              %)
       population-values-scotgov))

(def diff (data/diff 
            (set population-values-scotgov) 
            (set population-values-wikidata)))

(println (count (first diff)) "in :scotgov only")
(println (count (second diff)) "in :wikidata only")
(println (count (nth diff 2)) "in both")

(with-open [wtr (io/writer "population-values-diff.txt")]
  (binding [*out* wtr]
    (do
      (println "population-values: scotgov versus wikidata")
      (println)
      (println (count (first diff)) "in :scotgov only...")
      (pp/pprint (first diff))
      (println)
      (println (count (second diff)) "in :wikidata only...")
      (pp/pprint (second diff))
      (println)
      (println (count (nth diff 2)) "in both...")
      (pp/pprint (nth diff 2)))))

0 in :scotgov only
0 in :wikidata only
608 in both


nil

#### 3.3.1. When I first compared...

When I ran the above comparison for the first time (at 2020-09-08T10:01GMT) it discovered the following differences...
```
574 in :scotgov only
9 in :wikidata only
34 in both
```
The details are in [this file](population-values-diff-2020-09-08T10_01GMT.txt). 

Then I manually edited Wikidata to fix a few of those differences. However this was labourious so I decided to introduce some automation by using [QuickSatements](https://quickstatements.toolforge.org/). QuickStatements accepts CSV input - representing edits to be applied to Wikidata. The quickStatements CSV input was generated as follows.

In [27]:
; Build the map code->qid from a SPARQL query against Wikidata

(def sparql "
SELECT 
  (strafter(str(?areaEntity), 'http://www.wikidata.org/entity/') as ?qid) 
  ?code
WHERE {
  ?areaEntity wdt:P31 wd:Q15060255 ; # Scottish council area
              wdt:P836 ?code . # nine-character UK Government Statistical Service code
}")

(def code->qid
    (->> sparql
        (exec-query :wikidata)
        (#(do (println (count %) "rows") %))
        (map #(vector (:code %) (:qid %)))
        (into {})))

32 rows


#'user/code->qid

In [28]:
; Generate the CSV that is to be imported into QuickStatements

(def TQD "\"\"\"") ; triple double-quote 

(with-open [wtr (io/writer "population-values-quickstatements.csv")]
  (binding [*out* wtr]
    ;; qid, population, point in time (qualifier), determination method (qualifier), editing comment
    (println "qid,P1082,qal585,qal459,S854,#") 
    (doseq [m (first diff)]
      (println (str (code->qid (:code m)) ","
                    (:population m) ","
                    "+" (:year m) "-00-00T00:00:00Z/9,"
                    "Q791801," ; estimation process
                    TQD "http://statistics.gov.scot/data/population-estimates-current-geographic-boundaries" TQD ","
                    TQD "Set " (:label m) " council area's " (:year m) " population" TQD)))))

nil


When I ran the above CSV generation for the first time - the output is in [this file](population-values-quickstatements-2020-09-09T11_20GMT.csv) - it specified 2232 _individual_ edits to Wikidata. These were successfully executed (taking about 30 mins) against Wikidata by QuickStatements. (Unfortunately QuickStatements does not yet support a means to set the `rank` of a triple so I had to individually edit the 32 council area pages to mark, in each, its 2019 population value  as the `Preferred rank` population value - indicating that it is the most up-to-date population value.) 

## 4. The usefulness of the uploaded dataset

The uploaded dataset can be pulled (dynamically _de-referenced_) into Wikipedia articles and other web pages. 

### 4.1. Embedding dataset values into Wikipedia articles

For example, I edited the Wikipedia article [Council areas of Scotland](https://simple.wikipedia.org/wiki/Council_areas_of_Scotland) to its main table, the new column "_Number of people (latest estimate)_" whose values are pulled (each time the page is rendered) directly from the data that we have uploaded into Wikidata.

![Screenshot of Wikipedia article](screenshot-wikipedia-council-areas-article.png)

### 4.2. Embedding dataset based graphs into web pages

And >><a href="(https://query.wikidata.org/embed.html#%23defaultView%3ALineChart%0ASELECT%20%0A%20%20%3FcouncilArea%0A%20%20(str(YEAR(%3FpopulationWhen))%20as%20%3Fyear%20)%0A%20%20%3Fpopulation%0A%20%20%3FcouncilAreaLabel%0AWHERE%20%7B%0A%20%20%3FcouncilArea%20wdt%3AP31%20wd%3AQ15060255%20%3B%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20p%3AP1082%20%3FpopulationEntity%20.%0A%20%20%3FpopulationEntity%20ps%3AP1082%20%3Fpopulation%20%3B%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20pq%3AP585%20%3FpopulationWhen%20.%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%27%5BAUTO_LANGUAGE%5D%2Cen%27%20.%20%7D%0A%7D">here</a><< is a line graph, dynamically generated from the new Wikidata data, that can be embedded in any web page.

### 4.3. Concerns, next steps, alternative approaches

Interestingly, there is [some discussion](https://en.wikipedia.org/wiki/Wikipedia:Requests_for_comment/Wikidata_Phase_2) about the pros & cons of inserting Wikidata values into Wikipedia articles. The main argument (by some) against is the immaturity of Wikidata's structure: therefore a concern about the permanence of references to data within its structure. But those for it, point out that early use & evolution might be the best path to a maturity. 

The case study for our Data Commons Scotland project, is _open data about waste in Scotland_. So a next step for the project might be to upload into Wikidata, datasets that describe the amounts of household waste generated & recycled, and 'carbon impact' figures. These could also be linked to [council areas](https://www.wikidata.org/wiki/Q15060255) - as we have done the population dataset - to support _per council area_/_per citizen_ statistics and visualisations. Appropriate [properties](https://www.wikidata.org/wiki/Q18616576) do not yet exist in Wikidata for the description of such data about waste, so new ones would need to be ratified by the Wikidata community.

Should should datasets actually be uploaded into Wikidata? These are small datasets and they seem to fit well enough into Wikidata's information graph. Uploading them into Wikidata may make them easier to access and help enrich Wikidata's information graph. But then, of course, there is the _keeping it up-to-date_ issue to deal with. 
Alternatively, those datasets could be pulled directly from statistics.gov.scot, into Wikipedia  articles with the help of some new MediaWiki [extensions](https://www.mediawiki.org/wiki/Category:Extensions).




