Reader.qmd

---
title: "Data PPP Reader"
subtitle: "List of references and some quotes"
format: 
   html: 
     toc-depth: 2
   gfm: 
     toc-depth: 2
editor: visual
bibliography:
- bib/datalicensing.bib
- bib/datapooling.bib
- bib/privatelyhelddata.bib
- bib/administrativedata.bib
- bib/datalicensing.bib
- bib/datapooling.bib
- bib/libraryLOD.bib
- bib/privatelyhelddata.bib
- bib/statisticalLOD.bib
- bib/OpenMusE.bib
- bib/openmusicrepositories.bib
---

::: callout-note
[README](). Call for Paper for [*Trust through Transparency*](https://www.dcc.ac.uk/events/idcc24)*: 18th International Digital Curation Conference*. [Lightning Talk](https://music.dataobservatory.eu/documents/open_music_europe/data-to-policy/Lightning-talk.html) proposal for the same conference. [Conference paper](https://music.dataobservatory.eu/documents/open_music_europe/data-to-policy/IDCC_CfP.html) proposal.
:::

## Data Pooling

`bib/datapooling.bib`

Mattioli uses analogies with patent pools, particularly with patent pools that uphold industry standards for pooling data [@mattioli_data-pooling_2017]. Patent pools are federations of patent holders that reduce transaction costs by collectively licensing complementary patent rights under unified agreements. The original patent pool for the compact disc (CD) and later the DVD are good examples that benefitted the music industry. Commercializing the CD as a digital audio carrier required the joint exploitation of two patented innovations: Sony's error correction technology and Philips' optical technology. What made the CD a viable global standard is that anybody who developed devices for it knew how to get access to both patents. Exploiting the two innovations jointly increased the value of Sony's and Phillips's intellectual properties, even though they competed on the final markets of CD players, Discmans, or initially on the discs market [@flamm_patent-pools_2012].

Mattioli's research focuses on a seemingly straightforward case for organising similar data pools to patent pools: cancer research, where much data pooling occurs. While earlier conceptualisation of the data pooling problem focused on the free-rider problem and privacy concerns, he highlights those problems that make data pooling difficult in the music sector, too: concerns over professional, competitive or reputational standing [@mattioli_data-pooling_2017, p. 188.]. Just the way hospitals are reluctant to share data on cancer patients for fear that it will reflect poorly on the quality of the service they provide, we see very little music export data released because many small-nation institutions fear that whatever data they reveal will be seen by local stakeholders as insufficient in the competition with far wider and globally promoted American repertoire.

When trying to find analogies with successful patent pools, Mattioli draws attention to categorising innovations into "incremental" and "recombinant" [@mattioli_data-pooling_2017, pp 190-191.]. In analogy with incremental information, a music organisation can perform more and more valuable data analysis as the sheer size of its datasets grows. A new music distributor with a short history can benefit less from asset valuations or time series projections initially. Still, the ability to make inferences from such data grows by adding every royalty period's data to the existing pool. A time series of 20 months may not be sufficient for business predictions, but 80 months of data gives more and more insights. The "recombinant" insights occur when this music distributor can combine its datasets with other datasets collected by different organisations. Sales for a Swedish distributor may have a different dynamic in the United Kingdom than in France: adding external data about the SKK/GBP and SKK/EUR exchange rates, consumer spending dynamics in the two countries, or simply price level data in the two countries may inform the distributor if it can exploit the different UK/France dynamics with a better sales strategy, or, on the contrary, it should just simply manage better exchange rate risks.

As Mattioli concludes, the patent-data analogy is imperfect and data pooling appears to be more challenging for several reasons. While patents are finished when the patent is granted, datasets usually need constant actualisation; ownership and exploitation is far clearer with patents than with data. [@mattioli_data-pooling_2017, pp. 199.] This topic overlaps generally with the problem of data ownership, control and licensing.

## Data licensing

-   No one's ownership as the status quo and a possible way forward: A note on the public consultation on Building a European Data Economy [@kim_2018]

## Privately-held data

`bib/privatelyhelddata.bib`

The widespread use of ever-greater computing and data storage capacity changed how humanity records and stores data---accumulating substantial amounts of data records for statistical analysis used to be possible for large governmental, military or corporate structures until the middle of the 20th century. Currently, various questionnaires and sensor or digital documentation-based data accumulation happens in all public institutions, including on small municipal levels and in the private sector.

The creation of official, national statistics, which were almost entirely based on censuses and sample surveys, has been using more and more so-called administrative data sources in developed countries since the 1980s and 1990s. While a population census is still necessary and held every ten years in each EU member state, statisticians have a relatively accurate view of the country's population from birth and death certificates, tax and social security data, and other computer-recorded and digitally stored data in public institutions. In 2010, many statistical offices started experimenting with using digitally stored data by commercial enterprises, such as mobile phone operators or credit card companies. The "Use of new data source --- the case of Statistics Norway" by Sæbø and Dimakos presents the current state of affairs with the example of the Household Budget Survey [@saebo_new_data_sources_2023]. Until 2012, Norwegian statisticians measured household spending with a sample survey of Norwegians reporting their purchases; they asked a large sample of people drawn from the Norwegian population register (established in 1964) to fill in questionnaires. The data quality was particularly poor with food and beverages because such purchases, as opposed to infrequent buying of new television sets or festival tickets, happen daily; the reporting burden or the burden on memory is taxing when people retrospectively must fill out a questionnaire about buying bread, butter and orange juice.

Norway created statistical registers to tap into governmental data stores in 1990 and into municipal ones in 1995; by 2019, it utilised about 100 records and drew data from 30 public institutional sources [@saebo_new_data_sources_2023, p1.]. Like most countries, such "administrative data" was retrieved from other governmental entities, not the private sector. Collecting data from people, companies, and non-profits still relies on census-like comprehensive and sample surveys that take the form of filling out a digital or paper format or answering questions to an interviewer who fills out the form instead of the respondent.

Surveying is costly and often inaccurate. Asking randomly selected music creators about their received royalties over a year requires the respondent to answer questions after opening and reviewing various royalty statements or necessarily filters through the individual's cognitive biases and memory lapses. Statistics Norway realised that instead of asking 7000 households about what they were buying in the supermarkets, it is far more accurate and potentially cheaper to acquire the data directly from the sales logs of the supermarkets or the payment transaction records of the credit card companies. CEEMID, the predecessor of Open Music Europe, relied on similar techniques that experimenting government statisticians used: we kept asking anonymously and randomly music creators about their received royalties, but we also compared the data with the actual anonymised payouts of Artisjus in Hungary and SOZA in Slovakia.

Several European countries are experimenting with new regulations that make data provision mandatory for certain enterprises like supermarkets, mobile telephone operators or credit card companies to provide privately-held, commercial data for official statistical purposes. The lessons of Norway are interesting because the new statistical law (in force since 2021) allows such data collection after a cost/benefit analysis and risk reduction carried out by Statistics Norway. While in Open Music Europe, we are not planning and not even advocating state-mandated data collection; we find these criteria useful for voluntary data sharing with the government based on individual agreements, which we endorse in the music sector.

*European Statistical System Position paper on access to privately held data which are of public interest*

> Statistical surveys and administrative registers are and will remain important sources for the production of official statistics. However, the traditional data gathering methods could be in the future enhanced and enriched by big data analytics. To achieve this, it is essential that data held by private actors can be used by statistical authorities as raw material for innovative value-added services and statistical products, which will also boost the economy by creating new jobs and encouraging investment in data-driven sectors. Increased efficiency and faster delivery by statistical offices of innovative digital products and services will be one of the cornerstones for evidence-based decisions and contribute to lessening the burden on statistical respondents and notably on businesses. Improved access to data would enable statistical offices to provide more granular and timely statistics that would be more useful to enterprises and the citizens alike. It would also mean providing valuable feedback information to the data holders or delivering more tailored statistical services to companies, which might help them in return to better develop their business model. [@ess_position_privately_held_2017 pp. 3-4]

> Finally, concerning the practical modalities to access the data, our experience shows that such access can take a large variety of forms, ranging from access to raw data as collected by the data holder to access to data processed on the basis of algorithms specially designed and provided by statistical experts. Using algorithms provided by the statistical offices could not only reduce the cost of making data available but also represent an additional safeguard with regard to privacy. In general, data holders should be entitled to give access to their data as they are and should not be required to provide them in any specific format. It is indeed generally agreed that the burden on data providers needs to be minimised and data shall be used in their original format. The same approach applies regarding the metadata; without the relevant metadata, data will be of limited use for statistical purposes. It is therefore essential for statistical offices to access these metadata in order to be able to ascertain contents, quality and usability of the data while avoiding any extra burden on data holders and eventually leaving the details open for bilateral discussion depending on the data concerned. More generally, statistical offices are committed to transparency in the methods, algorithms, techniques, tools, etc. they use and this overall commitment should represent an additional incentive for data holders to share their data. [@ess_position_privately_held_2017, pp. 6-7.]

*High-Level Expert Group on Business-to-Government Data Sharing: Towards a European strategy on business-to-government data sharing for the public interest. Final report*

> B2G data-sharing collaborations should be organised:
>
> -- in testing environments ('sandboxes') for pilot testing ('pilots') to help assess the potential value of data for new situations in which a product or service could potentially be used ('use cases'),
>
> -- via public-private partnerships. [@hleg_towards_b2g_data_2020, p. 8.]

> Data taxonomy
>
> In this report, the term 'data' refers to data at all levels of abstraction, from raw data to pre-processed, processed data and insights derived from the data, depending on the use case, the type of data, etc. (for further details, see Annex I: Data taxonomy). Raw data: data collected from a source (e.g. numbers, instrument readings, images, text, videos, sensor data).
>
> Pre-processed data: pre-processing includes, for example, cleaning, instance selection, re-sampling, normalisation, transformation, feature extraction and selection.
>
> Processed data: data processing can be described as the manipulation of data in order to produce meaningful information.
>
> Data-driven insights: are generated by drawing conclusions from processed, analysed data.
>
> All levels of abstraction need to be handled in full compliance with privacy legislation.
>
> [@hleg_towards_b2g_data_2020, p 23]

> The focus of B2G data sharing is on datasets that the public sector either does not have \> access to or has no way to collect, in addition to situations where the sharing of such data would prevent the duplication of effort and investment in data collection and storage. [@hleg_towards_b2g_data_2020, p23]

> Member States should have in place structures to support B2G data sharing. These structures could be a body (or bodies) tasked with assisting public-sector organisations and private companies or civil-society organisations in entering into new data-sharing partnerships and facilitating the sharing of good practice. Over time, such structures could become trusted third parties between the public and private sectors, by bringing the relevant players together. [@hleg_towards_b2g_data_2020, p. 37]

> All B2G data sharing can ultimately be understood as a form of public-private partnership (PPP). However, from a practical point of view, sandboxes and specific PPPs can be put in place for specific public-interest purposes or to target specific challenges related to B2G data sharing. [@hleg_towards_b2g_data_2020, p 41]

p43 chart

## Administrative Data & Statistical registers

`bib/administrativedata.bib`

"A register aims to be a complete list of the objects in a specific group of objects or population." [@anders_register-based_2007] We are planning music industry registers where the objects are *music works* and *sound recordings* (in statistical terms, music products), and the populations are *music authors*, *music performers*, *groupings of performers* (as the majority of the musicians perform, record, release in groups, ensembles, orchestras), *record labels* (which may be formal and informal businesses) and *music publishers* (enterprises.) From a statistical point of view, our planned music industry registers can be seen as "administrative registers" because they were not initially created for a statistical purpose by a statistical authority.

The authoritative source on statistical business registers is the *United Nations Guidelines on Statistical Business Registers. Final draft prior to official editing* [@un_sbr_guidelines_2020] which is heavily based on the former UNECE *Guidelines on statistical business registers* [@unece_sbr_guidelines_2015]. The European guideline is the *European business statistics methodological manual for statistical business registers. 2021 edition* [@eurostat_sbr_manual_2021].

## Data licensing

`bib/datalicensing.bib`

## Dataspaces

A dataspace is an emerging approach to data management which recognises that in large-scale integration scenarios, involving thousands of data sources, it is difficult and expensive to obtain an upfront unifying schema across all sources. Data is integrated on an \`\`as-needed'' basis with the labour-intensive aspects of data integration postponed until they are required. Dataspaces reduce the initial effort required to set up data integration by relying on automatic matching and mapping generation techniques. [@curry_dataspaces_2020]

Dataspaces can provide an approach to enable data management in smart environments that can help to overcome technical, conceptual, and social/organisational barriers to information sharing. A fundamental requirement for intelligent decision-making within a smart environment is the availability of information about entities and their schemas across multiple data sources and intelligent systems [@ulHassan_catalog_entity_management_2020]

As the volume and variety of data sources within a dataspace grow, it becomes a semantically heterogeneous and distributed environment; this presents a significant challenge to querying the dataspace. Approaches used for querying siloed databases fail within large dataspaces because users do not have an a priori understanding of all the available datasets. [@freitas_query_heterogeneous_kg_2020]

A dataspace is an emerging data management approach used to tackle heterogeneous data integration in an incremental manner. Data sources that are participants in a dataspace can be of various types such as online services, open datasets, sensors, and smart devices. Given the dynamicity of dataspaces and the diversity of their data sources and user requirements, finding appropriate sources of data can be challenging for users. An approach for organising and indexing data services based on their semantic descriptions and using a feature-oriented model is the application of the Formal Concept Analysis for organising and indexing the descriptions. [@derguech_enhancing_discovery_dataspace_2020]

Humans are playing critical roles in the management of data at large scales, through activities including schema building, matching data elements, resolving conflicts, and ranking results. The application of human-in-the-loop within intelligent systems in smart environments presents challenges in the areas of programming paradigms, execution methods, and task design. This chapter examines current human-in-the-loop approaches for data management tasks, including data integration, data collection (e.g. citizen sensing), and query refinement. [@ulHassan_human-in-the-loop_2020]

## Survey harmonisation

`bib/surveyharmonization.bib`

## Open Music Europe cross-referencing

`bib/OpenMusE.bib`

-   Open Music Europe (OpenMusE) -- An Open, Scalable, Data-to-Policy Pipeline for European Music Ecosystems [@openmuse_2023]
-   Memorandum of Understanding on utilizing the Open Policy Analysis results of the OpenMuse Research and Innovation Consortium in the context of Slovak cultural and creative industries and sectors' public policies [@open_music_europe_sk_mou_2023]
-   *A framework for open policy analysis* [@OPA_framework_2020]

`bib/openmusicrepositories.bib`

-   WP1 Repository: Report on the European Music Economy [@open_music_europe_economy_repository]

-   WP2 Repository: Report on Music Diversity and Circulation in Europe [@open_music_europe_diversity_repository]