# Anonymizing the smarth2o *consumption* and *households* datasets
## Which data should be opened
The **Deliverable D8.3 WP8**, in the chapter 4.4 *The Open Data Strategy and Policy of SmartH2O*, defines five different levels of data openness. We can represent them as a table with three columns:
1. The number of level
2. the user's will to share data with the project and its partners
3. how the sharing will impact on the openness of data

| \# of level | User's will to share data with SmartH2O project                                                                                                                   | Data opened (anonymized)                                                                                                                                                                           |
|:-----------:|-------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|      1      | No consumption data shared with the SmartH2O project ![](10.png) | no individual water consumption data will be published as open data ![](11.png)                                                                    |
|      2      | Consumption data shared only with the project but no household information revealed to anyone ![](21.png) | no individual water consumption data will be published as open data ![](11.png)                                                                    |
|      3      | Consumption data and household information shared **only** with the **SmartH2O project** ![](31.png)        | no individual water consumption data will be published as open data ![](11.png)                                                                    |
|      4      | Consumption data **anonymously** shared, household data is kept private.                                                                                          | Only individual household water consumption is made available as Open Data. No information on household features is revealed. ![](41.png) |
|      5      | Full **anonymous** disclosure ![](31.png)                                                                   | water consumption and household features are being offered for release as Open Data. ![](52.png) 


The table shows clearly that the last two levels effectively opens the data. The *lock icon* inside the pictures means that  data have been anonymized. 
**Attention: In the 4th level, it's not clear which is the user's will to share data.**

 

## About anonymizing a dataset

### Attributes
Attributes are the building blocks of the anonymization process. We need to introduce them with a formal definition: given a list of individuals in tabular format, in which every row is an individual, we call **attribute** every column. 
Attributes can be:

- *identifiers*, like the personal ID card number, social security number, fiscal code.

- *quasi-identifiers*, e.g. : birth date, ZIP code, gender. 
Taken singularly, these attributes doesn't allow to identify an individual, but combining them and linking them to an external dataset it might identify the individuals.

- *sensitive*, like the state of health, a diagnosis, a disease, the party affiliation, the religion, the salary.

- *non-sensitive*, like the degree of instruction, the job, the favorite sport.

The SmartH2O datasets, acquired by executing SQL queries, will contains *identifiers* attributes like the Smart Meter IDs, the ID's of the database tables (keys and foreign keys) and *quasi-identifiers* like the number of children, the ownership of the household, the number of adults.

### Principal models used and techniques 
In order to anonymize datasets, there are different models and techniques. While models are focused on expression of attributes, techniques, to implements models, working on attributes values using generalization, suppression and perturbation methods.
 
The base model is called ***k*-anonymity**, that means: given a set of quasi-identifiers attributes, there are *at least* k individuals with the same attributes values, so we can't distinguish them among that attributes [Sweeney2002]. 
 
<p>The following table is a 2-anonymity table with Quasi Identifiers set QI = {Race, Birth, Gender, ZIP}<img src="kanonymity-table.png" alt="Smiley face" align="center"></p>


Although k-anonymity cannot protect the dataset privacy if the attacker has background knowledge and if the sensitives attributes, respect to a group of quasi identifiers, has a lack of diversity.

In order to improve the security of the k-anonymity model, the ***l*-diversity** model has been proposed [Machanavajjhala2007]: given a k-anonymity dataset and a set of quasi identifier attributes, there must be at least l distinct values of the sensitive attribute within each quasi identifier equivalence tuple.
 
<p>Here's a 4-anonymity table with QI={Zip, Age, Nationality) with 3-diversity *condition* attribute: <img src="l-diversity-table.png" alt="Smiley face" align="center"></p>

This instance of the l-diversity model is called *Distinct* l-diversity, but we can instantiate other instances using different criteria: those instances are called *Probabilistic*, *Entropy*, and *Recursive* l-diversity. 

The l-diversity model has two major weaknesses: it doesn't take into account the semantics of the sensitive attributes [elabd2015diversity], and it doesn't considers the overall distribution of the sensitive attributes. To address the latter, a new model has been proposed by [Li-Li-t-Closeness] and it has been called ***t*-closeness** . *t*-closeness aims to make the distribution of a sensitive attribute (in any equivalence tuple) close to the attribute's distribution in the whole dataset. 

In conclusion, we cite the most recent notion of privacy, called **Differential Privacy**: given an individual, its privacy's risk should not substantially increase with its participation in a statistical database [differential-privacy]. Thus, the Differential Privacy consists of a framework of techniques to keep databases as privacy-safe as possible to add an individual to the dataset. Differential privacy was mentioned also by Apple's 2016 WWDC keynote, and it's implemented in iOS 10 [WhatisDi37:online].





## The laws and regulations on data protection
The SmartH2O researchers get the consumption dataset from the utility, data will be anonymized and finally published to open data portals. Because we will treat data, we need to explore the juridic field of individuals' data protection. We have to analyze three distinct juridic areas: Switzerland for Tegna's dataset, European Union for Valencia, and UK for London (although there isn't yet a SmartH2O London's dataset).

### European Union  
On April 2016 EU replaced the twenty years old _Data Protection Directive_ with the _General Data Protection Regulation_ (GDPR) [GDPR:online]. Officially approved on 14 April 2016, the regulation will enter in force on 25 May 2018, thus we work on datasets assuming the GDPR would be already enforced. 
GDPR introduced some key points that are in our interest:

- **Controllers and Processors** Controllers are those that controls data, in our case the utility is the controller because we get the data from it. We are the processors, because we process the consumption data. The host of anonymized dataset, in our opinion, could be defined as controller. The role of data processors now is charged of many obligations, the ones that impact of Smarth2O activity are: to maintain a written record of processing activities, and to notify the controller (the Utility) on becoming aware of a personal data breach [Radicalc91:online].
- **Consent** Consent mus be explicit for sensitive data and, in general, unambiguous. The data controller must obeys these legal conditions. The SmartH2O researchers, as processors, can work only on that kind of data: it has been taken into account in the *Deliverable D8.3 WP8*
- **Right to be Forgotten** It has also known as *Data Erasure* right: when a user wants to be erased from the controller's database, the controller must erase her/his personal data, cease any further dissemination, and halt their processing by third parties. The SmartH2O researchers need to be notified by the Utility, in order to remove the subject data from the archive and re-publish the new anonymized version.
- **Large geographical reach** The personal data of those are in EU are protected by the GDPR, even if the processors, controllers, or the data processing, are established outside the EU. The Article 3 defines the processing activities as "the offering of good or services" and "the monitoring of their behaviour as far as their behaviour takes place within the Union". A consequence of this, will be that a controller located in UK, shall adopt the GDPR if its customer are in the EU's territory. 

### Switzerland
Tegna is located in Ticino canton, thus we have to apply the cantonal data protection law called _Legge cantonale sulla protezione dei dati personali_ (LPDP) [LPDP]: that legislation applied to public institutions, thus it has to be applied by the SUPSI researchers of the SmartH2O project. There's in force at the same time the federal _Data Protection Law_ (LPD) [LPD]: come into force in 1987, it will be soon revised to be tied to the GDPR by the Swiss Federal Council [Laprotez8:online]. The article 15 of LPDP allows to process personal data for scientific purposes only if data have been anonymized first, and this is part of our process. 

### United Kingdom
The hipotetic SmartH2O London's dataset might contains EU's citizens living in London, but the GDPR should be not applicable because they are outside EU. Although, because of Brexit, it is not clear what will happen: experts says that despite Brexit, UK businesses might adopt GDPR in order to continue to do business with their customers that are living in EU [Employer91:online][7tipstop76:online].   

### Conclusions
The GDPR will enter in force in 2018: we'll produce an opendata dataset obeying the new EU data protection law. Also Switzerland is going to revises its data protection law to be tied to the GDPR: until that, we uses LPDP as the reference. 

Here's the workflow applied by the SmartH2O researchers to fulfill the GDPR and the LPDP:

- SmartH2O researchers will process only data collected by users that gave their consent. 
- Users' data will be anonymized in order to be published to the opendata portals: SUPSI's researchers must anonymize the Tegna's dataset because the not-anonymized dataset cannot be exported outside Switzerland. The same, Valencia's researchers have to anonymize the Valencia's dataset inside EU's territory.
- In case of a data breach, researchers must communicate the event to the utility. 
- In case a user claim its right to be erased form the database, the utility must erase its records and notify to researchers in order to process again the data and publish a new dataset without the user's records.

## The anonymization process

### Utility vs Privacy
There is a trade-off between utility and privacy preservation: while we can protect sensitive data with privacy models, we are also removing and altering data from the database. With an high privacy model applied, the utility of data can be compromised due to data loss. In our case, we take in account that the most important information that SmartH2O project is opening is water consumption, that is not attributable to a sensitive information. We collect also family's informations and household features, that are quasi-identifiers, so that data will be treated before their release.

### The balanced model
We need to balance the risk of de-anonymization of smartH2O dataset with the data utility for research: a good deal is the __removal of all attributes regarding the individuals__: the children segmentation by age, the number of adults, the family's address, the ownership, and the number of pets. All attributes regarding the household building are kept: this allow researchers to do their research using all the building's features.
In case of re-identification, the SmartH2O anonymized dataset will discloses only the household's features and the water consumption, not any personal data. 
This is a simple but effective model, based only on attributes removal. 

### Workflow
The following steps describes the data workflow:
1. Consumption and Features are read in panda's datasets
2. The Meters IDs are remapped to new numerical values, in order to suppress the original IDs
3. The Identifier and quasi-identifier attributes discussed above are removed
4. The anonymized dataset and the remapped IDs table are saved as CSV file type.

### Software
The anonymization algorithm has been implemented fro scratch in python, because the simplicity of the model. In case we would apply more complex models like k-anonimity, there are some tools that we can try. Keeping the non-commercial and open-source requirements, we have: μ-ARGUS [muARGUS33:online], UTD anonymization toolbox [UTDAnony85:online], sdcMicro [CRANPack78:online], and ARX [ARXAComp47:online].


## Implementation
The python distribution used is the Anaconda distribution [Anaconda15:online]: actively maintained by Continuum it is oriented to the open data science field.

### Pre-condition
We have two dataset already extracted by using SQL queries on the production DB, and it has been saved as CSV file type.
__Note: the SQL query must obeys the privacy user's flag __
The two datasets are the household's features and the consumption, logically linked by the Smart Meter ID field.


In [1]:
import sys
print(sys.version)

3.5.2 |Continuum Analytics, Inc.| (default, Jul  5 2016, 11:41:13) [MSC v.1900 64 bit (AMD64)]


In [2]:
import pandas as pd
from pandas import DataFrame, Series
print(pd.__version__)

0.18.1


In [3]:
# attention: all the int columns will be converted to float64 bacause of the presence of NULL values 
# ref.: http://pandas.pydata.org/pandas-docs/stable/gotchas.html#support-for-integer-na

df1 = pd.read_csv('C:/Users/install/Documents/Export_Corrado_12.10.2016.csv')
df2 = pd.read_csv('C:/Users/install/Documents/Export_Corrado_10.10.2016.csv')



In [4]:
#extract unique list of meter ids and remap it to ordinal serie values
set1 = set(df1['smart_meter_id']) 
set2 = set(df2['meter'])

#frozenset because we can hash the values --> dict
uniques =  frozenset(set1 |set2)

# remap the meters to their row's index
#"CH_AQU_50991555"             {"CH_AQU_50991555": 0,
#"CH_AQU_51080356"   ------->  "CH_AQU_51080356": 1,
#"CH_AQU_50993221"             "CH_AQU_50993221": 2}  
mapped_meters = dict([(key,idx) for (idx,key) in enumerate(uniques)])  

# apply the mapping
df1['smart_meter_id'].replace(mapped_meters, inplace=True)
df2['meter'].replace(mapped_meters, inplace=True)
# to uniform meter id columns
df2.rename(columns={'meter':'smart_meter_id'},inplace=True)

In [5]:
#dropping of the identifiers column attributes
identifiers=['sm_oid','hs_smart_meter_oid','hs_building_oid','bld_oid'
             ,'dst_oid','cdi_hs_oid','sdi_hs_oid']
df1 = df1.drop(identifiers,axis=1)
quasi_identifiers = ['irrigation_system.1','ownership','number_pets','second',
                     'children9','children5_9','children0_4','number_adults',
                     'address','zipcode','city','country']
df1 = df1.drop(quasi_identifiers,axis=1)

In [6]:
#preparing for writing down the csv's
sorted_by_keys = sorted(mapped_meters.items(), key=lambda x: x[1])
CSV ="\n".join([str(k)+","+str(v) for (k,v) in sorted_by_keys])

In [7]:
df1.to_csv('C:/Users/install/Documents/smarth2o_anonymized_households.csv',
           encoding='utf-8',na_rep="NULL",index_label="index")
df2.to_csv('C:/Users/install/Documents/smarth2o_anonymized_consumptions.csv',
           encoding='utf-8',na_rep="NULL",index_label="index")
with open('C:/Users/install/Documents/smarth2o_anonymized_mapping_meters.csv', "w") as f:
          f.write(CSV)

## References
Due to an issue, references need to be rendered in a new pdf.
Here's the rendering:

In [3]:
from IPython.display import IFrame
IFrame("smarth2o-anonymizer-refs.pdf", width=800, height=600)