# Generate Specifications from Scenarios and Properties

This notebook generates a list of actions to augment the user scenarios and demonstrate a specification of system behavior in which a legal property is or is not satisfied. The list of actions are self-prompted to illustrate how language models can be used to brainstorm design alternatives. 

Because large language models may hallucinate or generate completions that fit the viewpoint and opinion of the prompt, called sycophancy, it is important to review the specifications for inaccuracy, misleading statements or incompleteness.

Finally, the specifications are divided into a balanced test and training dataset.

In [1]:
import json

ds_name = 'apple_app'
#ds_name = 'google_play'

dataset = json.load(open('scenarios/%s_sample.json' % ds_name, 'r'))
print('Read %i scenarios.' % len(dataset))
dataset[1]

Read 200 scenarios.


"The user needs to manage their electricity and gas accounts efficiently. They want to easily submit meter readings, pay bills, and view their account statements. They also want to receive notifications for bill arrivals and meter reading periods. Additionally, they need to set up a designated payer to help with bill payments. The app allows the user to input the designated payer's information and billing details for this purpose."

In [2]:
from langchain.chat_models import ChatOpenAI
import os

model = ChatOpenAI(
    openai_api_key = os.environ["OPENAI_API_KEY"],
    model_name='gpt-3.5-turbo'
)

## Define the Legal Properties

The following legal properties are defined and used to generate both compliant and non-compliant specifications.

In [3]:
properties = {
    'P': {
        'property': 'Power Imbalance',
        'rubric': 'Power imbalance generally occurs when the data controller is a public authority or employer, although other cases may arise. For consent to be freely given in the presence of a power imbalance, the controller must demonstrate that there is no detriment when consent is refused or later withdrawn. Recital 43 clearly indicates that it is unlikely that public authorities can rely on consent for processing as whenever the controller is a public authority, there is often a clear imbalance of power in the relationship between the controller and the data subject. There may be situations when it is possible for the employer to demonstrate that consent actually is freely given. Given the imbalance of power between an employer and its staff members, employees can only give free consent in exceptional circumstances, when it will have no adverse consequences at all whether or not they give consent. In addition, Article 88 and Recital 155 describe the need to protect employee’s interests in order to avoid a power imbalance.',
        'axiom': {
            'T': 'there is a power imbalance between the data subject and the data controller',
            'F': 'there is no power imbalance between the data subject and the data controller'
        }
    },
    'C': {
        'property': 'Conditionality',
        'rubric': 'If the purpose for processing a data type is bundled with other contract terms, or if the data subject is otherwise compelled to consent, then it is conditional and is not freely given. Conditionality only applies if the requested data is unnecessary to perform the contract. Contracts include end user agreements, terms of use, and terms and conditions. Article 7(4) GDPR indicates that, inter alia, the situation of “bundling” consent with acceptance of terms or conditions, or “tying” the provision of a contract or a service to a request for consent to process personal data that are not necessary for the performance of that contract or service, is considered highly undesirable.” Par 32. “Article 7(4) is only relevant where the requested data are not necessary for the performance of the contract, (including the provision of a service), and the performance of that contract is made conditional on the obtaining of these data on the basis of consent. Conversely, if processing is necessary to perform the contract (including to provide a service), then Article 7(4) does not apply.',
        'axiom': {
            'T': 'the data subject is compelled to consent or the purpose for data processing is bundled with other contract terms, such as user agreements, terms of use, or terms and conditions', 
            'F': 'the data subject is not compelled to consent and the purpose for data processing is not bundled with other contract terms, such as user agreements, terms of use, or terms and conditions'
            #'T': 'data processing requires the data subject to accept terms and conditions', 
            #'F': 'data processing does not require the data subject to accept terms and conditions'
        }
    },
    'G': {
        'property': 'Granularity',
        'rubric': 'Data subjects should be free to choose which purpose they accept, rather than having to consent to a bundle of processing purposes. Recital 43 clarifies that consent is presumed not to be freely given if the process/procedure for obtaining consent does not allow data subjects to give separate consent for personal data processing operations respectively (e.g. only for some processing operations and not for others) despite it being appropriate in the individual case. Recital 32 states, “Consent should cover all processing activities carried out for the same purpose or purposes. When the processing has multiple purposes, consent should be given for all of them.',
        'axiom': {
            'T': 'the data subject can choose which data processing purposes they accept',
            'F': 'the data subject cannot choose which data processing purposes they accept'
        }
    },
    'D': {
        'property': 'Detriment',
        'rubric': 'The controller needs to demonstrate that it is possible to refuse or withdraw consent without detriment, including no deception, intimidation, coercion or significant negative consequences. Gray Area: permissible incentives, which means a controller can use an incentive that is only obtainable if the data subject consents. This incentive is not viewed as a detriment to refusing to consent. Refusal to consent or withdrawal should not lead to a diminished product or service. The controller needs to demonstrate that it is possible to refuse or withdraw consent without detriment ([see Recital 42]). For example, the controller needs to prove that withdrawing consent does not lead to any costs for the data subject and thus no clear disadvantage for those withdrawing consent.',
        'axiom': {
            'T': 'the data subject can withdraw consent and incur no detriment',
            'F': 'the data subject may incur detriment if they withdraw consent'
        }
    },
    'S': {
        'property': 'Specificity',
        'rubric': 'The processing of data is limited to specific purposes and will not be processed for other purposes, the consent is granular, and the information presented to obtain consent describes the consent and not other unrelated matters. Article 6(1)(a) confirms that the consent of the data subject must be given in relation to “one or more specific” purposes and that a data subject has a choice in relation to each of them… In sum, to comply with the element of "specific" the controller must apply: i. Purpose specification as a safeguard against function creep, ii. Granularity in consent requests, and iii. Clear separation of information related to obtaining consent for data processing activities from information about other matters.',
        'axiom': {
            'T': 'data processing is limited to specific purposes',
            'F': 'data processing is not limited to specific purposes'
        }
    },
    'I': {
        'property': 'Informed',
        'rubric': 'A design description must indicate that a data subject is informed prior to the collection of their data, and at minimum[9] identify (a) the data controller’s identity, (b) the purpose of each processing operation, (c) what type(s) of data will be collected and used, (d) the existence of the right to withdraw consent, (e) information about the use of the data for automated processing, and (f) about the risks due to transfers to countries without adequacy decisions or safeguards. Based on Article 5 of the GDPR, the requirement for transparency is one of the fundamental principles, closely related to the principles of fairness and lawfulness. Providing information to data subjects prior to obtaining their consent is essential in order to enable them to make informed decisions, understand what they are agreeing to, and for example exercise their right to withdraw their consent. For consent to be informed, it is necessary to inform the data subject of certain elements that are crucial to make a choice. Therefore, the EDPB is of the opinion that at least the following information is required for obtaining valid consent: i. the controller’s identity, ii. the purpose of each of the processing operations for which consent is sought, iii. what (type of) data will be collected and used, iv. the existence of the right to withdraw consent, v. information about the use of the data for automated decision-making in accordance with Article 22 (2)(c) where relevant, and on the possible risks of data transfers due to absence of an adequacy decision and of appropriate safeguards as described in Article 46.',
        'axiom': {
            'T': 'the data subject is properly informed prior to the collection of their data',
            'F': 'the data subject is not property informed prior to the collection of their data'
        }
    },
    'U': {
        'property': 'Unambiguous',
        'rubric': 'Consent must be provided through a clear, affirmative action, which may be a written, oral or electronic means. Article 2(h) of Directive 95/46/EC described consent as an “indication of wishes by which the data subject signifies his agreement to personal data relating to him being processed”. Article 4(11) GDPR builds on this definition, by clarifying that valid consent requires an unambiguous indication by means of a statement or by a clear affirmative action, in line with previous guidance issued by the WP29. A “clear affirmative act” means that the data subject must have taken a deliberate action to consent to the particular processing. Recital 32 sets out additional guidance on this. Consent can be collected through a written or (a recorded) oral statement, including by electronic means.',
        'axiom': {
            'T': 'consent is provided through a clear, affirmative action by the data subject',
            'F': 'consent is not provided through a clear, affirmative action by the data subject'
        }
    },
    'W': {
        'property': 'Withdrawal',
        'rubric': 'The data subject can withdraw consent as easily as they gave it, and at any given time. Article 7(3) of the GDPR prescribes that the controller must ensure that consent can be withdrawn by the data subject as easy as giving consent and at any given time. The GDPR does not say that giving and withdrawing consent must always be done through the same action.”, “However, when consent is obtained via electronic means through only one mouse-click, swipe, or keystroke, data subjects must, in practice, be able to withdraw that consent equally as easily.',
        'axiom': {
            'T': 'the data subject can withdraw consent as easily as they gave it and at any time',
            'F': 'the data subject cannot withdraw consent as easily as they gave it'
        }   
    }
}

json.dump(properties, open('results/properties.json', 'w+'))

In [4]:
import random

# generate a random order of properties and property states
prop_list = []

case_count = len(properties) * 2
for i in range(0, len(dataset) - (len(dataset) % case_count), case_count):
    for key in properties.keys():
        prop_list.append([key, 'T'])
        prop_list.append([key, 'F'])

random.shuffle(prop_list)
print('Generated list of properties and states: %i' % len(prop_list))

Generated list of properties and states: 192


In [5]:
from langchain.prompts.chat import ChatPromptTemplate

prompt1 = ChatPromptTemplate.from_messages([
    ('system', 'You are a helpful assistant.'),
    ('human', """Carefully read the definition, and extend the specification to describe actions by the app and user that cause "{property}" to be {state}. Ensure each action in the extension supports that "{axiom}" is {state}. Do not refer directly to "{property}" in your response.
    
Definition of {property}: {definition}
    
Specification: {specification}

Actions: """)
])

store_code = {
    'apple_app': 'A',
    'google_play': 'G'
}
if not ds_name in store_code:
    store_code[ds_name] = 'X'

chain = prompt1 | model

augmented = []
for i, [prop, prop_state] in enumerate(prop_list):
    spec_id = 'S-' + store_code[ds_name] + ('000' + str(i)).rjust(3)
    response = chain.invoke({
        'definition': properties[prop]['rubric'],
        'specification': dataset[i], 
        'property': properties[prop]['property'],
        'axiom': properties[prop]['axiom'][prop_state],
        'state': 'true' if prop_state == 'T' else 'false'
    })
        
    augmented.append({
        'id': spec_id, 'base-spec': dataset[i], 'prop-actions': response.content, 'prop-code': prop, 'prop-state': prop_state})
    
    print(i)

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191


In [6]:
augmented[0]

{'id': 'S-A0000',
 'base-spec': "A user who is a pilot needs to check the weather conditions for their upcoming flight. They open the app and easily access visualized METAR and TAF weather information for the airport they will be departing from. The user appreciates the easy-to-understand metrics and weather symbols provided by the app, which help them make informed decisions about their flight plan. They also find the crosswind calculations and flight rule categories helpful. After customizing the units to their preference, the user listens to the voice synthesis reading of the weather details and notes the sunrise and sunset times for their journey. The app utilizes the user's location and weather preferences to deliver accurate and personalized weather updates, making it a valuable tool for aviation weather forecasting.",
 'prop-actions': '1. The app allows the user to easily customize their consent preferences for data processing related to weather updates by providing a dedicated 

In [7]:
json.dump(augmented, open('scenarios/%s_augmented.json' % ds_name, 'w+'))
print('Wrote %i augmentations.' % len(augmented))

Wrote 192 augmentations.


# Divide Dataset into Training and Testing

The dataset contains multiple specifications per property and, therefore, a balanced training and testing dataset should guarantee an even distribution of properties across each division. Random sampling can closely approximate this distribution, but it does not guarantee the balance.

We first randomize the dataset order, before sorting the data by property, and then we divide the sorted data based on the training size cutoff (e.g., 20%)

In [8]:
import json, random

# combine and randomize the dataset order
dataset1 = json.load(open('scenarios/apple_app_augmented.json', 'r'))
dataset2 = json.load(open('scenarios/google_play_augmented.json', 'r'))
dataset = dataset1 + dataset2
random.shuffle(dataset)

# sort the specifications by prop-code
indexed = {prop_code:[] for prop_code in properties.keys()}
for data in dataset:
    indexed[data['prop-code']].append(data)

training_size = 0.20
training = []
testing = []

print('Property'.ljust(10) + 'Total'.ljust(10) + 'Train'.ljust(10) + 'Test'.ljust(10))

for prop_code, spec_list in indexed.items():
    spec_count = round(len(spec_list) * training_size)
    training.extend(spec_list[:spec_count])
    testing.extend(spec_list[spec_count:])
    
    print(prop_code.ljust(10) + str(len(spec_list)).ljust(10) + str(spec_count).ljust(10) + str(len(spec_list) - spec_count).ljust(10))
    
print('\nTraining Percent: %0.3f' % training_size)
print('Train Size: %i' % len(training))
print('Test Size: %i' % len(testing))

json.dump(training, open('results/training.json', 'w+'))
json.dump(testing, open('results/testing.json', 'w+'))

Property  Total     Train     Test      
P         48        10        38        
C         48        10        38        
G         48        10        38        
D         48        10        38        
S         48        10        38        
I         48        10        38        
U         48        10        38        
W         48        10        38        

Training Percent: 0.200
Train Size: 80
Test Size: 304
