Agents centered around Large Language Models (LLMs) are now capable of automating mobile device operations for users. After fine-tuning to learn a user's mobile operations, these agents can adhere to high-level user instructions online. They execute tasks such as goal decomposition, sequencing of sub-goals, and interactive environmental exploration, until the final objective is achieved. However, privacy concerns related to personalized user data arise during mobile operations, requiring user confirmation. Moreover, users' real-world operations are exploratory, with action data being complex and redundant, posing challenges for agent learning. To address these issues, in our practical application, we have designed interactive tasks between agents and humans to identify sensitive information and align with personalized user needs. Additionally, we integrated Standard Operating Procedure (SOP) information within the model's in-context learning to enhance the agent's comprehension of complex task execution.
paper : https://arxiv.org/abs/2401.04124
Model : https://huggingface.co/tineding/ACT-SOP
Our research involved an investigation of renowned datasets and their statistical characteristics within the domain of large model control devices. In the field of web control, the Mind2Web dataset is particularly notable. Similarly, in the realm of mobile control, the AitW dataset stands out as a prominent resource.
- Examples of SOP data related to AIA and AitW scenarios in our paper can be found in the <data> folder.
Dataset/URL | Platform | Human demos | APPs or websites | Task steps | Observation format | Screen features | Real | High-level instruction |
---|---|---|---|---|---|---|---|---|
MiniWoB | web | 100 | 100 | 3.6 | DOM | × | × | × |
WebShop | web | 12,000 | 1 | 11.3 | DOM | × | × | ✔️ |
RUSS | web | 80 | 22 | 5.4 | DOM | × | ✔️ | ✔️ |
Mind2Web | web | 2,350 | 137 | 7.3 | DOM | × | ✔️ | ✔️ |
---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
RicoSCA | Android(apps) | 0 | n/a | 1.0 | VH,screen | × | × | × |
UIBert | Android(apps) | 16,660 | n/a | 1.0 | VH,screen | × | ✔️ | × |
PixelHelp | Android(apps) | 187 | 4 | 4.2 | VH,screen | × | ✔️ | ✔️ |
META-GUI | Android(apps) | 1125 | 11 | 4.3 | VH,screen | × | ✔️ | ✔️ |
UGIF | Android(apps) | 523 | 12 | 5.3 | VH,screen | × | ✔️ | ✔️ |
MoTIF | Android(apps) | 4,707 | 125 | 4.5 | VH,screen | × | ✔️ | ✔️ |
AitW | Android(apps+web) | 715,142(5,689,993Example) | 357+ | 6.5 | screen | ✔️ | ✔️ | ✔️ |
- Mobile
- Android in the Wild (AitW) is a large-scale dataset for mobile device control that contains human-collected demonstrations of natural language instructions, user interface (UI) screens, and actions for a variety of human tasks.
- Web
- Mind2Web is a dataset for developing and evaluating generalist agents for the web that can follow language instructions to complete complex tasks on any website. Mind2Web contains 2,350 tasks from 137 websites spanning 31 domains that: Reflect diverse and practical use cases on the web. Provide challenging yet realistic environments with real-world websites.Test generalization ability across tasks and environments.
We utilized the AitW dataset as a standard test set to evaluate the performance of various team models, including our model developed based on the SOP mechanism.
<style type="text/css"> .tg {border-collapse:collapse;border-spacing:0;} .tg td{border-style:solid;border-width:0px;font-family:Arial, sans-serif;font-size:14px;overflow:hidden; padding:10px 5px;word-break:normal;} .tg th{border-style:solid;border-width:0px;font-family:Arial, sans-serif;font-size:14px;font-weight:normal; overflow:hidden;padding:10px 5px;word-break:normal;} .tg .tg-di1h{border-color:#656565;text-align:center;vertical-align:middle} .tg .tg-x38u{border-color:#656565;color:#fe0000;text-align:center;vertical-align:middle} .tg .tg-fmsv{border-color:#656565;color:#333333;text-align:center;vertical-align:middle} </style>Type | Action Entity | Tuning | Research Organization | Model | Overall | General | Install | GoogleApps | Single | WebShopping |
---|---|---|---|---|---|---|---|---|---|---|
LLM |
Point-Coordinates |
No |
SJTU | ChatGPT-CoT (5-shot) |
7.72 |
5.93 |
4.38 |
10.47 |
9.39 |
8.42 |
Yes |
PaLM | 39.6 | - | - | - | - | - | |||
Llama2 (1% data) | 28.4 | 28.56 | 35.18 | 30.99 | 27.35 | 19.92 | ||||
Enlarged-Frame |
AntFin | Llama 2 | 65.43 | 55.3 | 73.65 | 62.33 | 74.82 | 61.07 | ||
Llama 2+plan | 62.08 | 52.1 | 71.65 | 56.23 | 74.18 | 56.22 | ||||
Llama 2+plan+state | 62.86 | 53.77 | 69.1 | 61.19 | 73.51 | 56.74 | ||||
Llama 2+SOP | 66.92 | 55.8 | 74.98 | 63.95 | 76.27 | 63.61 |
||||
Multi-Modal |
No | Microsoft/UC | GPT-4V ZS +text | 50.54 | 41.66 | 42.64 | 49.82 | 72.83 | 45.73 | |
GPT-4V ZS image-only | 51.92 | 42.44 | 49.18 | 48.26 | 76.34 | 43.35 | ||||
GPT-4V ZS +history | 52.96 | 43.01 | 46.14 | 49.18 | 78.29 | 48.18 | ||||
Point-Coordinates | Yes | BC-single | 68.7 | - | - | - | - | - | ||
BC-history | 73.1 | 63.7 | 77.5 | 75.7 | 80.3 | 68.5 | ||||
SJTU | Auto-UIseparate | 74.07 | 65.94 | 77.62 | 76.45 | 81.39 | 69.72 | |||
Auto-UIunified | 74.27 | 68.24 | 76.89 | 71.37 | 84.58 | 70.26 |
Paper
- Android in the Wild: A Large-Scale Dataset for Android Device Control
- you only look at screens: multimodal chain-of-action agents
- GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation
We have provided the basic processing code for AITW for everyone to train their own large models. This set of code includes data processing, model training, and model evaluation.
For more information, please refer to AitW and Auto-UI.
- Base enlarged-Frame
> pip install jax,jaxlib
> python code/data_process.py --input_dir [aitw_file_path] --output_dir [output_file_path]
-- for example: python code/data_process.py --input_dir /mntnlp/peisu/data/autoui/google_apps_blip_test.obj --output_dir /mntnlp/data/autoui/output_name.json
For example:
Input:
{
'episode_id': '16150593985172894737',
'data':
[{
'goal': "What's the top post on reddit today?",
'step_id': 1,
'image': tensor([-4.1089e-01, 3.0527e+00, -6.8378e-04, ..., 5.6299e-01,
1.2676e+00, -9.7266e-01], dtype=torch.float16),
'ui_positions': array([[0.05131579, 0.09722222, 0.02171053, 0.27777779],
[0.58684212, 0.31111112, 0.06447368, 0.14166667],
[0.60855263, 0.83333331, 0.025 , 0.02916667],
[0.67171055, 0.77916664, 0.01447368, 0.14166667],
[0.67302632, 0.33888888, 0.01315789, 0.09027778],
[0.67302632, 0.5625 , 0.01315789, 0.10972222],
[0.77302629, 0.12083333, 0.04605263, 0.05555556],
[0.77565789, 0.82499999, 0.04342105, 0.04305556],
[0.77894735, 0.34999999, 0.03618421, 0.05833333],
[0.81776315, 0.40277779, 0.01513158, 0.19583334],
[0.87828946, 0.84027779, 0.0381579 , 0.03472222],
[0.87894738, 0.12083333, 0.03618421, 0.04027778],
[0.95657897, 0.19166666, 0.02894737, 0.02916667],
[0.95657897, 0.48055556, 0.02828947, 0.03194445],
[0.95723683, 0.77083331, 0.02697368, 0.02916667]]),
'ui_text': ['Mon, Oct 10', 'M', '', 'YouTube', 'Gmail', 'Photos', '', '', '', 'Preferences', '', 'G', '', '', ''],
'ui_type': ['TEXT', 'TEXT', 'ICON_PLAY', 'TEXT', 'TEXT', 'TEXT', 'ICON_CALL', 'ICON_LOCATION', 'ICON_CHAT', 'TEXT', 'ICON_MIC',
'ICON_GOOGLE', 'ICON_V_BACKWARD', 'ICON_NAV_BAR_CIRCLE', 'ICON_NAV_BAR_RECT'],
'result_touch_yx': [0.8917460441589355, 0.4927879273891449],
'result_lift_yx': [0.8917460441589355, 0.4927879273891449],
'result_action': ['DUAL_POINT', '']
}]
}
output
"id":"2##What's the news in Chile?",
"instruction":
"Given a mobile screen and a question, provide the action based on the screeninformation.
Previous Actions:
step_id:0 action_type:PRESS_HOME
step_id:1 action_type:DUAL_POINT ui_text: ui_type:ICON_MIC
step_id:2 action_type:DUAL_POINT ui_text:abcnews.go.Com ui_type:TEXT
step_id:3 action_type:TYPE typed_text:What's the news in Chile?
step_id:4 action_type:TYPE typed_text:
step_id:5 action_type:PRESS_ENTER
Screen:
id:0 ui_text: ui_type:ICON_HOME
id:1 ui_text: ui_type:ICON_THREE_DOTS
id:2 ui_text:google.com/search?q ui_type:TEXT
id:3 ui_text:Google ui_type:TEXT
id:4 ui_text:= ui_type:ICON_THREE_BARS
id:5 ui_text: ui_type:ICON_MIC
id:6 ui_text:Q ui_type:ICON_MAGNIFYING_GLASS
id:7 ui_text:What's the news in Chile?
ui_type:TEXT\nid:8 ui_text:Al ui_type:TEXT
id:9 ui_text:News ui_type:TEXT
id:10 ui_text:Images ui_type:TEXT
id:11 ui_text:Videos ui_type:TEXT
id:12 ui_text:Maps ui_type:TEXT
id:13 ui_text: ui_type:ICON_THREE_DOTS
id:14 ui_text:4 ui_type:TEXT
id:15 ui_text:https://www.aljazeera.com> where ui_type:TEXT
id:16 ui_text:Chile | Today's latest from Al ui_type:TEXT
id:17 ui_text:Jazeera ui_type:TEXT
id:18 ui_text:Stay on top of Chile latest developments on ui_type:TEXT
id:19 ui_text:the ground with Al ui_type:TEXT
id:20 ui_text:Jazeera's fact-based ui_type:TEXT
id:21 ui_text:news, ui_type:TEXT
id:22 ui_text:exclusive video footage, ui_type:TEXT
id:23 ui_text:photos ui_type:TEXT
id:24 ui_text:and ui_type:TEXT
id:25 ui_text:updated... ui_type:TEXT
id:26 ui_text: ui_type:ICON_THREE_DOTS
id:27 ui_text:https://www.reuters.com» archive ui_type:TEXT
id:28 ui_text:Chile News Headlines |Reuters ui_type:TEXT
id:29 ui_text:Chile permanently closes ui_type:TEXT
id:30 ui_text:mining areas ui_type:TEXT
id:31 ui_text:Chile files ui_type:TEXT
id:32 ui_text:connected to giant sinkhole ui_type:TEXT
id:33 ui_text:charges against mining company for giant... ui_type:TEXT
id:34 ui_text: ui_type:ICON_THREE_DOTS
id:35 ui_text:https://www.independent.co.uk> topic ui_type:TEXT
id:36 ui_text:Chile ui_type:TEXT
id:37 ui_text:latest news, ui_type:TEXT
id:38 ui_text:breaking ui_type:TEXT
id:39 ui_text:- ui_type:TEXT
id:40 ui_text:stories and comMent - The ui_type:TEXT
id:41 ui_text: ui_type:ICON_NAV_BAR_CIRCLE
id:42 ui_text: ui_type:ICON_NAV_BAR_RECT
id:43 ui_text: ui_type:ICON_V_BACKWARD
Instruction:What's the news in Chile?
Answer:",
"input":"",
"output":"action_type:DUAL_POINT ui_text:Jazeera ui_type:TEXT id:17"
To add a AitW dataset the following steps need to be performed.
- Create a dataset configuration after the schema described above. Examples can be found in code/llama-recipes-main/src/llama_recipes/configs/datasets.py.
@dataclass
class aitw_dataset:
dataset: str = "aitw_dataset"
train_split: str = "/mntnlp/peisu/data/autoui/train.json"
test_split: str = "/mntnlp/peisu/data/autoui/val.json"
- Create a preprocessing routine which loads the data and returns a PyTorch style dataset. The signature for the preprocessing function needs to be (dataset_config, tokenizer, split_name) where split_name will be the string for train/validation split as defined in the dataclass.Examples can be found in code/llama-recipes-main/src/llama_recipes/datasets/aitw_dataset.py and __init__.py
class aitw(Dataset):
def __init__(self, tokenizer, json_name=None,):
...
def get_dataset(dataset_config, tokenizer, json_name=None):
"""cover function for handling loading the working dataset"""
...
- Register the dataset name and preprocessing function by inserting it as key and value into the DATASET_PREPROC dictionary in code/llama-recipes-main/src/llama_recipes/utils/dataset_utils.py
from llama_recipes.datasets import (
get_grammar_dataset,
get_alpaca_dataset,
get_samsum_dataset,
get_aitw_dataset,
)
DATASET_PREPROC = {
"alpaca_dataset": partial(get_alpaca_dataset),
"grammar_dataset": get_grammar_dataset,
"samsum_dataset": get_samsum_dataset,
"custom_dataset": get_custom_dataset,
"aitw_dataset":get_aitw_dataset,
}
- Set dataset field in training config to dataset name or use --dataset option.
--dataset aitw_dataset
Fine-tuning used Llama2-7b. For more information, please refer to llama-recipes
> cd code & $pip install -r llama-recipes-main/requirements.txt
> torchrun --nnodes 1 \
--nproc_per_node 1 llama-recipes-main/examples/finetuning.py \
--enable_fsdp \
--use_peft \
--peft_method lora \
--model_name /mntnlp/common_base_model/llama2-7b \
--dataset aitw_dataset \
--output_dir /mntnlp/tine/temp \
--use_fast_kernels \
--run_validation False \
--batch_size_training 4 \
--num_epochs 10 \
--quantization False \
Here we make use of Parameter Efficient Methods (PEFT) as described in the next section.
To run the command above make sure to pass the peft_method arg which can be set to lora, llama_adapter or prefix.
Once the model training is complete, the log will output information such as loss, tokenizer, LoRA storage path, and comprehensive training process details.
Epoch 10: train_perplexity=1.3190, train_epoch_loss=0.2768, epoch time 21.28177928365767s
Key: avg_train_prep, Value: 1.6622886657714844
Key: avg_train_loss, Value: 0.49041351675987244
Key: avg_epoch_time, Value: 21.60782462991774
Key: avg_checkpoint_time, Value: 0
> python code/infer_vllm.py --model_dir [llama path] --lora_dir [lora path] --test_file__dir [test file path]
This code loads LoRA weights into the base model, performs predictions and evaluations on the test set. Finally, it will return the model's prediction results, including the number of prompts in the test samples, the corresponding user task instruction count, and the task complete score.
Prompt count: 489436
Task instruction count: 62493
Task complete score : 0.5622314184930866
- Task complete score =Average(Number Of Correct Actions / Episode Length)