Skip to content

alipay/mobile-agent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MobileAgent

1. Introduction

Agents centered around Large Language Models (LLMs) are now capable of automating mobile device operations for users. After fine-tuning to learn a user's mobile operations, these agents can adhere to high-level user instructions online. They execute tasks such as goal decomposition, sequencing of sub-goals, and interactive environmental exploration, until the final objective is achieved. However, privacy concerns related to personalized user data arise during mobile operations, requiring user confirmation. Moreover, users' real-world operations are exploratory, with action data being complex and redundant, posing challenges for agent learning. To address these issues, in our practical application, we have designed interactive tasks between agents and humans to identify sensitive information and align with personalized user needs. Additionally, we integrated Standard Operating Procedure (SOP) information within the model's in-context learning to enhance the agent's comprehension of complex task execution.

paper : https://arxiv.org/abs/2401.04124

Model : https://huggingface.co/tineding/ACT-SOP

2. Data

Our research involved an investigation of renowned datasets and their statistical characteristics within the domain of large model control devices. In the field of web control, the Mind2Web dataset is particularly notable. Similarly, in the realm of mobile control, the AitW dataset stands out as a prominent resource.

  • Examples of SOP data related to AIA and AitW scenarios in our paper can be found in the <data> folder.
Dataset/URL Platform Human demos APPs or websites Task steps Observation format Screen features Real High-level instruction
MiniWoB web 100 100 3.6 DOM × × ×
WebShop web 12,000 1 11.3 DOM × × ✔️
RUSS web 80 22 5.4 DOM × ✔️ ✔️
Mind2Web web 2,350 137 7.3 DOM × ✔️ ✔️
---- ---- ---- ---- ---- ---- ---- ---- ----
RicoSCA Android(apps) 0 n/a 1.0 VH,screen × × ×
UIBert Android(apps) 16,660 n/a 1.0 VH,screen × ✔️ ×
PixelHelp Android(apps) 187 4 4.2 VH,screen × ✔️ ✔️
META-GUI Android(apps) 1125 11 4.3 VH,screen × ✔️ ✔️
UGIF Android(apps) 523 12 5.3 VH,screen × ✔️ ✔️
MoTIF Android(apps) 4,707 125 4.5 VH,screen × ✔️ ✔️
AitW Android(apps+web) 715,142(5,689,993Example) 357+ 6.5 screen ✔️ ✔️ ✔️
  • Mobile
    • Android in the Wild (AitW) is a large-scale dataset for mobile device control that contains human-collected demonstrations of natural language instructions, user interface (UI) screens, and actions for a variety of human tasks.
  • Web
    • Mind2Web is a dataset for developing and evaluating generalist agents for the web that can follow language instructions to complete complex tasks on any website. Mind2Web contains 2,350 tasks from 137 websites spanning 31 domains that: Reflect diverse and practical use cases on the web. Provide challenging yet realistic environments with real-world websites.Test generalization ability across tasks and environments.

3. DashBoard for AitW

We utilized the AitW dataset as a standard test set to evaluate the performance of various team models, including our model developed based on the SOP mechanism.

<style type="text/css"> .tg {border-collapse:collapse;border-spacing:0;} .tg td{border-style:solid;border-width:0px;font-family:Arial, sans-serif;font-size:14px;overflow:hidden; padding:10px 5px;word-break:normal;} .tg th{border-style:solid;border-width:0px;font-family:Arial, sans-serif;font-size:14px;font-weight:normal; overflow:hidden;padding:10px 5px;word-break:normal;} .tg .tg-di1h{border-color:#656565;text-align:center;vertical-align:middle} .tg .tg-x38u{border-color:#656565;color:#fe0000;text-align:center;vertical-align:middle} .tg .tg-fmsv{border-color:#656565;color:#333333;text-align:center;vertical-align:middle} </style>
Type Action Entity Tuning Research Organization Model Overall General Install GoogleApps Single WebShopping

LLM

Point-Coordinates

No
SJTU
ChatGPT-CoT (5-shot)

7.72

5.93

4.38

10.47

9.39

8.42

Yes
PaLM 39.6 - - - - -
Llama2 (1% data) 28.4 28.56 35.18 30.99 27.35 19.92

Enlarged-Frame
AntFin Llama 2 65.43 55.3 73.65 62.33 74.82 61.07
Llama 2+plan 62.08 52.1 71.65 56.23 74.18 56.22
Llama 2+plan+state 62.86 53.77 69.1 61.19 73.51 56.74
Llama 2+SOP 66.92 55.8 74.98 63.95 76.27 63.61

Multi-Modal
No Microsoft/UC GPT-4V ZS +text 50.54 41.66 42.64 49.82 72.83 45.73
GPT-4V ZS image-only 51.92 42.44 49.18 48.26 76.34 43.35
GPT-4V ZS +history 52.96 43.01 46.14 49.18 78.29 48.18
Point-Coordinates Yes Google BC-single 68.7 - - - - -
BC-history 73.1 63.7 77.5 75.7 80.3 68.5
SJTU Auto-UIseparate 74.07 65.94 77.62 76.45 81.39 69.72
Auto-UIunified 74.27 68.24 76.89 71.37 84.58 70.26

Paper

4. Our Code

We have provided the basic processing code for AITW for everyone to train their own large models. This set of code includes data processing, model training, and model evaluation.

Mobile-AITW

data process

For more information, please refer to AitW and Auto-UI.

  • Base enlarged-Frame
> pip install jax,jaxlib
> python code/data_process.py --input_dir [aitw_file_path] --output_dir [output_file_path]
-- for example: python code/data_process.py --input_dir /mntnlp/peisu/data/autoui/google_apps_blip_test.obj --output_dir /mntnlp/data/autoui/output_name.json

For example:

Input:

{
	'episode_id': '16150593985172894737', 
	'data': 
			[{
			'goal': "What's the top post on reddit today?", 
			'step_id': 1, 
			'image': tensor([-4.1089e-01,  3.0527e+00, -6.8378e-04,  ...,  5.6299e-01,
         					1.2676e+00, -9.7266e-01], dtype=torch.float16), 
         	'ui_positions': array([[0.05131579, 0.09722222, 0.02171053, 0.27777779],
					       [0.58684212, 0.31111112, 0.06447368, 0.14166667],
					       [0.60855263, 0.83333331, 0.025     , 0.02916667],
					       [0.67171055, 0.77916664, 0.01447368, 0.14166667],
					       [0.67302632, 0.33888888, 0.01315789, 0.09027778],
					       [0.67302632, 0.5625    , 0.01315789, 0.10972222],
					       [0.77302629, 0.12083333, 0.04605263, 0.05555556],
					       [0.77565789, 0.82499999, 0.04342105, 0.04305556],
					       [0.77894735, 0.34999999, 0.03618421, 0.05833333],
					       [0.81776315, 0.40277779, 0.01513158, 0.19583334],
					       [0.87828946, 0.84027779, 0.0381579 , 0.03472222],
					       [0.87894738, 0.12083333, 0.03618421, 0.04027778],
					       [0.95657897, 0.19166666, 0.02894737, 0.02916667],
					       [0.95657897, 0.48055556, 0.02828947, 0.03194445],
					       [0.95723683, 0.77083331, 0.02697368, 0.02916667]]), 
			'ui_text': ['Mon, Oct 10', 'M', '', 'YouTube', 'Gmail', 'Photos', '', '', '', 'Preferences', '', 'G', '', '', ''], 
			'ui_type': ['TEXT', 'TEXT', 'ICON_PLAY', 'TEXT', 'TEXT', 'TEXT', 'ICON_CALL', 'ICON_LOCATION', 'ICON_CHAT', 'TEXT', 'ICON_MIC', 
                        'ICON_GOOGLE', 'ICON_V_BACKWARD', 'ICON_NAV_BAR_CIRCLE', 'ICON_NAV_BAR_RECT'], 
			'result_touch_yx': [0.8917460441589355, 0.4927879273891449], 
            'result_lift_yx': [0.8917460441589355, 0.4927879273891449], 
			'result_action': ['DUAL_POINT', '']
			}]
}

output

"id":"2##What's the news in Chile?",

"instruction":
"Given a mobile screen and a question, provide the action based on the screeninformation.

Previous Actions:
step_id:0 action_type:PRESS_HOME
step_id:1 action_type:DUAL_POINT ui_text: ui_type:ICON_MIC
step_id:2 action_type:DUAL_POINT ui_text:abcnews.go.Com ui_type:TEXT
step_id:3 action_type:TYPE typed_text:What's the news in Chile?
step_id:4 action_type:TYPE typed_text:
step_id:5 action_type:PRESS_ENTER

Screen:
id:0 ui_text: ui_type:ICON_HOME
id:1 ui_text: ui_type:ICON_THREE_DOTS
id:2 ui_text:google.com/search?q ui_type:TEXT
id:3 ui_text:Google ui_type:TEXT
id:4 ui_text:= ui_type:ICON_THREE_BARS
id:5 ui_text: ui_type:ICON_MIC
id:6 ui_text:Q ui_type:ICON_MAGNIFYING_GLASS
id:7 ui_text:What's the news in Chile? 
ui_type:TEXT\nid:8 ui_text:Al ui_type:TEXT
id:9 ui_text:News ui_type:TEXT
id:10 ui_text:Images ui_type:TEXT
id:11 ui_text:Videos ui_type:TEXT
id:12 ui_text:Maps ui_type:TEXT
id:13 ui_text: ui_type:ICON_THREE_DOTS
id:14 ui_text:4 ui_type:TEXT
id:15 ui_text:https://www.aljazeera.com> where ui_type:TEXT
id:16 ui_text:Chile | Today's latest from Al ui_type:TEXT
id:17 ui_text:Jazeera ui_type:TEXT
id:18 ui_text:Stay on top of Chile latest developments on ui_type:TEXT
id:19 ui_text:the ground with Al ui_type:TEXT
id:20 ui_text:Jazeera's fact-based ui_type:TEXT
id:21 ui_text:news, ui_type:TEXT
id:22 ui_text:exclusive video footage, ui_type:TEXT
id:23 ui_text:photos ui_type:TEXT
id:24 ui_text:and ui_type:TEXT
id:25 ui_text:updated... ui_type:TEXT
id:26 ui_text: ui_type:ICON_THREE_DOTS
id:27 ui_text:https://www.reuters.com» archive ui_type:TEXT
id:28 ui_text:Chile News Headlines |Reuters ui_type:TEXT
id:29 ui_text:Chile permanently closes ui_type:TEXT
id:30 ui_text:mining areas ui_type:TEXT
id:31 ui_text:Chile files ui_type:TEXT
id:32 ui_text:connected to giant sinkhole ui_type:TEXT
id:33 ui_text:charges against mining company for giant... ui_type:TEXT
id:34 ui_text: ui_type:ICON_THREE_DOTS
id:35 ui_text:https://www.independent.co.uk> topic ui_type:TEXT
id:36 ui_text:Chile ui_type:TEXT
id:37 ui_text:latest news, ui_type:TEXT
id:38 ui_text:breaking ui_type:TEXT
id:39 ui_text:- ui_type:TEXT
id:40 ui_text:stories and comMent - The ui_type:TEXT
id:41 ui_text: ui_type:ICON_NAV_BAR_CIRCLE
id:42 ui_text: ui_type:ICON_NAV_BAR_RECT
id:43 ui_text: ui_type:ICON_V_BACKWARD

Instruction:What's the news in Chile?
Answer:",

"input":"",

"output":"action_type:DUAL_POINT ui_text:Jazeera ui_type:TEXT id:17"

Data-prepair

To add a AitW dataset the following steps need to be performed.

  • Create a dataset configuration after the schema described above. Examples can be found in code/llama-recipes-main/src/llama_recipes/configs/datasets.py.
      @dataclass
      class aitw_dataset:
         dataset: str = "aitw_dataset"
         train_split: str = "/mntnlp/peisu/data/autoui/train.json" 
         test_split: str = "/mntnlp/peisu/data/autoui/val.json"
  • Create a preprocessing routine which loads the data and returns a PyTorch style dataset. The signature for the preprocessing function needs to be (dataset_config, tokenizer, split_name) where split_name will be the string for train/validation split as defined in the dataclass.Examples can be found in code/llama-recipes-main/src/llama_recipes/datasets/aitw_dataset.py and __init__.py
      class aitw(Dataset):
         def __init__(self, tokenizer, json_name=None,):
         ...
      def get_dataset(dataset_config, tokenizer, json_name=None):
         """cover function for handling loading the working dataset"""
         ...
  • Register the dataset name and preprocessing function by inserting it as key and value into the DATASET_PREPROC dictionary in code/llama-recipes-main/src/llama_recipes/utils/dataset_utils.py
      from llama_recipes.datasets import (
         get_grammar_dataset,
         get_alpaca_dataset,
         get_samsum_dataset,
         get_aitw_dataset,
      )
      DATASET_PREPROC = {
         "alpaca_dataset": partial(get_alpaca_dataset),
         "grammar_dataset": get_grammar_dataset,
         "samsum_dataset": get_samsum_dataset,
         "custom_dataset": get_custom_dataset,
         "aitw_dataset":get_aitw_dataset,
      }
  • Set dataset field in training config to dataset name or use --dataset option.
      --dataset aitw_dataset

Fine-tuning

Fine-tuning used Llama2-7b. For more information, please refer to llama-recipes

> cd code & $pip install -r llama-recipes-main/requirements.txt
> torchrun --nnodes 1 \
         --nproc_per_node 1 llama-recipes-main/examples/finetuning.py \
         --enable_fsdp \
         --use_peft \
         --peft_method lora \
         --model_name /mntnlp/common_base_model/llama2-7b \
         --dataset aitw_dataset \
         --output_dir /mntnlp/tine/temp \
         --use_fast_kernels \
         --run_validation False \
         --batch_size_training 4 \
         --num_epochs 10 \
         --quantization False \

Here we make use of Parameter Efficient Methods (PEFT) as described in the next section.

To run the command above make sure to pass the peft_method arg which can be set to lora, llama_adapter or prefix.

Once the model training is complete, the log will output information such as loss, tokenizer, LoRA storage path, and comprehensive training process details.

Epoch 10: train_perplexity=1.3190, train_epoch_loss=0.2768, epoch time 21.28177928365767s
Key: avg_train_prep, Value: 1.6622886657714844
Key: avg_train_loss, Value: 0.49041351675987244
Key: avg_epoch_time, Value: 21.60782462991774
Key: avg_checkpoint_time, Value: 0

Inference And Evaluate

> python code/infer_vllm.py --model_dir [llama path] --lora_dir [lora path] --test_file__dir [test file path]

This code loads LoRA weights into the base model, performs predictions and evaluations on the test set. Finally, it will return the model's prediction results, including the number of prompts in the test samples, the corresponding user task instruction count, and the task complete score.

Prompt count: 489436 
Task instruction count: 62493
Task complete score : 0.5622314184930866
  • Task complete score =Average(Number Of Correct Actions / Episode Length)

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published