MobileAgent

1. Introduction

Agents centered around Large Language Models (LLMs) are now capable of automating mobile device operations for users. After fine-tuning to learn a user's mobile operations, these agents can adhere to high-level user instructions online. They execute tasks such as goal decomposition, sequencing of sub-goals, and interactive environmental exploration, until the final objective is achieved. However, privacy concerns related to personalized user data arise during mobile operations, requiring user confirmation. Moreover, users' real-world operations are exploratory, with action data being complex and redundant, posing challenges for agent learning. To address these issues, in our practical application, we have designed interactive tasks between agents and humans to identify sensitive information and align with personalized user needs. Additionally, we integrated Standard Operating Procedure (SOP) information within the model's in-context learning to enhance the agent's comprehension of complex task execution.

paper : https://arxiv.org/abs/2401.04124

Model : https://huggingface.co/tineding/ACT-SOP

2. Data

Our research involved an investigation of renowned datasets and their statistical characteristics within the domain of large model control devices. In the field of web control, the Mind2Web dataset is particularly notable. Similarly, in the realm of mobile control, the AitW dataset stands out as a prominent resource.

Examples of SOP data related to AIA and AitW scenarios in our paper can be found in the <data> folder.

Dataset/URL	Platform	Human demos	APPs or websites	Task steps	Observation format	Screen features	Real	High-level instruction
MiniWoB	web	100	100	3.6	DOM	×	×	×
WebShop	web	12,000	1	11.3	DOM	×	×	✔️
RUSS	web	80	22	5.4	DOM	×	✔️	✔️
Mind2Web	web	2,350	137	7.3	DOM	×	✔️	✔️
----	----	----	----	----	----	----	----	----
RicoSCA	Android(apps)	0	n/a	1.0	VH,screen	×	×	×
UIBert	Android(apps)	16,660	n/a	1.0	VH,screen	×	✔️	×
PixelHelp	Android(apps)	187	4	4.2	VH,screen	×	✔️	✔️
META-GUI	Android(apps)	1125	11	4.3	VH,screen	×	✔️	✔️
UGIF	Android(apps)	523	12	5.3	VH,screen	×	✔️	✔️
MoTIF	Android(apps)	4,707	125	4.5	VH,screen	×	✔️	✔️
AitW	Android(apps+web)	715,142(5,689,993Example)	357+	6.5	screen	✔️	✔️	✔️

Mobile
- Android in the Wild (AitW) is a large-scale dataset for mobile device control that contains human-collected demonstrations of natural language instructions, user interface (UI) screens, and actions for a variety of human tasks.
Web
- Mind2Web is a dataset for developing and evaluating generalist agents for the web that can follow language instructions to complete complex tasks on any website. Mind2Web contains 2,350 tasks from 137 websites spanning 31 domains that: Reflect diverse and practical use cases on the web. Provide challenging yet realistic environments with real-world websites.Test generalization ability across tasks and environments.

3. DashBoard for AitW

We utilized the AitW dataset as a standard test set to evaluate the performance of various team models, including our model developed based on the SOP mechanism.

Type	Action Entity	Tuning	Research Organization	Model	Overall	General	Install	GoogleApps	Single	WebShopping
LLM	Point-Coordinates	No	SJTU	ChatGPT-CoT (5-shot)	7.72	5.93	4.38	10.47	9.39	8.42
		Yes		PaLM	39.6	-	-	-	-	-
				Llama2 (1% data)	28.4	28.56	35.18	30.99	27.35	19.92
	Enlarged-Frame		AntFin	Llama 2	65.43	55.3	73.65	62.33	74.82	61.07
				Llama 2+plan	62.08	52.1	71.65	56.23	74.18	56.22
				Llama 2+plan+state	62.86	53.77	69.1	61.19	73.51	56.74
				Llama 2+SOP	66.92	55.8	74.98	63.95	76.27	63.61
Multi-Modal		No	Microsoft/UC	GPT-4V ZS +text	50.54	41.66	42.64	49.82	72.83	45.73
				GPT-4V ZS image-only	51.92	42.44	49.18	48.26	76.34	43.35
				GPT-4V ZS +history	52.96	43.01	46.14	49.18	78.29	48.18
	Point-Coordinates	Yes	Google	BC-single	68.7	-	-	-	-	-
			Google	BC-history	73.1	63.7	77.5	75.7	80.3	68.5
			SJTU	Auto-UIseparate	74.07	65.94	77.62	76.45	81.39	69.72
			SJTU	Auto-UIunified	74.27	68.24	76.89	71.37	84.58	70.26

Paper

4. Our Code

We have provided the basic processing code for AITW for everyone to train their own large models. This set of code includes data processing, model training, and model evaluation.

Mobile-AITW

data process

For more information, please refer to AitW and Auto-UI.

Base enlarged-Frame

> pip install jax,jaxlib
> python code/data_process.py --input_dir [aitw_file_path] --output_dir [output_file_path]
-- for example: python code/data_process.py --input_dir /mntnlp/peisu/data/autoui/google_apps_blip_test.obj --output_dir /mntnlp/data/autoui/output_name.json

For example:

Input:

{
	'episode_id': '16150593985172894737', 
	'data': 
			[{
			'goal': "What's the top post on reddit today?", 
			'step_id': 1, 
			'image': tensor([-4.1089e-01,  3.0527e+00, -6.8378e-04,  ...,  5.6299e-01,
         					1.2676e+00, -9.7266e-01], dtype=torch.float16), 
         	'ui_positions': array([[0.05131579, 0.09722222, 0.02171053, 0.27777779],
					       [0.58684212, 0.31111112, 0.06447368, 0.14166667],
					       [0.60855263, 0.83333331, 0.025     , 0.02916667],
					       [0.67171055, 0.77916664, 0.01447368, 0.14166667],
					       [0.67302632, 0.33888888, 0.01315789, 0.09027778],
					       [0.67302632, 0.5625    , 0.01315789, 0.10972222],
					       [0.77302629, 0.12083333, 0.04605263, 0.05555556],
					       [0.77565789, 0.82499999, 0.04342105, 0.04305556],
					       [0.77894735, 0.34999999, 0.03618421, 0.05833333],
					       [0.81776315, 0.40277779, 0.01513158, 0.19583334],
					       [0.87828946, 0.84027779, 0.0381579 , 0.03472222],
					       [0.87894738, 0.12083333, 0.03618421, 0.04027778],
					       [0.95657897, 0.19166666, 0.02894737, 0.02916667],
					       [0.95657897, 0.48055556, 0.02828947, 0.03194445],
					       [0.95723683, 0.77083331, 0.02697368, 0.02916667]]), 
			'ui_text': ['Mon, Oct 10', 'M', '', 'YouTube', 'Gmail', 'Photos', '', '', '', 'Preferences', '', 'G', '', '', ''], 
			'ui_type': ['TEXT', 'TEXT', 'ICON_PLAY', 'TEXT', 'TEXT', 'TEXT', 'ICON_CALL', 'ICON_LOCATION', 'ICON_CHAT', 'TEXT', 'ICON_MIC', 
                        'ICON_GOOGLE', 'ICON_V_BACKWARD', 'ICON_NAV_BAR_CIRCLE', 'ICON_NAV_BAR_RECT'], 
			'result_touch_yx': [0.8917460441589355, 0.4927879273891449], 
            'result_lift_yx': [0.8917460441589355, 0.4927879273891449], 
			'result_action': ['DUAL_POINT', '']
			}]
}

output

"id":"2##What's the news in Chile?",

"instruction":
"Given a mobile screen and a question, provide the action based on the screeninformation.

Previous Actions:
step_id:0 action_type:PRESS_HOME
step_id:1 action_type:DUAL_POINT ui_text: ui_type:ICON_MIC
step_id:2 action_type:DUAL_POINT ui_text:abcnews.go.Com ui_type:TEXT
step_id:3 action_type:TYPE typed_text:What's the news in Chile?
step_id:4 action_type:TYPE typed_text:
step_id:5 action_type:PRESS_ENTER

Screen:
id:0 ui_text: ui_type:ICON_HOME
id:1 ui_text: ui_type:ICON_THREE_DOTS
id:2 ui_text:google.com/search?q ui_type:TEXT
id:3 ui_text:Google ui_type:TEXT
id:4 ui_text:= ui_type:ICON_THREE_BARS
id:5 ui_text: ui_type:ICON_MIC
id:6 ui_text:Q ui_type:ICON_MAGNIFYING_GLASS
id:7 ui_text:What's the news in Chile? 
ui_type:TEXT\nid:8 ui_text:Al ui_type:TEXT
id:9 ui_text:News ui_type:TEXT
id:10 ui_text:Images ui_type:TEXT
id:11 ui_text:Videos ui_type:TEXT
id:12 ui_text:Maps ui_type:TEXT
id:13 ui_text: ui_type:ICON_THREE_DOTS
id:14 ui_text:4 ui_type:TEXT
id:15 ui_text:https://www.aljazeera.com> where ui_type:TEXT
id:16 ui_text:Chile | Today's latest from Al ui_type:TEXT
id:17 ui_text:Jazeera ui_type:TEXT
id:18 ui_text:Stay on top of Chile latest developments on ui_type:TEXT
id:19 ui_text:the ground with Al ui_type:TEXT
id:20 ui_text:Jazeera's fact-based ui_type:TEXT
id:21 ui_text:news, ui_type:TEXT
id:22 ui_text:exclusive video footage, ui_type:TEXT
id:23 ui_text:photos ui_type:TEXT
id:24 ui_text:and ui_type:TEXT
id:25 ui_text:updated... ui_type:TEXT
id:26 ui_text: ui_type:ICON_THREE_DOTS
id:27 ui_text:https://www.reuters.com» archive ui_type:TEXT
id:28 ui_text:Chile News Headlines |Reuters ui_type:TEXT
id:29 ui_text:Chile permanently closes ui_type:TEXT
id:30 ui_text:mining areas ui_type:TEXT
id:31 ui_text:Chile files ui_type:TEXT
id:32 ui_text:connected to giant sinkhole ui_type:TEXT
id:33 ui_text:charges against mining company for giant... ui_type:TEXT
id:34 ui_text: ui_type:ICON_THREE_DOTS
id:35 ui_text:https://www.independent.co.uk> topic ui_type:TEXT
id:36 ui_text:Chile ui_type:TEXT
id:37 ui_text:latest news, ui_type:TEXT
id:38 ui_text:breaking ui_type:TEXT
id:39 ui_text:- ui_type:TEXT
id:40 ui_text:stories and comMent - The ui_type:TEXT
id:41 ui_text: ui_type:ICON_NAV_BAR_CIRCLE
id:42 ui_text: ui_type:ICON_NAV_BAR_RECT
id:43 ui_text: ui_type:ICON_V_BACKWARD

Instruction:What's the news in Chile?
Answer:",

"input":"",

"output":"action_type:DUAL_POINT ui_text:Jazeera ui_type:TEXT id:17"

Data-prepair

To add a AitW dataset the following steps need to be performed.

Create a dataset configuration after the schema described above. Examples can be found in code/llama-recipes-main/src/llama_recipes/configs/datasets.py.

      @dataclass
      class aitw_dataset:
         dataset: str = "aitw_dataset"
         train_split: str = "/mntnlp/peisu/data/autoui/train.json" 
         test_split: str = "/mntnlp/peisu/data/autoui/val.json"

Create a preprocessing routine which loads the data and returns a PyTorch style dataset. The signature for the preprocessing function needs to be (dataset_config, tokenizer, split_name) where split_name will be the string for train/validation split as defined in the dataclass.Examples can be found in code/llama-recipes-main/src/llama_recipes/datasets/aitw_dataset.py and __init__.py

      class aitw(Dataset):
         def __init__(self, tokenizer, json_name=None,):
         ...
      def get_dataset(dataset_config, tokenizer, json_name=None):
         """cover function for handling loading the working dataset"""
         ...

Register the dataset name and preprocessing function by inserting it as key and value into the DATASET_PREPROC dictionary in code/llama-recipes-main/src/llama_recipes/utils/dataset_utils.py

      from llama_recipes.datasets import (
         get_grammar_dataset,
         get_alpaca_dataset,
         get_samsum_dataset,
         get_aitw_dataset,
      )
      DATASET_PREPROC = {
         "alpaca_dataset": partial(get_alpaca_dataset),
         "grammar_dataset": get_grammar_dataset,
         "samsum_dataset": get_samsum_dataset,
         "custom_dataset": get_custom_dataset,
         "aitw_dataset":get_aitw_dataset,
      }

Set dataset field in training config to dataset name or use --dataset option.

      --dataset aitw_dataset

Fine-tuning

Fine-tuning used Llama2-7b. For more information, please refer to llama-recipes

> cd code & $pip install -r llama-recipes-main/requirements.txt
> torchrun --nnodes 1 \
         --nproc_per_node 1 llama-recipes-main/examples/finetuning.py \
         --enable_fsdp \
         --use_peft \
         --peft_method lora \
         --model_name /mntnlp/common_base_model/llama2-7b \
         --dataset aitw_dataset \
         --output_dir /mntnlp/tine/temp \
         --use_fast_kernels \
         --run_validation False \
         --batch_size_training 4 \
         --num_epochs 10 \
         --quantization False \

Here we make use of Parameter Efficient Methods (PEFT) as described in the next section.

To run the command above make sure to pass the peft_method arg which can be set to lora, llama_adapter or prefix.

Once the model training is complete, the log will output information such as loss, tokenizer, LoRA storage path, and comprehensive training process details.

Epoch 10: train_perplexity=1.3190, train_epoch_loss=0.2768, epoch time 21.28177928365767s
Key: avg_train_prep, Value: 1.6622886657714844
Key: avg_train_loss, Value: 0.49041351675987244
Key: avg_epoch_time, Value: 21.60782462991774
Key: avg_checkpoint_time, Value: 0

Inference And Evaluate

> python code/infer_vllm.py --model_dir [llama path] --lora_dir [lora path] --test_file__dir [test file path]

This code loads LoRA weights into the base model, performs predictions and evaluations on the test set. Finally, it will return the model's prediction results, including the number of prompts in the test samples, the corresponding user task instruction count, and the task complete score.

Prompt count: 489436 
Task instruction count: 62493
Task complete score : 0.5622314184930866

Task complete score =Average(Number Of Correct Actions / Episode Length)

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
Image		Image
code		code
data		data
LEGAL.md		LEGAL.md
LICENCE		LICENCE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MobileAgent

1. Introduction

2. Data

3. DashBoard for AitW

4. Our Code

Mobile-AITW

data process

Data-prepair

Fine-tuning

Inference And Evaluate

About

Releases

Packages

Contributors 2

Languages

License

alipay/mobile-agent

Folders and files

Latest commit

History

Repository files navigation

MobileAgent

1. Introduction

2. Data

3. DashBoard for AitW

4. Our Code

Mobile-AITW

data process

Data-prepair

Fine-tuning

Inference And Evaluate

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages