Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New submissions for Fri, 10 Nov 23 #199

Open
A-suozhang opened this issue Nov 12, 2023 · 0 comments
Open

New submissions for Fri, 10 Nov 23 #199

A-suozhang opened this issue Nov 12, 2023 · 0 comments

Comments

@A-suozhang
Copy link
Owner

Keyword: efficient

Detecting Relevant Information in High-Volume Chat Logs: Keyphrase Extraction for Grooming and Drug Dealing Forensic Analysis

  • Authors: Jeovane Honório Alves, Horácio A. C. G. Pedroso, Rafael Honorio Venetikides, Joel E. M. Köster, Luiz Rodrigo Grochocki, Cinthia O. A. Freitas, Jean Paul Barddal
  • Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2311.04905
  • Pdf link: https://arxiv.org/pdf/2311.04905
  • Abstract
    The growing use of digital communication platforms has given rise to various criminal activities, such as grooming and drug dealing, which pose significant challenges to law enforcement and forensic experts. This paper presents a supervised keyphrase extraction approach to detect relevant information in high-volume chat logs involving grooming and drug dealing for forensic analysis. The proposed method, JointKPE++, builds upon the JointKPE keyphrase extractor by employing improvements to handle longer texts effectively. We evaluate JointKPE++ using BERT-based pre-trained models on grooming and drug dealing datasets, including BERT, RoBERTa, SpanBERT, and BERTimbau. The results show significant improvements over traditional approaches and demonstrate the potential for JointKPE++ to aid forensic experts in efficiently detecting keyphrases related to criminal activities.

Education journal rankings: A diversity-based Author Affiliation Index assessment methodology

  • Authors: Yan-Hong Yang, Ying-Hui Shao
  • Subjects: Digital Libraries (cs.DL)
  • Arxiv link: https://arxiv.org/abs/2311.04908
  • Pdf link: https://arxiv.org/pdf/2311.04908
  • Abstract
    Determining the reputation of academic journals is an crucial issue. The Author Affiliation Index (AAI) was proposed as a novel indicator for judging journal quality in many academic disciplines. Nevertheless, the original AAI has several potential limitations, some of which have been discussed and addressed in previous studies. In this paper, we modified the original AAI by incorporating diversity of top-notch institutions, namely the AAID, exploring how institutional diversity is related to journal quality assessment. We further conducted a quality assessment of 263 education journals indexed in the Social Sciences Citation Index (SSCI) by applying the AAID, AAI and weighted AAI. We find that the AAID ranking possesses a low correlation coefficient with the Journal Impact Factor (JIF) and Eigenfactor Score (ES). That is to say, the AAID rating has not reached a good agreement with the most popular ranking indicators JIF and ES for journals in the field of education. Moreover, we analyze the reasons for the highest AAID from the structure of complex networks. Overall, the AAID is an alternative indicator for evaluating the prestige of journals from a new perspective.

Successor Features for Efficient Multisubject Controlled Text Generation

  • Authors: Meng Cao, Mehdi Fatemi, Jackie Chi Kit Cheung, Samira Shabanian
  • Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
  • Arxiv link: https://arxiv.org/abs/2311.04921
  • Pdf link: https://arxiv.org/pdf/2311.04921
  • Abstract
    While large language models (LLMs) have achieved impressive performance in generating fluent and realistic text, controlling the generated text so that it exhibits properties such as safety, factuality, and non-toxicity remains challenging. % such as DExperts, GeDi, and rectification Existing decoding-based methods are static in terms of the dimension of control; if the target subject is changed, they require new training. Moreover, it can quickly become prohibitive to concurrently control multiple subjects. In this work, we introduce SF-GEN, which is grounded in two primary concepts: successor features (SFs) to decouple the LLM's dynamics from task-specific rewards, and language model rectification to proportionally adjust the probability of selecting a token based on the likelihood that the finished text becomes undesired. SF-GEN seamlessly integrates the two to enable dynamic steering of text generation with no need to alter the LLM's parameters. Thanks to the decoupling effect induced by successor features, our method proves to be memory-wise and computationally efficient for training as well as decoding, especially when dealing with multiple target subjects. To the best of our knowledge, our research represents the first application of successor features in text generation. In addition to its computational efficiency, the resultant language produced by our method is comparable to the SOTA (and outperforms baselines) in both control measures as well as language quality, which we demonstrate through a series of experiments in various controllable text generation tasks.

Leveraging Large Language Models for Collective Decision-Making

  • Authors: Marios Papachristou, Longqi Yang, Chin-Chia Hsu
  • Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Social and Information Networks (cs.SI)
  • Arxiv link: https://arxiv.org/abs/2311.04928
  • Pdf link: https://arxiv.org/pdf/2311.04928
  • Abstract
    In various work contexts, such as meeting scheduling, collaborating, and project planning, collective decision-making is essential but often challenging due to diverse individual preferences, varying work focuses, and power dynamics among members. To address this, we propose a system leveraging Large Language Models (LLMs) to facilitate group decision-making by managing conversations and balancing preferences among individuals. Our system extracts individual preferences and suggests options that satisfy a significant portion of the members. We apply this system to corporate meeting scheduling. We create synthetic employee profiles and simulate conversations at scale, leveraging LLMs to evaluate the system. Our results indicate efficient coordination with reduced interactions between members and the LLM-based system. The system also effectively refines proposed options over time, ensuring their quality and equity. Finally, we conduct a survey study involving human participants to assess our system's ability to aggregate preferences and reasoning. Our findings show that the system exhibits strong performance in both dimensions.

Prompt Cache: Modular Attention Reuse for Low-Latency Inference

  • Authors: In Gim, Guojun Chen, Seung-seob Lee, Nikhil Sarda, Anurag Khandelwal, Lin Zhong
  • Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
  • Arxiv link: https://arxiv.org/abs/2311.04934
  • Pdf link: https://arxiv.org/pdf/2311.04934
  • Abstract
    We present Prompt Cache, an approach for accelerating inference for large language models (LLM) by reusing attention states across different LLM prompts. Many input prompts have overlapping text segments, such as system messages, prompt templates, and documents provided for context. Our key insight is that by precomputing and storing the attention states of these frequently occurring text segments on the inference server, we can efficiently reuse them when these segments appear in user prompts. Prompt Cache employs a schema to explicitly define such reusable text segments, called prompt modules. The schema ensures positional accuracy during attention state reuse and provides users with an interface to access cached states in their prompt. Using a prototype implementation, we evaluate Prompt Cache across several LLMs. We show that Prompt Cache significantly reduce latency in time-to-first-token, especially for longer prompts such as document-based question answering and recommendations. The improvements range from 8x for GPU-based inference to 60x for CPU-based inference, all while maintaining output accuracy and without the need for model parameter modifications.

Interpretable Geoscience Artificial Intelligence (XGeoS-AI): Application to Demystify Image Recognition

  • Authors: Jin-Jian Xu, Hao Zhang, Chao-Sheng Tang, Lin Li, Bin Shi
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
  • Arxiv link: https://arxiv.org/abs/2311.04940
  • Pdf link: https://arxiv.org/pdf/2311.04940
  • Abstract
    As Earth science enters the era of big data, artificial intelligence (AI) not only offers great potential for solving geoscience problems, but also plays a critical role in accelerating the understanding of the complex, interactive, and multiscale processes of Earth's behavior. As geoscience AI models are progressively utilized for significant predictions in crucial situations, geoscience researchers are increasingly demanding their interpretability and versatility. This study proposes an interpretable geoscience artificial intelligence (XGeoS-AI) framework to unravel the mystery of image recognition in the Earth sciences, and its effectiveness and versatility is demonstrated by taking computed tomography (CT) image recognition as an example. Inspired by the mechanism of human vision, the proposed XGeoS-AI framework generates a threshold value from a local region within the whole image to complete the recognition. Different kinds of artificial intelligence (AI) methods, such as Support Vector Regression (SVR), Multilayer Perceptron (MLP), Convolutional Neural Network (CNN), can be adopted as the AI engines of the proposed XGeoS-AI framework to efficiently complete geoscience image recognition tasks. Experimental results demonstrate that the effectiveness, versatility, and heuristics of the proposed framework have great potential in solving geoscience image recognition problems. Interpretable AI should receive more and more attention in the field of the Earth sciences, which is the key to promoting more rational and wider applications of AI in the field of Earth sciences. In addition, the proposed interpretable framework may be the forerunner of technological innovation in the Earth sciences.

Edge-assisted U-Shaped Split Federated Learning with Privacy-preserving for Internet of Things

  • Authors: Hengliang Tang, Zihang Zhao, Detian Liu, Yang Cao, Shiqiang Zhang, Siqing You
  • Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
  • Arxiv link: https://arxiv.org/abs/2311.04944
  • Pdf link: https://arxiv.org/pdf/2311.04944
  • Abstract
    In the realm of the Internet of Things (IoT), deploying deep learning models to process data generated or collected by IoT devices is a critical challenge. However, direct data transmission can cause network congestion and inefficient execution, given that IoT devices typically lack computation and communication capabilities. Centralized data processing in data centers is also no longer feasible due to concerns over data privacy and security. To address these challenges, we present an innovative Edge-assisted U-Shaped Split Federated Learning (EUSFL) framework, which harnesses the high-performance capabilities of edge servers to assist IoT devices in model training and optimization process. In this framework, we leverage Federated Learning (FL) to enable data holders to collaboratively train models without sharing their data, thereby enhancing data privacy protection by transmitting only model parameters. Additionally, inspired by Split Learning (SL), we split the neural network into three parts using U-shaped splitting for local training on IoT devices. By exploiting the greater computation capability of edge servers, our framework effectively reduces overall training time and allows IoT devices with varying capabilities to perform training tasks efficiently. Furthermore, we proposed a novel noise mechanism called LabelDP to ensure that data features and labels can securely resist reconstruction attacks, eliminating the risk of privacy leakage. Our theoretical analysis and experimental results demonstrate that EUSFL can be integrated with various aggregation algorithms, maintaining good performance across different computing capabilities of IoT devices, and significantly reducing training time and local computation overhead.

On the steerability of large language models toward data-driven personas

  • Authors: Junyi Li, Ninareh Mehrabi, Charith Peris, Palash Goyal, Kai-Wei Chang, Aram Galstyan, Richard Zemel, Rahul Gupta
  • Subjects: Computation and Language (cs.CL)
  • Arxiv link: https://arxiv.org/abs/2311.04978
  • Pdf link: https://arxiv.org/pdf/2311.04978
  • Abstract
    The recent surge in Large Language Model (LLM) related applications has led to a concurrent escalation in expectations for LLMs to accommodate a myriad of personas and encompass a broad spectrum of perspectives. An important first step towards addressing this demand is to align language models with specific personas, be it groups of users or individuals. Towards this goal, we first present a new conceptualization of a persona. Moving beyond the traditional reliance on demographics like age, gender, or political party affiliation, we introduce a data-driven persona definition methodology built on collaborative-filtering. In this methodology, users are embedded into a continuous vector space based on their opinions and clustered into cohorts that manifest coherent views across specific inquiries. This methodology allows for a more nuanced understanding of different latent social groups present in the overall population (as opposed to simply using demographic groups) and enhances the applicability of model steerability. Finally, we present an efficient method to steer LLMs towards a particular persona. We learn a soft-prompting model to map the continuous representation of users into sequences of virtual tokens which, when prepended to the LLM input, enables the LLM to produce responses aligned with a given user. Our results show that our steerability algorithm is superior in performance compared to a collection of baselines.

MaxEVA: Maximizing the Efficiency of Matrix Multiplication on Versal AI Engine

  • Authors: Endri Taka, Aman Arora, Kai-Chiang Wu, Diana Marculescu
  • Subjects: Hardware Architecture (cs.AR)
  • Arxiv link: https://arxiv.org/abs/2311.04980
  • Pdf link: https://arxiv.org/pdf/2311.04980
  • Abstract
    The increasing computational and memory requirements of Deep Learning (DL) workloads has led to outstanding innovations in hardware architectures. An archetype of such architectures is the novel Versal AI Engine (AIE) by AMD/Xilinx. The AIE comprises multiple programmable processors optimized for vector-based algorithms. An AIE array consisting of 400 processor cores, operating at 1.25 GHz is able to deliver a peak throughput of 8 TFLOPs for 32-bit floating-point (fp32), and 128 TOPs for 8-bit integer (int8) precision. In this work, we propose MaxEVA: a novel framework to efficiently map Matrix Multiplication (MatMul) workloads on Versal AIE devices. Our framework maximizes the performance and energy efficiency of MatMul applications by efficiently exploiting features of the AIE architecture and resolving performance bottlenecks from multiple angles. When demonstrating on the VC1902 device of the VCK190 board, MaxEVA accomplishes up to 5.44 TFLOPs and 77.01 TOPs throughput for fp32 and int8 precisions, respectively. In terms of energy efficiency, MaxEVA attains up to 85.11 GFLOPs/W for fp32, and 1.73 TOPs/W for int8. Our proposed method substantially outperforms the state-of-the-art approach by exhibiting up to 2.19x throughput gain and 29.4% higher energy efficiency. The MaxEVA framework provides notable insights to fill the knowledge gap in effectively designing MatMul-based DL workloads on the new Versal AIE devices.

Just-in-time Quantization with Processing-In-Memory for Efficient ML Training

  • Authors: Mohamed Assem Ibrahim, Shaizeen Aga, Ada Li, Suchita Pati, Mahzabeen Islam
  • Subjects: Hardware Architecture (cs.AR); Distributed, Parallel, and Cluster Computing (cs.DC)
  • Arxiv link: https://arxiv.org/abs/2311.05034
  • Pdf link: https://arxiv.org/pdf/2311.05034
  • Abstract
    Data format innovations have been critical for machine learning (ML) scaling, which in turn fuels ground-breaking ML capabilities. However, even in the presence of low-precision formats, model weights are often stored in both high-precision and low-precision during training. Furthermore, with emerging directional data formats (e.g., MX9, MX6, etc.) multiple low-precision weight copies can be required. To lower memory capacity needs of weights, we explore just-in-time quantization (JIT-Q) where we only store high-precision weights in memory and generate low-precision weights only when needed. To perform JIT-Q efficiently, in this work, we evaluate emerging processing-in-memory (PIM) technology to execute quantization. With PIM, we can offload quantization to in-memory compute units enabling quantization to be performed without incurring costly data movement while allowing quantization to be concurrent with accelerator computation. Our proposed PIM-offloaded quantization keeps up with GPU compute and delivers considerable capacity savings (up to 24%) at marginal throughput loss (up to 2.4%). Said memory capacity savings can unlock several benefits such as fitting larger model in the same system, reducing model parallelism requirement, and improving overall ML training efficiency.

An approach to performance portability through generic programming

  • Authors: Andreas Hadjigeorgiou, Christodoulos Stylianou, Michele Weiland, Dirk Jacob Verschuur, Jacob Finkenrath
  • Subjects: Software Engineering (cs.SE); Distributed, Parallel, and Cluster Computing (cs.DC)
  • Arxiv link: https://arxiv.org/abs/2311.05038
  • Pdf link: https://arxiv.org/pdf/2311.05038
  • Abstract
    The expanding hardware diversity in high performance computing adds enormous complexity to scientific software development. Developers who aim to write maintainable software have two options: 1) To use a so-called data locality abstraction that handles portability internally, thereby, performance-productivity becomes a trade off. Such abstractions usually come in the form of libraries, domain-specific languages, and run-time systems. 2) To use generic programming where performance, productivity and portability are subject to software design. In the direction of the second, this work describes a design approach that allows the integration of low-level and verbose programming tools into high-level generic algorithms based on template meta-programming in C++. This enables the development of performance-portable applications targeting host-device computer architectures, such as CPUs and GPUs. With a suitable design in place, the extensibility of generic algorithms to new hardware becomes a well defined procedure that can be developed in isolation from other parts of the code. That allows scientific software to be maintainable and efficient in a period of diversifying hardware in HPC. As proof of concept, a finite-difference modelling algorithm for the acoustic wave equation is developed and benchmarked using roofline model analysis on Intel Xeon Gold 6248 CPU, Nvidia Tesla V100 GPU, and AMD MI100 GPU.

Network Flow Problems with Electric Vehicles

  • Authors: Haripriya Pulyassary, Kostas Kollias, Aaron Schild, David Shmoys, Manxi Wu
  • Subjects: Data Structures and Algorithms (cs.DS)
  • Arxiv link: https://arxiv.org/abs/2311.05040
  • Pdf link: https://arxiv.org/pdf/2311.05040
  • Abstract
    Electric vehicle (EV) adoption in long-distance logistics faces challenges such as range anxiety and uneven distribution of charging stations. Two pivotal questions emerge: How can EVs be efficiently routed in a charging network considering range limits, charging speeds and prices? And, can the existing charging infrastructure sustain the increasing demand for EVs in long-distance logistics? This paper addresses these questions by introducing a novel theoretical and computational framework to study the EV network flow problems. We present an EV network flow model that incorporates range constraints and nonlinear charging rates, and identify conditions under which polynomial-time solutions can be obtained for optimal single EV routing, maximum flow, and minimum-cost flow problems. Our findings provide insights for optimizing EV routing in logistics, ensuring an efficient and sustainable future.

Active Transfer Learning for Efficient Video-Specific Human Pose Estimation

  • Authors: Hiromu Taketsugu, Norimichi Ukita
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2311.05041
  • Pdf link: https://arxiv.org/pdf/2311.05041
  • Abstract
    Human Pose (HP) estimation is actively researched because of its wide range of applications. However, even estimators pre-trained on large datasets may not perform satisfactorily due to a domain gap between the training and test data. To address this issue, we present our approach combining Active Learning (AL) and Transfer Learning (TL) to adapt HP estimators to individual video domains efficiently. For efficient learning, our approach quantifies (i) the estimation uncertainty based on the temporal changes in the estimated heatmaps and (ii) the unnaturalness in the estimated full-body HPs. These quantified criteria are then effectively combined with the state-of-the-art representativeness criterion to select uncertain and diverse samples for efficient HP estimator learning. Furthermore, we reconsider the existing Active Transfer Learning (ATL) method to introduce novel ideas related to the retraining methods and Stopping Criteria (SC). Experimental results demonstrate that our method enhances learning efficiency and outperforms comparative methods. Our code is publicly available at: https://github.com/ImIntheMiddle/VATL4Pose-WACV2024

Generalized test utilities for long-tail performance in extreme multi-label classification

  • Authors: Erik Schultheis, Marek Wydmuch, Wojciech Kotłowski, Rohit Babbar, Krzysztof Dembczyński
  • Subjects: Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2311.05081
  • Pdf link: https://arxiv.org/pdf/2311.05081
  • Abstract
    Extreme multi-label classification (XMLC) is the task of selecting a small subset of relevant labels from a very large set of possible labels. As such, it is characterized by long-tail labels, i.e., most labels have very few positive instances. With standard performance measures such as precision@k, a classifier can ignore tail labels and still report good performance. However, it is often argued that correct predictions in the tail are more interesting or rewarding, but the community has not yet settled on a metric capturing this intuitive concept. The existing propensity-scored metrics fall short on this goal by confounding the problems of long-tail and missing labels. In this paper, we analyze generalized metrics budgeted "at k" as an alternative solution. To tackle the challenging problem of optimizing these metrics, we formulate it in the expected test utility (ETU) framework, which aims at optimizing the expected performance on a fixed test set. We derive optimal prediction rules and construct computationally efficient approximations with provable regret guarantees and robustness against model misspecification. Our algorithm, based on block coordinate ascent, scales effortlessly to XMLC problems and obtains promising results in terms of long-tail performance.

Fast Combinatorial Algorithms for Efficient Sortation

  • Authors: Madison Van Dyk, Kim Klause, Jochen Koenemann, Nicole Megow
  • Subjects: Discrete Mathematics (cs.DM)
  • Arxiv link: https://arxiv.org/abs/2311.05094
  • Pdf link: https://arxiv.org/pdf/2311.05094
  • Abstract
    Modern parcel logistic networks are designed to ship demand between given origin, destination pairs of nodes in an underlying directed network. Efficiency dictates that volume needs to be consolidated at intermediate nodes in typical hub-and-spoke fashion. In practice, such consolidation requires parcel sortation. In this work, we propose a mathematical model for the physical requirements, and limitations of parcel sortation. We then show that it is NP-hard to determine whether a feasible sortation plan exists. We discuss several settings, where (near-)feasibility of a given sortation instance can be determined efficiently. The algorithms we propose are fast and build on combinatorial witness set type lower bounds that are reminiscent and extend those used in earlier work on degree-bounded spanning trees and arborescences.

A differentiable brain simulator bridging brain simulation and brain-inspired computing

  • Authors: Chaoming Wang, Tianqiu Zhang, Sichao He, Yifeng Gong, Hongyaoxing Gu, Shangyang Li, Si Wu
  • Subjects: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
  • Arxiv link: https://arxiv.org/abs/2311.05106
  • Pdf link: https://arxiv.org/pdf/2311.05106
  • Abstract
    Brain simulation builds dynamical models to mimic the structure and functions of the brain, while brain-inspired computing (BIC) develops intelligent systems by learning from the structure and functions of the brain. The two fields are intertwined and should share a common programming framework to facilitate each other's development. However, none of the existing software in the fields can achieve this goal, because traditional brain simulators lack differentiability for training, while existing deep learning (DL) frameworks fail to capture the biophysical realism and complexity of brain dynamics. In this paper, we introduce BrainPy, a differentiable brain simulator developed using JAX and XLA, with the aim of bridging the gap between brain simulation and BIC. BrainPy expands upon the functionalities of JAX, a powerful AI framework, by introducing complete capabilities for flexible, efficient, and scalable brain simulation. It offers a range of sparse and event-driven operators for efficient and scalable brain simulation, an abstraction for managing the intricacies of synaptic computations, a modular and flexible interface for constructing multi-scale brain models, and an object-oriented just-in-time compilation approach to handle the memory-intensive nature of brain dynamics. We showcase the efficiency and scalability of BrainPy on benchmark tasks, highlight its differentiable simulation for biologically plausible spiking models, and discuss its potential to support research at the intersection of brain simulation and BIC.

Reducing the Side-Effects of Oscillations in Training of Quantized YOLO Networks

  • Authors: Kartik Gupta, Akshay Asthana
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2311.05109
  • Pdf link: https://arxiv.org/pdf/2311.05109
  • Abstract
    Quantized networks use less computational and memory resources and are suitable for deployment on edge devices. While quantization-aware training QAT is the well-studied approach to quantize the networks at low precision, most research focuses on over-parameterized networks for classification with limited studies on popular and edge device friendly single-shot object detection and semantic segmentation methods like YOLO. Moreover, majority of QAT methods rely on Straight-through Estimator (STE) approximation which suffers from an oscillation phenomenon resulting in sub-optimal network quantization. In this paper, we show that it is difficult to achieve extremely low precision (4-bit and lower) for efficient YOLO models even with SOTA QAT methods due to oscillation issue and existing methods to overcome this problem are not effective on these models. To mitigate the effect of oscillation, we first propose Exponentially Moving Average (EMA) based update to the QAT model. Further, we propose a simple QAT correction method, namely QC, that takes only a single epoch of training after standard QAT procedure to correct the error induced by oscillating weights and activations resulting in a more accurate quantized model. With extensive evaluation on COCO dataset using various YOLO5 and YOLO7 variants, we show that our correction method improves quantized YOLO networks consistently on both object detection and segmentation tasks at low-precision (4-bit and 3-bit).

ScribblePolyp: Scribble-Supervised Polyp Segmentation through Dual Consistency Alignment

  • Authors: Zixun Zhang, Yuncheng Jiang, Jun Wei, Hannah Cui, Zhen Li
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2311.05122
  • Pdf link: https://arxiv.org/pdf/2311.05122
  • Abstract
    Automatic polyp segmentation models play a pivotal role in the clinical diagnosis of gastrointestinal diseases. In previous studies, most methods relied on fully supervised approaches, necessitating pixel-level annotations for model training. However, the creation of pixel-level annotations is both expensive and time-consuming, impeding the development of model generalization. In response to this challenge, we introduce ScribblePolyp, a novel scribble-supervised polyp segmentation framework. Unlike fully-supervised models, ScribblePolyp only requires the annotation of two lines (scribble labels) for each image, significantly reducing the labeling cost. Despite the coarse nature of scribble labels, which leave a substantial portion of pixels unlabeled, we propose a two-branch consistency alignment approach to provide supervision for these unlabeled pixels. The first branch employs transformation consistency alignment to narrow the gap between predictions under different transformations of the same input image. The second branch leverages affinity propagation to refine predictions into a soft version, extending additional supervision to unlabeled pixels. In summary, ScribblePolyp is an efficient model that does not rely on teacher models or moving average pseudo labels during training. Extensive experiments on the SUN-SEG dataset underscore the effectiveness of ScribblePolyp, achieving a Dice score of 0.8155, with the potential for a 1.8% improvement in the Dice score through a straightforward self-training strategy.

Counter-Empirical Attacking based on Adversarial Reinforcement Learning for Time-Relevant Scoring System

  • Authors: Xiangguo Sun, Hong Cheng, Hang Dong, Bo Qiao, Si Qin, Qingwei Lin
  • Subjects: Machine Learning (cs.LG); Software Engineering (cs.SE)
  • Arxiv link: https://arxiv.org/abs/2311.05144
  • Pdf link: https://arxiv.org/pdf/2311.05144
  • Abstract
    Scoring systems are commonly seen for platforms in the era of big data. From credit scoring systems in financial services to membership scores in E-commerce shopping platforms, platform managers use such systems to guide users towards the encouraged activity pattern, and manage resources more effectively and more efficiently thereby. To establish such scoring systems, several "empirical criteria" are firstly determined, followed by dedicated top-down design for each factor of the score, which usually requires enormous effort to adjust and tune the scoring function in the new application scenario. What's worse, many fresh projects usually have no ground-truth or any experience to evaluate a reasonable scoring system, making the designing even harder. To reduce the effort of manual adjustment of the scoring function in every new scoring system, we innovatively study the scoring system from the preset empirical criteria without any ground truth, and propose a novel framework to improve the system from scratch. In this paper, we propose a "counter-empirical attacking" mechanism that can generate "attacking" behavior traces and try to break the empirical rules of the scoring system. Then an adversarial "enhancer" is applied to evaluate the scoring system and find the improvement strategy. By training the adversarial learning problem, a proper scoring function can be learned to be robust to the attacking activity traces that are trying to violate the empirical criteria. Extensive experiments have been conducted on two scoring systems including a shared computing resource platform and a financial credit system. The experimental results have validated the effectiveness of our proposed framework.

Dynamic Association Learning of Self-Attention and Convolution in Image Restoration

  • Authors: Kui Jiang, Xuemei Jia, Wenxin Huang, Wenbin Wang, Zheng Wang, Junjun Jiang
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2311.05147
  • Pdf link: https://arxiv.org/pdf/2311.05147
  • Abstract
    CNNs and Self attention have achieved great success in multimedia applications for dynamic association learning of self-attention and convolution in image restoration. However, CNNs have at least two shortcomings: 1) limited receptive field; 2) static weight of sliding window at inference, unable to cope with the content diversity.In view of the advantages and disadvantages of CNNs and Self attention, this paper proposes an association learning method to utilize the advantages and suppress their shortcomings, so as to achieve high-quality and efficient inpainting. We regard rain distribution reflects the degradation location and degree, in addition to the rain distribution prediction. Thus, we propose to refine background textures with the predicted degradation prior in an association learning manner. As a result, we accomplish image deraining by associating rain streak removal and background recovery, where an image deraining network and a background recovery network are designed for two subtasks. The key part of association learning is a novel multi-input attention module. It generates the degradation prior and produces the degradation mask according to the predicted rainy distribution. Benefited from the global correlation calculation of SA, MAM can extract the informative complementary components from the rainy input with the degradation mask, and then help accurate texture restoration. Meanwhile, SA tends to aggregate feature maps with self-attention importance, but convolution diversifies them to focus on the local textures. A hybrid fusion network involves one residual Transformer branch and one encoder-decoder branch. The former takes a few learnable tokens as input and stacks multi-head attention and feed-forward networks to encode global features of the image. The latter, conversely, leverages the multi-scale encoder-decoder to represent contexture knowledge.

Global error estimates of high-order fully decoupled schemes for the Cahn-Hilliard-Navier-Stokes model of Two-Phase Incompressible Flows

  • Authors: Xiaoli Li, Nan Zheng, Jie Shen, Zhengguang Liu
  • Subjects: Numerical Analysis (math.NA)
  • Arxiv link: https://arxiv.org/abs/2311.05162
  • Pdf link: https://arxiv.org/pdf/2311.05162
  • Abstract
    In this paper we construct new fully decoupled and high-order implicit-explicit (IMEX) schemes for the two-phase incompressible flows based on the new generalized scalar auxiliary variable approach with optimal energy approximation (EOP-GSAV) for Cahn-Hilliard equation and consistent splitting method for Navier-Stokes equation. These schemes are linear, fully decoupled, unconditionally energy stable, only require solving a sequence of elliptic equations with constant coefficients at each time step, and provide a new technique to preserve the consistency between original energy and modified energy. We derive that numerical solutions of these schemes are uniformly bounded without any restriction on time step size. Furthermore, we carry out a rigorous error analysis for the first-order scheme and establish optimal global error estimates for the phase function, velocity and pressure in two and three-dimensional cases. Numerical examples are presented to validate the proposed schemes.

BrainNetDiff: Generative AI Empowers Brain Network Generation via Multimodal Diffusion Model

  • Authors: Yongcheng Zong, Shuqiang Wang
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2311.05199
  • Pdf link: https://arxiv.org/pdf/2311.05199
  • Abstract
    Brain network analysis has emerged as pivotal method for gaining a deeper understanding of brain functions and disease mechanisms. Despite the existence of various network construction approaches, shortcomings persist in the learning of correlations between structural and functional brain imaging data. In light of this, we introduce a novel method called BrainNetDiff, which combines a multi-head Transformer encoder to extract relevant features from fMRI time series and integrates a conditional latent diffusion model for brain network generation. Leveraging a conditional prompt and a fusion attention mechanism, this method significantly improves the accuracy and stability of brain network generation. To the best of our knowledge, this represents the first framework that employs diffusion for the fusion of the multimodal brain imaging and brain network generation from images to graphs. We validate applicability of this framework in the construction of brain network across healthy and neurologically impaired cohorts using the authentic dataset. Experimental results vividly demonstrate the significant effectiveness of the proposed method across the downstream disease classification tasks. These findings convincingly emphasize the prospective value in the field of brain network research, particularly its key significance in neuroimaging analysis and disease diagnosis. This research provides a valuable reference for the processing of multimodal brain imaging data and introduces a novel, efficient solution to the field of neuroimaging.

Whisper in Focus: Enhancing Stuttered Speech Classification with Encoder Layer Optimization

  • Authors: Huma Ameer, Seemab Latif, Rabia Latif, Sana Mukhtar
  • Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
  • Arxiv link: https://arxiv.org/abs/2311.05203
  • Pdf link: https://arxiv.org/pdf/2311.05203
  • Abstract
    In recent years, advancements in the field of speech processing have led to cutting-edge deep learning algorithms with immense potential for real-world applications. The automated identification of stuttered speech is one of such applications that the researchers are addressing by employing deep learning techniques. Recently, researchers have utilized Wav2vec2.0, a speech recognition model to classify disfluency types in stuttered speech. Although Wav2vec2.0 has shown commendable results, its ability to generalize across all disfluency types is limited. In addition, since its base model uses 12 encoder layers, it is considered a resource-intensive model. Our study unravels the capabilities of Whisper for the classification of disfluency types in stuttered speech. We have made notable contributions in three pivotal areas: enhancing the quality of SEP28-k benchmark dataset, exploration of Whisper for classification, and introducing an efficient encoder layer freezing strategy. The optimized Whisper model has achieved the average F1-score of 0.81, which proffers its abilities. This study also unwinds the significance of deeper encoder layers in the identification of disfluency types, as the results demonstrate their greater contribution compared to initial layers. This research represents substantial contributions, shifting the emphasis towards an efficient solution, thereby thriving towards prospective innovation.

From "What" to "When" -- a Spiking Neural Network Predicting Rare Events and Time to their Occurrence

  • Authors: Mikhail Kiselev
  • Subjects: Neural and Evolutionary Computing (cs.NE)
  • Arxiv link: https://arxiv.org/abs/2311.05210
  • Pdf link: https://arxiv.org/pdf/2311.05210
  • Abstract
    In the reinforcement learning (RL) tasks, the ability to predict receiving reward in the near or more distant future means the ability to evaluate the current state as more or less close to the target state (labelled by the reward signal). In the present work, we utilize a spiking neural network (SNN) to predict time to the next target event (reward - in case of RL). In the context of SNNs, events are represented as spikes emitted by network neurons or input nodes. It is assumed that target events are indicated by spikes emitted by a special network input node. Using description of the current state encoded in the form of spikes from the other input nodes, the network should predict approximate time of the next target event. This research paper presents a novel approach to learning the corresponding predictive model by an SNN consisting of leaky integrate-and-fire (LIF) neurons. The proposed method leverages specially designed local synaptic plasticity rules and a novel columnar-layered SNN architecture. Similar to our previous works, this study places a strong emphasis on the hardware-friendliness of the proposed models, ensuring their efficient implementation on modern and future neuroprocessors. The approach proposed was tested on a simple reward prediction task in the context of one of the RL benchmark ATARI games, ping-pong. It was demonstrated that the SNN described in this paper gives superior prediction accuracy in comparison with precise machine learning techniques, such as decision tree algorithms and convolutional neural networks.

ConRad: Image Constrained Radiance Fields for 3D Generation from a Single Image

  • Authors: Senthil Purushwalkam, Nikhil Naik
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2311.05230
  • Pdf link: https://arxiv.org/pdf/2311.05230
  • Abstract
    We present a novel method for reconstructing 3D objects from a single RGB image. Our method leverages the latest image generation models to infer the hidden 3D structure while remaining faithful to the input image. While existing methods obtain impressive results in generating 3D models from text prompts, they do not provide an easy approach for conditioning on input RGB data. Na"ive extensions of these methods often lead to improper alignment in appearance between the input image and the 3D reconstructions. We address these challenges by introducing Image Constrained Radiance Fields (ConRad), a novel variant of neural radiance fields. ConRad is an efficient 3D representation that explicitly captures the appearance of an input image in one viewpoint. We propose a training algorithm that leverages the single RGB image in conjunction with pretrained Diffusion Models to optimize the parameters of a ConRad representation. Extensive experiments show that ConRad representations can simplify preservation of image details while producing a realistic 3D reconstruction. Compared to existing state-of-the-art baselines, we show that our 3D reconstructions remain more faithful to the input and produce more consistent 3D models while demonstrating significantly improved quantitative performance on a ShapeNet object benchmark.

Latent Task-Specific Graph Network Simulators

  • Authors: Philipp Dahlinger, Niklas Freymuth, Michael Volpp, Tai Hoang, Gerhard Neumann
  • Subjects: Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2311.05256
  • Pdf link: https://arxiv.org/pdf/2311.05256
  • Abstract
    Simulating dynamic physical interactions is a critical challenge across multiple scientific domains, with applications ranging from robotics to material science. For mesh-based simulations, Graph Network Simulators (GNSs) pose an efficient alternative to traditional physics-based simulators. Their inherent differentiability and speed make them particularly well-suited for inverse design problems. Yet, adapting to new tasks from limited available data is an important aspect for real-world applications that current methods struggle with. We frame mesh-based simulation as a meta-learning problem and use a recent Bayesian meta-learning method to improve GNSs adaptability to new scenarios by leveraging context data and handling uncertainties. Our approach, latent task-specific graph network simulator, uses non-amortized task posterior approximations to sample latent descriptions of unknown system properties. Additionally, we leverage movement primitives for efficient full trajectory prediction, effectively addressing the issue of accumulating errors encountered by previous auto-regressive methods. We validate the effectiveness of our approach through various experiments, performing on par with or better than established baseline methods. Movement primitives further allow us to accommodate various types of context data, as demonstrated through the utilization of point clouds during inference. By combining GNSs with meta-learning, we bring them closer to real-world applicability, particularly in scenarios with smaller datasets.

Finding Software Vulnerabilities in Open-Source C Projects via Bounded Model Checking

  • Authors: Janislley Oliveira de Sousa, Bruno Carvalho de Farias, Thales Araujo da Silva, Eddie Batista de Lima Filho, Lucas C. Cordeiro
  • Subjects: Cryptography and Security (cs.CR); Software Engineering (cs.SE)
  • Arxiv link: https://arxiv.org/abs/2311.05281
  • Pdf link: https://arxiv.org/pdf/2311.05281
  • Abstract
    Computer-based systems have solved several domain problems, including industrial, military, education, and wearable. Nevertheless, such arrangements need high-quality software to guarantee security and safety as both are mandatory for modern software products. We advocate that bounded model-checking techniques can efficiently detect vulnerabilities in general software systems. However, such an approach struggles to scale up and verify extensive code bases. Consequently, we have developed and evaluated a methodology to verify large software systems using a state-of-the-art bounded model checker. In particular, we pre-process input source-code files and guide the respective model checker to explore them systematically. Moreover, the proposed scheme includes a function-wise prioritization strategy, which readily provides results for code entities according to a scale of importance. Experimental results using a real implementation of the proposed methodology show that it can efficiently verify large software systems. Besides, it presented low peak memory allocation when executed. We have evaluated our approach by verifying twelve popular open-source C projects, where we have found real software vulnerabilities that their developers confirmed.

VoxNeRF: Bridging Voxel Representation and Neural Radiance Fields for Enhanced Indoor View Synthesis

  • Authors: Sen Wang, Wei Zhang, Stefano Gasperini, Shun-Cheng Wu, Nassir Navab
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
  • Arxiv link: https://arxiv.org/abs/2311.05289
  • Pdf link: https://arxiv.org/pdf/2311.05289
  • Abstract
    Creating high-quality view synthesis is essential for immersive applications but continues to be problematic, particularly in indoor environments and for real-time deployment. Current techniques frequently require extensive computational time for both training and rendering, and often produce less-than-ideal 3D representations due to inadequate geometric structuring. To overcome this, we introduce VoxNeRF, a novel approach that leverages volumetric representations to enhance the quality and efficiency of indoor view synthesis. Firstly, VoxNeRF constructs a structured scene geometry and converts it into a voxel-based representation. We employ multi-resolution hash grids to adaptively capture spatial features, effectively managing occlusions and the intricate geometry of indoor scenes. Secondly, we propose a unique voxel-guided efficient sampling technique. This innovation selectively focuses computational resources on the most relevant portions of ray segments, substantially reducing optimization time. We validate our approach against three public indoor datasets and demonstrate that VoxNeRF outperforms state-of-the-art methods. Remarkably, it achieves these gains while reducing both training and rendering times, surpassing even Instant-NGP in speed and bringing the technology closer to real-time.

Do personality tests generalize to Large Language Models?

  • Authors: Florian E. Dorner, Tom Sühr, Samira Samadi, Augustin Kelava
  • Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2311.05297
  • Pdf link: https://arxiv.org/pdf/2311.05297
  • Abstract
    With large language models (LLMs) appearing to behave increasingly human-like in text-based interactions, it has become popular to attempt to evaluate various properties of these models using tests originally designed for humans. While re-using existing tests is a resource-efficient way to evaluate LLMs, careful adjustments are usually required to ensure that test results are even valid across human sub-populations. Thus, it is not clear to what extent different tests' validity generalizes to LLMs. In this work, we provide evidence that LLMs' responses to personality tests systematically deviate from typical human responses, implying that these results cannot be interpreted in the same way as human test results. Concretely, reverse-coded items (e.g. "I am introverted" vs "I am extraverted") are often both answered affirmatively by LLMs. In addition, variation across different prompts designed to "steer" LLMs to simulate particular personality types does not follow the clear separation into five independent personality factors from human samples. In light of these results, we believe it is important to pay more attention to tests' validity for LLMs before drawing strong conclusions about potentially ill-defined concepts like LLMs' "personality".

Reliable and Efficient Data Collection in UAV-based IoT Networks

  • Authors: Poorvi Joshi (1), Alakesh Kalita (2), Mohan Gurusamy (1) ((1) National University of Singapore, (2) Singapore University of Technology and Design)
  • Subjects: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG); Systems and Control (eess.SY)
  • Arxiv link: https://arxiv.org/abs/2311.05303
  • Pdf link: https://arxiv.org/pdf/2311.05303
  • Abstract
    Internet of Things (IoT) involves sensors for monitoring and wireless networks for efficient communication. However, resource-constrained IoT devices and limitations in existing wireless technologies hinder its full potential. Integrating Unmanned Aerial Vehicles (UAVs) into IoT networks can address some challenges by expanding its' coverage, providing security, and bringing computing closer to IoT devices. Nevertheless, effective data collection in UAV-assisted IoT networks is hampered by factors, including dynamic UAV behavior, environmental variables, connectivity instability, and security considerations. In this survey, we first explore UAV-based IoT networks, focusing on communication and networking aspects. Next, we cover various UAV-based data collection methods their advantages and disadvantages, followed by a discussion on performance metrics for data collection. As this article primarily emphasizes reliable and efficient data collection in UAV-assisted IoT networks, we briefly discuss existing research on data accuracy and consistency, network connectivity, and data security and privacy to provide insights into reliable data collection. Additionally, we discuss efficient data collection strategies in UAV-based IoT networks, covering trajectory and path planning, collision avoidance, sensor network clustering, data aggregation, UAV swarm formations, and artificial intelligence for optimization. We also present two use cases of UAVs as a service for enhancing data collection reliability and efficiency. Finally, we discuss future challenges in data collection for UAV-assisted IoT networks.

Data Valuation and Detections in Federated Learning

  • Authors: Wenqian Li, Shuran Fu, Fengrui Zhang, Yan Pang
  • Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
  • Arxiv link: https://arxiv.org/abs/2311.05304
  • Pdf link: https://arxiv.org/pdf/2311.05304
  • Abstract
    Federated Learning (FL) enables collaborative model training without sharing raw data, demanding abundant, high-quality data for optimal model performance. Fair and efficient data evaluation is a fundamental issue for incentivizing clients to provide more high-quality data. Meanwhile, it is likely that only a subset of clients and datasets are relevant for a learning task while the rest of them may have a negative impact on the model training. This paper introduces a novel privacy-preserving method for evaluating client contributions and selecting relevant data samples without a pre-specified training algorithm. Our proposed approach, FedBary, utilizes Wasserstein distance within the federated context, offering a new pioneering solution for data valuation, which provides transparent data evaluation and efficient computation of Wasserstein barycenter to mitigate reliance on validation data. We conduct extensive empirical experiments and theoretical analysis, showing the promising research of this valuation metric.

Finite Element Analysis of the Oseen eigenvalue problem

  • Authors: Felipe Lepe, Gonzalo Rivera, Jesus Vellojin
  • Subjects: Numerical Analysis (math.NA)
  • Arxiv link: https://arxiv.org/abs/2311.05321
  • Pdf link: https://arxiv.org/pdf/2311.05321
  • Abstract
    We propose and analyze a finite element method for the Oseen eigenvalue problem. This problem is an extension of the Stokes eigenvalue problem, where the presence of the convective term leads to a non-symmetric problem and hence, to complex eigenvalues and eigenfunctions. With the aid of the compact operators theory, we prove that for inf-sup stable finite elements the convergence holds and hence, error estimates for the eigenvalues and eigenfunctions are derived. We also propose an a posteriori error estimator which results to be reliable and efficient. We report a series of numerical tests in two and three dimension in order to assess the performance of the method and the proposed estimator.

Numerical Modeling for Shoulder Injury Detection Using Microwave Imaging

  • Authors: Sahar Borzooei, Pierre-Henri Tournier, Victorita Dolean, Christian Pichot, Nadine Joachimowicz, Helene Roussel, Claire Migliaccio
  • Subjects: Numerical Analysis (math.NA); Analysis of PDEs (math.AP)
  • Arxiv link: https://arxiv.org/abs/2311.05322
  • Pdf link: https://arxiv.org/pdf/2311.05322
  • Abstract
    A portable imaging system for the on-site detection of shoulder injury is necessary to identify its extent and avoid its development to severe condition. Here, firstly a microwave tomography system is introduced using state-of-the-art numerical modeling and parallel computing for imaging different tissues in the shoulder. The results show that the proposed method is capable of accurately detecting and localizing rotator cuff tears of different size. In the next step, an efficient design in terms of computing time and complexity is proposed to detect the variations in the injured model with respect to the healthy model. The method is based on finite element discretization and uses parallel preconditioners from the domain decomposition method to accelerate computations. It is implemented using the open source FreeFEM software.

Spatial Attention-based Distribution Integration Network for Human Pose Estimation

  • Authors: Sihan Gao, Jing Zhu, Xiaoxuan Zhuang, Zhaoyue Wang, Qijin Li
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2311.05323
  • Pdf link: https://arxiv.org/pdf/2311.05323
  • Abstract
    In recent years, human pose estimation has made significant progress through the implementation of deep learning techniques. However, these techniques still face limitations when confronted with challenging scenarios, including occlusion, diverse appearances, variations in illumination, and overlap. To cope with such drawbacks, we present the Spatial Attention-based Distribution Integration Network (SADI-NET) to improve the accuracy of localization in such situations. Our network consists of three efficient models: the receptive fortified module (RFM), spatial fusion module (SFM), and distribution learning module (DLM). Building upon the classic HourglassNet architecture, we replace the basic block with our proposed RFM. The RFM incorporates a dilated residual block and attention mechanism to expand receptive fields while enhancing sensitivity to spatial information. In addition, the SFM incorporates multi-scale characteristics by employing both global and local attention mechanisms. Furthermore, the DLM, inspired by residual log-likelihood estimation (RLE), reconfigures a predicted heatmap using a trainable distribution weight. For the purpose of determining the efficacy of our model, we conducted extensive experiments on the MPII and LSP benchmarks. Particularly, our model obtained a remarkable $92.10%$ percent accuracy on the MPII test dataset, demonstrating significant improvements over existing models and establishing state-of-the-art performance.

Atom: Neural Traffic Compression with Spatio-Temporal Graph Neural Networks

  • Authors: Paul Almasan, Krzysztof Rusek, Shihan Xiao, Xiang Shi, Xiangle Cheng, Albert Cabellos-Aparicio, Pere Barlet-Ros
  • Subjects: Networking and Internet Architecture (cs.NI)
  • Arxiv link: https://arxiv.org/abs/2311.05337
  • Pdf link: https://arxiv.org/pdf/2311.05337
  • Abstract
    Storing network traffic data is key to efficient network management; however, it is becoming more challenging and costly due to the ever-increasing data transmission rates, traffic volumes, and connected devices. In this paper, we explore the use of neural architectures for network traffic compression. Specifically, we consider a network scenario with multiple measurement points in a network topology. Such measurements can be interpreted as multiple time series that exhibit spatial and temporal correlations induced by network topology, routing, or user behavior. We present \textit{Atom}, a neural traffic compression method that leverages spatial and temporal correlations present in network traffic. \textit{Atom} implements a customized spatio-temporal graph neural network design that effectively exploits both types of correlations simultaneously. The experimental results show that \textit{Atom} can outperform GZIP's compression ratios by 50%-65% on three real-world networks.

Accelerated Shapley Value Approximation for Data Evaluation

  • Authors: Lauren Watson, Zeno Kujawa, Rayna Andreeva, Hao-Tsung Yang, Tariq Elahi, Rik Sarkar
  • Subjects: Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2311.05346
  • Pdf link: https://arxiv.org/pdf/2311.05346
  • Abstract
    Data valuation has found various applications in machine learning, such as data filtering, efficient learning and incentives for data sharing. The most popular current approach to data valuation is the Shapley value. While popular for its various applications, Shapley value is computationally expensive even to approximate, as it requires repeated iterations of training models on different subsets of data. In this paper we show that the Shapley value of data points can be approximated more efficiently by leveraging the structural properties of machine learning problems. We derive convergence guarantees on the accuracy of the approximate Shapley value for different learning settings including Stochastic Gradient Descent with convex and non-convex loss functions. Our analysis suggests that in fact models trained on small subsets are more important in the context of data valuation. Based on this idea, we describe $\delta$-Shapley -- a strategy of only using small subsets for the approximation. Experiments show that this approach preserves approximate value and rank of data, while achieving speedup of up to 9.9x. In pre-trained networks the approach is found to bring more efficiency in terms of accurate evaluation using small subsets.

u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model

  • Authors: Jinjin Xu, Liwu Xu, Yuzhe Yang, Xiang Li, Yanchun Xie, Yi-Jie Huang, Yaqian Li
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2311.05348
  • Pdf link: https://arxiv.org/pdf/2311.05348
  • Abstract
    Recent advances such as LLaVA and Mini-GPT4 have successfully integrated visual information into LLMs, yielding inspiring outcomes and giving rise to a new generation of multi-modal LLMs, or MLLMs. Nevertheless, these methods struggle with hallucinations and the mutual interference between tasks. To tackle these problems, we propose an efficient and accurate approach to adapt to downstream tasks by utilizing LLM as a bridge to connect multiple expert models, namely u-LLaVA. Firstly, we incorporate the modality alignment module and multi-task modules into LLM. Then, we reorganize or rebuild multi-type public datasets to enable efficient modality alignment and instruction following. Finally, task-specific information is extracted from the trained LLM and provided to different modules for solving downstream tasks. The overall framework is simple, effective, and achieves state-of-the-art performance across multiple benchmarks. We also release our model, the generated data, and the code base publicly available.

SIRE: scale-invariant, rotation-equivariant estimation of artery orientations using graph neural networks

  • Authors: Dieuwertje Alblas, Julian Suk, Christoph Brune, Kak Khee Yeung, Jelmer M. Wolterink
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2311.05400
  • Pdf link: https://arxiv.org/pdf/2311.05400
  • Abstract
    Blood vessel orientation as visualized in 3D medical images is an important descriptor of its geometry that can be used for centerline extraction and subsequent segmentation and visualization. Arteries appear at many scales and levels of tortuosity, and determining their exact orientation is challenging. Recent works have used 3D convolutional neural networks (CNNs) for this purpose, but CNNs are sensitive to varying vessel sizes and orientations. We present SIRE: a scale-invariant, rotation-equivariant estimator for local vessel orientation. SIRE is modular and can generalise due to symmetry preservation. SIRE consists of a gauge equivariant mesh CNN (GEM-CNN) operating on multiple nested spherical meshes with different sizes in parallel. The features on each mesh are a projection of image intensities within the corresponding sphere. These features are intrinsic to the sphere and, in combination with the GEM-CNN, lead to SO(3)-equivariance. Approximate scale invariance is achieved by weight sharing and use of a symmetric maximum function to combine multi-scale predictions. Hence, SIRE can be trained with arbitrarily oriented vessels with varying radii to generalise to vessels with a wide range of calibres and tortuosity. We demonstrate the efficacy of SIRE using three datasets containing vessels of varying scales: the vascular model repository (VMR), the ASOCA coronary artery set, and a set of abdominal aortic aneurysms (AAAs). We embed SIRE in a centerline tracker which accurately tracks AAAs, regardless of the data SIRE is trained with. Moreover, SIRE can be used to track coronary arteries, even when trained only with AAAs. In conclusion, by incorporating SO(3) and scale symmetries, SIRE can determine the orientations of vessels outside of the training domain, forming a robust and data-efficient solution to geometric analysis of blood vessels in 3D medical images.

Diffusion Based Causal Representation Learning

  • Authors: Amir Mohammad Karimi Mamaghan, Andrea Dittadi, Stefan Bauer, Karl Henrik Johansson, Francesco Quinzan
  • Subjects: Machine Learning (cs.LG); Methodology (stat.ME)
  • Arxiv link: https://arxiv.org/abs/2311.05421
  • Pdf link: https://arxiv.org/pdf/2311.05421
  • Abstract
    Causal reasoning can be considered a cornerstone of intelligent systems. Having access to an underlying causal graph comes with the promise of cause-effect estimation and the identification of efficient and safe interventions. However, learning causal representations remains a major challenge, due to the complexity of many real-world systems. Previous works on causal representation learning have mostly focused on Variational Auto-Encoders (VAE). These methods only provide representations from a point estimate, and they are unsuitable to handle high dimensions. To overcome these problems, we proposed a new Diffusion-based Causal Representation Learning (DCRL) algorithm. This algorithm uses diffusion-based representations for causal discovery. DCRL offers access to infinite dimensional latent codes, which encode different levels of information in the latent code. In a first proof of principle, we investigate the use of DCRL for causal representation learning. We further demonstrate experimentally that this approach performs comparably well in identifying the causal structure and causal variables.

Taxonomy for Resident Space Objects in LEO: A Deep Learning Approach

  • Authors: Marta Guimarães, Cláudia Soares, Chiara Manfletti
  • Subjects: Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2311.05430
  • Pdf link: https://arxiv.org/pdf/2311.05430
  • Abstract
    The increasing number of RSOs has raised concerns about the risk of collisions and catastrophic incidents for all direct and indirect users of space. To mitigate this issue, it is essential to have a good understanding of the various RSOs in orbit and their behaviour. A well-established taxonomy defining several classes of RSOs is a critical step in achieving this understanding. This taxonomy helps assign objects to specific categories based on their main characteristics, leading to better tracking services. Furthermore, a well-established taxonomy can facilitate research and analysis processes by providing a common language and framework for better understanding the factors that influence RSO behaviour in space. These factors, in turn, help design more efficient and effective strategies for space traffic management. Our work proposes a new taxonomy for RSOs focusing on the low Earth orbit regime to enhance space traffic management. In addition, we present a deep learning-based model that uses an autoencoder architecture to reduce the features representing the characteristics of the RSOs. The autoencoder generates a lower-dimensional space representation that is then explored using techniques such as Uniform Manifold Approximation and Projection to identify fundamental clusters of RSOs based on their unique characteristics. This approach captures the complex and non-linear relationships between the features and the RSOs' classes identified. Our proposed taxonomy and model offer a significant contribution to the ongoing efforts to mitigate the overall risks posed by the increasing number of RSOs in orbit.

Airfoil generation and feature extraction using the conditional VAE-WGAN-gp

  • Authors: Kazuo Yonekura, Yuki Tomori, Katsuyuki Suzuki
  • Subjects: Computational Engineering, Finance, and Science (cs.CE)
  • Arxiv link: https://arxiv.org/abs/2311.05445
  • Pdf link: https://arxiv.org/pdf/2311.05445
  • Abstract
    A machine learning method was applied to solve an inverse airfoil design problem. A conditional VAE-WGAN-gp model, which couples the conditional variational autoencoder (VAE) and Wasserstein generative adversarial network with gradient penalty (WGAN-gp), is proposed for an airfoil generation method, and then it is compared with the WGAN-gp and VAE models. The VAEGAN model couples the VAE and GAN models, which enables feature extraction in the GAN models. In airfoil generation tasks, to generate airfoil shapes that satisfy lift coefficient requirements, it is known that VAE outperforms WGAN-gp with respect to the accuracy of the reproduction of the lift coefficient, whereas GAN outperforms VAE with respect to the smoothness and variations of generated shapes. In this study, VAE-WGAN-gp demonstrated a good performance in all three aspects. Latent distribution was also studied to compare the feature extraction ability of the proposed method.

Designing ship hull forms using generative adversarial networks

  • Authors: Kazuo Yonekura, Kotaro Omori, Xinran Qi, Katsuyuki Suzuki
  • Subjects: Computational Engineering, Finance, and Science (cs.CE)
  • Arxiv link: https://arxiv.org/abs/2311.05470
  • Pdf link: https://arxiv.org/pdf/2311.05470
  • Abstract
    We proposed a GAN-based method to generate a ship hull form. Unlike mathematical hull forms that require geometrical parameters to generate ship hull forms, the proposed method requires desirable ship performance parameters, i.e., the drag coefficient and tonnage. The requirements of ship owners are generally focused on the ship performance and not the geometry itself. Hence, the proposed model is useful for obtaining the ship hull form based on an owner's requirements. The GAN model was trained using a ship hull form dataset generated using the generalized Wigley hull form. The proposed method was evaluated through numerical experiments and successfully generated ship data with small errors.

On the Complexity of the Virtual Network Embedding in Specific Tree Topologies

  • Authors: Sergey Pankratov, Vitaly Aksenov, Stefan Schmid
  • Subjects: Computational Complexity (cs.CC); Networking and Internet Architecture (cs.NI)
  • Arxiv link: https://arxiv.org/abs/2311.05474
  • Pdf link: https://arxiv.org/pdf/2311.05474
  • Abstract
    Virtual networks are an innovative abstraction that extends cloud computing concepts to the network: by supporting bandwidth reservations between compute nodes (e.g., virtual machines), virtual networks can provide a predictable performance to distributed and communication-intensive cloud applications. However, in order to make the most efficient use of the shared resources, the Virtual Network Embedding (VNE) problem has to be solved: a virtual network should be mapped onto the given physical network so that resource reservations are minimized. The problem has been studied intensively already and is known to be NP-hard in general. In this paper, we revisit this problem and consider it on specific topologies, as they often arise in practice. To be more precise, we study the weighted version of the VNE problem: we consider a virtual weighted network of a specific topology which we want to embed onto a weighted network with capacities and specific topology. As for topologies, we consider most fundamental and commonly used ones: line, star, $2$-tiered star, oversubscribed $2$-tiered star, and tree, in addition to also considering arbitrary topologies. We show that typically the VNE problem is NP-hard even in more specialized cases, however, sometimes there exists a polynomial algorithm: for example, an embedding of the oversubscribed $2$-tiered star onto the tree is polynomial while an embedding of an arbitrary $2$-tiered star is not.

Anytime-Constrained Reinforcement Learning

  • Authors: Jeremy McMahan, Xiaojin Zhu
  • Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS)
  • Arxiv link: https://arxiv.org/abs/2311.05511
  • Pdf link: https://arxiv.org/pdf/2311.05511
  • Abstract
    We introduce and study constrained Markov Decision Processes (cMDPs) with anytime constraints. An anytime constraint requires the agent to never violate its budget at any point in time, almost surely. Although Markovian policies are no longer sufficient, we show that there exist optimal deterministic policies augmented with cumulative costs. In fact, we present a fixed-parameter tractable reduction from anytime-constrained cMDPs to unconstrained MDPs. Our reduction yields planning and learning algorithms that are time and sample-efficient for tabular cMDPs so long as the precision of the costs is logarithmic in the size of the cMDP. However, we also show that computing non-trivial approximately optimal policies is NP-hard in general. To circumvent this bottleneck, we design provable approximation algorithms that efficiently compute or learn an approximately feasible policy with optimal value so long as the maximum supported cost is bounded by a polynomial in the cMDP or by the absolute budget. Given our hardness results, our approximation guarantees are the best possible in terms of tractability under worst-case analysis.

BakedAvatar: Baking Neural Fields for Real-Time Head Avatar Synthesis

  • Authors: Hao-Bin Duan, Miao Wang, Jin-Chuan Shi, Xu-Chuan Chen, Yan-Pei Cao
  • Subjects: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2311.05521
  • Pdf link: https://arxiv.org/pdf/2311.05521
  • Abstract
    Synthesizing photorealistic 4D human head avatars from videos is essential for VR/AR, telepresence, and video game applications. Although existing Neural Radiance Fields (NeRF)-based methods achieve high-fidelity results, the computational expense limits their use in real-time applications. To overcome this limitation, we introduce BakedAvatar, a novel representation for real-time neural head avatar synthesis, deployable in a standard polygon rasterization pipeline. Our approach extracts deformable multi-layer meshes from learned isosurfaces of the head and computes expression-, pose-, and view-dependent appearances that can be baked into static textures for efficient rasterization. We thus propose a three-stage pipeline for neural head avatar synthesis, which includes learning continuous deformation, manifold, and radiance fields, extracting layered meshes and textures, and fine-tuning texture details with differential rasterization. Experimental results demonstrate that our representation generates synthesis results of comparable quality to other state-of-the-art methods while significantly reducing the inference time required. We further showcase various head avatar synthesis results from monocular videos, including view synthesis, face reenactment, expression editing, and pose editing, all at interactive frame rates.

L-WaveBlock: A Novel Feature Extractor Leveraging Wavelets for Generative Adversarial Networks

  • Authors: Mirat Shah, Vansh Jain, Anmol Chokshi, Guruprasad Parasnis, Pramod Bide
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
  • Arxiv link: https://arxiv.org/abs/2311.05548
  • Pdf link: https://arxiv.org/pdf/2311.05548
  • Abstract
    Generative Adversarial Networks (GANs) have risen to prominence in the field of deep learning, facilitating the generation of realistic data from random noise. The effectiveness of GANs often depends on the quality of feature extraction, a critical aspect of their architecture. This paper introduces L-WaveBlock, a novel and robust feature extractor that leverages the capabilities of the Discrete Wavelet Transform (DWT) with deep learning methodologies. L-WaveBlock is catered to quicken the convergence of GAN generators while simultaneously enhancing their performance. The paper demonstrates the remarkable utility of L-WaveBlock across three datasets, a road satellite imagery dataset, the CelebA dataset and the GoPro dataset, showcasing its ability to ease feature extraction and make it more efficient. By utilizing DWT, L-WaveBlock efficiently captures the intricate details of both structural and textural details, and further partitions feature maps into orthogonal subbands across multiple scales while preserving essential information at the same time. Not only does it lead to faster convergence, but also gives competent results on every dataset by employing the L-WaveBlock. The proposed method achieves an Inception Score of 3.6959 and a Structural Similarity Index of 0.4261 on the maps dataset, a Peak Signal-to-Noise Ratio of 29.05 and a Structural Similarity Index of 0.874 on the CelebA dataset. The proposed method performs competently to the state-of-the-art for the image denoising dataset, albeit not better, but still leads to faster convergence than conventional methods. With this, L-WaveBlock emerges as a robust and efficient tool for enhancing GAN-based image generation, demonstrating superior convergence speed and competitive performance across multiple datasets for image resolution, image generation and image denoising.

Exploiting Neural-Network Statistics for Low-Power DNN Inference

  • Authors: Lennart Bamberg, Ardalan Najafi, Alberto Garcia-Ortiz
  • Subjects: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
  • Arxiv link: https://arxiv.org/abs/2311.05557
  • Pdf link: https://arxiv.org/pdf/2311.05557
  • Abstract
    Specialized compute blocks have been developed for efficient DNN execution. However, due to the vast amount of data and parameter movements, the interconnects and on-chip memories form another bottleneck, impairing power and performance. This work addresses this bottleneck by contributing a low-power technique for edge-AI inference engines that combines overhead-free coding with a statistical analysis of the data and parameters of neural networks. Our approach reduces the interconnect and memory power consumption by up to 80% for state-of-the-art benchmarks while providing additional power savings for the compute blocks by up to 39%. These power improvements are achieved with no loss of accuracy and negligible hardware cost.

SigScatNet: A Siamese + Scattering based Deep Learning Approach for Signature Forgery Detection and Similarity Assessment

  • Authors: Anmol Chokshi, Vansh Jain, Rajas Bhope, Sudhir Dhage
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Signal Processing (eess.SP)
  • Arxiv link: https://arxiv.org/abs/2311.05579
  • Pdf link: https://arxiv.org/pdf/2311.05579
  • Abstract
    The surge in counterfeit signatures has inflicted widespread inconveniences and formidable challenges for both individuals and organizations. This groundbreaking research paper introduces SigScatNet, an innovative solution to combat this issue by harnessing the potential of a Siamese deep learning network, bolstered by Scattering wavelets, to detect signature forgery and assess signature similarity. The Siamese Network empowers us to ascertain the authenticity of signatures through a comprehensive similarity index, enabling precise validation and comparison. Remarkably, the integration of Scattering wavelets endows our model with exceptional efficiency, rendering it light enough to operate seamlessly on cost-effective hardware systems. To validate the efficacy of our approach, extensive experimentation was conducted on two open-sourced datasets: the ICDAR SigComp Dutch dataset and the CEDAR dataset. The experimental results demonstrate the practicality and resounding success of our proposed SigScatNet, yielding an unparalleled Equal Error Rate of 3.689% with the ICDAR SigComp Dutch dataset and an astonishing 0.0578% with the CEDAR dataset. Through the implementation of SigScatNet, our research spearheads a new state-of-the-art in signature analysis in terms of EER scores and computational efficiency, offering an advanced and accessible solution for detecting forgery and quantifying signature similarities. By employing cutting-edge Siamese deep learning and Scattering wavelets, we provide a robust framework that paves the way for secure and efficient signature verification systems.

A Coefficient Makes SVRG Effective

  • Authors: Yida Yin, Zhiqiu Xu, Zhiyuan Li, Trevor Darrell, Zhuang Liu
  • Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
  • Arxiv link: https://arxiv.org/abs/2311.05589
  • Pdf link: https://arxiv.org/pdf/2311.05589
  • Abstract
    Stochastic Variance Reduced Gradient (SVRG), introduced by Johnson & Zhang (2013), is a theoretically compelling optimization method. However, as Defazio & Bottou (2019) highlights, its effectiveness in deep learning is yet to be proven. In this work, we demonstrate the potential of SVRG in optimizing real-world neural networks. Our analysis finds that, for deeper networks, the strength of the variance reduction term in SVRG should be smaller and decrease as training progresses. Inspired by this, we introduce a multiplicative coefficient $\alpha$ to control the strength and adjust it through a linear decay schedule. We name our method $\alpha$-SVRG. Our results show $\alpha$-SVRG better optimizes neural networks, consistently reducing training loss compared to both baseline and the standard SVRG across various architectures and image classification datasets. We hope our findings encourage further exploration into variance reduction techniques in deep learning. Code is available at https://github.com/davidyyd/alpha-SVRG.

LLM Augmented Hierarchical Agents

  • Authors: Bharat Prakash, Tim Oates, Tinoosh Mohsenin
  • Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
  • Arxiv link: https://arxiv.org/abs/2311.05596
  • Pdf link: https://arxiv.org/pdf/2311.05596
  • Abstract
    Solving long-horizon, temporally-extended tasks using Reinforcement Learning (RL) is challenging, compounded by the common practice of learning without prior knowledge (or tabula rasa learning). Humans can generate and execute plans with temporally-extended actions and quickly learn to perform new tasks because we almost never solve problems from scratch. We want autonomous agents to have this same ability. Recently, LLMs have been shown to encode a tremendous amount of knowledge about the world and to perform impressive in-context learning and reasoning. However, using LLMs to solve real world problems is hard because they are not grounded in the current task. In this paper we exploit the planning capabilities of LLMs while using RL to provide learning from the environment, resulting in a hierarchical agent that uses LLMs to solve long-horizon tasks. Instead of completely relying on LLMs, they guide a high-level policy, making learning significantly more sample efficient. This approach is evaluated in simulation environments such as MiniGrid, SkillHack, and Crafter, and on a real robot arm in block manipulation tasks. We show that agents trained using our approach outperform other baselines methods and, once trained, don't need access to LLMs during deployment.

FogROS2-Sky: Optimizing Latency and Cost for Multi-Cloud Robot Applications

  • Authors: Kaiyuan Chen, Kush Hari, Rohil Khare, Charlotte Le, Trinity Chung, Jaimyn Drake, Karthik Dharmarajan, Simeon Adebola, Jeffrey Ichnowski, John Kubiatowicz, Ken Goldberg
  • Subjects: Robotics (cs.RO); Systems and Control (eess.SY)
  • Arxiv link: https://arxiv.org/abs/2311.05600
  • Pdf link: https://arxiv.org/pdf/2311.05600
  • Abstract
    This paper studies the cost-performance tradeoffs in cloud robotics with heterogeneous cloud service providers, which have complex pricing models and varying application requirements. We present FogROS2-Sky, a cost-efficient open source robotics platform that offloads unmodified ROS2 applications to multiple cloud providers and enables fine-grained cost analysis for ROS2 applications' communication with multiple cloud providers. As each provider offers different options for CPU, GPU, memory, and latency, it can be very difficult for users to decide which to choose. FogROS2-Sky includes an optimization algorithm, which either finds the best available hardware specification that fulfills the user's latency and cost constraints or reports that such a specification does not exist. We use FogROS2-Sky to perform time-cost analysis on three robotics applications: visual SLAM, grasp planning, and motion planning. We are able to sample different hardware setups at nearly half the cost while still create cost and latency functions suitable for the optimizer. We also evaluate the optimizer's efficacy for these applications with the Pareto frontier and show that the optimizer selects efficient hardware configurations to balance cost and latency. Videos and code are available on the website https://sites.google.com/view/fogros2-sky

Diffusion-Generative Multi-Fidelity Learning for Physical Simulation

  • Authors: Zheng Wang, Shibo Li, Shikai Fang, Shandian Zhe
  • Subjects: Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2311.05606
  • Pdf link: https://arxiv.org/pdf/2311.05606
  • Abstract
    Multi-fidelity surrogate learning is important for physical simulation related applications in that it avoids running numerical solvers from scratch, which is known to be costly, and it uses multi-fidelity examples for training and greatly reduces the cost of data collection. Despite the variety of existing methods, they all build a model to map the input parameters outright to the solution output. Inspired by the recent breakthrough in generative models, we take an alternative view and consider the solution output as generated from random noises. We develop a diffusion-generative multi-fidelity (DGMF) learning method based on stochastic differential equations (SDE), where the generation is a continuous denoising process. We propose a conditional score model to control the solution generation by the input parameters and the fidelity. By conditioning on additional inputs (temporal or spacial variables), our model can efficiently learn and predict multi-dimensional solution arrays. Our method naturally unifies discrete and continuous fidelity modeling. The advantage of our method in several typical applications shows a promising new direction for multi-fidelity learning.

Efficient Parallelization Layouts for Large-Scale Distributed Model Training

  • Authors: Johannes Hagemann, Samuel Weinbach, Konstantin Dobler, Maximilian Schall, Gerard de Melo
  • Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
  • Arxiv link: https://arxiv.org/abs/2311.05610
  • Pdf link: https://arxiv.org/pdf/2311.05610
  • Abstract
    Efficiently training large language models requires parallelizing across hundreds of hardware accelerators and invoking various compute and memory optimizations. When combined, many of these strategies have complex interactions regarding the final training efficiency. Prior work tackling this problem did not have access to the latest set of optimizations, such as FlashAttention or sequence parallelism. In this work, we conduct a comprehensive ablation study of possible training configurations for large language models. We distill this large study into several key recommendations for the most efficient training. For instance, we find that using a micro-batch size of 1 usually enables the most efficient training layouts. Larger micro-batch sizes necessitate activation checkpointing or higher degrees of model parallelism and also lead to larger pipeline bubbles. Our most efficient configurations enable us to achieve state-of-the-art training efficiency results over a range of model sizes, most notably a Model FLOPs utilization of 70.5% when training a 13B model.

Keyword: faster

MathNAS: If Blocks Have a Role in Mathematical Architecture Design

  • Authors: Wang Qinsi, Ke Jinhan, Liang Zhi, Zhang Sihai
  • Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
  • Arxiv link: https://arxiv.org/abs/2311.04943
  • Pdf link: https://arxiv.org/pdf/2311.04943
  • Abstract
    Neural Architecture Search (NAS) has emerged as a favoured method for unearthing effective neural architectures. Recent development of large models has intensified the demand for faster search speeds and more accurate search results. However, designing large models by NAS is challenging due to the dramatical increase of search space and the associated huge performance evaluation cost. Consider a typical modular search space widely used in NAS, in which a neural architecture consists of $m$ block nodes and a block node has $n$ alternative blocks. Facing the space containing $n^m$ candidate networks, existing NAS methods attempt to find the best one by searching and evaluating candidate networks directly.Different from the general strategy that takes architecture search as a whole problem, we propose a novel divide-and-conquer strategy by making use of the modular nature of the search space.Here, we introduce MathNAS, a general NAS framework based on mathematical programming.In MathNAS, the performances of the $mn$ possible building blocks in the search space are calculated first, and then the performance of a network is directly predicted based on the performances of its building blocks. Although estimating block performances involves network training, just as what happens for network performance evaluation in existing NAS methods, predicting network performance is completely training-free and thus extremely fast. In contrast to the $n^m$ candidate networks to evaluate in existing NAS methods, which require training and a formidable computational burden, there are only $mn$ possible blocks to handle in MathNAS. Therefore, our approach effectively reduces the complexity of network performance evaluation.Our code is available at https://github.com/wangqinsi1/MathNAS.

Efficient Compression of Overparameterized Deep Models through Low-Dimensional Learning Dynamics

  • Authors: Soo Min Kwon, Zekai Zhang, Dogyoon Song, Laura Balzano, Qing Qu
  • Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
  • Arxiv link: https://arxiv.org/abs/2311.05061
  • Pdf link: https://arxiv.org/pdf/2311.05061
  • Abstract
    Overparameterized models have proven to be powerful tools for solving various machine learning tasks. However, overparameterization often leads to a substantial increase in computational and memory costs, which in turn requires extensive resources to train. In this work, we aim to reduce this complexity by studying the learning dynamics of overparameterized deep networks. By extensively studying its learning dynamics, we unveil that the weight matrices of various architectures exhibit a low-dimensional structure. This finding implies that we can compress the networks by reducing the training to a small subspace. We take a step in developing a principled approach for compressing deep networks by studying deep linear models. We demonstrate that the principal components of deep linear models are fitted incrementally but within a small subspace, and use these insights to compress deep linear networks by decreasing the width of its intermediate layers. Remarkably, we observe that with a particular choice of initialization, the compressed network converges faster than the original network, consistently yielding smaller recovery errors throughout all iterations of gradient descent. We substantiate this observation by developing a theory focused on the deep matrix factorization problem, and by conducting empirical evaluations on deep matrix sensing. Finally, we demonstrate how our compressed model can enhance the utility of deep nonlinear models. Overall, we observe that our compression technique accelerates the training process by more than 2x, without compromising model quality.

TransReg: Cross-transformer as auto-registration module for multi-view mammogram mass detection

  • Authors: Hoang C. Nguyen, Chi Phan, Hieu H. Pham
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2311.05192
  • Pdf link: https://arxiv.org/pdf/2311.05192
  • Abstract
    Screening mammography is the most widely used method for early breast cancer detection, significantly reducing mortality rates. The integration of information from multi-view mammograms enhances radiologists' confidence and diminishes false-positive rates since they can examine on dual-view of the same breast to cross-reference the existence and location of the lesion. Inspired by this, we present TransReg, a Computer-Aided Detection (CAD) system designed to exploit the relationship between craniocaudal (CC), and mediolateral oblique (MLO) views. The system includes cross-transformer to model the relationship between the region of interest (RoIs) extracted by siamese Faster RCNN network for mass detection problems. Our work is the first time cross-transformer has been integrated into an object detection framework to model the relation between ipsilateral views. Our experimental evaluation on DDSM and VinDr-Mammo datasets shows that our TransReg, equipped with SwinT as a feature extractor achieves state-of-the-art performance. Specifically, at the false positive rate per image at 0.5, TransReg using SwinT gets a recall at 83.3% for DDSM dataset and 79.7% for VinDr-Mammo dataset. Furthermore, we conduct a comprehensive analysis to demonstrate that cross-transformer can function as an auto-registration module, aligning the masses in dual-view and utilizing this information to inform final predictions. It is a replication diagnostic workflow of expert radiologists

Probable Approximate Coordination

  • Authors: Ariel Livshits, Yoram Moses
  • Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Multiagent Systems (cs.MA)
  • Arxiv link: https://arxiv.org/abs/2311.05368
  • Pdf link: https://arxiv.org/pdf/2311.05368
  • Abstract
    We study the problem of how to coordinate the actions of independent agents in a distributed system where message arrival times are unbounded, but are determined by an exponential probability distribution. Asynchronous protocols executed in such a model are guaranteed to succeed with probability 1. We demonstrate a case in which the best asynchronous protocol can be improved on significantly. Specifically, we focus on the task of performing actions by different agents in a linear temporal order -- a problem known in the literature as Ordered Response. In asynchronous systems, ensuring such an ordering requires the construction of a message chain that passes through each acting agent, in order. Solving Ordered Response in this way in our model will terminate in time that grows linearly in the number of participating agents $n$, in expectation. We show that relaxing the specification slightly allows for a significant saving in time. Namely, if Ordered Response should be guaranteed with high probability (arbitrarily close to 1), it is possible to significantly shorten the expected execution time of the protocol. We present two protocols that adhere to the relaxed specification. One of our protocols executes exponentially faster than a message chain, when the number of participating agents $n$ is large, while the other is roughly quadratically faster. For small values of $n$, it is also possible to achieve similar results by using a hybrid protocol.

L-WaveBlock: A Novel Feature Extractor Leveraging Wavelets for Generative Adversarial Networks

  • Authors: Mirat Shah, Vansh Jain, Anmol Chokshi, Guruprasad Parasnis, Pramod Bide
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
  • Arxiv link: https://arxiv.org/abs/2311.05548
  • Pdf link: https://arxiv.org/pdf/2311.05548
  • Abstract
    Generative Adversarial Networks (GANs) have risen to prominence in the field of deep learning, facilitating the generation of realistic data from random noise. The effectiveness of GANs often depends on the quality of feature extraction, a critical aspect of their architecture. This paper introduces L-WaveBlock, a novel and robust feature extractor that leverages the capabilities of the Discrete Wavelet Transform (DWT) with deep learning methodologies. L-WaveBlock is catered to quicken the convergence of GAN generators while simultaneously enhancing their performance. The paper demonstrates the remarkable utility of L-WaveBlock across three datasets, a road satellite imagery dataset, the CelebA dataset and the GoPro dataset, showcasing its ability to ease feature extraction and make it more efficient. By utilizing DWT, L-WaveBlock efficiently captures the intricate details of both structural and textural details, and further partitions feature maps into orthogonal subbands across multiple scales while preserving essential information at the same time. Not only does it lead to faster convergence, but also gives competent results on every dataset by employing the L-WaveBlock. The proposed method achieves an Inception Score of 3.6959 and a Structural Similarity Index of 0.4261 on the maps dataset, a Peak Signal-to-Noise Ratio of 29.05 and a Structural Similarity Index of 0.874 on the CelebA dataset. The proposed method performs competently to the state-of-the-art for the image denoising dataset, albeit not better, but still leads to faster convergence than conventional methods. With this, L-WaveBlock emerges as a robust and efficient tool for enhancing GAN-based image generation, demonstrating superior convergence speed and competitive performance across multiple datasets for image resolution, image generation and image denoising.

Real-Time Neural Rasterization for Large Scenes

  • Authors: Jeffrey Yunfan Liu, Yun Chen, Ze Yang, Jingkang Wang, Sivabalan Manivasagam, Raquel Urtasun
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
  • Arxiv link: https://arxiv.org/abs/2311.05607
  • Pdf link: https://arxiv.org/pdf/2311.05607
  • Abstract
    We propose a new method for realistic real-time novel-view synthesis (NVS) of large scenes. Existing neural rendering methods generate realistic results, but primarily work for small scale scenes (<50 square meters) and have difficulty at large scale (>10000 square meters). Traditional graphics-based rasterization rendering is fast for large scenes but lacks realism and requires expensive manually created assets. Our approach combines the best of both worlds by taking a moderate-quality scaffold mesh as input and learning a neural texture field and shader to model view-dependant effects to enhance realism, while still using the standard graphics pipeline for real-time rendering. Our method outperforms existing neural rendering methods, providing at least 30x faster rendering with comparable or better realism for large self-driving and drone scenes. Our work is the first to enable real-time rendering of large real-world scenes.

Keyword: mobile

Auto deep learning for bioacoustic signals

  • Authors: Giulio Tosato, Abdelrahman Shehata, Joshua Janssen, Kees Kamp, Pramatya Jati, Dan Stowell
  • Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
  • Arxiv link: https://arxiv.org/abs/2311.04945
  • Pdf link: https://arxiv.org/pdf/2311.04945
  • Abstract
    This study investigates the potential of automated deep learning to enhance the accuracy and efficiency of multi-class classification of bird vocalizations, compared against traditional manually-designed deep learning models. Using the Western Mediterranean Wetland Birds dataset, we investigated the use of AutoKeras, an automated machine learning framework, to automate neural architecture search and hyperparameter tuning. Comparative analysis validates our hypothesis that the AutoKeras-derived model consistently outperforms traditional models like MobileNet, ResNet50 and VGG16. Our approach and findings underscore the transformative potential of automated deep learning for advancing bioacoustics research and models. In fact, the automated techniques eliminate the need for manual feature engineering and model design while improving performance. This study illuminates best practices in sampling, evaluation and reporting to enhance reproducibility in this nascent field. All the code used is available at https: //github.com/giuliotosato/AutoKeras-bioacustic Keywords: AutoKeras; automated deep learning; audio classification; Wetlands Bird dataset; comparative analysis; bioacoustics; validation dataset; multi-class classification; spectrograms.

Digital Twin-based 3D Map Management for Edge-assisted Device Pose Tracking in Mobile AR

  • Authors: Conghao Zhou, Jie Gao, Mushu Li, Nan Cheng, Xuemin Shen, Weihua Zhuang
  • Subjects: Networking and Internet Architecture (cs.NI)
  • Arxiv link: https://arxiv.org/abs/2311.04997
  • Pdf link: https://arxiv.org/pdf/2311.04997
  • Abstract
    Edge-device collaboration has the potential to facilitate compute-intensive device pose tracking for resource-constrained mobile augmented reality (MAR) devices. In this paper, we devise a 3D map management scheme for edge-assisted MAR, wherein an edge server constructs and updates a 3D map of the physical environment by using the camera frames uploaded from an MAR device, to support local device pose tracking. Our objective is to minimize the uncertainty of device pose tracking by periodically selecting a proper set of uploaded camera frames and updating the 3D map. To cope with the dynamics of the uplink data rate and the user's pose, we formulate a Bayes-adaptive Markov decision process problem and propose a digital twin (DT)-based approach to solve the problem. First, a DT is designed as a data model to capture the time-varying uplink data rate, thereby supporting 3D map management. Second, utilizing extensive generated data provided by the DT, a model-based reinforcement learning algorithm is developed to manage the 3D map while adapting to these dynamics. Numerical results demonstrate that the designed DT outperforms Markov models in accurately capturing the time-varying uplink data rate, and our devised DT-based 3D map management scheme surpasses benchmark schemes in reducing device pose tracking uncertainty.

Automated Mobile Sensing Strategies Generation for Human Behaviour Understanding

  • Authors: Nan Gao, Zhuolei Yu, Chun Yu, Yuntao Wang, Flora D. Salim, Yuanchun Shi
  • Subjects: Human-Computer Interaction (cs.HC)
  • Arxiv link: https://arxiv.org/abs/2311.05457
  • Pdf link: https://arxiv.org/pdf/2311.05457
  • Abstract
    Mobile sensing plays a crucial role in generating digital traces to understand human daily lives. However, studying behaviours like mood or sleep quality in smartphone users requires carefully designed mobile sensing strategies such as sensor selection and feature construction. This process is time-consuming, burdensome, and requires expertise in multiple domains. Furthermore, the resulting sensing framework lacks generalizability, making it difficult to apply to different scenarios. To address these challenges, we propose an automated mobile sensing strategy for human behaviour understanding. First, we establish a knowledge base and consolidate rules for effective feature construction, data collection, and model selection. Then, we introduce the multi-granular human behaviour representation and design procedures for leveraging large language models to generate strategies. Our approach is validated through blind comparative studies and usability evaluation. Ultimately, our approach holds the potential to revolutionise the field of mobile sensing and its applications.

Improving Human Legibility in Collaborative Robot Tasks through Augmented Reality and Workspace Preparation

  • Authors: Yi-Shiuan Tung, Matthew B. Luebbers, Alessandro Roncone, Bradley Hayes
  • Subjects: Robotics (cs.RO)
  • Arxiv link: https://arxiv.org/abs/2311.05562
  • Pdf link: https://arxiv.org/pdf/2311.05562
  • Abstract
    Understanding the intentions of human teammates is critical for safe and effective human-robot interaction. The canonical approach for human-aware robot motion planning is to first predict the human's goal or path, and then construct a robot plan that avoids collision with the human. This method can generate unsafe interactions if the human model and subsequent predictions are inaccurate. In this work, we present an algorithmic approach for both arranging the configuration of objects in a shared human-robot workspace, and projecting ``virtual obstacles'' in augmented reality, optimizing for legibility in a given task. These changes to the workspace result in more legible human behavior, improving robot predictions of human goals, thereby improving task fluency and safety. To evaluate our approach, we propose two user studies involving a collaborative tabletop task with a manipulator robot, and a warehouse navigation task with a mobile robot.

Keyword: pruning

There is no result

Keyword: diffusion

Improved DDIM Sampling with Moment Matching Gaussian Mixtures

  • Authors: Prasad Gabbur
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2311.04938
  • Pdf link: https://arxiv.org/pdf/2311.04938
  • Abstract
    We propose using a Gaussian Mixture Model (GMM) as reverse transition operator (kernel) within the Denoising Diffusion Implicit Models (DDIM) framework, which is one of the most widely used approaches for accelerated sampling from pre-trained Denoising Diffusion Probabilistic Models (DDPM). Specifically we match the first and second order central moments of the DDPM forward marginals by constraining the parameters of the GMM. We see that moment matching is sufficient to obtain samples with equal or better quality than the original DDIM with Gaussian kernels. We provide experimental results with unconditional models trained on CelebAHQ and FFHQ and class-conditional models trained on ImageNet datasets respectively. Our results suggest that using the GMM kernel leads to significant improvements in the quality of the generated samples when the number of sampling steps is small, as measured by FID and IS metrics. For example on ImageNet 256x256, using 10 sampling steps, we achieve a FID of 6.94 and IS of 207.85 with a GMM kernel compared to 10.15 and 196.73 respectively with a Gaussian kernel.

Lightweight Diffusion Models with Distillation-Based Block Neural Architecture Search

  • Authors: Siao Tang, Xin Wang, Hong Chen, Chaoyu Guan, Yansong Tang, Wenwu zhu
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2311.04950
  • Pdf link: https://arxiv.org/pdf/2311.04950
  • Abstract
    Diffusion models have recently shown remarkable generation ability, achieving state-of-the-art performance in many tasks. However, the high computational cost is still a troubling problem for diffusion models. To tackle this problem, we propose to automatically remove the structural redundancy in diffusion models with our proposed Diffusion Distillation-based Block-wise Neural Architecture Search (DiffNAS). Specifically, given a larger pretrained teacher, we leverage DiffNAS to search for the smallest architecture which achieves on-par or even better performance than the teacher. Considering current diffusion models are based on UNet which naturally has a block-wise structure, we perform neural architecture search independently in each block, which largely reduces the search space. Different from previous block-wise NAS methods, DiffNAS contains a block-wise local search strategy and a retraining strategy with a joint dynamic loss. Concretely, during the search process, we block-wisely select the best subnet to avoid the unfairness brought by the global search strategy used in previous works. When retraining the searched architecture, we adopt a dynamic joint loss to maintain the consistency between supernet training and subnet retraining, which also provides informative objectives for each block and shortens the paths of gradient propagation. We demonstrate this joint loss can effectively improve model performance. We also prove the necessity of the dynamic adjustment of this loss. The experiments show that our method can achieve significant computational reduction, especially on latent diffusion models with about 50% MACs and Parameter reduction.

BrainNetDiff: Generative AI Empowers Brain Network Generation via Multimodal Diffusion Model

  • Authors: Yongcheng Zong, Shuqiang Wang
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2311.05199
  • Pdf link: https://arxiv.org/pdf/2311.05199
  • Abstract
    Brain network analysis has emerged as pivotal method for gaining a deeper understanding of brain functions and disease mechanisms. Despite the existence of various network construction approaches, shortcomings persist in the learning of correlations between structural and functional brain imaging data. In light of this, we introduce a novel method called BrainNetDiff, which combines a multi-head Transformer encoder to extract relevant features from fMRI time series and integrates a conditional latent diffusion model for brain network generation. Leveraging a conditional prompt and a fusion attention mechanism, this method significantly improves the accuracy and stability of brain network generation. To the best of our knowledge, this represents the first framework that employs diffusion for the fusion of the multimodal brain imaging and brain network generation from images to graphs. We validate applicability of this framework in the construction of brain network across healthy and neurologically impaired cohorts using the authentic dataset. Experimental results vividly demonstrate the significant effectiveness of the proposed method across the downstream disease classification tasks. These findings convincingly emphasize the prospective value in the field of brain network research, particularly its key significance in neuroimaging analysis and disease diagnosis. This research provides a valuable reference for the processing of multimodal brain imaging data and introduces a novel, efficient solution to the field of neuroimaging.

ConRad: Image Constrained Radiance Fields for 3D Generation from a Single Image

  • Authors: Senthil Purushwalkam, Nikhil Naik
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2311.05230
  • Pdf link: https://arxiv.org/pdf/2311.05230
  • Abstract
    We present a novel method for reconstructing 3D objects from a single RGB image. Our method leverages the latest image generation models to infer the hidden 3D structure while remaining faithful to the input image. While existing methods obtain impressive results in generating 3D models from text prompts, they do not provide an easy approach for conditioning on input RGB data. Na"ive extensions of these methods often lead to improper alignment in appearance between the input image and the 3D reconstructions. We address these challenges by introducing Image Constrained Radiance Fields (ConRad), a novel variant of neural radiance fields. ConRad is an efficient 3D representation that explicitly captures the appearance of an input image in one viewpoint. We propose a training algorithm that leverages the single RGB image in conjunction with pretrained Diffusion Models to optimize the parameters of a ConRad representation. Extensive experiments show that ConRad representations can simplify preservation of image details while producing a realistic 3D reconstruction. Compared to existing state-of-the-art baselines, we show that our 3D reconstructions remain more faithful to the input and produce more consistent 3D models while demonstrating significantly improved quantitative performance on a ShapeNet object benchmark.

Predicting the Position Uncertainty at the Time of Closest Approach with Diffusion Models

  • Authors: Marta Guimarães, Cláudia Soares, Chiara Manfletti
  • Subjects: Machine Learning (cs.LG); Robotics (cs.RO)
  • Arxiv link: https://arxiv.org/abs/2311.05417
  • Pdf link: https://arxiv.org/pdf/2311.05417
  • Abstract
    The risk of collision between resident space objects has significantly increased in recent years. As a result, spacecraft collision avoidance procedures have become an essential part of satellite operations. To ensure safe and effective space activities, satellite owners and operators rely on constantly updated estimates of encounters. These estimates include the uncertainty associated with the position of each object at the expected TCA. These estimates are crucial in planning risk mitigation measures, such as collision avoidance manoeuvres. As the TCA approaches, the accuracy of these estimates improves, as both objects' orbit determination and propagation procedures are made for increasingly shorter time intervals. However, this improvement comes at the cost of taking place close to the critical decision moment. This means that safe avoidance manoeuvres might not be possible or could incur significant costs. Therefore, knowing the evolution of this variable in advance can be crucial for operators. This work proposes a machine learning model based on diffusion models to forecast the position uncertainty of objects involved in a close encounter, particularly for the secondary object (usually debris), which tends to be more unpredictable. We compare the performance of our model with other state-of-the-art solutions and a na"ive baseline approach, showing that the proposed solution has the potential to significantly improve the safety and effectiveness of spacecraft operations.

Diffusion Based Causal Representation Learning

  • Authors: Amir Mohammad Karimi Mamaghan, Andrea Dittadi, Stefan Bauer, Karl Henrik Johansson, Francesco Quinzan
  • Subjects: Machine Learning (cs.LG); Methodology (stat.ME)
  • Arxiv link: https://arxiv.org/abs/2311.05421
  • Pdf link: https://arxiv.org/pdf/2311.05421
  • Abstract
    Causal reasoning can be considered a cornerstone of intelligent systems. Having access to an underlying causal graph comes with the promise of cause-effect estimation and the identification of efficient and safe interventions. However, learning causal representations remains a major challenge, due to the complexity of many real-world systems. Previous works on causal representation learning have mostly focused on Variational Auto-Encoders (VAE). These methods only provide representations from a point estimate, and they are unsuitable to handle high dimensions. To overcome these problems, we proposed a new Diffusion-based Causal Representation Learning (DCRL) algorithm. This algorithm uses diffusion-based representations for causal discovery. DCRL offers access to infinite dimensional latent codes, which encode different levels of information in the latent code. In a first proof of principle, we investigate the use of DCRL for causal representation learning. We further demonstrate experimentally that this approach performs comparably well in identifying the causal structure and causal variables.

Control3D: Towards Controllable Text-to-3D Generation

  • Authors: Yang Chen, Yingwei Pan, Yehao Li, Ting Yao, Tao Mei
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
  • Arxiv link: https://arxiv.org/abs/2311.05461
  • Pdf link: https://arxiv.org/pdf/2311.05461
  • Abstract
    Recent remarkable advances in large-scale text-to-image diffusion models have inspired a significant breakthrough in text-to-3D generation, pursuing 3D content creation solely from a given text prompt. However, existing text-to-3D techniques lack a crucial ability in the creative process: interactively control and shape the synthetic 3D contents according to users' desired specifications (e.g., sketch). To alleviate this issue, we present the first attempt for text-to-3D generation conditioning on the additional hand-drawn sketch, namely Control3D, which enhances controllability for users. In particular, a 2D conditioned diffusion model (ControlNet) is remoulded to guide the learning of 3D scene parameterized as NeRF, encouraging each view of 3D scene aligned with the given text prompt and hand-drawn sketch. Moreover, we exploit a pre-trained differentiable photo-to-sketch model to directly estimate the sketch of the rendered image over synthetic 3D scene. Such estimated sketch along with each sampled view is further enforced to be geometrically consistent with the given sketch, pursuing better controllable text-to-3D generation. Through extensive experiments, we demonstrate that our proposal can generate accurate and faithful 3D scenes that align closely with the input text prompts and sketches.

ControlStyle: Text-Driven Stylized Image Generation Using Diffusion Priors

  • Authors: Jingwen Chen, Yingwei Pan, Ting Yao, Tao Mei
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
  • Arxiv link: https://arxiv.org/abs/2311.05463
  • Pdf link: https://arxiv.org/pdf/2311.05463
  • Abstract
    Recently, the multimedia community has witnessed the rise of diffusion models trained on large-scale multi-modal data for visual content creation, particularly in the field of text-to-image generation. In this paper, we propose a new task for ``stylizing'' text-to-image models, namely text-driven stylized image generation, that further enhances editability in content creation. Given input text prompt and style image, this task aims to produce stylized images which are both semantically relevant to input text prompt and meanwhile aligned with the style image in style. To achieve this, we present a new diffusion model (ControlStyle) via upgrading a pre-trained text-to-image model with a trainable modulation network enabling more conditions of text prompts and style images. Moreover, diffusion style and content regularizations are simultaneously introduced to facilitate the learning of this modulation network with these diffusion priors, pursuing high-quality stylized text-to-image generation. Extensive experiments demonstrate the effectiveness of our ControlStyle in producing more visually pleasing and artistic results, surpassing a simple combination of text-to-image model and conventional style transfer techniques.

3DStyle-Diffusion: Pursuing Fine-grained Text-driven 3D Stylization with 2D Diffusion Models

  • Authors: Haibo Yang, Yang Chen, Yingwei Pan, Ting Yao, Zhineng Chen, Tao Mei
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
  • Arxiv link: https://arxiv.org/abs/2311.05464
  • Pdf link: https://arxiv.org/pdf/2311.05464
  • Abstract
    3D content creation via text-driven stylization has played a fundamental challenge to multimedia and graphics community. Recent advances of cross-modal foundation models (e.g., CLIP) have made this problem feasible. Those approaches commonly leverage CLIP to align the holistic semantics of stylized mesh with the given text prompt. Nevertheless, it is not trivial to enable more controllable stylization of fine-grained details in 3D meshes solely based on such semantic-level cross-modal supervision. In this work, we propose a new 3DStyle-Diffusion model that triggers fine-grained stylization of 3D meshes with additional controllable appearance and geometric guidance from 2D Diffusion models. Technically, 3DStyle-Diffusion first parameterizes the texture of 3D mesh into reflectance properties and scene lighting using implicit MLP networks. Meanwhile, an accurate depth map of each sampled view is achieved conditioned on 3D mesh. Then, 3DStyle-Diffusion leverages a pre-trained controllable 2D Diffusion model to guide the learning of rendered images, encouraging the synthesized image of each view semantically aligned with text prompt and geometrically consistent with depth map. This way elegantly integrates both image rendering via implicit MLP networks and diffusion process of image synthesis in an end-to-end fashion, enabling a high-quality fine-grained stylization of 3D meshes. We also build a new dataset derived from Objaverse and the evaluation protocol for this task. Through both qualitative and quantitative experiments, we validate the capability of our 3DStyle-Diffusion. Source code and data are available at \url{https://github.com/yanghb22-fdu/3DStyle-Diffusion-Official}.

LCM-LoRA: A Universal Stable-Diffusion Acceleration Module

  • Authors: Simian Luo, Yiqin Tan, Suraj Patil, Daniel Gu, Patrick von Platen, Apolinário Passos, Longbo Huang, Jian Li, Hang Zhao
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2311.05556
  • Pdf link: https://arxiv.org/pdf/2311.05556
  • Abstract
    Latent Consistency Models (LCMs) have achieved impressive performance in accelerating text-to-image generative tasks, producing high-quality images with minimal inference steps. LCMs are distilled from pre-trained latent diffusion models (LDMs), requiring only ~32 A100 GPU training hours. This report further extends LCMs' potential in two aspects: First, by applying LoRA distillation to Stable-Diffusion models including SD-V1.5, SSD-1B, and SDXL, we have expanded LCM's scope to larger models with significantly less memory consumption, achieving superior image generation quality. Second, we identify the LoRA parameters obtained through LCM distillation as a universal Stable-Diffusion acceleration module, named LCM-LoRA. LCM-LoRA can be directly plugged into various Stable-Diffusion fine-tuned models or LoRAs without training, thus representing a universally applicable accelerator for diverse image generation tasks. Compared with previous numerical PF-ODE solvers such as DDIM, DPM-Solver, LCM-LoRA can be viewed as a plug-in neural PF-ODE solver that possesses strong generalization abilities. Project page: https://github.com/luosiallen/latent-consistency-model.

Bayesian Methods for Media Mix Modelling with shape and funnel effects

  • Authors: Javier Marin
  • Subjects: Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2311.05587
  • Pdf link: https://arxiv.org/pdf/2311.05587
  • Abstract
    In recent years, significant progress in generative AI has highlighted the important role of physics-inspired models that utilize advanced mathematical concepts based on fundamental physics principles to enhance artificial intelligence capabilities. Among these models, those based on diffusion equations have greatly improved image quality. This study aims to explore the potential uses of Maxwell-Boltzmann equation, which forms the basis of the kinetic theory of gases, and the Michaelis-Menten model in Marketing Mix Modelling (MMM) applications. We propose incorporating these equations into Hierarchical Bayesian models to analyse consumer behaviour in the context of advertising. These equation sets excel in accurately describing the random dynamics in complex systems like social interactions and consumer-advertising interactions.

Diffusion-Generative Multi-Fidelity Learning for Physical Simulation

  • Authors: Zheng Wang, Shibo Li, Shikai Fang, Shandian Zhe
  • Subjects: Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2311.05606
  • Pdf link: https://arxiv.org/pdf/2311.05606
  • Abstract
    Multi-fidelity surrogate learning is important for physical simulation related applications in that it avoids running numerical solvers from scratch, which is known to be costly, and it uses multi-fidelity examples for training and greatly reduces the cost of data collection. Despite the variety of existing methods, they all build a model to map the input parameters outright to the solution output. Inspired by the recent breakthrough in generative models, we take an alternative view and consider the solution output as generated from random noises. We develop a diffusion-generative multi-fidelity (DGMF) learning method based on stochastic differential equations (SDE), where the generation is a continuous denoising process. We propose a conditional score model to control the solution generation by the input parameters and the fidelity. By conditioning on additional inputs (temporal or spacial variables), our model can efficiently learn and predict multi-dimensional solution arrays. Our method naturally unifies discrete and continuous fidelity modeling. The advantage of our method in several typical applications shows a promising new direction for multi-fidelity learning.

Keyword: adaptive

Digital Twin-based 3D Map Management for Edge-assisted Device Pose Tracking in Mobile AR

  • Authors: Conghao Zhou, Jie Gao, Mushu Li, Nan Cheng, Xuemin Shen, Weihua Zhuang
  • Subjects: Networking and Internet Architecture (cs.NI)
  • Arxiv link: https://arxiv.org/abs/2311.04997
  • Pdf link: https://arxiv.org/pdf/2311.04997
  • Abstract
    Edge-device collaboration has the potential to facilitate compute-intensive device pose tracking for resource-constrained mobile augmented reality (MAR) devices. In this paper, we devise a 3D map management scheme for edge-assisted MAR, wherein an edge server constructs and updates a 3D map of the physical environment by using the camera frames uploaded from an MAR device, to support local device pose tracking. Our objective is to minimize the uncertainty of device pose tracking by periodically selecting a proper set of uploaded camera frames and updating the 3D map. To cope with the dynamics of the uplink data rate and the user's pose, we formulate a Bayes-adaptive Markov decision process problem and propose a digital twin (DT)-based approach to solve the problem. First, a DT is designed as a data model to capture the time-varying uplink data rate, thereby supporting 3D map management. Second, utilizing extensive generated data provided by the DT, a model-based reinforcement learning algorithm is developed to manage the 3D map while adapting to these dynamics. Numerical results demonstrate that the designed DT outperforms Markov models in accurately capturing the time-varying uplink data rate, and our devised DT-based 3D map management scheme surpasses benchmark schemes in reducing device pose tracking uncertainty.

Mental Health Diagnosis in the Digital Age: Harnessing Sentiment Analysis on Social Media Platforms upon Ultra-Sparse Feature Content

  • Authors: Haijian Shao, Ming Zhu, Shengjie Zhai
  • Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
  • Arxiv link: https://arxiv.org/abs/2311.05075
  • Pdf link: https://arxiv.org/pdf/2311.05075
  • Abstract
    Amid growing global mental health concerns, particularly among vulnerable groups, natural language processing offers a tremendous potential for early detection and intervention of people's mental disorders via analyzing their postings and discussions on social media platforms. However, ultra-sparse training data, often due to vast vocabularies and low-frequency words, hinders the analysis accuracy. Multi-labeling and Co-occurrences of symptoms may also blur the boundaries in distinguishing similar/co-related disorders. To address these issues, we propose a novel semantic feature preprocessing technique with a three-folded structure: 1) mitigating the feature sparsity with a weak classifier, 2) adaptive feature dimension with modulus loops, and 3) deep-mining and extending features among the contexts. With enhanced semantic features, we train a machine learning model to predict and classify mental disorders. We utilize the Reddit Mental Health Dataset 2022 to examine conditions such as Anxiety, Borderline Personality Disorder (BPD), and Bipolar-Disorder (BD) and present solutions to the data sparsity challenge, highlighted by 99.81% non-zero elements. After applying our preprocessing technique, the feature sparsity decreases to 85.4%. Overall, our methods, when compared to seven benchmark models, demonstrate significant performance improvements: 8.0% in accuracy, 0.069 in precision, 0.093 in recall, 0.102 in F1 score, and 0.059 in AUC. This research provides foundational insights for mental health prediction and monitoring, providing innovative solutions to navigate challenges associated with ultra-sparse data feature and intricate multi-label classification in the domain of mental health analysis.

Dynamic Adaptation Gains for Nonlinear Systems with Unmatched Uncertainties

  • Authors: Brett T. Lopez, Jean-Jacques Slotine
  • Subjects: Systems and Control (eess.SY); Robotics (cs.RO)
  • Arxiv link: https://arxiv.org/abs/2311.05082
  • Pdf link: https://arxiv.org/pdf/2311.05082
  • Abstract
    We present a new direct adaptive control approach for nonlinear systems with unmatched and matched uncertainties. The method relies on adjusting the adaptation gains of individual unmatched parameters whose adaptation transients would otherwise destabilize the closed-loop system. The approach also guarantees the restoration of the adaptation gains to their nominal values and can readily incorporate direct adaptation laws for matched uncertainties. The proposed framework is general as it only requires stabilizability for all possible models.

Differentiable Fluid Physics Parameter Identification Via Stirring

  • Authors: Wenqiang Xu, Dongzhe Zheng, Yutong Li, Jieji Ren, Cewu Lu
  • Subjects: Robotics (cs.RO)
  • Arxiv link: https://arxiv.org/abs/2311.05137
  • Pdf link: https://arxiv.org/pdf/2311.05137
  • Abstract
    Fluid interactions permeate daily human activities, with properties like density and viscosity playing pivotal roles in household tasks. While density estimation is straightforward through Archimedes' principle, viscosity poses a more intricate challenge, especially given the varied behaviors of Newtonian and non-Newtonian fluids. These fluids, which differ in their stress-strain relationships, are delineated by specific constitutive models such as the Carreau, Cross, and Herschel-Bulkley models, each possessing unique viscosity parameters. This study introduces a novel differentiable fitting framework, DiffStir, tailored to identify key physics parameters via the common daily operation of stirring. By employing a robotic arm for stirring and harnessing a differentiable Material Point Method (diffMPM)-based simulator, the framework can determine fluid parameters by matching observations from both the simulator and the real world. Recognizing the distinct preferences of the aforementioned constitutive models for specific fluids, an online strategy was adopted to adaptively select the most fitting model based on real-world data. Additionally, we propose a refining neural network to bridge the sim-to-real gap and mitigate sensor noise-induced inaccuracies. Comprehensive experiments were conducted to validate the efficacy of DiffStir, showcasing its precision in parameter estimation when benchmarked against reported literature values. More experiments and videos can be found in the supplementary materials and on the website: https://sites.google.com/view/diffstir.

SCAAT: Improving Neural Network Interpretability via Saliency Constrained Adaptive Adversarial Training

  • Authors: Rui Xu, Wenkang Qin, Peixiang Huang, Haowang, Lin Luo
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2311.05143
  • Pdf link: https://arxiv.org/pdf/2311.05143
  • Abstract
    Deep Neural Networks (DNNs) are expected to provide explanation for users to understand their black-box predictions. Saliency map is a common form of explanation illustrating the heatmap of feature attributions, but it suffers from noise in distinguishing important features. In this paper, we propose a model-agnostic learning method called Saliency Constrained Adaptive Adversarial Training (SCAAT) to improve the quality of such DNN interpretability. By constructing adversarial samples under the guidance of saliency map, SCAAT effectively eliminates most noise and makes saliency maps sparser and more faithful without any modification to the model architecture. We apply SCAAT to multiple DNNs and evaluate the quality of the generated saliency maps on various natural and pathological image datasets. Evaluations on different domains and metrics show that SCAAT significantly improves the interpretability of DNNs by providing more faithful saliency maps without sacrificing their predictive power.

Cross-modal Prompts: Adapting Large Pre-trained Models for Audio-Visual Downstream Tasks

  • Authors: Haoyi Duan, Yan Xia, Mingze Zhou, Li Tang, Jieming Zhu, Zhou Zhao
  • Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
  • Arxiv link: https://arxiv.org/abs/2311.05152
  • Pdf link: https://arxiv.org/pdf/2311.05152
  • Abstract
    In recent years, the deployment of large-scale pre-trained models in audio-visual downstream tasks has yielded remarkable outcomes. However, these models, primarily trained on single-modality unconstrained datasets, still encounter challenges in feature extraction for multi-modal tasks, leading to suboptimal performance. This limitation arises due to the introduction of irrelevant modality-specific information during encoding, which adversely affects the performance of downstream tasks. To address this challenge, this paper proposes a novel Dual-Guided Spatial-Channel-Temporal (DG-SCT) attention mechanism. This mechanism leverages audio and visual modalities as soft prompts to dynamically adjust the parameters of pre-trained models based on the current multi-modal input features. Specifically, the DG-SCT module incorporates trainable cross-modal interaction layers into pre-trained audio-visual encoders, allowing adaptive extraction of crucial information from the current modality across spatial, channel, and temporal dimensions, while preserving the frozen parameters of large-scale pre-trained models. Experimental evaluations demonstrate that our proposed model achieves state-of-the-art results across multiple downstream tasks, including AVE, AVVP, AVS, and AVQA. Furthermore, our model exhibits promising performance in challenging few-shot and zero-shot scenarios. The source code and pre-trained models are available at https://github.com/haoyi-duan/DG-SCT.

Adaptive-Labeling for Enhancing Remote Sensing Cloud Understanding

  • Authors: Jay Gala, Sauradip Nag, Huichou Huang, Ruirui Liu, Xiatian Zhu
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2311.05198
  • Pdf link: https://arxiv.org/pdf/2311.05198
  • Abstract
    Cloud analysis is a critical component of weather and climate science, impacting various sectors like disaster management. However, achieving fine-grained cloud analysis, such as cloud segmentation, in remote sensing remains challenging due to the inherent difficulties in obtaining accurate labels, leading to significant labeling errors in training data. Existing methods often assume the availability of reliable segmentation annotations, limiting their overall performance. To address this inherent limitation, we introduce an innovative model-agnostic Cloud Adaptive-Labeling (CAL) approach, which operates iteratively to enhance the quality of training data annotations and consequently improve the performance of the learned model. Our methodology commences by training a cloud segmentation model using the original annotations. Subsequently, it introduces a trainable pixel intensity threshold for adaptively labeling the cloud training images on the fly. The newly generated labels are then employed to fine-tune the model. Extensive experiments conducted on multiple standard cloud segmentation benchmarks demonstrate the effectiveness of our approach in significantly boosting the performance of existing segmentation models. Our CAL method establishes new state-of-the-art results when compared to a wide array of existing alternatives.

VoxNeRF: Bridging Voxel Representation and Neural Radiance Fields for Enhanced Indoor View Synthesis

  • Authors: Sen Wang, Wei Zhang, Stefano Gasperini, Shun-Cheng Wu, Nassir Navab
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
  • Arxiv link: https://arxiv.org/abs/2311.05289
  • Pdf link: https://arxiv.org/pdf/2311.05289
  • Abstract
    Creating high-quality view synthesis is essential for immersive applications but continues to be problematic, particularly in indoor environments and for real-time deployment. Current techniques frequently require extensive computational time for both training and rendering, and often produce less-than-ideal 3D representations due to inadequate geometric structuring. To overcome this, we introduce VoxNeRF, a novel approach that leverages volumetric representations to enhance the quality and efficiency of indoor view synthesis. Firstly, VoxNeRF constructs a structured scene geometry and converts it into a voxel-based representation. We employ multi-resolution hash grids to adaptively capture spatial features, effectively managing occlusions and the intricate geometry of indoor scenes. Secondly, we propose a unique voxel-guided efficient sampling technique. This innovation selectively focuses computational resources on the most relevant portions of ray segments, substantially reducing optimization time. We validate our approach against three public indoor datasets and demonstrate that VoxNeRF outperforms state-of-the-art methods. Remarkably, it achieves these gains while reducing both training and rendering times, surpassing even Instant-NGP in speed and bringing the technology closer to real-time.

Linear Gaussian Bounding Box Representation and Ring-Shaped Rotated Convolution for Oriented Object Detection

  • Authors: Zhen Zhou, Yunkai Ma, Junfeng Fan, Zhaoyang Liu, Fengshui Jing, Min Tan
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2311.05410
  • Pdf link: https://arxiv.org/pdf/2311.05410
  • Abstract
    Due to the frequent variability of object orientation, accurate prediction of orientation information remains a challenge in oriented object detection. To better extract orientation-related information, current methods primarily focus on the design of reasonable representations of oriented bounding box (OBB) and rotation-sensitive feature extraction. However, existing OBB representations often suffer from boundary discontinuity and representation ambiguity problems. Methods of designing continuous and unambiguous regression losses do not essentially solve such problems. Gaussian bounding box (GBB) avoids these OBB representation problems, but directly regressing GBB is susceptible to numerical instability. In this paper, we propose linear GBB (LGBB), a novel OBB representation. By linearly transforming the elements of GBB, LGBB does not have the boundary discontinuity and representation ambiguity problems, and have high numerical stability. On the other hand, current rotation-sensitive feature extraction methods based on convolutions can only extract features under a local receptive field, which is slow in aggregating rotation-sensitive features. To address this issue, we propose ring-shaped rotated convolution (RRC). By adaptively rotating feature maps to arbitrary orientations, RRC extracts rotation-sensitive features under a ring-shaped receptive field, rapidly aggregating rotation-sensitive features and contextual information. RRC can be applied to various models in a plug-and-play manner. Experimental results demonstrate that the proposed LGBB and RRC are effective and achieve state-of-the-art (SOTA) performance. By integrating LGBB and RRC into various models, the detection accuracy is effectively improved on DOTA and HRSC2016 datasets.

Active Mining Sample Pair Semantics for Image-text Matching

  • Authors: Yongfeng Chena, Jin Liua, Zhijing Yang, Ruihan Chena, Junpeng Tan
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2311.05425
  • Pdf link: https://arxiv.org/pdf/2311.05425
  • Abstract
    Recently, commonsense learning has been a hot topic in image-text matching. Although it can describe more graphic correlations, commonsense learning still has some shortcomings: 1) The existing methods are based on triplet semantic similarity measurement loss, which cannot effectively match the intractable negative in image-text sample pairs. 2) The weak generalization ability of the model leads to the poor effect of image and text matching on large-scale datasets. According to these shortcomings. This paper proposes a novel image-text matching model, called Active Mining Sample Pair Semantics image-text matching model (AMSPS). Compared with the single semantic learning mode of the commonsense learning model with triplet loss function, AMSPS is an active learning idea. Firstly, the proposed Adaptive Hierarchical Reinforcement Loss (AHRL) has diversified learning modes. Its active learning mode enables the model to more focus on the intractable negative samples to enhance the discriminating ability. In addition, AMSPS can also adaptively mine more hidden relevant semantic representations from uncommented items, which greatly improves the performance and generalization ability of the model. Experimental results on Flickr30K and MSCOCO universal datasets show that our proposed method is superior to advanced comparison methods.

Keyword: quantization

Leveraging Speculative Sampling and KV-Cache Optimizations Together for Generative AI using OpenVINO

  • Authors: Haim Barad, Ekaterina Aidova, Yury Gorbachev
  • Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Performance (cs.PF)
  • Arxiv link: https://arxiv.org/abs/2311.04951
  • Pdf link: https://arxiv.org/pdf/2311.04951
  • Abstract
    Inference optimizations are critical for improving user experience and reducing infrastructure costs and power consumption. In this article, we illustrate a form of dynamic execution known as speculative sampling to reduce the overall latency of text generation and compare it with standard autoregressive sampling. This can be used together with model-based optimizations (e.g. quantization) to provide an optimized solution. Both sampling methods make use of KV caching. A Jupyter notebook and some sample executions are provided.

Just-in-time Quantization with Processing-In-Memory for Efficient ML Training

  • Authors: Mohamed Assem Ibrahim, Shaizeen Aga, Ada Li, Suchita Pati, Mahzabeen Islam
  • Subjects: Hardware Architecture (cs.AR); Distributed, Parallel, and Cluster Computing (cs.DC)
  • Arxiv link: https://arxiv.org/abs/2311.05034
  • Pdf link: https://arxiv.org/pdf/2311.05034
  • Abstract
    Data format innovations have been critical for machine learning (ML) scaling, which in turn fuels ground-breaking ML capabilities. However, even in the presence of low-precision formats, model weights are often stored in both high-precision and low-precision during training. Furthermore, with emerging directional data formats (e.g., MX9, MX6, etc.) multiple low-precision weight copies can be required. To lower memory capacity needs of weights, we explore just-in-time quantization (JIT-Q) where we only store high-precision weights in memory and generate low-precision weights only when needed. To perform JIT-Q efficiently, in this work, we evaluate emerging processing-in-memory (PIM) technology to execute quantization. With PIM, we can offload quantization to in-memory compute units enabling quantization to be performed without incurring costly data movement while allowing quantization to be concurrent with accelerator computation. Our proposed PIM-offloaded quantization keeps up with GPU compute and delivers considerable capacity savings (up to 24%) at marginal throughput loss (up to 2.4%). Said memory capacity savings can unlock several benefits such as fitting larger model in the same system, reducing model parallelism requirement, and improving overall ML training efficiency.

Matrix Completion via Memoryless Scalar Quantization

  • Authors: Arian Eamaz, Farhang Yeganegi, Mojtaba Soltanalian
  • Subjects: Information Theory (cs.IT)
  • Arxiv link: https://arxiv.org/abs/2311.05052
  • Pdf link: https://arxiv.org/pdf/2311.05052
  • Abstract
    We delve into the impact of memoryless scalar quantization on matrix completion. We broaden our theoretical discussion to encompass the coarse quantization scenario with a dithering scheme, where the only available information for low-rank matrix recovery is few-bit low-resolution data. Our primary motivation for this research is to evaluate the recovery performance of nuclear norm minimization in handling quantized matrix problems without the use of any regularization terms such as those stemming from maximum likelihood estimation. We furnish theoretical guarantees for both scenarios: when access to dithers is available during the reconstruction process, and when we have access solely to the statistical properties of the dithers. Additionally, we conduct a comprehensive analysis of the effects of sign flips and prequantization noise on the recovery performance, particularly when the impact of sign flips is quantified using the well-known Hamming distance in the upper bound of recovery error.

Reducing the Side-Effects of Oscillations in Training of Quantized YOLO Networks

  • Authors: Kartik Gupta, Akshay Asthana
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2311.05109
  • Pdf link: https://arxiv.org/pdf/2311.05109
  • Abstract
    Quantized networks use less computational and memory resources and are suitable for deployment on edge devices. While quantization-aware training QAT is the well-studied approach to quantize the networks at low precision, most research focuses on over-parameterized networks for classification with limited studies on popular and edge device friendly single-shot object detection and semantic segmentation methods like YOLO. Moreover, majority of QAT methods rely on Straight-through Estimator (STE) approximation which suffers from an oscillation phenomenon resulting in sub-optimal network quantization. In this paper, we show that it is difficult to achieve extremely low precision (4-bit and lower) for efficient YOLO models even with SOTA QAT methods due to oscillation issue and existing methods to overcome this problem are not effective on these models. To mitigate the effect of oscillation, we first propose Exponentially Moving Average (EMA) based update to the QAT model. Further, we propose a simple QAT correction method, namely QC, that takes only a single epoch of training after standard QAT procedure to correct the error induced by oscillating weights and activations resulting in a more accurate quantized model. With extensive evaluation on COCO dataset using various YOLO5 and YOLO7 variants, we show that our correction method improves quantized YOLO networks consistently on both object detection and segmentation tasks at low-precision (4-bit and 3-bit).

Enhancing Computation Efficiency in Large Language Models through Weight and Activation Quantization

  • Authors: Jangwhan Lee, Minsoo Kim, Seungcheol Baek, Seok Joong Hwang, Wonyong Sung, Jungwook Choi
  • Subjects: Computation and Language (cs.CL)
  • Arxiv link: https://arxiv.org/abs/2311.05161
  • Pdf link: https://arxiv.org/pdf/2311.05161
  • Abstract
    Large Language Models (LLMs) are proficient in natural language processing tasks, but their deployment is often restricted by extensive parameter sizes and computational demands. This paper focuses on post-training quantization (PTQ) in LLMs, specifically 4-bit weight and 8-bit activation (W4A8) quantization, to enhance computational efficiency -- a topic less explored compared to weight-only quantization. We present two innovative techniques: activation-quantization-aware scaling (AQAS) and sequence-length-aware calibration (SLAC) to enhance PTQ by considering the combined effects on weights and activations and aligning calibration sequence lengths to target tasks. Moreover, we introduce dINT, a hybrid data format combining integer and denormal representations, to address the underflow issue in W4A8 quantization, where small values are rounded to zero. Through rigorous evaluations of LLMs, including OPT and LLaMA, we demonstrate that our techniques significantly boost task accuracies to levels comparable with full-precision models. By developing arithmetic units compatible with dINT, we further confirm that our methods yield a 2$\times$ hardware efficiency improvement compared to 8-bit integer MAC unit.

RepQ: Generalizing Quantization-Aware Training for Re-Parametrized Architectures

  • Authors: Anastasiia Prutianova, Alexey Zaytsev, Chung-Kuei Lee, Fengyu Sun, Ivan Koryakovskiy
  • Subjects: Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2311.05317
  • Pdf link: https://arxiv.org/pdf/2311.05317
  • Abstract
    Existing neural networks are memory-consuming and computationally intensive, making deploying them challenging in resource-constrained environments. However, there are various methods to improve their efficiency. Two such methods are quantization, a well-known approach for network compression, and re-parametrization, an emerging technique designed to improve model performance. Although both techniques have been studied individually, there has been limited research on their simultaneous application. To address this gap, we propose a novel approach called RepQ, which applies quantization to re-parametrized networks. Our method is based on the insight that the test stage weights of an arbitrary re-parametrized layer can be presented as a differentiable function of trainable parameters. We enable quantization-aware training by applying quantization on top of this function. RepQ generalizes well to various re-parametrized models and outperforms the baseline method LSQ quantization scheme in all experiments.
@A-suozhang A-suozhang self-assigned this Nov 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant