
# Introduction

In this notebook we will cluster and project research article embeddings. The end result is a model which we can use to visualize different clusters in the article space. 

In [0]:
df = spark.read.table("main.default.ARXIV")
sample_fraction = 0.1
df = df.sample(sample_fraction)
display(df.count())

238300

In [0]:
display(df.limit(3))

abstract,authors_parsed,categories,id,submitter,title,update_date,year
"Imitation learning holds great promise for addressing the complex task of autonomous urban driving, as experienced human drivers can navigate highly challenging scenarios with ease. While behavior cloning is a widely used imitation learning approach in autonomous driving due to its exemption from risky online interactions, it suffers from the covariate shift issue. To address this limitation, we propose a context-conditioned imitation learning approach that employs a policy to map the context state into the ego vehicle's future trajectory, rather than relying on the traditional formulation of both ego and context states to predict the ego action. Additionally, to reduce the implicit ego information in the coordinate system, we design an ego-perturbed goal-oriented coordinate system. The origin of this coordinate system is the ego vehicle's position plus a zero mean Gaussian perturbation, and the x-axis direction points towards its goal position. Our experiments on the real-world large-scale Lyft and nuPlan datasets show that our method significantly outperforms state-of-the-art approaches.","List(List(Guo, Ke, ), List(Jing, Wei, ), List(Chen, Junbo, ), List(Pan, Jia, ))",cs.RO,2305.02649,Ke Guo,CCIL: Context-conditioned imitation learning for urban driving,2023-05-05,2023
"Controllable image captioning is an emerging multimodal topic that aims to describe the image with natural language following human purpose, $\textit{e.g.}$, looking at the specified regions or telling in a particular text style. State-of-the-art methods are trained on annotated pairs of input controls and output captions. However, the scarcity of such well-annotated multimodal data largely limits their usability and scalability for interactive AI systems. Leveraging unimodal instruction-following foundation models is a promising alternative that benefits from broader sources of data. In this paper, we present Caption AnyThing (CAT), a foundation model augmented image captioning framework supporting a wide range of multimodel controls: 1) visual controls, including points, boxes, and trajectories; 2) language controls, such as sentiment, length, language, and factuality. Powered by Segment Anything Model (SAM) and ChatGPT, we unify the visual and language prompts into a modularized framework, enabling the flexible combination between different controls. Extensive case studies demonstrate the user intention alignment capabilities of our framework, shedding light on effective user interaction modeling in vision-language applications. Our code is publicly available at https://github.com/ttengwang/Caption-Anything.","List(List(Wang, Teng, ), List(Zhang, Jinrui, ), List(Fei, Junjie, ), List(Zheng, Hao, ), List(Tang, Yunlong, ), List(Li, Zhe, ), List(Gao, Mingqi, ), List(Zhao, Shanshan, ))",cs.CV,2305.02677,Teng Wang,Caption Anything: Interactive Image Description with Diverse Multimodal  Controls,2023-07-07,2023
"In this paper, we study row graphs of Toeplitz matrices. The notion of row graphs was introduced by Greenberg et al. in 1984 and is closely related to the notion of competition graphs, which has been extensively studied since Cohen had introduced it in 1968.  To understand the structure of the row graphs of Toeplitz matrices, which seem to be quite complicated, we have begun with Toeplitz matrices whose row graphs are triangle-free. We could show that if the row graph G of a Toeplitz matrix T is triangle-free, then T has the maximum row sum at most 2. Furthermore, it turns out that G is a disjoint union of paths and cycles whose lengths cannot vary that much in such a case. Then we study (0, 1)-Toeplitz matrices whose row graphs have only path components, only cycle components, and a cycle component of specific length, respectively. In particular, we completely characterize a (0, 1)-Toeplitz matrix whose row graph is a cycle.","List(List(Cheon, Gi-Sang, ), List(Kang, Bumtle, ), List(Kim, Suh-Ryung, ), List(Ryu, Homoon, ))",math.CO,2305.0269,Bumtle Kang,Row graphs of Toeplitz matrices,2023-05-05,2023



# Cleaning and tokenization

In [0]:
from pyspark.ml.feature import Tokenizer, StopWordsRemover, RegexTokenizer
from pyspark.sql.functions import udf, col, regexp_replace

In [0]:
def clean_text(df, input_column = "abstract", output_column = "abstract_clean"):
    # Remove LaTeX math equations
    df = df.withColumn(output_column, regexp_replace(input_column, r"(\$+)(?:(?!\1)[\s\S])*\1", ""))
    df = df.withColumn(output_column, regexp_replace(output_column, r"[\d-]", ""))
    df = df.withColumn(output_column, regexp_replace(output_column, r"[()£^.,\\]+", ""))
    return df

In [0]:
# Remove LaTeX math equations
df = clean_text(df, output_column = "abstract_math_free")
display(df.limit(3))

abstract,authors_parsed,categories,id,submitter,title,update_date,year,abstract_math_free
"Imitation learning holds great promise for addressing the complex task of autonomous urban driving, as experienced human drivers can navigate highly challenging scenarios with ease. While behavior cloning is a widely used imitation learning approach in autonomous driving due to its exemption from risky online interactions, it suffers from the covariate shift issue. To address this limitation, we propose a context-conditioned imitation learning approach that employs a policy to map the context state into the ego vehicle's future trajectory, rather than relying on the traditional formulation of both ego and context states to predict the ego action. Additionally, to reduce the implicit ego information in the coordinate system, we design an ego-perturbed goal-oriented coordinate system. The origin of this coordinate system is the ego vehicle's position plus a zero mean Gaussian perturbation, and the x-axis direction points towards its goal position. Our experiments on the real-world large-scale Lyft and nuPlan datasets show that our method significantly outperforms state-of-the-art approaches.","List(List(Guo, Ke, ), List(Jing, Wei, ), List(Chen, Junbo, ), List(Pan, Jia, ))",cs.RO,2305.02649,Ke Guo,CCIL: Context-conditioned imitation learning for urban driving,2023-05-05,2023,Imitation learning holds great promise for addressing the complex task of autonomous urban driving as experienced human drivers can navigate highly challenging scenarios with ease While behavior cloning is a widely used imitation learning approach in autonomous driving due to its exemption from risky online interactions it suffers from the covariate shift issue To address this limitation we propose a contextconditioned imitation learning approach that employs a policy to map the context state into the ego vehicle's future trajectory rather than relying on the traditional formulation of both ego and context states to predict the ego action Additionally to reduce the implicit ego information in the coordinate system we design an egoperturbed goaloriented coordinate system The origin of this coordinate system is the ego vehicle's position plus a zero mean Gaussian perturbation and the xaxis direction points towards its goal position Our experiments on the realworld largescale Lyft and nuPlan datasets show that our method significantly outperforms stateoftheart approaches
"Controllable image captioning is an emerging multimodal topic that aims to describe the image with natural language following human purpose, $\textit{e.g.}$, looking at the specified regions or telling in a particular text style. State-of-the-art methods are trained on annotated pairs of input controls and output captions. However, the scarcity of such well-annotated multimodal data largely limits their usability and scalability for interactive AI systems. Leveraging unimodal instruction-following foundation models is a promising alternative that benefits from broader sources of data. In this paper, we present Caption AnyThing (CAT), a foundation model augmented image captioning framework supporting a wide range of multimodel controls: 1) visual controls, including points, boxes, and trajectories; 2) language controls, such as sentiment, length, language, and factuality. Powered by Segment Anything Model (SAM) and ChatGPT, we unify the visual and language prompts into a modularized framework, enabling the flexible combination between different controls. Extensive case studies demonstrate the user intention alignment capabilities of our framework, shedding light on effective user interaction modeling in vision-language applications. Our code is publicly available at https://github.com/ttengwang/Caption-Anything.","List(List(Wang, Teng, ), List(Zhang, Jinrui, ), List(Fei, Junjie, ), List(Zheng, Hao, ), List(Tang, Yunlong, ), List(Li, Zhe, ), List(Gao, Mingqi, ), List(Zhao, Shanshan, ))",cs.CV,2305.02677,Teng Wang,Caption Anything: Interactive Image Description with Diverse Multimodal  Controls,2023-07-07,2023,Controllable image captioning is an emerging multimodal topic that aims to describe the image with natural language following human purpose  looking at the specified regions or telling in a particular text style Stateoftheart methods are trained on annotated pairs of input controls and output captions However the scarcity of such wellannotated multimodal data largely limits their usability and scalability for interactive AI systems Leveraging unimodal instructionfollowing foundation models is a promising alternative that benefits from broader sources of data In this paper we present Caption AnyThing CAT a foundation model augmented image captioning framework supporting a wide range of multimodel controls: visual controls including points boxes and trajectories; language controls such as sentiment length language and factuality Powered by Segment Anything Model SAM and ChatGPT we unify the visual and language prompts into a modularized framework enabling the flexible combination between different controls Extensive case studies demonstrate the user intention alignment capabilities of our framework shedding light on effective user interaction modeling in visionlanguage applications Our code is publicly available at https://githubcom/ttengwang/CaptionAnything
"In this paper, we study row graphs of Toeplitz matrices. The notion of row graphs was introduced by Greenberg et al. in 1984 and is closely related to the notion of competition graphs, which has been extensively studied since Cohen had introduced it in 1968.  To understand the structure of the row graphs of Toeplitz matrices, which seem to be quite complicated, we have begun with Toeplitz matrices whose row graphs are triangle-free. We could show that if the row graph G of a Toeplitz matrix T is triangle-free, then T has the maximum row sum at most 2. Furthermore, it turns out that G is a disjoint union of paths and cycles whose lengths cannot vary that much in such a case. Then we study (0, 1)-Toeplitz matrices whose row graphs have only path components, only cycle components, and a cycle component of specific length, respectively. In particular, we completely characterize a (0, 1)-Toeplitz matrix whose row graph is a cycle.","List(List(Cheon, Gi-Sang, ), List(Kang, Bumtle, ), List(Kim, Suh-Ryung, ), List(Ryu, Homoon, ))",math.CO,2305.0269,Bumtle Kang,Row graphs of Toeplitz matrices,2023-05-05,2023,In this paper we study row graphs of Toeplitz matrices The notion of row graphs was introduced by Greenberg et al in and is closely related to the notion of competition graphs which has been extensively studied since Cohen had introduced it in To understand the structure of the row graphs of Toeplitz matrices which seem to be quite complicated we have begun with Toeplitz matrices whose row graphs are trianglefree We could show that if the row graph G of a Toeplitz matrix T is trianglefree then T has the maximum row sum at most Furthermore it turns out that G is a disjoint union of paths and cycles whose lengths cannot vary that much in such a case Then we study Toeplitz matrices whose row graphs have only path components only cycle components and a cycle component of specific length respectively In particular we completely characterize a Toeplitz matrix whose row graph is a cycle


In [0]:
word_tokenizer = RegexTokenizer(inputCol = "abstract_math_free", outputCol = "word_tokens", minTokenLength = 3)
stop_words_remover = StopWordsRemover(inputCol = "word_tokens", outputCol = "clean_tokens")

In [0]:
df = word_tokenizer.transform(df)
df = stop_words_remover.transform(df)

In [0]:
display(df.select('id', 'clean_tokens').limit(12))

id,clean_tokens
2305.02649,"List(imitation, learning, holds, great, promise, addressing, complex, task, autonomous, urban, driving, experienced, human, drivers, navigate, highly, challenging, scenarios, ease, behavior, cloning, widely, used, imitation, learning, approach, autonomous, driving, due, exemption, risky, online, interactions, suffers, covariate, shift, issue, address, limitation, propose, contextconditioned, imitation, learning, approach, employs, policy, map, context, state, ego, vehicle's, future, trajectory, rather, relying, traditional, formulation, ego, context, states, predict, ego, action, additionally, reduce, implicit, ego, information, coordinate, system, design, egoperturbed, goaloriented, coordinate, system, origin, coordinate, system, ego, vehicle's, position, plus, zero, mean, gaussian, perturbation, xaxis, direction, points, towards, goal, position, experiments, realworld, largescale, lyft, nuplan, datasets, show, method, significantly, outperforms, stateoftheart, approaches)"
2305.02677,"List(controllable, image, captioning, emerging, multimodal, topic, aims, describe, image, natural, language, following, human, purpose, looking, specified, regions, telling, particular, text, style, stateoftheart, methods, trained, annotated, pairs, input, controls, output, captions, however, scarcity, wellannotated, multimodal, data, largely, limits, usability, scalability, interactive, systems, leveraging, unimodal, instructionfollowing, foundation, models, promising, alternative, benefits, broader, sources, data, paper, present, caption, anything, cat, foundation, model, augmented, image, captioning, framework, supporting, wide, range, multimodel, controls:, visual, controls, including, points, boxes, trajectories;, language, controls, sentiment, length, language, factuality, powered, segment, anything, model, sam, chatgpt, unify, visual, language, prompts, modularized, framework, enabling, flexible, combination, different, controls, extensive, case, studies, demonstrate, user, intention, alignment, capabilities, framework, shedding, light, effective, user, interaction, modeling, visionlanguage, applications, code, publicly, available, https://githubcom/ttengwang/captionanything)"
2305.0269,"List(paper, study, row, graphs, toeplitz, matrices, notion, row, graphs, introduced, greenberg, closely, related, notion, competition, graphs, extensively, studied, since, cohen, introduced, understand, structure, row, graphs, toeplitz, matrices, seem, quite, complicated, begun, toeplitz, matrices, whose, row, graphs, trianglefree, show, row, graph, toeplitz, matrix, trianglefree, maximum, row, sum, furthermore, turns, disjoint, union, paths, cycles, whose, lengths, vary, much, case, study, toeplitz, matrices, whose, row, graphs, path, components, cycle, components, cycle, component, specific, length, respectively, particular, completely, characterize, toeplitz, matrix, whose, row, graph, cycle)"
2305.02704,"List(fractional, programming, plays, crucial, role, wireless, network, design, many, relevant, problems, involve, maximizing, minimizing, ratio, terms, notice, maximization, case, minimization, case, converted, general, dealt, separately, previous, studies, thus, existing, method, maximizing, ratios, typically, work, minimization, case, vice, versa, however, objective, mixed, maxandmin, one, may, wish, maximize, signaltointerferenceplusnoise, ratio, sinr, legitimate, receiver, minimizing, eavesdropper, aim, fill, gap, maxfp, minfp, devising, unified, optimization, framework, main, results, threefold, first, extend, existing, maxfp, technique, called, quadratic, transform, minfp, develop, full, generalization, mixed, case, second, provide, minorizationmaximization, interpretation, proposed, unified, approach, thereby, establishing, convergence, also, obtaining, matrix, extension;, another, result, obtain, generalized, lagrangian, dual, transform, facilitates, solving, logarithmic, finally, present, three, typical, applications:, ageofinformation, aoi, minimization, cramerrao, bound, minimization, sensing, secure, data, rate, maximization, none, efficiently, addressed, previous, methods)"
2305.02721,"List(study, symmetry, breaking, particle, physics, plays, important, role, order, get, useful, information, nature, classification, arrangements, subatomic, particles, also, necessary, study, particle, physics, particles, building, blocks, nature, quarks, gluons, leptons, baryons, mesons, composed, quarks, arranged, gellmann, okubo, wellknown, eightfold, way, symmetry, standard, model, particles, composed, particles, particles, also, make, beautiful, patron, make, multiplets, baryons, spin, jp=, multiplets, observed, till, date, paper, multiplets, organized, studied, easy, new, way, result, obtained, clues, masses, characteristics, unknown, hyperons, approximations, characteristics, unidentified, baryons, recorded, article, mass, formula, baryon, multiplets, obtained)"
2305.0273,"List(paper, shown, every, lattice, exists, sequence, equidistributes, horocycle, flow, makes, modest, progress, towards, conjecture, shah, generalizes, result, venkatesh, arxiv:math/, established, equidistribution, cocompact, lattices, proof, utilizes, dichotomy, good, equidistribution, estimates, approximability, closed, horocycles, small, period)"
2305.02735,"List(galois, ring, residue, ring, basic, primitive, polynomial, degree, odd, larger, construct, partition, subsets, type, subsets, type, partition, invariant, multiplication, nonzero, element, teichmuller, set, multiple, action, automorphism, group, corollary, implies, existence, quasicyclic, additive, perfect, codes, index, doob, metric, scheme)"
2305.02767,"List(paper, construct, certain, quantum, spin, systems, moduli, spaces, connections, connected, oriented, finite, graph, simply, connected, compact, lie, group, construct, joint, eigenfunctions, commuting, quantum, hamiltonians, terms, local, invariant, tensors, determine, sufficient, conditions, ensuring, superintegrability, quantum, spin, system, using, irreducibility, criteria, harishchandra, modules, due, harishchandra, lepowsky, mccollum, resulting, class, quantum, superintegrable, spin, systems, includes, quantum, periodic, open, spin, calogeromoser, spin, chains, special, cases, periodic, case, description, joint, eigenfunctions, terms, local, invariant, tensors, multipoint, generalised, trace, functions, open, case, multipoint, spherical, functions, compact, symmetric, spaces)"
2305.02782,"List(largescale, dynamic, network, ldn, source, data, many, big, datarelated, applications, due, large, number, entities, largescale, dynamic, interactions, modeled, highdimensional, incomplete, hdi, tensor, contains, wealth, knowledge, time, patterns, latent, factorization, tensors, lft, model, efficiently, extracts, time, pattern, established, using, stochastic, gradient, descent, sgd, solvers, however, lft, models, based, sgd, often, limited, training, schemes, poor, tail, convergence, solve, problem, paper, proposes, novel, nonlinear, lft, model, mnnl, based, momentumincorporated, sgd, extracts, nonnegative, latent, factors, hdi, tensors, make, training, unconstrained, compatible, general, training, schemes, improving, convergence, accuracy, speed, empirical, studies, two, ldn, datasets, show, compared, existing, models, mnnl, model, higher, prediction, accuracy, convergence, speed)"
2305.02783,"List(recent, improvement, code, generation, capabilities, due, use, large, language, models, mainly, benefited, general, purpose, programming, languages, domain, specific, languages, ones, used, automation, received, far, less, attention, despite, involving, many, active, developers, essential, component, modern, cloud, platforms, work, focuses, generation, ansibleyaml, widely, used, markup, language, automation, present, ansible, wisdom, naturallanguage, ansibleyaml, code, generation, tool, aimed, improving, automation, productivity, ansible, wisdom, transformerbased, model, extended, training, new, dataset, containing, ansibleyaml, also, develop, two, novel, performance, metrics, yaml, ansible, capture, specific, characteristics, domain, results, show, ansible, wisdom, accurately, generate, ansible, script, natural, language, prompts, performance, comparable, better, existing, state, art, code, generation, models, fewshot, settings, asses, impact, training, ansible, yaml, data, compare, different, baselines, including, codexdavinci, also, show, finetuning, ansible, specific, model, bleu:, outperform, much, larger, codexdavinci, bleu:, model, evaluated, shot, settings)"



# Modeling

We will first form word2vec embeddings.
We form the document embedding by averaging over the word embeddings.

Finally, we can use the document embeddings to compute cosine similarity metrics and use it in K-means clustering.


## Word2Vec embedding

To form the Word2Vec embeddings, we will create \\( 10 \\)-dimensional vectors.

In [0]:
import mlflow

In [0]:
from pyspark.ml.feature import Word2Vec, Normalizer
from pyspark.sql.types import DoubleType

In [0]:
word2vec = Word2Vec(inputCol = 'clean_tokens', vectorSize = 10, outputCol = 'embedding')
with mlflow.start_run(run_name = 'run-1') as wordrun:
    example_df = df.select('clean_tokens')
    model = word2vec.fit(df)
    mlflow.spark.log_model(model, "word-model", signature=infer)

2023/12/21 09:43:38 INFO mlflow.spark: Inferring pip requirements by reloading the logged model from the databricks artifact repository, which can be time-consuming. To speed up, explicitly specify the conda_env or pip_requirements when calling log_model().


Downloading artifacts:   0%|          | 0/9 [00:00<?, ?it/s]



Uploading artifacts:   0%|          | 0/4 [00:00<?, ?it/s]

In [0]:
df = model.transform(df)

In [0]:
from pyspark.ml.functions import vector_to_array

In [0]:
# Normalize the embedding vectors
normalizer = Normalizer(inputCol = 'embedding', outputCol = 'normalized_embedding', p = 2.0)
df = normalizer.transform(df)
df = df.withColumn('embedding_arr', vector_to_array('normalized_embedding'))

In [0]:
# Find instances where embedding_arr has length 0
# This not happen since we regularize the article embeddings to have length 1.
norm_udf = udf(lambda x: sum([el ** 2 for el in x]), DoubleType())
df = df.withColumn('embedding_norm', norm_udf('embedding_arr'))

In [0]:
# View papers with 0 embedding vector and remove them
eps = 1e-6
display(df.where(df.embedding_norm < eps).select('id', 'abstract'))
df = df.filter(df.embedding_norm > eps)

id,abstract



## Clustering

Now we have the document embeddings, we can now use them to compute cosine metric similarities.
Using similarities, we can cluster the documents.

The number of clusters must be specified beforehand, we will use \\(K = 8\\).

In [0]:
from pyspark.ml.clustering import KMeans

In [0]:
with mlflow.start_run(run_name = 'run-2'):
    K = 8
    kmeans = KMeans(k = K, distanceMeasure = "cosine", featuresCol = 'normalized_embedding', predictionCol = 'prediction', maxIter = 4)
    kmeans_model = kmeans.fit(df)
    mlflow.spark.log_model(kmeans_model, 'kmeans-model')

2023/12/21 09:50:58 INFO mlflow.spark: Inferring pip requirements by reloading the logged model from the databricks artifact repository, which can be time-consuming. To speed up, explicitly specify the conda_env or pip_requirements when calling log_model().


Downloading artifacts:   0%|          | 0/9 [00:00<?, ?it/s]



Uploading artifacts:   0%|          | 0/4 [00:00<?, ?it/s]

In [0]:
df = kmeans_model.transform(df)


## PCA decomposition

To visualize the data in 2 dimensions we need to map the multidimensional embeddings into the plane. 

We can use _principal component analysis_ to achieve this. PCA forms linear combinations of the original axis, such that the variance of the dataset is preserved the most. 

In [0]:
from pyspark.ml.feature import PCA

In [0]:
with mlflow.start_run(run_name = 'run-3'):
    pca = PCA(k = 2, inputCol = 'normalized_embedding', outputCol = 'pca_features')
    pca_model = pca.fit(df)
    mlflow.spark.log_model(pca_model, 'pca-model')

2023/12/21 09:55:33 INFO mlflow.spark: Inferring pip requirements by reloading the logged model from the databricks artifact repository, which can be time-consuming. To speed up, explicitly specify the conda_env or pip_requirements when calling log_model().


Downloading artifacts:   0%|          | 0/9 [00:00<?, ?it/s]



Uploading artifacts:   0%|          | 0/4 [00:00<?, ?it/s]

In [0]:
df = pca_model.transform(df)
df = df.withColumn('pca_features', vector_to_array('pca_features'))

In [0]:
# Save the result in Delta table
selectable_columns = ['id', 'authors_parsed', 'categories', 'submitter', 'update_date', 'title', 'abstract', 'embedding_arr', 'prediction', 'pca_features', 'year']
display(df.select(*selectable_columns).limit(3))

id,authors_parsed,categories,submitter,update_date,title,abstract,embedding_arr,prediction,pca_features,year
2305.02649,"List(List(Guo, Ke, ), List(Jing, Wei, ), List(Chen, Junbo, ), List(Pan, Jia, ))",cs.RO,Ke Guo,2023-05-05,CCIL: Context-conditioned imitation learning for urban driving,"Imitation learning holds great promise for addressing the complex task of autonomous urban driving, as experienced human drivers can navigate highly challenging scenarios with ease. While behavior cloning is a widely used imitation learning approach in autonomous driving due to its exemption from risky online interactions, it suffers from the covariate shift issue. To address this limitation, we propose a context-conditioned imitation learning approach that employs a policy to map the context state into the ego vehicle's future trajectory, rather than relying on the traditional formulation of both ego and context states to predict the ego action. Additionally, to reduce the implicit ego information in the coordinate system, we design an ego-perturbed goal-oriented coordinate system. The origin of this coordinate system is the ego vehicle's position plus a zero mean Gaussian perturbation, and the x-axis direction points towards its goal position. Our experiments on the real-world large-scale Lyft and nuPlan datasets show that our method significantly outperforms state-of-the-art approaches.","List(0.26779809606130345, -0.021201288226136154, 0.059815622754490295, -0.33218383169904886, 0.22754517692028842, 0.09961459858577139, -0.290823463809687, -0.43780028804579235, 0.01846008760990836, 0.6896540063066591)",4,"List(-0.06838093397805217, -0.26521568090144404)",2023
2305.02677,"List(List(Wang, Teng, ), List(Zhang, Jinrui, ), List(Fei, Junjie, ), List(Zheng, Hao, ), List(Tang, Yunlong, ), List(Li, Zhe, ), List(Gao, Mingqi, ), List(Zhao, Shanshan, ))",cs.CV,Teng Wang,2023-07-07,Caption Anything: Interactive Image Description with Diverse Multimodal  Controls,"Controllable image captioning is an emerging multimodal topic that aims to describe the image with natural language following human purpose, $\textit{e.g.}$, looking at the specified regions or telling in a particular text style. State-of-the-art methods are trained on annotated pairs of input controls and output captions. However, the scarcity of such well-annotated multimodal data largely limits their usability and scalability for interactive AI systems. Leveraging unimodal instruction-following foundation models is a promising alternative that benefits from broader sources of data. In this paper, we present Caption AnyThing (CAT), a foundation model augmented image captioning framework supporting a wide range of multimodel controls: 1) visual controls, including points, boxes, and trajectories; 2) language controls, such as sentiment, length, language, and factuality. Powered by Segment Anything Model (SAM) and ChatGPT, we unify the visual and language prompts into a modularized framework, enabling the flexible combination between different controls. Extensive case studies demonstrate the user intention alignment capabilities of our framework, shedding light on effective user interaction modeling in vision-language applications. Our code is publicly available at https://github.com/ttengwang/Caption-Anything.","List(0.38816387190649027, -0.2384371930654559, 0.4529631938748501, -0.46817670581690235, 0.22739352085421488, 0.28351875146127364, -0.11113484039391547, -0.15463716094418192, 0.24044627179566155, 0.37675286457229534)",2,"List(-0.06990281570148421, -0.5280882523295725)",2023
2305.0269,"List(List(Cheon, Gi-Sang, ), List(Kang, Bumtle, ), List(Kim, Suh-Ryung, ), List(Ryu, Homoon, ))",math.CO,Bumtle Kang,2023-05-05,Row graphs of Toeplitz matrices,"In this paper, we study row graphs of Toeplitz matrices. The notion of row graphs was introduced by Greenberg et al. in 1984 and is closely related to the notion of competition graphs, which has been extensively studied since Cohen had introduced it in 1968.  To understand the structure of the row graphs of Toeplitz matrices, which seem to be quite complicated, we have begun with Toeplitz matrices whose row graphs are triangle-free. We could show that if the row graph G of a Toeplitz matrix T is triangle-free, then T has the maximum row sum at most 2. Furthermore, it turns out that G is a disjoint union of paths and cycles whose lengths cannot vary that much in such a case. Then we study (0, 1)-Toeplitz matrices whose row graphs have only path components, only cycle components, and a cycle component of specific length, respectively. In particular, we completely characterize a (0, 1)-Toeplitz matrix whose row graph is a cycle.","List(0.4261522909628924, -0.2573273896462248, -0.24571799918690904, -0.08386825564391794, -0.2512798986686927, -0.3437523851491704, 0.18974223723096262, -0.6657239695801047, -0.05910041328188241, 0.1441351835329684)",3,"List(-0.6009653499864132, 0.339198432436339)",2023


In [0]:
df.select(*selectable_columns).write.mode('overwrite').option('mergeSchema', 'true').format('delta').saveAsTable('main.default.ARXIV_ANALYSIS')


## Registering model

Now we have created the relevant models and stored the inference in a table. We can do the visualization now in a separate notebook.

However, we should build a pipeline of the steps made in order to save and register the total model.

In [0]:
from pyspark.ml import Pipeline 
from mlflow.models.signature import infer_signature

In [0]:
word2vec = Word2Vec(inputCol = 'clean_tokens', vectorSize = 10, outputCol = 'embedding')
normalizer = Normalizer(inputCol = 'embedding', outputCol = 'normalized_embedding', p = 2.0)

K = 8
kmeans = KMeans(k = K, distanceMeasure = "cosine", featuresCol = 'normalized_embedding', predictionCol = 'prediction', maxIter = 4)
pca = PCA(k = 2, inputCol = 'normalized_embedding', outputCol = 'pca_features')

pipeline = Pipeline(stages=[word_tokenizer, stop_words_remover, word2vec, normalizer, kmeans, pca])

In [0]:
with mlflow.start_run(run_name = 'pipeline-run-1'):
    train_features = df.select('abstract_math_free')
    pipeline_model = pipeline.fit(train_features)
    predictions = pipeline_model.transform(train_features)
    output_signature = predictions.select('abstract_math_free', 'prediction')
    signature = infer_signature(train_features, output_signature)
    mlflow.spark.log_model(pipeline_model, 'arxiv-model', signature = signature)

  outputs = _infer_schema(model_output) if model_output is not None else None
2023/12/21 10:48:46 INFO mlflow.spark: Inferring pip requirements by reloading the logged model from the databricks artifact repository, which can be time-consuming. To speed up, explicitly specify the conda_env or pip_requirements when calling log_model().


Downloading artifacts:   0%|          | 0/29 [00:00<?, ?it/s]

2023/12/21 10:48:49 INFO mlflow.store.artifact.artifact_repo: The progress bar can be disabled by setting the environment variable MLFLOW_ENABLE_ARTIFACTS_PROGRESS_BAR to false


Uploading artifacts:   0%|          | 0/4 [00:00<?, ?it/s]

In [0]:
catalog = "main"
schema = "default"
model_name = "axiv_classifier"
mlflow.set_registry_uri("databricks-uc")
mlflow.register_model(
    model_uri="runs:/01ca3e231e4a4cc39b1dc769f51a7659/arxiv-model",
    name=f"{catalog}.{schema}.{model_name}"
)

Successfully registered model 'main.default.axiv_classifier'.


Downloading artifacts:   0%|          | 0/33 [00:00<?, ?it/s]

2023/12/21 10:50:18 INFO mlflow.store.artifact.artifact_repo: The progress bar can be disabled by setting the environment variable MLFLOW_ENABLE_ARTIFACTS_PROGRESS_BAR to false


Uploading artifacts:   0%|          | 0/33 [00:00<?, ?it/s]

2023/12/21 10:50:23 INFO mlflow.store.artifact.cloud_artifact_repo: The progress bar can be disabled by setting the environment variable MLFLOW_ENABLE_ARTIFACTS_PROGRESS_BAR to false
Created version '1' of model 'main.default.axiv_classifier'.


<ModelVersion: aliases=[], creation_timestamp=1703155818363, current_stage=None, description='', last_updated_timestamp=1703155825660, name='main.default.axiv_classifier', run_id='01ca3e231e4a4cc39b1dc769f51a7659', run_link=None, source='dbfs:/databricks/mlflow-tracking/313095920668275/01ca3e231e4a4cc39b1dc769f51a7659/artifacts/arxiv-model', status='READY', status_message='', tags={}, user_id='tuomas.myllymaki98@outlook.com', version='1'>


# Endpoint querying

Now after registering the models and creating and endpoint, we can query it. After querying, we should get a prediction of the cluster.

In [0]:
import os, requests, json
import pandas as pd

In [0]:
headers = {'Authorization': f'Bearer {secret_key}', 'Content-Type': 'application/json'}
dataset = pd.DataFrame({
    "abstract_math_free": ["This is a test. We can consider calculating the mass of the sun by using radio spectrotrometry."]
})
ds_dict = {'dataframe_split': dataset.to_dict(orient='split')} if isinstance(dataset, pd.DataFrame) else create_tf_serving_json(dataset)
data_json = json.dumps(ds_dict, allow_nan=True)
response = requests.request(method='POST', headers=headers, url=url, data=data_json)
if response.status_code != 200:
    raise Exception(f'Request failed with status {response.status_code}, {response.text}')

In [0]:
response_json = json.loads(response.text)
predictions = response_json['predictions']
predictions

[5]


So our model thinks the example is in group 5.

We can try and compare our test abstract to other abstracts in the group 5. 

In [0]:
similar_df = spark.sql('select id, abstract from main.default.ARXIV_ANALYSIS where prediction == 5')
display(similar_df.limit(10))

id,abstract
2207.03898,"Statistical classification of the Helios solar wind observations into several populations sorted by bulk speed has revealed an outward acceleration of the wind. The faster the wind is, the smaller is this acceleration in the 0.3 - 1 au radial range (Maksimovic et al. 2020). In this article we show that recent measurements from the Parker Solar Probe (PSP) are compatible with an extension closer to the Sun of the latter Helios classification. For instance the well established bulk speed/proton temperature (u,Tp) correlation and bulk speed/electron temperature (u,Te) anti-correlation, together with the acceleration of the slowest winds, are verified in PSP data. We also model the combined PSP and Helios data, using empirical Parker-like models for which the solar wind undergoes an ""iso-poly"" expansion: isothermal in the corona, then polytropic at distances larger than the sonic point radius. The polytropic indices are derived from the observed temperature and density gradients. Our modelling reveals that the electron thermal pressure has a major contribution in the acceleration process of slow and intermediate winds (in the range of 300-500 km/s at 1 au), over a broad range of distances and that the global (electron and protons) thermal energy, alone, is able to explain the acceleration profiles. Moreover, we show that the very slow solar wind requires in addition to the observed pressure gradients, another source of acceleration."
2207.03903,"We report the discovery of five cyano derivatives of propene towards TMC-1 with the QUIJOTE line survey: $trans$ and $cis$-crotononitrile ($t$-CH$_3$CHCHCN, $c$-CH$_3$CHCHCN), methacrylonitrile (CH$_2$C(CH$_3$)CN), and $gauche$ and $cis$-allyl cyanide ($g$-CH$_2$CHCH$_2$CN and $c$-CH$_2$CHCH$_2$CN). The observed transitions allowed us to derive a common rotational temperature of 7$\pm$1 K for all them. The derived column densities are N($t$-CH$_3$CHCHCN)=(5$\pm$0.5)$\times$10$^{10}$ cm$^{-2}$, N($c$-CH$_3$CHCHCN)=(1.3$\pm$0.2)$\times$10$^{11}$ cm$^{-2}$, N(CH$_2$C(CH$_3$)CN)=(1.0$\pm$0.1)$\times$10$^{11}$ cm$^{-2}$, N($g$-CH$_2$CHCH$_2$CN)=(8.0$\pm$0.8)$\times$10$^{10}$ cm$^{-2}$, and N($c$-CH$_2$CHCH$_2$CN)=(7.0$\pm$0.7)$\times$10$^{10}$ cm$^{-2}$, respectively. The abundance of cyano-propene relative to that of propene is thus $\sim$10$^{-2}$, which is considerably lower than those of other cyano derivatives of abundant hydrocarbons. Upper limits are obtained for two ethynyl derivatives of propene ($E$ and $Z$-CH$_3$CHCHCCH)."
2207.03968,"We analyze the galaxy pairs in a set of volume limited samples from the SDSS to study the effects of minor interactions on the star formation rate (SFR) and colour of galaxies. We carefully design control samples of the isolated galaxies by matching the stellar mass and redshift of the minor pairs. The SFR distributions and colour distributions in the minor pairs differ from their controls at $>99\%$ significance level. We also simultaneously match the control galaxies in stellar mass, redshift and local density to assess the role of the environment. The null hypothesis can be rejected at $>99\%$ confidence level even after matching the environment. Our analysis shows a quenching in the minor pairs where the degree of quenching decreases with the increasing pair separation and plateaus beyond 50 kpc. We also prepare a sample of minor pairs with $H_{\alpha}$ line information. We calculate the SFR of these galaxies using the $H_{\alpha}$ line and repeat our analysis. We observe a quenching in the $H_{\alpha}$ sample too. We find that the majority of the minor pairs are quiescent systems that could be quenched due to minor interactions. Combining data from the Galaxy Zoo and Galaxy Zoo2, we find that only $\sim 1\%$ galaxies have a dominant bulge, $4\%-7\%$ galaxies host a bar, and $5\%-10\%$ galaxies show the AGN activity in minor pairs. This indicates that the presence of bulge, bar and AGN activity plays an insignificant role in quenching the galaxies in minor pairs. The more massive companion satisfies the criteria for mass quenching in most of the minor pairs. We propose that the stripping and starvation likely caused the quenching in the less massive companion at a later stage of evolution."
2207.04039,"The bias of dark matter halos and galaxies is a crucial quantity in many cosmological analyses. In this work, using large cosmological simulations, we explore the halo mass function and halo bias within cosmic voids. For the first time to date, we show that they are scale-dependent along the void profile, and provide a predictive theoretical model of both the halo mass function and halo bias inside voids, recovering for the latter and 1% accuracy against simulated data. These findings may help shed light on the dynamics of halo formation within voids and improve the analysis of several void statistics from ongoing and upcoming galaxy surveys."
2207.0421,"Thanks to relatively firm mode identification, possible based on period ratios only, High Amplitude Delta Scuti Stars pulsating in at least three radial modes are promising targets for asteroseismic inference. In this study we used the most numerous sample of HADS from the OGLE inner bulge fields that likely pulsate in either three or four radial modes simultaneously. We have computed a grid of pulsation models along evolutionary tracks and determined the physical parameters of stars by matching their pulsation periods and period ratios. For 176 HADS we determined physical parameters, i.e. masses, luminosities, effective temperatures, metallicities and ages. We present the distribution of physical parameters and discuss their properties. We selected 16 candidates for SX Phoenicis stars."
2207.04269,"Chemical models and experiments indicate that interstellar dust grains and their ice mantles play an important role in the production of complex organic molecules (COMs). To date, the most complex solid-phase molecule detected with certainty in the ISM is methanol, but the James Webb Space Telescope (JWST) may be able to identify still larger organic species. In this study, we use a coupled chemo-dynamical model to predict new candidate species for JWST detection toward the young star-forming core Cha-MMS1, combining the gas-grain chemical kinetic code MAGICKAL with a 1-D radiative hydrodynamics simulation using Athena++. With this model, the relative abundances of the main ice constituents with respect to water toward the core center match well with typical observational values, providing a firm basis to explore the ice chemistry. Six oxygen-bearing COMs (ethanol, dimethyl ether, acetaldehyde, methyl formate, methoxy methanol, and acetic acid), as well as formic acid, show abundances as high as, or exceeding, 0.01% with respect to water ice. Based on the modeled ice composition, the infrared spectrum is synthesized to diagnose the detectability of the new ice species. The contribution of COMs to IR absorption bands is minor compared to the main ice constituents, and the identification of COM ice toward the core center of Cha-MMS1 with the JWST NIRCAM/Wide Field Slitless Spectroscopy (2.4-5.0 micron) may be unlikely. However, MIRI observations (5-28 micron) toward COM-rich environments where solid-phase COM abundances exceed 1% with respect to the water ice column density might reveal the distinctive ice features of COMs."
2207.04483,"We present observations with the Cosmic Origins Spectrograph onboard the Hubble Space Telescope of seven compact low-mass star-forming galaxies at redshifts, z,in the range 0.3161-0.4276, with various O3Mg2=[OIII]5007/MgII 2796+2803 and Mg2=MgII 2796/MgII 2803 emission-line ratios. We aim to study the dependence of leaking Lyman continuum (LyC) emission on the characteristics of MgII emission together with the dependences on other indirect indicators of escaping ionizing radiation. LyC emission with escape fractions fesc(LyC)=3.1-4.6 per cent is detected in four galaxies, whereas only 1sigma upper limits of fesc(LyC) in the remaining three galaxies were derived. A strong narrow Ly-alpha emission line with two peaks separated by Vsep~298-592 km/s was observed in four galaxies with detected LyC emission and very weak Ly-alpha emission is observed in galaxies with LyC non-detections. Our new data confirm the tight anti-correlation between fesc(LyC) and Vsep found for previous low-redshift galaxy samples. Vsep remains the best indirect indicator of LyC leakage among all considered indicators. It is found that escaping LyC emission is detected predominantly in galaxies with Mg2>1.3. A tendency of an increase of fesc(LyC) with increasing of both the O3Mg2 and Mg2 is possibly present. However, there is substantial scatter in these relations not allowing their use for reliable prediction of fesc(LyC)."
2207.04702,"After core hydrogen burning, massive stars evolve from blue-white dwarfs to red supergiants by expanding, brightening, and cooling within few millennia. We discuss a previously neglected constraint on mass, age, and evolutionary state of Betelgeuse and Antares, namely their observed colour evolution over historical times: We place all 236 stars bright enough for their colour to be discerned by the unaided eye (V$\le$3.3 mag) on the colour-magnitude-diagram (CMD), and focus on those in the Hertzsprung gap. We study pre-telescopic records on star colour with historically-critical methods to find stars that have evolved noticeably in colour within the last millennia. Our main result is that Betelgeuse was recorded with a colour significantly different (non-red) than today (red, B$-$V=$1.78 \pm 0.05$ mag). Hyginus (Rome) and Sima Qian (China) independently report it two millennia ago as appearing like Saturn (B$-$V=$1.09 \pm 0.16$ mag) in colour and `yellow' (quantifiable as B$-$V=$0.95 \pm 0.35$ mag), respectively (together, 5.1$\sigma$ different from today). The colour change of Betelgeuse is a new, tight constraint for single-star theoretical evolutionary models (or merger models). It is most likely located less than one millennium past the bottom of the red giant branch, before which rapid colour evolution is expected. Evolutionary tracks from MIST consistent with both its colour evolution and its location on the CMD suggest a mass of $\sim$14M$_{\odot}$ at $\sim$14 Myr. The (roughly) constant colour of Antares for the last three millennia also constrains its mass and age. Wezen was reported white historically, but is now yellow."
2207.04754,"Image restoration under severe weather is a challenging task. Most of the past works focused on removing rain and haze phenomena in images. However, snow is also an extremely common atmospheric phenomenon that will seriously affect the performance of high-level computer vision tasks, such as object detection and semantic segmentation. Recently, some methods have been proposed for snow removing, and most methods deal with snow images directly as the optimization object. However, the distribution of snow location and shape is complex. Therefore, failure to detect snowflakes / snow streak effectively will affect snow removing and limit the model performance. To solve these issues, we propose a Snow Mask Guided Adaptive Residual Network (SMGARN). Specifically, SMGARN consists of three parts, Mask-Net, Guidance-Fusion Network (GF-Net), and Reconstruct-Net. Firstly, we build a Mask-Net with Self-pixel Attention (SA) and Cross-pixel Attention (CA) to capture the features of snowflakes and accurately localized the location of the snow, thus predicting an accurate snow mask. Secondly, the predicted snow mask is sent into the specially designed GF-Net to adaptively guide the model to remove snow. Finally, an efficient Reconstruct-Net is used to remove the veiling effect and correct the image to reconstruct the final snow-free image. Extensive experiments show that our SMGARN numerically outperforms all existing snow removal methods, and the reconstructed images are clearer in visual contrast. All codes will be available."
2207.04902,"The use of realistic mock galaxy catalogues is essential in the preparation of large galaxy surveys, in order to test and validate theoretical models and to assess systematics. We present an updated version of the mock catalogue constructed from the Millennium-XXL simulation, which uses a halo occupation distribution (HOD) method to assign galaxies r-band magnitudes and g-r colours. We have made several modifications to the mock to improve the agreement with measurements from the SDSS and GAMA surveys. We find that cubic interpolation, which was used to build the original halo lightcone, produces extreme velocities between snapshots. Using linear interpolation improves the correlation function quadrupole measurements on small scales. We also update the g-r colour distributions so that the observed colours better agree with measurements from GAMA data, particularly for faint galaxies. As an example of the science that can be done with the mock, we investigate how the luminosity function depends on environment and colour, and find good agreement with measurements from the GAMA survey. This full-sky mock catalogue is designed for the ongoing Dark Energy Spectroscopic Instrument (DESI) Bright Galaxy Survey (BGS), and is complete to a magnitude limit r=20.2."



Looking at the results, we can see that the test abstract does fit into this category quite well. There's references to astronomical phenomena and astrophysics.

However, there are some chemistry papers also in the same category 5. So the categories are not perfectly separated.