<h1 style = "font-family: monospace">Embedding the arXiv</h1>
<p style = "text-align:justify">This notebook contains code neccessary to import arXiv metadata to database that will be later searched by engine. To do this, titles and abstracts of research publications in arXiv database are first converted to vector representations, and these representations or <b></b>embeddings</b> are loaded into database. I will be using two databases: Redis in-memory storage and postgres, both of them have support for indexing vectors. To enable support for vector, pgvector extension has to be installed for postgres. I'm also using redis-stack-server version of Redis that includes all neccessary extensions.</p>

In [2]:
using DataFrames
using JSON3
using LibPQ
using Tables
using Embeddings
using WordTokenizers
using CSV
using Statistics
using Jedis
using Printf
using Random
using PyCall

<h3>import pretrained word2vec into Redis</h3>
<p style = "text-align:justify">The engine will work by calculating vector representation of user query by averaging vectors of all words used in the query. To do this quickly, the program will keep all individual word vectors in Redis memory storage:</p>

In [2]:
# load pretrained word2vec embeddings
emtable = load_embeddings(Word2Vec, "../GoogleNews-vectors-negative300.bin")

# create hash tables to access embedding vector for every word and its lowercase variant
emindex = Dict(word => i for (i, word) in enumerate(emtable.vocab))
emlower = Dict()
for (word, i) in emindex
    lw = lowercase(word)
    if lw ∉ keys(emindex)
        emlower[lw] = i
    end
end

In [None]:
# create file with commands to Redis server to store word - vector pairs
wordvec = "../word2vec.redis"

if !isfile(wordvec)
    open(wordvec, "w") do output
        for ei in [emindex, emlower]
            for (word, i) in ei
                vector = emtable.embeddings[:, i]
                write(output, "HSET vectors:word2vec \"" * word * "\" \"" * "[" * join(string.(vector),", ") * "]" * "\"\n")
            end
        end
    end
end

# import file to Redis
Base.run(pipeline(`cat $wordvec`, `redis-cli`, devnull))

# save Redis database to persistent storage
Jedis.execute(["save"])

# check the result
@printf "imported %i word vectors" Jedis.execute(["hlen", "vectors:word2vec"])

<h3>import arXiv embeddings to Postgres</h3>
<p style = "text-align:justify">The search algorithm will work by calculating cosine distance of query vector to document vectors. In the next step I will load the data to Postgresql database, and create a vector index for similarity search:</p>

In [3]:
# load arXiv metadata into dataframe
dataset = "arxiv-metadata-oai-2024-04-15.json" ;
dataframe = JSON3.read.(eachline(dataset)) |> DataFrame ;

# take a random subset of N documents from dataframe (for testing on personal computer)
N = 0
if N > 0
    dataframe = dataframe[shuffle(1:nrow(dataframe))[1:N], :]
end

# extract publication year of every document
transform!(dataframe, :versions_dates => ByRow(dates -> parse(Int16, split(dates[1], " ")[4])) => :year)

# keep only relevant columns that will be stored in database
dataframe = dataframe[:, [:id, :title, :abstract, :authors, :doi, :year]]

Row,id,title,abstract,authors,doi,year
Unnamed: 0_level_1,String,String,String,String,Union…,Int16
1,0704.0001,Calculation of prompt diphoton production cross sections at Tevatron and\n LHC energies,"A fully differential calculation in perturbative quantum chromodynamics is\npresented for the production of massive photon pairs at hadron colliders. All\nnext-to-leading order perturbative contributions from quark-antiquark,\ngluon-(anti)quark, and gluon-gluon subprocesses are included, as well as\nall-orders resummation of initial-state gluon radiation valid at\nnext-to-next-to-leading logarithmic accuracy. The region of phase space is\nspecified in which the calculation is most reliable. Good agreement is\ndemonstrated with data from the Fermilab Tevatron, and predictions are made for\nmore detailed tests with CDF and DO data. Predictions are shown for\ndistributions of diphoton pairs produced at the energy of the Large Hadron\nCollider (LHC). Distributions of the diphoton pairs from the decay of a Higgs\nboson are contrasted with those produced from QCD processes at the LHC, showing\nthat enhanced sensitivity to the signal can be obtained with judicious\nselection of events.\n","C. Bal\\'azs, E. L. Berger, P. M. Nadolsky, C.-P. Yuan",10.1103/PhysRevD.76.013009,2007
2,0704.0002,Sparsity-certifying Graph Decompositions,"We describe a new algorithm, the $(k,\\ell)$-pebble game with colors, and use\nit obtain a characterization of the family of $(k,\\ell)$-sparse graphs and\nalgorithmic solutions to a family of problems concerning tree decompositions of\ngraphs. Special instances of sparse graphs appear in rigidity theory and have\nreceived increased attention in recent years. In particular, our colored\npebbles generalize and strengthen the previous results of Lee and Streinu and\ngive a new proof of the Tutte-Nash-Williams characterization of arboricity. We\nalso present a new decomposition that certifies sparsity based on the\n$(k,\\ell)$-pebble game with colors. Our work also exposes connections between\npebble game algorithms and previous sparse graph algorithms by Gabow, Gabow and\nWestermann and Hendrickson.\n",Ileana Streinu and Louis Theran,,2007
3,0704.0003,The evolution of the Earth-Moon system based on the dark matter field\n fluid model,"The evolution of Earth-Moon system is described by the dark matter field\nfluid model proposed in the Meeting of Division of Particle and Field 2004,\nAmerican Physical Society. The current behavior of the Earth-Moon system agrees\nwith this model very well and the general pattern of the evolution of the\nMoon-Earth system described by this model agrees with geological and fossil\nevidence. The closest distance of the Moon to Earth was about 259000 km at 4.5\nbillion years ago, which is far beyond the Roche's limit. The result suggests\nthat the tidal friction may not be the primary cause for the evolution of the\nEarth-Moon system. The average dark matter field fluid constant derived from\nEarth-Moon system data is 4.39 x 10^(-22) s^(-1)m^(-1). This model predicts\nthat the Mars's rotation is also slowing with the angular acceleration rate\nabout -4.38 x 10^(-22) rad s^(-2).\n",Hongjun Pan,,2007
4,0704.0004,A determinant of Stirling cycle numbers counts unlabeled acyclic\n single-source automata,We show that a determinant of Stirling cycle numbers counts unlabeled acyclic\nsingle-source automata. The proof involves a bijection from these automata to\ncertain marked lattice paths and a sign-reversing involution to evaluate the\ndeterminant.\n,David Callan,,2007
5,0704.0005,From dyadic $\\Lambda_{\\alpha}$ to $\\Lambda_{\\alpha}$,"In this paper we show how to compute the $\\Lambda_{\\alpha}$ norm, $\\alpha\\ge\n0$, using the dyadic grid. This result is a consequence of the description of\nthe Hardy spaces $H^p(R^N)$ in terms of dyadic and special atoms.\n",Wael Abu-Shammala and Alberto Torchinsky,,2007
6,0704.0006,Bosonic characters of atomic Cooper pairs across resonance,"We study the two-particle wave function of paired atoms in a Fermi gas with\ntunable interaction strengths controlled by Feshbach resonance. The Cooper pair\nwave function is examined for its bosonic characters, which is quantified by\nthe correction of Bose enhancement factor associated with the creation and\nannihilation composite particle operators. An example is given for a\nthree-dimensional uniform gas. Two definitions of Cooper pair wave function are\nexamined. One of which is chosen to reflect the off-diagonal long range order\n(ODLRO). Another one corresponds to a pair projection of a BCS state. On the\nside with negative scattering length, we found that paired atoms described by\nODLRO are more bosonic than the pair projected definition. It is also found\nthat at $(k_F a)^{-1} \\ge 1$, both definitions give similar results, where more\nthan 90% of the atoms occupy the corresponding molecular condensates.\n",Y. H. Pong and C. K. Law,10.1103/PhysRevA.75.043613,2007
7,0704.0007,Polymer Quantum Mechanics and its Continuum Limit,"A rather non-standard quantum representation of the canonical commutation\nrelations of quantum mechanics systems, known as the polymer representation has\ngained some attention in recent years, due to its possible relation with Planck\nscale physics. In particular, this approach has been followed in a symmetric\nsector of loop quantum gravity known as loop quantum cosmology. Here we explore\ndifferent aspects of the relation between the ordinary Schroedinger theory and\nthe polymer description. The paper has two parts. In the first one, we derive\nthe polymer quantum mechanics starting from the ordinary Schroedinger theory\nand show that the polymer description arises as an appropriate limit. In the\nsecond part we consider the continuum limit of this theory, namely, the reverse\nprocess in which one starts from the discrete theory and tries to recover back\nthe ordinary Schroedinger quantum mechanics. We consider several examples of\ninterest, including the harmonic oscillator, the free particle and a simple\ncosmological model.\n","Alejandro Corichi, Tatjana Vukasinac and Jose A. Zapata",10.1103/PhysRevD.76.044016,2007
8,0704.0008,Numerical solution of shock and ramp compression for general material\n properties,"A general formulation was developed to represent material models for\napplications in dynamic loading. Numerical methods were devised to calculate\nresponse to shock and ramp compression, and ramp decompression, generalizing\nprevious solutions for scalar equations of state. The numerical methods were\nfound to be flexible and robust, and matched analytic results to a high\naccuracy. The basic ramp and shock solution methods were coupled to solve for\ncomposite deformation paths, such as shock-induced impacts, and shock\ninteractions with a planar interface between different materials. These\ncalculations capture much of the physics of typical material dynamics\nexperiments, without requiring spatially-resolving simulations. Example\ncalculations were made of loading histories in metals, illustrating the effects\nof plastic work on the temperatures induced in quasi-isentropic and\nshock-release experiments, and the effect of a phase transition.\n",Damian C. Swift,10.1063/1.2975338,2007
9,0704.0009,"The Spitzer c2d Survey of Large, Nearby, Insterstellar Clouds. IX. The\n Serpens YSO Population As Observed With IRAC and MIPS","We discuss the results from the combined IRAC and MIPS c2d Spitzer Legacy\nobservations of the Serpens star-forming region. In particular we present a set\nof criteria for isolating bona fide young stellar objects, YSO's, from the\nextensive background contamination by extra-galactic objects. We then discuss\nthe properties of the resulting high confidence set of YSO's. We find 235 such\nobjects in the 0.85 deg^2 field that was covered with both IRAC and MIPS. An\nadditional set of 51 lower confidence YSO's outside this area is identified\nfrom the MIPS data combined with 2MASS photometry. We describe two sets of\nresults, color-color diagrams to compare our observed source properties with\nthose of theoretical models for star/disk/envelope systems and our own modeling\nof the subset of our objects that appear to be star+disks. These objects\nexhibit a very wide range of disk properties, from many that can be fit with\nactively accreting disks to some with both passive disks and even possibly\ndebris disks. We find that the luminosity function of YSO's in Serpens extends\ndown to at least a few x .001 Lsun or lower for an assumed distance of 260 pc.\nThe lower limit may be set by our inability to distinguish YSO's from\nextra-galactic sources more than by the lack of YSO's at very low luminosities.\nA spatial clustering analysis shows that the nominally less-evolved YSO's are\nmore highly clustered than the later stages and that the background\nextra-galactic population can be fit by the same two-point correlation function\nas seen in other extra-galactic studies. We also present a table of matches\nbetween several previous infrared and X-ray studies of the Serpens YSO\npopulation and our Spitzer data set.\n","Paul Harvey, Bruno Merin, Tracy L. Huard, Luisa M. Rebull, Nicholas\n Chapman, Neal J. Evans II, Philip C. Myers",10.1086/518646,2007
10,0704.0010,"Partial cubes: structures, characterizations, and constructions","Partial cubes are isometric subgraphs of hypercubes. Structures on a graph\ndefined by means of semicubes, and Djokovi\\'{c}'s and Winkler's relations play\nan important role in the theory of partial cubes. These structures are employed\nin the paper to characterize bipartite graphs and partial cubes of arbitrary\ndimension. New characterizations are established and new proofs of some known\nresults are given.\n The operations of Cartesian product and pasting, and expansion and\ncontraction processes are utilized in the paper to construct new partial cubes\nfrom old ones. In particular, the isometric and lattice dimensions of finite\npartial cubes obtained by means of these operations are calculated.\n",Sergei Ovchinnikov,,2007


In [52]:
# define some useful functions to calculate embedding vectors

function wordvector(word, N)
    if word in keys(emindex)
        emtable.embeddings[:, emindex[word]]
    elseif word != lowercase(word)
        wordvector(lowercase(word), N)
    elseif word in keys(emlower)
        emtable.embeddings[:, emlower[word]]
    else
        Float32.(vec(zeros(1, N)))
    end   
end

function getvector(title, text)
    words = reduce(vcat, nltk_word_tokenize.(split_sentences(title * "\n" * text)))
    vectors = stack(wordvector.(words, 300))
    vec(mean(vectors, dims = 2))
end

getvector (generic function with 1 method)

In [None]:
# create a separate column in dataframe that will store embedding vector for each document, might take long on large datasets
transform!(dataframe, [:title, :abstract] => ByRow((title, text) -> getvector(title, text)) => :word2vec) ;

<p style = "text-align:justify">I will create additional column <i>i</i> to index documents, that will be used later when searching for documents using Redis in-memory index:</p>

In [4]:
insertcols!(dataframe, 1, :i => 1:nrow(dataframe)) ;

<p style = "text-align:justify">Now I will save the resulting dataframe to CSV file, that will be later imported to Postgres database:</p>

In [7]:
embeds = "arxiv-embeddings-full.csv"  # output file to store metadata and embeddings that will be uploaded to database
dbname = "science"                    # database name in postgresql
dbowner = "jupyter-alexandra"         # user with permissions to create tables
pguser = "researchers"                # user with a read access to database
pgpass = "KRASLApQ6QjE6hX6ff"         # user password
;

In [13]:
transform(column, value) = something(value, missing)                               # take care of empty values
transform(column, value::String) = replace(value, "\n" => " ")                     # escape newline symbols in abstracts
transform(column, value::Matrix{Float32}) = "[" * join(string.(value), ",") * "]"  # encode vector array as string

CSV.write(embeds, dataframe, transform = transform) ;

In [15]:
cp(embeds, "/tmp/" * embeds, force = true)

open("/tmp/arXiv.sql", "w") do output
    
    sql = "SET client_min_messages TO WARNING; \n" *
          "CREATE EXTENSION IF NOT EXISTS vector; \n" *
          "CREATE TABLE arxiv(i INT PRIMARY KEY, id VARCHAR(32), title TEXT, abstract TEXT, authors TEXT, doi VARCHAR(256), year INT,  word2vec vector(300)); \n" *
          "COPY arxiv(i, id, title, abstract, authors, doi, year, word2vec) FROM '/tmp/" * embeds * "' DELIMITER ',' CSV HEADER; \n" *
          "CREATE INDEX ON arxiv USING hnsw (word2vec vector_cosine_ops); \n" *
          "CREATE INDEX ON arxiv (id); \n" *
          "CREATE INDEX ON arxiv (year); \n"
    
    write(output, sql)
end

Base.run(`sudo -u $dbowner psql -d $dbname -a -f /tmp/arXiv.sql`)

rm("/tmp/arxiv-embeddings-full.csv")

SET
CREATE EXTENSION IF NOT EXISTS vector; 
CREATE EXTENSION
CREATE TABLE arxiv(i INT PRIMARY KEY, id VARCHAR(32), title TEXT, abstract TEXT, authors TEXT, doi VARCHAR(256), year INT,  word2vec vector(300)); 
CREATE TABLE
COPY arxiv(i, id, title, abstract, authors, doi, year, word2vec) FROM '/tmp/arxiv-embeddings-full.csv' DELIMITER ',' CSV HEADER; 
COPY 2459557
CREATE INDEX ON arxiv USING hnsw (word2vec vector_cosine_ops); 
CREATE INDEX
CREATE INDEX ON arxiv (id); 
CREATE INDEX
CREATE INDEX ON arxiv (year); 
CREATE INDEX


Now let's search for some documents using embeddings. To do this, I will define a function that will take query as a string and return top N documents from the database most similar to the query.

In [27]:
# convert query vector from array of numbers to string representation to use in SQL query
function search(query, N, pq, model = "word2vec")
    vector = getvector("", query)
    vector = string("[", join(vector, ","), "]")
    result = LibPQ.execute(pq, "SELECT id, $model <-> \$1 as distance, title FROM arxiv ORDER BY distance LIMIT \$2", [vector, N])
    DataFrame(columntable(result))
end

# run test search for a document specified by title and abstract text
function test(title, text; search::Function, N, pq, model = "word2vec")
    ts = @elapsed results = search(title * "\n" * text, N, pq, model)
    title = replace(title, "\n" => "")
    printstyled("\n" * title * "\n", underline = true)
    println(string(round(ts; digits = 4)) * " sec.\n")
    for result in eachrow(results)
        println("(" * string(round(result.distance; digits = 4)) * ") " * result.title)
    end
end

test (generic function with 1 method)

Let's test if function works correctly by selecting some random documents from database, and searching for them. The documents should appear at the top:

In [44]:
# connect to the postgresql database
pq = LibPQ.Connection("dbname=" * dbname * " host=localhost user=" * pguser * " password='" * pgpass * "'")

N_test = 8
subset = dataframe[shuffle(1:nrow(dataframe))[1:N_test], :]

test.(subset[!, :title], subset[!, :abstract], search = search, N = 2, pq = pq) ;



[0m[4mThe leaves of the Fatou set accumulate on the leaves of the Julia set[24m
2.2484 sec.

(0.0) The leaves of the Fatou set accumulate on the leaves of the Julia set
(0.1008) Tropical curves in sandpile models

[0m[4mAn Improved Stability Method for Linear Systems with Fast-Varying Delays[24m
2.2566 sec.

(0.0) An Improved Stability Method for Linear Systems with Fast-Varying Delays
(0.0923) Quadratic obstructions to small-time local controllability for   scalar-input differential systems

[0m[4mQuasiconformal Mappings and Neumann Eigenvalues of Divergent Elliptic  Operators[24m
2.2627 sec.

(0.0) Quasiconformal Mappings and Neumann Eigenvalues of Divergent Elliptic   Operators
(0.0815) Estimates of Dirichlet eigenvalues of divergent elliptic operators in   non-Lipschitz domains

[0m[4mImproving Large-scale Language Models and Resources for Filipino[24m
2.2664 sec.

(0.0) Improving Large-scale Language Models and Resources for Filipino
(0.0968) The Interplay of Variant,

As can be noticed from the results above, the search algorithm works correctly. Now we can test search by using custom phrase, such as <i>article about artificial intelligence</i>:

In [47]:
query = IJulia.readprompt("input the search query: ")
n = IJulia.readprompt("number of search results to display: ")

total = rowtable(LibPQ.execute(pq, "SELECT count(id) FROM arxiv"))[1][1]
@printf "\nsearching %i documents...\n" total

test(query, "", N = n, pq = pq)

input the search query:  article about artificial intelligence
number of search results to display:  32



searching 2459557 documents...

[0m[4marticle about artificial intelligence[24m
2.277 sec.

(0.4372) Impact of Artificial Intelligence on Economic Theory
(0.4393) Artificial Intelligence in Humans
(0.442) What can the brain teach us about building artificial intelligence?
(0.4437) Does an artificial intelligence perform market manipulation with its own   discretion? -- A genetic algorithm learns in an artificial market simulation
(0.4438) Retracted Articles about COVID-19 Vaccines Enable Vaccine Misinformation   on Twitter
(0.4439) Super forecasting the technological singularity risks from artificial   intelligence
(0.4471) Physical aging in article page views
(0.4483) Ethical Considerations in Artificial Intelligence Courses
(0.4494) Philosophy in the Face of Artificial Intelligence
(0.4503) Automatic Detection of Entity-Manipulated Text using Factual Knowledge
(0.4509) Artificial Intelligence Technology analysis using Artificial   Intelligence patent through Deep Learning model a

<h3>import arXiv embeddings to Redis</h3>
<p style = "text-align:justify">Redis is an in-memory storage, and queries to Redis can be faster than to classical SQL databases. In this section I'm going to create a vector index in Redis and run queries on it. However, keeping the whole metadata table in memory is quite time-consuming, so I will only keep the <b>i</b> column that stores numeric index of every document. The search engine would search for indexes of similar documents in Redis, and after that select metadata (title and etc.) from Postgres database.</p>

In [None]:
# export embeddings with indexes from postgresql database and load them to dataframe
Base.run(pipeline(`sudo -u $dbowner psql -d $dbname -t -A -F"," -c "SELECT id, i, word2vec FROM arxiv"`, "/tmp/arxiv.csv"))
redset = CSV.read("/tmp/arxiv.csv", DataFrame; header = [:id, :i, :word2vec], openquotechar = '[', closequotechar = ']')

# decode the embeddings column in dataframe from string representation to numeric array
redset.word2vec = map(split.(strip.(redset.word2vec, Ref(['[', ']'])), ',')) do nums
   parse.(Float32, nums)
end

redset = NamedTuple.(eachrow(redset)) ;

In [None]:
# define a generic function to import embeddings into Redis

function embed2red(data, model)
    # write the dataset to Redis-command file for importing
    output = open("$model.arxiv","w")
    fembed(record, output) = write(output, "json.set \"paper:" * record.id * "\" \$ \"" * escape_string(JSON3.write(record)) * "\"\n")
    fembed.(redset, output)

    dimensions = 300 # !!!!!!!
    
    # plus command to create index
    write(output, "FT.CREATE $model ON JSON PREFIX 1 \"paper:\" SCHEMA \$.i as i NUMERIC \$.$model as $model VECTOR HNSW 6 TYPE FLOAT32 DIM $dimensions DISTANCE_METRIC COSINE")
    close(output)

    # import the data
    Base.run(pipeline(`cat $model.arxiv`, `redis-cli`, devnull))

    # for large datasets index will be created in background, so we need to wait before using it
    reindex() = parse(Int, Jedis.execute(["ft.info", model])[10])
    if reindex() < length(redset)
        print("arxiv $model embeddings imported to Redis. creating index will take time, waiting... ")
        sleep(1); while reindex() < length(redset) sleep(4) end
        println("done! saving result...")
    end
    
    Jedis.execute(["save"])
    
end

In [14]:
# run the import and wait for the result
embed2red(redset, "word2vec")

arxiv embeddings imported to Redis. creating index will take time, waiting... done!


<p style = "text-align:justify">Now we can test how the search is performing using Redis. To do this, I will use the same functions as for Postgresql previously, but I will define a separate search function that will work on Redis. This function will access Redis vector index through Python call, because it is not clear how to implement binary vector request to Redis in Julia <i>* at least I did not manage to do it so far and will solve this problem later</i></p>

In [51]:
# define a runquery function that will access Redis vector index and return N top matches
py"""

import redis
import numpy
from redis.commands.search.query import Query

def runquery(vector, N, model):
    query = (
        Query("(*)=>[KNN 10000 @" + model + " $vector AS distance]")
        .sort_by("distance")
        .paging(0, N)
        .return_fields("distance", "i")
        .dialect(2)
    )
    bytes = numpy.array(vector, dtype = numpy.float32).tobytes()
    client = redis.Redis(port = 6379, decode_responses = True)
    result = client.ft(model).search(query, {"vector": bytes})
    return result.docs

"""

# the function to search for query on Redis also supports models different from word2vec
function research(query; N, pq, model)
    if model == "word2vec"
        vector = getvector("", query)
    else
        getvecf = "vec4" * model
        vector = getfield(Main, Symbol(getvecf))(query)
    end
    results = py"runquery"(vector, N, model)
    records = DataFrame([NamedTuple([:i => parse(Int, r.i), :distance => parse(Float32, r.distance)])  for r in results])
    result = LibPQ.execute(pq, "SELECT i, id, title FROM arxiv WHERE i IN (" * join([p.i for p in results], ", ") * ")")
    results = DataFrame(columntable(result))
    leftjoin(results, records, on = [:i])
end

research (generic function with 1 method)

In [33]:
pq = LibPQ.Connection("dbname=" * dbname * " host=localhost user=" * pguser * " password='" * pgpass * "'")

N_test = 8
subset = dataframe[shuffle(1:nrow(dataframe))[1:N_test], :]

test.(subset[!, :title], subset[!, :abstract], search = research, N = 2, pq = pq) ;


[0m[4mEntropy Spectrum of a Carged Black Hole of Heterotic String Theory via  Adiabatic Invariance[24m
0.1095 sec.

(0.0) Entropy Spectrum of a Carged Black Hole of Heterotic String Theory via   Adiabatic Invariance
(0.0473) Universality of Quantum Entropy for Extreme Black Holes

[0m[4mUnique continuation for the momentum ray transform[24m
0.0511 sec.

(0.0) Unique continuation for the momentum ray transform
(0.102) HourglassNeRF: Casting an Hourglass as a Bundle of Rays for Few-shot   Neural Rendering

[0m[4mDipolar condensates confined in a toroidal trap: ground state and  vortices[24m
0.0687 sec.

(0.0) Dipolar condensates confined in a toroidal trap: ground state and   vortices
(0.0252) Fast rotating condensates in an asymmetric harmonic trap

[0m[4mThrough the Big Bang[24m
0.0445 sec.

(0.0) Through the Big Bang
(0.0303) Entropy and the Typicality of Universes

[0m[4mInsights On Streamflow Predictability Across Scales Using Horizontal  Visibility Graph Based Networ

Performing custom-phrase search:

In [35]:
query = IJulia.readprompt("input the search query: ")
n = IJulia.readprompt("number of search results to display: ")
test(query, "", search = research, N = n, pq = pq)

input the search query:  articles about the Big Bang
number of search results to display:  12



[0m[4marticles about the Big Bang[24m
0.0534 sec.

(0.3244) Global Fluctuation Spectra in Big Crunch/Big Bang String Vacua
(0.3254) Theory Closing Talk
(0.3263) Wave asymptotics at a cosmological time-singularity: classical and   quantum scalar fields
(0.3267) Have Pulsar Timing Arrays detected the Hot Big Bang? Gravitational Waves   from Strong First Order Phase Transitions in the Early Universe
(0.3271) Big Bang Nucleosynthesis and Particle Dark Matter
(0.3297) Depth-graded motivic Lie algebra
(0.331) Hinode 7: Conference Summary and Future Suggestions
(0.332) Answer to the Comment about the Letter entitled ``Scalar fields as dark   matter in spiral galaxies''
(0.332) Dark Matter and Gravity Waves from a Dark Big Bang
(0.3354) Summary Talk at the 3rd KEK Topical Conference on CP Violation
(0.3368) Lie bialgebra structures on the Schr\"{o}dinger-Virasoro Lie algebra
(0.3379) Report from Sessions 1 and 3, including the Local Bubble Debate


The Redis search works correctly, because results are the same as with Postgresql. Now we are going to measure whether using Redis is faster than Postgresql:

In [38]:
print("Performance of Redis:")
@time research(query, 32, pq);
print("Performance of Postgres:")
@time search(query, 32, pq);

Performance of Redis:  0.045551 seconds (2.00 k allocations: 108.094 KiB)
Performance of Postgres:  2.258991 seconds (1.36 k allocations: 97.523 KiB)


The Redis index performs much better on large datasets.

<h3 style = "line-height:96px"><img align = left width = "256px" style = "min-width:33%" src = "https://huggingface.co/datasets/huggingface/brand-assets/resolve/main/hf-logo.svg">HuggingFace embeddings</h3>
<p style = "text-align:justify">In this section I'm going to use <a target = _blank href = "https://huggingface.co/spaces/mteb/leaderboard">a model from Hugging Face</a> to generate embeddings for each document instead of word2vec. The model taken will be <b>gte-large-en-v1.5</b> since it has shown best performance on information retrieval task for scinetific datasets: SCIDOCS and SciFact, and also has one of the best performance for Englis-language retrieval task overall.</p><p></p>Unfortunately, I will have to use python call here, because current Julia support for Hugging Face models is limited (there seems to be a bug that does not allow to download every model) and Sentence Transformers are also lacking. To save time, I will use PyCall right now and work on a pure Julia implementation afterwards.</p>

In [8]:
py"""
from sentence_transformers import SentenceTransformer

def model(path):
    return SentenceTransformer(path, trust_remote_code = True, device = "cuda")
"""

gtelarge = py"model"("Alibaba-NLP/gte-large-en-v1.5")

PyObject SentenceTransformer(
  (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: NewModel 
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

In [None]:
using CUDA

device_reset!()

In [6]:
chunks = NamedTuple.(eachrow(dataframe[!, [:i, :id, :title, :abstract]])) ;

In [None]:
open("gtelarge.arxiv","w") do output

    datale = length(chunks)
#    sizech = 8192 * 32 * 32 * 8
#    counch = Int(ceil(datale/sizech))

#    for i in range(1, step = sizech, length = counch)
#        i = Int(i)
#        j = i + sizech - 1
#        if j > datale
#            j = datale
#        end

        chunk = [(p.title * "\n" * p.abstract) for p in chunks] # for p in chunks[i:j]]
        chunk = gtelarge.encode(chunk, batch_size = 128)

        for ii in 1:datale # i:j
            record = (i = chunks[ii].i, id = chunks[ii].id, gtelarge = chunk[ii, :]) # -i+1
            write(output, "json.set \"paper:" * record.id * "\" \$ \"" * escape_string(JSON3.write(record)) * "\"\n")
        end
#
    end

    write(output, "FT.CREATE gtelarge ON JSON PREFIX 1 \"paper:\" SCHEMA \$.i as i NUMERIC \$.gtelarge as gtelarge VECTOR HNSW 6 TYPE FLOAT32 DIM 1024 DISTANCE_METRIC COSINE")

end

#close(output)

Now I will save the embeddings and import them into Redis only, to avoid unneccessary memory consumption by Postgres index, since onlty Redis indexing will be used for performance:

In [None]:
# check dimensions
# save and import to Redis no Postgres

gteset = NamedTuple.(eachrow(dataframe[!, :i, :id, :gtelarge])) ;

embed2red(gteset, gtelarge)

In [None]:
vec4gtelarge(query) = gtelarge.encode(query)
    
N_test = 2
subset = dataframe[shuffle(1:nrow(dataframe))[1:N_test], :]

test.(subset[!, :title], subset[!, :abstract], search = research, N = 4, pq = pq, model = "gtelarge") ;