Optimize substr_matching with the Aho-Corasick algorithm #45

andimarafioti · 2024-03-06T19:31:17Z

I optimized the substr_matching algorithm by integrating the Aho-Corasick Algorithm.

the Aho—Corasick algorithm is a string-searching algorithm invented by Alfred V. Aho and Margaret J. Corasick in 1975.[1] It is a kind of dictionary-matching algorithm that locates elements of a finite set of strings (the "dictionary") within an input text. It matches all strings simultaneously.

import json
import timeit
import random
import ahocorasick
from metaclip.substr_matching import spacing

with open("metadata.json") as f:
  metadata = json.load(f)


automaton = None
spaced_metadata = None

def initialize_automaton(metadata):
    automaton = ahocorasick.Automaton()
    for idx, key in enumerate(spaced_metadata):
        automaton.add_word(key, (idx, key))
    automaton.make_automaton()
    return automaton

def optimized_substr_matching(text, metadata):
    global spaced_metadata, automaton
    if spaced_metadata is None:
        spaced_metadata = [f" {entry} " for entry in metadata]
    text = spacing(text)
    if automaton is None:
        automaton = initialize_automaton(metadata)
    matched_entry_ids = set()
    for end_index, (entry_id, original_value) in automaton.iter(text):
        matched_entry_ids.add(entry_id)
    return list(matched_entry_ids)
    
def original_substr_matching(text, metadata):
    global spaced_metadata
    if spaced_metadata is None:
        spaced_metadata = []
        for entry in metadata:
            spaced_metadata.append(f" {entry} ")
    text = spacing(text)
    matched_entry_ids = []
    for entry_id, entry in enumerate(spaced_metadata):
        if entry in text:
            matched_entry_ids.append(entry_id)
    return matched_entry_ids

def process_texts_original(texts, metadata):
    return [original_substr_matching(text, metadata) for text in texts]

def process_texts_optimized(texts, metadata):
    return [optimized_substr_matching(text, metadata) for text in texts]

# Generate sample metadata and text for testing
sample_text = " ".join(random.choices(metadata, k=10))

# Time the original and optimized functions
original_time = timeit.timeit(lambda: original_substr_matching(sample_text, metadata), number=1)
warm_up_optimized_time = timeit.timeit(lambda: optimized_substr_matching(sample_text, metadata), number=1)
optimized_time = timeit.timeit(lambda: optimized_substr_matching(sample_text, metadata), number=1)

print(f"Original method time: {original_time:.6f} seconds")
print(f"Warm up optimized method time: {warm_up_optimized_time:.6f} seconds")
print(f"Optimized method time: {optimized_time:.6f} seconds")

is_consistent = sorted(original_substr_matching(sample_text, metadata)) == sorted(optimized_substr_matching(sample_text, metadata)) 
print(f"Is the new implementation consistent with the old one?: {is_consistent}")

# Generate a list of short sentences as sample texts
sample_texts = [" ".join(random.choices(metadata, k=10)) for _ in range(1000)]  # 1000 short sentences

# Time the processing of a list of sentences
original_time = timeit.timeit(lambda: process_texts_original(sample_texts, metadata), number=1)
optimized_time = timeit.timeit(lambda: process_texts_optimized(sample_texts, metadata), number=1)

print(f"Original method time for 1000 texts: {original_time:.6f} seconds")
print(f"Optimized method time for 1000 texts: {optimized_time:.6f} seconds")

Which prints in my laptop:

Original method time: 0.086521 seconds
Warm up optimized method time: 0.662693 seconds
Optimized method time: 0.000024 seconds
Is the new implementation consistent with the old one?: True
Original method time for 1000 texts: 51.219827 seconds
Optimized method time for 1000 texts: 0.015514 seconds

Please do consider that the outpout of the new substr_matching will not be ordered now. If you need it to be ordered, you can change the last line to sorted(list(matched_entry_ids))

I appreciate that you want to reduce dependencies for the repo, but considering that the use-case is building massive datasets, this sounds like a sensible improvement to me. What do you think? Is there some context I'm missing?

howardhsu · 2024-03-25T20:46:39Z

that's a great idea to improve the existing brute-force matching, thx a lot for the contribution. will refactor slightly later.

Optimize substr_matching with the Aho-Corasick algorithm

6b7afe2

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 6, 2024

howardhsu merged commit 17aac1f into facebookresearch:main Mar 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize substr_matching with the Aho-Corasick algorithm #45

Optimize substr_matching with the Aho-Corasick algorithm #45

Uh oh!

andimarafioti commented Mar 6, 2024 •

edited

Loading

Uh oh!

howardhsu commented Mar 25, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Optimize substr_matching with the Aho-Corasick algorithm #45

Optimize substr_matching with the Aho-Corasick algorithm #45

Uh oh!

Conversation

andimarafioti commented Mar 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

howardhsu commented Mar 25, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

andimarafioti commented Mar 6, 2024 •

edited

Loading