# String matching examples
## Another similarity measure

This similarity measure (which is a built-in in python) is another way to compare two strings:

In [1]:
from difflib import SequenceMatcher

In [2]:
# similar to other similarity algorithms like levenshtein but returns the similarity in % as we take the ratio
def similar(str1, str2):
    return round(SequenceMatcher(None, str1, str2).ratio(), 2)

A simple function to convert the result into a beautified string:

In [3]:
def stringify_sim(ratio, str1, str2):
    # printf-ish formatting
    return ("Similarity between {0} and {1}: " + str(ratio)).format(str1, str2)

A few examples:

In [4]:
print(stringify_sim(similar("retrieval", "retro"), "retrieval", "retro"))
print(stringify_sim(similar("spam", "park"), "spam", "park"))
print(stringify_sim(similar("height", "heihgt"), "height", "heihgt"))

Similarity between retrieval and retro: 0.57
Similarity between spam and park: 0.5
Similarity between height and heihgt: 0.83


## Exact string matching variations

In [5]:
import re

Function for evaluating if a string matches in the text - returns first occurrence or an error value if not found:

In [6]:
def evaluate_single_match(pattern, s):
    match = re.search(pattern, s)
    return (match.start(), match.end()-1) if match is not None else (-1,-1)

Function for evaluating if a string matches in the text - returns all occurrences:

In [7]:
def evaluate_all_matches(pattern_str, s):
    pattern = re.compile(pattern_str)
    match = pattern.search(s)

    if not match: return []
    res = []
    while match:
        res.append((match.start(), match.end()-1))
        match = pattern.search(s, match.start() + 1)
    return res

Function for evaluating if any string in a set matches in the text - returns all occurrences with their respective word:

In [8]:
def evaluate_all_on_sets(set_of_words, s):
    res = []
    for word in set_of_words:
        res = res + [(match, word) for match in evaluate_all_matches(word, s)]

    return res

A few examples:

In [9]:
first_text = "This is an example text."
second_text = "I'd like an apple (guess I'm hungry...)."

- Single match:

In [10]:
print(evaluate_single_match("apple", first_text))
print(evaluate_single_match("apple", second_text))

(-1, -1)
(12, 16)


- All matches:

In [11]:
print(evaluate_all_matches("I", first_text))
print(evaluate_all_matches("I", second_text))

[]
[(0, 0), (25, 25)]


- All matches on set:

In [12]:
print(evaluate_all_on_sets({"I", "an", "example"}, first_text))
print(evaluate_all_on_sets({"I", "an", "example"}, second_text))

[((8, 9), 'an'), ((11, 17), 'example')]
[((9, 10), 'an'), ((0, 0), 'I'), ((25, 25), 'I')]


## Regex string matching

All regex algorithms are built-in in python via the "re" library. Thus, only a small function is defined to use the library appropriately:

In [13]:
# own findall function for simplicity
def findall_regex(regex, text):
    return [x.group(0) for x in re.finditer(regex, text)]

A few regex examples on the wikipedia introduction text on information retrieval:

In [14]:
wiki_ir_intro = """Information retrieval (IR) in computing and information science is the process of obtaining information system resources that are relevant to an information need from a collection of those resources. Searches can be based on full-text or other content-based indexing. Information retrieval is the science of searching for information in a document, searching for documents themselves, and also searching for the metadata that describes data, and for databases of texts, images or sounds.

Automated information retrieval systems are used to reduce what has been called information overload. An IR system is a software system that provides
access to books, journals and other documents; stores and manages those documents. Web search engines are the most visible IR applications."""

print(wiki_ir_intro)

Information retrieval (IR) in computing and information science is the process of obtaining information system resources that are relevant to an information need from a collection of those resources. Searches can be based on full-text or other content-based indexing. Information retrieval is the science of searching for information in a document, searching for documents themselves, and also searching for the metadata that describes data, and for databases of texts, images or sounds.

Automated information retrieval systems are used to reduce what has been called information overload. An IR system is a software system that provides
access to books, journals and other documents; stores and manages those documents. Web search engines are the most visible IR applications.


Match all words:

In [15]:
words_regex = "[a-zA-Z]+(-[a-zA-Z]+)?"

print(findall_regex(words_regex, wiki_ir_intro))

['Information', 'retrieval', 'IR', 'in', 'computing', 'and', 'information', 'science', 'is', 'the', 'process', 'of', 'obtaining', 'information', 'system', 'resources', 'that', 'are', 'relevant', 'to', 'an', 'information', 'need', 'from', 'a', 'collection', 'of', 'those', 'resources', 'Searches', 'can', 'be', 'based', 'on', 'full-text', 'or', 'other', 'content-based', 'indexing', 'Information', 'retrieval', 'is', 'the', 'science', 'of', 'searching', 'for', 'information', 'in', 'a', 'document', 'searching', 'for', 'documents', 'themselves', 'and', 'also', 'searching', 'for', 'the', 'metadata', 'that', 'describes', 'data', 'and', 'for', 'databases', 'of', 'texts', 'images', 'or', 'sounds', 'Automated', 'information', 'retrieval', 'systems', 'are', 'used', 'to', 'reduce', 'what', 'has', 'been', 'called', 'information', 'overload', 'An', 'IR', 'system', 'is', 'a', 'software', 'system', 'that', 'provides', 'access', 'to', 'books', 'journals', 'and', 'other', 'documents', 'stores', 'and', 'ma

Match all words starting with i/I:

In [16]:
words_with_i_regex = "(?<=\s)(i|I)[a-zA-Z]+"

print(findall_regex(words_with_i_regex, wiki_ir_intro))

['in', 'information', 'is', 'information', 'information', 'indexing', 'Information', 'is', 'information', 'in', 'images', 'information', 'information', 'IR', 'is', 'IR']


Match all hyphens:

In [17]:
hyphen_regex = "[a-zA-Z]+-[a-zA-Z]+"

print(findall_regex(hyphen_regex, wiki_ir_intro))

['full-text', 'content-based']


Match all sentences:

In [18]:
sentence_regex = r'[^\s](.|\s)*?\.'

print(findall_regex(sentence_regex, wiki_ir_intro))

['Information retrieval (IR) in computing and information science is the process of obtaining information system resources that are relevant to an information need from a collection of those resources.', 'Searches can be based on full-text or other content-based indexing.', 'Information retrieval is the science of searching for information in a document, searching for documents themselves, and also searching for the metadata that describes data, and for databases of texts, images or sounds.', 'Automated information retrieval systems are used to reduce what has been called information overload.', 'An IR system is a software system that provides\naccess to books, journals and other documents; stores and manages those documents.', 'Web search engines are the most visible IR applications.']
