## Evaluating String Comparison Methods

Let's say I have a large list of change ticket titles: 

```
changes = [
    "CHNG559732: Recent deployment on foodatabasenodeserv for manifest ID azuredatabasenodeserv-062723230148699022",
    "CHNG494690: Recent deployment on legacycassandradbnodeserv for manifest ID legacycassandradbnodeserv-062723230148700108",
    "CHNG336829: Recent deployment on oldpostgresadapternodeweb for manifest ID oldpostgresadapternodeweb-062723230148703462",
    ...
]
```

Let's also assume then that I have alerts that goes off. An example alert title may look like: 
```
alert = "Increase in FCIs on /api/v1/foo/status over the past 5 minutes"
```

In this example, notice there is a relationship between "/api/v1/foo/status" in the alert and an application name "foodatabasenodeserv" is present in the first item of the change ticket list. Here "foo" establishes the relationship between the two. However, the change ticket list, the app names involved and the content of alerts will be constantly changing. 

How can I compare an alert against a large list of changes to find commonalities and relationships?

Currently OpenAI embeddings are far superior to any self-hosted options.

Below are some quick attempts at using various ML methods to compare an alert against a set of change tickets.

### Installs

In [1]:
!pip install spacy fuzzywuzzy scikit-learn
!pip install python-Levenshtein



### Imports

In [14]:
import json
import Levenshtein
import spacy


from sklearn.feature_extraction.text import TfidfVectorizer

try:
    nlp = spacy.load("en_core_web_lg")
except OSError:
    spacy.cli.download("en_core_web_lg")
    nlp = spacy.load("en_core_web_lg")


### Constants

In [12]:
alert = "Increase in FCIs on /api/v1/azure/status over the past 5 minutes"
alert2 = "Success rate of paypaldatabasenodeserv has dropped below 99.999% over the past 5 minutes"
changes = [
    "CHNG559732: Recent deployment on azuredatabasenodeserv for manifest ID azuredatabasenodeserv-062723230148699022",
    "CHNG494690: Recent deployment on legacycassandradbnodeserv for manifest ID legacycassandradbnodeserv-062723230148700108",
    "CHNG336829: Recent deployment on oldpostgresadapternodeweb for manifest ID oldpostgresadapternodeweb-062723230148703462"
]

### A) Compute Similarity Score using Spacy

TBD:
- Why is the distance of each the same; the changes are unique?

In [4]:
# Calculate similarity with each change string
alert_doc = nlp(alert)
for change in changes:
    change_doc = nlp(change)
    similarity = alert_doc.similarity(change_doc)
    print(alert_doc, "<->", change_doc)
    print("Similarity:", similarity)
    print()


Increase in FCIs on /api/v1/azure/status over the past 5 minutes <-> CHNG559732: Recent deployment on azuredatabasenodeserv for manifest ID azuredatabasenodeserv-062723230148699022
Similarity: 0.5555793900741783

Increase in FCIs on /api/v1/azure/status over the past 5 minutes <-> CHNG494690: Recent deployment on legacycassandradbnodeserv for manifest ID legacycassandradbnodeserv-062723230148700108
Similarity: 0.5555793900741783

Increase in FCIs on /api/v1/azure/status over the past 5 minutes <-> CHNG336829: Recent deployment on oldpostgresadapternodeweb for manifest ID oldpostgresadapternodeweb-062723230148703462
Similarity: 0.5555793900741783



### B) Extract Entity Relationships w/ Spacy & Fuzzywuzzy

We can see here that the values extracted are low quality. This is likely do to the case that the items we're searching for are not common English terms, but moreso related to Linux and/or web service names. 

TBD:
- Weirdness: Using a larger spacy model produces poorer quality matches

In [5]:
import spacy
from fuzzywuzzy import fuzz

def extract_entities(text):
    nlp = spacy.load("en_core_web_lg") # CHANGE ME! (en_core_web_sm, en_core_web_md, en_core_web_lg)
    doc = nlp(text)
    entities = [ent.text for ent in doc.ents]
    print(f"Entities extracted from {text}:\n{entities}\n")
    return entities

def find_relationships(alert, change_tickets):
    alert_entities = extract_entities(alert)
    relationships = []

    for ticket in change_tickets:
        ticket_entities = extract_entities(ticket)

        for alert_entity in alert_entities:
            for ticket_entity in ticket_entities:
                similarity = fuzz.ratio(alert_entity.lower(), ticket_entity.lower())
                if similarity >= 10:  # Adjust the similarity threshold as needed
                    relationships.append((alert_entity, ticket_entity, ticket))

    return relationships

def main():
    relationships = find_relationships(alert, changes)

    if relationships:
        print("\nRelationships found:")
        for alert_entity, ticket_entity, ticket in relationships:
            print(f"Entity from Alert: {alert_entity}")
            print(f"Entity from Ticket: {ticket_entity}")
            print(f"Change Ticket: {ticket}")
            print()
    else:
        print("\nNo relationships found.")

if __name__ == "__main__":
    main()


Entities extracted from Increase in FCIs on /api/v1/azure/status over the past 5 minutes:
['/api', 'the past 5 minutes']

Entities extracted from CHNG559732: Recent deployment on azuredatabasenodeserv for manifest ID azuredatabasenodeserv-062723230148699022:
['CHNG559732']

Entities extracted from CHNG494690: Recent deployment on legacycassandradbnodeserv for manifest ID legacycassandradbnodeserv-062723230148700108:
[]

Entities extracted from CHNG336829: Recent deployment on oldpostgresadapternodeweb for manifest ID oldpostgresadapternodeweb-062723230148703462:
['CHNG336829', 'oldpostgresadapternodeweb-062723230148703462']


Relationships found:
Entity from Alert: the past 5 minutes
Entity from Ticket: CHNG559732
Change Ticket: CHNG559732: Recent deployment on azuredatabasenodeserv for manifest ID azuredatabasenodeserv-062723230148699022

Entity from Alert: the past 5 minutes
Entity from Ticket: CHNG336829
Change Ticket: CHNG336829: Recent deployment on oldpostgresadapternodeweb for m

### C) Using Spacy filters to identify tags

Resources:
- https://spacy.io/usage/rule-based-matching#entityruler
- https://stackoverflow.com/questions/57667710/using-regex-for-phrase-pattern-in-entityruler

TBD:
- BUSINESS does not match when paypal is part of service name?
- ENDPOINT also captures the timeframe (last 5 minutes)
- TIMEFRAME is no longer captured after adding patterns (last 5 minutes)

In [15]:
from spacy.lang.en import English

nlp = English()
ruler = nlp.add_pipe("entity_ruler")
patterns = [
    {
        "label": "BUSINESS",
        "id": "BUSINESS",
        "pattern": [{
            "LOWER": "paypal"
        }]
    },
    {
        "label": "ENDPOINT", 
        "pattern": [
            {
                "ORTH": "/"
            }, 
            {
                "IS_ASCII": True, 
                "OP": "+"
            }
        ]
    },
    {
        "label": "SERVICE", 
        "id": "SERVICE",
        "pattern": [{
            "LOWER": {"regex": "\w+(serv|nodeserv|nodeweb)"}
        }]
    }
]

ruler.add_patterns(patterns)

doc = nlp(alert) # Swap between alert and alert2 to test
output = [(ent.text, ent.label_) for ent in doc.ents]

print (json.dumps(output, indent=2))

[
  [
    "/v1/azure/status over the past 5 minutes",
    "ENDPOINT"
  ]
]


### D) TF-DIF

Resources:
- https://www.capitalone.com/tech/machine-learning/understanding-tf-idf/
- https://www.geeksforgeeks.org/understanding-tf-idf-term-frequency-inverse-document-frequency

In [27]:
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Create a TF-IDF vectorizer instance
vectorizer = TfidfVectorizer()

# Fit the vectorizer and transform the data
tfidf_matrix = vectorizer.fit_transform([alert] + changes)

# Get the TF-IDF vectors
alert_tfidf = tfidf_matrix[0]
change_tfidf = tfidf_matrix[1:]

# Compare TF-IDF vectors
similarities = cosine_similarity(alert_tfidf, change_tfidf)
most_similar_index = np.argmax(similarities)

# Find top 3 similar changes
top_indices = np.argsort(similarities, axis=1)[0, ::-1][:3]
top_changes = [changes[i] for i in top_indices]
top_scores = [similarities[0, i] for i in top_indices]

print("Top 3 Similar Changes:\n")
for change, score in zip(top_changes, top_scores):
    print("Change:", change)
    print("Similarity Score:", score)
    print()



Top 3 Similar Changes:

Change: CHNG559732: Recent deployment on azuredatabasenodeserv for manifest ID azuredatabasenodeserv-062723230148699022
Similarity Score: 0.02813756547803127

Change: CHNG336829: Recent deployment on oldpostgresadapternodeweb for manifest ID oldpostgresadapternodeweb-062723230148703462
Similarity Score: 0.028137565478031264

Change: CHNG494690: Recent deployment on legacycassandradbnodeserv for manifest ID legacycassandradbnodeserv-062723230148700108
Similarity Score: 0.028137565478031264



### E) KeyBERT / BERTopic

Resources:
- https://coder.social/MaartenGr/KeyBERT/issues/60

In [None]:
# To Do

### F) Levenshtein Distance

Determine how different two strings are by counting the minimum number of changes needed to turn one string into the other. These changes can be adding, removing, or replacing individual characters. Lower score == better match. 

Resources:
- https://www.statology.org/levenshtein-distance-in-python/

In [22]:
min_distance = float('inf')  # Initialize with a large value
min_change = None

for change in changes:
    distance = Levenshtein.distance(alert, change)
    print(f"Ticket: {change}, Distance: {distance}")
    if distance < min_distance:
        min_distance = distance
        min_change = change

if min_change is not None:
    print("\nChange with Minimum Distance:")
    print(min_change)
    print("Distance:", min_distance)
else:
    print("\nNo changes found.")


Ticket: CHNG559732: Recent deployment on azuredatabasenodeserv for manifest ID azuredatabasenodeserv-062723230148699022, Distance: 88
Ticket: CHNG494690: Recent deployment on legacycassandradbnodeserv for manifest ID legacycassandradbnodeserv-062723230148700108, Distance: 98
Ticket: CHNG336829: Recent deployment on oldpostgresadapternodeweb for manifest ID oldpostgresadapternodeweb-062723230148703462, Distance: 98

Change with Minimum Distance:
CHNG559732: Recent deployment on azuredatabasenodeserv for manifest ID azuredatabasenodeserv-062723230148699022
Distance: 88
