Agnieszka Dutka
# Laboratory 4 - Edit distance and longest common subsequence

Contents:  
[1. Edit distance](#1)   
[2. Edits visualization](#2)  
[3. (2) example usage](#3)   
[4. LCS](#4)  
[5. Tokenize](#5)  
[6. Remove 3%](#6)  
[7. LCS of tokens](#7)  
[8. Diff algorithm](#8)  
[9. Diff usage](#9)  


In [2]:
import numpy as np
from bisect import bisect
from unidecode import unidecode

<a id='1'></a>
### Edit distance (ex. 1)

**_space complexity O(m x n)_**

In [3]:
def edit_distance(x, y, delta, whole_array=False):
    edit_table=np.zeros((len(x)+1,len(y)+1))
    for i in range(len(x)+1):
        edit_table[i,0]=i
    for j in range(len(y)+1):
        edit_table[0,j]=j
    for i in range(len(x)):
        k = i+1
        for j in range(len(y)):
            l = j+1
            edit_table[k,l]= min(
                edit_table[k-1,l]+1, edit_table[k,l-1]+1, edit_table[k-1,l-1]+delta(x[i],y[j]))
    if whole_array:
        return edit_table
    return edit_table[len(x),len(y)]

**_space complexity O(min{m, n})_**

In [4]:
def edit_distance2(x, y, delta):
    if len(x) > len(y):
        x, y = y, x
    edit_row = [i for i in range(len(x)+1)]
    for i in range(1, len(y)+1):
        new_row = [0]*(len(edit_row))
        new_row[0]=i
        for j in range(1, len(edit_row)):
            new_row[j]= min(
                new_row[j-1]+1, edit_row[j]+1, edit_row[j-1]+delta(x[j-1],y[i-1]))
        edit_row = new_row
    return edit_row[-1]

#### Delta functions

In [5]:
def delta1(a,b):  # classic
    if(a==b):
        return 0
    return 1

def delta2(a,b):  # no swap 
    if(a==b):
        return 0
    return 2

def delta3(a,b):  # with unidecode opt
    if(a==b):
        return 0
    elif unidecode(a)==unidecode(b):
        return 0.5
    return 1


#### Examples

In [18]:
# edit table
print(edit_distance("kot", "plotek", delta1, whole_array=True))

[[0. 1. 2. 3. 4. 5. 6.]
 [1. 1. 2. 3. 4. 5. 5.]
 [2. 2. 2. 2. 3. 4. 5.]
 [3. 3. 3. 3. 2. 3. 4.]]


In [24]:
# comparing both algorithms
assert edit_distance("Łódź", "Lodz", delta3) == 1.5 
assert edit_distance2("Łódź", "Lodz", delta3) == 1.5

assert edit_distance("żyrafa", "zwierzęta", delta3) == 6.5
assert edit_distance2("żyrafa", "zwierzęta", delta3) == 6.5
print("assertions passed")

assertions passed


<a id='2'></a>
### Path finding and visualization (ex. 2)

In [25]:
def get_path(x, y, delta): 
    """edit distance algorithm (analougus to edit_distance2) returning min_distance and path"""
    x, y = y, x # x, y swapped xD
    edit_row = [(i, 'i'*i) for i in range(len(x)+1)]
    for i in range(1, len(y)+1):
        new_row = [0]*(len(edit_row))
        new_row[0]=(i, 'd'*i)
        for j in range(1, len(edit_row)):
            min_tuple = min(
                (new_row[j-1][0]+1, new_row[j-1][1]+'i'), 
                (edit_row[j][0]+1, edit_row[j][1]+'d'),
                (edit_row[j-1][0]+delta(x[j-1],y[i-1]), edit_row[j-1][1]+'s'))
            new_row[j]= min_tuple
        edit_row = new_row
    return edit_row[-1]

def edit_distance_vis(a: str, b: str):
    res, path = get_path(a, b, delta3)
    ai, bi = 0, 0
    a = list(a)
    print(''.join(a), "<- start")
    for move in path:
        if move == 's':
            if a[ai]!= b[bi]: # swap
                a[ai] = b[bi]
                print(''.join(a[:ai]+['*', a[ai], '*']+a[ai+1:]), "\t[swap]")
            ai, bi = ai+1, bi+1
        elif move == 'i': # insertion
            a.insert(ai, b[bi])
            print(''.join(a[:ai]+['*', a[ai], '*']+a[ai+1:]), "\t[ins]")
            ai, bi = ai+1, bi+1
        elif move == 'd': #deletion
            print(''.join(a[:ai]+['*','*']+a[ai+1:]), "\t[del]")
            del a[ai]


<a id='3'></a>
### Visualization for given strings (ex. 3)

In [26]:
edit_distance_vis("los", "kloc")

los <- start
*k*los 	[ins]
klo*c* 	[swap]


In [9]:
edit_distance_vis("Łódź", "Lodz")

Łódź <- start
*L*ódź 	[swap]
L*o*dź 	[swap]
Lod*z* 	[swap]


In [10]:
edit_distance_vis("kwintesencja", "quintessence")

kwintesencja <- start
*q*wintesencja 	[swap]
q*u*intesencja 	[swap]
quinte*s*sencja 	[ins]
quintessenc**a 	[del]
quintessenc*e* 	[swap]


In [11]:
edit_distance_vis("ATGAATCTTACCGCCTCG", "ATGAGGCTCTGGCCCCTG")

ATGAATCTTACCGCCTCG <- start
ATGA*G*ATCTTACCGCCTCG 	[ins]
ATGAG*G*ATCTTACCGCCTCG 	[ins]
ATGAGG*C*TCTTACCGCCTCG 	[swap]
ATGAGGCTCT*G*ACCGCCTCG 	[swap]
ATGAGGCTCTG*G*CCGCCTCG 	[swap]
ATGAGGCTCTGGCC**CCTCG 	[del]
ATGAGGCTCTGGCCCCT**G 	[del]


<a id='4'></a>
### Longest common subsequence (ex. 4)

In [12]:
def lcs1(x, y):
    return(len(x)+len(y)-edit_distance(x,y,delta2))/2

def lcs2(x: list,y: list):  # faster, works on any kind of lists
    ranges=[]
    ranges.append(len(y))# I_0 = [0..n]
    for i in range(len(x)):
        positions=[j for j,l in enumerate(y) if l==x[i]]
        positions.reverse()
        for p in positions:
            k = bisect(ranges,p)
            if(k == bisect(ranges,p-1)):
                if(k<len(ranges)-1):
                    ranges[k]=p
                else:
                    ranges[k:k]=[p]
    return len(ranges)-1


#### Examples

In [13]:
lcs1('cbabac','abcabba')

4.0

In [96]:
assert lcs1('bs','sb') == 1
assert lcs2(list('bs'),list('sb')) == 1 

assert lcs1('cccccbbbbbccccccbbabababababa','ccccbcbbbbbcccfcccbbababababgaba') == 29
assert lcs2(list('cccccbbbbbccccccbbabababababa'),list('ccccbcbbbbbcccfcccbbababababgaba')) == 29
print("all assertions passed")

all assertions passed


<a id='5'></a>
### Tokenize Romeo & Juliet (ex. 5)

In [94]:
import spacy
from spacy.tokenizer import Tokenizer
from spacy.lang.pl import Polish
from typing import List
from time import perf_counter

In [35]:
nlp = spacy.blank("pl")
tokenizer = nlp.Defaults.create_tokenizer(nlp)

f = open("romeo-i-julia.txt", encoding='utf-8')
rnj_text = ''.join(f.readlines())

rnj_tok = tokenizer(rnj_text) # tokenized romeo&juliet text

<a id='6'></a>
### Create 2 files with random tokens removed (ex.6)

In [61]:
from random import random

def delete_random(tokens, part: float):
    res = []
    for t in tokens:
        if t.text=="\n" or t.is_punct or random() >= part:
            res.append(t)
    return res

def save(tokens, to_file:str):
    with open(to_file, 'w', encoding='utf-8') as f:
        for token in tokens:
            f.write(token.text_with_ws)
        f.close

In [62]:
rnj_tok1 = delete_random(rnj_tok, 0.03)  # with deleted random 3%
rnj_tok2 = delete_random(rnj_tok, 0.03)  # with deleted random 3%
print("original tokens:",len(rnj_tok))
print("tokens in rnj_tok1:",len(rnj_tok1))
print("tokens in rnj_tok2:",len(rnj_tok2))
save(rnj_tok1, "romeo-i-julia1.txt")
save(rnj_tok2, "romeo-i-julia2.txt")


original tokens: 32009
tokens in rnj_tok1: 31349
tokens in rnj_tok2: 31326


<a id='7'></a>
### Longest common subsequence of tokens (ex. 7)

In [95]:
""" using lcs2 implementation (lcs1 exceeds available memory stack)"""
st = perf_counter()
lcs = lcs2(rnj_tok1, rnj_tok2)
st1 = perf_counter()
print("lcs:", lcs)
print("elapsed time:", st1-st)

lcs: 30689
elapsed time: 153.13491290000002


<a id='8'></a>
### Diff algorithm (ex.8)


In [90]:
def lcs_matrix(a, b): 
    """ lcs algorithm using dynamic programming, to recreate differences in a and b.
    Works on lists of any kind of data with '==' defined (chars, tokens, classes) """
    mx = [[0 for _ in range(len(b) + 1)] for _ in range(len(a) + 1)]
    for i, ai in enumerate(a):
        for j, bj in enumerate(b):
            mx[i][j] = 1 + mx[i-1][j-1] if ai == bj else  max(mx[i][j-1], mx[i-1][j])
    return mx

def print_diff(mx, a, b):
    lines = []
    i, j = len(a)-1, len(b)-1
    while i >= 0 and j >= 0:
        if i < 0:
            lines.append(f">>> [{j}] {b[j]}")
            j -= 1
        elif j < 0:
            lines.append(f"<< [{i}] {a[i]}")
            i -= 1
        elif a[i] == b[j]:
            i, j = i-1, j-1
        elif mx[i][j-1] >= mx[i-1][j]:
            lines.append(f">>> [{j}] {b[j]}")
            j -= 1
        elif mx[i][j-1] < mx[i-1][j]:
            lines.append(f"<< [{i}] {a[i]}")
            i -= 1
    lines.reverse()
    for line in lines:
        print(line)
        
def diff(a, b):
    mx = lcs_matrix(a, b)
    return print_diff(mx, a, b)


In [67]:
""" compare 2 files with given paths """
def files_diff(a: str, b:str):
    with open(a, encoding='utf-8') as f1, open(b, encoding='utf-8') as f2:
        diff(f1.readlines(), f2.readlines())

#### Example usage

In [86]:
""" test on first 100 lines """
with open("romeo-i-julia1.txt", encoding='utf-8') as f1, open("romeo-i-julia2.txt", encoding='utf-8') as f2:
        diff(f1.readlines()[:20], f2.readlines()[:20])

>>> [0] William Shakespeare

<< [3] tłum. Józef Paszkowski

>>> [3] Józef Paszkowski

<< [10]  * ESKALUS — książę panujący w Weronie

>>> [10]  * ESKALUS — książę w Weronie

<< [12]  * MONTEKI, KAPULET — naczelnicy dwóch nieprzyjaznych sobie

>>> [12]  * MONTEKI, KAPULET — naczelnicy dwóch domów nieprzyjaznych sobie

<< [16]  * BENWOLIO — synowiec Montekiego

>>> [16]  * — synowiec Montekiego



<a id='9'></a>
### Comparing created files (ex.9)

In [91]:
""" compare whole files (takes ~20 sek)"""

st = perf_counter()
files_diff("romeo-i-julia1.txt", "romeo-i-julia2.txt")
print("elapsed time:", perf_counter()-st)


>>> [0] William Shakespeare

<< [3] tłum. Józef Paszkowski

>>> [3] Józef Paszkowski

<< [10]  * ESKALUS — książę panujący w Weronie

>>> [10]  * ESKALUS — książę w Weronie

<< [12]  * MONTEKI, KAPULET — naczelnicy dwóch nieprzyjaznych sobie

>>> [12]  * MONTEKI, KAPULET — naczelnicy dwóch domów nieprzyjaznych sobie

<< [16]  * BENWOLIO — synowiec Montekiego

>>> [16]  * — synowiec Montekiego

<< [25]  * PAŹ PARYSA

>>> [25]  * PARYSA

<< [30]  * — córka Kapuletów

>>> [30]  * JULIA — córka Kapuletów

<< [37] Rzecz odbywa się przez większą część sztuki w Weronie, przez część piątego aktu w Mantui.

>>> [37] Rzecz odbywa się przez większą część sztuki Weronie, przez część piątego aktu w Mantui.

<< [45] Dwa rody, zacne i sławne —

>>> [45] Dwa rody, zacne jednako i sławne —

<< [50] tych dwu wrogów wzięło bowiem życie,

>>> [50] Z łon tych dwu wrogów wzięło bowiem życie,

<< [60] otoczcie cierpliwymi względy,

<< [61] Jest w nim co złego, my usuniem błędy…

>>> [60] Które otoczcie cierp

>>> [6163] Leżała tu ówdzie zbieranina

<< [6172] Donośnym głosem?

>>> [6185] głosem?

<< [6177]                         Zbliż się tu, człowieku,

<< [6178] Widzę, że jesteś w niezamożnym stanie;

>>> [6190]                         się tu, człowieku,

>>> [6191] Widzę, że jesteś w stanie;

<< [6180] Drachmę trucizny takiej, co by mogła

<< [6181] Po wszystkich żyłach rozejść się od 

>>> [6193] Drachmę trucizny takiej, co mogła

>>> [6194] Po wszystkich żyłach rozejść się od razu

<< [6192] Ale w Mantui prawo śmiercią 

<< [6193] Każdego, co się waży go udzielić.

<< [6194] 

<< [6195] 

<< [6196] ROMEO

>>> [6205] Ale w Mantui prawo śmiercią karze

>>> [6206] Każdego, co się waży go udzielić.ROMEO

<< [6202] Świat ci nie sprzyja ani świata,

<< [6203] Bo świat dajeć prawa być bogatym;

>>> [6212] Świat ci nie sprzyja ani prawo świata,

>>> [6213] Bo świat nie dajeć prawa być bogatym;

<< [6205] 

>>> [6217] APTEKARZ

<< [6214] Ubóstwo twoje też, nie chęć opłacam.

>>> [6224] Ubóstwo 

elapsed time: 21.390346099999988
