Title: Fuzzy sequence matching with cross-correlation

## Cross-correlation

I am going to generate a "random file" - an array with values ranging between -1 and 1, and then randomly choose a "random slice" - a subarray portion from the "random file".  I'm then going to randomly mutate some of the values in teh subarray, and see if I can still find where in the "random file" the "random slice" came from.

In [2]:
import random
import numpy as np

FILE_SIZE = 100000
SLICE_SIZE = 100

random_file = np.asarray([random.uniform(-1,1) for i in range(FILE_SIZE)])
random_slice_index = random.randint(0, FILE_SIZE-SLICE_SIZE)
random_slice = random_file[random_slice_index:random_slice_index+SLICE_SIZE]

# Randomly change the sample by about 50% - simulating a 50% noise
random_slice_mutated = np.asarray([i + random.uniform(-0.5,0.5) for i in random_slice])

So now, we are going to try to find where in `random_file` the `random_slice_mutated` came from.  We know it should be `random_slice_index`.

### First, let's write our own cross-correlation function to find it.

In [3]:
# Coding our own cross-correlation algorithm

def find_sample_index(container, sample):
    best_match_index = 0
    largest_dot_product = 0
    sample_size = len(sample)
    for index in range(len(container) - sample_size):
        dot = np.dot(container[index:index+sample_size], sample)
        if dot > largest_dot_product:
            best_match_index = index
            largest_dot_product = dot
    return (best_match_index, largest_dot_product)
    

In [4]:
find_sample_index(random_file, random_slice_mutated)

(14108, 33.238798044766)

In [6]:
print("Actual index (should match):", random_slice_index)

Actual index (should match): 14108


### Now, let's use the `scipy.signal.correlate()` function instead of our own one.

Note that the numpy implementation returns a vector of all of the correlation values, not just the largest one, so to find the largest index, we need to find the largest value.

In [7]:
import scipy.signal
result = scipy.signal.correlate(random_file, random_slice)
max_correlation_index = np.argmax(result) - (SLICE_SIZE-1)
max_correlation_index

14108

### Cross-correlation algorithm

In [8]:
a = np.array([1,2,3,4,5])
b = np.array([1,2,3])
scipy.signal.correlate(a,b)

array([ 3,  8, 14, 20, 26, 14,  5])

For vectors:

A=
<table>
<tr><td>1</td><td>2</td><td>3</td><td>4</td><td>5</td></tr>
</table>

B=
<table>
<tr><td>1</td><td>2</td><td>3</td></tr>
</table>

The cross-correlate algorithm zero-pads each vector as such (where the blank cells represent 0):

#### Step 1
<table>
<tr><td></td><td></td><td>1</td><td>2</td><td>3</td><td>4</td><td>5</td></tr>
<tr><td>1</td><td>2</td><td>3</td><td></td><td></td><td></td><td></td></tr>
</table>

    0*1 + 0*2 + 1*3 + 2*0 + 3*0 + 4*0 + 5*0 = 3

#### Step 2
<table>
<tr><td></td><td>1</td><td>2</td><td>3</td><td>4</td><td>5</td></tr>
<tr><td>1</td><td>2</td><td>3</td><td></td><td></td><td></td></tr>
</table>

    0*1 + 1*2 + 2*3 + 3*0 + 4*0 + 5*0 = 8

#### Step 3
<table>
<tr><td>1</td><td>2</td><td>3</td><td>4</td><td>5</td></tr>
<tr><td>1</td><td>2</td><td>3</td><td></td><td></td></tr>
</table>

    1*1 + 2*2 + 3*3 + 4*0 + 5*0 = 14

## ...

#### Last step
<table>
<tr><td>1</td><td>2</td><td>3</td><td>4</td><td>5</td><td></td><td></td></tr>
<tr><td></td><td></td><td></td><td></td><td>1</td><td>2</td><td>3</td></tr>
</table>

So each step consists of doing the appropriate zero-padding and taking the dot product of the 2 vectors, and then moving to the right one spot.