## Q.1-- Cluster a bunch of points into max-spacing using single-link clustering

In [1]:
## Get some points and their distances into the notebook
with open('Downloads/algo2clustering1.txt') as f:
    distances = f.readlines()

In [2]:
distances[0]  ## number of points

'500\n'

In [3]:
distances[1] ## distance from point 1 to point 2

'1 2 6808\n'

In [4]:
distances[500*499//2]   ## should be final 2 points

'499 500 8273\n'

In [5]:
distances = [(int(p1),int(p2),int(d)) for [p1,p2,d] in [line.strip('\n').split() for line in distances[1:]]]

In [6]:
len(distances)

124750

In [7]:
500*499/2

124750.0

In [8]:
(499, 500, 8273) in distances

True

### Maybe try union find with path compression, for practice, albeit overkill for 500 points

In [9]:
## Find operation, with path compression
def find(item, leaderlist):
    if leaderlist[item] != item:
        leaderlist[item] = find(leaderlist[item], leaderlist)
    return leaderlist[item]

In [10]:
arr = [1,2,3,4,4]
find(0,arr)

4

In [11]:
arr

[4, 4, 4, 4, 4]

In [12]:
## Union operation
def union(l1, l2, ranklist, leaderlist):
    if ranklist[l1] < ranklist[l2]:
        leaderlist[l1] = leaderlist[l2]
    elif ranklist[l1] == ranklist[l2]:
        leaderlist[l1] = leaderlist[l2]
        ranklist[leaderlist[l2]] += 1
    else:
        leaderlist[l2] = leaderlist[l1]

In [13]:
leaders = [0,2,3,4,4]
ranks = [0,0,1,2,3] 
leader1 = find(0, leaders)
leader2 = find(1, leaders)
union(leader1, leader2, ranks, leaders)


In [14]:
leaders

[4, 4, 4, 4, 4]

In [15]:
ranks

[0, 0, 1, 2, 3]

In [16]:
leaders = [1,2,2,4,5,5]
ranks = [0,1,2,0,1,2]

In [17]:
union(find(1, leaders), find(4, leaders), ranks, leaders)

In [18]:
leaders

[1, 2, 5, 4, 5, 5]

In [19]:
ranks

[0, 1, 2, 0, 1, 3]

In [20]:
## Now back to Q.1:
dists = sorted(distances, key=lambda x: x[2])

In [21]:
dists[:3]

[(1, 348, 1), (12, 373, 1), (27, 487, 1)]

Your task in this problem is to run the clustering algorithm from lecture on this data set, where the target number k of clusters is set to 4. What is the maximum spacing of a 4-clustering?

In [22]:
leaders = list(range(501))  # 500 points, but indexed from 1
ranks = [0 for _ in range(501)]

clusters = 500

In [23]:
i=0  # to index thru the distances
while clusters > 4:  # k was set to 4 in this question
    pair = dists[i]
    lead1 = find(pair[0], leaders)
    lead2 = find(pair[1], leaders)
    if lead1 != lead2:
        union(lead1, lead2, ranks, leaders)
        clusters -= 1
    i+= 1
    

In [24]:
clusters

4

In [25]:
i

1218

In [26]:
# Q1 answer, hopefully:
while i < 2000:  # just making sure not to loop forever
    pair = dists[i]
    lead1 = find(pair[0], leaders)
    lead2 = find(pair[1], leaders)
    if lead1 != lead2:
        print(pair[2])  # max-spacing
        break
    i += 1

106


In [28]:
i  ## seeing how far we got before points were in different clusters

1307

## Q.2 involves a much bigger set of 'points'

### Easier just to copy and paste in the problem:

For example, the third line of the file "0 1 1 0 0 1 1 0 0 1 0 1 1 1 1 1 1 0 1 0 1 1 0 1" denotes the 24 bits associated with node #2.
The distance between two nodes u and v in this problem is defined as the Hamming distance--- the number of differing bits --- between the two nodes' labels. For example, the Hamming distance between the 24-bit label of node #2 above and the label "0 1 0 0 0 1 0 0 0 1 0 1 1 1 1 1 1 0 1 0 0 1 0 1" is 3 (since they differ in the 3rd, 7th, and 21st bits).

The question is: what is the largest value of k such that there is a k-clustering with spacing at least 3? That is, how many clusters are needed to ensure that no pair of nodes with all but 2 bits in common get split into different clusters?

In [32]:
# Get the data into the notebook, if it fits
with open('Downloads/algo2clustering_big.txt') as f:
    numPts = f.readline()
    print(numPts)
    pts = f.readlines()

200000 24



In [33]:
len(pts)

200000

In [34]:
pts[0]

'1 1 1 0 0 0 0 0 1 1 0 1 0 0 1 1 1 1 0 0 1 1 1 1 \n'

In [39]:
pts = [pt.strip(' \n').replace(' ','') for pt in pts]

In [40]:
pts[0]

'111000001101001111001111'

In [62]:
bin(int(pts[0], 2) ^ int(pts[1], 2)).count('1')

9

In [61]:
pts[1]

'011001100101111110101101'

In [63]:
def ham(pt1, pt2):
    return bin(int(pt1, 2) ^ int(pt2, 2)).count('1')

### Seemed like hamming all point pairs would work, but with 20B pairs (200,000 choose 2), that seems too long.  Since we only need to cluster pairs of points that ham to 2 or less, it should be faster to just xor the 300 (24 choose 2 plus 24 choose 1) candidates and check for their existence in the point set.  That's 60M xor's and 60M lookups

In [72]:
bin(int(pts[0], 2) ^ int(pts[1], 2))[2:]

'100001101000110001100010'

In [73]:
## THIS FAILS WHEN THE XOR IS A LOW NUMBER, BECAUSE THE ZEROS DON'T PAD LEFT
#def xor(pt1, pt2):
#    return bin(int(pt1, 2) ^ int(pt2, 2))[2:]

In [141]:
## Build the set of xor's for clustermates
same = '0'* 24
xors = set()
for bit1 in range(23):   
    for bit2 in range(bit1+1, 24):
        xors.add(same[:bit1] + '1' + same[bit1+1:])
        xors.add(same[:bit1] + '1' + same[bit1+1:bit2] + '1' + same[bit2+1:])
xors.add(same[-1] + '1')  ## add on the final one manually since using 2 indices
## convert to ints
xors = {int(x, 2) for x in xors}

In [142]:
len(xors)

300

In [144]:
10 in xors   # '000...1010'

True

In [145]:
pointSet = set(pts)
pointSet = {int(p,2) for p in pointSet}
len(pointSet)

198788

So that essentially clustered all identical points by deleting duplicates, and we now have 198,788 individual clusters.

In [146]:
pts = list(pointSet)  ## need to order it for clustering with union-find

In [147]:
numClusters = len(pts)
leaders = list(range(numClusters))
ranks = [0 for _ in range(numClusters)]
locs = {p[1]:p[0] for p in enumerate(pts)}  ## lookup of point locations in the ordered list

In [148]:
for i in range(numClusters):     ##THIS TOOK ABOUT 24 SECS
    for x in xors:
        p = pts[i] ^ x
        if p in pointSet:
            j = locs[p]  ## where is matching point in pts
            lead1 = find(i, leaders)  ## this is the path compression find
            lead2 = find(j, leaders)
            if lead1 != lead2:
                union(lead1, lead2, ranks, leaders)
                numClusters -= 1
    pointSet.remove(pts[i])  ## This will cut in half the work, by not duplicating operations with reversed operands

In [149]:
numClusters  ## hopefully the assgmt answer

6118

In [156]:
test=['11011','10011','10000','00110','00001','01010','00110','01000','10001']
testset=set(test)
print(testset)
testset={int(x, 2) for x in testset}
test = list(testset)
print(test)
n=len(test)
leads=list(range(n))
rnks=[0 for _ in range(n)]
locas= {p[1]:p[0] for p in enumerate(test)}
xs={'10000','01000','00100','00010','00001'}
xs = {int(x, 2) for x in xs}

for i in range(n):
    for x in xs:
        p = test[i] ^ x
        if p in testset:
            j = locas[p]
            lead1 = find(i, leads)
            lead2 = find(j, leads)
            if lead1 != lead2:
                print(p, x)
                union(lead1, lead2, rnks, leads)
                n -= 1
    testset.remove(test[i])

{'10001', '10000', '01000', '00110', '10011', '00001', '01010', '11011'}
[1, 6, 8, 10, 16, 17, 19, 27]
17 16
10 2
17 1
19 2
27 8


In [157]:
n

3

In [158]:
test

[1, 6, 8, 10, 16, 17, 19, 27]

In [159]:
leads

[5, 1, 3, 3, 5, 5, 5, 5]

In [160]:
rnks

[0, 0, 0, 1, 0, 1, 0, 0]

In [161]:
testset

set()