# Minimum k-cut Algorithm

### High-level Algorithm Specification:

1. Create vertex list and an edges list, e.g.:

    ```javascript
    vertices = {1: [2,4,5], 2: [3,4,5], 3: [2,4], 4: [1,2,3], 5: [1,2]}
    edges = [[1,2], [1,4], [1,5], [2,3], [2,4], [2,5], [3,4]]
    ```

2. Keep track of the minimum cut so far:

    ```javascript
    // really this could be the max degree of all vertices, I believe
    min_edges_so_far = len(edges)
    min_vertex_sets = {1:[], 2:[], 3:[], 4:[], 5:[]}
    ```

3. *Iterate at least `n^2 log n` times (where n is the original number of vertices)*
    
    **Intiate:**
    
    ```javascript
    temp_vertex_sets = copy(min_vertex_sets)
    temp_vertices = copy(vertices)
    temp_edges = copy(edges)
    ```

    **While num_vertices > k:**

    1. Pick an edge at random: the first vertex (`v1`) will absorb the second (`v2`). Add `v2` and `temp_vertex_sets[v2]` to `temp_vertex_sets[v1]` and delete `temp_vertex_sets[v2]`.
    2. All vertices adjacent to `v2` are added to `temp_vertices[v1]` unless already present. Remove `v2` from `temp_vertices[v1]`.
    3. Replace all instances of `v2` in `temp_edges` with `v1`, unless the other vertex of the edge is itself `v1`. In the latter case, delete the edge (e.g. remove self-loops). **Note:** Parallel edges are allowed; there may be multiple instances of an edge comprised the same vertex pair.

    **Finally:** The number of final edges is the number of edges across the final cut in this iteration. If it is less than min_edges_so_far, update `min_edges_so_far = len(temp_edges)` and `min_vertex_sets = temp_vertex_sets`.
        

## Setup: Select all measurements and document ids from the database

In [1]:
import psycopg2
import numpy as np
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt
import numpy as np
import json
from copy import deepcopy
import random
execfile('utils.py')

In [2]:
database = 'fomc'
conn = psycopg2.connect("dbname=" + database + " user=abarciauskas")
cur = conn.cursor()

year = 2006
cosine_thresh = 0.25
cur.execute("SELECT Doc1Id,Doc2Id,CosineSimilarity FROM alignments WHERE Year = '" + str(year) + "'"
           " AND CosineSimilarity >= " + str(cosine_thresh) + " ORDER BY random() LIMIT 500")
cosine_sims = cur.fetchall()
len(cosine_sims)

500

## Step 1: Create the graph

The graph is comprised a list of edges (a vertex tuple) and a dictionary of vertices.

In [3]:
edges, vertices = create_graph(cosine_sims)
print 'Number vertices in complete graph: ' + str(len(vertices))
print 'Number edges in complete graph: ' + str(len(edges))

Number vertices in complete graph: 607
Number edges in complete graph: 500


In [5]:
# need to find disconnected graphs
graphs = build_distinct_graphs(vertices)

graph_lengths = [len(graph) for graph in graphs]
fc_graph = graphs[graph_lengths.index(max(graph_lengths))]
print 'Number vertices fully connected graph: ' + str(len(fc_graph))

Number vertices fully connected graph: 48


In [8]:
# Remove loner graphs from the most fully connected graph (fc = fully connected)
set_fc_graph_vertices = set(fc_graph)
loners = set_fc_graph_vertices ^ set(vertices.keys())

fc_vertices = deepcopy(vertices)
fc_edges = deepcopy(edges)

for loner in loners: fc_vertices.pop(loner, None)
print 'Vertices in fully connected graph: ' + str(len(fc_vertices))

fc_edges = filter(lambda x: not list(x)[0] in loners and not list(x)[1] in loners, fc_edges)    
print 'Edges in fully connected graph: ' + str(len(fc_edges))

Vertices in fully connected graph: 48
Edges in fully connected graph: 50


## Step 2 & 3: Keep track of minimum so far and run many random iterations

In [9]:
execfile('karger_run.py')

In [10]:
k = 10
n = len(fc_vertices)
niters = 1

import time

t0 = time.time()
min_fc_edges_so_far, min_vertex_sets = karger_run(niters)
t1 = time.time()

total = t1-t0
print 'Total time for ' + str(niters) + ': ' + str(total)

Running iter: 0
Total time for 1: 0.00412201881409


In [11]:
niters = int(np.ceil(n**2*np.log(n)))

total_seconds = niters*total
minutes = total_seconds/60
hours = minutes/60
print hours

0.0102134466171


In [12]:
t0 = time.time()
min_edges_so_far, min_vertex_sets = karger_run(niters)
t1 = time.time()
total = t1-t0
print 'Total time for ' + str(niters) + ': ' + str(total)

Running iter: 0
Running iter: 50
Running iter: 100
Running iter: 150
Running iter: 200
Running iter: 250
Running iter: 300
Running iter: 350
Running iter: 400
Running iter: 450
Running iter: 500
Running iter: 550
Running iter: 600
Running iter: 650
Running iter: 700
Running iter: 750
Running iter: 800
Running iter: 850
Running iter: 900
Running iter: 950
Running iter: 1000
Running iter: 1050
Running iter: 1100
Running iter: 1150
Running iter: 1200
Running iter: 1250
Running iter: 1300
Running iter: 1350
Running iter: 1400
Running iter: 1450
Running iter: 1500
Running iter: 1550
Running iter: 1600
Running iter: 1650
Running iter: 1700
Running iter: 1750
Running iter: 1800
Running iter: 1850
Running iter: 1900
Running iter: 1950
Running iter: 2000
Running iter: 2050
Running iter: 2100
Running iter: 2150
Running iter: 2200
Running iter: 2250
Running iter: 2300
Running iter: 2350
Running iter: 2400
Running iter: 2450
Running iter: 2500
Running iter: 2550
Running iter: 2600
Running iter: 26

In [13]:
print 'Num crossing edges: ' + str(min_edges_so_far)
total = t1-t0
print 'Total time for ' + str(niters) + ' iterations: ' + str(total/60/60) + ' hours'
super_nodes = min_vertex_sets.keys()
super_nodes = filter(lambda x: len(min_vertex_sets[x]) >= 4, super_nodes)
nclusters = len(super_nodes)
print 'Number of actual clusters: ' + str(nclusters)
for node in super_nodes: print 'Super node of size: ' + str(len(min_vertex_sets[node]))

Num crossing edges: 9
Total time for 8920 iterations: 0.00702449977398 hours
Number of actual clusters: 2
Super node of size: 25
Super node of size: 9


In [20]:
execfile('utils.py')

# find the relative frequency for each super node
cur.execute("SELECT TermVector FROM corpii WHERE Year = '" + str(year) + "'")
terms = cur.fetchall()[0][0]
nterms = len(terms)

cluster_frequencies, overall_frequencies = sum_term_frequencies(cur, nterms, nclusters, super_nodes, min_vertex_sets)
cluster_frequencies_normalized = normalize_term_frequencies(nterms, cluster_frequencies, overall_frequencies)
print_clusters(cur, super_nodes, cluster_frequencies_normalized, 20, terms)

\_\_\_\_

**Cluster 1**

*Representative:*
> In the Committee� s discussion of monetary policy for the intermeeting period, nearly all members favored keeping the target federal funds rate at 5-1/4 percent at this meeting.

| Top Terms   |   Relative Frequency |
|:------------|---------------------:|
| posit       |                    1 |
| mount       |                    1 |
| compromis   |                    1 |
| strengthen  |                    1 |
| vari        |                    1 |
| declinar    |                    1 |
| gener       |                    1 |
| commit      |                    1 |
| pace        |                    1 |
| runoff      |                    1 |
| forecast    |                    1 |
| must        |                    1 |
| executiu    |                    1 |
| suppli      |                    1 |
| volatil     |                    1 |
| baby-boom   |                    1 |
| two         |                    1 |
| advantag    |                    