In [2]:
%run marxan_utils.ipynb

### Needs:

We need to determin the 5 most spatial different solutions (runs) from the algorithm results outputs.

### [Methodology](http://www.econ.upf.edu/~michael/stanford/):


* Calculate the similarity of the spatial solution for each pair of solutions to get a distance matrix using [jackard index](https://en.wikipedia.org/wiki/Jaccard_index).

* Perform an Aglommerative [Hierarquical clusterization](http://www.econ.upf.edu/~michael/stanford/maeb7.pdf) analysis of the solutions based on the distance matrix.  

#### R solution using Vegan
`v1_gcv_marxanClusteringRuns.ipynb`

#### Python prototype function

```python
def clusterSolutions(MARXAN_FOLDER, MARXAN_INPUTDATA,k=5):
    """
    Returns a list of the 5 most different solutions
    """
    
    # Open solutions matrix file
    userInputFile = readInput(MARXAN_FOLDER, MARXAN_INPUTDATA)
    userSolMat_df = validateFile(MARXAN_FOLDER,MARXAN_INPUTDATA, OutputSolutionsMatrix)
    solmat = userSolMat_df.drop(columns=userSolMat_df.columns[1]).join(userSolMat_df[userSolMat_df.columns[1]].apply(pd.Series))
    
    solmat = solmat.loc[:,solmat.columns != 'SolutionsMatrix']
    
    # Create distance matrix with Jaccard similarity
    dist_mat = linkage(solmat, method='average',metric='jaccard')
    
    # Find k clusters
    # print(f'Building cluster of {k} most different solutions')
    groups = fcluster(dist_mat, k, criterion='maxclust')
    
    # Get best solution per cluster (solution with the lowest Score) from sum table
    summary = validateFile(MARXAN_FOLDER,MARXAN_INPUTDATA, OutputSum)
    best =summary.loc[summary.loc[:]['Score'].idxmin(),'Run_Number']
    print(f'Overall best solution is {best}')
    
    bestlist =[]
    for i in range(k):
        g = np.where(groups == i+1)[0]
        sol = summary.loc[summary.loc[g]['Score'].idxmin(),'Run_Number']
#         print(f'Group {i+1} best solution = {sol}')
        bestlist.append(sol)

#     See figure
#     plt.figure(figsize=(10, 7))
#     plt.scatter(clust[:,0], clust[:,1], c=cluster.labels_, cmap='rainbow')
    
    return bestlist
```

#### [Javascript Prototype](https://codesandbox.io/s/kind-faraday-uro3f?file=/src/index.js)
```javascript
var Jaccard = require("jaccard-index");
var hclust = require("@greenelab/hclust");
var lodash = require("lodash");
// Data to be inputed runid\puid
var logs = [
  { runId: 1, puValues: [1, 1, 1, 1, 1, 1, 1] },
  { runId: 2, puValues: [0, 0, 1, 1, 0, 1, 0] },
  { runId: 3, puValues: [1, 1, 0, 1, 0, 1, 0] },
  { runId: 4, puValues: [1, 1, 1, 1, 1, 1, 1] },
  { runId: 5, puValues: [1, 0, 0, 1, 1, 1, 1] },
  { runId: 6, puValues: [0, 1, 0, 1, 1, 1, 0] },
  { runId: 7, puValues: [0, 1, 0, 0, 0, 0, 0] },
  { runId: 8, puValues: [0, 1, 0, 0, 1, 1, 0] },
  { runId: 9, puValues: [0, 1, 1, 0, 0, 1, 1] },
  { runId: 10, puValues: [0, 0, 0, 0, 0, 0, 0] },
  { runId: 11, puValues: [0, 0, 0, 0, 0, 0, 0] },
  { runId: 12, puValues: [0, 0, 0, 0, 0, 0, 0] },
  { runId: 13, puValues: [0, 0, 0, 0, 0, 0, 0] }
];

// We calculate the jaccard index (Similarity index matrix) becouse we want to
// know what is the similarity in terns of spatial distribution
// of the response. That is why we need also an array with the PU values per run

const { clusters, distances, order, clustersGivenK } = hclust.clusterData({
  data: logs.map((state) => ({
    raw: state
  })),
  key: "raw",
  distance: (setA, setB) => {
    console.log(setA.runId, setB.runId);
    console.log(JSON.stringify(Jaccard().index(setA.puValues, setB.puValues)));
    return Jaccard().index(setA.puValues, setB.puValues);
  },
  linkage: (setA, setB, distanceMatrix) =>
    hclust.averageDistance(setA, setB, distanceMatrix)
});
const fiveGroups = clustersGivenK[5];
const data = fiveGroups.map((indices) => indices.map((index) => logs[index]));
const selectedSolutions = data.map((group) =>
  lodash.minBy(group, (element) => element.runId)
);

console.log(JSON.stringify(selectedSolutions));
//console.log(JSON.stringify(distances));
```

### Comparaison of results between Python algorithm and the javascript one

In [3]:
MARXAN_FOLDER='/home/jovyan/work/datasets/raw/marxan'
MARXAN_INPUTDATA='input.dat'
clusterSolutions(MARXAN_FOLDER, MARXAN_INPUTDATA,k=5)

Overall best solution is 29


[49, 78, 29, 25, 6]

### Data prep to be used with javascript solution

```json
[
  { "runId": 1, "puValues": [1, 1, 1, 1, 1, 1, 1,...], "score":4 },
   ...
]
```

In [11]:
userSolMat_df = validateFile(MARXAN_FOLDER,MARXAN_INPUTDATA, OutputSolutionsMatrix)
solmat = userSolMat_df.drop(columns=userSolMat_df.columns[1]).join(userSolMat_df[userSolMat_df.columns[1]].apply(pd.Series))

solmat.apply(lambda x: {'runId':int(x[0][1:]),'puValues':x[1:].astype(int)}, axis=1, result_type='reduce').to_json(orient='records')

'[{"runId":1,"puValues":[0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,0,0,0,0,0,0,0,1,1,1,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,1,1,1,1,1,1,0,0,0,0,0,1,1,1,0,0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,1,1,1,0,0,0,1,1,1,1,1,1,1,1,0,0,0,0,1,1,1,1,0,0,0,0,0,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,0,0,0,0,0,0,0,1,1,1,0,0,1,1,1,1,1,1,1,1,1,0,0,0,0,0,1,1,1,0,0,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,0,0,0,0,0,0,0,1,1,0,0,0,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0

In [12]:
summary = validateFile(MARXAN_FOLDER,MARXAN_INPUTDATA, OutputSum)
summary

Unnamed: 0,Run_Number,Score,Cost,Planning_Units,Connectivity,Connectivity_Total,Connectivity_In,Connectivity_Edge,Connectivity_Out,Connectivity_In_Fraction,Penalty,Shortfall,Missing_Values,MPM
0,1,2.718854e+06,1442016.0,3314,4252000.0,51664000.0,11130000.0,4252000.0,36282000.0,0.215430,1238.113043,287000.0,0,0.994872
1,2,2.774136e+06,1416536.0,3317,4524000.0,51664000.0,11006000.0,4524000.0,36134000.0,0.213030,400.340491,208000.0,0,0.996169
2,3,2.914071e+06,1431084.0,3319,4940000.0,51664000.0,10806000.0,4940000.0,35918000.0,0.209159,987.059027,153000.0,0,0.980952
3,4,2.813159e+06,1434284.0,3310,4596000.0,51664000.0,10942000.0,4596000.0,36126000.0,0.211792,74.735672,20000.0,0,0.999930
4,5,2.825450e+06,1437756.0,3351,4620000.0,51664000.0,11094000.0,4620000.0,35950000.0,0.214734,1694.280186,181000.0,1,0.914286
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,96,2.869641e+06,1422100.0,3301,4820000.0,51664000.0,10794000.0,4820000.0,36050000.0,0.208927,1541.346704,523000.0,0,0.994872
96,97,2.883446e+06,1442168.0,3323,4804000.0,51664000.0,10890000.0,4804000.0,35970000.0,0.210785,77.862595,54000.0,0,0.999182
97,98,2.731485e+06,1440972.0,3348,4300000.0,51664000.0,11242000.0,4300000.0,36122000.0,0.217598,512.752419,40000.0,0,0.993909
98,99,2.906639e+06,1441192.0,3312,4884000.0,51664000.0,10806000.0,4884000.0,35974000.0,0.209159,246.927440,65000.0,0,0.997807


In [13]:
solmat = solmat.loc[:,solmat.columns != 'SolutionsMatrix']
# Create distance matrix with Jaccard similarity
dist_mat = linkage(solmat, method='average',metric='jaccard')
groups = fcluster(dist_mat, 5, criterion='maxclust')
bestlist =[]
for i in range(5):
    g = np.where(groups == i+1)[0]
    bestlist.append(g)
bestlist

[array([48, 50]),
 array([68, 77, 81, 87]),
 array([ 0,  1,  2,  3,  4,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
        18, 19, 20, 21, 22, 23, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,
        36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 49, 51, 52, 53, 54,
        55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 69, 70, 71, 72,
        73, 74, 75, 76, 78, 79, 80, 82, 83, 84, 85, 86, 88, 89, 90, 91, 92,
        93, 94, 95, 96, 97, 98, 99]),
 array([24]),
 array([5])]

In [9]:
result = clusterSolutions(MARXAN_FOLDER, MARXAN_INPUTDATA,k=5)
result

Overall best solution is 29


[49, 78, 29, 25, 6]

In [15]:
bestlist

[array([48, 50]),
 array([68, 77, 81, 87]),
 array([ 0,  1,  2,  3,  4,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
        18, 19, 20, 21, 22, 23, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,
        36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 49, 51, 52, 53, 54,
        55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 69, 70, 71, 72,
        73, 74, 75, 76, 78, 79, 80, 82, 83, 84, 85, 86, 88, 89, 90, 91, 92,
        93, 94, 95, 96, 97, 98, 99]),
 array([24]),
 array([5])]

In [7]:
test2=[
  [ 5 ],
  [ 24 ],
  [ 48, 50 ],
  [ 68, 81, 77, 87 ],
  [
    26, 88, 80, 75, 86, 43, 79,  8, 45, 95, 12, 14,
    16, 63, 18, 76, 30, 37, 65, 32, 73, 56, 38, 23,
    92, 49, 82, 44, 74, 99, 10, 47,  2, 39, 58, 36,
    41, 60, 71, 27, 85, 11, 42, 64, 96,  1,  3, 34,
    66, 35, 70, 19, 46, 55, 31,  7, 61, 15, 25, 78,
    89, 59, 84, 54, 93, 53, 62, 51, 83, 22,  0, 28,
    72, 98,  9, 97, 29, 57, 94,  6, 67,  4, 20, 52,
    13, 33, 40, 91, 17, 69, 21, 90
  ]
]

In [8]:
for clust in result:
    i=0
    for leaf in test2:
        if (clust-1) in leaf:
            print(clust,i)
        i+=1

49 2
78 3
29 4
25 1
6 0


In [None]:
[21,66,1,3,5]