# Step by step analysis of Ant Colony Simulations

**First Step** to read and clean the data turn it in a DataFrame

In [None]:
#importing needed libraries
# DataFrame library
import pandas as pd
# numerical library
import numpy as np
# Scientifical library
import scipy as sp
# Statistical Library with R notation for Linear Models
import statsmodels.formula.api as smf

import matplotlib.pyplot as plt
%matplotlib inline

#Creating the pandas DataFrame from csv file
df = pd.read_csv("burma14.csv")
#cleaning cells with Not Available data, or erased cell
clean = df.dropna()
#Creating a column with exceeding distance from optimal result
clean.loc[:,'delta'] = clean.loc[:,'distance'] - 3323
#Creating column with exceeding percentage from optimal result
clean.loc[:,'%delta'] = clean.loc[:,'delta']/3323

**Second Step** create hexagram of the columns
Answering the question: *How many times each value of a parameter got to optimal result*

With a hexagram we answer the general question: "How many times each value y was found by each value x"
and this is displayed by intensity of color, a colorbar is displayed by side to help analyze it

In [None]:
#iterate through each parameter column
for name in clean.columns:
    # since we are interested in the relation between distance and another parameter, those relative columns
    # are excluded from analysis
    if name in ['distance','ID','delta','%delta'] :
        continue
    # create figure
    fig, ax = plt.subplots(figsize=(8,6))
    # create plot type and change setup
    d = ax.hexbin(clean[name],clean['distance'], cmap=plt.cm.Blues, alpha=0.8)
    plt.xlabel(name,fontsize=14)
    plt.ylabel('Distance in Km',fontsize=14)
    plt.title('Hexbin of times distance X was reached with parameter '+name,fontsize=14)
    cb = fig.colorbar(d)
    cb.set_label('number of times',fontsize=14)
    fig.savefig('plot_hexbin_'+name+'.png')

**Second Step - Refining** Repeat hexagrams

After analizing the result, we can see a lot of good results already, but the final answer is still obscure
since the results are similar to a few values.

To reduce it we are increasing the **minimum** value, this way only after the count reach this minimum we display it

In [None]:
for name in clean.columns:
    if name in ['distance','ID','delta','%delta'] :
        continue
    fig, ax = plt.subplots(figsize=(8,6))
    
    # change point
    # before we only plotted the hexagram, without tuning
    # this time we set the minimal count parameter, so we only display values after it
    # we set it different based on the results
    if name == 'alpha':
        d = ax.hexbin(clean[name],clean['distance'],mincnt=8000, cmap=plt.cm.Blues)
    elif name == 'limit':
        d = ax.hexbin(clean[name],clean['distance'],mincnt=48000, cmap=plt.cm.Blues)
    else:
        d = ax.hexbin(clean[name],clean['distance'],mincnt=18000, cmap=plt.cm.Blues)
        
    plt.xlabel(name,fontsize=14)
    plt.ylabel('Distance in Km',fontsize=14)
    plt.title('Hexbin of times distance X was reached with parameter '+name,fontsize=14)
    cb = fig.colorbar(d)
    cb.set_label('number of times',fontsize=14)
    fig.savefig('plot_hexbin_10k_'+name+'.png')

**Third Step** Statistical Analysis

The idea in the previous steps were to gain information about this phenomenon, what is happening, is it replicable, when does it happens, how does it happens. Those kind of question were answered, at least in part. Now remains the question, **_why_**?

And now we try to answer it. With some statistical tools we try to gain knowledge about this phenomenon

1) Linear Model 1 variable
Our first try is to see how every parameter answer as a LM, we seek something like

```
y = a*x + b
```
This code show as y as variable dependable of x by a scale of a with an error of b

We are intereted in 3 values: 
* The parameter value it self, in our exemple the value a
* The P-value, it shows the statistical significance of our parameter, in simple words: 'Is my value random?'
* The R^2 value, it shows how much this LM can explain my data
* The Correlation between our parameter and the distance, if the parameter increase 1 point what happens to the distance?

In [None]:
# auxiliar array
aux = []
# same trick to ignore ours parameters
for name in clean.columns:
    if name in ['distance','ID','beta','delta','%delta'] :
        continue
    
    # formula for the linear model
    # this notation can be read as 
    # distance is proportional to parameter name
    # we are interested in find this proportion
    formula = "Q('distance') ~ Q('"+name+"')"
    model = smf.ols(formula,data=clean)
    # if the Number of OBServations is lower than ha
    if model.nobs < len(clean)/2:
        continue
        
    results = model.fit()
    
    #insert R^2 values in aux array with the parameter name
    aux.append((results.rsquared, name))
    #The Real Values of the Parameters
    print(results.params)
    #The P-value between 0 and 1, 0 is better
    print(results.pvalues)
    print(' ')
    #Pearson Correlation - Correlation
    print('pearson correlation distance and '+name,clean[name].corr(clean['distance'],method='pearson'))
    
    # Spearman Correlation - Correlation - with more robust approach to non linearity
    print('spearman correlation distance and '+name,clean[name].corr(clean['distance'],method='spearman'))
    print(' ')

    
#rank the array from greatest to lowest
aux.sort(reverse=True)
for mse,name in t:
    print(name,mse)