### Required libraries

In [5]:
import commons.graph
import commons.parse
import commons.scores
import warnings
warnings.filterwarnings('ignore')

## Preliminary steps
#### Create a GraphMaker
Specify the stopword list and the desired stemmer:
- ```POR```: Porter Stemmer
- ```SNO```: Snowball Stemmer (English)
- ```LAN```: Lancaster Stemmer

In [6]:
gm = commons.graph.GraphMaker('resources/longStopwords.txt', 'LAN')

#### Create (or update) the list of allowed articles
Set the minimum number of nodes for a graph to be considered and run the function.
For the current dataset the file is already made for you, **no need to run it again**.
This function is meant to be executed only the first time or if the dataset changes, e.g. some articles are added or removed.

In [None]:
min_nodes = 5
commons.parse.update_allowed_forbidden_files(gm, min_nodes)

### Sample the articles
There are 35403 allowed articles available for sampling.

In [None]:
sample_size = 5000
parsed_articles = commons.parse.parse_and_sample(sample_size, gm)

#### Set the run name
You will find the results in ```experiments/run_name/```.

In [7]:
run_name = 'testRun-40000'

## Compute the centralities!
The names for the centralities are:
- ```PR```: PageRank centrality
- ```CC```: Closeness Centrality
- ```BC```: Betwenness centrality
- ```LCC```: Local Clustering Coefficient

For the approximation there is an integer flag:
- ```0```: exact centrality
- ```1```: approximated centrality

In [None]:
commons.scores.centrality_print_scores(parsed_articles, 'BC', 0, run_name)

In [None]:
commons.scores.centrality_print_scores(parsed_articles, 'BC', 1, run_name)

In [None]:
commons.scores.centrality_print_scores(parsed_articles, 'PR', 0, run_name)

In [None]:
commons.scores.centrality_print_scores(parsed_articles, 'PR', 1, run_name)

In [None]:
commons.scores.centrality_print_scores(parsed_articles, 'LCC', 0, run_name)

In [None]:
commons.scores.centrality_print_scores(parsed_articles, 'LCC', 1, run_name)

In [None]:
commons.scores.centrality_print_scores(parsed_articles, 'CC', 0, run_name)

In [None]:
commons.scores.centrality_print_scores(parsed_articles, 'CC', 1, run_name)

## Look at the Results!
The names for the centralities are:
- ```PR```: PageRank centrality
- ```CC```: Closeness Centrality
- ```BC```: Betwenness centrality
- ```LCC```: Local Clustering Coefficient

For the approximation there is an integer value:
- ```0```: exact centrality
- ```1```: approximated centrality
- ```2```: consider both the exact and the approximated centralities for comparison.

The available metrics for visualization are:
- ```P@5```, ```P@10``` , ```P@15```, ```P@20```: Precision at 5, 10, 15, 20
- ```R@5```, ```R@10``` , ```R@15```, ```R@20```: Recall at 5, 10, 15, 20
- ```P@tot```: the number of keywords divided by the number of nodes of the co-occurrence graph
- ```R@tot```: the number of keywords effectively present in the abstract (hence, actually retrievable)

Function ```significant_differences``` prints a boxplot of the selected metric for the centrality of choice and a table presenting the results of Tukey HSD multiple comparison test. This will help in assessing whether there is an actual difference in performance between the selected centralities. The confidence interval is set at 95% by default.

Function ```average_metric``` outputs a simple DataFrame with the average performace (under a metric of choice) for all centralities. This is meant as an aid to better interpret the boxplots. 



In [None]:

commons.scores.significant_differences(['PR','CC','BC','LCC'], 2, 'R@20', run_name)
display(commons.scores.average_metric('R@20', run_name))