# Investigating Hybrid Strategies for Systematic Literature Review
**Experiment 1**

Esse notebook exibe:
1. Matriz de Citação;
2. Matriz de Indicação de Passos para encontrar os artigos selecionados;
3. Grafo de Indicação de passos para encontrar os artigos selecionados;
4. Matriz de Backard e Forward de cada artigo;

Atenção - O Grafo de Citação está no notebook CitationGraph.ipynb

In [1]:
%matplotlib notebook

import os, sys
sys.path.insert(1, os.path.join(sys.path[0], '..'))
import database
from snowballing.operations import reload, work_by_varname, load_work_map_all_years, find_citation
from snowballing.strategies import Strategy, State
import custom_strategies
from functools import reduce
from matplotlib_venn import venn2, venn2_circles
from matplotlib import pyplot as plt
import pandas as pd
from collections import OrderedDict
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)

custom_strategies.LIMIT_YEAR = 2015
reload()
# !pip install matplotlib-venn

In [2]:
selected = [(varname, w) for varname, w in load_work_map_all_years() if w.category == "snowball"]
names = [varname for varname, w in selected]
names

['armbrust2010b',
 'oConnor2015a',
 'petersen2015a',
 'cocozza2014a',
 'lepmets2014a',
 'trendowicz2014a',
 'unterkalmsteiner2014a',
 'münch2013a',
 'münch2013c',
 'stallinger2013a',
 'birkhölzer2011a',
 'esfahani2011a',
 'kaneko2011a',
 'plösch2011a',
 'trendowicz2011a',
 'armbrust2010a',
 'armbrust2010b',
 'barreto2010a',
 'basili2010a',
 'guzmán2010a',
 'kowalczyk2010a',
 'mandić2010a',
 'mandić2010b',
 'mandić2010d',
 'mcloughlin2010a',
 'mcloughlin2010b',
 'sun2010a',
 'albuquerque2009a',
 'trienekens2009a',
 'becker2008a',
 'becker2008b',
 'martins2008a',
 'basili2007a',
 'basili2007b',
 'basili2007c',
 'martins2007b',
 'wilkie2007a',
 'liu2006a',
 'liu2005a',
 'trienekens2005a',
 'wang2005a',
 'trienekens2004a',
 'murugappan2003a',
 'karlström2002a',
 'waina2001a',
 'debou2000a',
 'kautz2000a',
 'messnarz1999a',
 'sommerville1999a',
 'mccoy1998a',
 'reiblein1997a',
 'hinley1995a']

In [3]:

order = OrderedDict([
 ('waina2001a', 0),
 ('wilkie2007a', 1),
 ('becker2008a', 2),
 ('petersen2015a', 3),
 ('kaneko2011a', 4),
 ('barreto2010a', 5),
 ('trienekens2009a', 6),
 ('guzmán2010a', 7),
 ('basili2010a', 8),
 ('sommerville1999a', 9),
 ('wang2005a', 10),
 ('martins2008a', 11),
 ('plösch2011a', 12),
 ('albuquerque2009a', 13),
 ('reiblein1997a', 14),
 ('mandić2010a', 15),
 ('trienekens2005a', 16),
 ('esfahani2011a', 17),
 ('becker2008b', 18),
 ('oConnor2015a', 19),
 ('kautz2000a', 20),
 ('mandić2010b', 21),
 ('messnarz1999a', 22),
 ('mandić2010d', 23),
 ('cocozza2014a', 24),
 ('unterkalmsteiner2014a', 25),
 ('karlström2002a', 26),
 ('stallinger2013a', 27),
 ('hinley1995a', 28),
 ('sun2010a', 29),
 ('armbrust2010a', 30),
 ('debou2000a', 31),
 ('lepmets2014a', 32),
 ('mcloughlin2010a', 33),
 ('mcloughlin2010b', 34),
 ('liu2005a', 35),
 ('mccoy1998a', 36),
 ('basili2007a', 37),
 ('trendowicz2014a', 38),
 ('münch2013c', 39),
 ('münch2013a', 40),
 ('basili2007b', 41),
 ('trendowicz2011a', 42),
 ('martins2007b', 43),
 ('armbrust2010b', 44),
 ('liu2006a', 45),
 ('birkhölzer2011a', 46),
 ('trienekens2004a', 47),
 ('kowalczyk2010a', 48),
 ('murugappan2003a', 49),
 ('basili2007c', 50),
])
id_to_varname = OrderedDict(sorted([
  (index, varname) for varname, index in order.items()
]))
selected = [(varname, work_by_varname(varname)) for index, varname in id_to_varname.items()]
names = ['{} S{}'.format(w.year, order[varname] + 1) for varname, w in selected]
#from snowballing.dbmanager import insert, set_attribute
#for key, value in order.items():
#    set_attribute(key, "selected_order", "{}".format(value + 1))

# Citation Matrix

Leitura da matriz de citação: 
1. A linha identifica as referencias do artigo. Exemplo: Artigo da Linha (2010 S8 - guzmán2010a) referencia os artigos das colunas (2009 S7) e (2010 S9). Verifiquei que no Backward de (2010 S8 - guzmán2010a) os artigos (2009 S7) e (2010 S9) existem.
2. A coluna identifica quem citou ela. Exemplo: Artigo da coluna (2011 S5) foi citado por linha (2015 S4 - petersen2015a). Verifiquei que no Forward de (2011 S5) o artigo (2015 S4) existe. 
3. O " - " indica a impossibilidade do artigo da linha ter nas referências (citar) o artigo da coluna, devido ao ano de publicação. Exemplo: Linha 5 - (ref 2011 S5  - id kaneko 2011a) não poderia ter nas referências o artigo da coluna (cited 2015 S4).


In [None]:
matrix1 = [
    [varname] + ['-' if cited.year > citer.year else 
     'x' if find_citation(citer, cited) else ''
     for _, cited in selected]
     for varname, citer in selected
]
df = pd.DataFrame(matrix1)
#df.set_index(names)
df.set_axis(0, names)
df.set_axis(1, ["id"] + names)
df = df.rename_axis("cited", axis="columns")
df = df.rename_axis("ref", axis="rows")
def highlight_max(s):
    return [
        'background-color: grey' if k == s.name else
        'background-color: green' if v == 'x' else ''
        for k, v in s.iteritems()
    ]
df_style = df.style.apply(highlight_max).set_properties(**{'text-align': 'center'}).set_table_styles([
    dict(selector="th", props=[("text-align", "center")]),
])
df_style



cited,id,2001 S1,2007 S2,2008 S3,2015 S4,2011 S5,2010 S6,2009 S7,2010 S8,2010 S9,1999 S10,2005 S11,2008 S12,2011 S13,2009 S14,1997 S15,2010 S16,2005 S17,2011 S18,2008 S19,2015 S20,2000 S21,2010 S22,1999 S23,2010 S24,2014 S25,2014 S26,2002 S27,2013 S28,1995 S29,2010 S30,2010 S31,2000 S32,2014 S33,2010 S34,2010 S35,2005 S36,1998 S37,2007 S38,2014 S39,2013 S40,2013 S41,2007 S42,2011 S43,2007 S44,2010 S45,2006 S46,2011 S47,2004 S48,2010 S49,2003 S50,2007 S51
ref,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1
2001 S1,waina2001a,,-,-,-,-,-,-,-,-,,-,-,-,-,,-,-,-,-,-,,-,,-,-,-,-,-,,-,-,,-,-,-,-,,-,-,-,-,-,-,-,-,-,-,-,-,-,-
2007 S2,wilkie2007a,,,-,-,-,-,-,-,-,,,-,-,-,,-,,-,-,-,,-,,-,-,-,,-,,-,-,,-,-,-,,,,-,-,-,,-,,-,,-,,-,,
2008 S3,becker2008a,,,,-,-,-,-,-,-,,,,-,-,,-,,-,,-,,-,,-,-,-,,-,,-,-,,-,-,-,x,x,,-,-,-,,-,,-,,-,,-,,
2015 S4,petersen2015a,,,,,x,,,,x,,,,,,,,,,,,,x,,x,,,,,,,,,,,,,,x,,x,x,,x,,,,,,,,
2011 S5,kaneko2011a,,,,-,,,,,,,,,,,,,,,,-,,,,,-,-,,-,,,,,-,,,,,,-,-,-,x,,,,,,,x,,
2010 S6,barreto2010a,,,,-,-,,,,,,,,-,,,,,-,x,-,,,,,-,-,,-,,,,,-,,,,,,-,-,-,x,-,,,,-,,,,
2009 S7,trienekens2009a,,,,-,-,-,,-,-,,,,-,,,-,,-,,-,,-,,-,-,-,,-,,-,-,,-,-,-,,,,-,-,-,,-,,-,,-,,-,,
2010 S8,guzmán2010a,,,,-,-,,x,,x,,,,-,,,,,-,,-,,,,,-,-,,-,,,,,-,,,,,,-,-,-,,-,,,,-,,,,
2010 S9,basili2010a,,,,-,-,,,,,,,,-,,,,,-,,-,,,,,-,-,,-,,,,,-,,,,,x,-,-,-,,-,,,,-,,,,
1999 S10,sommerville1999a,-,-,-,-,-,-,-,-,-,,-,-,-,-,,-,-,-,-,-,-,-,,-,-,-,-,-,,-,-,-,-,-,-,-,,-,-,-,-,-,-,-,-,-,-,-,-,-,-


In [None]:
with open("output/table.html", "wb") as html:
    html.write(df_style.render().encode("utf-8"))

In [None]:
# [row[0]]
nmatrix = [[(1 if x == 'x' else float('inf')) for x in row[1:]]
 for row in matrix1]
for i, row in enumerate(nmatrix):
    for j, v in enumerate(row):
        if v == 1:
            nmatrix[j][i] = 1
        if i == j:
            nmatrix[i][j] = 0
pmax = 1
size = len(matrix1)
for k in range(size):
    for i in range(size):
        for j in range(size):
            if nmatrix[i][j] > nmatrix[i][k] + nmatrix[k][j]:
                nmatrix[i][j] = nmatrix[j][i] = nmatrix[i][k] + nmatrix[k][j]
                pmax = max(pmax, nmatrix[i][j])

pmax

# Matrix - Steps to find the article

1. Quantidade de passos para encontrar um artigo da lista dos selecionados, através dos artigos encontrados nas referências e por quem cita ele. Exemplo: Artigo da linha (2015 S4 petersen2015a) encontra através de um passo, nas referências dele, o artigo da coluna 2011 S5. E encontra o artigo da coluna 2010 S6 através de 3 passos, através dos artigos referenciados que referenciam outros.


In [None]:
matrix3 = [
    [s[0]] + row
     for s, row in zip(selected, nmatrix)
]
df = pd.DataFrame(matrix3)
#df.set_index(names)
df.set_axis(0, names)
df.set_axis(1, ["id"] + names)
df = df.rename_axis("cited", axis="columns")
df = df.rename_axis("ref", axis="rows")
def highlight_max(s):
    return [
        'background-color: grey' if k == s.name else
        'background-color: green' if v != float('inf') and isinstance(v, float) else ''
        for k, v in s.iteritems()
    ]
df_style = df.style.apply(highlight_max).set_properties(**{'text-align': 'center'}).set_table_styles([
    dict(selector="th", props=[("text-align", "center")]),
])
df_style

# Graph Citation - Steps to Find Article

1. Algoritmo utilizado para gerar o grafo: Algoritmo de Floyd-Warshall — Determina a distância entre todos os pares de vértices de um grafo.
2. Exemplo: Círculo do Grafo (2010 S6 35 5) significa que o artigo (2010 S6) encontra 35 artigos da lista dos 51 artigos, através de 5 passos.

In [None]:
from subprocess import Popen, PIPE as P
class ViewMatrix:
    def __init__(self, nmatrix, names):
        self.nmatrix = nmatrix
        self.names = names
        
    @property
    def dot(self):
        text = ["digraph G {", "graph [ overlap=false ]"]
        for i, name in enumerate(self.names):
            filtered = [x for j, x in enumerate(self.nmatrix[i]) if x != float('inf') if j != i]
            total = sum(1 for x in filtered)
            maxsteps = max(filtered) if filtered else 0
            work = work_by_varname(id_to_varname[int(self.names[i].split()[-1][1:]) - 1])
            color = "green" if getattr(work, 'final_selected', 0) else "white"
            text.append(f'"{name}" [fillcolor="{color}", style=filled, label="{name}\n{total} {maxsteps}"];')
        for i, lis in enumerate(self.nmatrix):
            for j, v in enumerate(lis):
                if v != float('inf') and j != i:
                    text.append('"{}" -> "{}" [label="{}" color="gray"]'.format(self.names[i], self.names[j], v))
        text.append("}")
        return '\n'.join(text)

    def _ipython_display_(self):
        from IPython.display import display
        bundle = {}

        dot = self.dot
        bundle['text/vnd.graphviz'] = dot

        try:
            kwargs = {} if os.name != 'nt' else {"creationflags": 0x08000000}
            p = Popen(['neato', '-T', "svg", "-Goutputorder=edgesfirst"], stdout=P, stdin=P, stderr=P, **kwargs)
            image = p.communicate(dot.encode('utf-8'))[0]
            bundle['image/svg+xml'] = image.decode("utf-8")
        except OSError as e:
            print(e)
            if e.errno != os.errno.ENOENT:
                raise

        display(bundle, raw=True)
m = ViewMatrix(nmatrix, names)
m

# Backward and Forward - Details

1. Quantidade de artigos que visitou e em quantos passos para encontrar o artigo da lista dos selecionados;
2. Quantidade de artigos encontrados via backward e quantos foram selecionados;
3. Quantidade de artigos encontrados via forward e quantos foram selecionados.

In [None]:
strategy = Strategy({})
matrix2 = [[
    "S", "varname", "visits", "steps",
    "total backward", "selected backward",
    "total forward", "selected forward",
    "backward list", "selected backward list",
    "forward list", "selected forward list"
] + ["Step {}".format(i + 1) for i in range(pmax)]]
for varname, index in order.items():
    work = work_by_varname(varname)
    backward = strategy.ref[work]
    selected_backward = [x for x in backward if x.category == "snowball"]
    forward = strategy.rev_ref[work]
    selected_forward = [x for x in forward if x.category == "snowball"]
    steps = [[] for x in range(pmax)]
    for i, v in enumerate(nmatrix[index]):
        if v != float('inf') and v != 0:
            steps[v - 1].append(id_to_varname[i])
    steps = [", ".join(s) for s in steps]
    filtered = [x for j, x in enumerate(nmatrix[index]) if x != float('inf') if j != i]
    total = sum(1 for x in filtered)
    maxsteps = max(filtered) if filtered else 0
    row = [
        "S{}".format(index + 1), varname,
        total, maxsteps,
        len(backward), len(selected_backward),
        len(forward), len(selected_forward),
        ", ".join(x.metakey for x in backward),
        ", ".join(x.metakey for x in selected_backward),
        ", ".join(x.metakey for x in forward),
        ", ".join(x.metakey for x in selected_forward),
    ] + steps
    
    matrix2.append(row)
pd.set_option('display.max_colwidth',50)
pd.set_option('display.max_colwidth',1000)
df = pd.DataFrame(matrix2)
df

In [None]:
df.to_excel("tabela_grafo.xlsx")

In [None]:
#nmatrix[13]