# The  Research  Space

"Here we use a large dataset of scholarly publications disambiguated at the individual level to create a map of science — or research  space — where links connect pairs of fields based on the probability that an individual has published in both of them."
[Original Article](https://arxiv.org/ftp/arxiv/papers/1602/1602.08409.pdf)

[Artigo suplementar](https://link.springer.com/content/pdf/10.1140/epjds/s13688-019-0210-z.pdf)

In [None]:
import matplotlib.pyplot as plt
import matplotlib.colors as colors
import matplotlib.cm as cmx
from collections import Counter
import geopandas as gpd
import networkx as nx
import pandas as pd
import numpy as np
import unidecode
import re

### Revealed Comparative Advantage

"We next use the research space to predict the future presence of an individual, organization, or country in a research field. To make these predictions we define five possible states for individuals, organizations, or countries in a research field. These states are: inactive, active, nascent, intermediate, and developed."

In [None]:
rca = dict()
sum_f = dict()
sum_s = dict()
sum_sf = 0

for sf in x:
    if sf[0] in sum_f:
        sum_f[sf[0]] += x[sf]
    else:
        sum_f[sf[0]] = x[sf]

    if sf[1] in sum_s:   
        sum_s[sf[1]] += x[sf]
    else:
        sum_s[sf[1]] = x[sf]
    
    sum_sf += x[sf]

    
for sf in x:
    rca[sf] = (x[sf]/sum_f[sf[0]])/(sum_s[sf[1]]/sum_sf)

Também podemos computar o RCA de instituições

In [None]:
inst = bio[["id_pesquisador", "nome_instituicao"]].set_index("id_pesquisador").to_dict()["nome_instituicao"]
inst = {int(k): v for k, v in inst.items()}

x_inst = dict()
rca_inst = dict()
sum_f_inst = dict()

for sf in x:
    ins = inst[sf[0]]
    if (ins, sf[1]) in x_inst:
        x_inst[(ins, sf[1])] += x[sf]
    else:
        x_inst[(ins, sf[1])] = x[sf]
    
    if ins in sum_f_inst:
        sum_f_inst[ins] += x[sf]
    else:
        sum_f_inst[ins] = x[sf]

    
for sf in x_inst:
    rca_inst[sf] = (x_inst[sf]/sum_f_inst[sf[0]])/(sum_s[sf[1]]/sum_sf)

e de municípios/estados/regiões

In [None]:
id_cep = bio[["id_pesquisador", "cep_instituicao"]].set_index("id_pesquisador").to_dict()["cep_instituicao"]
id_estado = {int(k): cep_estado(v) for k, v in id_cep.items()}

x_est = dict()
rca_est = dict()
sum_f_est = dict()

for sf in x:
    est = id_estado[sf[0]]
    if (est, sf[1]) in x_est:
        x_est[(est, sf[1])] += x[sf]
    else:
        x_est[(est, sf[1])] = x[sf]
    
    if est in sum_f_est:
        sum_f_est[est] += x[sf]
    else:
        sum_f_est[est] = x[sf]

    
for sf in x_est:
    rca_est[sf] = (x_est[sf]/sum_f_est[sf[0]])/(sum_s[sf[1]]/sum_sf)

In [None]:
nome_cep = bio[["nome_instituicao", "cep_instituicao"]].set_index("nome_instituicao").to_dict()["cep_instituicao"]
nome_estado = sorted([k for k, v in nome_cep.items() if cep_estado(v) == "DESCONHECIDO"])
nome_estado[1010:1020]

### Visualizando o RCA

Áreas de maior atuação de diferentes instituições do país. Mediremos a fração de áreas de cada categoria que uma instituição tem RCA > 1.

In [None]:
dict_area = areas[["Field", "Classification"]].set_index("Field").to_dict()["Classification"]
size = Counter(areas["Classification"].tolist())
macro = dict()

for sf in rca_inst:
    if rca_inst[sf] > 1:
        if (sf[0], dict_area[sf[1]]) in macro:
            macro[(sf[0], dict_area[sf[1]])] += 1 / size[dict_area[sf[1]]]
        else:
            macro[(sf[0], dict_area[sf[1]])] = 1 / size[dict_area[sf[1]]] 

In [None]:
insts = [
    "universidade federal de minas gerais",
    "universidade federal de lavras",
    "universidade de sao paulo",
    "universidade estadual de campinas",
    "universidade federal da paraiba",
    "universidade federal rural de pernambuco",
    "universidade federal do rio de janeiro",
    "pontificia universidade catolica de minas gerais",
    "petrobras",
    "ministerio da fazenda"
]

In [None]:
unique = areas["Classification"].unique()[1:]

plt.rcParams["figure.figsize"] = (10,50)
cm = cmx.get_cmap('gnuplot', 26)
color = cm(np.linspace(0, 1, 26))
fig = plt.figure()

for i in range(len(insts)):
    uni = insts[i]
    areas_list = list()
    for a in unique:
        if (uni, a) in macro:
            areas_list.append(macro[(uni, a)])
        else:
            areas_list.append(0)

    sub = fig.add_subplot(10,1,i+1)
    plt.bar(unique, areas_list, color=color, width=1.0)
    plt.xticks(rotation=90)
    plt.title(uni)
    plt.ylabel("Fração de áreas especializadas")
    plt.xlabel("Classificação das áreas")
    if i < 9:
        sub.set_xticks([])
    
fig.tight_layout()
plt.show()

Podemos também ver a rede do Research Space colorida pelos RCA dessas instituições

In [None]:
uni = "universidade federal de lavras"
vals = list()
for f in of:
    if (uni, f) in rca_inst:
        vals.append(rca_inst[(uni, f)])
    else:
        vals.append(0)

pos = show_graph(k, vals, pos)

In [None]:
def rca_disc(val):
    if val == 0:
        return "Inactive"
    if val < 0.5:
        return "Nascent"
    if val < 1:
        return "Intermediate"
    else:
        return "Developed"

dict_area = {of[x]:rca_disc(vals[x]) for x in range(len(of))}
unique = ["Inactive", "Nascent", "Intermediate", "Developed"]
subs = {u: v for v,u in enumerate(sorted(unique))}
values = [subs[dict_area[node]] for node in of]

pos = show_graph(k, values, pos, subs)

In [None]:
estado = "31"
vals = list()
for f in of:
    if (estado, f) in rca_est:
        vals.append(rca_est[(estado, f)])
    else:
        vals.append(0)

def rca_disc(val):
    if val == 0:
        return "Inactive"
    if val < 0.5:
        return "Nascent"
    if val < 1:
        return "Intermediate"
    else:
        return "Developed"

dict_area = {of[x]:rca_disc(vals[x]) for x in range(len(of))}
unique = ["Inactive", "Nascent", "Intermediate", "Developed"]
subs = {u: v for v,u in enumerate(sorted(unique))}
values = [subs[dict_area[node]] for node in of]

pos = show_graph(k, values, pos, subs)

In [None]:
gdf = gpd.read_file("../dataset/br_unidades_da_federacao/BRUFE250GC_SIR.shp")
gdf.head()

In [None]:
dict_area = areas[["Field", "Classification"]].set_index("Field").to_dict()["Classification"]
size = Counter(areas["Classification"].tolist())
macro = dict()

for sf in rca_est:
    if rca_est[sf] > 1:
        if (sf[0], dict_area[sf[1]]) in macro:
            macro[(sf[0], dict_area[sf[1]])] += 1 / size[dict_area[sf[1]]]
        else:
            macro[(sf[0], dict_area[sf[1]])] = 1 / size[dict_area[sf[1]]] 

In [None]:
unique = areas["Classification"].unique()[1:]
estados = gdf["CD_GEOCUF"].to_list()

plt.rcParams["figure.figsize"] = (60,50)
fig = plt.figure()

for i in range(len(unique)):
    a = unique[i]
    estados_list = list()
    for est in estados:
        if (est, a) in macro:
            estados_list.append(macro[(est, a)])
        else:
            estados_list.append(0)

    gdf["colors"] = estados_list
    ax = fig.add_subplot(6,5,i+1)
    ax.axis('off')
    gdf.plot(column="colors", ax=ax, cmap='Blues')
    plt.title(a)
    
fig.tight_layout()
plt.show()

Áreas urbanas

"First, we infer the country in which each affiliation-city pair is located; second, for each country, we compute a geographic distance matrix (using Vicenty’s formula) connecting each pair of cities; and lastly we use hierarchical clustering to define the different urban areas with the additional constraint that the maximum distance within each cluster has to be less than 50 km."

### Knowledge density

"We then predict the probability that individual, organization, or country, $s$ will increase its level of development in field $f$ by creating an indicator of the fraction of fields that are connected to field $f$ and that are already developed by $s$."

In [None]:
s = set()
for sf in x:
    s.add(sf[0])

*evaluating transitions to a developed state*

In [None]:
Ud = {sf: 1 for sf in rca if rca[sf] >= 1 and sf[1] in of}

*evaluating the transition from an inactive to an active state*

In [None]:
Ua = {sf: 1 for sf in rca if rca[sf] >= 0 and sf[1] in of}

The density function

In [None]:
def Omega(U):
    omega = dict()
    for idx in range(len(of)):
        print((idx+1)/len(of))
        f = of[idx]
        norm = sum(k[idx, idx2] for idx2 in range(len(of)))
        
        for p in s:    
            if (p,f) in U:
                continue
            
            if p not in omega:
                omega[p] = list()
                
            num = sum(k[idx, idx2] for idx2 in range(len(of)) if (p,of[idx2]) in U)
            div = np.round(num/norm, 5)
            
            if div > 0.0:           
                omega[p].append((div, idx))
    
    return omega

### Prediction

"Finally, to predict a transition of entity in field fbetween a pair of states (i.e. from inactive to active), we look at all fields that are in the initial state (i.e. inactive) and sort them by density ($\omega$sf)."

In [None]:
o = Omega(Ua)

### Correlation

Correlation analysis between the average knowledge density aggregated at the state/municipal level and a selection of development indicators