# instructions:

This Jupyter Notebook can be used to search the Scielo network for desired keywords or phrases pertaining to research topics. For example, you can search for 'niobium', 'aerospace', or 'soybean biodiesel'. 

Note that if you search for a phrase with multiple words only clusters which contain all of that phrase's constituent words will show up in the results. Please do not include any punctuation in the search term. 

You can change the following parameters:
- the search term by setting the variable 'query'
- the threshold for the minimum number of times a keyword must appear in a cluster to include it in search results, 'mycount'
- the ID of the specific cluster you'd like to read through for detailed bibliographic information

Don't change any of the other code or it won't work!

After you have adjusted the parameters you want, go to the "Cell" menu at the top left and click "Run All" to see results.

In [165]:
#don't change anything here
import json
import math
import copy
import string
import re
import pandas as pd
import networkx as nx
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter
import nltk
from nltk.stem import PorterStemmer

pd.set_option('display.max_colwidth', -1)
pd.set_option('display.max_rows', 500)

ps = PorterStemmer()
G = nx.read_gpickle('network_clustered3_full.pkl')
groups = {}
for node in G.nodes():
    if type(node) == int:
        groups.setdefault(G.node[node]['group'],[]).extend(list(set(G.node[node]['TA_words'])))

### replace the text inside the quotes in the next cell to set the keyword you want to search for

In [143]:
query = nltk.word_tokenize('biodiesel soybean')

### this is the threshold for the minimum number of papers in a cluster that should contain each keyword. Set to 1 for the broadest possible search and increase if you want more specific resuls. 

In [144]:
mycount = 2

In [150]:
#don't change anything here
for i in query:
    i = i.lower()
    i = ps.stem(i)

cluster_dict = {}
cluster_list = []
count_list = []
for i in groups:
    allthere = 1
    count = 9999
    for j in query:
        temp = groups[i].count(j)
        allthere = allthere * temp
        if temp < count:
            count = copy.deepcopy(temp)
    
    if count >= mycount and allthere > 0:
        cluster_dict[i] = {}
        cluster_dict[i]['nodes'] = []
        cluster_dict[i]['count'] = count
        count_list.append(cluster_dict[i]['count'])
        cluster_list.append(i)
        
for node in G.nodes():
    if G.node[node]['group'] in cluster_list:
        temp = copy.deepcopy(G.node[node])
        temp.pop('TA', None)
        temp.pop('TA_words', None)
        temp.pop('viz', None)
        cluster_dict[G.node[node]['group']]['nodes'].append(temp)

### in the next cell there's a table showing the cluster ID, the number of papers with hits on the rarest keyword, the total number of papers in the cluster, and the most common keywords from that cluster

In [166]:
cluster_results = pd.DataFrame({'ID': cluster_list, 'hits': count_list})
cluster_results = cluster_results.set_index('ID')
cluster_results['# papers'] = 0
cluster_results['keywords'] = ''


for i in range(1,max(groups)+1):
    if i in cluster_results.index.values:
        cluster_results.loc[i, '# papers'] = len(cluster_dict[i]['nodes'])
        cluster_results.loc[i, 'keywords'] = ' '.join(t[0] for t in Counter(groups[i]).most_common(10))

cluster_results

Unnamed: 0_level_0,hits,# papers,keywords
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
759,2,16,oil acid fatti oxid extract lipid stabil chromatographi profil chemic
1400,2,11,biodiesel oil ester acid liquid improv third phormidium oper microalga
1401,2,21,diesel fuel biodiesel engin emiss consumpt reduct specif agricultur power
1410,2,12,lipas immobil oil reaction enzym ratio respons convers solvent optim
1412,2,6,reaction lipas ester transesterif enzymat catalyz acid oil ethanol ratio
1557,2,14,enzym ferment amylas cultur highest extract rang lipid degc industri
1741,2,17,degrad matter fiber rumin crude protein deterg neutral acid nutrit
1776,5,40,diet matter meal weight anim inclus fed feed intak soybean
1828,4,24,glycerin diet inclus crude matter complet without corn fed replic
1904,2,13,synthesi nanoparticl prepar reaction size synthes ray copper catalyt diffract


### change the number in the square brackets to the cluster ID you want to examine

In [148]:
cluster_dict[1776]['nodes']

[{'abstract': 'Foram avaliados o consumo, as digestibilidades totais e ruminais e as taxas de digestao (kd) e de passagem (kp) ruminal dos nutrientes de dietas constituidas de cana-de-acucar in natura e diferentes niveis de concentrado. Utilizaram-se cinco bovinos mesticos, fistulados no rumen, com peso corporal inicial de 300+-50kg, distribuidos em delineamento em quadrado latino 5x5. As dietas experimentais foram constituidas de: 1) 100% cana-de-acucar in natura (CA); 2) 80% de CA + 20% de concentrado (C); 3) 60% de CA + 40% de C; 4) 40% de CA + 60% de C; e 5) 20% de CA + 80% de C. Os dados foram analisados utilizando-se o procedimento MIXED do SAS (versao 9.1), bem como analise de regressao e 5% como nivel critico de probabilidade para o erro tipo I. O consumo de materia seca (MS), expresso em kg/dia ou g/kg de peso corporal foi influenciado (P<0,05) pelos niveis de concentrado. Os demais consumos calculados em kg/dia tambem foram influenciados (P<0,05) pelos niveis de concentrado, 

In [91]:
#don't change anything here
authors = []
institutions = []

for i_id, i_item in cluster_dict.items():
    for j in i_item['nodes']:
        authors.extend(j['authors'])
        institutions.extend(j['affiliations'])

### this lists the most common authors from the search

In [92]:
#don't change anything here
Counter(authors).most_common(20)

[('Joao Almir Oliveira', 7),
 ('Matthieu Tubino', 5),
 ('Renato Mendes Guimaraes', 5),
 ('Javier Solis Estrada', 4),
 ('Jose Fernando Schlosser', 4),
 ('Marcelo Silveira de Farias', 4),
 ('Jose Neuman Miranda Neiva', 4),
 ('J.M.B. Ezequiel', 4),
 ('A.C. Homem Junior', 4),
 ('Tathiana Elisa Masetto', 4),
 ('Stefania Vilas Boas Coelho', 4),
 ('Osvaldo Resende', 4),
 ('Everson Reis Carvalho', 4),
 ('Willian L. G. da Silva', 3),
 ('Bruna Regina Sombrio', 3),
 ('Andrea Lima dos Santos Schneider', 3),
 ('Ana Paula Testa Pezzin', 3),
 ('Claudete Regina Alcalde', 3),
 ('Geraldo Tadeu dos Santos', 3),
 ('Fabiano Ferreira da Silva', 3)]

### this lists the most common institutions

In [93]:
#don't change anything here
Counter(institutions).most_common(10)

[('\n            ', 13),
 (', ', 5),
 ('; ', 4),
 ('Departamento de Agricultura, UFLA, Caixa Postal 3037, 37200-000 - Lavras, MG, Brasil.',
  4),
 ('Universidade Tecnologica Federal do Parana, Dois\n               Vizinhos, Parana, Brasil. ',
  4),
 ('Departamento de Engenharia Rural (DER), Centro de Ciencias Rurais (CCR), Universidade Federal de Santa Maria (UFSM), Santa Maria, RS, Brasil.',
  3),
 ('.', 3),
 ('Universidade Federal da Grande Dourados, Faculdade de Ciencias Agrarias, Caixa Postal 533, 79804-970 - Dourados, MS, Brasil. ',
  3),
 ('Instituto de Quimica, Universidade Estadual de Campinas, CP 6154, 13083-970 Campinas-SP, Brazil',
  2),
 ('Instituto Agronomico do Parana/ Londrina - PR, Brasil', 2)]