# Combina SJR e Lattes

Checa se os dados do SCImagoJR cobrem todos os periódicos e as confereências dos dados dos perfis da plataforma Lattes.

In [1]:
import unidecode
import numpy as np
import pandas as pd
import re

Dados do SCImagoJR

In [2]:
sjr = pd.read_csv("../dataset/SJR/scimagojr completo.csv", sep=";", header=0, usecols=["Title","Categories"])
sjr.head(10)

Unnamed: 0,Title,Categories
0,CA - A Cancer Journal for Clinicians,Hematology (Q1); Oncology (Q1)
1,MMWR. Recommendations and reports : Morbidity ...,Epidemiology (Q1); Health Information Manageme...
2,Nature Reviews Materials,"Biomaterials (Q1); Electronic, Optical and Mag..."
3,Quarterly Journal of Economics,Economics and Econometrics (Q1)
4,Nature Reviews Genetics,Genetics (Q1); Genetics (clinical) (Q1); Molec...
5,Nature Reviews Molecular Cell Biology,Cell Biology (Q1); Molecular Biology (Q1)
6,Nature Reviews Cancer,Cancer Research (Q1); Oncology (Q1)
7,National vital statistics reports : from the C...,Life-span and Life-course Studies (Q1)
8,Nature Reviews Immunology,Immunology (Q1); Immunology and Allergy (Q1); ...
9,Cell,"Biochemistry, Genetics and Molecular Biology (..."


Dados da plataforma Lattes

In [3]:
artigos = pd.read_csv("../dataset/lattes/artigos.csv", sep=";sep;", usecols=["id_pesquisador", "ano_publicacao", "journal_ou_conferencia", "numero_pesquisadores"], header=0, engine='python')
artigos = artigos.dropna()
artigos.head(10)

Unnamed: 0,id_pesquisador,ano_publicacao,journal_ou_conferencia,numero_pesquisadores
0,99900328713566,2005,Sitientibus. Série Ciências Biológicas,3.0
1,99900328713566,2007,Mycotaxon,3.0
2,99900328713566,2007,Mycotaxon,4.0
3,99900328713566,2007,Acta Botanica Brasilica,3.0
4,99900328713566,2009,Acta Botanica Brasílica (Impresso),2.0
5,99900328713566,2009,Acta Botanica Brasílica (Impresso),2.0
6,99900328713566,2008,Mycotaxon,4.0
7,99900328713566,2008,Revista Brasileira de Botânica,4.0
8,99900328713566,2008,Mycotaxon,4.0
9,99900328713566,2008,Revista Brasileira de Botânica,3.0


Vamos precisar de duas funções:
    
    - Uma para tratar as strings
    - Outra pra dividir uma string com caracteres marcadores

In [4]:
def ops(s):
    s = re.sub(r"\s?\(.*\)", "", s)
    s = unidecode.unidecode(s)
    s = s.lower()
    s = re.sub("\s+", " ", s)
    s = s.strip()
    
    return s

In [5]:
def splt(s):
    chars = [".", ";", ":", "/", "-"]
    subs = [s]
    
    for c in chars:
        aux = list()
        for s in subs:
            aux += s.split(c)
        subs = aux
        
    return subs

Obtendo o conjunto dos periódicos do sjr e a lista dos do lattes

In [6]:
sjr_catg = sjr["Categories"].to_list()
sjr_titl = sjr["Title"].to_list()
sjr_list = [[ops(x) for x in splt(string)] for string in sjr_titl]
    
lattes_list = [ops(x) for x in artigos["journal_ou_conferencia"].to_list()]

Combinando as categorias para os artigos cuja conferência está presente

In [7]:
buffer = dict()

with open("../dataset/lattes_categories.csv", "w") as arq:
    arq.write("pesq;sep;ano;sep;catg;sep;num\n")
    
    for i in range(len(lattes_list)):
        if i%10000 == 0:
            print(i/len(lattes_list))
        
        title = lattes_list[i]
        lattes = artigos.iloc[i]
        
        if lattes[0] == "None":
            continue
    
        if title in buffer:
            arq.write("{};sep;{};sep;{};sep;{}\n".format(lattes[0], lattes[1], buffer[title], lattes[3]))
            continue
        
        strings = splt(title) + [title]
        
        for s in strings:
            for j in range(len(sjr_list)):
                if s in sjr_list[j]:
                    arq.write("{};sep;{};sep;{};sep;{}\n".format(lattes[0], lattes[1], sjr_catg[j], lattes[3]))
                    buffer[title] = sjr_catg[j]
                    break
                    
            else:
                continue
                
            break

0.0
0.002157444239774901
0.004314888479549802
0.006472332719324703
0.008629776959099604
0.010787221198874505
0.012944665438649406
0.015102109678424307
0.017259553918199208
0.019416998157974107
0.02157444239774901
0.02373188663752391
0.025889330877298812
0.02804677511707371
0.030204219356848614
0.03236166359662351
0.034519107836398416
0.03667655207617331
0.038833996315948215
0.04099144055572312
0.04314888479549802
0.045306329035272916
0.04746377327504782
0.04962121751482272
0.051778661754597624
0.05393610599437252
0.05609355023414742
0.058250994473922325
0.06040843871369723
0.06256588295347212
0.06472332719324703
0.06688077143302193
0.06903821567279683
0.07119565991257173
0.07335310415234662
0.07551054839212153
0.07766799263189643
0.07982543687167133
0.08198288111144623
0.08414032535122114
0.08629776959099604
0.08845521383077094
0.09061265807054583
0.09277010231032073
0.09492754655009564
0.09708499078987054
0.09924243502964544
0.10139987926942035
0.10355732350919525
0.10571476774897014


0.9190712461441077
0.9212286903838827
0.9233861346236576
0.9255435788634325
0.9277010231032073
0.9298584673429823
0.9320159115827572
0.9341733558225321
0.936330800062307
0.9384882443020819
0.9406456885418568
0.9428031327816317
0.9449605770214066
0.9471180212611815
0.9492754655009564
0.9514329097407312
0.9535903539805062
0.9557477982202811
0.957905242460056
0.9600626866998309
0.9622201309396058
0.9643775751793807
0.9665350194191556
0.9686924636589305
0.9708499078987054
0.9730073521384803
0.9751647963782553
0.9773222406180301
0.979479684857805
0.9816371290975799
0.9837945733373548
0.9859520175771297
0.9881094618169046
0.9902669060566796
0.9924243502964544
0.9945817945362293
0.9967392387760042
0.9988966830157792
