# Analize e processamento do TK #9 do `scieloorg/doi_request`

Para mais informações sobre o problema e motivações olhar o link: 
https://github.com/scieloorg/doi_request/issues/9

In [1]:
import re
import pandas as pd
import numpy as np

Arquivo CSV gerado com os dados dos documentos da coleção SciELO Brasil que não possuem registro no Crossref.

In [2]:
doi_files = pd.read_csv('./SciELO_Brazil_DOI.csv', delimiter=";", low_memory=False)
doi_files.head()

Unnamed: 0,collection,issn_scielo,journal,title,doc_type,pid,doi,doi_prefix,version,pub_year,has_doi,no_doi,doi_found,doi_not_found
0,scl,0044-5967,Acta Amazonica,O uso do solo na Amazônia,editorial,S0044-59671973000100003,10.1590/1809-43921973031003,10.159,xml,1973,1,0,0,1
1,scl,0044-5967,Acta Amazonica,An evolutionary and ecological perspective of ...,undefined,S0044-59671973000100005,10.1590/1809-43921973031005,10.159,xml,1973,1,0,0,1
2,scl,0044-5967,Acta Amazonica,"Anatomia de Anacardium spruceanum Bth, Ex Engl...",undefined,S0044-59671973000100039,10.1590/1809-43921973031039,10.159,xml,1973,1,0,0,1
3,scl,0044-5967,Acta Amazonica,The effect of slash and burn agriculture on pl...,undefined,S0044-59671973000100055,10.1590/1809-43921973031055,10.159,xml,1973,1,0,0,1
4,scl,0044-5967,Acta Amazonica,The chemical composition of Amazonian plants (),undefined,S0044-59671973000100063,10.1590/1809-43921973031063,10.159,xml,1973,1,0,0,1


In [3]:
doi_files.shape

(1626, 14)

#### Testar se o doi ainda não esta registrado no `https://www.doi.org`

Para testar se os dois ainda estao indisponiveis, foi construido o script `generate_doi_not_found.py` que consulta toda a lista dos dois no proprio site do `www.doi.org`

Esse script deve ser executado num virtualenv de `python>=3.7`e que tenha cido instalados as dependencias atraves do compando `pip install -r requirements.generate_doi_not_found.txt`

```shell
$ python generate_doi_not_found.py
```

Ao termido desse processamento teremos o arquivo `df_doi_not_found.csv` e podemos continuar a analize.
Nessa etapa devemos pesquizar no *SciELO - DOI Manager* se esse PID ja foi processados ou não, para isso foi construido o script `consult_doi_request_and_extract_data.py` que o ira consultar o banco de dados da aplicação e constroi os arquivos `df_doi_not_processed.csv`, `df_doi_processed.csv` com os dados de dois processado e nao processado. 

```shell
$ python consult_doi_request_and_extract_data.py \
    --host HOSTDB \
    --port 5432 \
    --user doi_user 
    --password XXXXXXX 
    --database doi_manager
```

In [4]:
# Carregando dados dos arquivos gerados para facilitar processo
df_doi_not_processed = pd.read_csv("./df_doi_not_processed.csv")
df_doi_processed = pd.read_csv("./df_doi_processed.csv")

In [5]:
print("DOIs nunca processados", len(df_doi_not_processed))
df_doi_not_processed.head()

DOIs nunca processados 102


Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,collection,issn_scielo,journal,title,doc_type,pid,doi,doi_prefix,version,pub_year,has_doi,no_doi,doi_found,doi_not_found,request
0,0,0,scl,0044-5967,Acta Amazonica,O uso do solo na Amazônia,editorial,S0044-59671973000100003,10.1590/1809-43921973031003,10.159,xml,1973,1,0,0,1,404 Client Error: Not Found for url: https://w...
1,1,1,scl,0044-5967,Acta Amazonica,An evolutionary and ecological perspective of ...,undefined,S0044-59671973000100005,10.1590/1809-43921973031005,10.159,xml,1973,1,0,0,1,404 Client Error: Not Found for url: https://w...
2,2,2,scl,0044-5967,Acta Amazonica,"Anatomia de Anacardium spruceanum Bth, Ex Engl...",undefined,S0044-59671973000100039,10.1590/1809-43921973031039,10.159,xml,1973,1,0,0,1,404 Client Error: Not Found for url: https://w...
3,3,3,scl,0044-5967,Acta Amazonica,The effect of slash and burn agriculture on pl...,undefined,S0044-59671973000100055,10.1590/1809-43921973031055,10.159,xml,1973,1,0,0,1,404 Client Error: Not Found for url: https://w...
4,4,4,scl,0044-5967,Acta Amazonica,The chemical composition of Amazonian plants (),undefined,S0044-59671973000100063,10.1590/1809-43921973031063,10.159,xml,1973,1,0,0,1,404 Client Error: Not Found for url: https://w...


In [6]:
print("DOIs ja processados", len(df_doi_processed))
df_doi_processed.head()

DOIs ja processados 1525


Unnamed: 0.1,Unnamed: 0,journal,pid,doi,submission_status,feedback_status,feedback_xml
0,0,Anais da Academia Brasileira de Ciências,S0001-37652019000400516,10.1590/0001-3765201920180614,success,success,"<doi_batch_diagnostic status=""completed"" sp=""d..."
1,1,Anais da Academia Brasileira de Ciências,S0001-37652019000600402,10.1590/0001-3765201920190208,success,success,"<doi_batch_diagnostic status=""completed"" sp=""d..."
2,2,Anais da Academia Brasileira de Ciências,S0001-37652019000600611,10.1590/0001-3765201920190218,success,success,"<doi_batch_diagnostic status=""completed"" sp=""d..."
3,3,Bragantia,S0006-87052019005013101,10.1590/1678-4499.20180178,success,success,"<doi_batch_diagnostic status=""completed"" sp=""d..."
4,4,Bragantia,S0006-87052019005013102,10.1590/1678-4499.20180251,success,success,"<doi_batch_diagnostic status=""completed"" sp=""d..."


## Analizando erros do DOI não pocessados

 Esses casos nessecitam ser Processados adicionados pelo SciELO - DOI Manager

In [7]:
# Lista dos PIDs para serem Processados
print("\n".join(df_doi_not_processed.pid.to_list()))

S0044-59671973000100003
S0044-59671973000100005
S0044-59671973000100039
S0044-59671973000100055
S0044-59671973000100063
S0044-59671973000100065
S0044-59671973000100071
S0044-59671973000200003
S0044-59671973000200005
S0044-59671973000200007
S0044-59671973000200017
S0044-59671973000200033
S0044-59671973000200041
S0044-59671973000200043
S0044-59671973000200047
S0044-59671973000300003
S0044-59671973000300005
S0044-59671973000300029
S0044-59671973000300041
S0044-59671973000300043
S0044-59671973000300045
S0044-59671973000300051
S0044-59671973000300053
S0044-59671973000300059
S0044-59671978000400523
S0044-59671978000400543
S0044-59671978000400545
S0044-59671978000400549
S0044-59671978000400557
S0044-59671978000400561
S0044-59671978000400577
S0044-59671978000400591
S0044-59671978000400595
S0044-59671978000400601
S0044-59671978000400605
S0044-59671978000400613
S0044-59671978000400621
S0044-59671978000400629
S0044-59671978000400639
S0044-59671978000400657
S0044-59671978000400679
S0044-5967197800

## Analizando erros do DOI ja processados

Foi extraido os dados de `Situação de depósito` e `XML de resultado do depósito` diretamente do banco de dados do PostgreSQL atraves do script utilizado anteriormente.

In [8]:
# tratando cada caso de retorno
gb = df_doi_processed.groupby("feedback_status")
data_errors = {
    x: gb.get_group(x) for x in gb.groups
}

for k,v in data_errors.items():
    print("status:", k, "\tTotal", v.journal.count())

status: error 	Total 8
status: failure 	Total 1411
status: notapplicable 	Total 7
status: semValor 	Total 18
status: success 	Total 67
status: waiting 	Total 14


### Caso de  statos com `failure`

In [9]:
status_failure = data_errors["failure"]
print("Total de item com failure", status_failure.journal.count())

Total de item com failure 1411


In [10]:
# Regex para tratar os retornos dos xml do CrossRef
regex = r"<msg>(.*)<\/msg>"
result = []

for item in status_failure.itertuples():
    group = re.findall(regex, item.feedback_xml, re.MULTILINE)
    if group:
        msg_feedback = group[0]
    else:
        msg_feedback = ""
    result.append(
        [item.journal, msg_feedback]
    )

df_result = pd.DataFrame(result, columns=["journal", "msg_feedback"])
g_df_result = df_result.groupby(["msg_feedback", "journal"]).size().to_frame('size')
g_df_result.sort_values(by="size", ascending=False)



Unnamed: 0_level_0,Unnamed: 1_level_0,size
msg_feedback,journal,Unnamed: 2_level_1
"ISSN ""01007203"" has already been assigned, issn (01007203) is assigned to another title (Revista Brasileira de Ginecologia e Obstetr&#258;&#173;cia / RBGO Gynecology and Obstetrics)",Revista Brasileira de Ginecologia e Obstetrícia,145
"ISSN ""16783166"" has already been assigned, title/issn: Scientiae Studia/16783166 is owned by publisher: 10.11606",Scientiae Studia,105
"ISSN ""01032070"" has already been assigned, title/issn: Tempo Social/01032070 is owned by publisher: 10.11606",Tempo Social,89
"ISSN ""00217557"" has already been assigned, title/issn: Jornal de Pediatria/00217557 is owned by publisher: 10.1016",Jornal de Pediatria,77
"ISSN ""19805098"" has already been assigned, title/issn: Ci&#258;&#350;ncia Florestal/19805098 is owned by publisher: 10.5902",Ciência Florestal,75
"ISSN ""19825676"" has already been assigned, title/issn: Tropical Plant Pathology/19825676 is owned by publisher: 10.1007",Tropical Plant Pathology,68
"ISSN ""18071775"" has already been assigned, issn (18071775) is assigned to another title (Journal of Information Systems and Technology Management)",JISTEM - Journal of Information Systems and Technology Management,66
"ISSN ""18075932"" has already been assigned, title/issn: Clinics/18075932 is owned by publisher: 10.6061",Clinics,60
"ISSN ""0066782X"" has already been assigned, title/issn: Arquivos Brasileiros de Cardiologia/0066782X is owned by publisher: 10.5935",Arquivos Brasileiros de Cardiologia,59
"ISSN ""01013289"" has already been assigned, title/issn: Revista Brasileira de Ci&#258;&#350;ncias do Esporte/01013289 is owned by publisher: 10.1016",Revista Brasileira de Ciências do Esporte,50


### Caso de  statos com `success`

Esse item deve ser analidados em separado pois aparentemente eles foram processados com sucesso pelo CrossRef, mais seu DOI acusa erro no relatorio

In [11]:
status_success = data_errors["success"]
print("Total de item com success", status_success.journal.count())

Total de item com success 67


In [12]:
status_success.head()

Unnamed: 0.1,Unnamed: 0,journal,pid,doi,submission_status,feedback_status,feedback_xml
0,0,Anais da Academia Brasileira de Ciências,S0001-37652019000400516,10.1590/0001-3765201920180614,success,success,"<doi_batch_diagnostic status=""completed"" sp=""d..."
1,1,Anais da Academia Brasileira de Ciências,S0001-37652019000600402,10.1590/0001-3765201920190208,success,success,"<doi_batch_diagnostic status=""completed"" sp=""d..."
2,2,Anais da Academia Brasileira de Ciências,S0001-37652019000600611,10.1590/0001-3765201920190218,success,success,"<doi_batch_diagnostic status=""completed"" sp=""d..."
3,3,Bragantia,S0006-87052019005013101,10.1590/1678-4499.20180178,success,success,"<doi_batch_diagnostic status=""completed"" sp=""d..."
4,4,Bragantia,S0006-87052019005013102,10.1590/1678-4499.20180251,success,success,"<doi_batch_diagnostic status=""completed"" sp=""d..."


In [22]:
import urllib.request
exclude_pid = []
# for row in status_success.itertuples():
#     req = urllib.request.Request("https://www.doi.org/{0}".format(row.doi))
#     try: 
#         urllib.request.urlopen(req)
#     except urllib.error.URLError as e:
#         print(e.reason)
#     else:
#         exclude_pid.append(row.pid)

Esses casos os PIDs abaixo deve ser excluidos da lista de erros, pois eles existem e estao validos e registrados

In [23]:
# Lista dos PIDs para excluir do relatorio
print("Total de PIDS: ", len(exclude_pid))

print("\n".join(exclude_pid))

Total de PIDS:  0



```
S0001-37652019000400516
S0001-37652019000600402
S0001-37652019000600611
S0006-87052019005013101
S0006-87052019005013102
S0006-87052019005013103
S0006-87052019005013104
S0036-46652019005000219
S0036-46652019005000220
S0036-46652019005000504
S0100-39842019005014101
S0100-39842019005014102
S0100-39842019005014103
S0100-39842019005014104
S0101-28002019005023101
S0101-73302019000100703
S0102-46982019000100901
S0104-07072019000600310
S0104-92242019000100212
S1413-41522019005012101
S1413-70542019000100220
S1413-70542019000100221
S1413-70542019000100222
S1413-70542019000100401
S1516-14392019000400214
S1516-14392019000400215
S1516-14392019000400216
S1516-31802019005004101
S1516-31802019005004102
S1519-69842019005009101
S1519-69842019005009102
S1519-69842019005009103
S1807-76922019000200304
S1807-76922019000200305
S1980-00372019000100317
S1980-00372019000100318
S1980-00372019000100319
S1980-00372019000100320
S1980-00372019000100321
S1980-00372019000100322
S1980-00372019000100323
S1980-00372019000100324
S1980-00372019000100325
S1980-00372019000100326
S1980-00372019000100327
S1980-00372019000100328
S1981-67232019000100442
S1981-67232019000100443
S1981-77462019000300200
S1981-77462019000300508
S1981-77462019000300701
S2175-78602019000100232
S2175-78602019000100233
S2175-78602019000100234
S2175-78602019000100235
S2175-78602019000100236
S2175-78602019000100237
S2175-78602019000100238
S2175-78602019000100239
S2175-78602019000100240
S2175-78602019000100241
S2175-78602019000100601
S2175-78602019000100602
S2179-80872019000400117
S2179-80872019000400401
S2179-975X2019000100318
S2318-03312019000100402
```

### Caso de  statos com `error` e `waiting`

In [15]:
status_error = data_errors["error"]
status_waiting = data_errors["waiting"]

print("Total de item com waiting", status_waiting.journal.count())
print("Total de item com error", status_error.journal.count())

Total de item com waiting 14
Total de item com error 8


In [16]:
status_error.head()

Unnamed: 0.1,Unnamed: 0,journal,pid,doi,submission_status,feedback_status,feedback_xml
654,654,Physis: Revista de Saúde Coletiva,S0103-73312015000100332,10.1590/1982&#8211;37030332ERRATA,success,error,
689,689,Acta Ortopédica Brasileira,S1413-78522017000300099,10.1590/1413-785220172503153742,success,error,
729,729,Psicologia: Ciência e Profissão,S1414-98932014000300704,10.1590/1982&#8211;3703001452013,success,error,
735,735,Revista Brasileira de Plantas Medicinais,S1516-05722015000100051,10.1590/1983-084X/ 12_191,success,error,
1330,1330,Alfa : Revista de Linguística (São José do Rio...,S1981-57942014000200252,10.1590/1981-5794-1405-0 251,success,error,


Esses casos nessecitam ser reprocessados pelo SciELO - DOI Manager

In [17]:
status_waiting.head()

Unnamed: 0.1,Unnamed: 0,journal,pid,doi,submission_status,feedback_status,feedback_xml
214,214,Radiologia Brasileira,S0100-39842019000100060,10.1590/0100-3984.2017.0091,success,waiting,
655,655,Ciência Rural,S0103-84782019000300150,10.1590/0103-8478cr20180444,success,waiting,
673,673,Revista da Associação Médica Brasileira,S0104-42302019000300319,10.1590/1806-9282.65.3.319,success,waiting,
678,678,Anais Brasileiros de Dermatologia,S0365-05962018000400026,10.1590/abd1806-4841.20187329,success,waiting,
679,679,Revista Brasileira de Educação,S1413-24782018000100603,10.1590/s1413-24782018230099,success,waiting,


Esses casos nessecitam ser reprocessados pelo SciELO - DOI Manager

In [18]:
# Lista dos PIDs para serem Reprocessados
print("\nstatus_waiting\n")
print("\n".join(status_waiting.pid.to_list()))
print("\nstatus_error\n")
print("\n".join(status_error.pid.to_list()))


status_waiting

S0100-39842019000100060
S0103-84782019000300150
S0104-42302019000300319
S0365-05962018000400026
S1413-24782018000100603
S1414-98932018000600003
S1415-43662018000500355
S1415-43662018000800564
S1516-31802018000500449
S1516-31802018000500464
S1517-86922018000500366
S1517-86922018000500382
S1806-37132018000100069
S1983-21172018000100203

status_error

S0103-73312015000100332
S1413-78522017000300099
S1414-98932014000300704
S1516-05722015000100051
S1981-57942014000200252
S1981-81222017000200453
S2237-101X2002000100039
S2237-101X2012000100029


### Caso de  statos com `notapplicable` 

In [19]:
status_notapplicable = data_errors["notapplicable"]

print("Total de item com status_notapplicable", status_notapplicable.journal.count())

Total de item com status_notapplicable 7


In [20]:
status_notapplicable.head()

Unnamed: 0.1,Unnamed: 0,journal,pid,doi,submission_status,feedback_status,feedback_xml
251,251,Revista Brasileira de Educação Médica,S0100-55022015000100171,10.15910/1981-52712015v39n1e00232013er,notapplicable,notapplicable,
393,393,Revista Brasileira de Ginecologia e Obstetrícia,S0100-72032014000700315,10.159/S0100-720320140004977,notapplicable,notapplicable,
394,394,Revista Brasileira de Ginecologia e Obstetrícia,S0100-72032014000800340,10.1509/SO100-720320140005034,notapplicable,notapplicable,
506,506,Cadernos de Saúde Pública,S0102-311X2017000702001,10.15090/0102-311x00138516,notapplicable,notapplicable,
507,507,Cadernos de Saúde Pública,S0102-311X2017001400501,10.15090/0102-311x00058116,notapplicable,notapplicable,


Esse caso deve se verificar o *Prefix* utilizado para registro dos DOI, pois aparetentemente ouve erro de digitação necesses dados. e não há como reprocesser sem a correçao do valor do prefixo.

In [21]:
print("\n".join(status_notapplicable.doi.to_list()))

10.15910/1981-52712015v39n1e00232013er
10.159/S0100-720320140004977
10.1509/SO100-720320140005034
10.15090/0102-311x00138516
10.15090/0102-311x00058116
10.15090/0102-311x00087416
10.1509/0104-4060.47764
