# OpenCitations Notebook
### Arianna Moretti

## Table of Contents

#### 2022

1. [22/02 - 02/03 (Log Data Study)](#entry_1)
2. [02/03 - 09/03 (OC Index Software Code Refactoring - NOCI + Mapping Merge)](#entry_2)
3. [09/03 - 15/03 (Log Data Visualization d3.js)](#entry_3)
4. [15/03 - 22/03 (????)](#entry_4)

#### 202*

5. [??/?? - ??/?? (????)](#entry_5)

# 22/02 - 02/03 (Log Data Study) <a class="anchor" id="entry_1"></a>

### Studio Files di Log Raw

<ul>
    <li>Download cartella Dropbox con file di log raw per 2021</li>
    <li>Studio dei dati statistici: <a href="https://github.com/opencitations/statistics/tree/master/script">opencitations statistics repo </a> </li>
    <li>Studio del file format Prometheus: <a href="https://sysdig.com/blog/prometheus-metrics/"> https://sysdig.com/blog/prometheus-metrics/</a>, <a href="https://prometheus.io/docs/instrumenting/clientlibs/"> https://prometheus.io/docs/instrumenting/clientlibs/</a>
    <li>Esempio di <a href="http://opencitations.net/statistics/2022-01">call API per i dati di Log di Gennaio 2021</a> </li>
</ul>

#### Prometheus File Format
Prometheus is an open source time series database including a collection of client libraries which allow metrics to be published, so to be collected by the metrics server. The Prometheus metrics format is largely adopted and became also an independent project: OpenMetrics, aimed at making this metric format specification a standard. Sysdig Monitor dynamically detects and scrape Prometheus metrics.
##### Custom Metrics
Source: <a href="https://sysdig.com/blog/how-to-instrument-code-custom-metrics-vs-apm-vs-opentracing/">href="https://sysdig.com/blog/how-to-instrument-code-custom-metrics-vs-apm-vs-opentracing/</a>
Custom Metrics (JMX, Golang expvar, Prometheus, statsd...) is an approach on how to instrument code to easily monitor the performance and troubleshooting of an application. Typical aspects to be monitored are: the most visited components a web page, the slowest components to load, the difference of speed in loading the frontend and the backend, which factors affect the speed of the process (location, browser, device). The options to monitor those aspects are: using an APM instrument, using OpenTracing libraries, or generating metrics ad-hoc for specific components.

##### Comparison between CM, APM and Opentracing

|Issue | Custom Metrics | APM | Opentracing |
| --- | --- | --- | --- |
| Code-related problems | Devs need to provide metrics with performance in code but are not as easy to identify | Yes | Yes |
| Infrastructure-related problems | Yes | No | No |
| Node and service level aggregation | Yes | No | No |
| Standard implementaton | Some languages include a standard way to implement them: (Prometheus, Java JMX, Go expvar, …) | No | Yes |
| Allows capacity planning | Yes | No | No |
| Allows complete statistical measurements | Yes | No | No |
| Cloud Native Computing Foundation standard | Prometheus metrics only | No | Yes |
| Distributed application analysis | Yes, without per trace analysis | Yes | Yes |
| Useful for developers for pre-production environments | Yes | Yes | Yes |
| Useful for complete DevOps strategy | Yes | No | No |

##### Metrics notations: dot notation vs multi-dimensional tagged metrics
Source: <a href="https://sysdig.com/blog/prometheus-metrics/">https://sysdig.com/blog/prometheus-metrics/</a>
For Python we need the third-party library Prometheu to feed the monitoring system.
There are two main paradigms to represent a metric: <b>dot notation</b> and <b>multi-dimensional tagged metrics</b>.
In a dot-notated metric, data are provided in dot-separated format in the name of the metric, which determine the detail and the hierarchy needed. The arrangement of the metric depends on the piece of information needed. 
In Prometheus metric format a flat approach is adopted to metrics naming. Instead of a hierarchical, dot separated name, there are names combined with a series of labels or tags. "Highly dimensional data" imply the possibility to associate any number of context-specific labels to every submitted metric.

##### Prometheus metrics (OpenMetrics)
Prometheus metrics format is line oriented: lines are separated by a line feed character (n), and the last line must end with a line feed character, while empty lines are somply ignored.

A metric is composed by the (optional) components below:

<ul>
    <li>Metric name</li>
    <li>Any number of labels (can be 0), represented as a key-value array</li>
    <li>Current metric value</li>
    <li>Optional metric timestamp</li>
</ul>

Metric output is typically preceded with **# HELP** and **# TYPE** metadata lines.
The HELP string identifies the metric name and a brief description of it. The TYPE string identifies the type of metric. If there’s no TYPE before a metric, the metric is set to untyped. Everything else that starts with a # is parsed as a comment.

(**continua*)

##### Implement Prometheus custom metric instrumentation in Python



## 02/03 - 09/03 (OC Index Software Code Refactoring - NOCI + Mapping Merge) <a class="anchor" id="entry_2"></a>

### NOCI material

1. **OpenCitations Index espansion**

   1.1 *ADDITIONS*
    
      1.1.1 [NIH citation source](#en_1.1.1)

      1.1.2 [NOCI glob](#en_1.1.2)

      1.1.3 [PMID manager](#en_1.1.3)

      1.1.4 [NIH resource finder](#en_1.1.4)
        
   1.2 *ADJUSTMENTS / EXPANSIONS*
    
      1.2.1 [citation/oci.py](#en_1.2.1)
        
      1.2.2 [finder/resourcefinder.py](#en_1.2.2)
        
      1.2.3 [finder/dataciteresourcefinder.py](#en_1.2.3)
      
      1.2.4 [finder/crossrefresourcefinder.py](#en_1.2.4)
      
      1.2.5 [finder/orcidresourcefinder.py](#en_1.2.5)
      
      1.2.6 [storer/csvmanager.py](#en_1.2.6)
      
      1.2.7 [storer/datahandler.py](#en_1.2.7)
      
      1.2.8 [storer/update.py](#en_1.2.8)
      
      
        
2. **Ramose**

    2.1 [NOCI configuration file](#en_2.1)
    
    2.2 [indexapi.py extension](#en_2.2)


### 1.1.1 NIH citation source <a class="anchor" id="en_1.1.1"></a>

Codice della citation source per il National Institute of Health. Il dataset citazionale (NIH-OCC) fornisce solo le informazioni minime richieste dall'OpenCitations data model, ovvero **citante** e **citato**, espressi rispettivamente nei campi "citing" e "referenced" del file CSV del NIH-OCC.
Il codice è stato testato con successo. 
Di seguito, il codice del NIH citation source e del rispettivo test case. 

In [None]:
from os import walk, sep, remove
from os.path import isdir
from json import load
from csv import DictWriter
from index.citation.citationsource import CSVFileCitationSource
from index.identifier.pmidmanager import PMIDManager
from index.citation.oci import Citation, OCIManager


class NIHCitationSource( CSVFileCitationSource ):
    def __init__(self, src, local_name=""):
        self.pmid = PMIDManager()
        super( NIHCitationSource, self ).__init__( src, local_name )

    def get_next_citation_data(self):
        row = self._get_next_in_file()
        #id_type = OCIManager.pmid_type

        while row is not None:
            citing = self.pmid.normalise(row.get("citing"))
            cited = self.pmid.normalise(row.get("referenced"))

            self.update_status_file()
            return citing, cited, None, None, None, None #, id_type
            self.update_status_file()
            row = self._get_next_in_file()

        remove(self.status_file)

### test/10_noci

In [None]:
import unittest
from index.coci.glob import process
from os import sep, makedirs
from os.path import exists
from shutil import rmtree
from index.storer.csvmanager import CSVManager
from index.noci.nationalinstituteofhealthsource import NIHCitationSource
from csv import DictReader


class NOCITest(unittest.TestCase):

    def setUp(self):
        self.input_file = "index%stest_data%snih_dump%ssource.csv" % (sep, sep, sep)
        self.citations = "index%stest_data%snih_dump%scitations.csv" % (sep, sep, sep)

    def test_citation_source(self):
        ns = NIHCitationSource( self.input_file )
        new = []
        cit = ns.get_next_citation_data()
        while cit is not None:
            citing, cited, creation, timespan, journal_sc, author_sc = cit
            new.append({
                "citing": citing,
                "cited": cited,
                "creation": "" if creation is None else creation,
                "timespan": "" if timespan is None else timespan,
                "journal_sc": "no" if journal_sc is None else journal_sc,
                "author_sc": "no" if author_sc is None else author_sc
            })
            cit = ns.get_next_citation_data()

        with open(self.citations, encoding="utf8") as f:
            old = list(DictReader(f))

        self.assertEqual(new, old)

### 1.1.2 NOCI glob <a class="anchor" id="en_1.1.2"></a>
L' iCite Database contenente il NIH-OCC, ovvero la data source di NOCI, e un altro dataset: **iCite Metadata**. Se il NIH-OCC contiene solamente dati citazionali rappresentati dai PMID del citante e del citato, iCite Metadata contiene metadati relativi alle entità bibliografiche (identificate da PMID) coinvolte nelle citazioni contenute nel NIH-OCC. iCite Metadata contiene dunque delle informazioni che possono essere rielaborate al fine di ricavarne (direttamente o indirettamente) i metadati utili a completare i quattro campi della tupla a sei elementi non coperti dal NIH-OCC.
Tra i campi del dataset iCite Metadata, **"doi"** fornisce l'informazione di **mapping** PMID-DOI. Questo dato è particolarmente utile perché permette di sfruttare i servizi API dei DOI per ricavare le informazioni che non vengono fornite né in iCite Metadata né nei servizi API dei PMIDs. 
I dati ricavati sono salvati in files CSV che vengono utilizzati come materiale di supporto nel processo di popolazione dell'Indice citazionale. 
Di seguito, il codice del glob di NOCI e il relativo test, passato con successo.

In [None]:
import pandas as pd
from argparse import ArgumentParser
from index.storer.csvmanager import CSVManager
from index.finder.crossrefresourcefinder import CrossrefResourceFinder
from index.finder.orcidresourcefinder import ORCIDResourceFinder
from index.identifier.pmidmanager import PMIDManager
from index.identifier.doimanager import DOIManager
from index.identifier.issnmanager import ISSNManager
from index.identifier.orcidmanager import ORCIDManager
from os import sep, makedirs, walk
import os
from os.path import exists
import json
from re import sub
from index.citation.oci import Citation
from zipfile import ZipFile
from tarfile import TarFile
import re
from timeit import default_timer as timer

def issn_data_recover(directory):
    journal_issn_dict = dict()
    filename = directory + sep + 'journal_issn.json'
    if not os.path.exists(filename):
        return journal_issn_dict
    else:
        with open(filename, 'r', encoding='utf8') as fd:
            journal_issn_dict = json.load(fd)
            types = type(journal_issn_dict)
            return journal_issn_dict

def issn_data_to_cache(name_issn_dict, directory):
    filename = directory + sep + 'journal_issn.json'
    with open(filename, 'w', encoding='utf-8' ) as fd:
            json.dump(name_issn_dict, fd, ensure_ascii=False, indent=4)

#PUB DATE EXTRACTION : takes in input a data structure representing a bibliographic entity
def build_pubdate(row):
    year = str(row["year"])
    str_year = sub( "[^\d]", "", year)[:4]
    if str_year:
        return str_year
    else:
        return None


# get_all_files extracts all the needed files from the input directory
def get_all_files(i_dir):
    result = []
    opener = None

    if i_dir.endswith( ".zip" ):
        zf = ZipFile( i_dir )
        for name in zf.namelist():
            if name.lower().endswith(".csv") and "citations" not in name.lower() and "source" not in name.lower():
                result.append( name )
        opener = zf.open
    elif i_dir.endswith( ".tar.gz" ):
        tf = TarFile.open( i_dir )
        for name in tf.getnames():
            if name.lower().endswith(".csv") and "citations" not in name.lower() and "source" not in name.lower():
                result.append(name)
        opener = tf.extractfile

    else:
        for cur_dir, cur_subdir, cur_files in walk(i_dir):
            for file in cur_files:
                if file.lower().endswith( ".csv" ) and "citations" not in file.lower() and "source" not in file.lower():
                    result.append(cur_dir + sep + file)
        opener = open
    return result, opener


def process(input_dir, output_dir, n):
    if not exists(output_dir):
        makedirs(output_dir)

    citing_pmid_with_no_date = set()
    valid_pmid = CSVManager( output_dir + sep + "valid_pmid.csv" )
    valid_doi = CSVManager("index/test_data/crossref_glob" + sep + "valid_doi.csv")
    id_date = CSVManager( output_dir + sep + "id_date_pmid.csv" )
    id_issn = CSVManager( output_dir + sep + "id_issn_pmid.csv" )
    id_orcid = CSVManager( output_dir + sep + "id_orcid_pmid.csv" )
    journal_issn_dict = issn_data_recover(output_dir) #just an empty dict, in case of a code break
    pmid_manager = PMIDManager(valid_pmid)
    crossref_resource_finder = CrossrefResourceFinder(valid_doi)
    orcid_resource_finder = ORCIDResourceFinder(valid_doi)

    doi_manager = DOIManager(valid_doi)
    issn_manager = ISSNManager()
    orcid_manager = ORCIDManager()

    all_files, opener = get_all_files(input_dir)
    len_all_files = len(all_files)

    # Read all the CSV file in the NIH dump to create the main information of all the indexes
    print( "\n\n# Add valid PMIDs from NIH metadata" )
    for file_idx, file in enumerate( all_files, 1 ):
        df = pd.DataFrame()

        for chunk in pd.read_csv(file, chunksize=1000 ):
            f = pd.concat( [df, chunk], ignore_index=True )
            f.fillna("", inplace=True)

            print( "Open file %s of %s" % (file_idx, len_all_files) )
            for index, row in f.iterrows():
                if int(index) !=0 and int(index) % int(n) == 0:
                    print( "Group nr.", int(index)//int(n), "processed. Data from", int(index), "rows saved to journal_issn.json mapping file")
                    issn_data_to_cache(journal_issn_dict, output_dir)

                citing_pmid = pmid_manager.normalise(row['pmid'], True)
                pmid_manager.set_valid(citing_pmid)
                citing_doi = doi_manager.normalise(row['doi'], True)

                if id_date.get_value(citing_pmid) is None:
                    citing_date = Citation.check_date(build_pubdate(row))
                    if citing_date is not None:
                        id_date.add_value(citing_pmid, citing_date)
                        if citing_pmid in citing_pmid_with_no_date:
                            citing_pmid_with_no_date.remove(citing_pmid)
                    else:
                        citing_pmid_with_no_date.add( citing_pmid )

                if id_issn.get_value( citing_pmid ) is None:
                    journal_name = row["journal"]
                    if journal_name: #check that the string is not empty
                        if journal_name in journal_issn_dict.keys():
                            for issn in journal_issn_dict[journal_name]:
                                id_issn.add_value(citing_pmid, issn)
                        else:
                            if citing_doi is not None:
                                json_res = crossref_resource_finder._call_api(citing_doi)
                                if json_res is not None:
                                    issn_set = crossref_resource_finder._get_issn(json_res)
                                    if len(issn_set)>0:
                                        journal_issn_dict[journal_name] = []
                                    for issn in issn_set:
                                        issn_norm = issn_manager.normalise(str(issn))
                                        id_issn.add_value( citing_pmid, issn_norm )
                                        journal_issn_dict[journal_name].append(issn_norm)


                if id_orcid.get_value(citing_pmid) is None:
                    if citing_doi is not None:
                        json_res = orcid_resource_finder._call_api(citing_doi)
                        if json_res is not None:
                            orcid_set = orcid_resource_finder._get_orcid(json_res)
                            for orcid in orcid_set:
                                orcid_norm = orcid_manager.normalise( orcid )
                                id_orcid.add_value(citing_pmid, orcid_norm)

            issn_data_to_cache( journal_issn_dict, output_dir )


    # Iterate once again for all the rows of all the csv files, so to check the validity of the referenced pmids.
    print( "\n\n# Checking the referenced pmids validity" )
    for file_idx, file in enumerate( all_files, 1 ):
        df = pd.DataFrame()

        for chunk in pd.read_csv( file, chunksize=1000 ):
            f = pd.concat( [df, chunk], ignore_index=True )
            f.fillna("", inplace=True)
            print( "Open file %s of %s" % (file_idx, len_all_files) )
            for index, row in f.iterrows():
                if row["references"] != "":
                    ref_string = row["references"].strip()
                    ref_string_norm = re.sub("\s+", " ", ref_string)
                else:
                    print("the type of row reference is", (row["references"]), type(row["references"]))
                    print(index, row )

                cited_pmids = set(ref_string_norm.split(" "))
                for cited_pmid in cited_pmids:
                    if pmid_manager.is_valid(cited_pmid):
                        print("valid cited pmid added:", cited_pmid)
                    else:
                        print("invalid cited pmid discarded:", cited_pmid)

    for pmid in citing_pmid_with_no_date:
        id_date.add_value( pmid, "" )

if __name__ == "__main__":
    arg_parser = ArgumentParser( "Global files creator for NOCI",
                                 description="Process NIH CSV files and create global indexes to enable "
                                             "the creation of NOCI." )
    arg_parser.add_argument( "-i", "--input_dir", dest="input_dir", required=True,
                             help="Either the directory or the zip file that contains the NIH data dump "
                                  "of CSV files." )
    arg_parser.add_argument( "-o", "--output_dir", dest="output_dir", required=True,
                             help="The directory where the indexes are stored." )


    arg_parser.add_argument( "-n", "--num_lines", dest="n", required=True,
                             help="Number of lines after which the data stored in the dictionary for the mapping "
                                  "between a Journal name and the related issns are passed into a JSON cache file" )


    args = arg_parser.parse_args()

    start = timer()
    process(args.input_dir, args.output_dir, args.n)
    end = timer()
    #calculate elapsed time
    print("elapsed time, in seconds:", (end-start))


#python -m index.noci.glob1 -i "index/test_data/nih_dump" -o "index/test_data/nih_glob1" -n 20

### test/13_glob1

In [None]:
import unittest
from os import sep, remove
import os
from os.path import exists
from index.noci.glob1 import issn_data_recover, issn_data_to_cache, build_pubdate, get_all_files, process
from index.storer.csvmanager import CSVManager
from index.identifier.issnmanager import ISSNManager
from index.identifier.orcidmanager import ORCIDManager
from index.identifier.pmidmanager import PMIDManager
from index.identifier.doimanager import DOIManager
import shutil
import pandas as pd

class MyTestCase( unittest.TestCase ):
    def setUp(self):
        self.dir_with_issn_map = "index%stest_data%sglob_noci%sissn_data_recover%swith_issn_mapping" % (sep, sep, sep, sep)
        self.dir_without_issn_map = "index%stest_data%sglob_noci%sissn_data_recover%swithout_issn_mapping" % (sep, sep, sep, sep)
        self.issn_journal_sample_dict = {"N Biotechnol": ["1871-6784"], "Biochem Med": ["0006-2944"], "Magn Reson Chem": ["0749-1581"]}
        self.data_to_cache_dir = "index%stest_data%sglob_noci%sissn_data_to_cache" % (sep, sep, sep)
        self.get_all_files_dir = "index%stest_data%sglob_noci%sget_all_files" % (sep, sep, sep)
        self.csv_sample = "index%stest_data%sglob_noci%sget_all_files%s1.csv" % (sep, sep, sep, sep)
        self.output_dir = "index%stest_data%sglob_noci%sprocess%soutput" % (sep, sep, sep, sep)
        self.valid_pmid = CSVManager( self.output_dir + sep + "valid_pmid.csv" )
        self.valid_doi = CSVManager( "index/test_data/crossref_glob" + sep + "valid_doi.csv" )
        self.id_date = CSVManager( self.output_dir + sep + "id_date_pmid.csv" )
        self.id_issn = CSVManager( self.output_dir + sep + "id_issn_pmid.csv" )
        self.id_orcid = CSVManager( self.output_dir + sep + "id_orcid_pmid.csv" )
        self.doi_manager = DOIManager(self.valid_doi)
        self.pmid_manager = PMIDManager(self.valid_pmid)
        self.issn_manager = ISSNManager()
        self.orcid_manager = ORCIDManager()
        self.sample_reference = "pmid:7829625"

    def test_issn_data_recover(self):
        #Test the case in which there is no mapping file for journals - issn
        self.assertEqual(issn_data_recover(self.dir_without_issn_map), {})
        #Test the case in which there is a mapping file for journals - issn
        issn_map_dict_len = len(issn_data_recover(self.dir_with_issn_map))
        self.assertTrue(issn_map_dict_len>0)

    def test_issn_data_to_cache(self):
        filename = self.data_to_cache_dir + sep + 'journal_issn.json'
        if exists(filename):
            remove(filename)
        self.assertFalse(exists(filename))
        issn_data_to_cache(self.issn_journal_sample_dict, self.data_to_cache_dir)
        self.assertTrue(exists(filename))

    def test_get_all_files(self):
        all_files, opener = get_all_files( self.get_all_files_dir)
        len_all_files = len(all_files)
        #The folder contains 4 csv files, but two of those contains the words "citations" or "source" in their filenames
        self.assertEqual( len_all_files, 2)

    def test_build_pubdate(self):
        df = pd.DataFrame()
        for chunk in pd.read_csv(self.csv_sample, chunksize=1000):
            f = pd.concat( [df, chunk], ignore_index=True )
            f.fillna( "", inplace=True )
            for index, row in f.iterrows():
                pub_date = build_pubdate(row)
                self.assertTrue(isinstance(pub_date, str))
                self.assertTrue(isinstance(int(pub_date), int))
                self.assertEqual(len(pub_date), 4)

    def test_process(self):
        for files in os.listdir( self.output_dir):
            path = os.path.join( self.output_dir, files )
            try:
                shutil.rmtree(path)
            except OSError:
                os.remove(path)
        self.assertEqual(len(os.listdir(self.output_dir)),0)
        process(self.get_all_files_dir, self.output_dir, 20)
        self.assertTrue(len(os.listdir(self.output_dir))>0)

        df = pd.DataFrame()
        for chunk in pd.read_csv( self.csv_sample, chunksize=1000 ):
            f = pd.concat( [df, chunk], ignore_index=True )
            f.fillna( "", inplace=True )
            for index, row in f.iterrows():
                if index == 1:
                    pmid = row["pmid"]

        citing_pmid = self.pmid_manager.normalise(pmid, include_prefix=True)

        self.assertEqual(self.valid_pmid.get_value(citing_pmid), {'v'})
        self.assertEqual(self.valid_pmid.get_value(self.sample_reference), {'v'})
        self.assertEqual(self.id_date.get_value(citing_pmid), {'1998'})
        self.assertEqual(self.id_issn.get_value(citing_pmid), {'0918-8959', '1348-4540'})

        df = pd.DataFrame()
        for chunk in pd.read_csv( self.csv_sample, chunksize=1000 ):
            f = pd.concat( [df, chunk], ignore_index=True )
            f.fillna( "", inplace=True )
            for index, row in f.iterrows():
                if index == 0:
                    pmid = row["pmid"]

        citing_pmid = self.pmid_manager.normalise(pmid, include_prefix=True)

        self.assertEqual(self.id_orcid.get_value(citing_pmid), {'0000-0002-0524-4077'})


if __name__ == '__main__':
    unittest.main()

#python -m unittest index.test.13_glob

### 1.1.3 PMID manager <a class="anchor" id="en_1.1.3"></a>
La classe PMIDManager è sviluppata come **istanza della superclasse IdentifierManager**. Lo sviluppo del PMIDManager è modellato sull'esempio della classe DOIManager, con cui condivide scopo e funzioni.
In particolare, gli identifier manager si occupano di normalizzare il formato degli identificativi, per poi verificarne l'esistenza e la validità ricorrendo a servizi di API specifici per ogni tipo di identificativo.
L'**API service** utilizzato per i PMID è **https://pubmed.ncbi.nlm.nih.gov/**, che fornisce in risposta una **pagina HTML**. Per questo motivo, l'informazione relativa all'avvenuta o mancata validazione del PMID in questione viene estratta con gli strumenti forniti dalla libreria **BeautifulSoup**. 
Di seguito, il codice del PMID manager e l'estensione del test case per gli identifier managers. Il test è stato passato con successo. 

In [None]:
from index.identifier.identifiermanager import IdentifierManager
from re import sub, match
from urllib.parse import unquote, quote
from requests import get
from index.storer.csvmanager import CSVManager
from requests import ReadTimeout
from requests.exceptions import ConnectionError
from time import sleep
from bs4 import BeautifulSoup



class PMIDManager( IdentifierManager ):
    def __init__(self, valid_pmid=None, use_api_service=True):
        if valid_pmid is None:
            valid_pmid = CSVManager( store_new=False )

        self.api = "https://pubmed.ncbi.nlm.nih.gov/"
        self.valid_pmid = valid_pmid
        self.use_api_service = use_api_service
        self.p = "pmid:"
        super( PMIDManager, self ).__init__()

    def set_valid(self, id_string):
        pmid = self.normalise(id_string, include_prefix=True )
        if self.valid_pmid.get_value( pmid ) is None:
            self.valid_pmid.add_value( pmid, "v" )

    def is_valid(self, id_string):
        pmid = self.normalise( id_string, include_prefix=True )
        if pmid is None or match( "^pmid:[1-9]\d*$", pmid ) is None:
            return False
        else:
            if self.valid_pmid.get_value( pmid ) is None:
                if self.__pmid_exists( pmid ):
                    self.valid_pmid.add_value( pmid, "v" )
                else:
                    self.valid_pmid.add_value( pmid, "i" )
            return "v" in self.valid_pmid.get_value( pmid )

    def normalise(self, id_string, include_prefix=False):
        id_string = str(id_string)
        try:
            pmid_string = sub( "^0+", "", sub( "\0+", "", (sub( "[^\d+]", "", id_string )) ) )
            return "%s%s" % (self.p if include_prefix else "", pmid_string)
        except:
            return None

    def __pmid_exists(self, pmid_full):
        pmid = self.normalise( pmid_full )
        if self.use_api_service:
            tentative = 3
            while tentative:
                tentative -= 1
                try:
                    r = get( self.api + quote( pmid ) + "/?format=pmid", headers=self.headers, timeout=30 )
                    if r.status_code == 200:
                        r.encoding = "utf-8"

                        soup = BeautifulSoup( r.content, features="lxml" )
                        for i in soup.find_all( "meta", {"name": "uid"} ):
                            id = i["content"]
                            if id == pmid:
                                return True

                except ReadTimeout:
                    pass
                except ConnectionError:
                    sleep(5)

        return False

### test/02_identifiermanager.py (PMID extension)

In [None]:
import unittest
from os import sep
from index.identifier.doimanager import DOIManager
from index.identifier.issnmanager import ISSNManager
from index.identifier.orcidmanager import ORCIDManager
from index.identifier.pmidmanager import PMIDManager
from index.storer.csvmanager import CSVManager


class IdentifierManagerTest(unittest.TestCase):
    """This class aim at testing the methods of the class CSVManager."""

    def setUp(self):
#[...]

#class extension for pubmedid
        self.valid_pmid_1 = "2942070"
        self.valid_pmid_2 = "1509982"
        self.valid_pmid_3 = "7189714"
        self.invalid_pmid_1 = "0067308798798"
        self.invalid_pmid_2 = "pmid:174777777777"
        self.invalid_pmid_3 = "000009265465465465"
        self.valid_pmid_path = "index%stest_data%svalid_pmid.csv" % (sep, sep)

#[...]

#class extension for pubmedid
    def test_pmid_normalise(self):
        pm = PMIDManager()
        self.assertEqual(self.valid_pmid_1, pm.normalise(self.valid_pmid_1.replace("", "pmid:")))
        self.assertEqual(self.valid_pmid_1, pm.normalise(self.valid_pmid_1.replace("", " ")))
        self.assertEqual(self.valid_pmid_1, pm.normalise("https://pubmed.ncbi.nlm.nih.gov/"+self.valid_pmid_1))
        self.assertEqual(self.valid_pmid_2, pm.normalise("000"+self.valid_pmid_2))

    def test_pmid_is_valid(self):
        pm_nofile = PMIDManager()
        print(pm_nofile.normalise(self.valid_pmid_1, include_prefix=True ))
        print(pm_nofile.is_valid(self.valid_pmid_1))
        self.assertTrue(pm_nofile.is_valid(self.valid_pmid_1))
        self.assertTrue(pm_nofile.is_valid(self.valid_pmid_2))
        self.assertTrue(pm_nofile.is_valid(self.valid_pmid_3))
        self.assertFalse(pm_nofile.is_valid(self.invalid_pmid_1))
        self.assertFalse(pm_nofile.is_valid(self.invalid_pmid_2))
        self.assertFalse(pm_nofile.is_valid(self.invalid_pmid_3))

        valid_pmid = CSVManager(self.valid_pmid_path)
        pm_file = PMIDManager(valid_pmid=valid_pmid, use_api_service=False)
        self.assertTrue(pm_file.is_valid(self.valid_pmid_1))
        self.assertFalse(pm_file.is_valid(self.invalid_pmid_1))

        pm_nofile_noapi = PMIDManager(use_api_service=False)
        self.assertFalse(pm_nofile_noapi.is_valid(self.valid_pmid_1))
        self.assertFalse(pm_nofile_noapi.is_valid(self.invalid_pmid_1))
        self.assertFalse(pm_nofile_noapi.is_valid(self.valid_pmid_2))
        self.assertFalse(pm_nofile_noapi.is_valid(self.invalid_pmid_2))
        self.assertFalse(pm_nofile_noapi.is_valid(self.valid_pmid_3))
        self.assertFalse(pm_nofile_noapi.is_valid(self.invalid_pmid_3))

### 1.1.4 NIH resource finder <a class="anchor" id="en_1.1.4"></a>
La classe NIHResourceFinder è sviluppata come **istanza della superclasse ApiIDResourceFinder** e il suo scopo è quello di **recuperare metadati dall'API per i PMID**, nel caso in cui nel processo non vengano forniti files di supporto generati nel preprocessing dal glob di NOCI. 
Il servizio utilizzato è https://pubmed.ncbi.nlm.nih.gov/ (con display option "pubmed") che restituisce risposte in formato HTML. Per questo motivo, i dati sono estratti utilizzando la libreria **Beautiful Soup** per accedere alla sezione che contiene la stringa testuale con i metadati. A questo punto, i dati sono estratti dalla stringa testuale sfruttando le **regex**. 
A differenza dei servizi API per i DOI, questa REST API non fornisce dati particolarmente dettagliati e non copre alcune delle informazioni richieste dall'OCDM, come ad esempio gli ORCID degli autori. Tra i metadati a disposizione, gli unici utili sono la **data di pubblicazione** e l'**ISSN** della rivista di pubblicazione. 
Di seguito, lo script del NIHResourceFinder e la relativa espansionsione del testcase 03_resourcefinder.py. Il test è stato superato con successo. 

In [None]:
from index.finder.resourcefinder import ApiIDResourceFinder
from index.citation.oci import OCIManager
from requests import get
from urllib.parse import quote
from datetime import datetime
from bs4 import BeautifulSoup
import re

class NIHResourceFinder(ApiIDResourceFinder):
    def __init__(self, date=None, orcid=None, issn=None, pmid=None, use_api_service=True):
        self.use_api_service = use_api_service
        self.api = "https://pubmed.ncbi.nlm.nih.gov/"
        self.baseurl = "https://pubmed.ncbi.nlm.nih.gov/"
        super(NIHResourceFinder, self).__init__(date=date, orcid=orcid, issn=issn, id=pmid, id_type=OCIManager.pmid_type,
                                                     use_api_service=use_api_service)

    def _get_issn(self, txt_obj):
        result = set()
        issns = re.findall("IS\s+-\s+\d{4}-\d{4}", txt_obj)
        for i in issns:
            issn = re.search("\d{4}-\d{4}", i).group(0)
            result.add(issn)
        return result

    def _get_date(self, txt_obj):
        date = re.search("DP\s+-\s+(\d{4}(\s?(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec))?(\s?((3[0-1])|([1-2][0-9])|([0]?[1-9])))?)", txt_obj).group(1)
        re_search = re.search("(\d{4})\s+(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s+((3[0-1])|([1-2][0-9])|([0]?[1-9]))", date)
        if re_search is not None:
            result = re_search.group(0)
            datetime_object = datetime.strptime(result, '%Y %b %d')
            return datetime.strftime(datetime_object, '%Y-%m-%d')
        else:
            re_search = re.search("(\d{4})\s+(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)", date)
            if re_search is not None:
                result = re_search.group(0)
                datetime_object = datetime.strptime(result, '%Y %b')
                return datetime.strftime(datetime_object, '%Y-%m')
            else:
                re_search = re.search("(\d{4})", date)
                if re_search is not None:
                    result = re.search("(\d{4})", date).group(0)
                    datetime_object = datetime.strptime(result, '%Y')
                    return datetime.strftime(datetime_object, '%Y')
                else:
                    return None


    def _call_api(self, pmid_full):
        if self.use_api_service:
            pmid = self.pm.normalise(pmid_full)
            r = get(self.api + quote(pmid) + "/?format=pubmed", headers=self.headers, timeout=30)
            if r.status_code == 200:
                r.encoding = "utf-8"
                soup = BeautifulSoup(r.text, features="lxml")
                mdata = str(soup.find(id="article-details"))
                return mdata

### test/03_resourcefinder.py (PMID extension)

In [None]:
import unittest
from os import sep
from index.storer.csvmanager import CSVManager
from index.finder.crossrefresourcefinder import CrossrefResourceFinder
from index.finder.dataciteresourcefinder import DataCiteResourceFinder
from index.finder.nihresourcefinder import NIHResourceFinder
from index.finder.orcidresourcefinder import ORCIDResourceFinder
from index.finder.resourcefinder import ResourceFinderHandler


class ResourceFinderTest(unittest.TestCase):
    """This class aim at testing the methods of the class CSVManager."""

    def setUp(self):
        self.date_path = "index%stest_data%sid_date.csv" % (sep, sep)
        self.date_path_pmid = "index%stest_data%sid_date_pmid.csv" % (sep, sep)
        self.orcid_path = "index%stest_data%sid_orcid.csv" % (sep, sep)
        self.orcid_path_pmid = "index%stest_data%sid_orcid_pmid.csv" % (sep, sep)
        self.issn_path = "index%stest_data%sid_issn.csv" % (sep, sep)
        self.issn_path_pmid = "index%stest_data%sid_issn_pmid.csv" % (sep, sep)
        self.doi_path = "index%stest_data%svalid_doi.csv" % (sep, sep)
        self.pmid_path = "index%stest_data%svalid_pmid.csv" % (sep, sep)
#[...]
    
    def test_nationalinstititeofhealth_get_orcid(self):
        #Do not use support files, only APIs
        nf_1 = NIHResourceFinder()
        self.assertIsNone(nf_1.get_orcid("7189714"))
        self.assertIsNone(nf_1.get_orcid("1509982"))

        # Do use support files, but avoid using APIs
        nf_2 = NIHResourceFinder(orcid=CSVManager(self.orcid_path_pmid),
                                      pmid=CSVManager(self.pmid_path), use_api_service=False)
        self.assertIn("0000-0002-1825-0097", nf_2.get_orcid("7189714"))
        self.assertNotIn("0000-0002-1825-0098", nf_2.get_orcid("1509982"))

        # Do not use support files neither APIs
        nf_3 = NIHResourceFinder(use_api_service=False)
        self.assertIsNone(nf_3.get_orcid("7189714"))

    def test_nationalinstititeofhealth_get_issn(self):
        # Do not use support files, only APIs
        nf_1 = NIHResourceFinder()
        self.assertIn("0003-4819", nf_1.get_container_issn("2942070"))
        self.assertNotIn("0003-0000", nf_1.get_container_issn("2942070"))

        # # Do use support files, but avoid using APIs
        nf_2 = NIHResourceFinder(issn=CSVManager(self.issn_path_pmid),
                                      pmid=CSVManager(self.pmid_path), use_api_service=False)
        container = nf_2.get_container_issn("1509982")
        self.assertIn("0065-4299", container)
        self.assertNotIn("0065-4444", nf_2.get_container_issn("1509982"))

        # Do not use support files neither APIs
        nf_3 = NIHResourceFinder(use_api_service=False)
        self.assertIsNone(nf_3.get_container_issn("7189714"))

    def test_nationalinstititeofhealth_get_pub_date(self):
        # Do not use support files, only APIs
        nf_1 = NIHResourceFinder()
        self.assertIn("1998-05-25", nf_1.get_pub_date("9689714"))
        self.assertNotEqual("1998", nf_1.get_pub_date("9689714"))

        # Do not use support files, only APIs
        nf_2 = NIHResourceFinder(date=CSVManager(self.date_path_pmid),
                                      pmid=CSVManager(self.pmid_path), use_api_service=False)
        self.assertIn("1980-06", nf_2.get_pub_date("7189714"))
        self.assertNotEqual("1980-06-22", nf_2.get_pub_date("7189714"))

        # Do not use support files neither APIs
        nf_3 = NIHResourceFinder(use_api_service=False)
        self.assertIsNone(nf_3.get_pub_date("2942070"))

### 1.2.1 citation/oci.py <a class="anchor" id="en_1.2.1"></a>

### 1.2.2 finder/resourcefinder.py <a class="anchor" id="en_1.2.2"></a>

### 1.2.3 finder/dataciteresourcefinder.py <a class="anchor" id="en_1.2.3"></a>

### 1.2.4 finder/crossrefresourcefinder.pyr <a class="anchor" id="en_1.2.4"></a>

### 1.2.5 finder/orcidresourcefinder.py <a class="anchor" id="en_1.2.5"></a>

### 1.2.6 storer/csvmanager.py <a class="anchor" id="en_1.2.6"></a>

### 1.2.7 storer/datahandler.py <a class="anchor" id="en_1.2.7"></a>

### 1.2.8 storer/update.py <a class="anchor" id="en_1.2.8"></a>

### RAMOSE - NOCI configuration file<a class="anchor" id="en_2.1"></a>

### RAMOSE - indexapi.py <a class="anchor" id="en_2.2"></a>

## 09/03 - 15/03 (Log Data Visualization d3.js) <a class="anchor" id="entry_3"></a>

Lorem Ipsum


## 15/03 - 22/03 (????) <a class="anchor" id="entry_4"></a>

Lorem Ipsum

## ??/?? - ??/?? (????) <a class="anchor" id="entry_5"></a>
????????