1. Ingest citation data via uploaded RIS or BibTex files or via manually-entered, individual records
1. Parse citation data according to input format and standardize (across formats) field names and values as much as possible
1. Munge parsed data into a convenient format for importing into a database, e.g. CSV -> Postgres (see [here](https://www.postgresql.org/docs/9.5/static/sql-copy.html))
1. Import citations into database with additional columns for, e.g. citation_id, project_id, user_id, is_duplicate (NULL to start), confirmed_duplicate, ...
1. Apply trained dedupe model to new citations vs. existing citations for given project, find possible matches, interactively prompt user to confirm duplicates when in doubt; mark duplicate records in the db accordingly

In [2]:
import io
import logging
import os

import dedupe

import cipy

In [21]:
logger = logging.getLogger()
logging.basicConfig()

In [4]:
conn_creds = cipy.db.get_conn_creds('DATABASE_URL')
ddl_path = '/Users/burtondewilde/Desktop/datakind/ci/conservation-intl/cipy/db/ddls/citations.yaml'
psql = cipy.db.PostgresDB(ddl_path, conn_creds)

In [5]:
psql.print_table_spec('citations')

name                 pos type                           nullable max_length
---------------------------------------------------------------------------
record_id              1 bigint                         NO                 
project_id             2 integer                        NO                 
user_id                3 integer                        NO                 
insert_ts              4 timestamp without time zone    NO                 
type_of_work           5 character varying              YES      25        
title                  6 character varying              YES      250       
secondary_title        7 character varying              YES      250       
publication_year       8 smallint                       YES                
publication_month      9 smallint                       YES                
authors               10 ARRAY                          YES                
abstract              11 text                           YES                
keywords    

In [6]:
settings_file = '/Users/burtondewilde/Desktop/datakind/ci/conservation-intl/models/dedupe_citations_settings'
training_file = '/Users/burtondewilde/Desktop/datakind/ci/conservation-intl/models/dedupe_citations_training.json'

In [24]:
if os.path.exists(settings_file):
    logger.info('reading dedupe settings from %s', settings_file)
    with io.open(settings_file, mode='rb') as sf:
        deduper = dedupe.StaticDedupe(sf, num_cores=4)
else:
    
    variables = [
        #{'field': 'authors', 'type': 'Set', 'has missing': True},
        {'field': 'title', 'type': 'String', 'has missing': True},
        {'field': 'abstract', 'type': 'Text', 'has missing': True},
        {'field': 'publication_year', 'type': 'Exact', 'has missing': True},
        {'field': 'doi', 'type': 'String', 'has missing': True}
    ]
    deduper = dedupe.Dedupe(variables, num_cores=2)
    
    query = """
            SELECT authors, title, abstract, publication_year, doi, type_of_work
            FROM citations
            LIMIT 1000
            """
    deduper.sample({i: row for i, row in enumerate(psql.run_query(query))}, 25000)
    
    if os.path.exists(training_file):
        logger.info('reading labeled examples from %s', training_file)
        with io.open(training_file, mode='rt') as tf:
            deduper.readTraining(tf)
            
    logger.info('starting active labeling...')
    
    # use 'y', 'n' and 'u' keys to flag duplicates press 'f' when you are finished
    dedupe.consoleLabel(deduper)
    
    with io.open(training_file, mode='wt') as tf:
        deduper.writeTraining(tf)
        
    deduper.train(maximum_comparisons=1000000, recall=0.95)
    
    with io.open(settings_file, mode='wb') as sf:
        deduper.writeSettings(sf)
        
    deduper.cleanupTraining()

title : A study of low level vibrations as a power source for wireless sensor nodes
abstract : Advances in low power VLSI design, along with the potentially low duty cycle of wireless sensor nodes open up the possibility of powering small wireless computing devices from scavenged ambient power. A broad review of potential power scavenging technologies and conventional energy sources is first presented. Low-level vibrations occurring in common household and office environments as a potential power source are studied in depth. The goal of this paper is not to suggest that the conversion of vibrations is the best or most versatile method to scavenge ambient power, but to study its potential as a viable power source for applications where vibrations are present. Different conversion mechanisms are investigated and evaluated leading to specific optimized designs for both capacitive MicroElectroMechancial Systems (MEMS) and piezoelectric converters. Simulations show that the potential power 

y


title : Structure and growth of self-assembling monolayers
abstract : The structural phases and the growth of self-assembled monolayers (SAMs) are reviewed from a surface science perspective. with emphasis on simple model systems. The concept of self-assembly is explained, and different self-assembling materials are briefly discussed. A summary of the techniques used for the study of SAMs is given. Different general scenarios for structures obtained by self-assembly are described. Thiols on Au(111) surfaces are used as an archetypal system to investigate in detail the structural phase diagram as a function of temperature and coverage, the specific structural features on a molecular level, and the effect of changes of the molecular backbone and the end group on the structure of the SAM. Temperature effects including phase transitions are discussed. Concepts for the preparation of more complex structures such as multi-component SAMs, laterally structured SAMs, and heterostructures. also 

n


title : The feasibility of creating a checklist for the assessment of the methodological quality both of randomised and non-randomised studies of health care interventions
abstract : Objective-To test the feasibility of creating a valid and reliable checklist with the following features: appropriate for assessing both randomised and non-randomised studies; provision of both an overall score for study quality and a profile of scores not only for the quality of reporting, internal validity (bias and confounding) and power, but also for external validity. Design-A pilot version was first developed, based on epidemiological principles, reviews, and existing checklists for randomised studies. Face and content validity were assessed by three experienced reviewers and reliability was determined using two raters assessing 10 randomised and 10 non-randomised studies. Using different raters, the checklist was revised and tested for internal consistency (Kuder-Richardson 20), test-retest and inte

KeyboardInterrupt: 