1. Ingest citation data via uploaded RIS or BibTex files or via manually-entered, individual records
1. Parse citation data according to input format and standardize (across formats) field names and values as much as possible
1. Munge parsed data into a convenient format for importing into a database, e.g. CSV -> Postgres (see [here](https://www.postgresql.org/docs/9.5/static/sql-copy.html))
1. Import citations into database with additional columns for, e.g. citation_id, project_id, user_id, is_duplicate (NULL to start), confirmed_duplicate, ...
1. Apply trained dedupe model to new citations vs. existing citations for given project, find possible matches, interactively prompt user to confirm duplicates when in doubt; mark duplicate records in the db accordingly

In [1]:
import logging
from psycopg2 import connect as psql_connect
from psycopg2 import Error as PsycopgError


def get_connection(host, database, port=5439, username=None, password=None):
    """
    Get PostgresSQL connection.
    
    @param host: str
    @param database: str, name of the database.
    @param port: int
    @param username: str
    @param password: str
    @return: psycopg2._psycopg.connection|None
    """
    connection_params = {
        'host': host,
        'port': port,
        'user': username,
        'password': password,
        'database': database,
    }

    try:
        connection = psql_connect(**connection_params)
        logging.info('Connected to Redshift, %s:%s/%s', host, port, database)
        return connection
    except PsycopgError:
        logging.exception('Failed to connect to redshift %s:%s/%s', host, port, database)

    return None

In [10]:
"""
This code demonstrates how to use dedupe with a comma separated values
(CSV) file. All operations are performed in memory, so will run very
quickly on datasets up to ~10,000 rows.

We start with a CSV file containing our messy data. In this example,
it is listings of early childhood education centers in Chicago
compiled from several different sources.

The output will be a CSV with our clustered results.

For larger datasets, see our [mysql_example](mysql_example.html)
"""
from future.builtins import next

import os
import csv
import re
import logging
import optparse

import dedupe
from unidecode import unidecode

# ## Logging

log_level = logging.WARNING 
logging.getLogger().setLevel(log_level)

# ## Setup

input_file = '/Users/burtondewilde/Desktop/datakind/ci/conservation-intl/data/raw/csv_example_messy_input.csv'
outputs_path = '/Users/burtondewilde/Desktop/datakind/ci/conservation-intl/data/processed'
output_file = os.path.join(outputs_path, 'csv_example_output.csv')
settings_file = os.path.join(outputs_path, 'csv_example_learned_settings')
training_file = os.path.join(outputs_path, 'csv_example_training.json')


def preProcess(column):
    """
    Do a little bit of data cleaning with the help of Unidecode and Regex.
    Things like casing, extra spaces, quotes and new lines can be ignored.
    """
    try : # python 2/3 string differences
        column = column.decode('utf8')
    except AttributeError:
        pass
    column = unidecode(column)
    column = re.sub('  +', ' ', column)
    column = re.sub('\n', ' ', column)
    column = column.strip().strip('"').strip("'").lower().strip()
    # If data is missing, indicate that by setting the value to `None`
    if not column:
        column = None
    return column


def readData(filename):
    """
    Read in our data from a CSV file and create a dictionary of records, 
    where the key is a unique record ID and each value is dict
    """

    data_d = {}
    with open(filename) as f:
        reader = csv.DictReader(f)
        for row in reader:
            clean_row = [(k, preProcess(v)) for (k, v) in row.items()]
            row_id = int(row['Id'])
            data_d[row_id] = dict(clean_row)

    return data_d

print('importing data ...')
data_d = readData(input_file)

# If a settings file already exists, we'll just load that and skip training
if os.path.exists(settings_file):
    print('reading from', settings_file)
    with open(settings_file, 'rb') as f:
        deduper = dedupe.StaticDedupe(f)
else:
    # ## Training

    # Define the fields dedupe will pay attention to
    fields = [
        {'field' : 'Site name', 'type': 'String'},
        {'field' : 'Address', 'type': 'String'},
        {'field' : 'Zip', 'type': 'Exact', 'has missing' : True},
        {'field' : 'Phone', 'type': 'String', 'has missing' : True},
        ]

    # Create a new deduper object and pass our data model to it.
    deduper = dedupe.Dedupe(fields)

    # To train dedupe, we feed it a sample of records.
    deduper.sample(data_d, 15000)

    # If we have training data saved from a previous run of dedupe,
    # look for it and load it in.
    # __Note:__ if you want to train from scratch, delete the training_file
    if os.path.exists(training_file):
        print('reading labeled examples from ', training_file)
        with open(training_file, 'rb') as f:
            deduper.readTraining(f)

    # ## Active learning
    # Dedupe will find the next pair of records
    # it is least certain about and ask you to label them as duplicates
    # or not.
    # use 'y', 'n' and 'u' keys to flag duplicates
    # press 'f' when you are finished
    print('starting active labeling...')

    dedupe.consoleLabel(deduper)

    # Using the examples we just labeled, train the deduper and learn
    # blocking predicates
    deduper.train()

    # When finished, save our training to disk
    with open(training_file, 'w') as tf:
        deduper.writeTraining(tf)

    # Save our weights and predicates to disk.  If the settings file
    # exists, we will skip all the training and learning next time we run
    # this file.
    with open(settings_file, 'wb') as sf:
        deduper.writeSettings(sf)
        
# Find the threshold that will maximize a weighted average of our
# precision and recall.  When we set the recall weight to 2, we are
# saying we care twice as much about recall as we do precision.
#
# If we had more data, we would not pass in all the blocked data into
# this function but a representative sample.

threshold = deduper.threshold(data_d, recall_weight=1)

# ## Clustering

# `match` will return sets of record IDs that dedupe
# believes are all referring to the same entity.

print('clustering...')
clustered_dupes = deduper.match(data_d, threshold)

print('# duplicate sets', len(clustered_dupes))

# ## Writing Results

# Write our original data back out to a CSV with a new column called 
# 'Cluster ID' which indicates which records refer to each other.

cluster_membership = {}
cluster_id = 0
for (cluster_id, cluster) in enumerate(clustered_dupes):
    id_set, scores = cluster
    cluster_d = [data_d[c] for c in id_set]
    canonical_rep = dedupe.canonicalize(cluster_d)
    for record_id, score in zip(id_set, scores):
        cluster_membership[record_id] = {
            "cluster id" : cluster_id,
            "canonical representation" : canonical_rep,
            "confidence": score
        }

singleton_id = cluster_id + 1

with open(output_file, 'w') as f_output, open(input_file) as f_input:
    writer = csv.writer(f_output)
    reader = csv.reader(f_input)

    heading_row = next(reader)
    heading_row.insert(0, 'confidence_score')
    heading_row.insert(0, 'Cluster ID')
    canonical_keys = canonical_rep.keys()
    for key in canonical_keys:
        heading_row.append('canonical_' + key)

    writer.writerow(heading_row)

    for row in reader:
        row_id = int(row[0])
        if row_id in cluster_membership:
            cluster_id = cluster_membership[row_id]["cluster id"]
            canonical_rep = cluster_membership[row_id]["canonical representation"]
            row.insert(0, cluster_membership[row_id]['confidence'])
            row.insert(0, cluster_id)
            for key in canonical_keys:
                row.append(canonical_rep[key].encode('utf8'))
        else:
            row.insert(0, None)
            row.insert(0, singleton_id)
            singleton_id += 1
            for key in canonical_keys:
                row.append(None)
        writer.writerow(row)

importing data ...


Site name : el valor - carlos cantu
Address : 2434 s kildare ave
Zip : None
Phone : None

Site name : el valor - carlos cantu
Address : 2434 s kildare ave
Zip : None
Phone : None

0/10 positive, 0/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished


starting active labeling...
10


(y)es / (n)o / (u)nsure / (f)inished


5


(y)es / (n)o / (u)nsure / (f)inished


y


Site name : pathways to learning i/t
Address : 3450-54 w. 79th st
Zip : None
Phone : 4369244

Site name : el valor - kidz colony
Address : 6287 s archer ave
Zip : None
Phone : 7678522

1/10 positive, 0/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished


n


Site name : ferguson cpc
Address : 1420 n. hudson
Zip : None
Phone : 5348580

Site name : henry booth house precious little ones
Address : 5327 s michigan ave
Zip : 60615
Phone : None

1/10 positive, 1/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished


n


  * (true_distinct + false_distinct)))
Site name : community learning center, inc.
Address : 10612-20 south wentworth
Zip : 60628
Phone : 9284104

Site name : community learning center
Address : 10612 s wentworth avenue
Zip : 60628
Phone : 9284104

1/10 positive, 2/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished


y


Site name : chicago public schools new field primary school
Address : 1707 w. morse
Zip : 60626
Phone : 5342760

Site name : healy
Address : 3040 s. parnell
Zip : 60616
Phone : 5349170

2/10 positive, 2/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished


n


Site name : easter seals society of metropolitan chicago allison's infant & toddler center
Address : 234 e 114th st
Zip : 60628
Phone : 8404502

Site name : henry booth house allison's
Address : 34 e. 115th st.
Zip : 60628
Phone : 8404502

2/10 positive, 3/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished


n


Site name : catholic charities-st mark
Address : 1041 n. campbell
Zip : 60622
Phone : 7726606

Site name : catholic charities chicago - st. mark
Address : 1041 n campbell avenue
Zip : 60622
Phone : 7726606

2/10 positive, 4/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished


y


Site name : national teachers acad
Address : 55 w. cermack
Zip : None
Phone : 5349970

Site name : chicago public schools n.t.a. (national teachers academy)
Address : 55 w. cermak
Zip : 60616
Phone : 5349970

3/10 positive, 4/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished


y


Site name : henry booth house - little hands & feet
Address : 7801 s wolcott ave
Zip : None
Phone : None

Site name : evers
Address : 9811 s. lowe
Zip : None
Phone : 5352565

4/10 positive, 4/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished


n


Site name : beethoven
Address : 25 w. 47th st.
Zip : None
Phone : 5351480

Site name : beethoven
Address : 4421 s. state st.
Zip : 60609
Phone : 5351480

4/10 positive, 5/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished


u


Site name : chicago youth centers - rebecca k. crown / cyc
Address : 7601 s phillips ave
Zip : None
Phone : 6481550

Site name : chicago youth centers rebecca crown
Address : 7601 s. phillips
Zip : 60649
Phone : 7310444

4/10 positive, 5/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished


u


Site name : dumas
Address : 6615 s. kenwood ave
Zip : None
Phone : 5350802

Site name : dumas
Address : 6650 s. ellis
Zip : 60637
Phone : 5350750

4/10 positive, 5/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished


n


Site name : kiddy kare preschools little learners
Address : 5923 w. 63rd st.
Zip : None
Phone : 5815541

Site name : el valor - little learners
Address : 5923 w 63rd st
Zip : None
Phone : 5815541

4/10 positive, 6/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished


y


Site name : viva family center
Address : 2516 w. division
Zip : 60622
Phone : 2529100

Site name : children's home viva family center
Address : 2516 w. division
Zip : 60602
Phone : 2526313

5/10 positive, 6/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished


y


Site name : douglass-tubman youth ministries,inc. - douglass-tubman child development center
Address : 5010 w chicago ave
Zip : None
Phone : 6266581

Site name : douglas-tubman child development center
Address : 5010 w chicago avenue
Zip : 60651
Phone : 2683053

6/10 positive, 6/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished


f


Finished labeling


clustering...


  a = empty(shape, dtype, order)


# duplicate sets 849


In [11]:
threshold

0.4760381

In [7]:
# TODO: get data into a suitable format

# data = {unique_id_1: dict(record_1),
#         unique_id_2: dict(record_2)}

In [9]:
import dedupe

In [3]:
variables = [
    {'field': 'authors', 'type': 'Set', 'has missing': True},
    {'field': 'title', 'type': 'String'},
    {'field': 'abstract', 'type': 'Text'},
    {'field': 'publication_year', 'type': 'Exact', 'has missing': True},
]

In [6]:
deduper = dedupe.Dedupe(variables, num_cores=2)

In [None]:
sample_size = 15000
deduper.sample(data, sample_size=sample_size, blocked_proportion=0.5)

In [None]:
# use 'y', 'n' and 'u' keys to flag duplicates press 'f' when you are finished
dedupe.consoleLabel(deduper)

In [None]:
# using the examples we just labeled, train the deduper and learn blocking predicates
deduper.train()