General:
----------------------------------------------------------------------------------------------------------------------------
This notebook extracts data from the databases as sentences in the form of list of tokens (or list of words).

Prerequisites:
 - Download sentences.db from __FIXME__: Add public ling here

Sentence processing using SpaCy:
----------------------------------------------------------------------------------------------------------------------------
- Remove stop words
- Remove punctuation
- Mask numbers e.g.: 18 --> dd, 2018-->dddd, 34.54--> dd.dd
- Lookup word lemma and replace tokens with lemmas when they exist. 
  "Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of 
   words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word,  
   which is known as the lemma." (https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html)


In [1]:
# -*- coding: utf-8 -*-
import os
import re
import errno
import random
import sqlite3
import subprocess
import pymysql.cursors
from random import shuffle
from __future__ import division
from abbreviations import get_abbreviations

In [2]:
def prepare_directories(): 
    # FIXME: Add link to dowload db in this directory once it's created?
    try:
        os.mkdir("db")
    except OSError as exc:
        if exc.errno != errno.EEXIST:
            raise
        pass
    try:
        os.mkdir("models")
    except OSError as exc:
        if exc.errno != errno.EEXIST:
            raise
        pass

In [3]:
# Create directories to store database and models
prepare_directories()

In [4]:
# Connect to db
def connect_to_db():
    database = "db/sentences.db"
    conn = create_connection(database)
    return conn

In [5]:
# Connect to DB
def create_connection(db_file):
    try:
        conn = sqlite3.connect(db_file)
        return conn
    except Error as e:
        print(e)
 
    return None

In [6]:
def cleanup_token(tkn):
    new_tkn = tkn.strip().strip(',').strip('.')
    if len(new_tkn)>=2 and new_tkn[0] == '(' and new_tkn[len(new_tkn)-1]==')':
        new_tkn=new_tkn.rstrip(')').lstrip('(')
    return new_tkn

In [7]:
# Get sentences
def get_sentences_from_db(conn):
    dictionary = {}
    docids = []
    cur = conn.cursor()
    cur.execute("select sentence, docid from sentences where haspolys = 1") 
    rows = cur.fetchall()
 
    for row in rows:
        sentence = row[0]
        docid = row [1]
        
        #combine sentences with identical docids
        if docid not in docids: 
            docids.append(docid)
            sentences_list = list()
            sentences_list.append(sentence)
            dictionary[docid] = sentences_list
            
        else: 
            sentences_list.append(sentence)

    # Returns sentences as list of tokens where each sentence is broken into words using spaces except within parenthesis
    return dictionary

In [8]:
# Connects to db
connection = connect_to_db()

In [9]:
# Gets sentences and write training and testing files
sentences = get_sentences_from_db(connection)

In [10]:
# Gets abbreviation-polymer pairs and returns as dictionary sorted by docid
polymer_abbrs_dictionary = {}
keys = list()

#gets abbreviation:polymer pair for each docid
for key in sentences: 
    values = sentences[key]
    abbrs = get_abbreviations(values)
    abbreviations = abbrs.items()
    polymer_abbrs_dictionary[key] = abbreviations

In [11]:
#combine entries for each docid dictionary into one dictionary

dictionary_values = polymer_abbrs_dictionary.values()
dictionary3 = {}
for item in dictionary_values: 
    dictionary2 = dict(item)
    for key, value in dictionary2.items(): 
        dictionary3[key]=value

In [12]:
#print polymers: [every acronym found for that polymer]

acronyms = {}
polymers = list()
for key,value in dictionary3.items():
    abbrev = key
    polymer = value
    if polymer not in polymers: 
        polymers.append(polymer)
        abbreviations = list()
        abbreviations.append(abbrev)
        acronyms[polymer] = abbreviations
    else: 
        abbreviations.append(abbrev)

for key in acronyms:
    print key,':',acronyms[key]

cyclohexene oxide : [u'CHO']
phosphonation levels : [u'PLs', u'RALLS-SEC']
linear N-alkyl poly(m-benzamide)s : [u'LPAs']
ionic self-assembly : [u'ISA']
instrument response function : [u'IRF']
controlled radical polymerization techniques : [u'CRP']
poly[N-9′-heptadecanyl-2,7-carbazole-alt-5,5-(4′,7′-di-2-thienyl-2′,1′,3′-benzothiadiazole)] : [u'PCDTBT']
poly(diisopropyl-p-vinylbenzyl phosphonate) : [u'PDIPVBP']
electron affinities : [u'EAs']
core−shell−corona : [u'CSC']
nonequilibrium lattice fluid : [u'NELF']
poly(methacryloyloxyethyl acrylate) : [u'PMEA']
Poly(olefin sulfone)s : [u'POSs']
copolyimides : [u'coPIs']
piezoresponse force microscopy : [u'PFM', u'PFO']
poly(acetylene) : [u'PA']
poly(amino urethane) : [u'PAU']
amphiphilic polymer conetwork : [u'APCN']
poly(glycidyl methacrylate) : [u'PGMA']
vinyl pivalate, i : [u'VPi']
polyacrylamide : [u'PAM']
molecular dynamics : [u'MD']
poly(hexamethylene oxide) : [u'PHMO']
A-b-B/C blend system, poly(vinylphenol-b-methyl methacrylate)/pol

weight fraction of PEO : [u'wPEO']
compositions were nicely controlled by the VSC/ε-CL molar feed ratio, in which : [u'CL-co-VSC']
grafting ratios : [u'Gr']
poly(L-lysine) : [u'PLys']
monomer conversions, thus allowing for the efficient formation of : [u'multi']
band excitation : [u'BE']
styrene sodium sulfonate : [u'SSS']
poly(α-methyl-α-ethyl-β-propiolactone)s : [u'PMEPLs']
butyl acrylate : [u'BA']
initiation mechanism : [u'Ic']
single wall carbon nanotubes : [u'SWCNT']
poly(bithiophene phenylene) : [u'PBTP']
refractive index detector : [u'RID']
3-(2,2,2-Trifluoroethoxymethyl)-3-methyloxetane : [u'3FOx']
Block copolymers : [u'BCP']
l-3-(3,4-dihydroxyphenyl)alanine : [u'l-Dopa']
polyfluoreneethynylene : [u'PFE']
terpyridine : [u'terpy']
poly(urethane–amide)s : [u'PUA']
X-ray fluorescence : [u'XRF']
National Synchrotron Radiation Research Center : [u'NSRRC']
poly((9,9-dioctylfluorene)-2,7-diyl-alt-[4,7-bis(thiophen-5-yl)-2,1,3-benzothiadiazole]-2′,2″-diyl) : [u'PFTBT']
Synthesis of 1-P

polyfluorenes : [u'PFs']
polymerization-induced phase separation : [u'PIPS']
poly(ethylene glycol) methyl ether acrylate : [u'PEGA']
excess : [u'ee']
macromonomers : [u'MMs']
polybenzimidazoles : [u'PBIs']
fluoromethylstyrene : [u'FMST']
poly(ethylene oxide)−dimethyl ether, : [u'PEO\u2212DME']
poly(3-heptylthiophene) : [u'P3HpT']
electroluminescence : [u'EL']
cyclic poly(ethylene oxides) : [u'CPEOs']
hydroxyl-containing initiator for the ring-opening polymerization of lactide, producing a maleimide functionalized polylactide : [u'HEMI-PLLA']
macromolecular chain transfer agents : [u'Macro-CTAs']
coordinative chain transfer polymerization : [u'CCTP', u'PSAs']
diblock copolymer brushes : [u'DCBs']
poly(p-methylstyrene) : [u'PpMS']
methyl 4,6-O-benzylidene-2,3-O-carbonyl-R-d-glucopyranoside : [u'MBCG']
total-heat-flow : [u'total HF']
shell cross-linked reverse micelles : [u'SCRM']
weight ratio PEG:α-CD was 32.3:67.7 : [u'w/w']
Si substrate with a native oxide layer : [u'Si/SiOx']
Gel perm

studied polymer families for : [u'semi']
fluoride ions : [u'F\u2013']
poly(vinyl cinnamate) : [u'PVCN']
M1, and pol : [u'M1']
polytetrahydrophenanthrene : [u'PTHP']
amino-4′-(p-aminophenoxy)triphenylamine : [u'AAPT']
denote the fine chemical structure of PSm-b-P(DCH)n−x- : [u'DCH-Ru']
Kumada catalyst transfer polymerization : [u'KCTP']
polystyrene is replaced with a nonaromatic cyclohexyl ring : [u'PCHE']
head−tail : [u'H\u2212T']
thermally rearranged polybenzoxazoles : [u'TR-PBOs']
a random copolymer having similar comonomer composition, pol : [u'A-Pro-OMe']
and that the use of these Zr(IV) : [u'and Hf(IV)']
order-to-disorder transition : [u'ODT']
head-to-head and tail-to-tail : [u'H-H-T-T']
Isotactic poly(3-methyl-1-pentene) : [u'iP3MP']
atom transfer radical polymerization : [u'ATRP']
poly(2-vinyloxirane carbonate) : [u'PVIC']
syndiotactic : [u'st-']
temperature superposition : [u'tTS']
PEO composition will increase with assembly pH; for : [u'PEO/PAA']
benzoyl peroxide : [u'BPO']
li

carbonyl : [u'C\u2550O', u'\u03b1MS']
food and drug administration : [u'FDA']
polymeric ionic liquids : [u'PIL']
1-phenylethyl bromide : [u'1-PEBr']
was related to the orientation of the solvent : [u'water', u'Mw']
extreme ultraviolet : [u'EUV']
methyl methacrylate : [u'MMA']
bisacrylamide : [u'BIS']
poly(vinyl acetate) : [u'PVAc']
poly(tert-butyl acrylate) : [u'PtBA']
cyclic structure formation through hydrogen bonding between two pendant acryloyl groups : [u'CH\xb7\xb7\xb7OC']
cyano-4-hydroxycinnamic acid : [u'CHCA']
polymer, bearing two π-rich cone-like calix[5]arene cavities : [u'PC[5]']
PS-b-poly(n-alkyl methacrylates) : [u'PS-b-PnAMA']
poly(1,2-methylenedioxybenzene) : [u'PMDOB']
trimethylolpropane triacrylate : [u'TMPTA']
Polyphosphoester Ionomer : [u'PPEI']
tetracyanoethylene : [u'TCNE']
Differential thermal analyses : [u'DTA']
dynamic mechanical analysis of pol : [u'DNMA']
Vogel−Fulcher−Tammann : [u'VFT']
poly(2-chlorostyrene)/poly(vinyl methyl ether) : [u'P2ClS/PVME']
interpe