## A good spacy intro for getting tokens
https://nicschrading.com/project/Intro-to-NLP-with-spaCy/

## Load spacy.io
https://spacy.io/#install

In [1]:
import io
import codecs
from spacy.en import English
nlp = English(parser=True, tagger=True) # so we can sentence parse

### Examples: using spacy iterating over tokens to get words and lemmas.
https://spacy.io/docs#token

The tokens have several properties.

Note the lemma handles conversion to lower case for us.

- lemma: canonical or conceptual form of word.
- orthographic: Commonsense commonsense or 'common-sense' are three different orthographic words. They are each single orthographic words. 'common sense' is two orthographic words. The meaning does not change though the way they are written does.

(https://www.sussex.ac.uk/webteam/gateway/file.php?name=essay---what-is-a-word.pdf&site=1)

I'm guessing the ints for the orthographies and lemmas come from the English dictionary (or the doc? below) loaded for parsing.

Note how probability differs by the orthographic representation. That it would do so might not be obvious from the documentation. The probability is log probability from a large corpus. Smaller values are rarer.

https://spacy.io/docs#token-distributional

TODO: test this by looking up a few words directly by their ints or vice versa. Maybe see strings and note when it is loaded.

https://spacy.io/docs#stringstore

In [5]:
# TODO: add back code that created sentences array, probably what we currently have in 4.2.

In [4]:
print(sentences[0])
print('\n')
for i in range(0, len(sentences)):
    parsed_data = nlp(sentences[i])
    for i, token in enumerate(parsed_data):
        print('{}\t{}\t{}\t{}'.format(token.orth, token.orth_, token.lemma, token.lemma_))
        if i > 20:
            break
    if i > 1:
        break

In [22]:
test = u'Interval intervals (20 symptoms with a health-caring professional) (P > 0.003).'

parsed_data = nlp(test)

for token in parsed_data:
    print('{}\t{}\t{}\t{}'.format(token.orth, token.orth_, token.lemma, token.lemma_))
print('\n')

for token in parsed_data:
    if len(token.lemma_) > 2:
        print('{}\t{}\t{}\t{}'.format(token.orth_, token.lemma, token.lemma_, token.prob))

90778	Interval	16220	interval
14156	intervals	16220	interval
506	(	506	(
1020	20	1020	20
4674	symptoms	11231	symptom
489	with	489	with
469	a	469	a
1357	health	1357	health
498	-	498	-
4899	caring	827	care
2231	professional	2231	professional
487	)	487	)
506	(	506	(
6159	P	10149	p
1216743	>	1216743	>
130535	0.003	130535	0.003
487	)	487	)
419	.	419	.


Interval	16220	interval	-15.9418096542
intervals	16220	interval	-12.7912664413
symptoms	11231	symptom	-11.159992218
with	489	with	-5.24324989319
health	1357	health	-9.29672241211
caring	827	care	-11.2292251587
professional	2231	professional	-10.0627040863
0.003	130535	0.003	-16.5937042236


## Gather positive and negative sentences for Arousal

These are all the annotations for Arousal topic as annotated by Mark.

    See 2_spacy_parse_annotations notebook.

In [2]:
wkdir = 'rdoc/results'
%cd $wkdir

In [3]:
!wc ./annotations_processed/AR_mk_pos
!wc ./annotations_processed/AR_mk_not_pos
! file -I annotations_processed/AR_mk_pos # utf-8

     201    5280   38255 ./annotations_processed/AR_mk_pos
     211    4215   29329 ./annotations_processed/AR_mk_not_pos
annotations_processed/AR_mk_pos: text/plain; charset=utf-8


In [4]:
sentences = []
with codecs.open('./annotations_processed/AR_mk_pos', mode='r', encoding='utf-8') as f:
    sents = f.read().splitlines() # 201
    for s in sents:
        s = s.replace('{{', '').replace('}}', '').strip()
        sentences.append(s)
len(sentences)

201

These are a temporary set of negative sentences for Arousal by Mark.

    See 2_spacy_parse_annotations notebook.

In [5]:
neg_sentences = []
with codecs.open('./annotations_processed/AR_mk_tmp_neg', mode='r', encoding='utf-8') as f:
    neg_sents = f.read().splitlines() # 201
    for s in neg_sents:
        s = s.replace('{{', '').replace('}}', '').strip()
        neg_sentences.append(s)
len(neg_sentences)

166

## Use spacy to create bags of words from each sentence.
TODO: Research which situations might suggesting unique rather than repeats of the words might be useful. Think this has to do in part with size of the text? Or is simply something to explore in any data mining task?

In [7]:
### Start postgres if desire to test array wrapping in postgres
#!pg_ctl -D /usr/local/var/postgres -l /usr/local/var/postgres/server.log start

### spacy helper functions

In [6]:
def spacy_split_sentences(text):
    sentences = []
    #doc = nlp(text.decode('utf8')) #"This is a sentence. Here's another...".decode('utf8'))
    doc = nlp(text) #"This is a sentence. Here's another...".decode('utf8'))
    for span in doc.sents:
        #sentences.append(u''.join(doc[i].string for i in range(span.start, span.end)).encode('utf-8').strip())
        sentences.append(''.join(doc[i].string for i in range(span.start, span.end)))#.strip())
    return(sentences)

In [57]:
def spacy_lemma_gt_len(text, length=2):
    '''Create bag of unique lemmas, requiring lemma length > length
    
    Note: setting length to 1 may mess up our postgres arrays as we would
    get commas here, unless we were to quote everything.
    '''
    tokens = []
    #doc = nlp(text.decode('utf8')) #"This is a sentence. Here's another...".decode('utf8'))
    parsed_data = nlp(text) #"This is a sentence. Here's another...".decode('utf8'))
    for token in parsed_data:
        if len(token.lemma_) > length:
            tokens.append(token.lemma_.lower())
    return(list(set(tokens)))

In [75]:
def spacy_lemma_biwords_gt_len(text, length=3):
    '''Create bag of unique bi-lemmas, requiring lemma length > length
    
    We are crudely eliminating any bi-lemmas that have commas in them to save us in loading postgres arrays.
    '''
    biwords = []
    parsed_data = nlp(text)
    skip_chars = [',', '"', "'"]
    for i in range(1, len(parsed_data) - 1):
        skip = False
        biword = u'{} {}'.format(parsed_data[i].lemma_.lower(), parsed_data[i+1].lemma_.lower())
        if (parsed_data[i].lemma_ in skip_chars or parsed_data[i+1].lemma_ in skip_chars):
            skip = True
        if len(biword) > length and not skip:
            biwords.append(biword)
    return(list(set(biwords)))

test = 'A good, apple once told me there was a rotten worm inside.'.decode('utf8')
res = spacy_lemma_gt_len(test, length=4)
print(res)
res = spacy_lemma_biwords_gt_len(test, length=4)
# ', '.join(res) # note, flattens formatting.
print(', '.join(res))

[u'rotten', u'there', u'inside', u'apple']
worm inside, rotten worm, tell me, once tell, inside ., apple once, me there, there be, a rotten


In [70]:
def to_lemmas_and_lemma_biwords(sentence, lemma_len=2, bi_lemma_len=4):
    lemma = spacy_lemma_gt_len(sentence, length=lemma_len)
    lemma_biwords = spacy_lemma_biwords_gt_len(sentence, length=bi_lemma_len)
    return(lemma + lemma_biwords)

### Sidenote: what do we use as sentences?

We preserve sentences as parsed in stop 2_spacy_parse_annotations notebook.

Note, we did some manual sentence spliiting within the abstracts.

We wish to retain that splitting instead of using spacy to split herein as spacy is imperfect due to all the acronyms in our text.

In [27]:
# with codecs.open('./annotations_processed/AR_mk_pos', mode='r', encoding='utf-8') as f:
#     sents = spacy_split_sentences(f.read())
# len(sents) # 179

# sentences = []
# with codecs.open('./annotations_processed/AR_mk_pos', mode='r', encoding='utf-8') as f:
#     sents = f.read().splitlines() # 201
#     for s in sents:
#         sentences.append(spacy_split_sentences(s))
# len(sentences) # still have 201, spacy found no new splits.

### Create bag of words and biwords from each of our sentences.
Function assumes bag of words (lemmas) should have the actual words, and duplicate words are removed.

Some approaches to using bag of words or lemmas assumes word appears greater than once in corpus. (which makes sense as meaningless to use if only occurs once across corpus?)

#### Ex. large lemmas from 1st sentence

In [65]:
print(len(sentences))
for i, sentence in enumerate(sentences):
    print(sentence)
    lemma_6 = spacy_lemma_gt_len(sentence, length=10)
    print('\n'.join(lemma_6))
    if i >= 0:
        break
    print('\n')

201
Sociodemographic inequalities in the stage of diagnosis and cancer survival may be partly due to differences in the appraisal interval (time from noticing a bodily change to perceiving a reason to discuss symptoms with a health-care professional).
professional
sociodemographic


In [69]:
### Test unicode OK across all sentences

In [70]:
## positives
# print(len(sentences))
# bags_of_lemmas = []
# for sentence in sentences:
#     lemma_2 = spacy_lemma_gt_len(sentence, n=2)
#     bags_of_lemmas.append(lemma_2)
# print(len(bags_of_lemmas))

## negatives
# print(len(neg_sentences))
# neg_bags_of_lemmas = []
# for sentence in neg_sentences:
#     lemma_2 = spacy_lemma_gt_len(sentence, n=2)
#     neg_bags_of_lemmas.append(lemma_2)
# print(len(neg_bags_of_lemmas))

### positives and negatives as bags of lemmas and lemma biwords

In [76]:
print(len(sentences))
bags_of_lemmas = []

for sentence in sentences:
    lbw = to_lemmas_and_lemma_biwords(sentence)
    bags_of_lemmas.append(lbw)
print(len(bags_of_lemmas))

print(len(neg_sentences))
neg_bags_of_lemmas = []
for sentence in neg_sentences:
    lbw = to_lemmas_and_lemma_biwords(sentence)
    neg_bags_of_lemmas.append(lbw)
print(len(neg_bags_of_lemmas))

201
201
166
166


In [3]:
wkdir = './rdoc/results'
%cd $wkdir

### Saving positives and negatives
- #TODO: the positives are not strictly positive yet, we still need to remove those annoatations from 'irrelevant' docs.
- remember, positves had '{{' tags somewhere in sentence.
- we are skipping the 'neutral' sentences, i.e. not saving anything from a positive document.
- tmp_neg are any sentence from documents that were not positive.

Note, '{' to wrap array for psql, and \N to leave empty field for deepdive reserved id.

In [77]:
with codecs.open('./annotations_processed/AR_mk_pos_bags-of-lemmas', 'w', encoding='utf-8') as f:
    for b in bags_of_lemmas:
        b_arr = ', '.join(b)
        f.write(u'{{{}}}\t{}\t{}\n'.format(b_arr, '+arousal', '\N'))

with codecs.open('./annotations_processed/AR_mk_tmp_neg_bags-of-lemmas', 'w', encoding='utf-8') as f:
    for b in neg_bags_of_lemmas:
        b_arr = ', '.join(b)
        f.write(u'{{{}}}\t{}\t{}\n'.format(b_arr, '-arousal', '\N'))
#./annotations_processed/AR_mk_tmp_neg_bags-of-lemmas

### Misc (some other features from spacy)
Can iterate over (ents) entities, (noun_chunks), sentences (sents)...

In [75]:
# print(sentences[17])
# print('.....')
# for i in range(17,18):
#     print(sentences[i])
#     print('------')
#     doc = nlp(sentences[i], tag=True)
#     print('Entities :')
#     for ent in doc.ents:
#         print(ent)
#     print('Noun chunks :')
#     for nchunk in doc.noun_chunks:
#         print(nchunk)
#     for sent in doc.sents:
#          print(sent)