# POS-tagging for comparative/superlative identification

__Contents__

0. [Start the Stanford CoreNLP server](#Start-the-Stanford-CoreNLP-server)
0. [Convenience function for POS tagging](#Convenience-function-for-POS-tagging)
0. [Comparative/Superlative identifiers](#Comparative/Superlative-identifiers)
0. [Data analysis](#Data-analysis)
  0. [Tag the data](#Tag-the-data)
  0. [Identify comparatives and superlatives](#Identify-comparatives-and-superlatives)
  0. [Inspection](#Inspection)

In [10]:
import json
import os
import pandas as pd
import nltk as nltk
from pycorenlp import StanfordCoreNLP

## Start the Stanford CoreNLP server

Before running this notebook, [get CoreNLP](http://nlp.stanford.edu/software/stanford-corenlp-full-2015-12-09.zip), go into its directory, and run

`java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer 9000`

If you're using port 9000 for something else, change that value and then change `PORT` in the next cell.

(To suppress output, run with `-prettyPrint false 2&>1 >/dev/null` at the end)

In [11]:
PORT = 9000

nlp = StanfordCoreNLP('http://localhost:{}'.format(PORT))

## Convenience function for POS tagging

In [12]:
def stanford_pos(text):
    """
    Parameters
    ----------
    text : str
       CoreNLP handles all tokenizing, at the sentence and word level.
       
    Returns
    -------
    list of tuples (str, str)
       The first member of each pair is the word, the second its POS tag.          
    """
    if not isinstance(text, basestring):
        print '%s: %s' % (type(text), str(text))
    try:
        if text.strip() == '':
            return []

        #text = str(text)
        ann = nlp.annotate(
            text, 
            properties={'annotators': 'pos', 
                        'outputFormat': 'json'})
        lemmas = []
        if isinstance(ann, basestring):
            ann = json.loads(ann.replace('\x00', '?').encode('latin-1'), encoding='utf-8', strict=True)
        for sentence in ann['sentences']:
            for token in sentence['tokens']:
                lemmas.append((token['word'], token['pos']))
    except Exception as e:
        print text
        raise
    return lemmas

## Comparative/Superlative identifiers

In [13]:
from nltk.stem.wordnet import WordNetLemmatizer

LEMMATIZER = WordNetLemmatizer()

def is_comp_sup(word, pos, tags, check_lemmatizer=False):
    """
    Parameters
    ----------
    word, pos : str, str
        The lemma.
    
    tags : iterable of str
        The tags considered positive evidence for comp/sup morphology.
       
       
    check_lemmatizer : bool
        If True, then if the `pos` is in `tags`, we also check that
        `word` is different from the lemmatized version of word
        according to WordNet, treating it as an adjective. This 
        could be used to achieve greater precision, perhaps at the
        expense of recall.
       
    Returns
    -------
    bool       
    """
    if pos not in tags:
        return False
    if check_lemmatizer and LEMMATIZER.lemmatize(word, 'a') == word:
        return False
    return True

def is_superlative(word, pos, check_lemmatizer=False):
    return is_comp_sup(
        word, pos, {'JJS', 'RBS'}, check_lemmatizer=check_lemmatizer)

def is_comparative(word, pos, check_lemmatizer=False):
    return is_comp_sup(
        word, pos, {'JJR', 'RBR'}, check_lemmatizer=check_lemmatizer)

## Data analysis

In [15]:
d_human = (pd.read_csv('humanOutput/filteredCorpus.csv')
     .assign(source = 'human'))
d_prag = (pd.read_csv('modelOutput/speaker_big_sl_perp_sampled_message.csv')
     .assign(source = 'pragmatic'))
d_lit = (pd.read_csv('modelOutput/speaker_big_s0_untuned_sampled_message.csv')
     .assign(source = 'literal'))
d = d_human.append(d_prag).append(d_lit)

In [16]:
d['contents'] = d['contents'].fillna('')
d_human['contents'] = d_human['contents'].fillna('')
d_prag['contents'] = d_prag['contents'].fillna('')
d_lit['contents'] = d_lit['contents'].fillna('')

### Tag the data

In [17]:
stanford_pos('\xe4\xbd\xa0\xe5\xa5\xbd\xef\xbc\x81\xe4\xbd\xa0\xe5\xa5\xbd\xe5\x90\x97'.decode('utf-8'))

[(u'\u4f60\u597d', u'NN'), (u'\uff01', u'CD'), (u'\u4f60\u597d\u5417', u'CD')]

In [18]:
# A lemma is a (word, pos) tag pair.
lemmas = []
for i, text in enumerate(d['contents']):
    lemmas.append(stanford_pos(unicode(text).decode('utf-8')))
d['lemmas'] = lemmas

In [19]:
# for i, text in enumerate(d['contents'][62108:]):
#     lemmas.append(stanford_pos(unicode(text).decode('utf-8')))
# d['lemmas'] = lemmas

### Identify comparatives and superlatives

These steps put a 1 in the position of comparatives/superlatives, and a 0 in all other places, to maintain alignment with the original texts.

In [20]:
d['superlatives'] = [[1 if is_superlative(*lem) else 0 for lem in lemmas]
                     for lemmas in d['lemmas']]

In [21]:
d['comparatives'] = [[1 if is_comparative(*lem) else 0 for lem in lemmas]
                     for lemmas in d['lemmas']]

Count superlatives & comparatives

In [22]:
d['numSuper'] = [sum(counts) for counts in d['superlatives']]

d['numComp'] = [sum(counts) for counts in d['comparatives']]

### Inspection

Run the cell below to allow for non-scrolling display:

In [13]:
%%javascript
IPython.OutputArea.auto_scroll_threshold = 9999;

<IPython.core.display.Javascript object>

In [14]:
d.query('numComp > 0').head()

Unnamed: 0,gameid,time,roundNum,sender,contents,source,lemmas,superlatives,comparatives,numSuper,numComp
0,1124-1,1459877203862,1,speaker,The darker blue one,human,"[(The, DT), (darker, JJR), (blue, JJ), (one, NN)]","[0, 0, 0, 0]","[0, 1, 0, 0]",0,1
13,1124-1,1459877360202,13,speaker,"One of the brown ones, the lighter shaded one",human,"[(One, CD), (of, IN), (the, DT), (brown, JJ), ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 1, 0, 0]",0,1
14,1124-1,1459877388314,14,speaker,The more vibrantly red one.~~~~~~ not the more...,human,"[(The, DT), (more, JJR), (vibrantly, RB), (red...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, ...",0,2
31,1124-1,1459877544164,26,speaker,darker red,human,"[(darker, JJR), (red, NN)]","[0, 0]","[1, 0]",0,1
33,1124-1,1459877564218,28,speaker,"purple, darker one",human,"[(purple, JJ), (,, ,), (darker, JJR), (one, CD)]","[0, 0, 0, 0]","[0, 0, 1, 0]",0,1


In [45]:
d.query('numComp > 0 & source == "model"').head()

Unnamed: 0,gameid,time,roundNum,sender,contents,source,lemmas,superlatives,comparatives,numSuper,numComp
10,8994-5,1476489931875,46,speaker,"dark blue , lighter",model,"[(dark, JJ), (blue, JJ), (,, ,), (lighter, JJR)]","[0, 0, 0, 0]","[0, 0, 0, 1]",0,1
20,2641-2,1476489571015,6,speaker,"the purplish box . ~ one that is purple , more...",model,"[(the, DT), (purplish, NN), (box, NN), (., .),...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, ...",0,1
26,2641-2,1476489632676,12,speaker,"brighter green , not the olive or dull color",model,"[(brighter, JJR), (green, JJ), (,, ,), (not, R...","[0, 0, 0, 0, 0, 0, 0, 0, 0]","[1, 0, 0, 0, 0, 0, 0, 0, 0]",0,1
30,2641-2,1476489717242,16,speaker,the duller purple ~ dullest ~ more purple,model,"[(the, DT), (duller, RBR), (purple, JJ), (~, N...","[0, 0, 0, 0, 1, 0, 0, 0]","[0, 1, 0, 0, 0, 0, 1, 0]",1,2
41,2641-2,1476489870754,27,speaker,this one is a brightest on the rockies 2 look ...,model,"[(this, DT), (one, CD), (is, VBZ), (a, DT), (b...","[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",1,1


### Write to file

In [23]:
(d.drop(['superlatives', 'comparatives'], 1)
 .to_csv("taggedColorMsgs.csv", index = False))