# GAP heuristics from [here](https://www.kaggle.com/sattree/2-reproducing-gap-results)
If needed, change paths below - where models shall be stored (`/home/kashn500/heavy_models/ `) and paths to data and features to be outputed.

In [1]:
# change these if needed
PATH_TO_TRAIN = '../input/train.tsv'
PATH_TO_TEST = '../input/test.tsv'
PATH_OUT_TRAIN_FEAT = '../features/train_gap_heuristics.csv'
PATH_OUT_TEST_FEAT = '../features/test_gap_heuristics.csv'

## 1. Download necessary models and install dependencies

In [25]:
%%time
# Download and install all dependencies
# gpr_pub contains the heuristics models and supplementary code
!git clone https://github.com/sattree/gpr_pub.git
!wget -P /home/kashn500/heavy_models/ http://nlp.stanford.edu/software/stanford-corenlp-full-2018-10-05.zip
!unzip /home/kashn500/heavy_models/stanford-corenlp-full-2018-10-05.zip
!pip install allennlp --ignore-installed greenlet
!pip install attrdict

Cloning into 'gpr_pub'...
remote: Enumerating objects: 290, done.[K
remote: Total 290 (delta 0), reused 0 (delta 0), pack-reused 290[K
Receiving objects: 100% (290/290), 5.34 MiB | 3.84 MiB/s, done.
Resolving deltas: 100% (128/128), done.
Checking connectivity... done.
--2019-04-01 15:40:27--  http://nlp.stanford.edu/software/stanford-corenlp-full-2018-10-05.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/software/stanford-corenlp-full-2018-10-05.zip [following]
--2019-04-01 15:40:27--  https://nlp.stanford.edu/software/stanford-corenlp-full-2018-10-05.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 393239982 (375M) [application/zip]
Saving to: ‘/home/kashn500/heavy_models/stanford-corenlp-full-2018-10-05.zip

[K    0% |▎                               | 235kB 53.7MB/s eta 0:00:01[K    0% |▎                               | 245kB 53.0MB/s eta 0:00:01[K    0% |▎                               | 256kB 54.4MB/s eta 0:00:01[K    0% |▎                               | 266kB 54.5MB/s eta 0:00:01[K    0% |▎                               | 276kB 53.6MB/s eta 0:00:01[K    1% |▎                               | 286kB 54.2MB/s eta 0:00:01[K    1% |▍                               | 296kB 54.5MB/s eta 0:00:01[K    1% |▍                               | 307kB 54.4MB/s eta 0:00:01[K    1% |▍                               | 317kB 55.1MB/s eta 0:00:01[K    1% |▍                               | 327kB 55.1MB/s eta 0:00:01[K    1% |▍                               | 337kB 55.7MB/s eta 0:00:01[K    1% |▍                               | 348kB 55.5MB/s eta 0:00:01[K    1% |▍                               | 358kB 53.5MB/s eta 0:00:01[K    1% |▍                               | 368kB 54.9MB/s eta 

[K    7% |██▌                             | 2.2MB 88.5MB/s eta 0:00:01[K    7% |██▌                             | 2.2MB 86.7MB/s eta 0:00:01[K    7% |██▌                             | 2.2MB 87.4MB/s eta 0:00:01[K    7% |██▌                             | 2.2MB 88.0MB/s eta 0:00:01[K    7% |██▋                             | 2.2MB 86.7MB/s eta 0:00:01[K    8% |██▋                             | 2.2MB 87.5MB/s eta 0:00:01[K    8% |██▋                             | 2.2MB 16.8MB/s eta 0:00:02[K    8% |██▋                             | 2.2MB 16.6MB/s eta 0:00:02[K    8% |██▋                             | 2.3MB 16.5MB/s eta 0:00:02[K    8% |██▋                             | 2.3MB 16.3MB/s eta 0:00:02[K    8% |██▋                             | 2.3MB 16.3MB/s eta 0:00:02[K    8% |██▋                             | 2.3MB 15.8MB/s eta 0:00:02[K    8% |██▋                             | 2.3MB 15.6MB/s eta 0:00:02[K    8% |██▋                             | 2.3MB 15.6MB/s eta

[K    16% |█████▏                          | 4.5MB 67.8MB/s eta 0:00:01[K    16% |█████▏                          | 4.5MB 68.2MB/s eta 0:00:01[K    16% |█████▏                          | 4.5MB 67.1MB/s eta 0:00:01[K    16% |█████▏                          | 4.5MB 68.9MB/s eta 0:00:01[K    16% |█████▎                          | 4.5MB 69.8MB/s eta 0:00:01[K    16% |█████▎                          | 4.5MB 69.0MB/s eta 0:00:01[K    16% |█████▎                          | 4.5MB 69.3MB/s eta 0:00:01[K    16% |█████▎                          | 4.6MB 68.6MB/s eta 0:00:01[K    16% |█████▎                          | 4.6MB 69.7MB/s eta 0:00:01[K    16% |█████▎                          | 4.6MB 69.6MB/s eta 0:00:01[K    16% |█████▎                          | 4.6MB 67.9MB/s eta 0:00:01[K    16% |█████▎                          | 4.6MB 68.6MB/s eta 0:00:01[K    16% |█████▎                          | 4.6MB 67.6MB/s eta 0:00:01[K    16% |█████▍                          | 4.6M

[K    24% |████████                        | 6.9MB 16.9MB/s eta 0:00:02[K    24% |████████                        | 6.9MB 16.8MB/s eta 0:00:02[K    24% |████████                        | 6.9MB 16.6MB/s eta 0:00:02[K    25% |████████                        | 6.9MB 16.5MB/s eta 0:00:02[K    25% |████████                        | 6.9MB 19.7MB/s eta 0:00:02[K    25% |████████                        | 7.0MB 19.9MB/s eta 0:00:02[K    25% |████████                        | 7.0MB 19.9MB/s eta 0:00:02[K    25% |████████                        | 7.0MB 19.9MB/s eta 0:00:02[K    25% |████████                        | 7.0MB 19.9MB/s eta 0:00:02[K    25% |████████                        | 7.0MB 19.8MB/s eta 0:00:02[K    25% |████████                        | 7.0MB 61.1MB/s eta 0:00:01[K    25% |████████                        | 7.0MB 62.2MB/s eta 0:00:01[K    25% |████████▏                       | 7.0MB 64.5MB/s eta 0:00:01[K    25% |████████▏                       | 7.0M

[K    34% |███████████                     | 9.5MB 26.2MB/s eta 0:00:01[K    34% |███████████                     | 9.5MB 25.9MB/s eta 0:00:01[K    34% |███████████                     | 9.5MB 25.7MB/s eta 0:00:01[K    34% |███████████                     | 9.5MB 25.4MB/s eta 0:00:01[K    34% |███████████                     | 9.5MB 25.6MB/s eta 0:00:01[K    34% |███████████                     | 9.6MB 25.7MB/s eta 0:00:01[K    34% |███████████                     | 9.6MB 25.6MB/s eta 0:00:01[K    34% |███████████                     | 9.6MB 19.3MB/s eta 0:00:01[K    34% |███████████                     | 9.6MB 18.9MB/s eta 0:00:01[K    34% |███████████                     | 9.6MB 18.8MB/s eta 0:00:01[K    34% |███████████                     | 9.6MB 34.2MB/s eta 0:00:01[K    34% |███████████                     | 9.6MB 34.4MB/s eta 0:00:01[K    34% |███████████▏                    | 9.6MB 35.0MB/s eta 0:00:01[K    34% |███████████▏                    | 9.6M

[K    41% |█████████████▍                  | 11.6MB 20.4MB/s eta 0:00:01[K    41% |█████████████▍                  | 11.6MB 20.2MB/s eta 0:00:01[K    41% |█████████████▍                  | 11.6MB 20.0MB/s eta 0:00:01[K    41% |█████████████▍                  | 11.6MB 19.9MB/s eta 0:00:01[K    41% |█████████████▍                  | 11.6MB 24.2MB/s eta 0:00:01[K    42% |█████████████▍                  | 11.6MB 24.2MB/s eta 0:00:01[K    42% |█████████████▌                  | 11.6MB 24.3MB/s eta 0:00:01[K    42% |█████████████▌                  | 11.7MB 24.3MB/s eta 0:00:01[K    42% |█████████████▌                  | 11.7MB 24.1MB/s eta 0:00:01[K    42% |█████████████▌                  | 11.7MB 24.7MB/s eta 0:00:01[K    42% |█████████████▌                  | 11.7MB 56.2MB/s eta 0:00:01[K    42% |█████████████▌                  | 11.7MB 58.2MB/s eta 0:00:01[K    42% |█████████████▌                  | 11.7MB 59.0MB/s eta 0:00:01[K    42% |█████████████▌           

[K    49% |████████████████                | 13.8MB 51.2MB/s eta 0:00:01[K    49% |████████████████                | 13.8MB 51.8MB/s eta 0:00:01[K    49% |████████████████                | 13.8MB 52.1MB/s eta 0:00:01[K    49% |████████████████                | 13.8MB 50.8MB/s eta 0:00:01[K    49% |████████████████                | 13.8MB 51.0MB/s eta 0:00:01[K    49% |████████████████                | 13.8MB 51.2MB/s eta 0:00:01[K    49% |████████████████                | 13.8MB 51.1MB/s eta 0:00:01[K    49% |████████████████                | 13.8MB 51.3MB/s eta 0:00:01[K    49% |████████████████                | 13.8MB 50.5MB/s eta 0:00:01[K    50% |████████████████                | 13.9MB 51.5MB/s eta 0:00:01[K    50% |████████████████                | 13.9MB 52.0MB/s eta 0:00:01[K    50% |████████████████                | 13.9MB 50.8MB/s eta 0:00:01[K    50% |████████████████                | 13.9MB 51.2MB/s eta 0:00:01[K    50% |████████████████         

[K    59% |███████████████████             | 16.5MB 55.5MB/s eta 0:00:01[K    59% |███████████████████             | 16.5MB 55.8MB/s eta 0:00:01[K    59% |███████████████████             | 16.5MB 56.0MB/s eta 0:00:01[K    59% |███████████████████             | 16.5MB 55.9MB/s eta 0:00:01[K    59% |███████████████████▏            | 16.5MB 54.5MB/s eta 0:00:01[K    59% |███████████████████▏            | 16.6MB 55.1MB/s eta 0:00:01[K    59% |███████████████████▏            | 16.6MB 55.2MB/s eta 0:00:01[K    59% |███████████████████▏            | 16.6MB 54.0MB/s eta 0:00:01[K    59% |███████████████████▏            | 16.6MB 54.9MB/s eta 0:00:01[K    59% |███████████████████▏            | 16.6MB 54.6MB/s eta 0:00:01[K    59% |███████████████████▏            | 16.6MB 54.5MB/s eta 0:00:01[K    60% |███████████████████▏            | 16.6MB 54.2MB/s eta 0:00:01[K    60% |███████████████████▏            | 16.6MB 53.1MB/s eta 0:00:01[K    60% |███████████████████▎     

[K    69% |██████████████████████          | 19.1MB 61.7MB/s eta 0:00:01[K    69% |██████████████████████          | 19.1MB 61.6MB/s eta 0:00:01[K    69% |██████████████████████▏         | 19.1MB 60.9MB/s eta 0:00:01[K    69% |██████████████████████▏         | 19.1MB 61.3MB/s eta 0:00:01[K    69% |██████████████████████▏         | 19.2MB 65.9MB/s eta 0:00:01[K    69% |██████████████████████▏         | 19.2MB 66.4MB/s eta 0:00:01[K    69% |██████████████████████▏         | 19.2MB 67.6MB/s eta 0:00:01[K    69% |██████████████████████▏         | 19.2MB 67.2MB/s eta 0:00:01[K    69% |██████████████████████▏         | 19.2MB 63.3MB/s eta 0:00:01[K    69% |██████████████████████▏         | 19.2MB 62.7MB/s eta 0:00:01[K    69% |██████████████████████▏         | 19.2MB 64.9MB/s eta 0:00:01[K    69% |██████████████████████▏         | 19.2MB 64.9MB/s eta 0:00:01[K    69% |██████████████████████▎         | 19.2MB 64.8MB/s eta 0:00:01[K    69% |██████████████████████▎  

[K    78% |█████████████████████████       | 21.7MB 29.4MB/s eta 0:00:01[K    78% |█████████████████████████       | 21.7MB 29.2MB/s eta 0:00:01[K    78% |█████████████████████████       | 21.7MB 29.0MB/s eta 0:00:01[K    78% |█████████████████████████       | 21.7MB 28.9MB/s eta 0:00:01[K    78% |█████████████████████████       | 21.7MB 39.3MB/s eta 0:00:01[K    78% |█████████████████████████       | 21.7MB 39.1MB/s eta 0:00:01[K    78% |█████████████████████████       | 21.7MB 39.7MB/s eta 0:00:01[K    78% |█████████████████████████       | 21.7MB 39.5MB/s eta 0:00:01[K    78% |█████████████████████████▏      | 21.7MB 39.1MB/s eta 0:00:01[K    78% |█████████████████████████▏      | 21.7MB 39.2MB/s eta 0:00:01[K    78% |█████████████████████████▏      | 21.8MB 61.5MB/s eta 0:00:01[K    78% |█████████████████████████▏      | 21.8MB 62.2MB/s eta 0:00:01[K    78% |█████████████████████████▏      | 21.8MB 63.4MB/s eta 0:00:01[K    78% |█████████████████████████

[K    87% |████████████████████████████▏   | 24.3MB 55.6MB/s eta 0:00:01[K    87% |████████████████████████████▏   | 24.3MB 55.6MB/s eta 0:00:01[K    87% |████████████████████████████▏   | 24.4MB 54.4MB/s eta 0:00:01[K    87% |████████████████████████████▏   | 24.4MB 54.3MB/s eta 0:00:01[K    87% |████████████████████████████▏   | 24.4MB 58.9MB/s eta 0:00:01[K    88% |████████████████████████████▏   | 24.4MB 60.2MB/s eta 0:00:01[K    88% |████████████████████████████▏   | 24.4MB 60.8MB/s eta 0:00:01[K    88% |████████████████████████████▏   | 24.4MB 33.9MB/s eta 0:00:01[K    88% |████████████████████████████▏   | 24.4MB 32.9MB/s eta 0:00:01[K    88% |████████████████████████████▏   | 24.4MB 35.1MB/s eta 0:00:01[K    88% |████████████████████████████▎   | 24.4MB 35.5MB/s eta 0:00:01[K    88% |████████████████████████████▎   | 24.4MB 35.1MB/s eta 0:00:01[K    88% |████████████████████████████▎   | 24.5MB 35.8MB/s eta 0:00:01[K    88% |█████████████████████████

[K    100% |████████████████████████████████| 27.7MB 1.2MB/s eta 0:00:01
[?25hCollecting numpy (from allennlp)
[?25l  Downloading https://files.pythonhosted.org/packages/35/d5/4f8410ac303e690144f0a0603c4b8fd3b986feb2749c435f7cdbb288f17e/numpy-1.16.2-cp36-cp36m-manylinux1_x86_64.whl (17.3MB)
[K    100% |████████████████████████████████| 17.3MB 2.8MB/s eta 0:00:01
[?25hCollecting pytest (from allennlp)
[?25l  Downloading https://files.pythonhosted.org/packages/7e/16/83b2a35c427b838df9836c9e7e4ae6dfbcbdea643db44652f693b1c57d70/pytest-4.4.0-py2.py3-none-any.whl (223kB)
[K    100% |████████████████████████████████| 225kB 20.9MB/s ta 0:00:01
[?25hCollecting editdistance (from allennlp)
[?25l  Downloading https://files.pythonhosted.org/packages/77/67/2b1fe72bdd13ee9ec32b97959d7dfbfcd7c0548081d69aaf8493c1e695f9/editdistance-0.5.3-cp36-cp36m-manylinux1_x86_64.whl (178kB)
[K    100% |████████████████████████████████| 184kB 23.6MB/s ta 0:00:01
[?25hCollecting numpydoc>=0.8.0 (from alle

  Downloading https://files.pythonhosted.org/packages/73/fb/00a976f728d0d1fecfe898238ce23f502a721c0ac0ecfedb80e0d88c64e9/six-1.12.0-py2.py3-none-any.whl
Collecting protobuf>=3.2.0 (from tensorboardX>=1.2->allennlp)
[?25l  Downloading https://files.pythonhosted.org/packages/5a/aa/a858df367b464f5e9452e1c538aa47754d467023850c00b000287750fa77/protobuf-3.7.1-cp36-cp36m-manylinux1_x86_64.whl (1.2MB)
[K    100% |████████████████████████████████| 1.2MB 7.9MB/s eta 0:00:01
[?25hCollecting colorama<=0.3.9,>=0.2.5 (from awscli>=1.11.91->allennlp)
  Using cached https://files.pythonhosted.org/packages/db/c8/7dcf9dbcb22429512708fe3a547f8b6101c0d02137acbd892505aee57adf/colorama-0.3.9-py2.py3-none-any.whl
Collecting s3transfer<0.3.0,>=0.2.0 (from awscli>=1.11.91->allennlp)
[?25l  Downloading https://files.pythonhosted.org/packages/d7/de/5737f602e22073ecbded7a0c590707085e154e32b68d86545dcc31004c02/s3transfer-0.2.0-py2.py3-none-any.whl (69kB)
[K    100% |████████████████████████████████| 71kB 17.4

  Using cached https://files.pythonhosted.org/packages/c5/e1/1523fb1dab744e2c6b1f02446f2139a78726c18c062a8ddd53875abb20f8/pyaml-18.11.0-py2.py3-none-any.whl
Collecting aws-xray-sdk<0.96,>=0.93 (from moto>=1.3.4->allennlp)
  Using cached https://files.pythonhosted.org/packages/a4/a5/da7887285564f9e0ae5cd25a453cca36e2cd43d8ccc9effde260b4d80904/aws_xray_sdk-0.95-py2.py3-none-any.whl
Collecting jsondiff==1.1.1 (from moto>=1.3.4->allennlp)
Collecting python-dateutil<3.0.0,>=2.1 (from moto>=1.3.4->allennlp)
[?25l  Downloading https://files.pythonhosted.org/packages/41/17/c62faccbfbd163c7f57f3844689e3a78bae1f403648a6afb1d0866d87fbb/python_dateutil-2.8.0-py2.py3-none-any.whl (226kB)
[K    100% |████████████████████████████████| 235kB 9.8MB/s eta 0:00:01
[?25hCollecting xmltodict (from moto>=1.3.4->allennlp)
  Downloading https://files.pythonhosted.org/packages/28/fd/30d5c1d3ac29ce229f6bdc40bbc20b28f716e8b363140c26eff19122d8a5/xmltodict-0.12.0-py2.py3-none-any.whl
Collecting singledispatch (

[?25l  Downloading https://files.pythonhosted.org/packages/29/19/44753eab1fdb50770ac69605527e8859468f3c0fd7dc5a76dd9c4dbd7906/websocket_client-0.56.0-py2.py3-none-any.whl (200kB)
[K    100% |████████████████████████████████| 204kB 25.5MB/s a 0:00:01
[?25hCollecting docker-pycreds>=0.4.0 (from docker>=2.5.1->moto>=1.3.4->allennlp)
  Using cached https://files.pythonhosted.org/packages/f5/e8/f6bd1eee09314e7e6dee49cbe2c5e22314ccdb38db16c9fc72d2fa80d054/docker_pycreds-0.4.0-py2.py3-none-any.whl
Collecting wrapt (from aws-xray-sdk<0.96,>=0.93->moto>=1.3.4->allennlp)
  Downloading https://files.pythonhosted.org/packages/67/b2/0f71ca90b0ade7fad27e3d20327c996c6252a2ffe88f50a95bba7434eda9/wrapt-1.11.1.tar.gz
Collecting jsonpickle (from aws-xray-sdk<0.96,>=0.93->moto>=1.3.4->allennlp)
  Downloading https://files.pythonhosted.org/packages/dc/12/8c44eabb501e2bc0aec0dd152b328074d98a50968d3a02be28f6037f0c6a/jsonpickle-1.1-py2.py3-none-any.whl
Collecting pycparser (from cffi!=1.11.3,>=1.8->cryptog

In [24]:
from sklearn.metrics import log_loss, classification_report
from attrdict import AttrDict

import spacy

from allennlp.predictors.predictor import Predictor
from allennlp.models.archival import load_archive
from nltk.parse.corenlp import CoreNLPParser, CoreNLPDependencyParser

import sys
sys.path.append('../../')
from gpr_pub.utils import CoreNLPServer

# gap_scorer_ext has minor fixes for py3 and to take pandas df as input instead of filepaths
from gpr_pub.gap.gap_scorer_ext import read_annotations, calculate_scores, add_to_score_view

Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex.


## 2. Initialize models

In [25]:
# Heuristic models implement coref resolution based on heuristics described in the paper
# Pronoun resolution is a simple wrapper to convert coref predictions into class-specific labels
# Multi pass sieve model implements backoff mechanism
from gpr_pub.models.heuristics.random_distance import RandomModel
from gpr_pub.models.heuristics.token_distance import TokenDistanceModel
from gpr_pub.models.heuristics.syntactic_distance import StanfordSyntacticDistanceModel
from gpr_pub.models.heuristics.parallelism import AllenNLPParallelismModel as ParallelismModel
from gpr_pub.models.heuristics.url_title import StanfordURLTitleModel as URLModel

from gpr_pub.models.pronoun_resolution import PronounResolutionModel

from gpr_pub.models.multi_pass_sieve import MultiPassSieveModel

In [26]:
# Instantiate stanford corenlp server
STANFORD_CORENLP_PATH = 'stanford-corenlp-full-2018-10-05/'
server = CoreNLPServer(classpath=STANFORD_CORENLP_PATH,
                        corenlp_options=AttrDict({'port': 9090, 
                                                  'timeout': '600000', 
                                                  'quiet': 'true',
                                                  'preload': 'tokenize,spplit,lemma,parse,deparse'}))
server.start()
STANFORD_SERVER_URL = server.url

In [27]:
# !pip install cymem==1.31.2 spacy==2.0.12

In [28]:
# Instantiate base models
STANFORD_MODEL = CoreNLPParser(url=STANFORD_SERVER_URL)
SPACY_MODEL = spacy.load('en_core_web_lg')
model_url = 'https://s3-us-west-2.amazonaws.com/allennlp/models/biaffine-dependency-parser-ptb-2018.08.23.tar.gz'
archive = load_archive(model_url, cuda_device=1)
ALLEN_DEP_MODEL = Predictor.from_archive(archive)

Did not use initialization regex that was passed: .*weight_ih.*
Did not use initialization regex that was passed: .*bias_hh.*
Did not use initialization regex that was passed: .*bias_ih.*
Did not use initialization regex that was passed: .*weight_hh.*


In [29]:
# Instantiate heuristic models
random_coref_model = RandomModel(SPACY_MODEL)
random_proref_model = PronounResolutionModel(random_coref_model)

token_distance_coref_model = TokenDistanceModel(SPACY_MODEL)
token_distance_proref_model = PronounResolutionModel(token_distance_coref_model)

syntactic_distance_coref_model = StanfordSyntacticDistanceModel(STANFORD_MODEL)
syntactic_distance_proref_model = PronounResolutionModel(syntactic_distance_coref_model, n_jobs=12)

parallelism_coref_model = ParallelismModel(ALLEN_DEP_MODEL, SPACY_MODEL)
parallelism_proref_model = PronounResolutionModel(parallelism_coref_model)

url_title_coref_model = URLModel(STANFORD_MODEL)
url_title_proref_model = PronounResolutionModel(url_title_coref_model, n_jobs=12)

## 3. Featurize train data

In [30]:
train_df = pd.read_csv(PATH_TO_TRAIN, sep='\t')
train_df.columns = map(lambda x: x.lower().replace('-', '_'), train_df.columns)

In [31]:
%%time
# Creates sieve pipeline of heuristic models, applying each new heuristic with appropriate backoff models
# Multi pass sieve - order of models provided as input is important
#    - left to right: recall increases
#    - right to left: precision increases
preds = MultiPassSieveModel(random_proref_model).predict(train_df)
score_df = add_to_score_view(preds, train_df, None, 'Random')

preds = MultiPassSieveModel(token_distance_proref_model).predict(train_df)
score_df = add_to_score_view(preds, train_df, score_df, 'Token Distance')

preds = MultiPassSieveModel(syntactic_distance_proref_model,
                           token_distance_proref_model).predict(train_df)
score_df = add_to_score_view(preds, train_df, score_df, 'Syntactic Distance')

preds = MultiPassSieveModel(parallelism_proref_model,
                            syntactic_distance_proref_model,
                           token_distance_proref_model).predict(train_df)
score_df = add_to_score_view(preds, train_df, score_df, 'Parallelism')

preds = MultiPassSieveModel(url_title_proref_model,
                            parallelism_proref_model,
                            syntactic_distance_proref_model,
                           token_distance_proref_model).predict(train_df)

100%|██████████| 2454/2454 [00:44<00:00, 55.53it/s]


Unnamed: 0,M,F,B,O
Random,45.23,48.38,1.07,46.81


100%|██████████| 2454/2454 [00:41<00:00, 59.29it/s]


Unnamed: 0,M,F,B,O
Random,45.23,48.38,1.07,46.81
Token Distance,49.02,46.84,0.96,47.93


[Parallel(n_jobs=12)]: Using backend ThreadingBackend with 12 concurrent workers.
[Parallel(n_jobs=12)]: Done  26 tasks      | elapsed:    2.2s
[Parallel(n_jobs=12)]: Done 176 tasks      | elapsed:   14.1s
[Parallel(n_jobs=12)]: Done 426 tasks      | elapsed:   35.9s
[Parallel(n_jobs=12)]: Done 776 tasks      | elapsed:  1.1min
[Parallel(n_jobs=12)]: Done 1226 tasks      | elapsed:  1.7min
[Parallel(n_jobs=12)]: Done 1776 tasks      | elapsed:  2.7min
[Parallel(n_jobs=12)]: Done 2426 tasks      | elapsed:  3.7min
[Parallel(n_jobs=12)]: Done 2454 out of 2454 | elapsed:  3.9min finished
100%|██████████| 2454/2454 [00:59<00:00, 41.32it/s]


Unnamed: 0,M,F,B,O
Random,45.23,48.38,1.07,46.81
Token Distance,49.02,46.84,0.96,47.93
Syntactic Distance,65.89,65.49,0.99,65.69


  0%|          | 0/2454 [00:00<?, ?it/s]Your label namespace was 'pos'. We recommend you use a namespace ending with 'labels' or 'tags', so we don't add UNK and PAD tokens by default to your vocabulary.  See documentation for `non_padded_namespaces` parameter in Vocabulary.
  6%|▌         | 149/2454 [01:06<31:05,  1.24it/s]

Dependency parse and tokenizer tokens dont match.


 12%|█▏        | 289/2454 [02:11<15:53,  2.27it/s]

Dependency parse and tokenizer tokens dont match.


 12%|█▏        | 298/2454 [02:15<16:41,  2.15it/s]

Dependency parse and tokenizer tokens dont match.


 37%|███▋      | 907/2454 [06:43<11:40,  2.21it/s]

Dependency parse and tokenizer tokens dont match.


 54%|█████▍    | 1334/2454 [10:06<08:14,  2.27it/s]

Dependency parse and tokenizer tokens dont match.


 68%|██████▊   | 1668/2454 [12:27<04:59,  2.63it/s]

Dependency parse and tokenizer tokens dont match.


 81%|████████  | 1977/2454 [14:34<03:23,  2.34it/s]

Dependency parse and tokenizer tokens dont match.


 92%|█████████▏| 2258/2454 [16:49<01:23,  2.35it/s]

Dependency parse and tokenizer tokens dont match.


100%|██████████| 2454/2454 [18:11<00:00,  2.43it/s]
[Parallel(n_jobs=12)]: Using backend ThreadingBackend with 12 concurrent workers.
[Parallel(n_jobs=12)]: Done  26 tasks      | elapsed:    2.4s
[Parallel(n_jobs=12)]: Done 176 tasks      | elapsed:   14.2s
[Parallel(n_jobs=12)]: Done 426 tasks      | elapsed:   36.1s
[Parallel(n_jobs=12)]: Done 776 tasks      | elapsed:  1.1min
[Parallel(n_jobs=12)]: Done 1226 tasks      | elapsed:  1.7min
[Parallel(n_jobs=12)]: Done 1776 tasks      | elapsed:  2.7min
[Parallel(n_jobs=12)]: Done 2426 tasks      | elapsed:  3.6min
[Parallel(n_jobs=12)]: Done 2454 out of 2454 | elapsed:  3.9min finished
100%|██████████| 2454/2454 [00:58<00:00, 41.62it/s]


Unnamed: 0,M,F,B,O
Random,45.23,48.38,1.07,46.81
Token Distance,49.02,46.84,0.96,47.93
Syntactic Distance,65.89,65.49,0.99,65.69
Parallelism,68.58,66.61,0.97,67.59


[Parallel(n_jobs=12)]: Using backend ThreadingBackend with 12 concurrent workers.
[Parallel(n_jobs=12)]: Done  26 tasks      | elapsed:    1.9s
[Parallel(n_jobs=12)]: Done 176 tasks      | elapsed:   15.1s
[Parallel(n_jobs=12)]: Done 426 tasks      | elapsed:   38.2s
[Parallel(n_jobs=12)]: Done 776 tasks      | elapsed:  1.2min
[Parallel(n_jobs=12)]: Done 1226 tasks      | elapsed:  1.8min
[Parallel(n_jobs=12)]: Done 1776 tasks      | elapsed:  2.7min
[Parallel(n_jobs=12)]: Done 2426 tasks      | elapsed:  3.7min
[Parallel(n_jobs=12)]: Done 2454 out of 2454 | elapsed:  3.9min finished
  6%|▌         | 149/2454 [01:04<23:30,  1.63it/s]

Dependency parse and tokenizer tokens dont match.


 12%|█▏        | 289/2454 [02:05<16:00,  2.25it/s]

Dependency parse and tokenizer tokens dont match.


 12%|█▏        | 298/2454 [02:09<16:57,  2.12it/s]

Dependency parse and tokenizer tokens dont match.


 37%|███▋      | 907/2454 [06:56<12:09,  2.12it/s]

Dependency parse and tokenizer tokens dont match.


 54%|█████▍    | 1334/2454 [10:01<07:59,  2.34it/s]

Dependency parse and tokenizer tokens dont match.


 68%|██████▊   | 1668/2454 [12:18<04:50,  2.70it/s]

Dependency parse and tokenizer tokens dont match.


 81%|████████  | 1977/2454 [14:33<03:22,  2.35it/s]

Dependency parse and tokenizer tokens dont match.


 92%|█████████▏| 2258/2454 [16:33<01:25,  2.29it/s]

Dependency parse and tokenizer tokens dont match.


100%|██████████| 2454/2454 [17:56<00:00,  2.39it/s]
[Parallel(n_jobs=12)]: Using backend ThreadingBackend with 12 concurrent workers.
[Parallel(n_jobs=12)]: Done  26 tasks      | elapsed:    2.0s
[Parallel(n_jobs=12)]: Done 176 tasks      | elapsed:   14.8s
[Parallel(n_jobs=12)]: Done 426 tasks      | elapsed:   36.2s
[Parallel(n_jobs=12)]: Done 776 tasks      | elapsed:  1.1min
[Parallel(n_jobs=12)]: Done 1226 tasks      | elapsed:  1.7min
[Parallel(n_jobs=12)]: Done 1776 tasks      | elapsed:  2.7min
[Parallel(n_jobs=12)]: Done 2426 tasks      | elapsed:  3.7min
[Parallel(n_jobs=12)]: Done 2454 out of 2454 | elapsed:  3.9min finished
100%|██████████| 2454/2454 [00:52<00:00, 41.92it/s]

CPU times: user 23h 32min 56s, sys: 1min 11s, total: 23h 34min 7s
Wall time: 56min 2s





In [37]:
score_df = add_to_score_view(preds, train_df, score_df, 'Parallelism+URL')

Unnamed: 0,M,F,B,O
Random,45.23,48.38,1.07,46.81
Token Distance,49.02,46.84,0.96,47.93
Syntactic Distance,65.89,65.49,0.99,65.69
Parallelism,68.58,66.61,0.97,67.59
Parallelism+URL,72.4,71.02,0.98,71.7


In [35]:
len(preds)

2454

In [65]:
y_pred_train = pd.DataFrame(preds, columns=['gap_A', 'gap_B']).astype('uint8')
# y_pred_train['gap_NEITHER'] = 1 - y_pred_train['gap_A'] - y_pred_train['gap_B']

In [86]:
y_pred_train.to_csv(PATH_OUT_TRAIN_FEAT, index=None)

## 4. Featurize test data

In [None]:
test_df = pd.read_csv(PATH_TO_TEST, sep='\t')
test_df.columns = map(lambda x: x.lower().replace('-', '_'), test_df.columns)

In [None]:
%%time
gap_test_preds = MultiPassSieveModel(url_title_proref_model,
                            parallelism_proref_model,
                            syntactic_distance_proref_model,
                           token_distance_proref_model).predict(test_df)

In [None]:
y_pred_test = pd.DataFrame(gap_test_preds, columns=['gap_A', 'gap_B']).astype('uint8')
# y_pred_test['gap_NEITHER'] = 1 - y_pred_test['gap_A'] - y_pred_test['gap_B']

In [None]:
y_pred_test.to_csv(PATH_OUT_TEST_FEAT, index=None)