<a href="https://colab.research.google.com/github/akamalas5/Capstone/blob/main/neural_coref.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The neuralcoref is compatible only with python3.7/spacy 2.1.0.  So intalling these for running Coreference Resolution

In [1]:
!apt-get install python3.7

Reading package lists... Done
Building dependency tree       
Reading state information... Done
python3.7 is already the newest version (3.7.12-1+bionic1).
0 upgraded, 0 newly installed, 0 to remove and 37 not upgraded.


Install spacy , neuralcoref

In [2]:
!pip install spacy==2.1.0
!pip install neuralcoref
!pip install https://github.com/explosion/spacy-models/releases//download/en_core_web_lg-2.1.0/en_core_web_lg-2.1.0.tar.gz

Collecting spacy==2.1.0
  Downloading spacy-2.1.0-cp37-cp37m-manylinux1_x86_64.whl (27.7 MB)
[K     |████████████████████████████████| 27.7 MB 1.3 MB/s 
[?25hCollecting blis<0.3.0,>=0.2.2
  Downloading blis-0.2.4-cp37-cp37m-manylinux1_x86_64.whl (3.2 MB)
[K     |████████████████████████████████| 3.2 MB 65.4 MB/s 
[?25hCollecting thinc<7.1.0,>=7.0.2
  Downloading thinc-7.0.8-cp37-cp37m-manylinux1_x86_64.whl (2.1 MB)
[K     |████████████████████████████████| 2.1 MB 59.2 MB/s 
Collecting preshed<2.1.0,>=2.0.1
  Downloading preshed-2.0.1-cp37-cp37m-manylinux1_x86_64.whl (82 kB)
[K     |████████████████████████████████| 82 kB 537 kB/s 
Collecting plac<1.0.0,>=0.9.6
  Downloading plac-0.9.6-py2.py3-none-any.whl (20 kB)
Installing collected packages: preshed, plac, blis, thinc, spacy
  Attempting uninstall: preshed
    Found existing installation: preshed 3.0.6
    Uninstalling preshed-3.0.6:
      Successfully uninstalled preshed-3.0.6
  Attempting uninstall: plac
    Found existing in

In [3]:
import pandas as pd
import re
import spacy
import neuralcoref
import en_core_web_lg

nlp = en_core_web_lg.load()
neuralcoref.add_to_pipe(nlp)

100%|██████████| 40155833/40155833 [00:01<00:00, 27880865.44B/s]


<spacy.lang.en.English at 0x7f865a3a2dd0>

In [4]:
def coref_resolution(text):
    """Function that executes coreference resolution on a given text"""
    doc = nlp(text)
    # fetches tokens with whitespaces from spacy document
    tok_list = list(token.text_with_ws for token in doc)
    for cluster in doc._.coref_clusters:
        # get tokens from representative cluster name
        cluster_main_words = set(cluster.main.text.split(' '))
        for coref in cluster:
            if coref != cluster.main:  # if coreference element is not the representative element of that cluster
                if coref.text != cluster.main.text and bool(set(coref.text.split(' ')).intersection(cluster_main_words)) == False:
                    # if coreference element text and representative element text are not equal and none of the coreference element words are in representative element. This was done to handle nested coreference scenarios
                    tok_list[coref.start] = cluster.main.text + \
                        doc[coref.end-1].whitespace_
                    for i in range(coref.start+1, coref.end):
                        tok_list[i] = ""

    return "".join(tok_list)

In [5]:
coref_resolution('The brake system and hose system should to be mounted on chassis.  it should also be linked with pedal. Pedel is part of under body. it should be in the hood') 

'The brake system and hose system should to be mounted on chassis.  The brake system and hose system should also be linked with pedal. Pedel is part of under body. Pedel should be in the hood'

In [None]:
input_data = 'data/wiki_brake_all_pages_df.csv'
brake_df = pd.read_csv(input_data)
brake_df.head()

Unnamed: 0.1,Unnamed: 0,wiki_text,text_clean
0,0,A brake is a mechanical device that inhibits m...,A brake is a mechanical device that inhibits m...
1,1,Most brakes commonly use friction between two ...,Most brakes commonly use friction between two ...
2,2,Brakes are generally applied to rotating axles...,Brakes are generally applied to rotating axles...
3,3,Since kinetic energy increases quadratically w...,Since kinetic energy increases quadratically w...
4,4,Almost all wheeled vehicles have a brake of so...,Almost all wheeled vehicles have a brake of so...


### For the cleaned up text resolve the coreferences 

In [None]:
brake_df['text_clean'] = brake_df['text_clean'].astype(str)
brake_df['tex_coref_resolved'] = brake_df['text_clean'].apply(coref_resolution)

In [None]:
brake_df.head()

Unnamed: 0.1,Unnamed: 0,wiki_text,text_clean,tex_coref_resolved
0,0,A brake is a mechanical device that inhibits m...,A brake is a mechanical device that inhibits m...,A brake is a mechanical device that inhibits m...
1,1,Most brakes commonly use friction between two ...,Most brakes commonly use friction between two ...,Most brakes commonly use friction between two ...
2,2,Brakes are generally applied to rotating axles...,Brakes are generally applied to rotating axles...,Brakes are generally applied to rotating axles...
3,3,Since kinetic energy increases quadratically w...,Since kinetic energy increases quadratically w...,Since kinetic energy increases quadratically w...
4,4,Almost all wheeled vehicles have a brake of so...,Almost all wheeled vehicles have a brake of so...,Almost all wheeled vehicles have a brake of so...


### Output the Coreference resolved text data frame to CSV file. This file will be input for terms extraction 

In [None]:
brake_df.to_csv("../data/wiki_brake_all_pages_with_coref_df.csv")