# Coreference Resolution 

- Application of the neuralcoref spaCy modude: https://spacy.io/universe/project/neuralcoref

## Install the required spaCy version and the required packages

-REMARK!!!! -> This notebook requires a specific spacy version to run: spacy==2.1.0

In [1]:
# !python -m pip install spacy==2.1.0

In [2]:
# !pip install neuralcoref

In [3]:
import neuralcoref

In [4]:
# !python -m spacy download en_core_web_lg


In [5]:
# !python -m spacy download en

# Coreference Resolution - Exploration 

## Coreference Resolution Example from spaCy 

- spaCy's Large English model is utilized

In [6]:
import en_core_web_lg

nlp = en_core_web_lg.load()

import spacy
import neuralcoref

# nlp = spacy.load(r'C:\Users\Johnn\AppData\Local\Programs\Python\Python37\lib\site-packages\en_core_web_lg')

# nlp = spacy.load('en_core_web_lg')  # Correct model name
neuralcoref.add_to_pipe(nlp)

doc1 = nlp('My sister has a dog. She loves him.')
print(doc1._.coref_clusters)

doc2 = nlp('Angela lives in Boston. She is quite happy in that city.')
for ent in doc2.ents:
    print(ent._.coref_cluster)


[My sister: [My sister, She], a dog: [a dog, him]]
Angela: [Angela, She]
Boston: [Boston, that city]


In [7]:
doc1._.coref_resolved

'My sister has a dog. My sister loves a dog.'

## Import the "final_dataframe.csv" as constructed in notebook 1:PDF Parsing & Dataframe Creation

In [8]:
import pandas as pd
import ast

In [9]:
final_dataframe = pd.read_csv(r'C:\Users\Johnn\Desktop\Thesis-Coding\final_dataframe.csv')

def convert_to_list(column):
    return column.apply(ast.literal_eval)

final_dataframe['Content'] = convert_to_list(final_dataframe['Content'])
final_dataframe['Cleaned_Content'] = convert_to_list(final_dataframe['Cleaned_Content'])

In [10]:
def coref_solve(text):
    doc = nlp(text)
    new_text = doc._.coref_resolved
    return new_text


In [52]:
coref_solve('Angela lives in Boston. She is quite happy in that city.')

'Angela lives in Boston. Angela is quite happy in Boston.'

In [12]:
final_dataframe.head()

Unnamed: 0,Title,Articles,Content,Cleaned_Content
0,"TITLE I \nSUBJECT MATTER, DEFINITIONS AND GENE...",Article 1 Subject matter,[\nThis Regulation lays down the rules for the...,[this regulation lays down the rules for the e...
1,"TITLE I \nSUBJECT MATTER, DEFINITIONS AND GENE...",Article 2 Definitions,[ ‘applicant’ means a natural person or an ent...,[‘applicant’ means a natural person or an enti...
2,"TITLE I \nSUBJECT MATTER, DEFINITIONS AND GENE...",Article 3 Compliance of secondary legislation...,[ \nProvisions concerning the implementation o...,[provisions concerning the implementation of t...
3,"TITLE I \nSUBJECT MATTER, DEFINITIONS AND GENE...","Article 4 Periods, dates and time limits",[\nUnless otherwise provided in this Regulatio...,"[unless otherwise provided in this regulation,..."
4,"TITLE I \nSUBJECT MATTER, DEFINITIONS AND GENE...",Article 5 Protection of personal data,[\nThis Regulation is without prejudice to Reg...,[this regulation is without prejudice to regul...


In [13]:
"".join(final_dataframe.iloc[1,3])

'‘applicant’ means a natural person or an entity with or without legal personality who has submitted an application in a grant award procedure or in a contest for prizes;‘application document’ means a tender, a request to participate, a grant application or an application in a contest for prizes;‘award procedure’ means a procurement procedure, a grant award procedure, a contest for prizes, or a procedure for the selection of experts or persons or entities implementing the budget pursuant to point (c) of the first subparagraph of article 62(1);‘basic act’ means a legal act, other than a recommendation or an opinion, which provides a legal basis for an action and for the implementation of the corresponding expenditure entered in the budget or of the budgetary guarantee or financial assistance backed by the budget, and which may take any of the following forms: (a) in implementation of the treaty on the functioning of the european union (tfeu) and the treaty establishing the european atom

In [14]:
final_dataframe["Articles"].unique()

array(['Article 1  Subject matter ', 'Article 2  Definitions ',
       'Article 3  Compliance of secondary legislation with this Regulation ',
       'Article 4  Periods, dates and time limits ',
       'Article 5  Protection of personal data ',
       'Article 6  Respect for budgetary principles ',
       'Article 7  Scope of the budget ',
       'Article 8  Specific rules on the principles of unity and budgetary accuracy ',
       'Article 9  Definition ',
       'Article 10  Budgetary accounting for revenue and appropriations ',
       'Article 11  Commitment of appropriations ',
       'Article 12  Cancellation and carry-over of appropriations ',
       'Article 13  Detailed provisions on cancellation and carry-over of appropriations ',
       'Article 14  Decommitments ',
       'Article 15  Making appropriations corresponding to decommitments available again ',
       'Article 16  Rules applicable in the event of late adoption of the budget ',
       'Article 17  Definition and s

In [15]:
final_dataframe[final_dataframe["Articles"]=='Article 282  Entry into force and application '].iloc[0,3]

['this regulation shall enter into force on the third day following that of its publication in the official journal of the european union.',
 'it shall apply from 2 august 2018.',
 'by way of derogation from paragraph 2 of this article: (a) article 271(1)(a), article 272(2), article 272(10)(a), article 272(11)(b)(i), (c), (d) and (e), article 272(12)(a), (b)(i) and (c), article 272(14)(c), article 272(15), (17), (18), (22) and (23), article 272(26)(d), article 272(27)(a)(i), article 272(53), and (54), article 272(55)(b)(i), article 273(3), article 276(2) and article 276(4)(b) shall apply from 1 january 2014; (b) article 272(11)(a) and (f), article 272(13), article 272(14)(b), article 272(16), article 272(19)(a) and article 274(3) shall apply from 1 january 2018; (c) articles 6 to 60, 63 to 68, 73 to 207, 241 to 253 and 264 to 268 shall apply from 1 january 2019 as regards the implementation of the administrative appropriations of union institutions; this is without prejudice to point (

In [16]:
example_text = "\n ".join(final_dataframe[final_dataframe["Articles"]=='Article 282  Entry into force and application '].iloc[0,3])
example_text

'this regulation shall enter into force on the third day following that of its publication in the official journal of the european union.\n it shall apply from 2 august 2018.\n by way of derogation from paragraph 2 of this article: (a) article 271(1)(a), article 272(2), article 272(10)(a), article 272(11)(b)(i), (c), (d) and (e), article 272(12)(a), (b)(i) and (c), article 272(14)(c), article 272(15), (17), (18), (22) and (23), article 272(26)(d), article 272(27)(a)(i), article 272(53), and (54), article 272(55)(b)(i), article 273(3), article 276(2) and article 276(4)(b) shall apply from 1 january 2014; (b) article 272(11)(a) and (f), article 272(13), article 272(14)(b), article 272(16), article 272(19)(a) and article 274(3) shall apply from 1 january 2018; (c) articles 6 to 60, 63 to 68, 73 to 207, 241 to 253 and 264 to 268 shall apply from 1 january 2019 as regards the implementation of the administrative appropriations of union institutions; this is without prejudice to point (h) of

In [17]:
coref_solve(example_text)

'this regulation shall enter into force on the third day following that of this regulation publication in the official journal of the european union.\n this regulation shall apply from 2 august 2018.\n by way of derogation from paragraph 2 of this article: (a) article 271(1)(a), article 272(2), article 272(10)(a), article 272(11)(b)(i), (c), (d) and (e), article 272(12)(a), (b)(i) and (c), article 272(14)(c), article 272(15), (17), (18), (22) and (23), article 272(26)(d), article 272(27)(a)(i), article 272(53), and (54), article 272(55)(b)(i), article 273(3), article 276(2) and article 276(4)(b) shall apply from 1 january 2014; (b) article 272(11)(a) and (f), article 272(13), article 272(14)(b), article 272(16), article 272(19)(a) and article 274(3) shall apply from 1 january 2018; (c) articles 6 to 60, 63 to 68, 73 to 207, 241 to 253 and 264 to 268 shall apply from 1 january 2019 as regards the implementation of the administrative appropriations of union institutions; this is without 

In [19]:
example_text.split("\n ") == final_dataframe[final_dataframe["Articles"]=='Article 282  Entry into force and application '].iloc[0,3]

True

In [51]:
coref_solve('‘prize’ means a financial contribution given as a reward following a contest. where such a contribution is provided under direct management, it shall be governed by title ix;')

'‘prize’ means a financial contribution given as a reward following a contest. where such a contribution is provided under direct management, such a contribution shall be governed by title ix;'

In [20]:
dataframe_I = final_dataframe.explode('Cleaned_Content')

In [21]:
dataframe_I.reset_index()

Unnamed: 0,index,Title,Articles,Content,Cleaned_Content
0,0,"TITLE I \nSUBJECT MATTER, DEFINITIONS AND GENE...",Article 1 Subject matter,[\nThis Regulation lays down the rules for the...,this regulation lays down the rules for the es...
1,1,"TITLE I \nSUBJECT MATTER, DEFINITIONS AND GENE...",Article 2 Definitions,[ ‘applicant’ means a natural person or an ent...,‘applicant’ means a natural person or an entit...
2,1,"TITLE I \nSUBJECT MATTER, DEFINITIONS AND GENE...",Article 2 Definitions,[ ‘applicant’ means a natural person or an ent...,"‘application document’ means a tender, a reque..."
3,1,"TITLE I \nSUBJECT MATTER, DEFINITIONS AND GENE...",Article 2 Definitions,[ ‘applicant’ means a natural person or an ent...,‘award procedure’ means a procurement procedur...
4,1,"TITLE I \nSUBJECT MATTER, DEFINITIONS AND GENE...",Article 2 Definitions,[ ‘applicant’ means a natural person or an ent...,"‘basic act’ means a legal act, other than a re..."
...,...,...,...,...,...
1099,280,TITLE XVI \nINFORMATION REQUESTS AND DELEGATED...,Article 281 Repeal,"[ \nRegulation (EU, Euratom) No 966/2012 is re...","without prejudice to article 279(3), the commi..."
1100,280,TITLE XVI \nINFORMATION REQUESTS AND DELEGATED...,Article 281 Repeal,"[ \nRegulation (EU, Euratom) No 966/2012 is re...",references to the repealed regulation shall be...
1101,281,TITLE XVI \nINFORMATION REQUESTS AND DELEGATED...,Article 282 Entry into force and application,[ \nThis Regulation shall enter into force on ...,this regulation shall enter into force on the ...
1102,281,TITLE XVI \nINFORMATION REQUESTS AND DELEGATED...,Article 282 Entry into force and application,[ \nThis Regulation shall enter into force on ...,it shall apply from 2 august 2018.


In [22]:
dataframe_I.iloc[1102,3]

'it shall apply from 2 august 2018.'

In [23]:
final_dataframe

Unnamed: 0,Title,Articles,Content,Cleaned_Content
0,"TITLE I \nSUBJECT MATTER, DEFINITIONS AND GENE...",Article 1 Subject matter,[\nThis Regulation lays down the rules for the...,[this regulation lays down the rules for the e...
1,"TITLE I \nSUBJECT MATTER, DEFINITIONS AND GENE...",Article 2 Definitions,[ ‘applicant’ means a natural person or an ent...,[‘applicant’ means a natural person or an enti...
2,"TITLE I \nSUBJECT MATTER, DEFINITIONS AND GENE...",Article 3 Compliance of secondary legislation...,[ \nProvisions concerning the implementation o...,[provisions concerning the implementation of t...
3,"TITLE I \nSUBJECT MATTER, DEFINITIONS AND GENE...","Article 4 Periods, dates and time limits",[\nUnless otherwise provided in this Regulatio...,"[unless otherwise provided in this regulation,..."
4,"TITLE I \nSUBJECT MATTER, DEFINITIONS AND GENE...",Article 5 Protection of personal data,[\nThis Regulation is without prejudice to Reg...,[this regulation is without prejudice to regul...
...,...,...,...,...
277,TITLE XVI \nINFORMATION REQUESTS AND DELEGATED...,Article 278 Amendment to Decision No 541/2014...,[\nIn Article 4 of Decision No 541/2014/EU of ...,[in article 4 of decision no 541/2014/eu of th...
278,TITLE XVI \nINFORMATION REQUESTS AND DELEGATED...,Article 279 Transitional provisions,[ \nLegal commitments for grants implementing ...,[legal commitments for grants implementing the...
279,TITLE XVI \nINFORMATION REQUESTS AND DELEGATED...,Article 280 Review,[\nThis Regulation shall be reviewed whenever ...,[this regulation shall be reviewed whenever it...
280,TITLE XVI \nINFORMATION REQUESTS AND DELEGATED...,Article 281 Repeal,"[ \nRegulation (EU, Euratom) No 966/2012 is re...","[regulation (eu, euratom) no 966/2012 is repea..."


## Obtain the whole text of each Article by joining each Article's paragraphs together

In [24]:
final_dataframe['whole_text'] = final_dataframe['Cleaned_Content'].apply(lambda x: "\n ".join(x))

In [25]:
final_dataframe

Unnamed: 0,Title,Articles,Content,Cleaned_Content,whole_text
0,"TITLE I \nSUBJECT MATTER, DEFINITIONS AND GENE...",Article 1 Subject matter,[\nThis Regulation lays down the rules for the...,[this regulation lays down the rules for the e...,this regulation lays down the rules for the es...
1,"TITLE I \nSUBJECT MATTER, DEFINITIONS AND GENE...",Article 2 Definitions,[ ‘applicant’ means a natural person or an ent...,[‘applicant’ means a natural person or an enti...,‘applicant’ means a natural person or an entit...
2,"TITLE I \nSUBJECT MATTER, DEFINITIONS AND GENE...",Article 3 Compliance of secondary legislation...,[ \nProvisions concerning the implementation o...,[provisions concerning the implementation of t...,provisions concerning the implementation of th...
3,"TITLE I \nSUBJECT MATTER, DEFINITIONS AND GENE...","Article 4 Periods, dates and time limits",[\nUnless otherwise provided in this Regulatio...,"[unless otherwise provided in this regulation,...","unless otherwise provided in this regulation, ..."
4,"TITLE I \nSUBJECT MATTER, DEFINITIONS AND GENE...",Article 5 Protection of personal data,[\nThis Regulation is without prejudice to Reg...,[this regulation is without prejudice to regul...,this regulation is without prejudice to regula...
...,...,...,...,...,...
277,TITLE XVI \nINFORMATION REQUESTS AND DELEGATED...,Article 278 Amendment to Decision No 541/2014...,[\nIn Article 4 of Decision No 541/2014/EU of ...,[in article 4 of decision no 541/2014/eu of th...,in article 4 of decision no 541/2014/eu of the...
278,TITLE XVI \nINFORMATION REQUESTS AND DELEGATED...,Article 279 Transitional provisions,[ \nLegal commitments for grants implementing ...,[legal commitments for grants implementing the...,legal commitments for grants implementing the ...
279,TITLE XVI \nINFORMATION REQUESTS AND DELEGATED...,Article 280 Review,[\nThis Regulation shall be reviewed whenever ...,[this regulation shall be reviewed whenever it...,this regulation shall be reviewed whenever it ...
280,TITLE XVI \nINFORMATION REQUESTS AND DELEGATED...,Article 281 Repeal,"[ \nRegulation (EU, Euratom) No 966/2012 is re...","[regulation (eu, euratom) no 966/2012 is repea...","regulation (eu, euratom) no 966/2012 is repeal..."


In [26]:
final_dataframe[final_dataframe["Articles"]=='Article 282  Entry into force and application '].iloc[0,4]

'this regulation shall enter into force on the third day following that of its publication in the official journal of the european union.\n it shall apply from 2 august 2018.\n by way of derogation from paragraph 2 of this article: (a) article 271(1)(a), article 272(2), article 272(10)(a), article 272(11)(b)(i), (c), (d) and (e), article 272(12)(a), (b)(i) and (c), article 272(14)(c), article 272(15), (17), (18), (22) and (23), article 272(26)(d), article 272(27)(a)(i), article 272(53), and (54), article 272(55)(b)(i), article 273(3), article 276(2) and article 276(4)(b) shall apply from 1 january 2014; (b) article 272(11)(a) and (f), article 272(13), article 272(14)(b), article 272(16), article 272(19)(a) and article 274(3) shall apply from 1 january 2018; (c) articles 6 to 60, 63 to 68, 73 to 207, 241 to 253 and 264 to 268 shall apply from 1 january 2019 as regards the implementation of the administrative appropriations of union institutions; this is without prejudice to point (h) of

In [27]:
final_dataframe['whole_text_list'] = final_dataframe['whole_text'].apply(lambda x: x.split("\n "))

In [28]:
equality_check = final_dataframe['whole_text_list'] == final_dataframe['Cleaned_Content']

In [29]:
equality_check.sum()

282

In [30]:
final_dataframe

Unnamed: 0,Title,Articles,Content,Cleaned_Content,whole_text,whole_text_list
0,"TITLE I \nSUBJECT MATTER, DEFINITIONS AND GENE...",Article 1 Subject matter,[\nThis Regulation lays down the rules for the...,[this regulation lays down the rules for the e...,this regulation lays down the rules for the es...,[this regulation lays down the rules for the e...
1,"TITLE I \nSUBJECT MATTER, DEFINITIONS AND GENE...",Article 2 Definitions,[ ‘applicant’ means a natural person or an ent...,[‘applicant’ means a natural person or an enti...,‘applicant’ means a natural person or an entit...,[‘applicant’ means a natural person or an enti...
2,"TITLE I \nSUBJECT MATTER, DEFINITIONS AND GENE...",Article 3 Compliance of secondary legislation...,[ \nProvisions concerning the implementation o...,[provisions concerning the implementation of t...,provisions concerning the implementation of th...,[provisions concerning the implementation of t...
3,"TITLE I \nSUBJECT MATTER, DEFINITIONS AND GENE...","Article 4 Periods, dates and time limits",[\nUnless otherwise provided in this Regulatio...,"[unless otherwise provided in this regulation,...","unless otherwise provided in this regulation, ...","[unless otherwise provided in this regulation,..."
4,"TITLE I \nSUBJECT MATTER, DEFINITIONS AND GENE...",Article 5 Protection of personal data,[\nThis Regulation is without prejudice to Reg...,[this regulation is without prejudice to regul...,this regulation is without prejudice to regula...,[this regulation is without prejudice to regul...
...,...,...,...,...,...,...
277,TITLE XVI \nINFORMATION REQUESTS AND DELEGATED...,Article 278 Amendment to Decision No 541/2014...,[\nIn Article 4 of Decision No 541/2014/EU of ...,[in article 4 of decision no 541/2014/eu of th...,in article 4 of decision no 541/2014/eu of the...,[in article 4 of decision no 541/2014/eu of th...
278,TITLE XVI \nINFORMATION REQUESTS AND DELEGATED...,Article 279 Transitional provisions,[ \nLegal commitments for grants implementing ...,[legal commitments for grants implementing the...,legal commitments for grants implementing the ...,[legal commitments for grants implementing the...
279,TITLE XVI \nINFORMATION REQUESTS AND DELEGATED...,Article 280 Review,[\nThis Regulation shall be reviewed whenever ...,[this regulation shall be reviewed whenever it...,this regulation shall be reviewed whenever it ...,[this regulation shall be reviewed whenever it...
280,TITLE XVI \nINFORMATION REQUESTS AND DELEGATED...,Article 281 Repeal,"[ \nRegulation (EU, Euratom) No 966/2012 is re...","[regulation (eu, euratom) no 966/2012 is repea...","regulation (eu, euratom) no 966/2012 is repeal...","[regulation (eu, euratom) no 966/2012 is repea..."


## Apply coreference resolution to each Article's content

In [31]:
final_dataframe['coreference_text'] = final_dataframe['whole_text'].apply(lambda x: coref_solve(x)) 

In [32]:
final_dataframe

Unnamed: 0,Title,Articles,Content,Cleaned_Content,whole_text,whole_text_list,coreference_text
0,"TITLE I \nSUBJECT MATTER, DEFINITIONS AND GENE...",Article 1 Subject matter,[\nThis Regulation lays down the rules for the...,[this regulation lays down the rules for the e...,this regulation lays down the rules for the es...,[this regulation lays down the rules for the e...,this regulation lays down the rules for the es...
1,"TITLE I \nSUBJECT MATTER, DEFINITIONS AND GENE...",Article 2 Definitions,[ ‘applicant’ means a natural person or an ent...,[‘applicant’ means a natural person or an enti...,‘applicant’ means a natural person or an entit...,[‘applicant’ means a natural person or an enti...,‘applicant’ means a natural person or an entit...
2,"TITLE I \nSUBJECT MATTER, DEFINITIONS AND GENE...",Article 3 Compliance of secondary legislation...,[ \nProvisions concerning the implementation o...,[provisions concerning the implementation of t...,provisions concerning the implementation of th...,[provisions concerning the implementation of t...,provisions concerning the implementation of th...
3,"TITLE I \nSUBJECT MATTER, DEFINITIONS AND GENE...","Article 4 Periods, dates and time limits",[\nUnless otherwise provided in this Regulatio...,"[unless otherwise provided in this regulation,...","unless otherwise provided in this regulation, ...","[unless otherwise provided in this regulation,...","unless otherwise provided in this regulation, ..."
4,"TITLE I \nSUBJECT MATTER, DEFINITIONS AND GENE...",Article 5 Protection of personal data,[\nThis Regulation is without prejudice to Reg...,[this regulation is without prejudice to regul...,this regulation is without prejudice to regula...,[this regulation is without prejudice to regul...,this regulation is without prejudice to regula...
...,...,...,...,...,...,...,...
277,TITLE XVI \nINFORMATION REQUESTS AND DELEGATED...,Article 278 Amendment to Decision No 541/2014...,[\nIn Article 4 of Decision No 541/2014/EU of ...,[in article 4 of decision no 541/2014/eu of th...,in article 4 of decision no 541/2014/eu of the...,[in article 4 of decision no 541/2014/eu of th...,in article 4 of decision no 541/2014/eu of the...
278,TITLE XVI \nINFORMATION REQUESTS AND DELEGATED...,Article 279 Transitional provisions,[ \nLegal commitments for grants implementing ...,[legal commitments for grants implementing the...,legal commitments for grants implementing the ...,[legal commitments for grants implementing the...,legal commitments for grants implementing the ...
279,TITLE XVI \nINFORMATION REQUESTS AND DELEGATED...,Article 280 Review,[\nThis Regulation shall be reviewed whenever ...,[this regulation shall be reviewed whenever it...,this regulation shall be reviewed whenever it ...,[this regulation shall be reviewed whenever it...,this regulation shall be reviewed whenever thi...
280,TITLE XVI \nINFORMATION REQUESTS AND DELEGATED...,Article 281 Repeal,"[ \nRegulation (EU, Euratom) No 966/2012 is re...","[regulation (eu, euratom) no 966/2012 is repea...","regulation (eu, euratom) no 966/2012 is repeal...","[regulation (eu, euratom) no 966/2012 is repea...","regulation (eu, euratom) no 966/2012 is repeal..."


In [33]:
final_dataframe[final_dataframe["Articles"]=='Article 282  Entry into force and application '].iloc[0,6]

'this regulation shall enter into force on the third day following that of this regulation publication in the official journal of the european union.\n this regulation shall apply from 2 august 2018.\n by way of derogation from paragraph 2 of this article: (a) article 271(1)(a), article 272(2), article 272(10)(a), article 272(11)(b)(i), (c), (d) and (e), article 272(12)(a), (b)(i) and (c), article 272(14)(c), article 272(15), (17), (18), (22) and (23), article 272(26)(d), article 272(27)(a)(i), article 272(53), and (54), article 272(55)(b)(i), article 273(3), article 276(2) and article 276(4)(b) shall apply from 1 january 2014; (b) article 272(11)(a) and (f), article 272(13), article 272(14)(b), article 272(16), article 272(19)(a) and article 274(3) shall apply from 1 january 2018; (c) articles 6 to 60, 63 to 68, 73 to 207, 241 to 253 and 264 to 268 shall apply from 1 january 2019 as regards the implementation of the administrative appropriations of union institutions; this is without 

In [34]:
final_dataframe['coreference_list_ready'] = final_dataframe['coreference_text'].apply(lambda x: x.split("\n ")) 

In [36]:
final_dataframe['identical_length'] = final_dataframe.apply(lambda row: len(row['Cleaned_Content']) == len(row['coreference_list_ready']), axis=1)

In [37]:
final_dataframe['identical_length'].sum()

280

In [38]:
final_dataframe[final_dataframe['identical_length']==False]

Unnamed: 0,Title,Articles,Content,Cleaned_Content,whole_text,whole_text_list,coreference_text,coreference_list_ready,identical_length
141,TITLE V \nCOMMON RULES,Article 142 The early-detection and exclusion...,[ \nInformation exchanged within the early-det...,[information exchanged within the early-detect...,information exchanged within the early-detecti...,[information exchanged within the early-detect...,information exchanged within the early-detecti...,[information exchanged within the early-detect...,False
242,TITLE XIII \nANNUAL ACCOUNTS AND OTHER FINANCI...,Article 243 Financial statements,[ \nThe financial statements shall be presente...,[the financial statements shall be presented i...,the financial statements shall be presented in...,[the financial statements shall be presented i...,the financial statements shall be presented in...,[the financial statements shall be presented i...,False


In [43]:
final_dataframe[final_dataframe['identical_length']==False].iloc[0,3]

['information exchanged within the early-detection and exclusion system referred to in article 135 shall be centralised in a database set up by the commission (‘the database’) and shall be managed in accordance with the right to privacy and other rights provided for in regulation (ec) no 45/2001. l 193/100   information on cases of early detection, exclusion and/or financial penalties shall be entered in the database by the authorising officer responsible after notifying the person or entity concerned, as referred to in article 135(2). such notification may be deferred in exceptional circumstances, where there are compelling legitimate grounds to preserve the confidentiality of an investigation or of national judicial proceedings, until such compelling legitimate grounds to preserve the confidentiality cease to exist. in accordance with regulation (ec) no 45/2001, the commission shall upon request inform the person or entity subject to the early-detection and exclusion system, as refer

In [45]:
final_dataframe[final_dataframe['identical_length']==False].iloc[0,4]

'information exchanged within the early-detection and exclusion system referred to in article 135 shall be centralised in a database set up by the commission (‘the database’) and shall be managed in accordance with the right to privacy and other rights provided for in regulation (ec) no 45/2001. l 193/100   information on cases of early detection, exclusion and/or financial penalties shall be entered in the database by the authorising officer responsible after notifying the person or entity concerned, as referred to in article 135(2). such notification may be deferred in exceptional circumstances, where there are compelling legitimate grounds to preserve the confidentiality of an investigation or of national judicial proceedings, until such compelling legitimate grounds to preserve the confidentiality cease to exist. in accordance with regulation (ec) no 45/2001, the commission shall upon request inform the person or entity subject to the early-detection and exclusion system, as referr

In [49]:
final_dataframe[final_dataframe['identical_length']==False].iloc[0,2]

[' \nInformation exchanged within the early-detection and exclusion system referred to in Article 135 shall be \ncentralised in a database set up by the Commission (‘the database’) and shall be managed in accordance with the \nright to privacy and other rights provided for in Regulation (EC) No 45/2001. \nL 193/100  \n\x0c \nInformation on cases of early detection, exclusion and/or financial penalties shall be entered in the database by the \nauthorising officer responsible after notifying the person or entity concerned, as referred to in Article 135(2). Such \nnotification may be deferred in exceptional circumstances, where there are compelling legitimate grounds to preserve \nthe confidentiality of an investigation or of national judicial proceedings, until such compelling legitimate grounds to \npreserve the confidentiality cease to exist. \nIn accordance with Regulation (EC) No 45/2001, the Commission shall upon request inform the person or entity subject \nto the early-detection a

In [46]:
final_dataframe[final_dataframe['identical_length']==False].iloc[0,6]

'information exchanged within the early-detection and exclusion system referred to in article 135 shall be centralised in a database set up by the commission (‘the database’) and shall be managed in accordance with the right to privacy and other rights provided for in regulation (ec) no 45/2001. l 193/100   information on cases of early detection, exclusion and/or financial penalties shall be entered in the database by the authorising officer responsible after notifying the person or entity concerned, as referred to in article 135(2). such notification may be deferred in exceptional circumstances, where there are compelling legitimate grounds to preserve the confidentiality of an investigation or of national judicial proceedings, until such compelling legitimate grounds to preserve the confidentiality cease to exist. in accordance with regulation (ec) no 45/2001, the commission shall upon request inform the person or entity subject to the early-detection and exclusion system, as referr

In [None]:
coref_solve(final_dataframe[final_dataframe['identical_length']==False].iloc[0,6])

'information exchanged within the early-detection and exclusion system referred to in article 135 shall be centralised in a database set up by the commission (‘the database’) and shall be managed in accordance with the right to privacy and other rights provided for in regulation (ec) no 45/2001. l 193/100   information on cases of early detection, exclusion and/or financial penalties shall be entered in the database by the authorising officer responsible after notifying the person or entity concerned, as referred to in article 135(2). such notification may be deferred in exceptional circumstances, where there are compelling legitimate grounds to preserve the confidentiality of an investigation or of national judicial proceedings, until such compelling legitimate grounds to preserve the confidentiality cease to exist. in accordance with regulation (ec) no 45/2001, the commission shall upon request inform the person or entity subject to the early-detection and exclusion system, as referr

In [50]:
coref_solve('The information contained in the database shall be updated, where appropriate, following a rectification, an erasure or any modification of data. It shall only be published in accordance with article 140.')

'The information contained in the database shall be updated, where appropriate, following a rectification, an erasure or any modification of data. It shall only be published in accordance with article 140.'

# Coreference Resolution Implementation Pipeline

- Import Packages
- Give spaCy example
- Import Financial Regulations Dataframe
- Apply coreference resolution to the whole content of an article (all paragraphs joined)
- Apply coreference resolution to each paragraph of an article, separately 

## Import Packages

In [56]:
import neuralcoref
import spacy
import pandas as pd
import ast

## spaCy Coreference Resolution Example

In [57]:
import en_core_web_lg

nlp = en_core_web_lg.load()

neuralcoref.add_to_pipe(nlp)

doc1 = nlp('My sister has a dog. She loves him.')
print(doc1._.coref_clusters)

doc2 = nlp('Angela lives in Boston. She is quite happy in that city.')
for ent in doc2.ents:
    print(ent._.coref_cluster)

[My sister: [My sister, She], a dog: [a dog, him]]
Angela: [Angela, She]
Boston: [Boston, that city]


## Import Dataframe of Financial Regulations as created in notebook 1: Parsing PDF

In [58]:
final_dataframe = pd.read_csv(r'C:\Users\Johnn\Desktop\Thesis-Coding\final_dataframe.csv')

def convert_to_list(column):
    return column.apply(ast.literal_eval)

final_dataframe['Content'] = convert_to_list(final_dataframe['Content'])
final_dataframe['Cleaned_Content'] = convert_to_list(final_dataframe['Cleaned_Content'])

In [59]:
def coref_solve(text):
    doc = nlp(text)
    new_text = doc._.coref_resolved
    return new_text

In [60]:
final_dataframe

Unnamed: 0,Title,Articles,Content,Cleaned_Content
0,"TITLE I \nSUBJECT MATTER, DEFINITIONS AND GENE...",Article 1 Subject matter,[\nThis Regulation lays down the rules for the...,[this regulation lays down the rules for the e...
1,"TITLE I \nSUBJECT MATTER, DEFINITIONS AND GENE...",Article 2 Definitions,[ ‘applicant’ means a natural person or an ent...,[‘applicant’ means a natural person or an enti...
2,"TITLE I \nSUBJECT MATTER, DEFINITIONS AND GENE...",Article 3 Compliance of secondary legislation...,[ \nProvisions concerning the implementation o...,[provisions concerning the implementation of t...
3,"TITLE I \nSUBJECT MATTER, DEFINITIONS AND GENE...","Article 4 Periods, dates and time limits",[\nUnless otherwise provided in this Regulatio...,"[unless otherwise provided in this regulation,..."
4,"TITLE I \nSUBJECT MATTER, DEFINITIONS AND GENE...",Article 5 Protection of personal data,[\nThis Regulation is without prejudice to Reg...,[this regulation is without prejudice to regul...
...,...,...,...,...
277,TITLE XVI \nINFORMATION REQUESTS AND DELEGATED...,Article 278 Amendment to Decision No 541/2014...,[\nIn Article 4 of Decision No 541/2014/EU of ...,[in article 4 of decision no 541/2014/eu of th...
278,TITLE XVI \nINFORMATION REQUESTS AND DELEGATED...,Article 279 Transitional provisions,[ \nLegal commitments for grants implementing ...,[legal commitments for grants implementing the...
279,TITLE XVI \nINFORMATION REQUESTS AND DELEGATED...,Article 280 Review,[\nThis Regulation shall be reviewed whenever ...,[this regulation shall be reviewed whenever it...
280,TITLE XVI \nINFORMATION REQUESTS AND DELEGATED...,Article 281 Repeal,"[ \nRegulation (EU, Euratom) No 966/2012 is re...","[regulation (eu, euratom) no 966/2012 is repea..."


## Apply Coreference Resolution to each Paragraph and to each Articles Content

In [61]:
final_dataframe['whole_text'] = final_dataframe['Cleaned_Content'].apply(lambda x: "\n ".join(x))
final_dataframe['coreference_text'] = final_dataframe['whole_text'].apply(lambda x: coref_solve(x)) 
final_dataframe['coreference_list_ready'] = final_dataframe['coreference_text'].apply(lambda x: x.split("\n ")) 


## Save CSV - coreferenced_final_dataframe.csv

In [63]:
final_dataframe.to_csv('coreferenced_final_dataframe.csv', index=False)  