# PDF-->Text-->Dictionary-->Dataframe (final_dataframe.csv)
-Note--> The "content" column of Final Dataframe contains the original non processed text of each article

-For Future: Check PyMuPDF4LLM to transform text into LLM format e.g. Markdown Text

## Convert PDF to Text using PyMuPDF

In [1]:
# Import pymupdf module  https://pymupdf.readthedocs.io/en/latest/

import pymupdf
import re

In [2]:
doc = pymupdf.open("regulations_new.pdf") # open a document
out = open("output.txt", "wb") # create a text output
for page in doc: # iterate the document pages
    text = page.get_text().encode("utf8") # get plain text (is in UTF-8)
    out.write(text) # write text of page
    out.write(bytes((12,))) # write page delimiter (form feed 0x0C)
out.close()

In [3]:
with open('output.txt', 'rt',encoding='utf-8') as in_file:
    text = in_file.read()
    print(repr(text))



In [4]:
# Get Titles for use in splitting the document 

pattern = r'(TITLE\s+[a-zA-Z]+[A-Za-z\s,]+)\s*\n'
matches = re.findall(pattern, text)
matches

['TITLE I \nSUBJECT MATTER, DEFINITIONS AND GENERAL PRINCIPLES ',
 'TITLE II \nBUDGET AND BUDGETARY PRINCIPLES ',
 'TITLE III \nESTABLISHMENT AND STRUCTURE OF THE BUDGET ',
 'TITLE IV \nBUDGET IMPLEMENTATION ',
 'TITLE V \nCOMMON RULES ',
 'TITLE VI \nINDIRECT MANAGEMENT ',
 'TITLE VII \nPROCUREMENT AND CONCESSIONS ',
 'TITLE VIII \nGRANTS ',
 'TITLE IX \nPRIZES ',
 'TITLE X \nFINANCIAL INSTRUMENTS, BUDGETARY GUARANTEES AND FINANCIAL ASSISTANCE ',
 'TITLE XI \nCONTRIBUTIONS TO EUROPEAN POLITICAL PARTIES ',
 'TITLE XII \nOTHER BUDGET IMPLEMENTATION INSTRUMENTS ',
 'TITLE XIII \nANNUAL ACCOUNTS AND OTHER FINANCIAL REPORTING ',
 'TITLE XIV \nEXTERNAL AUDIT AND DISCHARGE ',
 'TITLE XV \nADMINISTRATIVE APPROPRIATIONS ',
 'TITLE XVI \nINFORMATION REQUESTS AND DELEGATED ACTS ']

In [5]:
title_list = re.split(pattern, text)
title_list

[' \n(245) Some modifications regarding financial instruments, budgetary guarantees and financial assistance should only \napply from the date of application of the post-2020 multiannual financial framework in order to allow sufficient \ntime to adapt the applicable legal bases and programmes to the new rules. \n(246) The information on the annual average of full-time equivalents and on the estimated amount of assigned revenue \ncarried over from preceding years should be provided for the first time together with the draft budget to be \npresented in 2021 in order to allow sufficient time for the Commission to adapt to the new obligation, \nHAVE ADOPTED THIS REGULATION: \nPART ONE \nFINANCIAL REGULATION \n',
 'TITLE I \nSUBJECT MATTER, DEFINITIONS AND GENERAL PRINCIPLES ',
 'Article 1 \nSubject matter \nThis Regulation lays down the rules for the establishment and the implementation of the general budget of the European \nUnion and of the European Atomic Energy Community (‘the budget’)

In [6]:
# Create Dictionary of key:value = Title:Text 

titles = {}
for title in title_list:
    if title in matches:
        index = title_list.index(title)
        titles[title]=title_list[index+1]

for key in titles:
    titles[key] = "\n" + titles[key]

In [7]:
titles

{'TITLE I \nSUBJECT MATTER, DEFINITIONS AND GENERAL PRINCIPLES ': '\nArticle 1 \nSubject matter \nThis Regulation lays down the rules for the establishment and the implementation of the general budget of the European \nUnion and of the European Atomic Energy Community (‘the budget’) and the presentation and auditing of their accounts. \nArticle 2 \nDefinitions \nFor the purposes of this Regulation, the following definitions apply: \n(1) ‘applicant’ means a natural person or an entity with or without legal personality who has submitted an application \nin a grant award procedure or in a contest for prizes; \n(2) ‘application document’ means a tender, a request to participate, a grant application or an application in a contest for \nprizes; \n(3) ‘award procedure’ means a procurement procedure, a grant award procedure, a contest for prizes, or a procedure for \nthe selection of experts or persons or entities implementing the budget pursuant to point (c) of the first \nsubparagraph of Art

In [8]:
titles['TITLE II \nBUDGET AND BUDGETARY PRINCIPLES ']

'\nArticle 6 \nRespect for budgetary principles \nThe budget shall be established and implemented in accordance with the principles of unity, budgetary accuracy, \nannuality, equilibrium, unit of account, universality, specification, sound financial management and transparency as set \nout in this Regulation. \nCHAPTER 1 \nPrinciples of unity and of budgetary accuracy \nArticle 7 \nScope of the budget \n1. \nFor each financial year, the budget shall forecast and authorise all revenue and expenditure considered necessary for \nthe Union. It shall comprise: \n(a) the revenue and expenditure of the Union, including administrative expenditure resulting from the implementation of \nthe provisions of the TEU relating to the common foreign and security policy (CFSP), and operational expenditure \noccasioned by implementation of those provisions where it is charged to the budget; \n(b) the revenue and expenditure of the European Atomic Energy Community. \n2. \nThe budget shall contain differen

In [9]:
# For a specific Title: Create Dictionary of key:value = Article:Text 

pattern_article = r'\n(Article\s+\d+\s)*\n'
title_text = titles['TITLE XVI \nINFORMATION REQUESTS AND DELEGATED ACTS ']
matches = re.findall(pattern_article, title_text)
articles_list_I = re.split(pattern_article, title_text)

pattern_article_topic = r'^[^\n]+'
articles_2 = {}
for article in articles_list_I:
    if article in matches:
        index = articles_list_I.index(article)
        matches_topic = re.findall(pattern_article_topic, articles_list_I[index+1])
        key_name = article + " " + matches_topic[0]
        new_text = re.sub('^[^\n]+',"",articles_list_I[index+1])
        articles_2[key_name]=new_text

In [10]:
articles_2

{'Article 268  Information requests by the European Parliament and by the Council ': '\nThe European Parliament and the Council shall be entitled to obtain any information or explanations regarding budgetary \nmatters within their fields of competence. ',
 'Article 269  Exercise of the delegation ': '\n1. \nThe power to adopt delegated acts is conferred on the Commission subject to the conditions laid down in this \nArticle. \n2. \nThe power to adopt delegated acts referred to in Articles 70(1), the third paragraph of Article 71, Article 161 and \nthe second and third subparagraphs of Article 213(2) shall be conferred on the Commission for a period ending on \n31 December 2020. The Commission shall draw up a report in respect of the delegation of power not later than \n31 December 2018. The delegation of power shall be tacitly extended for the periods of duration of the subsequent \nmultiannual financial frameworks, unless the European Parliament or the Council opposes such extension n

In [11]:
# FUNCTION OBTAIN_ARTICLES
# Given a Title name create a dictionary of Article:Text


def obtain_articles(title_name):
    pattern_article = r'\n(Article\s+\d+\s)*\n'
    # pattern_article = r'\n(Article\s+\d+\s)*\n'
    title_text = titles[title_name]
    matches = re.findall(pattern_article, title_text)
    articles_list = re.split(pattern_article, title_text)

    pattern_article_topic = r'^[^\n]+'
    articles_dict = {}
    for article in articles_list:
        if article in matches:
            index = articles_list.index(article)
            matches_topic = re.findall(pattern_article_topic, articles_list[index+1])
            key_name = article + " " + matches_topic[0]
            new_text = re.sub('^[^\n]+',"",articles_list[index+1])
            articles_dict[key_name]=new_text
    
    return articles_dict


In [12]:
obtain_articles('TITLE II \nBUDGET AND BUDGETARY PRINCIPLES ')

{'Article 6  Respect for budgetary principles ': '\nThe budget shall be established and implemented in accordance with the principles of unity, budgetary accuracy, \nannuality, equilibrium, unit of account, universality, specification, sound financial management and transparency as set \nout in this Regulation. \nCHAPTER 1 \nPrinciples of unity and of budgetary accuracy ',
 'Article 7  Scope of the budget ': '\n1. \nFor each financial year, the budget shall forecast and authorise all revenue and expenditure considered necessary for \nthe Union. It shall comprise: \n(a) the revenue and expenditure of the Union, including administrative expenditure resulting from the implementation of \nthe provisions of the TEU relating to the common foreign and security policy (CFSP), and operational expenditure \noccasioned by implementation of those provisions where it is charged to the budget; \n(b) the revenue and expenditure of the European Atomic Energy Community. \n2. \nThe budget shall contain 

In [13]:
# Create Article:Text dictionary for each Title 
# Store in Dictionary of dictionaries where Title : Dictionary of Articles

titles_2 = {}
for key in titles:
    articles_dictionary = obtain_articles(key)
    titles_2[key]=articles_dictionary

In [14]:
titles_2

{'TITLE I \nSUBJECT MATTER, DEFINITIONS AND GENERAL PRINCIPLES ': {'Article 1  Subject matter ': '\nThis Regulation lays down the rules for the establishment and the implementation of the general budget of the European \nUnion and of the European Atomic Energy Community (‘the budget’) and the presentation and auditing of their accounts. ',
  'Article 2  Definitions ': '\nFor the purposes of this Regulation, the following definitions apply: \n(1) ‘applicant’ means a natural person or an entity with or without legal personality who has submitted an application \nin a grant award procedure or in a contest for prizes; \n(2) ‘application document’ means a tender, a request to participate, a grant application or an application in a contest for \nprizes; \n(3) ‘award procedure’ means a procurement procedure, a grant award procedure, a contest for prizes, or a procedure for \nthe selection of experts or persons or entities implementing the budget pursuant to point (c) of the first \nsubparagra

## Remove page noise such as page notes and marks 

In [15]:
# Remove noise patterns for the articles text of each Title 

pattern_EN = '\nEN'
pattern_EU = '\nOfficial Journal of the European Union'
pattern_date = '\n30\.7\.2018'
pattern_L = '\nL 193/\d+\n'
pattern_extra_text = r'\n\(\s\d\s\)\s[\s\S]+?p\.'


for title_key in titles_2:
    for article_key in titles_2[title_key]: 
        titles_2[title_key][article_key] = re.sub(pattern_EN, '', titles_2[title_key][article_key])
        titles_2[title_key][article_key] = re.sub(pattern_EU, '', titles_2[title_key][article_key])
        titles_2[title_key][article_key] = re.sub(pattern_date, '', titles_2[title_key][article_key])
        titles_2[title_key][article_key] = re.sub(pattern_L, '', titles_2[title_key][article_key])
        titles_2[title_key][article_key] = re.sub(pattern_extra_text, '', titles_2[title_key][article_key])

        
    

In [16]:
titles_2

{'TITLE I \nSUBJECT MATTER, DEFINITIONS AND GENERAL PRINCIPLES ': {'Article 1  Subject matter ': '\nThis Regulation lays down the rules for the establishment and the implementation of the general budget of the European \nUnion and of the European Atomic Energy Community (‘the budget’) and the presentation and auditing of their accounts. ',
  'Article 2  Definitions ': '\nFor the purposes of this Regulation, the following definitions apply: \n(1) ‘applicant’ means a natural person or an entity with or without legal personality who has submitted an application \nin a grant award procedure or in a contest for prizes; \n(2) ‘application document’ means a tender, a request to participate, a grant application or an application in a contest for \nprizes; \n(3) ‘award procedure’ means a procurement procedure, a grant award procedure, a contest for prizes, or a procedure for \nthe selection of experts or persons or entities implementing the budget pursuant to point (c) of the first \nsubparagra

## For each article create a list of each text elements.

In [17]:
list_example = re.split('\n\d+\.', titles_2['TITLE II \nBUDGET AND BUDGETARY PRINCIPLES ']['Article 12  Cancellation and carry-over of appropriations '])
list_example

['',
 ' \nAppropriations which have not been used by the end of the financial year for which they were entered shall be \ncancelled, unless they are carried over in accordance with paragraphs 2 to 8. ',
 ' \nThe following appropriations may be carried over by a decision taken pursuant to paragraph 3, but only to the \nfollowing financial year: \n(a) commitment appropriations and non-differentiated appropriations, for which most of the preparatory stages of the \ncommitment procedure have been completed by 31 December of the financial year. Such appropriations may be \ncommitted up to 31 March of the following financial year, with the exception of non-differentiated appropriations \nrelated to building projects which may be committed up to 31 December of the following financial year; \n(b) appropriations which are necessary when the legislative authority has adopted a basic act in the final quarter of the \nfinancial year and the Commission has been unable to commit the appropriations p

In [18]:
titles_2['TITLE I \nSUBJECT MATTER, DEFINITIONS AND GENERAL PRINCIPLES ']['Article 2  Definitions ']

'\nFor the purposes of this Regulation, the following definitions apply: \n(1) ‘applicant’ means a natural person or an entity with or without legal personality who has submitted an application \nin a grant award procedure or in a contest for prizes; \n(2) ‘application document’ means a tender, a request to participate, a grant application or an application in a contest for \nprizes; \n(3) ‘award procedure’ means a procurement procedure, a grant award procedure, a contest for prizes, or a procedure for \nthe selection of experts or persons or entities implementing the budget pursuant to point (c) of the first \nsubparagraph of Article 62(1); \n(4) ‘basic act’ means a legal act, other than a recommendation or an opinion, which provides a legal basis for an action \nand for the implementation of the corresponding expenditure entered in the budget or of the budgetary guarantee \nor financial assistance backed by the budget, and which may take any of the following forms: \n(a) in implement

In [19]:
# pattern_article = r'\n\(\d\)'
pattern_article = r'\n\(\d\)' 
matches = re.findall(pattern_article, titles_2['TITLE I \nSUBJECT MATTER, DEFINITIONS AND GENERAL PRINCIPLES ']['Article 2  Definitions '])
print(matches)

['\n(1)', '\n(2)', '\n(3)', '\n(4)', '\n(5)', '\n(6)', '\n(7)', '\n(8)', '\n(9)']


In [20]:
pattern_article = r'\n\(\d\)'
matches = re.findall(pattern_article, titles_2['TITLE XVI \nINFORMATION REQUESTS AND DELEGATED ACTS ']['Article 270  Amendments to Regulation (EU) No 1296/2013 '])
print(matches)

['\n(1)', '\n(2)', '\n(3)', '\n(4)', '\n(5)', '\n(6)']


In [21]:
pattern_article_2 = r'\n\d\.'
matches = re.findall(pattern_article_2, titles_2['TITLE I \nSUBJECT MATTER, DEFINITIONS AND GENERAL PRINCIPLES ']['Article 3  Compliance of secondary legislation with this Regulation '])
print(matches)

['\n1.', '\n2.']


In [22]:
titles_2['TITLE I \nSUBJECT MATTER, DEFINITIONS AND GENERAL PRINCIPLES ']['Article 2  Definitions ']

'\nFor the purposes of this Regulation, the following definitions apply: \n(1) ‘applicant’ means a natural person or an entity with or without legal personality who has submitted an application \nin a grant award procedure or in a contest for prizes; \n(2) ‘application document’ means a tender, a request to participate, a grant application or an application in a contest for \nprizes; \n(3) ‘award procedure’ means a procurement procedure, a grant award procedure, a contest for prizes, or a procedure for \nthe selection of experts or persons or entities implementing the budget pursuant to point (c) of the first \nsubparagraph of Article 62(1); \n(4) ‘basic act’ means a legal act, other than a recommendation or an opinion, which provides a legal basis for an action \nand for the implementation of the corresponding expenditure entered in the budget or of the budgetary guarantee \nor financial assistance backed by the budget, and which may take any of the following forms: \n(a) in implement

In [23]:
pattern_article_2 = r'\n\(\d+\)'
re.split(pattern_article_2, titles_2['TITLE I \nSUBJECT MATTER, DEFINITIONS AND GENERAL PRINCIPLES ']['Article 2  Definitions '])[1:]

[' ‘applicant’ means a natural person or an entity with or without legal personality who has submitted an application \nin a grant award procedure or in a contest for prizes; ',
 ' ‘application document’ means a tender, a request to participate, a grant application or an application in a contest for \nprizes; ',
 ' ‘award procedure’ means a procurement procedure, a grant award procedure, a contest for prizes, or a procedure for \nthe selection of experts or persons or entities implementing the budget pursuant to point (c) of the first \nsubparagraph of Article 62(1); ',
 ' ‘basic act’ means a legal act, other than a recommendation or an opinion, which provides a legal basis for an action \nand for the implementation of the corresponding expenditure entered in the budget or of the budgetary guarantee \nor financial assistance backed by the budget, and which may take any of the following forms: \n(a) in implementation of the Treaty on the Functioning of the European Union (TFEU) and the 

In [24]:
pattern_article_2 = r'\n\d+\.'
re.split(pattern_article_2, titles_2['TITLE I \nSUBJECT MATTER, DEFINITIONS AND GENERAL PRINCIPLES ']['Article 3  Compliance of secondary legislation with this Regulation '])[1:]


[' \nProvisions concerning the implementation of the revenue and expenditure of the budget, and contained in a basic \nact, shall comply with the budgetary principles set out in Title II. ',
 ' \nWithout prejudice to paragraph 1, any proposal or amendment to a proposal submitted to the legislative authority \ncontaining derogations from the provisions of this Regulation other than those set out in Title II, or from delegated acts \nadopted pursuant to this Regulation, shall clearly indicate such derogations and shall state the specific reasons justifying \nthem in the recitals and in the explanatory memorandum of such proposals or amendments. ']

In [25]:
if re.search(pattern_article_2, titles_2['TITLE I \nSUBJECT MATTER, DEFINITIONS AND GENERAL PRINCIPLES ']['Article 3  Compliance of secondary legislation with this Regulation ']):
    print('True')
    

True


In [26]:
re.findall(pattern_article_2, titles_2['TITLE I \nSUBJECT MATTER, DEFINITIONS AND GENERAL PRINCIPLES ']['Article 3  Compliance of secondary legislation with this Regulation '])

['\n1.', '\n2.']

In [27]:
# Split the content of each article to create a list per article. Repeat for each Title. There are two patterns
# found in how the articles' content is organised. One is with (1) and one is with 1. 

pattern_article_2 = r'\n\(\d+\)'
pattern_article_3 = r'\n\d+\.'
for title_key in titles_2:
    for article_key in titles_2[title_key]: 
        if re.search(pattern_article_2, titles_2[title_key][article_key]):
            titles_2[title_key][article_key] = re.split(pattern_article_2, titles_2[title_key][article_key])[1:]
        # if article_key in ['Article 2  Definitions ','Article 270  Amendments to Regulation (EU) No 1296/2013 ',
        #                    'Article 271  Amendments to Regulation (EU) No 1301/2013 ','Article 272  Amendments to Regulation (EU) No 1303/2013 ',
        #                    'Article 273  Amendments to Regulation (EU) No 1304/2013 ','Article 274  Amendments to Regulation (EU) No 1309/2013 ',
        #                    'Article 275  Amendments to Regulation (EU) No 1316/2013 ','Article 276  Amendments to Regulation (EU) No 223/2014 ',
        #                    'Article 277  Amendments to Regulation (EU) No 283/2014 ','Article 278  Amendment to Decision No 541/2014/EU ',
        #                    'Article 279  Transitional provisions ','Article 280  Review ','Article 281  Repeal ','Article 282  Entry into force and application ']:
        #     titles_2[title_key][article_key] = re.split(pattern_article_2, titles_2[title_key][article_key])[1:]
        elif re.search(pattern_article_3, titles_2[title_key][article_key]):
            titles_2[title_key][article_key] = re.split(pattern_article_3, titles_2[title_key][article_key])[1:]
        else:
            titles_2[title_key][article_key] =  titles_2[title_key][article_key]

     
            
        

In [28]:
# Convert each articles content from list to string to prepare for dataframe conversion 

for title_key in titles_2:
    for article_key in titles_2[title_key]: 
        titles_2[title_key][article_key] =  str(titles_2[title_key][article_key])


In [29]:
titles_2

{'TITLE I \nSUBJECT MATTER, DEFINITIONS AND GENERAL PRINCIPLES ': {'Article 1  Subject matter ': '\nThis Regulation lays down the rules for the establishment and the implementation of the general budget of the European \nUnion and of the European Atomic Energy Community (‘the budget’) and the presentation and auditing of their accounts. ',
  'Article 2  Definitions ': "[' ‘applicant’ means a natural person or an entity with or without legal personality who has submitted an application \\nin a grant award procedure or in a contest for prizes; ', ' ‘application document’ means a tender, a request to participate, a grant application or an application in a contest for \\nprizes; ', ' ‘award procedure’ means a procurement procedure, a grant award procedure, a contest for prizes, or a procedure for \\nthe selection of experts or persons or entities implementing the budget pursuant to point (c) of the first \\nsubparagraph of Article 62(1); ', ' ‘basic act’ means a legal act, other than a rec

## From Dictionary to Dataframe 

In [30]:
import pandas as pd

In [31]:
# Example Dataframe 

for title_key in titles_2:
    print(title_key)
    dataframe = pd.DataFrame.from_dict(titles_2[title_key], orient='index',columns=['Content']).reset_index()
    print(dataframe)


TITLE I 
SUBJECT MATTER, DEFINITIONS AND GENERAL PRINCIPLES 
                                               index  \
0                         Article 1  Subject matter    
1                            Article 2  Definitions    
2  Article 3  Compliance of secondary legislation...   
3         Article 4  Periods, dates and time limits    
4            Article 5  Protection of personal data    

                                             Content  
0  \nThis Regulation lays down the rules for the ...  
1  [' ‘applicant’ means a natural person or an en...  
2  [' \nProvisions concerning the implementation ...  
3  \nUnless otherwise provided in this Regulation...  
4  \nThis Regulation is without prejudice to Regu...  
TITLE II 
BUDGET AND BUDGETARY PRINCIPLES 
                                                index  \
0        Article 6  Respect for budgetary principles    
1                     Article 7  Scope of the budget    
2   Article 8  Specific rules on the principles of...   
3

In [32]:
# For each title create a dataframe for the articles:text dictionaries and then merge all dataframe into a single dataframe 

dataframes_titles = []
for title_key in titles_2:
    dataframe = pd.DataFrame.from_dict(titles_2[title_key], orient='index',columns=['Content']).reset_index()
    dataframe.columns = ['Articles','Content']
    dataframe.insert(0, 'Title', title_key)
    dataframes_titles.append(dataframe)
dataframes_titles

final_dataframe = pd.concat(dataframes_titles, ignore_index=True, axis=0)
final_dataframe

Unnamed: 0,Title,Articles,Content
0,"TITLE I \nSUBJECT MATTER, DEFINITIONS AND GENE...",Article 1 Subject matter,\nThis Regulation lays down the rules for the ...
1,"TITLE I \nSUBJECT MATTER, DEFINITIONS AND GENE...",Article 2 Definitions,[' ‘applicant’ means a natural person or an en...
2,"TITLE I \nSUBJECT MATTER, DEFINITIONS AND GENE...",Article 3 Compliance of secondary legislation...,[' \nProvisions concerning the implementation ...
3,"TITLE I \nSUBJECT MATTER, DEFINITIONS AND GENE...","Article 4 Periods, dates and time limits",\nUnless otherwise provided in this Regulation...
4,"TITLE I \nSUBJECT MATTER, DEFINITIONS AND GENE...",Article 5 Protection of personal data,\nThis Regulation is without prejudice to Regu...
...,...,...,...
277,TITLE XVI \nINFORMATION REQUESTS AND DELEGATED...,Article 278 Amendment to Decision No 541/2014...,\nIn Article 4 of Decision No 541/2014/EU of t...
278,TITLE XVI \nINFORMATION REQUESTS AND DELEGATED...,Article 279 Transitional provisions,[' \nLegal commitments for grants implementing...
279,TITLE XVI \nINFORMATION REQUESTS AND DELEGATED...,Article 280 Review,\nThis Regulation shall be reviewed whenever i...
280,TITLE XVI \nINFORMATION REQUESTS AND DELEGATED...,Article 281 Repeal,"[' \nRegulation (EU, Euratom) No 966/2012 is r..."


In [306]:
final_dataframe.iloc[0,2]

'\nThis Regulation lays down the rules for the establishment and the implementation of the general budget of the European \nUnion and of the European Atomic Energy Community (‘the budget’) and the presentation and auditing of their accounts. '

In [33]:
final_dataframe['Title'].unique()

array(['TITLE I \nSUBJECT MATTER, DEFINITIONS AND GENERAL PRINCIPLES ',
       'TITLE II \nBUDGET AND BUDGETARY PRINCIPLES ',
       'TITLE III \nESTABLISHMENT AND STRUCTURE OF THE BUDGET ',
       'TITLE IV \nBUDGET IMPLEMENTATION ', 'TITLE V \nCOMMON RULES ',
       'TITLE VI \nINDIRECT MANAGEMENT ',
       'TITLE VII \nPROCUREMENT AND CONCESSIONS ', 'TITLE VIII \nGRANTS ',
       'TITLE IX \nPRIZES ',
       'TITLE X \nFINANCIAL INSTRUMENTS, BUDGETARY GUARANTEES AND FINANCIAL ASSISTANCE ',
       'TITLE XI \nCONTRIBUTIONS TO EUROPEAN POLITICAL PARTIES ',
       'TITLE XII \nOTHER BUDGET IMPLEMENTATION INSTRUMENTS ',
       'TITLE XIII \nANNUAL ACCOUNTS AND OTHER FINANCIAL REPORTING ',
       'TITLE XIV \nEXTERNAL AUDIT AND DISCHARGE ',
       'TITLE XV \nADMINISTRATIVE APPROPRIATIONS ',
       'TITLE XVI \nINFORMATION REQUESTS AND DELEGATED ACTS '],
      dtype=object)

In [307]:
import ast

In [308]:
# Convert the content of each article from string to list again 

def back_to_list(text):
    if text.startswith("["):
        new_text = ast.literal_eval(text)
    else: 
        new_text = [text]
    return new_text

final_dataframe['Content'] = final_dataframe['Content'].apply(back_to_list)

In [309]:
final_dataframe


Unnamed: 0,Title,Articles,Content
0,"TITLE I \nSUBJECT MATTER, DEFINITIONS AND GENE...",Article 1 Subject matter,[\nThis Regulation lays down the rules for the...
1,"TITLE I \nSUBJECT MATTER, DEFINITIONS AND GENE...",Article 2 Definitions,[ ‘applicant’ means a natural person or an ent...
2,"TITLE I \nSUBJECT MATTER, DEFINITIONS AND GENE...",Article 3 Compliance of secondary legislation...,[ \nProvisions concerning the implementation o...
3,"TITLE I \nSUBJECT MATTER, DEFINITIONS AND GENE...","Article 4 Periods, dates and time limits",[\nUnless otherwise provided in this Regulatio...
4,"TITLE I \nSUBJECT MATTER, DEFINITIONS AND GENE...",Article 5 Protection of personal data,[\nThis Regulation is without prejudice to Reg...
...,...,...,...
277,TITLE XVI \nINFORMATION REQUESTS AND DELEGATED...,Article 278 Amendment to Decision No 541/2014...,[\nIn Article 4 of Decision No 541/2014/EU of ...
278,TITLE XVI \nINFORMATION REQUESTS AND DELEGATED...,Article 279 Transitional provisions,[ \nLegal commitments for grants implementing ...
279,TITLE XVI \nINFORMATION REQUESTS AND DELEGATED...,Article 280 Review,[\nThis Regulation shall be reviewed whenever ...
280,TITLE XVI \nINFORMATION REQUESTS AND DELEGATED...,Article 281 Repeal,"[ \nRegulation (EU, Euratom) No 966/2012 is re..."


In [310]:
final_dataframe.iloc[1,2]

[' ‘applicant’ means a natural person or an entity with or without legal personality who has submitted an application \nin a grant award procedure or in a contest for prizes; ',
 ' ‘application document’ means a tender, a request to participate, a grant application or an application in a contest for \nprizes; ',
 ' ‘award procedure’ means a procurement procedure, a grant award procedure, a contest for prizes, or a procedure for \nthe selection of experts or persons or entities implementing the budget pursuant to point (c) of the first \nsubparagraph of Article 62(1); ',
 ' ‘basic act’ means a legal act, other than a recommendation or an opinion, which provides a legal basis for an action \nand for the implementation of the corresponding expenditure entered in the budget or of the budgetary guarantee \nor financial assistance backed by the budget, and which may take any of the following forms: \n(a) in implementation of the Treaty on the Functioning of the European Union (TFEU) and the 

## Text Preprocessing to the text elements of each articles' content list
- Strip text 
- Lowercase
- Remove noise --> '\n', '\x0c'
- Store the cleaned content list of each article into a new column 

Save the result as final_dataframe.csv


In [311]:
for element in final_dataframe.iloc[1,2]:
    element=element.strip().lower()
    element = re.sub('\n','',element)
    element = re.sub('\x0c','',element)
    print(repr(element))

'‘applicant’ means a natural person or an entity with or without legal personality who has submitted an application in a grant award procedure or in a contest for prizes;'
'‘application document’ means a tender, a request to participate, a grant application or an application in a contest for prizes;'
'‘award procedure’ means a procurement procedure, a grant award procedure, a contest for prizes, or a procedure for the selection of experts or persons or entities implementing the budget pursuant to point (c) of the first subparagraph of article 62(1);'
'‘basic act’ means a legal act, other than a recommendation or an opinion, which provides a legal basis for an action and for the implementation of the corresponding expenditure entered in the budget or of the budgetary guarantee or financial assistance backed by the budget, and which may take any of the following forms: (a) in implementation of the treaty on the functioning of the european union (tfeu) and the treaty establishing the euro

In [312]:
# Create text preprocessing function for strip, lowercase and noise removal 

def clean_content(content):
    if isinstance(content,list):
        cleaned_list = []
        for element in content:
            new_element=element.strip().lower()
            new_element = re.sub(r'\n','',new_element)
            new_element = re.sub(r'\x0c','',new_element)
            cleaned_list.append(new_element)
       
        return cleaned_list
    


In [313]:
# Store the preprocess content list into a new column with the cleaned content list 

final_dataframe['Cleaned_Content'] = final_dataframe['Content'].apply(clean_content)

In [314]:
final_dataframe

Unnamed: 0,Title,Articles,Content,Cleaned_Content
0,"TITLE I \nSUBJECT MATTER, DEFINITIONS AND GENE...",Article 1 Subject matter,[\nThis Regulation lays down the rules for the...,[this regulation lays down the rules for the e...
1,"TITLE I \nSUBJECT MATTER, DEFINITIONS AND GENE...",Article 2 Definitions,[ ‘applicant’ means a natural person or an ent...,[‘applicant’ means a natural person or an enti...
2,"TITLE I \nSUBJECT MATTER, DEFINITIONS AND GENE...",Article 3 Compliance of secondary legislation...,[ \nProvisions concerning the implementation o...,[provisions concerning the implementation of t...
3,"TITLE I \nSUBJECT MATTER, DEFINITIONS AND GENE...","Article 4 Periods, dates and time limits",[\nUnless otherwise provided in this Regulatio...,"[unless otherwise provided in this regulation,..."
4,"TITLE I \nSUBJECT MATTER, DEFINITIONS AND GENE...",Article 5 Protection of personal data,[\nThis Regulation is without prejudice to Reg...,[this regulation is without prejudice to regul...
...,...,...,...,...
277,TITLE XVI \nINFORMATION REQUESTS AND DELEGATED...,Article 278 Amendment to Decision No 541/2014...,[\nIn Article 4 of Decision No 541/2014/EU of ...,[in article 4 of decision no 541/2014/eu of th...
278,TITLE XVI \nINFORMATION REQUESTS AND DELEGATED...,Article 279 Transitional provisions,[ \nLegal commitments for grants implementing ...,[legal commitments for grants implementing the...
279,TITLE XVI \nINFORMATION REQUESTS AND DELEGATED...,Article 280 Review,[\nThis Regulation shall be reviewed whenever ...,[this regulation shall be reviewed whenever it...
280,TITLE XVI \nINFORMATION REQUESTS AND DELEGATED...,Article 281 Repeal,"[ \nRegulation (EU, Euratom) No 966/2012 is re...","[regulation (eu, euratom) no 966/2012 is repea..."


In [316]:
final_dataframe.dtypes

Title              object
Articles           object
Content            object
Cleaned_Content    object
dtype: object

## Save to csv 

In [315]:
# final_dataframe.to_csv('final_dataframe.csv', index=False)  

# Experiment !!! IGNORE

In [178]:
final_dataframe['Title'].unique()

array(['TITLE I \nSUBJECT MATTER, DEFINITIONS AND GENERAL PRINCIPLES ',
       'TITLE II \nBUDGET AND BUDGETARY PRINCIPLES ',
       'TITLE III \nESTABLISHMENT AND STRUCTURE OF THE BUDGET ',
       'TITLE IV \nBUDGET IMPLEMENTATION ', 'TITLE V \nCOMMON RULES ',
       'TITLE VI \nINDIRECT MANAGEMENT ',
       'TITLE VII \nPROCUREMENT AND CONCESSIONS ', 'TITLE VIII \nGRANTS ',
       'TITLE IX \nPRIZES ',
       'TITLE X \nFINANCIAL INSTRUMENTS, BUDGETARY GUARANTEES AND FINANCIAL ASSISTANCE ',
       'TITLE XI \nCONTRIBUTIONS TO EUROPEAN POLITICAL PARTIES ',
       'TITLE XII \nOTHER BUDGET IMPLEMENTATION INSTRUMENTS ',
       'TITLE XIII \nANNUAL ACCOUNTS AND OTHER FINANCIAL REPORTING ',
       'TITLE XIV \nEXTERNAL AUDIT AND DISCHARGE ',
       'TITLE XV \nADMINISTRATIVE APPROPRIATIONS ',
       'TITLE XVI \nINFORMATION REQUESTS AND DELEGATED ACTS '],
      dtype=object)

In [269]:
final_dataframe.iloc[6,3]


['for each financial year, the budget shall forecast and authorise all revenue and expenditure considered necessary for the union. it shall comprise: (a) the revenue and expenditure of the union, including administrative expenditure resulting from the implementation of the provisions of the teu relating to the common foreign and security policy (cfsp), and operational expenditure occasioned by implementation of those provisions where it is charged to the budget; (b) the revenue and expenditure of the european atomic energy community.',
 'the budget shall contain differentiated appropriations, which consist of commitment appropriations and payment appropriations, and non-differentiated appropriations. the appropriations authorised for the financial year shall consist of: (a) appropriations provided in the budget, including by amending budgets; (b) appropriations carried over from preceding financial years; (c) appropriations made available again in accordance with article 15; (d) approp

In [267]:
final_dataframe.iloc[6,2]

[' \nFor each financial year, the budget shall forecast and authorise all revenue and expenditure considered necessary for \nthe Union. It shall comprise: \n(a) the revenue and expenditure of the Union, including administrative expenditure resulting from the implementation of \nthe provisions of the TEU relating to the common foreign and security policy (CFSP), and operational expenditure \noccasioned by implementation of those provisions where it is charged to the budget; \n(b) the revenue and expenditure of the European Atomic Energy Community. ',
 ' \nThe budget shall contain differentiated appropriations, which consist of commitment appropriations and payment \nappropriations, and non-differentiated appropriations. \nThe appropriations authorised for the financial year shall consist of: \n(a) appropriations provided in the budget, including by amending budgets; \n(b) appropriations carried over from preceding financial years; \n(c) appropriations made available again in accordance 

In [270]:
final_dataframe.iloc[0,0]

'TITLE I \nSUBJECT MATTER, DEFINITIONS AND GENERAL PRINCIPLES '

In [271]:
dataframe_I = final_dataframe[final_dataframe["Title"]=='TITLE I \nSUBJECT MATTER, DEFINITIONS AND GENERAL PRINCIPLES ']

In [272]:
dataframe_I

Unnamed: 0,Title,Articles,Content,Cleaned_Content
0,"TITLE I \nSUBJECT MATTER, DEFINITIONS AND GENE...",Article 1 Subject matter,[\nThis Regulation lays down the rules for the...,[this regulation lays down the rules for the e...
1,"TITLE I \nSUBJECT MATTER, DEFINITIONS AND GENE...",Article 2 Definitions,[ ‘applicant’ means a natural person or an ent...,[‘applicant’ means a natural person or an enti...
2,"TITLE I \nSUBJECT MATTER, DEFINITIONS AND GENE...",Article 3 Compliance of secondary legislation...,[ \nProvisions concerning the implementation o...,[provisions concerning the implementation of t...
3,"TITLE I \nSUBJECT MATTER, DEFINITIONS AND GENE...","Article 4 Periods, dates and time limits",[\nUnless otherwise provided in this Regulatio...,"[unless otherwise provided in this regulation,..."
4,"TITLE I \nSUBJECT MATTER, DEFINITIONS AND GENE...",Article 5 Protection of personal data,[\nThis Regulation is without prejudice to Reg...,[this regulation is without prejudice to regul...


In [273]:

dataframe_I = dataframe_I.explode('Cleaned_Content')
dataframe_I

Unnamed: 0,Title,Articles,Content,Cleaned_Content
0,"TITLE I \nSUBJECT MATTER, DEFINITIONS AND GENE...",Article 1 Subject matter,[\nThis Regulation lays down the rules for the...,this regulation lays down the rules for the es...
1,"TITLE I \nSUBJECT MATTER, DEFINITIONS AND GENE...",Article 2 Definitions,[ ‘applicant’ means a natural person or an ent...,‘applicant’ means a natural person or an entit...
1,"TITLE I \nSUBJECT MATTER, DEFINITIONS AND GENE...",Article 2 Definitions,[ ‘applicant’ means a natural person or an ent...,"‘application document’ means a tender, a reque..."
1,"TITLE I \nSUBJECT MATTER, DEFINITIONS AND GENE...",Article 2 Definitions,[ ‘applicant’ means a natural person or an ent...,‘award procedure’ means a procurement procedur...
1,"TITLE I \nSUBJECT MATTER, DEFINITIONS AND GENE...",Article 2 Definitions,[ ‘applicant’ means a natural person or an ent...,"‘basic act’ means a legal act, other than a re..."
...,...,...,...,...
1,"TITLE I \nSUBJECT MATTER, DEFINITIONS AND GENE...",Article 2 Definitions,[ ‘applicant’ means a natural person or an ent...,‘works contract’ means a contract covering eit...
2,"TITLE I \nSUBJECT MATTER, DEFINITIONS AND GENE...",Article 3 Compliance of secondary legislation...,[ \nProvisions concerning the implementation o...,provisions concerning the implementation of th...
2,"TITLE I \nSUBJECT MATTER, DEFINITIONS AND GENE...",Article 3 Compliance of secondary legislation...,[ \nProvisions concerning the implementation o...,"without prejudice to paragraph 1, any proposal..."
3,"TITLE I \nSUBJECT MATTER, DEFINITIONS AND GENE...","Article 4 Periods, dates and time limits",[\nUnless otherwise provided in this Regulatio...,"unless otherwise provided in this regulation, ..."


In [48]:
dataframe_I.iloc[1,3]

'‘applicant’ means a natural person or an entity with or without legal personality who has submitted an application in a grant award procedure or in a contest for prizes;'

In [49]:
def remove_punct(text):
    cleaned = ""
    punctuation = '!@#$%^&*()_-+={}[]:;"\'|<>,.?/~`’‘'
    for i in text:
        if i not in punctuation:
            cleaned = cleaned + i 
    return cleaned


In [50]:
dataframe_I['Cleaned_Content'] = dataframe_I['Cleaned_Content'].apply(remove_punct)

In [51]:
dataframe_I.iloc[1,3]

'applicant means a natural person or an entity with or without legal personality who has submitted an application in a grant award procedure or in a contest for prizes'

# Experiment !!!! IGNORE

OpenIE Standford - "Object-Verb-Subject" Triplet Extraction

In [52]:
from openie import StanfordOpenIE

In [53]:
properties = {
    'openie.affinity_probability_cap': 2 / 3,
}

with StanfordOpenIE(properties=properties) as client:
    text = 'applicant means a natural person or an entity with or without legal personality who has submitted an application in a grant award procedure or in a contest for prizes'
    print('Text: %s.' % text)
    for triple in client.annotate(text):
        print('|-', triple)

Text: applicant means a natural person or an entity with or without legal personality who has submitted an application in a grant award procedure or in a contest for prizes.
Starting server with command: java -Xmx8G -cp C:\Users\Johnn\.stanfordnlp_resources\stanford-corenlp-4.5.3/* edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 60000 -threads 5 -maxCharLength 100000 -quiet True -serverProperties corenlp_server-b199a51d41aa4237.props -preload openie


PermanentlyFailedException: Timed out waiting for service to come alive.

In [15]:
properties = {
    'openie.affinity_probability_cap': 2 / 3,
}

with StanfordOpenIE(properties=properties) as client:
    text = 'applicant means a natural person or an entity with or without legal personality'
    print('Text: %s.' % text)
    for triple in client.annotate(text):
        print('|-', triple)

Text: applicant means a natural person or an entity with or without legal personality.
Starting server with command: java -Xmx8G -cp C:\Users\Johnn\.stanfordnlp_resources\stanford-corenlp-4.5.3/* edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 60000 -threads 5 -maxCharLength 100000 -quiet True -serverProperties corenlp_server-d53e9f7d3fea4e5b.props -preload openie
|- {'subject': 'applicant', 'relation': 'means', 'object': 'natural person'}
|- {'subject': 'entity', 'relation': 'is with', 'object': 'legal personality'}
|- {'subject': 'applicant', 'relation': 'means', 'object': 'person'}


In [55]:
import spacy
from spacy.tokens import Doc, Span
from spacy.matcher import Matcher

# Load the pre-trained SpaCy model
nlp = spacy.load("en_core_web_sm")

# Define the custom component
def extract_relations(doc):
    matcher = Matcher(nlp.vocab)
    # Define patterns for matching relations
    pattern = [
        {'DEP': 'nsubj'},
        {'DEP': 'aux', 'OP': '?'},
        {'DEP': 'ROOT'},
        {'DEP': 'det', 'OP': '?'},
        {'DEP': 'amod', 'OP': '*'},
        {'DEP': 'dobj'}
    ]
    matcher.add("relation_pattern", [pattern])
    matches = matcher(doc)

    relations = []
    for match_id, start, end in matches:
        span = doc[start:end]
        relations.append((span.text, span.root.dep_))
    return relations

# Register the custom component with SpaCy
Doc.set_extension("relations", getter=extract_relations, force=True)

# Sample text
text = "applicant means a natural person or an entity with or without legal personality who has submitted an application in a grant award procedure or in a contest for prizes"

# Process the text
doc = nlp(text)

# Extract and print relations
relations = doc._.relations
for relation in relations:
    print(relation)
    

('applicant means a natural person', 'ROOT')


In [None]:
import spacy

In [286]:
nlp = spacy.load("en_core_web_sm")
doc = nlp("applicant means a natural person or an entity with or without legal personality who has submitted an application in a grant award procedure or in a contest for prizes")

for token in doc:
    print(token.text, token.pos_, token.tag_, token.dep_)

applicant PROPN NNP nsubj
means VERB VBZ ROOT
a DET DT det
natural ADJ JJ amod
person NOUN NN dobj
or CCONJ CC cc
an DET DT det
entity NOUN NN conj
with ADP IN prep
or CCONJ CC cc
without ADP IN conj
legal ADJ JJ amod
personality NOUN NN pobj
who PRON WP nsubj
has AUX VBZ aux
submitted VERB VBN relcl
an DET DT det
application NOUN NN dobj
in ADP IN prep
a DET DT det
grant NOUN NN compound
award NOUN NN compound
procedure NOUN NN pobj
or CCONJ CC cc
in ADP IN conj
a DET DT det
contest NOUN NN pobj
for ADP IN prep
prizes NOUN NNS pobj


In [293]:
nlp = spacy.load("en_core_web_sm")
doc = nlp("applicant means a natural person or an entity with or without legal personality who has submitted an application in a grant award procedure or in a contest for prizes")
assert doc.has_annotation("SENT_START")
for sent in doc.sents:
    print(sent.text)


applicant means a natural person or an entity with or without legal personality who has submitted an application in a grant award procedure or in a contest for prizes


In [292]:
from spacy import displacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("applicant means a natural person or an entity with or without legal personality who has submitted an application in a grant award procedure or in a contest for prizes")
# Since this is an interactive Jupyter environment, we can use displacy.render here
displacy.render(doc, style='dep')

In [330]:

from spacy.matcher import DependencyMatcher

nlp = spacy.load("en_core_web_sm")
matcher = DependencyMatcher(nlp.vocab)
pattern = [
  {
    "RIGHT_ID": "relation",       # unique name
    "RIGHT_ATTRS": {"ORTH": "means"}  # token pattern for "founded"
  },
  {
      "LEFT_ID": "relation",
      "REL_OP": ">",
      "RIGHT_ID": "head",
      "RIGHT_ATTRS": {"POS": "PROPN"},
  },
    {
        "LEFT_ID": "relation",
        "REL_OP": ">",
        "RIGHT_ID": "founded_object_modifier",
        "RIGHT_ATTRS": {"DEP": {"IN": ["dobj"]}},
    },
        {
        "LEFT_ID": "founded_object_modifier",
        "REL_OP": ">",
        "RIGHT_ID": "object_modifier",
        "RIGHT_ATTRS": {"DEP": {"IN": ["amod"]}},
    },
    {
        "LEFT_ID": "founded_object_modifier",
        "REL_OP": ">",
        "RIGHT_ID": "det_object_modifier",
        "RIGHT_ATTRS": {"DEP": {"IN": ["det"]}},
    },
      {
        "LEFT_ID": "founded_object_modifier",
        "REL_OP": ">",
        "RIGHT_ID": "conj_object_modifier",
        "RIGHT_ATTRS": {"DEP": {"IN": ["conj"]}},
    },
    {
        "LEFT_ID": "founded_object_modifier",
        "REL_OP": ">",
        "RIGHT_ID": "cnn_object_modifier",
        "RIGHT_ATTRS": {"DEP": {"IN": ["cc"]}},
    }


    #     {
    #     "LEFT_ID": "founded_object_modifier",
    #     "REL_OP": ">",
    #     "RIGHT_ID": "new_founded_object_modifier",
    #     "RIGHT_ATTRS": {"DEP": {"IN": ["amod","det"]}},
    # }

]
matcher.add("MEANS", [pattern])
doc = nlp("applicant means a natural person or an entity with or without legal personality who has submitted an application in a grant award procedure or in a contest for prizes")
matches = matcher(doc)
print(matches)
match_id, token_ids = matches[0]
for i in range(len(token_ids)):
    print(pattern[i]["RIGHT_ID"] + ":", doc[token_ids[i]].text)

[(7875272566945416800, [1, 0, 4, 3, 2, 7, 5])]
relation: means
head: applicant
founded_object_modifier: person
object_modifier: natural
det_object_modifier: a
conj_object_modifier: entity
cnn_object_modifier: or


In [326]:
doc[token_ids[4]].text

'a'

In [328]:
token_ids

[1, 0, 4, 3, 2, 7]

In [263]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("applicant means a natural person or an entity with or without legal personality who has submitted an application in a grant award procedure or in a contest for prizes")
for chunk in doc.noun_chunks:
    print(chunk.text, chunk.root.text, chunk.root.dep_,
            chunk.root.head.text)

applicant applicant nsubj means
a natural person person dobj means
an entity entity conj person
legal personality personality pobj without
who who nsubj submitted
an application application dobj submitted
a grant award procedure procedure pobj in
a contest contest pobj in
prizes prizes pobj for


In [266]:
dataframe_II = final_dataframe[final_dataframe["Title"]=='TITLE II \nBUDGET AND BUDGETARY PRINCIPLES ']

In [268]:

dataframe_II = dataframe_II.explode('Cleaned_Content')
dataframe_II

Unnamed: 0,Title,Articles,Content,Cleaned_Content
5,TITLE II \nBUDGET AND BUDGETARY PRINCIPLES,Article 6 Respect for budgetary principles,[\nThe budget shall be established and impleme...,the budget shall be established and implemente...
6,TITLE II \nBUDGET AND BUDGETARY PRINCIPLES,Article 7 Scope of the budget,"[ \nFor each financial year, the budget shall ...","for each financial year, the budget shall fore..."
6,TITLE II \nBUDGET AND BUDGETARY PRINCIPLES,Article 7 Scope of the budget,"[ \nFor each financial year, the budget shall ...",the budget shall contain differentiated approp...
6,TITLE II \nBUDGET AND BUDGETARY PRINCIPLES,Article 7 Scope of the budget,"[ \nFor each financial year, the budget shall ...",commitment appropriations shall cover the tota...
6,TITLE II \nBUDGET AND BUDGETARY PRINCIPLES,Article 7 Scope of the budget,"[ \nFor each financial year, the budget shall ...",payment appropriations shall cover payments ma...
...,...,...,...,...
37,TITLE II \nBUDGET AND BUDGETARY PRINCIPLES,Article 38 Publication of information on reci...,"[ \nThe Commission shall make available, in an...",save in the cases referred to in paragraphs 3 ...
37,TITLE II \nBUDGET AND BUDGETARY PRINCIPLES,Article 38 Publication of information on reci...,"[ \nThe Commission shall make available, in an...",the information referred to in the first subpa...
37,TITLE II \nBUDGET AND BUDGETARY PRINCIPLES,Article 38 Publication of information on reci...,"[ \nThe Commission shall make available, in an...",persons or entities implementing union funds p...
37,TITLE II \nBUDGET AND BUDGETARY PRINCIPLES,Article 38 Publication of information on reci...,"[ \nThe Commission shall make available, in an...",the information referred to in paragraph 1 sha...


In [269]:
dataframe_II.iloc[1,3]

'for each financial year, the budget shall forecast and authorise all revenue and expenditure considered necessary for the union. it shall comprise: (a) the revenue and expenditure of the union, including administrative expenditure resulting from the implementation of the provisions of the teu relating to the common foreign and security policy (cfsp), and operational expenditure occasioned by implementation of those provisions where it is charged to the budget; (b) the revenue and expenditure of the european atomic energy community.'

In [270]:
dataframe_II['Cleaned_Content'] = dataframe_II['Cleaned_Content'].apply(lambda x: x.split('.'))
dataframe_II = dataframe_II.explode('Cleaned_Content')

In [271]:
dataframe_II.iloc[0,3]

'the budget shall be established and implemented in accordance with the principles of unity, budgetary accuracy, annuality, equilibrium, unit of account, universality, specification, sound financial management and transparency as set out in this regulation'

In [272]:
dataframe_II

Unnamed: 0,Title,Articles,Content,Cleaned_Content
5,TITLE II \nBUDGET AND BUDGETARY PRINCIPLES,Article 6 Respect for budgetary principles,[\nThe budget shall be established and impleme...,the budget shall be established and implemente...
5,TITLE II \nBUDGET AND BUDGETARY PRINCIPLES,Article 6 Respect for budgetary principles,[\nThe budget shall be established and impleme...,chapter 1 principles of unity and of budgetar...
6,TITLE II \nBUDGET AND BUDGETARY PRINCIPLES,Article 7 Scope of the budget,"[ \nFor each financial year, the budget shall ...","for each financial year, the budget shall fore..."
6,TITLE II \nBUDGET AND BUDGETARY PRINCIPLES,Article 7 Scope of the budget,"[ \nFor each financial year, the budget shall ...",it shall comprise: (a) the revenue and expend...
6,TITLE II \nBUDGET AND BUDGETARY PRINCIPLES,Article 7 Scope of the budget,"[ \nFor each financial year, the budget shall ...",
...,...,...,...,...
37,TITLE II \nBUDGET AND BUDGETARY PRINCIPLES,Article 38 Publication of information on reci...,"[ \nThe Commission shall make available, in an...","the commission shall make available, in an ap..."
37,TITLE II \nBUDGET AND BUDGETARY PRINCIPLES,Article 38 Publication of information on reci...,"[ \nThe Commission shall make available, in an...",
37,TITLE II \nBUDGET AND BUDGETARY PRINCIPLES,Article 38 Publication of information on reci...,"[ \nThe Commission shall make available, in an...","where personal data are published, the informa..."
37,TITLE II \nBUDGET AND BUDGETARY PRINCIPLES,Article 38 Publication of information on reci...,"[ \nThe Commission shall make available, in an...",this shall also apply to personal data referr...


In [273]:
dataframe_II['Articles'].unique()

array(['Article 6  Respect for budgetary principles ',
       'Article 7  Scope of the budget ',
       'Article 8  Specific rules on the principles of unity and budgetary accuracy ',
       'Article 9  Definition ',
       'Article 10  Budgetary accounting for revenue and appropriations ',
       'Article 11  Commitment of appropriations ',
       'Article 12  Cancellation and carry-over of appropriations ',
       'Article 13  Detailed provisions on cancellation and carry-over of appropriations ',
       'Article 14  Decommitments ',
       'Article 15  Making appropriations corresponding to decommitments available again ',
       'Article 16  Rules applicable in the event of late adoption of the budget ',
       'Article 17  Definition and scope ',
       'Article 18  Balance from financial year ',
       'Article 19  Use of euro ', 'Article 20  Scope ',
       'Article 21  Assigned revenue ',
       'Article 22  Structure to accommodate assigned revenue and provision of correspondi

In [248]:
dataframe_II.isna().sum()

Title              0
Articles           0
Content            0
Cleaned_Content    0
dtype: int64