## Clean the data:
* do "basic" cleaning: drop empty values, drop duplicates, drop subset duplicates => results in L1(basic) of size 787895 sentence pairs
* do "strong" cleaning: bicleaner-hardrules L1 => results in L2(strong) of size 574034 sentence pairs
* do "intermediate" cleaning: disable some rules from the bicleaner tool => results in L3(intermediate) of size 610480
## Split the data in traindevtest:
* use L2(strong) to create dev and test set
* remove dev and test from L1(basic) and L3(intermediate) to avoid training on dev/test data

In [1]:
import pandas as pd
import csv
import os

In [2]:
path_ENRO_bisents = os.path.join(os.pardir, "data/DCEP/00-raw/EN-RO-bisentences.txt")
#quoting is to correctly include quote in quote strings as one field when reading the file
df=pd.read_csv(path_ENRO_bisents, sep='\t', names=['english', 'romanian'], quoting=csv.QUOTE_NONE)

The raw file (df.shape=(2668811, 2)) contains lots of duplicates. Here are some of them:

In [3]:
df_duplcates_sorted=df.groupby(df.columns.tolist(),as_index=False).size().sort_values(by='size',ascending=[False])[:8]
df_duplcates_sorted

Unnamed: 0,english,romanian,size
174855,Amendment,Amendamentul,59392
872241,–,–,33074
55777,",",",",31713
596108,Text proposed by the Commission,Textul propus de Comisie,24069
433286,Justification,Justificare,18142
93126,1.,1.,14492
561622,Rule,Articolul,11383
528323,Proposal for a regulation,Propunere de regulament,11349


Removing duplicates and NaN values

In [4]:
print(df.shape)
df=df.drop_duplicates()
print(df.shape)
df=df.dropna()
print(df.shape)

(2668811, 2)
(902307, 2)
(902303, 2)


## Inconsistencies in the data found:

the original dataset contains some words with space-noise in them:

In [5]:
pd.set_option('display.max_colwidth', None)
contains_noise = df[df['romanian'].str.contains('a ț ine seama de evolu ț ia tehnică a pie ț ei financiare')]
contains_noise

Unnamed: 0,english,romanian
1754573,"In order to take account of technical developments on financial markets and to ensure uniform application of this Directive, the Commission shall lay down, by means of delegated acts in accordance with Articles 24, 24a and 24b, measures concerning the format of the prospectus or base prospectus and supplements. [...] ""","„(5) Pentru a ț ine seama de evolu ț ia tehnică a pie ț ei financiare ș i pentru a asigura o punere în aplicare uniformă a prezentei directive, Comisia stabile ș te, cu ajutorul actelor delegate în conformitate cu articolele 24, 24a ș i 24b, măsurile [...] privind schema prospectului sau a prospectului de bază ș i a suplimentelor. [...]”"
1754716,"In order to take account of technical developments on financial markets and to ensure uniform application of this Directive, the Commission shall lay down, by means of delegated acts in accordance with Articles 24, 24a and 24b, measures concerning the conditions in accordance with which time limits may be adjusted. [...] ""","„(7) Pentru a ț ine seama de evolu ț ia tehnică a pie ț ei financiare ș i pentru a asigura o punere în aplicare uniformă a prezentei directive, Comisia stabile ș te, cu ajutorul actelor delegate în conformitate cu articolele 24, 24a ș i 24b, măsurile [...] privind condi ț iile la care termenele se pot adapta. [...] ”"
1754741,"In order to take account of technical developments on financial markets and to ensure uniform application of this Directive, the Commission shall lay down , by means of delegated acts in accordance with Articles 24, 24a and 24b, measures concerning paragraphs 1, 2, 3 and 4. [...] ""","„(7) Pentru a ț ine seama de evolu ț ia tehnică a pie ț ei financiare ș i pentru a asigura o punere în aplicare uniformă a prezentei directive, Comisia stabile ș te, cu ajutorul actelor delegate în conformitate cu articolele 24, 24a ș i 24b, măsurile [...] privind alineatele (1), (2), (3) ș i (4). [...]”"
1754752,"In order to take account of technical developments on financial markets and to ensure uniform application of this Directive, the Commission shall lay down, by means of delegated acts in accordance with Article s 24, 24a and 24b , measures concerning the dissemination of advertisements announcing the intention to offer securities to the public or the admission to trading on a regulated market, in particular before the prospectus has been made available to the public or before the opening of the subscription, and concerning paragraph 4. [...] ""","„(7) Pentru a ț ine seama de evolu ț ia tehnică a pie ț ei financiare ș i pentru a asigura o aplicare uniformă a prezentei directive, Comisia stabile ș te, cu ajutorul actelor delegate în conformitate cu articolele 24, 24a ș i 24b, măsurile [...] privind difuzarea comunicărilor cu caracter promo ț ional care anun ț ă inten ț ia de a face oferte publice de valori mobiliare sau de a impune aceste valori la tranzac ț ionarea pe o pia ț ă reglementată, în special înainte ca prospectul să fie pus la dispozi ț ia publicului sau înaintea deschiderii subscrierii, precum ș i măsurile [...] privind alineatul (4). [...]”"


untranslated sentences:

In [6]:
contains_untranslated = df[df['romanian'].str.contains('Pursuant to this differentiation the Conference')]
contains_small_difference = df[df['english'].str.contains('Recovery plan for bluefin tuna in the Eastern')]
contains_untranslated


Unnamed: 0,english,romanian
1458024,Pursuant to this differentiation the Conference of Presidents also decided to refer to the Committee on Constitutional Affairs new items requiring changes to Parliament's Rules of Procedure.,Pursuant to this differentiation the Conference of Presidents also decided to refer to the Committee on Constitutional Affairs new items requiring changes to Regulamentul de procedură al Parlamentului.


In [7]:
contains_small_difference

Unnamed: 0,english,romanian
206974,Recovery plan for bluefin tuna in the Eastern Atlantic and Mediterranean,Redresarea stocurilor de ton roşu din Oceanul Atlantic de Est şi din Marea Mediterană
207329,Recovery plan for bluefin tuna in the Eastern Atlantic and the Mediterranean,Redresarea stocurilor de ton roşu din Oceanul Atlantic de Est şi din Marea Mediterană
207509,Recovery plan for bluefin tuna in the Eastern Atlantic and Mediterranean * (debate),Redresarea stocurilor de ton roşu din Oceanul Atlantic de Est şi din Marea Mediterană * (dezbatere)
210136,Recovery plan for bluefin tuna in the Eastern Atlantic and Mediterranean * (vote),Redresarea stocurilor de ton roşu din Oceanul Atlantic de Est şi din Marea Mediterană * (vot)
212386,Recovery plan for bluefin tuna in the Eastern Atlantic and the Mediterranean *,Redresarea stocurilor de ton roşu din Oceanul Atlantic de Est şi din Marea Mediterană *


Example of freely translated sentence:

In [8]:
contains_freely=df[df.iloc[:, 1].str.contains('joc de cuvinte')]
contains_freely

Unnamed: 0,english,romanian
322678,"We ""pass"" arguments among ourselves, exchange views and experience in order to achieve a common ""goal"".","Ne «pasăm» argumentele de la unul la altul, facem schimburi de opinii şi de experienţe pentru a obţine un «obiectiv» [joc de cuvinte în limba engleză, «goal» însemnând în acelaşi timp «obiectiv» şi «gol» n.r.]."


If we remove also alternative translations we are left with 787895 sentence pairs

In [9]:
df_no_ro_alternative=df.drop_duplicates(subset='english')

see=df_no_ro_alternative[df_no_ro_alternative.duplicated(subset="romanian", keep=False)]
grouped=see.groupby("romanian").size().reset_index(name="counts").sort_values(['counts'], ascending=True)

In [10]:
#examples of romanian duplicate sentences
df.merge(grouped, how="right").iloc[250:259]

Unnamed: 0,english,romanian,counts
250,2009 would be an extremely difficult year for the citizens and economies of eurozone.,"Recunoscând succesul incontestabil al monedei unice, preşedintele Eurogroup a arătat că adevăratul test de coeziune urmează, 2009 fiind un an extrem de dificil pentru cetăţeni şi economiile din zona euro.",2
251,Collective redress,Recursuri colective,2
252,Seeking justice collectively,Recursuri colective,2
253,Redirecting EU funds to help create jobs for young people,Redirecționarea fondurilor UE pentru crearea de locuri de muncă pentru tineri,2
254,"Plan to redirect regional funds to create jobs could undermine trust, MEPs warn",Redirecționarea fondurilor UE pentru crearea de locuri de muncă pentru tineri,2
255,Recovery plan for bluefin tuna in the Eastern Atlantic and Mediterranean,Redresarea stocurilor de ton roşu din Oceanul Atlantic de Est şi din Marea Mediterană,2
256,Recovery plan for bluefin tuna in the Eastern Atlantic and the Mediterranean,Redresarea stocurilor de ton roşu din Oceanul Atlantic de Est şi din Marea Mediterană,2
257,Citation 1 a (new),Referirea 1a (nouă),2
258,Citation 1a (new),Referirea 1a (nouă),2


In [11]:
df_no_alternatives=df_no_ro_alternative.drop_duplicates(subset='romanian')
df_no_alternatives.shape

(787895, 2)

In [12]:
df=df_no_alternatives

In [13]:
#keep the index and add flag --scol and --tcol to bicleaner 
df.to_csv("../data/DCEP/01-intermediate/bisentences_787895.txt", header=False, sep="\t")

## here, we use the bicleaner tool to apply some more rules on the just generated bisentences_787895.txt (size 787895)

In [14]:
!bicleaner-hardrules ../data/DCEP/01-intermediate/bisentences_787895.txt -s en -t ro --annotated_output --disable_minimal_length --scol 2 --tcol 3 > ../data/DCEP/01-intermediate/annotated_bisentences_787895.txt

2021-08-22 17:36:44,951 - INFO - LM filtering disabled.
2021-08-22 17:36:44,951 - INFO - Porn removal disabled.
2021-08-22 17:36:44,951 - INFO - Executing main program...
2021-08-22 17:36:44,951 - INFO - Starting process
2021-08-22 17:36:44,951 - INFO - Running 31 workers at 10000 rows per block
2021-08-22 17:36:45,029 - INFO - Start mapping
2021-08-22 17:37:32,346 - INFO - End mapping
2021-08-22 17:40:20,780 - INFO - Hard rules applied. Output available in <stdout>
2021-08-22 17:40:20,831 - INFO - Finished
2021-08-22 17:40:20,831 - INFO - Total: 787895 rows
2021-08-22 17:40:20,831 - INFO - Elapsed time 215.88 s
2021-08-22 17:40:20,831 - INFO - Troughput: 3649 rows/s
2021-08-22 17:40:20,832 - INFO - Program finished


In [15]:
df_annotated=pd.read_csv(os.path.join(os.pardir,"data/DCEP/01-intermediate/annotated_bisentences_787895.txt"), sep="\t", names=['english', 'romanian', 'yesno', 'reason'], quoting=3)
df_annotated

Unnamed: 0,english,romanian,yesno,reason
0,ORAL QUESTION H-0336/07,ÎNTREBARE ORALĂ H-0336/07,1,keep
1,for Question Time at the part-session in May 2007,pentru timpul afectat întrebărilor din perioada de sesiune mai 2007,1,keep
2,pursuant to Rule 109 of the Rules of Procedure,în conformitate cu articolul 109 din Regulamentul de procedură,1,keep
3,by,de,1,keep
4,Roberta Alma Anastase,Roberta Alma Anastase,0,c_identical
...,...,...,...,...
2668799,Calls on the Commission to take into consideration the results of any studies on the effects of this programme on children's development;,"invită Comisia să ia în considerare rezultatele studiilor disponibile, referitoare la efectele acestui program asupra dezvoltării copiilor;",1,keep
2668801,"Instructs its President to forward this declaration, together with the names of the signatories The list of signatories is published in Annex 3 to the Minutes of 15 March 2012 ( P7_PV-PROV(2012)03-15(ANN3) ).","încredințează Președintelui sarcina de a transmite prezenta declarație, însoțită de numele semnatarilor Lista semnatarilor este publicată în anexa 3 la procesul-verbal din 15 martie 2012 ( P7_PV-PROV(2012)03-15(ANN3) ).",1,keep
2668802,", to the Commission and to the Parliaments of the Member States.",", Comisiei și parlamentelor statelor membre.",1,keep
2668803,The Week : 01-01-2004(s),"""""""Săptămâna"""" : 01-01-2004(s)""",0,c_majority_alpha(left)


In [16]:
df_reason=df_annotated.groupby(['yesno', 'reason'])['english'].count().reset_index(name='count') \
                             .sort_values(['count'], ascending=False)

df_reason

Unnamed: 0,yesno,reason,count
28,1,keep,574034
2,0,c_identical,131687
6,0,c_majority_alpha(left),23505
5,0,c_length or c_length_bytes,14810
1,0,c_different_language,10774
22,0,"c_reliable_long_language(right, targetlang)",7035
25,0,left.istitle() and right.istitle(),6746
8,0,c_no_breadcrumbs1(left),4913
3,0,c_identical_wo_digits,3644
4,0,c_identical_wo_punct,1838


In [20]:
pd.set_option('display.max_colwidth', None)
#replace the string with another reason string to sample examples of matched sentences
samples=df_annotated.loc[df_annotated['reason']=='c_no_space_noise(left)']
samples.sample(n=5)

Unnamed: 0,english,romanian,yesno,reason
1583759,"(a) Class C1 tyres - tyres intended for vehicles of category M 1 , O 1 and O 2 ;","(a) Pneuri din clasa C1 - concepute pentru vehicule de categoriile M 1 , O 1 și O 2 ;",0,c_no_space_noise(left)
1705470,Item 1 0 0 0 — Salaries,Postul 1 0 0 0 — Indemnizații,0,c_no_space_noise(left)
1706555,Article 4 0 3 — Contributions to European political foundations,Articolul 4 0 3 — Contribuții la fundațiile politice europene,0,c_no_space_noise(left)
1294851,Item 2 0 2 6 — Security and surveillance of buildings,Postul 2 0 2 6 – Securitatea şi supravegherea imobilelor,0,c_no_space_noise(left)
1772127,Item 5 0 0 1 — Proceeds from the sale of other movable property — Assigned revenue,Postul 5 0 0 1 — Încasări din vânzarea de alte bunuri mobile – Venituri alocate,0,c_no_space_noise(left)


In [21]:
title_not_available=df_annotated[df_annotated.iloc[:, 1].str.contains('Acest titlu nu este încă disponibil în toate limbile')]
title_not_available

Unnamed: 0,english,romanian,yesno,reason
56894,- Report on gender mainstreaming in the work of the committees ( 2005/2149(INI) ) - FEMM Committee - Rapporteur: Anna Záborská ( A6-0478/2006 ),- Acest titlu nu este încă disponibil în toate limbile: Report on gender mainstreaming in the work of the committees ( 2005/2149(INI) ) - FEMM - Raportoare: Anna Záborská ( A6-0478/2006 ),0,c_different_language
56896,- Report on European Road Safety Action Programme - mid-term review ( 2006/2112(INI) ) - TRAN Committee - Rapporteur: Ewa Hedkvist Petersen ( A6-0449/2006 ),- Acest titlu nu este încă disponibil în toate limbile: Report on European Road Safety Action Programme - mid-term review ( 2006/2112(INI) ) - TRAN - Raportoare: Ewa Hedkvist Petersen ( A6-0449/2006 ),0,c_different_language
56897,- Report on the Council's Seventh and Eighth Annual Reports according to Operative Provision 8 of the European Union Code of Conduct on Arms Exports ( 2006/2068(INI) ) - AFET Committee - Rapporteur: Raül Romeva i Rueda ( A6-0439/2006 ),- Acest titlu nu este încă disponibil în toate limbile: Report on the Council's Seventh and Eighth Annual Reports according to Operative Provision 8 of the European Union Code of Conduct on Arms Exports ( 2006/2068(INI) ) - AFET - Raportor: Raül Romeva i Rueda ( A6-0439/2006 ),0,c_different_language
56898,- Report with recommendations to the Commission on the European private company statute ( 2006/2013(INI) ) - JURI Committee - Rapporteur: Klaus-Heiner Lehne ( A6-0434/2006 ),- Acest titlu nu este încă disponibil în toate limbile: Report with recommendations to the Commission on the European private company statute ( 2006/2013(INI) ) - JURI - Raportor: Klaus-Heiner Lehne ( A6-0434/2006 ),0,c_different_language
56906,"- Michał Tomasz Kamiński, on behalf of the UEN Group.","- Acest titlu nu este încă disponibil în toate limbile: Michał Tomasz Kamiński, în numele grupului UEN.",1,keep
...,...,...,...,...
1105420,- Quality management for European Statistics ( 2011/2289(INI) ),- Acest titlu nu este încă disponibil în toate limbile Quality management for European Statistics ( 2011/2289(INI) ),1,keep
1105425,- Women in Political Decision Making - quality and equality ( 2011/2295(INI) ),- Acest titlu nu este încă disponibil în toate limbile Women in Political Decision Making - quality and equality ( 2011/2295(INI) ),1,keep
1105432,- The small scale and artisanal fisheries and the CFP reform ( 2011/2292(INI) ) (opinion: DEVE),- Acest titlu nu este încă disponibil în toate limbile The small scale and artisanal fisheries and the CFP reform ( 2011/2292(INI) ) (aviz: DEVE),0,c_different_language
1105433,- Reporting obligations under Regulation (EC) No 2371/2002 on the conservation and sustainable exploitation of fisheries resources under the CFP ( 2011/2291(INI) ) (opinion: ENVI),- Acest titlu nu este încă disponibil în toate limbile Reporting obligations under Regulation (EC) No 2371/2002 on the conservation and sustainable exploitation of fisheries resources under the CFP ( 2011/2291(INI) ) (aviz: ENVI),0,c_different_language


In [22]:
df_keep=df_annotated.loc[df_annotated['reason']=='keep']
df_keep

Unnamed: 0,english,romanian,yesno,reason
0,ORAL QUESTION H-0336/07,ÎNTREBARE ORALĂ H-0336/07,1,keep
1,for Question Time at the part-session in May 2007,pentru timpul afectat întrebărilor din perioada de sesiune mai 2007,1,keep
2,pursuant to Rule 109 of the Rules of Procedure,în conformitate cu articolul 109 din Regulamentul de procedură,1,keep
3,by,de,1,keep
5,to the Council,Consiliului,1,keep
...,...,...,...,...
2668795,Calls on the Commission and the Member States to encourage the introduction of the programme ‘Chess in School’ in the educational systems of the Member States;,invită Comisia și statele membre să încurajeze introducerea programului „șahul în instituțiile de învățământ” în sistemele educaționale din statele membre;,1,keep
2668797,"Calls on the Commission, in its forthcoming communication on sport, to pay the necessary attention to the program ‘Chess in School’ and to ensure sufficient funding for it from 2012 onwards;","invită Comisia ca, în viitoarea sa comunicare privind sportul, să acorde atenția cuvenită programului „șahul în instituțiile de învățământ” și să asigure finanțarea suficientă a acestuia începând din 2012;",1,keep
2668799,Calls on the Commission to take into consideration the results of any studies on the effects of this programme on children's development;,"invită Comisia să ia în considerare rezultatele studiilor disponibile, referitoare la efectele acestui program asupra dezvoltării copiilor;",1,keep
2668801,"Instructs its President to forward this declaration, together with the names of the signatories The list of signatories is published in Annex 3 to the Minutes of 15 March 2012 ( P7_PV-PROV(2012)03-15(ANN3) ).","încredințează Președintelui sarcina de a transmite prezenta declarație, însoțită de numele semnatarilor Lista semnatarilor este publicată în anexa 3 la procesul-verbal din 15 martie 2012 ( P7_PV-PROV(2012)03-15(ANN3) ).",1,keep


In [23]:
df_keep=df_keep[["english", "romanian"]]
df_keep

Unnamed: 0,english,romanian
0,ORAL QUESTION H-0336/07,ÎNTREBARE ORALĂ H-0336/07
1,for Question Time at the part-session in May 2007,pentru timpul afectat întrebărilor din perioada de sesiune mai 2007
2,pursuant to Rule 109 of the Rules of Procedure,în conformitate cu articolul 109 din Regulamentul de procedură
3,by,de
5,to the Council,Consiliului
...,...,...
2668795,Calls on the Commission and the Member States to encourage the introduction of the programme ‘Chess in School’ in the educational systems of the Member States;,invită Comisia și statele membre să încurajeze introducerea programului „șahul în instituțiile de învățământ” în sistemele educaționale din statele membre;
2668797,"Calls on the Commission, in its forthcoming communication on sport, to pay the necessary attention to the program ‘Chess in School’ and to ensure sufficient funding for it from 2012 onwards;","invită Comisia ca, în viitoarea sa comunicare privind sportul, să acorde atenția cuvenită programului „șahul în instituțiile de învățământ” și să asigure finanțarea suficientă a acestuia începând din 2012;"
2668799,Calls on the Commission to take into consideration the results of any studies on the effects of this programme on children's development;,"invită Comisia să ia în considerare rezultatele studiilor disponibile, referitoare la efectele acestui program asupra dezvoltării copiilor;"
2668801,"Instructs its President to forward this declaration, together with the names of the signatories The list of signatories is published in Annex 3 to the Minutes of 15 March 2012 ( P7_PV-PROV(2012)03-15(ANN3) ).","încredințează Președintelui sarcina de a transmite prezenta declarație, însoțită de numele semnatarilor Lista semnatarilor este publicată în anexa 3 la procesul-verbal din 15 martie 2012 ( P7_PV-PROV(2012)03-15(ANN3) )."


## Create a dev and a test set to use for both models

In [24]:
#dev and test sets are created from the bicleaner cleaned set
test=df_keep.sample(n=2000, random_state=42)
temp_train=df_keep.drop(test.index)

dev=temp_train.sample(n=2000, random_state=42)
train=temp_train.drop(dev.index)

In [34]:
#checking some sample sentence pairs from dev and test
checking=train.sample(n=10, random_state=42)
train.shape

(570034, 2)

In [26]:
#remove those entries from the bigger df which are now contained in dev or test
basic_train=df.drop(test.index)
basic_cleaning_train=basic_train.drop(dev.index)
basic_cleaning_train.shape

(783895, 2)

## Too strict!

* By applying all possible rules from the bicleaner tool, filtering was too strict. Disable some rules and run bicleaner again. 
* The output was annotated_bisentences_787895_less_rules.txt which contained 614480 keep sent pairs

(remark: we disabled some rules instead of just selecting from df, because some sentences matched more than one rule)


In [27]:
# here run the bicleaner tool with some rules disabled
!bicleaner-hardrules ../data/DCEP/01-intermediate/bisentences_787895.txt -s en -t ro --annotated_output --disable_minimal_length --disable_max_length --disable_paren_check --disable_majority_nonalpha --disable_titlecased_check --disable_breadcrumbs --disable_nonidentical_punct --scol 2 --tcol 3 > ../data/DCEP/01-intermediate/annotated_bisentences_787895_less_rules.txt

2021-08-22 17:49:41,841 - INFO - LM filtering disabled.
2021-08-22 17:49:41,841 - INFO - Porn removal disabled.
2021-08-22 17:49:41,841 - INFO - Executing main program...
2021-08-22 17:49:41,841 - INFO - Starting process
2021-08-22 17:49:41,841 - INFO - Running 31 workers at 10000 rows per block
2021-08-22 17:49:41,917 - INFO - Start mapping
2021-08-22 17:50:37,455 - INFO - End mapping
2021-08-22 17:52:13,951 - INFO - Hard rules applied. Output available in <stdout>
2021-08-22 17:52:14,008 - INFO - Finished
2021-08-22 17:52:14,008 - INFO - Total: 787895 rows
2021-08-22 17:52:14,010 - INFO - Elapsed time 152.17 s
2021-08-22 17:52:14,010 - INFO - Troughput: 5177 rows/s
2021-08-22 17:52:14,010 - INFO - Program finished


In [28]:
#in L3 long sentences are included >=1024 characters long
df_ann_less_rules=pd.read_csv(os.path.join(os.pardir,"data/DCEP/01-intermediate/annotated_bisentences_787895_less_rules.txt"), sep="\t", names=['english', 'romanian', 'yesno', 'reason'], quoting=3)
df_ann_less_keep=df_ann_less_rules.loc[df_ann_less_rules['reason']=='keep']
df_ann_less_keep

Unnamed: 0,english,romanian,yesno,reason
0,ORAL QUESTION H-0336/07,ÎNTREBARE ORALĂ H-0336/07,1,keep
1,for Question Time at the part-session in May 2007,pentru timpul afectat întrebărilor din perioada de sesiune mai 2007,1,keep
2,pursuant to Rule 109 of the Rules of Procedure,în conformitate cu articolul 109 din Regulamentul de procedură,1,keep
3,by,de,1,keep
5,to the Council,Consiliului,1,keep
...,...,...,...,...
2668799,Calls on the Commission to take into consideration the results of any studies on the effects of this programme on children's development;,"invită Comisia să ia în considerare rezultatele studiilor disponibile, referitoare la efectele acestui program asupra dezvoltării copiilor;",1,keep
2668801,"Instructs its President to forward this declaration, together with the names of the signatories The list of signatories is published in Annex 3 to the Minutes of 15 March 2012 ( P7_PV-PROV(2012)03-15(ANN3) ).","încredințează Președintelui sarcina de a transmite prezenta declarație, însoțită de numele semnatarilor Lista semnatarilor este publicată în anexa 3 la procesul-verbal din 15 martie 2012 ( P7_PV-PROV(2012)03-15(ANN3) ).",1,keep
2668802,", to the Commission and to the Parliaments of the Member States.",", Comisiei și parlamentelor statelor membre.",1,keep
2668803,The Week : 01-01-2004(s),"""""""Săptămâna"""" : 01-01-2004(s)""",1,keep


In [29]:
df_less_rules_reason=df_ann_less_rules.groupby(['yesno', 'reason'])['english'].count().reset_index(name='count') \
                             .sort_values(['count'], ascending=False)

df_less_rules_reason

Unnamed: 0,yesno,reason,count
16,1,keep,614480
2,0,c_identical,131821
4,0,c_length or c_length_bytes,14859
1,0,c_different_language,11074
13,0,"c_reliable_long_language(right, targetlang)",7744
3,0,c_identical_wo_digits,3644
12,0,"c_reliable_long_language(left, sourcelang)",1485
10,0,c_no_urls(left),1478
5,0,c_no_glued_words(left),385
8,0,c_no_space_noise(left),314


In [30]:
df_ann_less_keep=df_ann_less_keep[["english", "romanian"]]
df_ann_less_keep.sample(n=10)

Unnamed: 0,english,romanian
1847447,"– having regard to the European Commission Communication entitled ‘Roadmap for Maritime Spatial Planning: Achieving Common Principles in the EU’ ( COM(2008)0791 ),","– având în vedere Comunicarea Comisiei Europene intitulată „Foaie de parcurs privind amenajarea spațiului maritim: Realizarea principiilor comune în UE” ( COM(2008)0791 ),"
508475,Presentation of a draft study on the interrelationship between freedom of expression and freedom of religion,Prezentarea unui proiect de studiu privind relația de interdependență între libertatea de exprimare și libertatea religioasă
1570627,"(c) appropriate knowledge and understanding of the essential requirements, of the applicable harmonised standards and of the relevant provisions of Community harmonisation legislation and of its implementing regulations;","(c) cunoștințe și înțelegere corespunzătoare a cerințelor esențiale, a standardelor armonizate aplicabile și a dispozițiilor relevante din legislația comunitară de armonizare și din reglementările de punere în aplicare a acesteia ;"
1820458,All Member States must take action and fight youth unemployment with policy priorities and strategies that reflect the national specificities.,Toate statele membre trebuie să intervină și să combată șomajul în rândul tinerilor prin priorități politice și strategii care să reflecte specificul național.
809662,"""However, the European Parliament calls again for the immediate and unconditional release of all political prisoners in Cuba.""""""","""Cu toate acestea, Parlamentul European cere din nou eliberarea imediată și necondiționată a tuturor prizonierilor politici din Cuba""""."""
1841465,"– having regard to the Commission communication on the mid-term assessment of implementing the EC biodiversity action plan ( COM(2008)0864 final),","– având în vedere Comunicarea Comisiei privind evaluarea intermediară a implementării planului de acțiune comunitar pentru biodiversitate ( COM(2008)0864 final),"
1859607,(4) It is appropriate to clarify that staff of the EEAS who carry out tasks for the Commission as part of their duties should follow instructions given by the Commission.,(4) Este oportun a clarifica faptul că personalul SEAE care desfășoară anumite sarcini pentru Comisie ca parte a îndatoririlor sale ar trebui să urmeze instrucțiunile Comisiei.
1985378,"As a consequence, the Rapporteur calls for effective measures against uncontrolled speculation on food and agricultural commodities.","Drept urmare, raportoarea solicită măsuri eficace de combatere a speculațiilor necontrolate cu mărfuri alimentare și agricole."
1393987,"Article 9, paragraph 2, point (f) (Directive 98/70/EC)",Articolul 9 alineatul (2) litera (f) (Directiva 98/70/CE)
2543412,"Every year, Member States shall provide the Commission (Eurostat) with a report on the quality of the data relating to the reference periods in the reference year, and on any methodological changes that have been made.","În fiecare an, statele membre vor furniza Comisiei (Eurostat) un raport privind calitatea datelor pentru perioadele de referință în anul de referință și privind orice schimbări metodologice care au fost efectuate."


In [31]:
#remove dev and test entries from df_ann_less_keep
train_less_notest=df_ann_less_keep.drop(test.index)
train_less_nodev=train_less_notest.drop(dev.index)
train_less_nodev.shape

(610480, 2)

## Write all df's to files

In [35]:
#write bicleaner cleaned L2 df's to files 
source_train=train.iloc[:,0]
target_train=train.iloc[:,1]
#size 570034
source_train.to_csv("../data/DCEP/01-intermediate/L2_strong/L2_train.en", index = None, header = False)
target_train.to_csv("../data/DCEP/01-intermediate/L2_strong/L2_train.ro", index = None, header = False)

source_test=test.iloc[:,0]
target_test=test.iloc[:,1]

source_test.to_csv("../data/DCEP/01-intermediate/L2_strong/L2_test.en", index = None, header = False)
target_test.to_csv("../data/DCEP/01-intermediate/L2_strong/L2_test.ro", index = None, header = False)

source_dev=dev.iloc[:,0]
target_dev=dev.iloc[:,1]

source_dev.to_csv("../data/DCEP/01-intermediate/L2_strong/L2_dev.en", index = None, header = False)
target_dev.to_csv("../data/DCEP/01-intermediate/L2_strong/L2_dev.ro", index = None, header = False)

In [36]:
!head ../data/DCEP/01-intermediate/L2_strong/*

==> ../data/DCEP/01-intermediate/L2_strong/L2_dev.en <==
"Except in the case of a flagrant offence, restraining measures requiring the intervention of a judge cannot be instituted against a member of any Chamber for the duration of a session, regarding repressive matters, except by the first President of the Court of Appeal at the demand of the competent judge."
conclusion
"""Furthermore, at the request of the then Irish Minister for Enterprise Trade and Employment, IFSRA """" undertook a review of matters raised by Irish policyholders in order to identify how their position could be improved """" (Ms O'DEA, H4)."""
Recognise voluntary activities
It should be underlined that this reduction is still subject to the final decision of the Commission.
Article 2 – point 5 a (new)
", which entered into force on 22 January 2011 and which identifies public procurement directives as ‘Community acts which refer to matters governed by the Convention’,"
"As you can see from our photo, fruit

c ) Țesături pentru tapiserie
PE428.631v01-00 B7 ‑ 0027 / 2009 Rezoluția Parlamentului European referitoare la legea lituaniană privind protecția minorilor față de efectele dăunătoare ale informațiilor publice
&quot; Pentru a evita repetarea utilizărilor finale ale subproduselor , se include un nou articol care reglementează o singură dată posibilitățile de eliminare pentru toate categoriile de subproduse . &quot;
&quot; &quot; &quot; În 2007 premiul LUX a fost acordat filmului &quot; &quot; &quot; &quot; Auf der anderen Seite &quot; &quot; &quot; &quot; ( On the other side ) de Fatih Akin . &quot; &quot; &quot;
· Suficienţă : Vor fi veniturile suficiente pentru a acoperi cheltuielile UE pe termen lung ?
salută dezbaterea publică referitoare la cartea verde și îndeamnă departamentele competente ale Comisiei să efectueze o analiză temeinică a rezultatului acestei consultări ;
&quot; , fără a aduce atingere obligațiilor care revin în acest sens angajatorului &quot;
Definiţia „ plăţii de 

In [37]:
!wc -l ../data/DCEP/01-intermediate/L2_strong/*

     2000 ../data/DCEP/01-intermediate/L2_strong/L2_dev.en
     2000 ../data/DCEP/01-intermediate/L2_strong/L2_dev.en.tok
     2000 ../data/DCEP/01-intermediate/L2_strong/L2_dev.ro
     2000 ../data/DCEP/01-intermediate/L2_strong/L2_dev.ro.tok
     2000 ../data/DCEP/01-intermediate/L2_strong/L2_test.en
     2000 ../data/DCEP/01-intermediate/L2_strong/L2_test.en.tok
     2000 ../data/DCEP/01-intermediate/L2_strong/L2_test.ro
     2000 ../data/DCEP/01-intermediate/L2_strong/L2_test.ro.tok
   570034 ../data/DCEP/01-intermediate/L2_strong/L2_train.en
   570034 ../data/DCEP/01-intermediate/L2_strong/L2_train.en.tok
   570034 ../data/DCEP/01-intermediate/L2_strong/L2_train.ro
   570034 ../data/DCEP/01-intermediate/L2_strong/L2_train.ro.tok
       11 ../data/DCEP/01-intermediate/L2_strong/README.md
  2296147 insgesamt


In [38]:
#write basic_cleaning L1 df's to files size 783895
source_train_basic=basic_cleaning_train.iloc[:,0]
target_train_basic=basic_cleaning_train.iloc[:,1]

source_train_basic.to_csv ("../data/DCEP/01-intermediate/L1_basic/L1_train.en", index = None, header = False)
target_train_basic.to_csv ("../data/DCEP/01-intermediate/L1_basic/L1_train.ro", index = None, header = False)

In [39]:
!head ../data/DCEP/01-intermediate/L1_basic/*

==> ../data/DCEP/01-intermediate/L1_basic/L1_basic.txt <==
0	ORAL QUESTION H-0336/07	ÎNTREBARE ORALĂ H-0336/07
1	for Question Time at the part-session in May 2007	pentru timpul afectat întrebărilor din perioada de sesiune mai 2007
2	pursuant to Rule 109 of the Rules of Procedure	în conformitate cu articolul 109 din Regulamentul de procedură
3	by	de
4	Roberta Alma Anastase	Roberta Alma Anastase
5	to the Council	Consiliului
6	Subject: More active EU involvement in settling unresolved conflicts and measures proposed for 2007	Subiect: Implicarea mai activă a UE în soluţionarea conflictelor îngheţate şi măsurile prevăzute pentru 2007
7	In the context of the recent exchange of views, held at the March part-session, with High Representative Javier Solana on the priorities for the Union's common foreign and defence policies, a large number of Members stressed, as a major priority for 2007, the need to deal with the problems of security and stability in the Union's eastern neighbourhood, especi

In [40]:
!wc -l ../data/DCEP/01-intermediate/L1_basic/*

   787895 ../data/DCEP/01-intermediate/L1_basic/L1_basic.txt
   783895 ../data/DCEP/01-intermediate/L1_basic/L1_train.en
   783895 ../data/DCEP/01-intermediate/L1_basic/L1_train.en.tok
   783895 ../data/DCEP/01-intermediate/L1_basic/L1_train.ro
   783895 ../data/DCEP/01-intermediate/L1_basic/L1_train.ro.tok
        3 ../data/DCEP/01-intermediate/L1_basic/README.md
  3923478 insgesamt


In [41]:
#write less rules L3 df's to files size 610480
source_train_less_rules=train_less_nodev.iloc[:,0]
target_train_less_rules=train_less_nodev.iloc[:,1]

source_train_less_rules.to_csv ("../data/DCEP/01-intermediate/L3_intermediate/L3_train.en" , index = None, header = False)
target_train_less_rules.to_csv ("../data/DCEP/01-intermediate/L3_intermediate/L3_train.ro" , index = None, header = False)

In [44]:
!head ../data/DCEP/01-intermediate/L3_intermediate/*

==> ../data/DCEP/01-intermediate/L3_intermediate/L3_train.en <==
ORAL QUESTION H-0336/07
for Question Time at the part-session in May 2007
pursuant to Rule 109 of the Rules of Procedure
by
to the Council
Subject: More active EU involvement in settling unresolved conflicts and measures proposed for 2007
"In the context of the recent exchange of views, held at the March part-session, with High Representative Javier Solana on the priorities for the Union's common foreign and defence policies, a large number of Members stressed, as a major priority for 2007, the need to deal with the problems of security and stability in the Union's eastern neighbourhood, especially via a more active involvement in settling unresolved conflicts and eliminating their consequences."
What concrete measures will the Council take for the consolidation and further development of the existing efforts in this direction in 2007?
"In this connection, how will the Council take account of the recent Commission communi

In [45]:
!wc -l /home/bernadeta/BA_code/data/DCEP/01-intermediate/L3_intermediate/*

   610480 /home/bernadeta/BA_code/data/DCEP/01-intermediate/L3_intermediate/L3_train.en
   610480 /home/bernadeta/BA_code/data/DCEP/01-intermediate/L3_intermediate/L3_train.en.tok
   610480 /home/bernadeta/BA_code/data/DCEP/01-intermediate/L3_intermediate/L3_train.ro
   610480 /home/bernadeta/BA_code/data/DCEP/01-intermediate/L3_intermediate/L3_train.ro.tok
        7 /home/bernadeta/BA_code/data/DCEP/01-intermediate/L3_intermediate/README.md
  2441927 insgesamt
