In [1]:
from jadoch.data import costep
from jadoch.data.costep import language, contains, starts_with, speaker

### Basics
---

A particular session can be loaded with `costep.session("YYYY-MM-DD")`.

In [2]:
next(costep.session("1996-04-15"))

{'session': '1996-04-15',
 'chapter': '1',
 'turn': '1',
 'speaker': {'president': 'yes'},
 'texts': {'danish': ['Jeg erklærer Europa-Parlamentets session, der blev afbrudt den 28. marts 1996, for genoptaget.'],
  'german': ['Ich erkläre die am Donnerstag, den 28. März 1996 unterbrochene Sitzungsperiode des Europäischen Parlaments für wiederaufgenommen.'],
  'greek': ['Kηρύσσω την επανάληψη της συνόδου του Eυρωπαϊκού Kοινοβουλίου που είχε διακοπεί την Πέμπτη 28 Mαρτίου 1996.'],
  'english': ['I declare resumed the session of the European Parliament adjourned on Thursday, 28 March 1996.'],
  'spanish': ['Declaro reanudado el período de sesiones del Parlamento Europeo, interrumpido el 28 de marzo de 1996.'],
  'french': ['Je déclare reprise la session du Parlement européen, qui avait été interrompue le jeudi 28 mars 1996.'],
  'italian': ['Dichiaro ripresa la sessione del Parlamento europeo interrotta giovedì 28 marzo 1996.'],
  'dutch': ['Ik verklaar de zitting van het Europees Parlemen

---
Alternatively, every session can be searched by calling `costep.speeches()`.

In [3]:
next(costep.speeches())

{'session': '1996-04-15',
 'chapter': '1',
 'turn': '1',
 'speaker': {'president': 'yes'},
 'texts': {'danish': ['Jeg erklærer Europa-Parlamentets session, der blev afbrudt den 28. marts 1996, for genoptaget.'],
  'german': ['Ich erkläre die am Donnerstag, den 28. März 1996 unterbrochene Sitzungsperiode des Europäischen Parlaments für wiederaufgenommen.'],
  'greek': ['Kηρύσσω την επανάληψη της συνόδου του Eυρωπαϊκού Kοινοβουλίου που είχε διακοπεί την Πέμπτη 28 Mαρτίου 1996.'],
  'english': ['I declare resumed the session of the European Parliament adjourned on Thursday, 28 March 1996.'],
  'spanish': ['Declaro reanudado el período de sesiones del Parlamento Europeo, interrumpido el 28 de marzo de 1996.'],
  'french': ['Je déclare reprise la session du Parlement européen, qui avait été interrompue le jeudi 28 mars 1996.'],
  'italian': ['Dichiaro ripresa la sessione del Parlamento europeo interrotta giovedì 28 marzo 1996.'],
  'dutch': ['Ik verklaar de zitting van het Europees Parlemen

---
You can get sentence aligned data for a set of languages by calling `costep.sentences(lang1, lang2, ...)`.

In [4]:
next(costep.sentences("english", "german"))

{'english': 'I declare resumed the session of the European Parliament adjourned on Thursday, 28 March 1996.',
 'german': 'Ich erkläre die am Donnerstag, den 28. März 1996 unterbrochene Sitzungsperiode des Europäischen Parlaments für wiederaufgenommen.',
 'meta': {'session': '1996-04-15',
  'chapter': '1',
  'turn': '1',
  'speaker': {'president': 'yes'}}}

### Simple Filtering
---

Maybe you want to find english sentences that start with "after all,". One way to do this would be with `filter`.

In [5]:
filter?

[0;31mInit signature:[0m [0mfilter[0m[0;34m([0m[0mself[0m[0;34m,[0m [0;34m/[0m[0;34m,[0m [0;34m*[0m[0margs[0m[0;34m,[0m [0;34m**[0m[0mkwargs[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m     
filter(function or None, iterable) --> filter object

Return an iterator yielding those items of iterable for which function(item)
is true. If function is None, return the items that are true.
[0;31mType:[0m           type
[0;31mSubclasses:[0m     


In [6]:
def english_starts_with_after_all(dct):
    return dct["english"].lower().startswith("after all,")


next(
    filter(
        english_starts_with_after_all,
        costep.sentences("english")
    )
)

{'english': 'After all, we have in you an expert who is in any case closely concerned with these matters.',
 'meta': {'session': '1996-04-15',
  'chapter': '3',
  'turn': '11',
  'speaker': {'president': 'yes'}}}

---
However, this is somewhat cumbersome. Some common filters are included in the library to make tasks like this easier. We can perform the same search using the `language` and `starts_with` functions.

In [7]:
english = language("english")

In [8]:
next(
    filter(
        english(starts_with("after all,")),
        costep.sentences("english")
    )
)

{'english': 'After all, we have in you an expert who is in any case closely concerned with these matters.',
 'meta': {'session': '1996-04-15',
  'chapter': '3',
  'turn': '11',
  'speaker': {'president': 'yes'}}}

In [9]:
language?

[0;31mSignature:[0m [0mlanguage[0m[0;34m([0m[0mlang[0m[0;34m:[0m [0mstr[0m[0;34m)[0m [0;34m->[0m [0mCallable[0m[0;34m[[0m[0;34m[[0m[0mjadoch[0m[0;34m.[0m[0mcore[0m[0;34m.[0m[0mfunctional[0m[0;34m.[0m[0mFilter[0m[0;34m][0m[0;34m,[0m [0mjadoch[0m[0;34m.[0m[0mcore[0m[0;34m.[0m[0mfunctional[0m[0;34m.[0m[0mFilter[0m[0;34m][0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Creates a function that when given a filter will modify the input
to apply to a particular language from the corpus.

Args:
    lang (str): The language for the filter (E.g."english")

Returns:
    A function that will modify a filter to apply to that language.

Examples:
    >>> german = language("german")
    >>> next(filter(german(contains("ja")), sentences("german", "english")))
    {'german': 'Wir haben ja mit Ihnen einen Experten, der ohnehin mit diesen...',
     'english': 'After all, we have in you an expert who is in any case closely...',
     'meta': {'sess

In [10]:
starts_with?

[0;31mSignature:[0m [0mstarts_with[0m[0;34m([0m[0mphrase[0m[0;34m:[0m [0mstr[0m[0;34m)[0m [0;34m->[0m [0mjadoch[0m[0;34m.[0m[0mcore[0m[0;34m.[0m[0mfunctional[0m[0;34m.[0m[0mFilter[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Creates a filter that will search for sentences starting with the given phrase.

Args:
    phrase(str): A space delimited phrase to search for.

Returns:
    Filter: A filter which searches for sentences starting with that phrase.
[0;31mFile:[0m      ~/dev/cogstates/jadoch/notebooks/jadoch/data/costep.py
[0;31mType:[0m      function


---
Another useful filter is `contains`, which will find sentences that contain a particular phrase.

In [11]:
next(
    filter(
        english(contains("you know")),
        costep.sentences("english")
    )
)

{'english': 'As you know Madam President, ladies and gentlemen, a committee of inquiry has been set up to investigate fraud involving Community transit operations and its initial findings have already shown that there is a great deal of fraud and a great deal of effort required in the matter of TIR documents, computerizing the Community transit system, data exchange, and effective checks on these data.',
 'meta': {'session': '1996-04-16',
  'chapter': '2',
  'turn': '3',
  'speaker': {'president': 'no',
   'name': 'Caudron',
   'language': 'fr',
   'forename': 'Gérard',
   'surname': 'Caudron',
   'country': 'FR',
   'group': 'GUE/NGL',
   'id': '407'}}}

In [12]:
contains?

[0;31mSignature:[0m [0mcontains[0m[0;34m([0m[0mphrase[0m[0;34m:[0m [0mstr[0m[0;34m)[0m [0;34m->[0m [0mjadoch[0m[0;34m.[0m[0mcore[0m[0;34m.[0m[0mfunctional[0m[0;34m.[0m[0mFilter[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Creates a filter that will search for sentences containing the given phrase.

Args:
    phrase (str): A space delimited phrase to search for.

Returns:
    Filter: A filter which searches sentences for that phrase.
[0;31mFile:[0m      ~/dev/cogstates/jadoch/notebooks/jadoch/data/costep.py
[0;31mType:[0m      function


---
There is also the builtin filter `speaker` which lets you search for sentences that were originally spoken in a particular language.

In [13]:
next(filter(speaker("english"), costep.sentences("english", "german")))

{'english': 'Mr President, it concerns the speech made last week by Mr Fischler on BSE and reported in the Minutes.',
 'german': 'Es geht um die Erklärung von Herrn Fischler zu BSE, die im Protokoll festgehalten wurde.',
 'meta': {'session': '1996-04-15',
  'chapter': '3',
  'turn': '4',
  'speaker': {'president': 'no',
   'name': 'Sturdy',
   'language': 'en',
   'forename': 'Robert',
   'surname': 'Sturdy',
   'country': 'GB',
   'group': 'ECR',
   'id': '2306'}}}

In [14]:
speaker?

[0;31mSignature:[0m [0mspeaker[0m[0;34m([0m[0mlang[0m[0;34m:[0m [0mstr[0m[0;34m)[0m [0;34m->[0m [0mjadoch[0m[0;34m.[0m[0mcore[0m[0;34m.[0m[0mfunctional[0m[0;34m.[0m[0mFilter[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Creates a filter that will search for sentences where the speaker was
originally speaking the given language.

Args:
    lang (str): The language for the filter (E.g. "german", "German", "de", "DE")

Returns:
    Filter: A filter which searches for sentences originally spoken in that language.
[0;31mFile:[0m      ~/dev/cogstates/jadoch/notebooks/jadoch/data/costep.py
[0;31mType:[0m      function


### Advanced Filtering
---

Something notable is that the objects returns by these filters described above are special. They support composition through logical operations including and-ing (&), or-ing (|), and inversion (~) making them very powerful.

In [15]:
(contains("ja") | contains("doch")) & ~contains("haben")

<jadoch.core.functional.Filter at 0x131a43100>

Say, for example, you want german sentences containing "ja" where the english sentence does not contain "yes".

In [16]:
german = language("german")

In [17]:
# Note the `~` to indicate english sentences that do NOT contain "yes".
fltr = german(contains("ja")) & ~english(contains("yes"))
# Using the filter just like before.
next(filter(fltr, costep.sentences("english", "german")))

{'english': 'After all, we have in you an expert who is in any case closely concerned with these matters.',
 'german': 'Wir haben ja mit Ihnen einen Experten, der ohnehin mit diesen Fragen eng befaßt ist.',
 'meta': {'session': '1996-04-15',
  'chapter': '3',
  'turn': '11',
  'speaker': {'president': 'yes'}}}

Suppose you want to get all the sentences the previous filter didn't match. This task, which might be useful for creating a training data, becomes easy.

In [18]:
next(filter(~fltr, costep.sentences("english", "german")))

{'english': 'I declare resumed the session of the European Parliament adjourned on Thursday, 28 March 1996.',
 'german': 'Ich erkläre die am Donnerstag, den 28. März 1996 unterbrochene Sitzungsperiode des Europäischen Parlaments für wiederaufgenommen.',
 'meta': {'session': '1996-04-15',
  'chapter': '1',
  'turn': '1',
  'speaker': {'president': 'yes'}}}

### Creating Custom Filters
---
A simple one might be to check if the speaker was president.

In [19]:
def president(dct):
    return dct["meta"]["speaker"].get("president") == "yes"

In [20]:
next(filter(president, costep.sentences("english")))

{'english': 'I declare resumed the session of the European Parliament adjourned on Thursday, 28 March 1996.',
 'meta': {'session': '1996-04-15',
  'chapter': '1',
  'turn': '1',
  'speaker': {'president': 'yes'}}}

However, you will find that this version of the filter does not support inversion.

In [21]:
next(filter(~president, costep.sentences("english")))

TypeError: bad operand type for unary ~: 'function'

To fix this we need to use the library's `Filter` class.

In [22]:
from jadoch.core.functional import Filter

In [23]:
president = Filter(lambda dct: dct["meta"]["speaker"].get("president") == "yes")

In [24]:
next(filter(~president, costep.sentences("english")))

{'english': 'Mr President, on behalf of my fellow-members from the Committee on Agriculture I should like to ask you to change a few things in the voting about the BSE resolution.',
 'meta': {'session': '1996-04-15',
  'chapter': '3',
  'turn': '2',
  'speaker': {'president': 'no',
   'name': 'Oomen-Ruijten',
   'language': 'nl',
   'forename': 'Ria',
   'surname': 'Oomen-Ruijten',
   'country': 'NL',
   'group': 'PPE',
   'id': '1765'}}}

This can be written equivalently in decorator notation if you prefer.

In [25]:
@Filter
def president(dct):
    return dct["meta"]["speaker"].get("president") == "yes"

In [26]:
next(filter(~president, costep.sentences("english")))

{'english': 'Mr President, on behalf of my fellow-members from the Committee on Agriculture I should like to ask you to change a few things in the voting about the BSE resolution.',
 'meta': {'session': '1996-04-15',
  'chapter': '3',
  'turn': '2',
  'speaker': {'president': 'no',
   'name': 'Oomen-Ruijten',
   'language': 'nl',
   'forename': 'Ria',
   'surname': 'Oomen-Ruijten',
   'country': 'NL',
   'group': 'PPE',
   'id': '1765'}}}

---
Suppose we want to make a filter that interfaces with the `language` function and finds sentences which end with a particular string. Keeping in mind that the language function feeds our filter a list of words we can write something like this.

In [27]:
def ends_with(phrase):
    phr = phrase.lower().split()
    def fltr(sent):
        return sent[-len(phr) :] == phr
    return Filter(fltr)

In [28]:
next(filter(english(ends_with("these matters.")), costep.sentences("english")))

{'english': 'After all, we have in you an expert who is in any case closely concerned with these matters.',
 'meta': {'session': '1996-04-15',
  'chapter': '3',
  'turn': '11',
  'speaker': {'president': 'yes'}}}