# Query model

To access the sentences necessary to my PhD research, a filtering process of the following kind must be done:

1. Filter sentences that contain a token whose `lemma` form that is might surface in a DcI or GcI construction, those `lemmata` are stored in `../data/mvi.csv`;
1. Filter the sentences that are have an `infinitive` token dependent on the token that was evaluated `true` in the previous filtering step;
1. Filter the sentences that have a `dative` or `genitive` token dependent on the token that was evaluated `true` in the first filtering step;
1. Filter the sentences that have any nominal token in `accusative` or one of the previously filtered cases dependent on the token that was evaluated `true` in the previous two filtering steps.

To programatically build these, there must be a way to:
- store the sentences that were filtered;
- store the identification of the tokens that evaluated true for each step;
- store the criteria used for the evaluation.

# Query testing

## Building handlers

In [1]:
from doc_data.processor import read_data
from doc_data.db import mongo
import pandas as pd

In [2]:
db = mongo("mongodb://localhost:27017")
valid_tokens = db.tokens

Connected successfully


In [3]:
mvi = pd.read_csv("../data/mvi.csv")
mvi = list(mvi.lemma)

## Filtering words with MVI

In [4]:
sentences_with_mvi = list(
    valid_tokens.aggregate([
        {"$match": {"lemma": {"$in": mvi}}},
        {"$project": 
         {
             "text-sentence": 1, 
             "text-sentence-id": 1,
             "_id": 0
         }
        }
    ])
)

ts = [x["text-sentence"] for x in sentences_with_mvi]
mvi_ids = [x["text-sentence-id"] for x in sentences_with_mvi]

valid_tokens.aggregate([
    {"$match": {"text-sentence": {"$in": ts}}},
    {"$out": "mvi_tokens"}
])

<pymongo.command_cursor.CommandCursor at 0x7f515d0c9130>

## Filtering sentences with infinitives dependent on the MVI

In [5]:
valid_tokens = db.mvi_tokens

In [6]:
sentences_with_infinitive = list(
    valid_tokens.aggregate([
        {"$match": {
                "text-sentence-head": {"$in": mvi_ids},
                "feats": {"$regex": "VerbForm=Inf"}
        }},
        {"$project": 
         {
             "text-sentence": 1, 
             "text-sentence-id": 1,
             "_id": 1
         }
        }
    ])
)

ts = [x["text-sentence"] for x in sentences_with_infinitive]
inf_ids = [x["text-sentence-id"] for x in sentences_with_infinitive]

valid_tokens.aggregate([
    {"$match": {"text-sentence": {"$in": ts}}},
    {"$out": "inf_tokens"}
])

<pymongo.command_cursor.CommandCursor at 0x7f50b9b5bdc0>

## Filtering sentences with Dat/Gen

In [7]:
valid_tokens = db.inf_tokens

In [8]:
possible_heads = mvi_ids
sentences_with_dat = list(
    valid_tokens.aggregate([
        {"$match": {
                "text-sentence-head": {"$in": possible_heads},
                "feats": {"$regex": "Case=Dat"}
        }},
        {"$project": 
         {
             "text-sentence": 1, 
             "text-sentence-id": 1,
             "_id": 1
         }
        }
    ])
)

ts = [x["text-sentence"] for x in sentences_with_dat]
dat_ids = [x["text-sentence-id"] for x in sentences_with_dat]

valid_tokens.aggregate([
    {"$match": {"text-sentence": {"$in": ts}}},
    {"$out": "dat_ids"}
])

<pymongo.command_cursor.CommandCursor at 0x7f50b9b5b700>

In [9]:
from doc_data.query import get_value_by_tsi, get_dependents

In [12]:
for x in mvi_ids:
    for y in get_dependents(valid_tokens, x):
        print(y["text"])
    break

καθόλου
μὲν
τὸ
μηδενὶ
μέρει
δὲ
ὑπάρχειν
