# Second prepositional case in Russian dialectal corpora

## Introduction

This project aims to collect and analyze data from corpora of Russian dialects for my diploma. It focuses on usage of second prepositional (locative) case (loc2 tag in the corpora). The existing works on the topic allow to assume that the following parameters can be relevant: 
1) animacy -- in standard Russian only inanimate nouns can be in loc2;
2) preposition -- only 'na' and 'v'/'vo' are attested with loc2; 
3) gender -- masculine is dominant, but in dialects there also can be neutral (3rd declension feminine nouns also have loc2 form, but it marks only by accent placement, which is not accesible in our data);
4) last consonant of the stem -- could be more loc2 forms with velars and less with palatalized;
5) number of syllables (1 is more preferred) and place of accent (do not have in our data). 
Also, in many dialects there is only a limited group of nouns that can be used in loc2, so the lemma itself could be a parameter.

## Data Collection

Automatic download of data is now in development process. When it is ready, this script will be applied to handle data from several corpora, that work in the same way. For now I will manually search for the needed data in one corpus (http://lingconlab.ru/khislavichi/#!/) using CQL Search that is built in:

This query looks for words that end with -u or -ju letter and have tags, mentioned above. Loc2 forms are not easy to find for the parser, so they can be analized as locative and dative as well as loc2.

Then the corpora allow to download results in tsv-format. We will turn it to a dataframe.

In [81]:
import pandas as pd
data = pd.read_csv ("khislavichi_res.tsv", sep = '\t')

That is how our dataframe looks like:

In [82]:
data.head()

Unnamed: 0,utterance_id,From,To,File,Id,String id,Year of birth,Sex,Education,Place of birth,Place of living,Left Context,DMatch,Right Context
0,24.0,108453.0,114376.0,2019_zhanvil_pds1932_4_2,,pds1932,1932.0,f,4 класса,Шипы,Жанвиль,"Ну, вы = из",дому,"выносят, кто ямку роет, и те (тэя) выносят."
1,39.0,175700.0,180982.0,2019_zhanvil_pds1932_4_2,,pds1932,1932.0,f,4 класса,Шипы,Жанвиль,Ну тогда (тады) ж при = к,вечеру,"ж её, надо надеть и всё, гроб если готовый."
2,132.0,553567.0,558426.0,2019_zhanvil_pds1932_4_2,,pds1932,1932.0,f,4 класса,Шипы,Жанвиль,Ну ва = ну копали,могилочку,", как (як) чей - то где - то, вон там (тама), ..."
3,194.0,868946.0,872511.0,2019_zhanvil_pds1932_4_2,,pds1932,1932.0,f,4 класса,Шипы,Жанвиль,"Как (як), этот (этый), как (як) опустишь гроб ...",гробу,","
4,319.0,59180.0,63480.0,2019_zhanvil_pds1932_4_1,,pds1932,1932.0,f,4 класса,Шипы,Жанвиль,"Ну, ну она (ина) ну но =, может,",ночьу,", ну дочка тутай её."


This is raw data with a lot of noise in it.

## Preprocessing

Now let's clean our data by deleting entries without necessary prepositions in the left context. And then we will retrieve the relevant parameters (animacy, gender, preposition, number of syllables in the stem and the last consonant in the stem) and add them to a new dataframe. 

In [83]:
from pymorphy2 import MorphAnalyzer
morph = MorphAnalyzer()

In [84]:
lemas = []
anim = []
gender = []
finale = []
nsyll = []
prep = []
i = 0
for cont in data['Left Context']:
    c = str(cont).lower().split()
    p = ''
    for w in c[:-5:-1]:
        ana = morph.parse(w.strip(',.?:;-=()!'))[0]
        if ana.tag.POS == 'PREP':
            p = ana.word
            break
    if p == 'на' or p== 'в' or p=='во':
        prep.append(p)
    else:
        data = data.drop([i])
    i += 1
for word in data['DMatch']:
    word = str(word)
    ana = morph.parse(word.lower())[0]
    lemas.append(ana.normal_form) #лемма
    anim.append(ana.tag.animacy) #одушевленность
    gender.append(ana.tag.gender) #род
    
    if word[-1] == 'у':
        finale.append(word[-2])
    elif word[-2] in 'уеыаоэяиюё':
        finale.append('j')
    else:
        finale.append(word[-2]+"'") # исход основы
        
    n = 0
    for letter in ana.normal_form:
        if letter in 'уеыаоэяиюё':
            n += 1
    nsyll.append(n)# количество слогов в основе

In [85]:
df = pd.DataFrame({'word': data['DMatch'],
                   'lema': lemas,
                 'animacy': anim,
                   'gender': gender,
                   'finale': finale,
                   'n_syll': nsyll,
                   'preposition': prep
                  })
df.head()

Unnamed: 0,word,lema,animacy,gender,finale,n_syll,preposition
5,сундуку,сундук,inan,masc,к,2,в
7,сундуку,сундук,inan,masc,к,2,в
9,гробу,гроб,inan,masc,б,1,в
13,году,год,inan,masc,д,1,в
14,году,год,inan,masc,д,1,в


In [104]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 229 entries, 5 to 827
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   word         229 non-null    object
 1   lema         229 non-null    object
 2   animacy      229 non-null    object
 3   gender       229 non-null    object
 4   finale       229 non-null    object
 5   n_syll       229 non-null    int64 
 6   preposition  229 non-null    object
dtypes: int64(1), object(6)
memory usage: 14.3+ KB


## Analyses

Here are parameters and their frequencies.

In [90]:
df['lema'].value_counts(normalize=True)

год          0.279476
лес          0.126638
вид          0.052402
край         0.030568
бок          0.026201
               ...   
бык          0.004367
як           0.004367
санаторий    0.004367
бресник      0.004367
белор        0.004367
Name: lema, Length: 74, dtype: float64

We can see that 5 most frequent lemmas make about 50 percent of all the examples. That means some lemmas preffer loc2 and some don't.

In [91]:
df['animacy'].value_counts(normalize=True)

inan    0.938865
anim    0.061135
Name: animacy, dtype: float64

As expected, most of nouns used in loc2 are inanimate, however, there are about 6 percent of animate nouns, that would be impossible for standard Russian. Let's have a look on them.

In [102]:
print(df[df['animacy']=='anim'])

          word      lema animacy gender finale  n_syll preposition
48     собняку    собняк    anim   masc      к       2          на
82        лясу       ляс    anim   masc      с       1           в
113    горлачу    горлач    anim   masc      ч       2           в
168    мусорку    мусорк    anim   masc      к       2           в
196        яку        як    anim   masc      к       1           в
203    Жавинку    жавинк    anim   masc      к       2           в
340      Бычку     бычок    anim   masc      к       2           в
341      Бычку     бычок    anim   masc      к       2           в
342      Бычку     бычок    anim   masc      к       2           в
364   человеку   человек    anim   masc      к       3           в
405       быку       бык    anim   masc      к       1          на
569  Малиннику  малинник    anim   masc      к       3           в
598   большаку   большак    anim   masc      к       2          на
757  покойнику  покойник    anim   masc      к       3        

When we have a look at the context, it turns out that even examples with animate (and not falsely recognized as animate) nouns are not relevant because they are not in loc2. To avoid such noise we would have to consider syntactic structure.

In [92]:
df['gender'].value_counts(normalize=True)

masc    0.973799
neut    0.026201
Name: gender, dtype: float64

In [94]:
print(df[df['gender']=='neut'])

            word         lema animacy gender finale  n_syll preposition
150       у'лицу       у'лицо    inan   neut      ц       3          на
155       у'лицу       у'лицо    inan   neut      ц       3          на
178    могилочку    могилочко    inan   neut      к       4          на
520  государству  государство    inan   neut      в       4          на
534   обсуждению   обсуждение    inan   neut      j       5           в
581      тёрочку      тёрочко    inan   neut      к       3          на


Loc2 is aplicable mostly to masculine nouns, very few neutral nouns are either marked meutral by mistake or need to be further analized in a broader context.

In [88]:
df['finale'].value_counts()

д     88
к     44
с     30
т     15
j     10
б      7
л      6
в      6
р      5
г      5
н      3
м      2
ц      2
ф      2
ч      2
н'     1
х      1
Name: finale, dtype: int64

We see that most frequent finales are d, k and s. D and s are explained by most frequent in loc2 lexemes 'les' and 'god', and high frequency of finale k supports the hypothesis about preference of velar finales in loc2. Among all the examples there is only one with palatalized finale -- n' (NB j is palatal, not palatalized), which supports the idea that loc2 dispreffers palatalized finales.

In [89]:
df['n_syll'].value_counts()

1    175
2     34
3     15
4      4
5      1
Name: n_syll, dtype: int64

We can see that the more syllables in the noun the less frequent it is in loc2.

Let us compare frequencies of usual locative and loc2 in inanimate masculine 1-syllable nouns. For that we use another CQL query:
Query: [(word='.*(е)'%c)& (tag = 'NOUN,inan,masc,sing,loct.*'%c)]
Number of results: 738

In [115]:
data_e = pd.read_csv ("khislavichi_e.tsv", sep = '\t')

In [116]:
i = 0
for word in data_e['DMatch']:
    word = str(word).lower()
    n = 0
    for letter in word[:-1]:
        if letter in 'уеыаоэяиюё':
            n += 1
    if n > 1:
        data_e = data_e.drop([i])
    i += 1

In [124]:
u_forms = len(df[(df.gender=='masc')&(df.animacy=='inan')].index)
e_forms = len(data_e.index)
loc2_perc = u_forms/(e_forms+u_forms)*100
print("Percentage of loc2 forms out of all locative forms:", loc2_perc)

Percentage of loc2 forms out of all locative forms: 46.96629213483146


We can compare this result with DARL (Dialectological Atlas of Russian Language) data.

![DARL Map 14 Volume II Morphology](019.jpg "DARL Map 14 Volume II Morphology")

As we can see, DARL marks Khislavichi district (to the south from Smolensk, on the river Sozh) as territory with 61 to 70 percent of loc2. And modern data of the corpus gives us only 47 percent.

## Conclusions

Data collection based on CQL query that accounts for mistakes in automatic morphologic analyses (includes all possible tags) and further cleaning with regard to the context has shown good results for searching loc2 forms. The irrelevant results were caused only by some unknown for MorphAnalyzer words and by the fact that we did not account for syntactic structure while looking for prepositions.
The parameters discussed in introduction seem to be relevant for our data.
The comparison with DARL data showed that in the chosen area the percentage of loc2 forms decreased.
Plans for the future include developement of the current project by analyzing other corpora and by automatization of data collection process.