# Morfeusz

**[Morfeusz 2](http://sgjp.pl/morfeusz/morfeusz.html)** carries out a morphological analysis for Polish. It is equipped with a huge dictionary of Polish words (~7 mln words) together with their lemmas and their part of speech. 

It does not use word's context for lemmatization - each word is analyzed separately. If morphological disambiguation is detected for a given word (meaning there are many possible words that could be lemmas for the word) all candidates for lemmas are returned.

Morfeusz does not have Python 3 bindings. On the other hand, Polish NLP in Python 2 is not very convenient (yep, getting the right encoding hurts!). Thus, we give you our own 'bindings' to Python 3 [here](../src/morfeusz2.py).  

The easiest way to run Morfeusz is through docker. 

You may use [this Dockerfile](../docker/morfeusz). 

How? Just navigate to the home directory of the `nlp_workshop` repository and run in your terminal:

```make docker-build-morfeusz
 make docker-run-morfeusz```
 
The above will initialize jupyter notebook kernel.

Now you are ready to run this notebook.

In [9]:
import sys
sys.path.append('../src/') 

from morfeusz2 import Morfeusz

In [10]:
morf = Morfeusz()

In [11]:
text = 'Ala ma kota. Alo, gdzie Twój kot?'

In [12]:
result = morf.analyse(text)

In [13]:
result

[['[0', '1', 'Ala', 'Ala', 'subst:sg:nom:f', 'imię', '_'],
 [' 0', '1', 'Ala', 'Al', 'subst:sg:gen.acc:m1', 'imię', '_'],
 [' 0', '1', 'Ala', 'Alo', 'subst:sg:gen.acc:m1', 'imię', '_'],
 ['1', '2', 'ma', 'mój:a', 'adj:sg:nom.voc:f:pos', '_', '_'],
 [' 1', '2', 'ma', 'mieć', 'fin:sg:ter:imperf', '_', '_'],
 ['2', '3', 'kota', 'kota', 'subst:sg:nom:f', 'nazwa pospolita', '_'],
 [' 2', '3', 'kota', 'kot:s1', 'subst:sg:gen.acc:m2', 'nazwa pospolita', '_'],
 [' 2',
  '3',
  'kota',
  'kot:s2',
  'subst:sg:gen.acc:m1',
  'nazwa pospolita',
  'pot.|środ.'],
 ['3', '4', '.', '.', 'interp', '_', '_'],
 ['4', '5', 'Alo', 'Alo', 'subst:sg:nom:m1', 'imię', '_'],
 [' 4', '5', 'Alo', 'Alo', 'subst:sg:voc:m1', 'imię', '_'],
 ['5', '6', '', '', '', '', 'interp', '_', '_'],
 ['6', '7', 'gdzie', 'gdzie:d', 'adv', '_', '_'],
 [' 6', '7', 'gdzie', 'gdzie:q', 'qub', '_', '_'],
 ['7', '8', 'Twój', 'twój:s', 'subst:sg:nom:m1', 'nazwa pospolita', 'pot.'],
 [' 7', '8', 'Twój', 'twój:s', 'subst:sg:voc:m1', 'naz

In [14]:
import pandas as pd

def lemmatize(text):
    result = morf.analyse(text)
    morf_df = pd.DataFrame(result)[[1, 2]]
    morf_df.columns = ['word_number', 'word']
    morf_df.drop_duplicates(subset=['word_number'], keep='first', inplace=True)
    return list(morf_df['word'])

In [15]:
lemmatize(text)

['Ala', 'ma', 'kota', '.', 'Alo', '', 'gdzie', 'Twój', 'kot', '?']