Members :
- Xi WANG : lilaswang2227@gmail.com
- Jiren REN : renjiren120@gmail.com

In [2]:
! python3 -m spacy download en_core_web_sm
! python3 -m spacy download fr_core_news_sm
! pip install nltk pandas scikit-learn

Defaulting to user installation because normal site-packages is not writeable
Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m3.1 MB/s[0m  [33m0:00:04[0m eta [36m0:00:01[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
Defaulting to user installation because normal site-packages is not writeable
Collecting fr-core-news-sm==3.7.0
  Downloading https://github.com/explosion/spacy-models/releases/download/fr_core_news_sm-3.7.0/fr_core_news_sm-3.7.0-py3-none-any.whl (16.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.3/16.3 MB[0m [31m2.4 MB/s[0m  [33m0:00:06[0m eta [36m0:00:01[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('fr

# Exercise 1 : Lemmatization

In this exercise, the objective is to create your own lemmatizer for french language. We will test different lemmatization approaches : 
* Based on a dictionary
* Based on machine learning approach (you can use sklearn) or define your own architecture with pytorch
* With and without pos tag given as input

In all case you should compare your results and report performances of the proposed algorithm to [spacy](https://spacy.io/models/fr) lemmatizer (the different configuration).

You are free to use any machine-learning algorihtm/model, taking or not the context of sentences such as [LinearRegression](https://scikit-learn.org/1.5/modules/generated/sklearn.linear_model.LinearRegression.html) or training your own [RNN with pytorch](https://pytorch.org/docs/stable/generated/torch.nn.RNN.html). 
However you must always motivate your choices and compare results of the different configurations.

You will send the report to *thomas.gerald@universite-paris-saclay.fr* in PDF format named as following and the code (notebook with  output of the two exercises in a zip format) :


**report_[firstname]_[lastname].pdf**

The report for the two exercises must not exceed three pages !


## Dataset
To train or build your lemmatizer you have three files in *tabular separated values* format :
* [training-set.tsv](https://thomas-gerald.fr/TMC/resources/data/training-set.tsv) that you can use to train/build your dictionnary/model 
* [testing-set.tsv](https://thomas-gerald.fr/TMC/resources/data/testing-set.tsv) used to evaluate the different approaches
* [testing-gallica.tsv](https://thomas-gerald.fr/TMC/resources/data/testing-gallica.tsv) used as gold standard to evaluate performances [github (in french)](https://github.com/Gallicorpora/Lemmatisation)

In our case we have two possibilities for a lemma:
* (a) A sequence of characters, meaning that "to rule" an "a rule" are the same lemma
* (b) A sequence of characters, meaning that "to rule" represent the verb, a tuple ("rule", "V") while "a rule" is represented by the tuple ("rule", "N") 
In the (a) case the size of the vocabulary (output) will be 
## Spacy :

Below a small example using spacy lemmatization
```python
import spacy
nlp = spacy.load("en_core_web_sm")
text_a = "He is thirty years old"
text_b = "We still are champions"
print(f'Lemmatization A : {[(w.lemma_, w.pos_) for w in nlp(text_a)]}')
print(f'Lemmatization B : {[(w.lemma_, w.pos_) for w in nlp(text_b)]}')
```

### Reading data

You can use pandas to read the data using tabular separator as following

In [1]:
import pandas as pd
train_file = "data/training-set.tsv"
pd.read_csv(train_file, sep='\t', names=["token", "lemma", "pos"])

Unnamed: 0,token,lemma,pos
0,Certes,certes,ADV
1,",",",",PONCT
2,rien,rien,PRO
3,ne,ne,ADV
4,dit,dire,V
...,...,...,...
261384,effet,effet,N
261385,positif,positif,A
261386,.,.,PONCT
261387,tenir,tenir,V


In [2]:

w_vocabulary = {'unknow_word'}
l_vocabulary = set()
lp_vocabulary = set()

with open(train_file, 'r')  as f:
    for line in f:
        try: 
            word, lemma, pos = line.split()
            w_vocabulary.add(word)
            l_vocabulary.add(lemma)
            lp_vocabulary.add((lemma, pos))
        except: 
            pass

print(f'The input vocabulary contains : {len(w_vocabulary)} words' )
print(f'The number of str lemma is :  {len(l_vocabulary)}')
print(f'The number of lemma (considering PoS) is :  {len(lp_vocabulary)}')

The input vocabulary contains : 23271 words
The number of str lemma is :  15194
The number of lemma (considering PoS) is :  16144


### Lemmatization based on a dictionary

In [3]:
from collections import defaultdict, Counter

lemma_dict = defaultdict(Counter)          # case (a)
lemma_pos_dict = defaultdict(Counter)      # case (b)

In [4]:
with open(train_file, 'r') as f:
    for line in f:
        try:
            word, lemma, pos = line.strip().split()
            word = word.lower()
            lemma = lemma.lower()

            # case a : word -> lemma
            lemma_dict[word][lemma] += 1
            # case b : (word, pos) -> lemma
            lemma_pos_dict[(word, pos)][lemma] += 1

        #except:
        #    pass
        except ValueError:
            continue

print(lemma_dict)
print(lemma_pos_dict)

defaultdict(<class 'collections.Counter'>, {'certes': Counter({'certes': 39}), ',': Counter({',': 15911}), 'rien': Counter({'rien': 54}), 'ne': Counter({'ne': 719}), 'dit': Counter({'dire': 36}), "qu'": Counter({'que': 596, "qu'": 1}), 'une': Counter({'un': 2378, 'une': 9}), 'seconde': Counter({'second': 25, 'seconde': 4}), 'motion': Counter({'motion': 3}), 'de': Counter({'de': 12730, 'un': 167, 'du': 1, 'en_vue>de': 1, 'à_raison>de': 1}), 'censure': Counter({'censure': 3}), 'sur': Counter({'sur': 1298}), 'son': Counter({'son': 689}), 'projet': Counter({'projet': 94}), 'loi': Counter({'loi': 85}), 'reprenant': Counter({'reprendre': 2}), "l'": Counter({'le': 5066, 'il': 114, "l'": 15}), 'accord': Counter({'accord': 152}), 'du': Counter({'de': 2696, 'du': 186, 'du_coup': 1, 'à_cause_de': 1, 'au_sein_de': 1, 'un': 1, 'en_raison_de': 1}), '10': Counter({'10': 120}), 'avril': Counter({'avril': 64}), "n'": Counter({'ne': 766}), 'aurait': Counter({'avoir': 66}), 'pas': Counter({'pas': 846}), 

##### Trainning Lemmatization based on dictionary

In [5]:
# case (a): without POS
dict_lemma = {
    word: counter.most_common(1)[0][0]
    for word, counter in lemma_dict.items()
}

# case (b): with POS
dict_lemma_pos = {
    key: counter.most_common(1)[0][0]
    for key, counter in lemma_pos_dict.items()
}

print(dict_lemma)
print(dict_lemma_pos)

{'certes': 'certes', ',': ',', 'rien': 'rien', 'ne': 'ne', 'dit': 'dire', "qu'": 'que', 'une': 'un', 'seconde': 'second', 'motion': 'motion', 'de': 'de', 'censure': 'censure', 'sur': 'sur', 'son': 'son', 'projet': 'projet', 'loi': 'loi', 'reprenant': 'reprendre', "l'": 'le', 'accord': 'accord', 'du': 'de', '10': '10', 'avril': 'avril', "n'": 'ne', 'aurait': 'avoir', 'pas': 'pas', 'été': 'être', 'la': 'le', 'bonne': 'bon', 'mais': 'mais', 'cette': 'ce', 'probabilité': 'probabilité', 'reconnaissent': 'reconnaître', 'les': 'le', 'socialistes': 'socialiste', 'était': 'être', 'plus': 'plus', 'plausible': 'plausible', '.': '.', 'toujours': 'toujours', 'est': 'être', '-il': 'il', 'que': 'que', 'le': 'le', 'gouvernement': 'gouvernement', 'a': 'avoir', 'cédé': 'céder', 'alors_que': 'alors_que', 'ses': 'son', 'adversaires': 'adversaire', 'politiques': 'politique', 'proposent': 'proposer', 'aucune': 'aucun', 'solution': 'solution', 'alternative': 'alternatif', 'et': 'et', 'considèrent': 'considér

In [6]:
# Define lemmatization test on the dictionary trainning
def lemmatize_dict(word, pos=None):
    word = word.lower()
    if pos is not None and (word, pos) in dict_lemma_pos:
        return dict_lemma_pos[(word, pos)]
    if word in dict_lemma:
        return dict_lemma[word]
    return word 

print(lemmatize_dict("dit"))
print(lemmatize_dict("dit", "V"))

dire
dire


##### Accuracy of lemmatization of dictionary

In [7]:
test_file = "data/testing-set.tsv"
pd.read_csv(test_file, sep='\t', names=["token", "lemma", "pos"])
test_data_dict = []

with open(test_file, 'r', encoding='utf-8') as f:
    for line in f:
        try:
            word, lemma, pos = line.strip().split()
            test_data_dict.append((word.lower(), lemma.lower(), pos))
        #except:
        #    pass
        except ValueError:
            continue

print(f'The test set contains {len(test_data_dict)} tokens.')

The test set contains 16694 tokens.


In [8]:
def evaluate(lemmatizer, data, use_pos=False):
    correct = 0
    total = 0

    for word, gold_lemma, pos in data:
        if use_pos:
            pred = lemmatizer(word, pos)
        else:
            pred = lemmatizer(word)

        if pred == gold_lemma:
            correct += 1
        total += 1

    return correct / total

acc_no_pos = evaluate(lemmatize_dict, test_data_dict, use_pos=False)
acc_with_pos = evaluate(lemmatize_dict, test_data_dict, use_pos=True)

print(f"Dictionary lemmatizer (no POS): {acc_no_pos:.4f}")
print(f"Dictionary lemmatizer (with POS): {acc_with_pos:.4f}")

Dictionary lemmatizer (no POS): 0.9475
Dictionary lemmatizer (with POS): 0.9611


Using POS information improves lemmatization accuracy, especially for ambiguous forms that can correspond to different lemmas depending on their syntactic category (e.g. nouns vs verbs).

However, the POS-based dictionary approach is sensitive to POS tagging consistency and suffers from out-of-vocabulary errors when unseen word–POS combinations appear in the test set.

### Based on Machine Learning (sklearn)

In [9]:
def extract_rule(word, lemma, max_len=6):
    """
    from word -> lemma get their suffix
    then return (word_suffix, lemma_suffix)
    """
    for i in range(1, min(len(word), max_len) + 1):
        if word[:-i] == lemma[:-i]:
            return (word[-i:], lemma[-i:])
    return word, lemma



def make_features(word, pos):
    """extract suffix features from the word"""
    word = word.lower()
    return {
        "suffix1": word[-1:] if len(word) >= 1 else "",
        "suffix2": word[-2:] if len(word) >= 2 else "",
        "suffix3": word[-3:] if len(word) >= 3 else "",
        "suffix4": word[-4:] if len(word) >= 4 else "",
        "pos": pos
    }

In [10]:
X = []
y = []

with open(train_file, 'r', encoding='utf-8') as f:
    for line in f:
        #print("line:", line)
        line = line.rstrip('\n')
        if not line:
            continue
        try:
            word, lemma, pos = line.split()
        except ValueError:
            continue

        word = word.lower()
        lemma = lemma.lower()
        w_suf, l_suf = extract_rule(word, lemma)

        X.append(make_features(word, pos))
        y.append(f"{w_suf}→{l_suf}")

print(len(X), len(y))
print(X[:10])
print(y[:10])

261378 261378
[{'suffix1': 's', 'suffix2': 'es', 'suffix3': 'tes', 'suffix4': 'rtes', 'pos': 'ADV'}, {'suffix1': ',', 'suffix2': '', 'suffix3': '', 'suffix4': '', 'pos': 'PONCT'}, {'suffix1': 'n', 'suffix2': 'en', 'suffix3': 'ien', 'suffix4': 'rien', 'pos': 'PRO'}, {'suffix1': 'e', 'suffix2': 'ne', 'suffix3': '', 'suffix4': '', 'pos': 'ADV'}, {'suffix1': 't', 'suffix2': 'it', 'suffix3': 'dit', 'suffix4': '', 'pos': 'V'}, {'suffix1': "'", 'suffix2': "u'", 'suffix3': "qu'", 'suffix4': '', 'pos': 'C'}, {'suffix1': 'e', 'suffix2': 'ne', 'suffix3': 'une', 'suffix4': '', 'pos': 'D'}, {'suffix1': 'e', 'suffix2': 'de', 'suffix3': 'nde', 'suffix4': 'onde', 'pos': 'A'}, {'suffix1': 'n', 'suffix2': 'on', 'suffix3': 'ion', 'suffix4': 'tion', 'pos': 'N'}, {'suffix1': 'e', 'suffix2': 'de', 'suffix3': '', 'suffix4': '', 'pos': 'P'}]
['s→s', ',→,', 'n→n', 'e→e', 'dit→dire', "'→e", 'une→un', 'seconde→second', 'n→n', 'e→e']


##### Trainning Lemmatization based on ML sk-learn

In [11]:
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

ml_model = Pipeline([
    ("vectorizer", DictVectorizer()),
    ("classifier", LogisticRegression(max_iter=200))
])

#this one is too long to run, so we limit to first 10,000 samples.
#ml_model.fit(X, y) 
ml_model.fit(X[:10000], y[:10000])


In [12]:
#test the model
print(ml_model.predict([make_features('ties', 'N')]))
print(ml_model.predict([make_features('onet', 'V')]))

['économies→économie']
['t→t']


In [13]:
def apply_rule(word, rule_str):
    """
    Apply a predicted transformation rule to a word.
    """
    w_suf, l_suf = rule_str.split("→")

    if word.endswith(w_suf):
        return word[:-len(w_suf)] + l_suf
    else:
        return word


def lemmatize_ml(word, pos=None):
    """Lemmatize a word using the ML model."""
    word = word.lower()

    features = make_features(word, pos)

    rule_str = ml_model.predict([features])[0]
    return apply_rule(word, rule_str)

# test the ML lemmatizer
print(lemmatize_ml("une", "D"))       # un
print(lemmatize_ml("dit", "V"))       # dire
print(lemmatize_ml("seconde", "A"))   # seconde

un
dir
seconde


##### Accuracy of lemmatization of sk-learn

In [15]:
test_file = "data/testing-set.tsv"

test_data_skl = []

with open(test_file, 'r', encoding='utf-8') as f:
    for line in f:
        line = line.strip()
        if not line:
            continue

        parts = line.split('\t')
        if len(parts) != 3:
            continue

        word, lemma, pos = parts
        test_data_skl.append((word.lower(), lemma.lower(), pos))


def evaluate_case_a(lemmatizer, data):
    correct = 0
    total = 0

    for word, gold_lemma, pos in data:
        pred = lemmatizer(word, pos)
        if pred == gold_lemma:
            correct += 1
        total += 1

    return correct / total


def evaluate_case_b(lemmatizer, data):
    correct = 0
    total = 0

    for word, gold_lemma, pos in data:
        pred = lemmatizer(word, pos)
        if (pred, pos) == (gold_lemma, pos):
            correct += 1
        total += 1

    return correct / total


acc_a = evaluate_case_a(lemmatize_ml, test_data_skl)
acc_b = evaluate_case_b(lemmatize_ml, test_data_skl)

print(f"ML lemmatizer – (no pos) lemma: {acc_a:.4f}")
print(f"ML lemmatizer – (with pos) lemma: {acc_b:.4f}")

ML lemmatizer – (no pos) lemma: 0.8425
ML lemmatizer – (with pos) lemma: 0.8425


We used a LinearRegression model from sklearn as a baseline machine-learning approach for lemmatization.

Words were represented using suffix-based features (last 1–4 characters) and part-of-speech tags.

sklearn model got a lower accuracy than dictionary model, probably due to a lower quantities of training set.

Although RNN-based models implemented in PyTorch could potentially capture more complex morphological patterns, they require larger datasets and sentence-level context. Therefore, we did not consider them in this exercise.

### Based on SpaCy (fr_core_news_sm)

In [16]:
import spacy
#nlp = spacy.load("en_core_web_sm")
#text_a = "He is thirty years old"
#text_b = "We still are champions"
nlp = spacy.load("fr_core_news_sm")
text_a = "Il a trente ans."
text_b = "Nous sommes toujours champions."
print(f'Lemmatization text A : {[(w.lemma_, w.pos_) for w in nlp(text_a)]}')
print(f'Lemmatization text B : {[(w.lemma_, w.pos_) for w in nlp(text_b)]}')



Lemmatization text A : [('il', 'PRON'), ('avoir', 'AUX'), ('trente', 'NUM'), ('an', 'NOUN'), ('.', 'PUNCT')]
Lemmatization text B : [('nous', 'PRON'), ('être', 'AUX'), ('toujours', 'ADV'), ('champion', 'ADJ'), ('.', 'PUNCT')]


In [19]:
def compare_with_spacy(text, my_lemmatizer, pass_pos_to_my_model=False):
    """
    my_lemmatizer: lemmatize_dict or lemmatize_ml
    pass_pos_to_my_model:
      - False: call my_lemmatizer(word, None)
      - True : call my_lemmatizer(word, spacy_pos)
    """
    doc = nlp(text)
    rows = []
    for tok in doc:
        word = tok.text
        spacy_pos = tok.pos_
        spacy_lemma = tok.lemma_

        if pass_pos_to_my_model:
            my_lemma = my_lemmatizer(word, spacy_pos)
        else:
            my_lemma = my_lemmatizer(word, None)

        rows.append({
            "word": word,
            "spacy_pos": spacy_pos,
            "spacy_lemma": spacy_lemma,
            "my_lemma": my_lemma
        })
    return pd.DataFrame(rows)


In [20]:
# Compare with dictionary lemmatizer
print("=== Dictionary vs spaCy ===")
display(compare_with_spacy(text_a, lemmatize_dict, pass_pos_to_my_model=False))
display(compare_with_spacy(text_b, lemmatize_dict, pass_pos_to_my_model=False))

# Compare with ML (sklearn) lemmatizer
print("=== ML sklearn vs spaCy ===")
display(compare_with_spacy(text_a, lemmatize_ml, pass_pos_to_my_model=False))
display(compare_with_spacy(text_b, lemmatize_ml, pass_pos_to_my_model=False))


=== Dictionary vs spaCy ===


Unnamed: 0,word,spacy_pos,spacy_lemma,my_lemma
0,Il,PRON,il,il
1,a,AUX,avoir,avoir
2,trente,NUM,trente,trente
3,ans,NOUN,an,an
4,.,PUNCT,.,.


Unnamed: 0,word,spacy_pos,spacy_lemma,my_lemma
0,Nous,PRON,nous,il
1,sommes,AUX,être,être
2,toujours,ADV,toujours,toujours
3,champions,ADJ,champion,champions
4,.,PUNCT,.,.


=== ML sklearn vs spaCy ===


Unnamed: 0,word,spacy_pos,spacy_lemma,my_lemma
0,Il,PRON,il,il
1,a,AUX,avoir,avoir
2,trente,NUM,trente,trente
3,ans,NOUN,an,ans
4,.,PUNCT,.,.


Unnamed: 0,word,spacy_pos,spacy_lemma,my_lemma
0,Nous,PRON,nous,nous
1,sommes,AUX,être,sommes
2,toujours,ADV,toujours,toujours
3,champions,ADJ,champion,champions
4,.,PUNCT,.,.


### Conclusion

The dictionary-based lemmatizer was trained on the full training set and can be evaluated on the full test set (261378 tokens) efficiently. In contrast, due to computational constraints, the sklearn LinearRegression model was trained (and evaluated) on a 10000-token subset.

Therefore, the dictionary-based approach achieves higher accuracy than the sklearn model both without POS (0.9475 > 0.8425) and with POS (0.9611 > 0.8425). POS information improves the dictionary baseline by reducing ambiguity in (word, POS) mappings, while it does not improve the sklearn model in our setting.

The qualitative comparison with spaCy is consistent with these results: the dictionary method matches spaCy on frequent in-vocabulary forms, whereas the sklearn model diverges more often on plural nouns and irregular verb forms. Overall, spaCy remains stronger due to richer linguistic resources.