# Lemmatization

## Anish Sachdeva (DTU/2K16/MC/013)

## Natural Language Processing (NLP) - Dr. Seba Susan

### Overview
We introduce lemmatization in this Notebook and alsous lemmatization to process our resume to lemmatize each toke i the corpus (Resume). We also compare the results with previously seen Porter Stemmer algorithm on our Resume.

### 1. Introduction

For grammatical reasons, documents are going to use different forms of a word, such as organize, organizes, and organizing. Additionally, there are families of derivationally related words with similar meanings, such as democracy, democratic, and democratization. In many situations, it seems as if it would be useful for a search for one of these words to return documents that contain another word in the set.

The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. For instance:


_am, are, is_ $\Rightarrow$ __be__

_car, cars, car's, cars'_ $\Rightarrow$ __car__

The Result of such mappings can b something like this:

_the boy's cars are different colors_ $\Rightarrow$ _the boy car be differ color_

However, the two words differ in their flavor. Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes. Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma . 

If confronted with the token _saw_, stemming might return just _s_, whereas lemmatization would attempt to return either _see_ or _saw_ depending on whether the use of the token was as a verb or a noun. The two may also differ in that stemming most commonly collapses derivationally related words, whereas lemmatization commonly only collapses the different inflectional forms of a lemma. 

Linguistic processing for stemming or lemmatization is often done by an additional plug-in component to the indexing process, and a number of such components exist, both commercial and open-source.

## 2. Importing basic tools required for tokenization, stemming and lemmatization of the resume corpus.

In [1]:
import nltk
import pickle
from collections import Counter
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\anish\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

## 3. Tokenizer
The Tokenizer is a basic tool used in NLP that returns tokens from a given document. We can create our own token as well based on Regex. 

In [2]:
class Tokenizer:
    def __init__(self):
        self._tokenizer = nltk.RegexpTokenizer(r'\w+')

    def tokenize(self, document: str) -> list:
        return self._tokenizer.tokenize(document)

In [3]:
# We define a basic tokenizer that takes all word strings from a document as a token. Test it below
# You can also modify the sentence below to see tokens
message = "Have no fear of perfection. You'll never reach it 🔥"
tokenizer = Tokenizer()
print(tokenizer.tokenize(message))

['Have', 'no', 'fear', 'of', 'perfection', 'You', 'll', 'never', 'reach', 'it']


## 4. Porter Stemmer
We now define the Porter Stemmer Algorithm, one of the most famous stemming algorithm for the english language created by M. F. Porter in 1980. This algorithm reduces a token by removing it's inflections. e.g. _running_ $\rightarrow$ _run_.

We use the porter stemmer as defined in the first assignment which can be seen [here](https://github.com/anishLearnsToCode/porter-stemmer).

In [4]:
class PorterStemmer:

    def __init__(self):
        """The word is a buffer holding a word to be stemmed. The letters are in the range
        [start, offset ... offset + 1) ... ending at end."""

        self.vowels = ('a', 'e', 'i', 'o', 'u')
        self.word = ''
        self.end = 0
        self.start = 0
        self.offset = 0
        self._tokenizer = Tokenizer()

    def is_vowel(self, letter):
        return letter in self.vowels

    def is_consonant(self, index):
        """:returns True if word[index] is a consonant."""
        if self.is_vowel(self.word[index]):
            return False
        if self.word[index] == 'y':
            if index == self.start:
                return True
            else:
                return not self.is_consonant(index - 1)
        return True

    def m(self):
        """m() measures the number of consonant sequences between start and offset.
        if c is a consonant sequence and v a vowel sequence, and <..>
        indicates arbitrary presence,
           <c><v>       gives 0
           <c>vc<v>     gives 1
           <c>vcvc<v>   gives 2
           <c>vcvcvc<v> gives 3
           ....
        """
        n = 0
        i = self.start
        while True:
            if i > self.offset:
                return n
            if not self.is_consonant(i):
                break
            i += 1
        i += 1
        while True:
            while True:
                if i > self.offset:
                    return n
                if self.is_consonant(i):
                    break
                i += 1
            i += 1
            n += 1
            while True:
                if i > self.offset:
                    return n
                if not self.is_consonant(i):
                    break
                i += 1
            i += 1

    def contains_vowel(self):
        """:returns TRUE if the word contains a vowel in the range [start, offset]"""
        for i in range(self.start, self.offset + 1):
            if not self.is_consonant(i):
                return True
        return False

    def contains_double_consonant(self, j):
        """:returns TRUE if the word contain a double consonant in the range [offset, start]"""
        if j < (self.start + 1):
            return False
        if self.word[j] != self.word[j - 1]:
            return False
        return self.is_consonant(j)

    def is_of_form_cvc(self, i):
        """:returns TRUE for indices set {i-2, i-1, i} has the form consonant - vowel - consonant
        and also if the second c is not w,x or y. this is used when trying to
        restore an e at the end of a short  e.g.
           cav(e), lov(e), hop(e), crim(e), but
           snow, box, tray.
        """
        if i < (self.start + 2) or not self.is_consonant(i) or self.is_consonant(i - 1) or not self.is_consonant(i - 2):
            return 0
        ch = self.word[i]
        if ch == 'w' or ch == 'x' or ch == 'y':
            return 0
        return 1

    def ends_with(self, s):
        """:returns TRUE when {start...end} ends with the string s."""
        length = len(s)
        if s[length - 1] != self.word[self.end]:  # tiny speed-up
            return False
        if length > (self.end - self.start + 1):
            return False
        if self.word[self.end - length + 1: self.end + 1] != s:
            return False
        self.offset = self.end - length
        return True

    def set_to(self, s):
        """sets [offset + 1, end] to the characters in the string s, readjusting end."""
        length = len(s)
        self.word = self.word[:self.offset + 1] + s + self.word[self.offset + length + 1:]
        self.end = self.offset + length

    def replace_morpheme(self, s):
        """is a mapping function to change morphemes"""
        if self.m() > 0:
            self.set_to(s)

    def remove_plurals(self):
        """This is step 1 ab and gets rid of plurals and -ed or -ing. e.g.
           caresses  ->  caress
           ponies    ->  poni
           ties      ->  ti
           caress    ->  caress
           cats      ->  cat
           feed      ->  feed
           agreed    ->  agree
           disabled  ->  disable
           matting   ->  mat
           mating    ->  mate
           meeting   ->  meet
           milling   ->  mill
           messing   ->  mess
           meetings  ->  meet
        """
        if self.word[self.end] == 's':
            if self.ends_with("sses"):
                self.end = self.end - 2
            elif self.ends_with("ies"):
                self.set_to("i")
            elif self.word[self.end - 1] != 's':
                self.end = self.end - 1
        if self.ends_with("eed"):
            if self.m() > 0:
                self.end = self.end - 1
        elif (self.ends_with("ed") or self.ends_with("ing")) and self.contains_vowel():
            self.end = self.offset
            if self.ends_with("at"):
                self.set_to("ate")
            elif self.ends_with("bl"):
                self.set_to("ble")
            elif self.ends_with("iz"):
                self.set_to("ize")
            elif self.contains_double_consonant(self.end):
                self.end = self.end - 1
                ch = self.word[self.end]
                if ch == 'l' or ch == 's' or ch == 'z':
                    self.end = self.end + 1
            elif self.m() == 1 and self.is_of_form_cvc(self.end):
                self.set_to("e")

    def terminal_y_to_i(self):
        """This defines step 1 c which turns terminal y to i when there is another vowel in the stem."""
        if self.ends_with('y') and self.contains_vowel():
            self.word = self.word[:self.end] + 'i' + self.word[self.end + 1:]

    def map_double_to_single_suffix(self):
        """Defines step 2 and maps double suffices to single ones.
        so -ization ( = -ize plus -ation) maps to -ize etc. note that the
        string before the suffix must give m() > 0.
        """
        if self.word[self.end - 1] == 'a':
            if self.ends_with("ational"):
                self.replace_morpheme("ate")
            elif self.ends_with("tional"):
                self.replace_morpheme("tion")
        elif self.word[self.end - 1] == 'c':
            if self.ends_with("enci"):
                self.replace_morpheme("ence")
            elif self.ends_with("anci"):
                self.replace_morpheme("ance")
        elif self.word[self.end - 1] == 'e':
            if self.ends_with("izer"):      self.replace_morpheme("ize")
        elif self.word[self.end - 1] == 'l':
            if self.ends_with("bli"):
                self.replace_morpheme("ble")  # --DEPARTURE--
            # To match the published algorithm, replace this phrase with
            #   if self.ends("abli"):      self.r("able")
            elif self.ends_with("alli"):
                self.replace_morpheme("al")
            elif self.ends_with("entli"):
                self.replace_morpheme("ent")
            elif self.ends_with("eli"):
                self.replace_morpheme("e")
            elif self.ends_with("ousli"):
                self.replace_morpheme("ous")
        elif self.word[self.end - 1] == 'o':
            if self.ends_with("ization"):
                self.replace_morpheme("ize")
            elif self.ends_with("ation"):
                self.replace_morpheme("ate")
            elif self.ends_with("ator"):
                self.replace_morpheme("ate")
        elif self.word[self.end - 1] == 's':
            if self.ends_with("alism"):
                self.replace_morpheme("al")
            elif self.ends_with("iveness"):
                self.replace_morpheme("ive")
            elif self.ends_with("fulness"):
                self.replace_morpheme("ful")
            elif self.ends_with("ousness"):
                self.replace_morpheme("ous")
        elif self.word[self.end - 1] == 't':
            if self.ends_with("aliti"):
                self.replace_morpheme("al")
            elif self.ends_with("iviti"):
                self.replace_morpheme("ive")
            elif self.ends_with("biliti"):
                self.replace_morpheme("ble")
        elif self.word[self.end - 1] == 'g':
            if self.ends_with("logi"):      self.replace_morpheme("log")

    def step3(self):
        """step3() deals with -ic-, -full, -ness etc."""
        if self.word[self.end] == 'e':
            if self.ends_with("icate"):
                self.replace_morpheme("ic")
            elif self.ends_with("ative"):
                self.replace_morpheme("")
            elif self.ends_with("alize"):
                self.replace_morpheme("al")
        elif self.word[self.end] == 'i':
            if self.ends_with("iciti"):     self.replace_morpheme("ic")
        elif self.word[self.end] == 'l':
            if self.ends_with("ical"):
                self.replace_morpheme("ic")
            elif self.ends_with("ful"):
                self.replace_morpheme("")
        elif self.word[self.end] == 's':
            if self.ends_with("ness"):      self.replace_morpheme("")

    def step4(self):
        """step4() takes off -ant, -ence etc., in context <c>vcvc<v>."""
        if self.word[self.end - 1] == 'a':
            if self.ends_with("al"):
                pass
            else:
                return
        elif self.word[self.end - 1] == 'c':
            if self.ends_with("ance"):
                pass
            elif self.ends_with("ence"):
                pass
            else:
                return
        elif self.word[self.end - 1] == 'e':
            if self.ends_with("er"):
                pass
            else:
                return
        elif self.word[self.end - 1] == 'i':
            if self.ends_with("ic"):
                pass
            else:
                return
        elif self.word[self.end - 1] == 'l':
            if self.ends_with("able"):
                pass
            elif self.ends_with("ible"):
                pass
            else:
                return
        elif self.word[self.end - 1] == 'n':
            if self.ends_with("ant"):
                pass
            elif self.ends_with("ement"):
                pass
            elif self.ends_with("ment"):
                pass
            elif self.ends_with("ent"):
                pass
            else:
                return
        elif self.word[self.end - 1] == 'o':
            if self.ends_with("ion") and (self.word[self.offset] == 's' or self.word[self.offset] == 't'):
                pass
            elif self.ends_with("ou"):
                pass
            # takes care of -ous
            else:
                return
        elif self.word[self.end - 1] == 's':
            if self.ends_with("ism"):
                pass
            else:
                return
        elif self.word[self.end - 1] == 't':
            if self.ends_with("ate"):
                pass
            elif self.ends_with("iti"):
                pass
            else:
                return
        elif self.word[self.end - 1] == 'u':
            if self.ends_with("ous"):
                pass
            else:
                return
        elif self.word[self.end - 1] == 'v':
            if self.ends_with("ive"):
                pass
            else:
                return
        elif self.word[self.end - 1] == 'z':
            if self.ends_with("ize"):
                pass
            else:
                return
        else:
            return
        if self.m() > 1:
            self.end = self.offset

    def step5(self):
        """step5() removes a final -e if m() > 1, and changes -ll to -l if m > 1."""
        self.offset = self.end
        if self.word[self.end] == 'e':
            a = self.m()
            if a > 1 or (a == 1 and not self.is_of_form_cvc(self.end - 1)):
                self.end = self.end - 1
        if self.word[self.end] == 'l' and self.contains_double_consonant(self.end) and self.m() > 1:
            self.end = self.end - 1

    def stem_document(self, document):
        result = []
        for line in document.split('\n'):
            result.append(self.stem_sentence(line))
        return '\n'.join(result)

    def alphabetic(self, word):
        return ''.join([letter if letter.isalpha() else '' for letter in word])

    def stem_sentence(self, sentence):
        result = []
        for word in self._tokenizer.tokenize(sentence):
            result.append(self.stem_word(word))
        return ' '.join(result)

    def stem_word(self, word):
        if word == '':
            return ''

        self.word = word
        self.end = len(word) - 1
        self.start = 0

        self.remove_plurals()
        self.terminal_y_to_i()
        self.map_double_to_single_suffix()
        self.step3()
        self.step4()
        self.step5()
        return self.word[self.start: self.end + 1]

In [5]:
# test the porter stemmer with any word of your choice
stemmer = PorterStemmer()
print(stemmer.stem_word('running'))

run


In [9]:
# test the porter stemmer with any sentence of your choice
print(stemmer.stem_sentence("the boy's cars are different colors"))

the boi s car ar differ color


## 5. Lemmatization
We now create a class for lemmatization that will use the `Tokenizer` class. An instance of the Lemmatization class will help us reduce words into their __lemmas__.  

In [11]:
class Lemmatizer:
    def __init__(self):
        self._lemmatizer = WordNetLemmatizer()
        self._tokenizer = Tokenizer()

    def _tokenize(self, document: str) -> list:
        return self._tokenizer.tokenize(document)

    def lemmatize_word(self, word: str, pos=None) -> str:
        return self._lemmatizer.lemmatize(word, pos) if pos is not None else self._lemmatizer.lemmatize(word)

    def lemmatize_sentence(self, sentence: str, pos=None) -> str:
        result = []
        for word in self._tokenize(sentence):
            if pos is not None:
                result.append(self.lemmatize_word(word, pos))
            else:
                result.append(self.lemmatize_word(word))
        return ' '.join(result)

    def lemmatize_document(self, document: str) -> str:
        result = []
        for line in document.split('\n'):
            result.append(self.lemmatize_sentence(line))
        return '\n'.join(result)

In [15]:
# test the lemmatizer with any word of your choice
lemmatizer = Lemmatizer()
print(lemmatizer.lemmatize_word('wolves'))

wolf


In [19]:
# try the lemmaztization algorithm with a sentence of your choice
print(lemmatizer.lemmatize_sentence('the quick brown fox 🦊 jumps over the lazy dog 🐶'))

the quick brown fox jump over the lazy dog


## 6. Loading in the Resume
We now load in our resume and see the sample text in it.

In [20]:
resume_file = open('../assets/resume.txt', 'r')
resume = resume_file.read().lower()
resume_file.close()
print(resume)

anish sachdeva
software developer + clean code enthusiast

phone : 8287428181
email : anish_@outlook.com
home : sandesh vihar, pitampura, new delhi - 110034
date of birth : 7th april 1998
languages : english, hindi, french

work experience
what after college ( 4 months )
delhi, india
creating content to teach core java and python with data structures and algorithms and giving online classes to students

summer research fellow at university of auckland ( 2 months )
auckland, new zealand
worked on geometry of mobius transformations, differential grometry under dr. pedram hekmati at the department of
mathematics, university of auckland

software developer at cern ( 14 months )
cern, geneva, switzerland
worked in the core platforms team of the fap-bc group. part of an agile team of developers that maintains and adds core
functionality to applications used internally at cern by hr, financial, administrative and other departments
including scientific

worked on legacy applications that compr

## 7. Stemming the Resume
We now Stem our resume by applying the PorterStemmer algorithm on it and see the output.

In [21]:
stemmer = PorterStemmer()
resume_file = open('../assets/resume.txt', 'r')
resume = resume_file.read().lower()
resume_file.close()

resume_stemmed = stemmer.stem_document(resume)
pickle.dump(obj=resume_stemmed, file=open('../assets/resume_stemmed.p', 'wb'))

resume_stemmed_file = open('../assets/resume_stemmed.txt', 'w')
resume_stemmed_file.write(resume_stemmed)
resume_stemmed_file.close()

In [22]:
# We now display the stemmed resume
print(resume_stemmed)

anish sachdeva
softwar develop clean code enthusiast

phone 8287428181
email anish_ outlook com
home sandesh vihar pitampura new delhi 110034
date of birth 7th april 1998
languag english hindi french

work experi
what after colleg 4 month
delhi india
creat content to teach core java and python with data structur and algorithm and give onlin class to student

summer research fellow at univers of auckland 2 month
auckland new zealand
work on geometri of mobiu transform differenti grometri under dr pedram hekmati at the depart of
mathemat univers of auckland

softwar develop at cern 14 month
cern geneva switzerland
work in the core platform team of the fap bc group part of an agil team of develop that maintain and add core
function to applic us intern at cern by hr financi administr and other depart
includ scientif

work on legaci applic that compris of singl and some time multipl framework such a java spring boot
hibern and java ee also work with googl polym 1 0 and jsp on the client sid

## 8. Creating the Lemmatized Resume
We now use our `Lemmatizer` class to lemmatize our resume and save it so that we can run analytics on it later on.

In [23]:
lemmatizer = Lemmatizer()
resume_file = open('../assets/resume.txt')
resume = resume_file.read().lower()
resume_file.close()

resume_lemmatized = lemmatizer.lemmatize_document(resume)
pickle.dump(resume_lemmatized, open('../assets/resume_lemmatized.p', 'wb'))

resume_lemmatized_file = open('../assets/resume_lemmatized.txt', 'w')
resume_lemmatized_file.write(resume_lemmatized)
resume_lemmatized_file.close()

In [24]:
# displaying lemmatized resume
print(resume_lemmatized)

anish sachdeva
software developer clean code enthusiast

phone 8287428181
email anish_ outlook com
home sandesh vihar pitampura new delhi 110034
date of birth 7th april 1998
language english hindi french

work experience
what after college 4 month
delhi india
creating content to teach core java and python with data structure and algorithm and giving online class to student

summer research fellow at university of auckland 2 month
auckland new zealand
worked on geometry of mobius transformation differential grometry under dr pedram hekmati at the department of
mathematics university of auckland

software developer at cern 14 month
cern geneva switzerland
worked in the core platform team of the fap bc group part of an agile team of developer that maintains and add core
functionality to application used internally at cern by hr financial administrative and other department
including scientific

worked on legacy application that comprise of single and some time multiple framework such a ja

## 9. Analytics
We now run a few basic analytics and compae the output of the stemmed and lemmaztized Resumes with each other and the original resume.

In [25]:
# We load in the original, stemmed and lemmatized resumes
resume_file = open('../assets/resume.txt', 'r')
resume = resume_file.read().lower()
resume_file.close()
resume_stemmed = pickle.load(open('../assets/resume_stemmed.p', 'rb'))
resume_lemmatized = pickle.load(open('../assets/resume_lemmatized.p', 'rb'))

In [26]:
# extracting tokens from the original, stemmed and lemmatized outputs
resume_tokens = word_tokenize(resume)
stemmed_resume_tokens = word_tokenize(resume_stemmed)
lemmatized_resume_tokens = word_tokenize(resume_lemmatized)

In [27]:
# Comparing the number of tokens in original, stemmed and lemmatized outputs
print('No. of tokens in Resume:', len(resume_tokens))
print('No. of tokens in Stemmed Resume:', len(stemmed_resume_tokens))
print('No. of tokens in Lemmatized Resume:', len(lemmatized_resume_tokens))

No. of tokens in Resume: 424
No. of tokens in Stemmed Resume: 363
No. of tokens in Lemmatized Resume: 363


We observe that both the stemmed and lemmatized resume's have same number of tokesn which is correct as the tokenization step for both these processes uses the same Tokenization algorithm.

In [28]:
# comparing no. of words and word frequencies in both stemmed and lemmatized outputs
stemmed_resume_frequencies = Counter(stemmed_resume_tokens)
lemmatized_resume_frequencies = Counter(lemmatized_resume_tokens)
print('\nNo. of unique tokens/words in the stemmed output:', len(stemmed_resume_frequencies))
print('No. of unique tokens/words in the lemmatized output:', len(lemmatized_resume_frequencies))


No. of unique tokens/words in the stemmed output: 220
No. of unique tokens/words in the lemmatized output: 229


In the stemmed output there are less number of tokens, but the reduction in number of tokens isn't that high and if the purpose of our task is to reduce the number of tokens in the corpus, stemming is definately a way to go, but lemmatization also achieves similar percentage reduction. 

In [30]:
# seeing the top 30 most common words in the stemmed and lemmatized outputs
print('\nTop 30 most common words/tokens in the stemmed output:\n', stemmed_resume_frequencies.most_common(30))
print('\nTop 30 most common words/tokens in the lemmatized output:\n', lemmatized_resume_frequencies.most_common(30))


Top 30 most common words/tokens in the stemmed output:
 [('and', 16), ('of', 12), ('work', 7), ('java', 7), ('the', 6), ('with', 5), ('on', 5), ('develop', 4), ('com', 4), ('delhi', 4), ('month', 4), ('to', 4), ('core', 4), ('data', 4), ('structur', 4), ('algorithm', 4), ('at', 4), ('univers', 4), ('cern', 4), ('in', 4), ('comput', 4), ('code', 3), ('creat', 3), ('teach', 3), ('auckland', 3), ('mathemat', 3), ('that', 3), ('applic', 3), ('http', 3), ('softwar', 2)]

Top 30 most common words/tokens in the lemmatized output:
 [('and', 16), ('of', 12), ('java', 7), ('worked', 6), ('the', 6), ('with', 5), ('on', 5), ('com', 4), ('delhi', 4), ('month', 4), ('to', 4), ('core', 4), ('data', 4), ('structure', 4), ('algorithm', 4), ('at', 4), ('university', 4), ('cern', 4), ('in', 4), ('developer', 3), ('auckland', 3), ('mathematics', 3), ('that', 3), ('application', 3), ('computer', 3), ('http', 3), ('software', 2), ('new', 2), ('english', 2), ('4', 2)]


In stemming we are getting words reduced down to their roots and words are much more clearer in their meaning and are closer to their original form in the lemmatizaed format. Although in lemmatization we are also receiving a number __4__ in our most frequently ocurring characters.

We now introduce a helpee method that will help us in tagging each token in the corpus with the corresponding Part of Speech Tags (POS tags). POS Tags are mainly of the following types:

- Noun (n)
- Verb (v)
- Adjective (a)
- Adverb (r)
- Symbol (s)

By tagging the orignal, stemmed and lemmatized resumes we can check whether we are still maintaing the same frequency of POS tags. Which will further show that the meaning or context of our words has been retained despite the pre-processing steps of Stemming or Lemmatization.

In [31]:
def get_pos_frequency(tokens: list) -> Counter:
    synsets = [wordnet.synsets(token) for token in tokens]
    pos_tags = []
    for synset in synsets:
        if isinstance(synset, list) and len(synset) > 0:
            pos_tags.append(synset[0].pos())
    return Counter(pos_tags)


# Analyzing of frequency of POS tags in original, stemmed and Lemmatized resume
resume_pos_frequency = get_pos_frequency(resume_tokens)
stemmed_resume_pos_frequency = get_pos_frequency(stemmed_resume_tokens)
lemmatized_resume_pos_frequency = get_pos_frequency(lemmatized_resume_tokens)

print('\nResume POS Tags Frequency:', resume_pos_frequency)
print('Stemmed Resume POS Tags Frequency:', stemmed_resume_pos_frequency)
print('Lemmatized Resume POS Tags Frequency:', lemmatized_resume_pos_frequency)


Resume POS Tags Frequency: Counter({'n': 202, 'v': 20, 'a': 18, 's': 10, 'r': 4})
Stemmed Resume POS Tags Frequency: Counter({'n': 161, 'a': 11, 'v': 10, 's': 7, 'r': 4})
Lemmatized Resume POS Tags Frequency: Counter({'n': 214, 'v': 20, 'a': 18, 's': 11, 'r': 5})


We observe that the number of nouns and adverbs increases in the resume after performing the lemmatization step and the number of nouns clearly decreases after performing stemming. So, if in our application the user wishes to search proper nouns and obtain specific results and only those that match exactly, like searching __java__, __python__ etc. through a resume, the lemmatization will give a better result.