In [43]:
import re
import nltk
import spacy
import string
import numpy as np
import pandas as pd
from nltk.corpus import wordnet
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer

In [44]:
nltk.download("stopwords")
nltk.download('punkt_tab')
nltk.download("wordnet")

nltk.download('averaged_perceptron_tagger_eng') # For POS tagging

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


True

In [54]:
english_stopset = set(stopwords.words('english')).union(
                  {"things", "that's", "something", "take", "don't", "may", "want", "you're",
                   "set", "might", "says", "including", "lot", "much", "said", "know",
                   "good", "step", "often", "going", "thing", "things", "think",
                   "back", "actually", "better", "look", "find", "right", "example",
                                                                  "verb", "verbs"})
english_stopset = list(stopwords.words('english'))

In [46]:
docs = ['i loved you ethiopian, stored elements in Compress find Sparse Ethiopia is the greatest country in the world of nation at universe',

        'also, sometimes, the same words can have multiple different ‘lemma’s. So, based on the context it’s used, you should identify the \
        part-of-speech (POS) tag for the word in that specific context and extract the appropriate lemma. Examples of implementing this comes \
        in the following sections countries.ethiopia With a planned.The name that the Blue Nile river loved took in Ethiopia is derived from the \
        Geez word for great to imply its being the river of rivers The word Abay still exists in ethiopia major languages',

        'With more than  million people, ethiopia is the second most populous nation in Africa after Nigeria, and the fastest growing \
         economy in the region. However, it is also one of the poorest, with a per capita income',

        'The primary purpose of the dam ethiopia is electricity production to relieve Ethiopia’s acute energy shortage and for electricity export to neighboring\
         countries.ethiopia With a planned.',

        'The name that the Blue Nile river loved takes in Ethiopia "abay" is derived from the Geez blue loved word for great to imply its being the river of rivers The \
         word Abay still exists in Ethiopia major languages to refer to anything or anyone considered to be superior.',

        'Two non-upgraded loved turbine-generators with MW each are the first loveto go into operation with loved MW delivered to the national power grid. This early power\
         generation will start well before the completion']

title = ['Two upgraded', 'Loved Turbine-Generators', 'Operation With Loved', 'National', 'Power Grid', 'Generator']

keywords = ['two','non','loved','ethiopia','operation','grid','power','fight','survive']  #we can generate keywords from articls using 'spacy'

## Issue 1: Title Wasn't Being Preprocessed
- Refactored the preprocessing logic into a reusable function `preprocess` to preprocess both docs and title, ensuring the code follows the DRY (Don't Repeat Yourself) principle.

In [47]:
# Preprocessing function
def preprocess(text):
    text = re.sub(r'[^\x00-\x7F]+', ' ', text)  # Remove non-ASCII characters
    text = re.sub(r'@\w+', '', text)  # Remove mentions
    text = text.lower()  # Convert to lowercase
    text = re.sub(r'[%s]' % re.escape(string.punctuation), ' ', text)  # Remove punctuation
    text = re.sub(r'[0-9]', '', text)  # Remove numbers
    text = re.sub(r'\s{2,}', ' ', text)  # Remove extra spaces
    return text.strip()

documents_clean = [preprocess(doc) for doc in docs]
documents_cleant = [preprocess(doc) for doc in title]

print(documents_clean)
print(documents_cleant)


['i loved you ethiopian stored elements in compress find sparse ethiopia is the greatest country in the world of nation at universe', 'also sometimes the same words can have multiple different lemma s so based on the context it s used you should identify the part of speech pos tag for the word in that specific context and extract the appropriate lemma examples of implementing this comes in the following sections countries ethiopia with a planned the name that the blue nile river loved took in ethiopia is derived from the geez word for great to imply its being the river of rivers the word abay still exists in ethiopia major languages', 'with more than million people ethiopia is the second most populous nation in africa after nigeria and the fastest growing economy in the region however it is also one of the poorest with a per capita income', 'the primary purpose of the dam ethiopia is electricity production to relieve ethiopia s acute energy shortage and for electricity export to neighb

## Issue 2: Lemmatizer Was Limited
- The lemmatizer was only working for a small subset of words (e.g., converting "elements" to "element") but failed for words like those ending in "-ed." This issue arose because the lemmatizer lacked information about the word's part of speech (noun, verb, adjective, etc.).

- Resolved the issue by referring to the given article about lemmatization: [Link to Article](https://www.machinelearningplus.com/nlp/lemmatization-examples-python/), Lemmatization Examples in Python, which provided helpful guidance.

- Imported required libraries for nltk.pos_tag and added a reusable function, `get_wordnet_pos`. This function maps a word’s part of speech (POS) tag, such as noun or verb, to a format understood by WordNetLemmatizer (e.g., 'n' for noun, 'v' for verb).

- Updated the lemmatizer call to include the POS argument: `lemmer.lemmatize(word, get_wordnet_pos(word))`.

- Both previous and after update outputs are printed below to outline the changes.


In [63]:
def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)


lemmer=WordNetLemmatizer()

new_docs_prev=[' '.join([lemmer.lemmatize(documents_clean) for documents_clean in text.split(',')]) for text in documents_clean]  #Lemmatization the words/description
new_docs=[' '.join([lemmer.lemmatize(word, get_wordnet_pos(word)) for word in text.split()]) for text in documents_clean]  #Lemmatization the words/description
print(f"Previous: {new_docs_prev}")
print(f"Updated: {new_docs}\n")

titles = [' '.join([lemmer.lemmatize(title).strip() for title in text.split(' ')]) for text in title]   #Lemmatization the title
titles = [' '.join([lemmer.lemmatize(word, get_wordnet_pos(word)) for word in title.split()]) for title in documents_cleant]   #Lemmatization the title
print(f"Previous: {title}")
print(f"Updated: {titles}")


Previous: ['i loved you ethiopian stored elements in compress find sparse ethiopia is the greatest country in the world of nation at universe', 'also sometimes the same words can have multiple different lemma s so based on the context it s used you should identify the part of speech pos tag for the word in that specific context and extract the appropriate lemma examples of implementing this comes in the following sections countries ethiopia with a planned the name that the blue nile river loved took in ethiopia is derived from the geez word for great to imply its being the river of rivers the word abay still exists in ethiopia major languages', 'with more than million people ethiopia is the second most populous nation in africa after nigeria and the fastest growing economy in the region however it is also one of the poorest with a per capita income', 'the primary purpose of the dam ethiopia is electricity production to relieve ethiopia s acute energy shortage and for electricity export

In [69]:
vectorizer = TfidfVectorizer(analyzer='word',
                              ngram_range=(1, 2),
                              min_df=0.002,
                              max_df=0.99,
                              max_features=10000,
                              lowercase=True,
                              stop_words=english_stopset)

In [70]:
X = vectorizer.fit_transform(new_docs)

In [71]:
# Create a DataFrame
df = pd.DataFrame(X.T.toarray())
print(df.head(10))
print(df.shape)

     0         1         2         3         4    5
0  0.0  0.083112  0.000000  0.000000  0.229908  0.0
1  0.0  0.000000  0.000000  0.000000  0.140185  0.0
2  0.0  0.083112  0.000000  0.000000  0.114954  0.0
3  0.0  0.000000  0.000000  0.174451  0.000000  0.0
4  0.0  0.000000  0.000000  0.174451  0.000000  0.0
5  0.0  0.000000  0.167583  0.000000  0.000000  0.0
6  0.0  0.000000  0.167583  0.000000  0.000000  0.0
7  0.0  0.083112  0.137421  0.000000  0.000000  0.0
8  0.0  0.000000  0.167583  0.000000  0.000000  0.0
9  0.0  0.101354  0.000000  0.000000  0.000000  0.0
(224, 6)


## Issue 3: It finaly works for "love" but stopped working for "loved"
- The issue was resolved by using the `get_similar_articles` function with an improved lemmatization approach. Specifically, I applied `lemmer.lemmatize(lemma_ops, get_wordnet_pos(lemma_ops))` to dynamically determine the word type (e.g., verb, noun, etc.) and apply the appropriate lemmatization. This ensured both "love" and "loved" were treated correctly.

In [81]:
def get_similar_articles(q,t, df):
  print("Done Searching. Full Result: \n")
  print("searched items : ", q)
  print("Article with the Highest Cosine Similarity Values: ")
  search_rank ={}
  top_results=5
  q = [q]
  t = [t]

  q_vec = vectorizer.transform(q).toarray().reshape(df.shape[0],)
  q_vect = vectorizer.transform(t).toarray().reshape(df.shape[0],)
  sim = {}
  titl = {}

  for i in range(len(new_docs)) and range(len(titles)):
    sim[i] = np.dot(df.loc[:, i].values, q_vec) / np.linalg.norm(df.loc[:, i]) * np.linalg.norm(q_vec)  #Calculate the similarity
    # Or we can use cosine)similarity library both are the same
    titl[i] = np.dot(df.loc[:, i].values, q_vect) / np.linalg.norm(df.loc[:, i]) * np.linalg.norm(q_vect)

  sim_sorted = sorted(sim.items(),key=lambda x : x[1], reverse=True)[:min(len(sim), top_results)]
  sim_sortedt = sorted(titl.items(),key=lambda x : x[1], reverse=True)[:min(len(titl), top_results)]


  for i, v in sim_sorted and sim_sortedt:    # Print the articles and their similarity values
    if v != 0.0:
      print("Similaritas score: ", v)
      zip(titles, new_docs)
      print(titles[i])
      print(new_docs[i])
      print('\n')


In [91]:

lemma_ops = 'loved'
list1 = nltk.word_tokenize(lemma_ops)
q1 = ' '.join([lemmer.lemmatize(lemma_ops, get_wordnet_pos(lemma_ops)) for lemma_ops in list1])

get_similar_articles(q1,q1, df)
print('-'*100)

Done Searching. Full Result: 

searched items :  love
Article with the Highest Cosine Similarity Values: 
Similaritas score:  0.17053622712109406
generator
two non upgraded love turbine generator with mw each be the first loveto go into operation with love mw deliver to the national power grid this early power generation will start well before the completion


Similaritas score:  0.16633214688113884
power grid
the name that the blue nile river love take in ethiopia abay be derive from the geez blue love word for great to imply it be the river of river the word abay still exists in ethiopia major language to refer to anything or anyone consider to be superior


Similaritas score:  0.12578354491916732
two upgraded
i love you ethiopian store element in compress find sparse ethiopia be the great country in the world of nation at universe


Similaritas score:  0.06012924359789893
love turbine generator
also sometimes the same word can have multiple different lemma s so base on the context i

In [92]:

lemma_ops = 'love'
list1 = nltk.word_tokenize(lemma_ops)
q1 = ' '.join([lemmer.lemmatize(lemma_ops, get_wordnet_pos(lemma_ops)) for lemma_ops in list1])

get_similar_articles(q1,q1, df)
print("-" * 100)

Done Searching. Full Result: 

searched items :  love
Article with the Highest Cosine Similarity Values: 
Similaritas score:  0.17053622712109406
generator
two non upgraded love turbine generator with mw each be the first loveto go into operation with love mw deliver to the national power grid this early power generation will start well before the completion


Similaritas score:  0.16633214688113884
power grid
the name that the blue nile river love take in ethiopia abay be derive from the geez blue love word for great to imply it be the river of river the word abay still exists in ethiopia major language to refer to anything or anyone consider to be superior


Similaritas score:  0.12578354491916732
two upgraded
i love you ethiopian store element in compress find sparse ethiopia be the great country in the world of nation at universe


Similaritas score:  0.06012924359789893
love turbine generator
also sometimes the same word can have multiple different lemma s so base on the context i

In [93]:

lemma_ops = 'electrical productions'
list1 = nltk.word_tokenize(lemma_ops)
q1 = ' '.join([lemmer.lemmatize(lemma_ops, get_wordnet_pos(lemma_ops)) for lemma_ops in list1])

get_similar_articles(q1,q1, df)
print('-' * 100)

Done Searching. Full Result: 

searched items :  electrical production
Article with the Highest Cosine Similarity Values: 
Similaritas score:  0.17445107707418728
national
the primary purpose of the dam ethiopia be electricity production to relieve ethiopia s acute energy shortage and for electricity export to neighbor country ethiopia with a plan


----------------------------------------------------------------------------------------------------
