# NDS metrics similiarity

## Objective :
* Comparing (similiarity) two NDS ( News Dataset ) with metrics

## Scope 
* Read data from kaggle 
* Techique embedding data  with TFIDF and Counvectorizer ( Bag of word )
* Find query beetween NDS to find higher similiarity

In [1]:
import pandas as pd

In [2]:
data_news_bbc= pd.read_csv("bbc_news.csv").sample(n=5000)

In [3]:
data_news_bbc.head()

Unnamed: 0,title,pubDate,guid,link,description
15496,The Papers: 'You're nicked' and 'no new smart ...,"Wed, 05 Apr 2023 23:33:31 GMT",https://www.bbc.co.uk/news/blogs-the-papers-65...,https://www.bbc.co.uk/news/blogs-the-papers-65...,Thursday's papers feature the former SNP CEO's...
30337,How much will the 2p National Insurance cut sa...,"Wed, 06 Mar 2024 16:09:51 GMT",https://www.bbc.co.uk/news/explainers-63635185#5,https://www.bbc.co.uk/news/explainers-63635185,The government has announced a further 2p cut ...
21537,23-year-old rescue dog celebrates year in new ...,"Fri, 01 Sep 2023 14:01:45 GMT",https://www.bbc.co.uk/news/uk-wales-66684282,https://www.bbc.co.uk/news/uk-wales-66684282?a...,Dogs Trust says 23-year-old Ty is a rare but r...
25228,Russia seeks extremist label for LGBT movement,"Fri, 17 Nov 2023 16:35:36 GMT",https://www.bbc.co.uk/news/world-europe-67454386,https://www.bbc.co.uk/news/world-europe-674543...,The measure could leave any LGBT activist in t...
6912,Africa's week in pictures: 12-18 August 2022,"Thu, 18 Aug 2022 23:38:35 GMT",https://www.bbc.co.uk/news/world-africa-62588364,https://www.bbc.co.uk/news/world-africa-625883...,A selection of the best photos from across Afr...


In [57]:
data_news_bbc["description"].sample(n=1).to_list()

['The black, ballerina-length velvet evening dress sold for 11 times its estimated price.']

In [4]:
data_news_bbc = data_news_bbc[["title","description"]]

## Pipeline Data preprocessing

In [5]:
def caseFolding(text):
  text=text.lower()
  return text


import re
def punc_removal(text):
  text=re.sub(r"[^a-zA-Z]"," ",text)
  return text

# import StemmerFactory class

# create stemmer
import nltk
from nltk import *
stemmer = PorterStemmer()
def stemsWords(text):
  text=stemmer.stem(text)
  return text

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
nltk.download('stopwords')
stp=stopwords.words('english')

def flatten(l):
    """
    Same as
    flat_list = []
    for sublist in l:
    for item in sublist:
        flat_list.append(item)

    Example =[[1,2,3,4,5]] - > [1,2,3,4,5]
    """
    return " ".join([item for sublist in l for item in sublist])

def remove_stop_words(text):
  stopWords_add=['us','like','ur','gt']
  stopWords_combine=stopWords_add+stp
  clean_words = []
  text=text.split()
  perulangan_text=[word for word in text if word not in stopWords_combine]
  clean_words.append(perulangan_text)
  return flatten(clean_words)

def preprocessing_text(text):
  text=caseFolding(text)
  text=punc_removal(text)
  text=remove_stop_words(text)
  text=stemsWords(text)
  return text

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Reza\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Reza\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


# Metrics similiarity

In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics.pairwise import euclidean_distances

## NDS Bag of words data news_bbc

In [7]:
data_news_bbc["description_clean"] = data_news_bbc["description"].apply(preprocessing_text) 
documents = data_news_bbc["description_clean"]
BagOfWord = CountVectorizer()
embed_bagofword_data_news_bbc = BagOfWord.fit_transform(documents)

In [59]:
print(data_news_bbc["description_clean"].sample(n=1).to_list())

['kyiv refuses say many soldiers still trapped saying information sensit']


In [21]:
doc_term_matrix = embed_bagofword_data_news_bbc.todense()
df = pd.DataFrame(
   doc_term_matrix,
   columns=BagOfWord.get_feature_names_out(),
   index=data_news_bbc["title"].to_list()
)

In [22]:
df.head()

Unnamed: 0,aaron,ab,abandon,abandoned,abandoning,abbas,abbess,abbey,abbie,abby,...,zinchenko,zoe,zon,zone,zonzolo,zoo,zookeepers,zoom,zoysa,zuckerberg
The Papers: 'You're nicked' and 'no new smart motorways',0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
How much will the 2p National Insurance cut save me?,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
23-year-old rescue dog celebrates year in new home,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Russia seeks extremist label for LGBT movement,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Africa's week in pictures: 12-18 August 2022,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## NDS tfidf data news_bbc

In [14]:
data_news_bbc["description_clean"] = data_news_bbc["description"].apply(preprocessing_text) 
documents = data_news_bbc["description_clean"]
tfidf = TfidfVectorizer()
embed_tfidf_data_news_bbc = tfidf.fit_transform(documents)

In [15]:
doc_term_matrix = embed_tfidf_data_news_bbc.todense()
df = pd.DataFrame(
   doc_term_matrix,
   columns=tfidf.get_feature_names_out(),
   index=data_news_bbc["title"].to_list()
)

In [16]:
df.head()

Unnamed: 0,aaron,ab,abandon,abandoned,abandoning,abbas,abbess,abbey,abbie,abby,...,zinchenko,zoe,zon,zone,zonzolo,zoo,zookeepers,zoom,zoysa,zuckerberg
The Papers: 'You're nicked' and 'no new smart motorways',0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
How much will the 2p National Insurance cut save me?,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
23-year-old rescue dog celebrates year in new home,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Russia seeks extremist label for LGBT movement,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Africa's week in pictures: 12-18 August 2022,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


# Comparing

In [29]:
query = "health ukranian"

## bag of word

In [24]:
doc_term_matrix = embed_bagofword_data_news_bbc.todense()
df_bag_of_word = pd.DataFrame(
   doc_term_matrix,
   columns=BagOfWord.get_feature_names_out(),
   index=data_news_bbc["title"].to_list()
)

In [26]:
df_bag_of_word.head()

Unnamed: 0,aaron,ab,abandon,abandoned,abandoning,abbas,abbess,abbey,abbie,abby,...,zinchenko,zoe,zon,zone,zonzolo,zoo,zookeepers,zoom,zoysa,zuckerberg
The Papers: 'You're nicked' and 'no new smart motorways',0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
How much will the 2p National Insurance cut save me?,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
23-year-old rescue dog celebrates year in new home,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Russia seeks extremist label for LGBT movement,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Africa's week in pictures: 12-18 August 2022,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [34]:
query_vector = BagOfWord.transform([query])
# Calculate cosine similarity between the query and all documents
cosine_similarities_bagOfWord = cosine_similarity(query_vector, embed_bagofword_data_news_bbc).flatten()
data_bagOfWord = pd.DataFrame(
   cosine_similarities_bagOfWord,
   columns=[query],
   index=data_news_bbc["title"].to_list()
) 

In [39]:
result_score_bag_of_word=data_bagOfWord.sort_values(by=query,ascending=False).head()

In [40]:
result_score_bag_of_word

Unnamed: 0,health ukranian
Cold weather: How do cold-health alerts work?,0.5547
New treatment for migraine attacks on NHS to benefit thousands,0.377964
What is GDP and how does it affect me?,0.377964
Head teacher apologises after pupils hurt in crush,0.353553
"Strep A symptoms: What you need to know, in a minute",0.353553


## tfidf

In [27]:
doc_term_matrix = embed_tfidf_data_news_bbc.todense()
df_tfidf = pd.DataFrame(
   doc_term_matrix,
   columns=tfidf.get_feature_names_out(),
   index=data_news_bbc["title"].to_list()
)

In [28]:
df_tfidf.head()

Unnamed: 0,aaron,ab,abandon,abandoned,abandoning,abbas,abbess,abbey,abbie,abby,...,zinchenko,zoe,zon,zone,zonzolo,zoo,zookeepers,zoom,zoysa,zuckerberg
The Papers: 'You're nicked' and 'no new smart motorways',0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
How much will the 2p National Insurance cut save me?,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
23-year-old rescue dog celebrates year in new home,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Russia seeks extremist label for LGBT movement,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Africa's week in pictures: 12-18 August 2022,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [30]:
query_vector = tfidf.transform([query])

In [31]:
# Calculate cosine similarity between the query and all documents
cosine_similarities_tfidf = cosine_similarity(query_vector, embed_tfidf_data_news_bbc).flatten()
data_score_tfidf = pd.DataFrame(
   cosine_similarities_tfidf,
   columns=[query],
   index=data_news_bbc["title"].to_list()
) 

In [37]:
result_score_tfidf = data_score_tfidf.sort_values(by=query,ascending=False).head()

In [38]:
result_score_tfidf 

Unnamed: 0,health ukranian
Cold weather: How do cold-health alerts work?,0.487961
Matt Hancock paid £320K for I'm a Celebrity appearance,0.318486
England appoints ambassador to shake up women's health,0.309159
Head teacher apologises after pupils hurt in crush,0.3019
"Strep A symptoms: What you need to know, in a minute",0.301159


In [76]:
print(result_score_tfidf .to_markdown())

|                                                        |   health ukranian |
|:-------------------------------------------------------|------------------:|
| Cold weather: How do cold-health alerts work?          |          0.487961 |
| Matt Hancock paid £320K for I'm a Celebrity appearance |          0.318486 |
| England appoints ambassador to shake up women's health |          0.309159 |
| Head teacher apologises after pupils hurt in crush     |          0.3019   |
| Strep A symptoms: What you need to know, in a minute   |          0.301159 |


# Conclusion

In [51]:
if result_score_tfidf["health ukranian"].to_list()[0] > result_score_bag_of_word["health ukranian"].to_list()[0]:
    print("TFIDF lebih besar")
    print(result_score_tfidf["health ukranian"].to_list()[0])
    
else:
    print("BOW lebih besar")
    print(result_score_bag_of_word["health ukranian"].to_list()[0])
        

BOW lebih besar
0.5547001962252291
