[TF-IDF — Term Frequency-Inverse Document Frequency](https://www.learndatasci.com/glossary/tf-idf-term-frequency-inverse-document-frequency/)



In [18]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.datasets import fetch_20newsgroups
from sklearn.metrics.pairwise import cosine_similarity
import math
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [19]:
#Dataset
newsgroups = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))
documents = newsgroups.data

In [20]:
# Text cleaning

def clean_text(text):
    cleaned_chars = [char if char.isalnum() or char.isspace() else ' ' for char in text]
    cleaned_text = ''.join(cleaned_chars)
    return cleaned_text

# Tokenization and stop word removal
def tokenize_and_remove_stopwords(text):
    words = nltk.word_tokenize(text)
    stop_words = set(stopwords.words('english'))
    words = [word.lower() for word in words if word.lower() not in stop_words]
    return words




In [21]:
cleaned_documents = [(tokenize_and_remove_stopwords(clean_text(doc))) for doc in documents]

In [22]:
documents_str = [' '.join(doc) for doc in cleaned_documents]

vectorizer = TfidfVectorizer()
dtm = vectorizer.fit_transform(documents_str)


In [23]:
print("Cleaned Documents:")

for i in range(len(cleaned_documents)):
  print("Document", i + 1, ":", cleaned_documents[i])


Cleaned Documents:
Document 1 : ['sure', 'bashers', 'pens', 'fans', 'pretty', 'confused', 'lack', 'kind', 'posts', 'recent', 'pens', 'massacre', 'devils', 'actually', 'bit', 'puzzled', 'bit', 'relieved', 'however', 'going', 'put', 'end', 'non', 'pittsburghers', 'relief', 'bit', 'praise', 'pens', 'man', 'killing', 'devils', 'worse', 'thought', 'jagr', 'showed', 'much', 'better', 'regular', 'season', 'stats', 'also', 'lot', 'fo', 'fun', 'watch', 'playoffs', 'bowman', 'let', 'jagr', 'lot', 'fun', 'next', 'couple', 'games', 'since', 'pens', 'going', 'beat', 'pulp', 'jersey', 'anyway', 'disappointed', 'see', 'islanders', 'lose', 'final', 'regular', 'season', 'game', 'pens', 'rule']
Document 2 : ['brother', 'market', 'high', 'performance', 'video', 'card', 'supports', 'vesa', 'local', 'bus', '1', '2mb', 'ram', 'anyone', 'suggestions', 'ideas', 'diamond', 'stealth', 'pro', 'local', 'bus', 'orchid', 'farenheit', '1280', 'ati', 'graphics', 'ultra', 'pro', 'high', 'performance', 'vlb', 'card', '

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



 : ['native', 'american', 'girlfriend', 'asks', 'government', 'really', 'care', 'hill', 'beans', 'religion', 'come', 'still', 'busting', 'us', 'oregon', 'washington', 'places', 'christian', 'u', 'army', 'marched', 'church', 'gunpoint']
Document 2947 : ['similar', 'note', 'good', 'friend', 'mine', 'worked', 'clerk', 'chain', 'bookstore', 'several', 'peers', 'amazing', 'one', 'woman', 'particular', 'customer', 'asked', 'autobiography', 'benjamin', 'franklin', 'first', 'question', 'still', 'alive', 'fiction', 'non', 'fiction', 'finally', 'friend', 'intervened', 'showed', 'guy', 'makes', 'one', 'wonder', 'standards', 'employment']
Document 2948 : ['improper', 'etiquette', 'illegal', 'people', 'responsible', 'junk', 'mailings', 'evil']
Document 2949 : []
Document 2950 : ['sean', '68070', 'exists', 'sean', 'want', 'get', 'mini', 'war', 'going', 'say', 'little', 'bit', 'skeptic', 'performance', 'claiming', 'centris', 'see', 'please', 'flames', 'reserve', 'c', 'chicago', 'last', 'consumer', 'e

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



8314 : ['opening', 'game', 'effect', 'maybe', 'pros', 'arrive', 'late', 'nervousness', 'rookie', 'wc', 'players', 'problems', 'get', 'lines', 'clicking', 'may', 'make', 'things', 'hard', 'get', 'going', 'worse', 'nations', 'guess', 'better', 'team', 'face', 'opening', 'game', 'better', 'since', 'chances', 'upset', 'greater', 'reasons', 'worse', 'teams', 'tough', 'beat', 'presented', 'hans', 'virus', 'lindberg', 'former', 'coach', 'switzerland', '1', 'worse', 'teams', 'referring', 'france', 'switzerland', 'austria', 'italy', 'etc', 'usually', 'world', 'class', 'goalies', '2', 'defensive', 'play', 'become', 'much', 'disciplined', 'take', 'much', 'less', 'unnecessary', 'penalties', '3', 'use', 'four', 'lines', 'makes', 'harder', 'make', 'run', 'gas', '4', 'ice', 'quality', 'german', 'wc', 'rinks', 'poor', 'another', 'weird', 'thing', 'czechs', 'played', 'entertaining', 'hockey', 'err', 'kidding', 'david', 'alex', 'new', 'name', 'ok', 'forgot', 'czech', 'roster', 'home', 'yesterday', 'know

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



: ['payne', 'crl', 'dec', 'com', 'andrew', 'payne', 'message', 'id', '1993apr20', '004418', '11548', 'crl', 'dec', 'com', 'organization', 'dec', 'cambridge', 'research', 'lab', 'date', 'tue', '20', 'apr', '1993', '00', '44', '18', 'gmt', 'anyone', 'know', 'source', 'tcm3105', 'modem', 'chips', 'used', 'baycom', 'pmp', 'modems', 'ideally', 'something', 'geared', 'toward', 'hobbyists', 'small', 'quantity', 'mail', 'order', 'etc', 'years', 'buying', 'distributor', 'marshall', 'hundreds', 'pmp', 'kits', 'orders', 'dropped', 'point', 'longer', 'afford', 'offer', 'service', 'distributors', 'checked', 'crazy', 'minimum', 'order', '100', 'like', 'find', 'source', 'still', 'interested', 'building', 'pmp', 'kits', 'suggestions', 'andrew', 'c', 'payne', 'dec', 'cambridge', 'research', 'lab', 'r110b', 'wnet', 'hal', '9000']
Document 10716 : ['ever', 'noticed', 'hockey', 'player', 'interviewed', 'periods', 'tv', 'game', 'usually', 'get', 'goal', 'assist', 'explain', 'usually', 'talk', 'stars', 'reg

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



Document 16943 : ['hmmm', 'gave', 'two', 'examples', 'matched', 'objective', 'criteria', 'response', 'subjective', 'claptrap', 'lame', 'never', 'counter', 'fact', 'examples', 'fit', 'objective', 'criteria', 'one', 'wonders', 'playing', 'semantic', 'games', 'rick', 'schaut', 'uucp', 'uunet', 'uw', 'beaver', 'microsoft', 'richs']
Document 16944 : ['looking', 'c', 'itoh', 'printer', 'driver', 'windows', '3', '1', 'anybody', 'happen', 'know', 'could', 'find', 'beast', 'thanks', 'advance', 'jerry']
Document 16945 : ['go', '39', 'lincoln', 'continental', 'could', 'find', 'one', 'sad', 'part', 'edsel', 'ford', 'designed', 'look', 'abortion', 'named', 'justice']
Document 16946 : ['proventil', 'inhaler', 'asthma', 'relief', 'fall', 'steroid', 'nonsteroid', 'category', 'looking', 'product', 'literature', 'clear']
Document 16947 : ['apollo', 'done', 'hard', 'way', 'big', 'hurry', 'limited', 'technology', 'base', 'government', 'contracts', 'privately', 'rather', 'government', 'project', 'cuts', 'c

In [24]:
print("Vocabulary:")
print()
print(len(vectorizer.get_feature_names_out()))

Vocabulary:

129906


In [25]:

#cosine similarity
def calculate_cosine_similarity(vector1, vector2):
    return cosine_similarity(vector1.reshape(1, -1), vector2.reshape(1, -1))[0][0]

#document similarity
def document_similarity_search(input_document, top_n=5):
    input_vector = vectorizer.transform([input_document])
    similarities = [calculate_cosine_similarity(input_vector, doc_vector) for doc_vector in dtm]
    ranked_indices = sorted(range(len(similarities)), key=similarities.__getitem__, reverse=True)

    # Return top N similar documents
    result = [(documents[i], similarities[i]) for i in ranked_indices[:top_n]]
    return result



In [26]:
input_doc = "Life"
similar_documents = document_similarity_search(input_doc)

for i, (doc, similarity) in enumerate(similar_documents, 1):
    print(i,'Similarity',similarity)
    print(doc)
    print()

1 Similarity 0.7281291629864991






People get a life !!!!!!!!!!

2 Similarity 0.6144503163753218
                                                       

Since it is a Life Time membership, you won't have to worry about it
until your next life.


3 Similarity 0.5828044515795875
Please get a REAL life.



4 Similarity 0.5068872862885634

YOUR sex life, maybe....

5 Similarity 0.46488023468733886



This was all badly reported in the news.  There is no evidence that
signs of life found in old rock predate putative planet-sterilizing
events.  Rather, the argument was that if life arose shortly the last
sterilizing event, then it must be easily formed.  The *inference*
was that life originated before and was destroyed, but there was
no evidence of that.

However, even this argument is flawed.  It could well be that origin of
life requires specific conditions (say, a certain composition of the
atmosphere) that do not last for long.  So, perhaps life formed
early only because i