[TF-IDF — Term Frequency-Inverse Document Frequency](https://www.learndatasci.com/glossary/tf-idf-term-frequency-inverse-document-frequency/)



In [1]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.datasets import fetch_20newsgroups
from sklearn.metrics.pairwise import cosine_similarity
import math
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [2]:
#Dataset
newsgroups = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))
documents = newsgroups.data

In [3]:
# Text cleaning

def clean_text(text):
    cleaned_chars = [char if char.isalnum() or char.isspace() else ' ' for char in text]
    cleaned_text = ''.join(cleaned_chars)
    return cleaned_text

# Tokenization and stop word removal
def tokenize_and_remove_stopwords(text):
    words = nltk.word_tokenize(text)
    stop_words = set(stopwords.words('english'))
    words = [word.lower() for word in words if word.lower() not in stop_words]
    return words




In [4]:
cleaned_documents = [(tokenize_and_remove_stopwords(clean_text(doc))) for doc in documents]

In [5]:
documents_str = [' '.join(doc) for doc in cleaned_documents]

vectorizer = TfidfVectorizer()
dtm = vectorizer.fit_transform(documents_str)


In [6]:
print("Cleaned Documents:")

for i in range(len(cleaned_documents)):
  print("Document", i + 1, ":", cleaned_documents[i])


Cleaned Documents:
Document 1 : ['sure', 'bashers', 'pens', 'fans', 'pretty', 'confused', 'lack', 'kind', 'posts', 'recent', 'pens', 'massacre', 'devils', 'actually', 'bit', 'puzzled', 'bit', 'relieved', 'however', 'going', 'put', 'end', 'non', 'pittsburghers', 'relief', 'bit', 'praise', 'pens', 'man', 'killing', 'devils', 'worse', 'thought', 'jagr', 'showed', 'much', 'better', 'regular', 'season', 'stats', 'also', 'lot', 'fo', 'fun', 'watch', 'playoffs', 'bowman', 'let', 'jagr', 'lot', 'fun', 'next', 'couple', 'games', 'since', 'pens', 'going', 'beat', 'pulp', 'jersey', 'anyway', 'disappointed', 'see', 'islanders', 'lose', 'final', 'regular', 'season', 'game', 'pens', 'rule']
Document 2 : ['brother', 'market', 'high', 'performance', 'video', 'card', 'supports', 'vesa', 'local', 'bus', '1', '2mb', 'ram', 'anyone', 'suggestions', 'ideas', 'diamond', 'stealth', 'pro', 'local', 'bus', 'orchid', 'farenheit', '1280', 'ati', 'graphics', 'ultra', 'pro', 'high', 'performance', 'vlb', 'card', '

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



 5360 : ['4', 'com', 'port', 'boards', 'available', 'pcs', 'want', 'standard', 'com', 'ports', 'need', 'mention', 'expensive', 'coprocessed', 'ones', 'either', 'able', 'share', 'irqs', 'able', 'use', 'irqs', '8', '15']
Document 5361 : ['anyone', 'know', 'program', 'record', 'keyboard', 'sequences', 'windowed', 'dos', 'box', 'would', 'like', 'something', 'starts', 'telnet', 'program', 'logs', 'accounts', 'windows', 'recorder', 'seem', 'able', 'record', 'key', 'sequences']
Document 5362 : ['origional', 'bit', 'missing', 'long', 'short', 'follows', 'origional', 'poster', 'asked', 'could', 'use', 'old', 'vga', 'svga', 'monitor', 'centris', 'hence', 'title', 'answer', 'ot', 'question', 'unqualified', 'yes', 'use', 'old', 'vga', 'svga', 'monitor', 'centris', 'need', 'adaptor', 'use', 'mac', 'vga', 'q', 'james', 'engineering', '510', '525', '7350', 'run', 'two', 'machines', 'adaptor', 'mentioned', 'convert', 'centris', 'three', 'row', 'vga', 'svga', '25', 'pin', 'adaptor', 'monitor', 'special

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



 : ['dean', 'velasco', 'quoted', 'letter', 'james', 'stowell', 'president', 'moody', 'bible', 'institute', 'lot', 'discussion', 'far', 'nobody', 'seems', 'hit', 'exactly', 'criticism', 'arrogance', 'aimed', 'arrogance', 'attacked', 'think', 'ones', 'know', 'absolutes', 'short', 'many', 'evangelicals', 'claim', 'infallible', 'matter', 'religious', 'texts', 'particular', 'problem', 'one', 'epistemology', 'shorthand', 'think', 'epistemology', 'know', 'question', 'turns', 'troubling', 'one', 'problem', 'absolute', 'certainty', 'bottom', 'least', 'thinking', 'goes', 'inside', 'head', 'unless', 'certain', 'everything', 'happens', 'head', 'infallible', 'reasoning', 'discover', 'source', 'truth', 'question', 'means', 'absolute', 'justification', 'source', 'authority', 'means', 'absolute', 'certainty', 'let', 'take', 'specific', 'example', 'biblical', 'inerrancy', 'fictional', 'inerrantist', 'named', 'zeke', 'following', 'arguments', 'applies', 'idea', 'papal', 'infallibility', 'zeke', 'presume

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [7]:
print("Vocabulary:")
print()
print(len(vectorizer.get_feature_names_out()))

Vocabulary:

129906


In [8]:

#cosine similarity
def calculate_cosine_similarity(vector1, vector2):
    return cosine_similarity(vector1.reshape(1, -1), vector2.reshape(1, -1))[0][0]

#document similarity
def document_similarity_search(input_document, top_n=5):
    input_vector = vectorizer.transform([input_document])
    similarities = [calculate_cosine_similarity(input_vector, doc_vector) for doc_vector in dtm]
    ranked_indices = sorted(range(len(similarities)), key=similarities.__getitem__, reverse=True)

    # Return top N similar documents
    result = [(documents[i], similarities[i]) for i in ranked_indices[:top_n]]
    return result



In [10]:
input_doc = "College"
similar_documents = document_similarity_search(input_doc)

for i, (doc, similarity) in enumerate(similar_documents, 1):
    print(i,'Similarity',similarity)
    print(doc)
    print()

1 Similarity 0.36001385451023843

Ask me whether I'm surprised that you haven't managed to waddle out of
college after all this time.


2 Similarity 0.26326039070130103
Hello,

I am planning on attending Podiatry School next year.

I have narrowed my choices to the Pennsylvania College of Podiatric
Medicine, in Philadelphia, or the California College of Podiatric
Medicine in San Francisco.  

If anyone has any information or oppinions about these two schools, please
tell me.  I am having a hard time deciding which one to attend, and must
make a decision very soon.  

thank you, Larry

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Live From New York, It's SATURDAY NIGHT...

3 Similarity 0.2398779889319091
Apparently, the only place to take the MSF course around
here in NC is at a community college.

That woudl preclude some sort of state
subsidation, then, no?


4 Similarity 0.23621977831442106
Could someone please post the rosters for the College Hockey All-Star ga