## **TF-IDF**

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
import numpy as np
import pandas as pd

In [4]:
data = pd.read_csv("/content/drive/MyDrive/india-news-headlines.csv")

In [5]:
data

Unnamed: 0,publish_date,headline_category,headline_text
0,20010102,unknown,Status quo will not be disturbed at Ayodhya; s...
1,20010102,unknown,Fissures in Hurriyat over Pak visit
2,20010102,unknown,America's unwanted heading for India?
3,20010102,unknown,For bigwigs; it is destination Goa
4,20010102,unknown,Extra buses to clear tourist traffic
...,...,...,...
3650965,20220331,city.srinagar,J&K sacks 2 cops; 3 other employees over terro...
3650966,20220331,entertainment.hindi.bollywood,Ranbir Kapoor says 'Rishi Kapoor enjoyed his a...
3650967,20220331,city.trichy,As Covid-19 cases drop to nil in southern dist...
3650968,20220331,city.erode,Tamil Nadu sees marginal rise of Covid cases w...


## **Data** **Preprocessing**

In [6]:
data.head()

Unnamed: 0,publish_date,headline_category,headline_text
0,20010102,unknown,Status quo will not be disturbed at Ayodhya; s...
1,20010102,unknown,Fissures in Hurriyat over Pak visit
2,20010102,unknown,America's unwanted heading for India?
3,20010102,unknown,For bigwigs; it is destination Goa
4,20010102,unknown,Extra buses to clear tourist traffic


In [7]:
data.tail()

Unnamed: 0,publish_date,headline_category,headline_text
3650965,20220331,city.srinagar,J&K sacks 2 cops; 3 other employees over terro...
3650966,20220331,entertainment.hindi.bollywood,Ranbir Kapoor says 'Rishi Kapoor enjoyed his a...
3650967,20220331,city.trichy,As Covid-19 cases drop to nil in southern dist...
3650968,20220331,city.erode,Tamil Nadu sees marginal rise of Covid cases w...
3650969,20220331,city.salem,Tamil Nadu sees marginal rise of Covid cases w...


In [8]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3650970 entries, 0 to 3650969
Data columns (total 3 columns):
 #   Column             Dtype 
---  ------             ----- 
 0   publish_date       int64 
 1   headline_category  object
 2   headline_text      object
dtypes: int64(1), object(2)
memory usage: 83.6+ MB


In [9]:
data.describe()

Unnamed: 0,publish_date
count,3650970.0
mean,20131930.0
std,52522.82
min,20010100.0
25%,20100420.0
50%,20140230.0
75%,20170930.0
max,20220330.0


In [10]:
data.isnull().sum()

publish_date         0
headline_category    0
headline_text        0
dtype: int64

In [11]:
data.duplicated().sum()

24860

In [12]:
data.drop_duplicates(inplace=True)

In [13]:
data.duplicated().sum()

0

### **Finding the most relevant name in the headlines using TF-IDF score**

In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [15]:
#calculate tfidf
txt_content=data['headline_text']
vec = TfidfVectorizer()

In [16]:
tfidf=vec.fit_transform(txt_content)

In [17]:
df = pd.DataFrame(tfidf[0].T.todense(),index=vec.get_feature_names(),columns=["TF-IDF"])



In [18]:
df = df.sort_values('TF-IDF', ascending=False)

In [19]:
df

Unnamed: 0,TF-IDF
disturbed,0.460244
quo,0.429967
vajpayee,0.377465
ayodhya,0.359033
status,0.329715
...,...
forevers,0.000000
forewarn,0.000000
forewarned,0.000000
foreword,0.000000


### **Using the TF-IDF with cosine similarity, rank the news headlines based on a user query**.  

In [20]:
from sklearn.metrics.pairwise import cosine_similarity

In [21]:
#doc2_tfidf=vec.transform(["Status quo will not be disturbed at Ayodhya"])
# calculate the cosine similarity between the documents
#sim = cosine_similarity(tfidf, doc2_tfidf).flatten()
#print(sim)

In [22]:
# Get user query
query = input("Enter your query: ")
doc2_tfidf=vec.transform([query])
sim = cosine_similarity(tfidf, doc2_tfidf).flatten()
print(sim)

# sort the headlines by cosine similarity and print the top results
related_headlines_indices = sim.argsort()[:-5:-1]
print("Top related headlines:")
for i in related_headlines_indices:
    print(txt_content[i])


Enter your query: Status quo will not be disturbed at Ayodhya
[0.8992015 0.        0.        ... 0.        0.        0.       ]
Top related headlines:
Status quo will not be disturbed at Ayodhya; says Vajpayee
Blair pledges undying troth to the Iraqi people
Educated and aware but still suffering
Guess where Veerappan keeps his money?
