# TfidfVectorizer Explanation
Convert a collection of raw documents to a matrix of TF-IDF features

TF-IDF where TF means term frequency, and IDF means Inverse Document frequency.

In [32]:
from sklearn.feature_extraction.text import TfidfVectorizer
text = ['Hello Varun Singh here, I love machine learning','Welcome to the Machine learning hub' ]

In [33]:
vect = TfidfVectorizer()

In [34]:
vect.fit(text)

0,1,2
,input,'content'
,encoding,'utf-8'
,decode_error,'strict'
,strip_accents,
,lowercase,True
,preprocessor,
,tokenizer,
,analyzer,'word'
,stop_words,
,token_pattern,'(?u)\\b\\w\\w+\\b'


In [35]:
## TF will count the frequency of word in each document. and IDF 
print(vect.idf_)

[1.40546511 1.40546511 1.40546511 1.         1.40546511 1.
 1.40546511 1.40546511 1.40546511 1.40546511 1.40546511]


In [36]:
print(vect.vocabulary_)

{'hello': 0, 'varun': 9, 'singh': 6, 'here': 1, 'love': 4, 'machine': 5, 'learning': 3, 'welcome': 10, 'to': 8, 'the': 7, 'hub': 2}


### A words which is present in all the data, it will have low IDF value. With this unique words will be highlighted using the Max IDF values.

In [37]:
example = text[0]
example

'Hello Varun Singh here, I love machine learning'

In [38]:
example = vect.transform([example])
print(example.toarray())

[[0.4078241  0.4078241  0.         0.29017021 0.4078241  0.29017021
  0.4078241  0.         0.         0.4078241  0.        ]]


### Here, 0 is present in the which indexed word, which is not available in given sentence.

## PassiveAggressiveClassifier

### Passive: if correct classification, keep the model; Aggressive: if incorrect classification, update to adjust to this misclassified example.

Passive-Aggressive algorithms are generally used for large-scale learning. It is one of the few ‘online-learning algorithms‘. In online machine learning algorithms, the input data comes in sequential order and the machine learning model is updated step-by-step, as opposed to batch learning, where the entire training dataset is used at once. This is very useful in situations where there is a huge amount of data and it is computationally infeasible to train the entire dataset because of the sheer size of the data. We can simply say that an online-learning algorithm will get a training example, update the classifier, and then throw away the example.

## Let's start the work

In [39]:
import os
os.chdir(r"C:\Users\varun\Code\FakeNewsDetection1")

In [40]:
import pandas as pd

In [41]:
dataframe = pd.read_csv('cleaned_IFND.csv')
dataframe.head()

Unnamed: 0,id,Statement,Image,Web,Category,Date,Label
0,2,"who praises india's aarogya setu app, says it ...",https://cdn.dnaindia.com/sites/default/files/s...,DNAINDIA,COVID-19,Oct-20,REAL
1,3,"in delhi, deputy us secretary of state stephen...",https://cdn.dnaindia.com/sites/default/files/s...,DNAINDIA,VIOLENCE,Oct-20,REAL
2,4,lac tensions: china's strategy behind delibera...,https://cdn.dnaindia.com/sites/default/files/s...,DNAINDIA,TERROR,Oct-20,REAL
3,5,india has signed 250 documents on space cooper...,https://cdn.dnaindia.com/sites/default/files/s...,DNAINDIA,COVID-19,Oct-20,REAL
4,6,tamil nadu chief minister's mother passes away...,https://cdn.dnaindia.com/sites/default/files/s...,DNAINDIA,ELECTION,Oct-20,REAL


In [42]:
from sklearn.utils import resample

df_fake = dataframe[dataframe['Label'] == 'FAKE']
df_real = dataframe[dataframe['Label'] == 'REAL']

df_fake_upsampled = resample(df_fake,
                             replace=True,
                             n_samples=len(df_real),
                             random_state=42)

df_balanced = pd.concat([df_real, df_fake_upsampled])


In [43]:
df_balanced['Label'].value_counts()

Label
REAL    32788
FAKE    32788
Name: count, dtype: int64

In [44]:
# Shuffle the dataset to mix REAL and FAKE rows
df_balanced = df_balanced.sample(frac=1, random_state=42).reset_index(drop=True)

# Confirm it’s shuffled
print("✅ Dataset shuffled successfully!")
print(df_balanced['Label'].head(10))


✅ Dataset shuffled successfully!
0    REAL
1    REAL
2    REAL
3    FAKE
4    REAL
5    FAKE
6    FAKE
7    REAL
8    FAKE
9    REAL
Name: Label, dtype: object


In [45]:
labels = df_balanced['Label'].tolist()

# Find index where label changes from REAL to FAKE or vice versa
change_points = [i for i in range(1, len(labels)) if labels[i] != labels[i-1]]

if len(change_points) == 1:
    print(f"⚠️ Dataset appears sorted — label changes only once at index {change_points[0]}")
else:
    print(f"✅ Dataset seems mixed — label changes {len(change_points)} times.")


✅ Dataset seems mixed — label changes 32634 times.


In [46]:
x = df_balanced['Statement']
y = df_balanced['Label']

In [47]:
x

0          pm modi biopic to re-release as theatres reopen
1        india to witness animal spirits revival with 1...
2        rahul gandhi condoles death of venugopal s mot...
3        mp assembly poll video shared as evm tampering...
4        sc grants bail to 84-year-old man as paternity...
                               ...                        
65571    no, that is not the rs 1,000 currency note tha...
65572    bjp worker died from shotgun pellets during pa...
65573    2013 video of unclaimed bodies at osmania hosp...
65574    sushant singh rajput case: uddhav never interf...
65575              won t take any retrograde step: rajnath
Name: Statement, Length: 65576, dtype: object

In [48]:
y

0        REAL
1        REAL
2        REAL
3        FAKE
4        REAL
         ... 
65571    FAKE
65572    REAL
65573    FAKE
65574    REAL
65575    REAL
Name: Label, Length: 65576, dtype: object

In [49]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.metrics import accuracy_score, confusion_matrix

In [50]:
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=0)
y_train

8412     FAKE
44454    FAKE
30224    REAL
5419     FAKE
45199    FAKE
         ... 
41993    FAKE
21243    FAKE
45891    REAL
42613    FAKE
43567    REAL
Name: Label, Length: 52460, dtype: object

In [51]:
y_train

8412     FAKE
44454    FAKE
30224    REAL
5419     FAKE
45199    FAKE
         ... 
41993    FAKE
21243    FAKE
45891    REAL
42613    FAKE
43567    REAL
Name: Label, Length: 52460, dtype: object

In [52]:
tfvect = TfidfVectorizer(stop_words='english',max_df=0.7)
tfid_x_train = tfvect.fit_transform(x_train)
tfid_x_test = tfvect.transform(x_test)

* max_df = 0.50 means "ignore terms that appear in more than 50% of the documents".
* max_df = 25 means "ignore terms that appear in more than 25 documents".

In [53]:
classifier = PassiveAggressiveClassifier(max_iter=50)
classifier.fit(tfid_x_train,y_train)

0,1,2
,C,1.0
,fit_intercept,True
,max_iter,50
,tol,0.001
,early_stopping,False
,validation_fraction,0.1
,n_iter_no_change,5
,shuffle,True
,verbose,0
,loss,'hinge'


In [54]:
y_pred = classifier.predict(tfid_x_test)
score = accuracy_score(y_test,y_pred)
print(f'Accuracy: {round(score*100,2)}%')

Accuracy: 96.33%


In [55]:
import numpy as np
print("Unique labels in training data:", np.unique(y_train))

Unique labels in training data: ['FAKE' 'REAL']


In [56]:
def fake_news_det(news):
    input_data = [news]
    vectorized_input_data = tfvect.transform(input_data)
    prediction = classifier.predict(vectorized_input_data)
    print(prediction)

In [57]:
import pickle
pickle.dump(classifier,open('model.pkl', 'wb'))

In [58]:
# load the model from disk
loaded_model = pickle.load(open('model.pkl', 'rb'))

In [59]:
text_samples = [
    #  Likely Fake (matches 'fact', 'fake', 'photo', 'viral' etc.)
    "Fact Check: Viral video claiming earthquake in Noida is from 2015.",
    "Fake image shared on WhatsApp shows manipulated picture of Taj Mahal.",
    "Photo of politician queuing at ration shop is falsely linked to current crisis.",
    "Factcheck: No, Bollywood actor did not donate 100 crore to Uttarakhand flood victims.",
    "Truth: Viral claim about new ₹2000 note design is incorrect.",

    #  Likely Real (matches 'dies', 'nabs', 'winner', 'gangster', 'Noida', etc.)
    "TV host dies mysteriously after revealing TRP scam in Noida.",
    "Gangster arrested in Uttarakhand for manipulating election results.",
    "Staff declares politician as reality show winner to boost TRP ratings.",
    "Wednesday blast in Noida mall kills 12 people, police nabs suspects.",
    "Actor’s digital footprint found in secret TRP manipulation case."
]

for txt in text_samples:
    fake_news_det(txt)


['FAKE']
['FAKE']
['FAKE']
['FAKE']
['FAKE']
['REAL']
['REAL']
['REAL']
['REAL']
['REAL']


In [60]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report

# Predict on test set
y_pred = classifier.predict(tfid_x_test)

# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, pos_label='FAKE')
recall = recall_score(y_test, y_pred, pos_label='FAKE')
f1 = f1_score(y_test, y_pred, pos_label='FAKE')

# Print metrics nicely
print(f"Accuracy:  {round(accuracy * 100, 2)}%")
print(f"Precision: {round(precision * 100, 2)}%")
print(f"Recall:    {round(recall * 100, 2)}%")
print(f"F1 Score:  {round(f1 * 100, 2)}%")

# Optional: full classification report (for both labels)
print("\nDetailed Classification Report:")
print(classification_report(y_test, y_pred))


Accuracy:  96.33%
Precision: 95.09%
Recall:    97.74%
F1 Score:  96.39%

Detailed Classification Report:
              precision    recall  f1-score   support

        FAKE       0.95      0.98      0.96      6593
        REAL       0.98      0.95      0.96      6523

    accuracy                           0.96     13116
   macro avg       0.96      0.96      0.96     13116
weighted avg       0.96      0.96      0.96     13116



In [61]:
import numpy as np

# Get the feature names (words) from your TF-IDF vectorizer
feature_names = np.array(tfvect.get_feature_names_out())

# Get coefficients from your trained Logistic Regression model
coeffs = classifier.coef_[0]

# Top 20 words that make model predict "FAKE"
top_fake_indices = np.argsort(coeffs)[:20]
print(" Top words predicting FAKE:")
print(feature_names[top_fake_indices])

# Top 20 words that make model predict "REAL"
top_real_indices = np.argsort(coeffs)[-20:]
print("\n Top words predicting REAL:")
print(feature_names[top_real_indices])


 Top words predicting FAKE:
['fact' 'fake' 'did' 'photo' 'viral' 'video' 'shared' 'factcheck'
 'falsely' 'image' 'false' 'misinformation' 'check' 'picture' 'underlying'
 'picnic' 'bimaru' 'truth' 'simultaneously' 'incorrect']

 Top words predicting REAL:
['fast' 'highlights' 'gangster' '4715' 'enforced' '55' 'runs' 'noida'
 'dies' 'uttarakhand' 'ioc' 'figure' 'appeal' 'spreading' 'urges'
 'declares' 'irritate' 'fir' 'politicising' 'trp']


In [62]:
# Save model
with open('model1.pkl', 'wb') as model_file:
    pickle.dump(classifier, model_file)

# Save TF-IDF vectorizer
with open('tfidf_vectorizer.pkl', 'wb') as vec_file:
    pickle.dump(tfvect, vec_file)

print("✅ model1.pkl and tfidf_vectorizer.pkl saved successfully!")


✅ model1.pkl and tfidf_vectorizer.pkl saved successfully!
