## Step 0: Importing Required Libraries

Before we start, we need to import some Python libraries.

Each library has a very specific role.  
We import **only what we need** — nothing extra.

Below is an explanation of every import.


In [3]:
import pandas as pd
import re

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

### Explanation of Imports

- **pandas (`pd`)**
  - Used to load and manipulate tabular data (CSV files).
  - Our dataset is stored in a table format, so pandas is essential.

- **re (Regular Expressions)**
  - Used for text cleaning.
  - Helps remove URLs, punctuation, and unwanted characters from text.

- **train_test_split**
  - Used to split our data into training and testing sets.
  - This helps us evaluate how well our model performs on unseen data.

- **TfidfVectorizer**
  - Converts text into numerical vectors.
  - Machines cannot understand text, so we convert words into numbers using TF-IDF.

- **LogisticRegression**
  - Our machine learning model.
  - Used for binary classification (Fake vs Real).

- **accuracy_score, confusion_matrix, classification_report**
  - Used to evaluate model performance.
  - They tell us how many predictions were correct and where the model made mistakes.


## Step 1: Loading the Dataset

We are using a **Fake News dataset** where:
- Each row represents a news article
- We have just have two columns from the dataset
  - `text` → the news content
  - `label` → whether the news is Fake or Real

We will load this dataset using pandas.


In [23]:
df = pd.read_csv("dataset1.csv") #fill here

# Keep only relevant columns
#fill here
df[['text','label']]
#print(df.head())


Unnamed: 0,text,label
0,At least 31 people were wounded by Israeli arm...,1
1,The racists came out in full force when Jimmy ...,0
2,"On Tuesday night, Ted Cruz dropped out of the ...",0
3,"Russia, Turkey and Iran are close to finalizin...",1
4,U.S. House Speaker Paul Ryan on Wednesday did ...,1
...,...,...
4995,"The mosque in Finsbury Park, London where this...",0
4996,U.S. Food and Drug Administration Commissioner...,1
4997,In a desperate bid to normalize the Republican...,0
4998,Senator Pat Toomey is a staunch Republican cau...,0


## Step 2: Text Cleaning (Preprocessing)

Raw text is messy.

It contains:
- Capital letters
- URLs
- Punctuation
- Extra spaces

If we feed this raw text directly into a model, it creates **noise** and hurts performance.

So we clean the text before doing anything else.


In [25]:
def clean_text(text):
    # Convert to lowercase
    text = text.lower()

    # Remove URLs
    text = re.sub(r"http\S+|www\S+", "", text)

    # Remove punctuation and special characters
    text = re.sub(r"[^a-z\s]", "", text)

    # Remove extra spaces
    text = re.sub(r"\s+", " ", text).strip()

    return text


In [29]:
df["clean_text"] = df["text"].apply(clean_text)

print(df[["text", "clean_text"]].head())


                                                text  \
0  At least 31 people were wounded by Israeli arm...   
1  The racists came out in full force when Jimmy ...   
2  On Tuesday night, Ted Cruz dropped out of the ...   
3  Russia, Turkey and Iran are close to finalizin...   
4  U.S. House Speaker Paul Ryan on Wednesday did ...   

                                          clean_text  
0  at least people were wounded by israeli army g...  
1  the racists came out in full force when jimmy ...  
2  on tuesday night ted cruz dropped out of the r...  
3  russia turkey and iran are close to finalizing...  
4  us house speaker paul ryan on wednesday did no...  


In [31]:
X = df["clean_text"]
y = df["label"]

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42
)


## Step 3: Converting Text into Numbers using TF-IDF

Machine learning models cannot understand text.

They only work with numbers.

TF-IDF (Term Frequency - Inverse Document Frequency) converts each document
into a numerical vector based on word importance.


In [33]:
vectorizer = TfidfVectorizer(
    stop_words="english",
    max_df=0.7
)

X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)


## Step 4: Loading the Logistic Regression model & Splitting Data into Train and Test Sets

We do NOT train our model on the entire dataset.

Instead:
- **Training data** → Used to teach the model
- **Testing data** → Used to evaluate the model

This simulates a real-world scenario where the model sees new, unseen data.


In [35]:
model = LogisticRegression(max_iter=1000)

model.fit(X_train_tfidf, y_train)


#Step 5: Evaluation

In [37]:
y_pred = model.predict(X_test_tfidf)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))


Accuracy: 0.953

Confusion Matrix:
 [[493  31]
 [ 16 460]]

Classification Report:
               precision    recall  f1-score   support

           0       0.97      0.94      0.95       524
           1       0.94      0.97      0.95       476

    accuracy                           0.95      1000
   macro avg       0.95      0.95      0.95      1000
weighted avg       0.95      0.95      0.95      1000



#Major Features for prediction

In [39]:
feature_names = vectorizer.get_feature_names_out()
coefficients = model.coef_[0]

coef_df = pd.DataFrame({
    "word": feature_names,
    "coefficient": coefficients
})

# Top words pushing prediction towards FAKE
print("\nTop FAKE indicators:")
print(coef_df.sort_values(by="coefficient").head(10))

# Top words pushing prediction towards REAL
print("\nTop REAL indicators:")
print(coef_df.sort_values(by="coefficient", ascending=False).head(10))



Top FAKE indicators:
          word  coefficient
25106     just    -3.067233
1761   america    -2.797485
22462    image    -2.775776
21305  hillary    -2.669309
27034     like    -2.667040
32537    obama    -2.639149
13614      don    -2.589608
53036     wire    -2.220258
19314      gop    -2.207081
51486    video    -2.115254

Top REAL indicators:
             word  coefficient
49503      trumps     4.975804
48318    thursday     3.378301
52434   wednesday     3.358363
29760    minister     2.997831
49621     tuesday     2.989750
30185      monday     2.697419
18183      friday     2.657097
40345     reuters     2.595226
39910  republican     2.582633
45478   statement     2.449888


The words that the model thinks that make a news FAKE mostly contain the authoritative words which is evident by the use of
the names of American presidents and the name America itself.
It should also be noted that the mere use of authoritative words doesn't confirm the news is FAKE
Also, the core ingredients of a fake news - "according to sources", here the use of words like "image" and "video" is indicative of
the same
The use of the word "just" for exaggeration is no big surprise

Hence, some of these words make sense.

However, just because the news contain the name of the current ruling party ("republican", "minister") doesn't make it real hence may indicate towards the bias in the data itself.