## Fake News Predicator

My this project aims to build a machine learning model to classify news articles as either "fake" or "true" based on their textual content. The dataset contains two separate CSV files: one with fake news articles and another with true news.  
The methodology involves data loading, preprocessing, text cleaning, feature extraction, model training, and evaluation using classification metrics. Various classifiers will be tested for performance.

### Importing required libraries

In [77]:
# Libraries for data manipulation and numerical operations
import pandas as pd
import numpy as np

# Libraries for data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Libraries for model building and evaluation
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

# Libraries for text processing and cleaning
import re
import string


### Loading the Dataset

Loading two datasets:  
- `Fake.csv` containing fake news articles  
- `True.csv` containing true news articles

To combine the data later and create target labels for classification.


In [78]:
data_fake = pd.read_csv('Fake.csv')
data_true = pd.read_csv('True.csv')

In [79]:
# Displaying sample records from the fake news dataset
data_fake.head()

Unnamed: 0,title,text,subject,date
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017"
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017"
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017"
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017"
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017"


In [80]:
# Displaying sample records from the true news dataset
data_true.head()

Unnamed: 0,title,text,subject,date
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017"
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017"
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017"
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017"
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017"


### Data Preprocessing

In [81]:
# Adding a new column 'class' to label the datasets
data_fake['class'] = 0  # 0 indicates fake news
data_true['class'] = 1  # 1 indicates true news

In [82]:
# Checking the shape (rows, columns) of both datasets after labeling
data_fake.shape, data_true.shape

((23481, 5), (21417, 5))

In [83]:
# Reserving last 10 rows from fake and true datasets for manual testing

# Extracting last 10 fake news samples for testing
data_fake_manual_testing = data_fake.tail(10)

# Removing these 10 rows from the main fake news dataset
for i in range(23480, 23470, -1):
    data_fake.drop([i], axis=0, inplace=True)

# Extracting last 10 true news samples for testing
data_true_manual_testing = data_true.tail(10)

# Removing these 10 rows from the main true news dataset
for i in range(21416, 21406, -1):
    data_true.drop([i], axis=0, inplace=True)


In [84]:
# Checking the new shapes after removing rows for manual testing
data_fake.shape, data_true.shape

((23471, 5), (21407, 5))

In [85]:
# Assigning class labels again for the manual testing datasets
data_fake_manual_testing['class'] = 0
data_true_manual_testing['class'] = 1

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_fake_manual_testing['class'] = 0
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_true_manual_testing['class'] = 1


In [86]:
data_fake_manual_testing.head(10)

Unnamed: 0,title,text,subject,date,class
23471,Seven Iranians freed in the prisoner swap have...,"21st Century Wire says This week, the historic...",Middle-east,"January 20, 2016",0
23472,#Hashtag Hell & The Fake Left,By Dady Chery and Gilbert MercierAll writers ...,Middle-east,"January 19, 2016",0
23473,Astroturfing: Journalist Reveals Brainwashing ...,Vic Bishop Waking TimesOur reality is carefull...,Middle-east,"January 19, 2016",0
23474,The New American Century: An Era of Fraud,Paul Craig RobertsIn the last years of the 20t...,Middle-east,"January 19, 2016",0
23475,Hillary Clinton: ‘Israel First’ (and no peace ...,Robert Fantina CounterpunchAlthough the United...,Middle-east,"January 18, 2016",0
23476,McPain: John McCain Furious That Iran Treated ...,21st Century Wire says As 21WIRE reported earl...,Middle-east,"January 16, 2016",0
23477,JUSTICE? Yahoo Settles E-mail Privacy Class-ac...,21st Century Wire says It s a familiar theme. ...,Middle-east,"January 16, 2016",0
23478,Sunnistan: US and Allied ‘Safe Zone’ Plan to T...,Patrick Henningsen 21st Century WireRemember ...,Middle-east,"January 15, 2016",0
23479,How to Blow $700 Million: Al Jazeera America F...,21st Century Wire says Al Jazeera America will...,Middle-east,"January 14, 2016",0
23480,10 U.S. Navy Sailors Held by Iranian Military ...,21st Century Wire says As 21WIRE predicted in ...,Middle-east,"January 12, 2016",0


In [87]:
data_true_manual_testing.head(10)

Unnamed: 0,title,text,subject,date,class
21407,"Mata Pires, owner of embattled Brazil builder ...","SAO PAULO (Reuters) - Cesar Mata Pires, the ow...",worldnews,"August 22, 2017",1
21408,"U.S., North Korea clash at U.N. forum over nuc...",GENEVA (Reuters) - North Korea and the United ...,worldnews,"August 22, 2017",1
21409,"U.S., North Korea clash at U.N. arms forum on ...",GENEVA (Reuters) - North Korea and the United ...,worldnews,"August 22, 2017",1
21410,Headless torso could belong to submarine journ...,COPENHAGEN (Reuters) - Danish police said on T...,worldnews,"August 22, 2017",1
21411,North Korea shipments to Syria chemical arms a...,UNITED NATIONS (Reuters) - Two North Korean sh...,worldnews,"August 21, 2017",1
21412,'Fully committed' NATO backs new U.S. approach...,BRUSSELS (Reuters) - NATO allies on Tuesday we...,worldnews,"August 22, 2017",1
21413,LexisNexis withdrew two products from Chinese ...,"LONDON (Reuters) - LexisNexis, a provider of l...",worldnews,"August 22, 2017",1
21414,Minsk cultural hub becomes haven from authorities,MINSK (Reuters) - In the shadow of disused Sov...,worldnews,"August 22, 2017",1
21415,Vatican upbeat on possibility of Pope Francis ...,MOSCOW (Reuters) - Vatican Secretary of State ...,worldnews,"August 22, 2017",1
21416,Indonesia to buy $1.14 billion worth of Russia...,JAKARTA (Reuters) - Indonesia will buy 11 Sukh...,worldnews,"August 22, 2017",1


#### Combine Datasets and Create Target Variable

In [88]:
data_merge = pd.concat([data_fake, data_true], axis = 0)
data_merge.head(10)

Unnamed: 0,title,text,subject,date,class
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017",0
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017",0
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017",0
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017",0
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017",0
5,Racist Alabama Cops Brutalize Black Boy While...,The number of cases of cops brutalizing and ki...,News,"December 25, 2017",0
6,"Fresh Off The Golf Course, Trump Lashes Out A...",Donald Trump spent a good portion of his day a...,News,"December 23, 2017",0
7,Trump Said Some INSANELY Racist Stuff Inside ...,In the wake of yet another court decision that...,News,"December 23, 2017",0
8,Former CIA Director Slams Trump Over UN Bully...,Many people have raised the alarm regarding th...,News,"December 22, 2017",0
9,WATCH: Brand-New Pro-Trump Ad Features So Muc...,Just when you might have thought we d get a br...,News,"December 21, 2017",0


In [89]:
data_merge.columns

Index(['title', 'text', 'subject', 'date', 'class'], dtype='object')

In [90]:
# Dropping the unnecessary columns
data = data_merge.drop(['title', 'subject', 'date'], axis=1)

In [91]:
data.isnull().sum()

Unnamed: 0,0
text,0
class,0


In [92]:
data = data.sample(frac=1)

In [93]:
data.head()

Unnamed: 0,text,class
10021,The Anheuser-Busch Brewery put beer production...,0
3893,WASHINGTON (Reuters) - U.S. Environmental Prot...,1
13841,PHNOM PENH (Reuters) - Cambodian Prime Ministe...,1
3872,(Reuters) - The U.S. Senate voted on Tuesday t...,1
452,LONDON (Reuters) - Britain’s ambassador to the...,1


In [94]:
data.reset_index(inplace=True)
data.drop(['index'], axis=1, inplace=True)

In [95]:
data.columns

Index(['text', 'class'], dtype='object')

In [96]:
data.head()

Unnamed: 0,text,class
0,The Anheuser-Busch Brewery put beer production...,0
1,WASHINGTON (Reuters) - U.S. Environmental Prot...,1
2,PHNOM PENH (Reuters) - Cambodian Prime Ministe...,1
3,(Reuters) - The U.S. Senate voted on Tuesday t...,1
4,LONDON (Reuters) - Britain’s ambassador to the...,1


In [97]:
def wordopt(text):
  text = text.lower()
  text = re.sub('\[.*?\]', '', text)
  text = re.sub("\\W", " ", text)
  text = re.sub('htps?://\S+www\.\S+', '', text)
  text = re.sub('<.*?>+', '', text)
  text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
  text = re.sub('\n', '', text)
  text = re.sub('\w*\d\w*', '', text)
  return text

  text = re.sub('\[.*?\]', '', text)
  text = re.sub('htps?://\S+www\.\S+', '', text)
  text = re.sub('\w*\d\w*', '', text)


In [98]:
data['text'] = data['text'].apply(wordopt)

In [99]:
X = data['text']
y = data['class']

### Train Test Splitting

In [100]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

### Text Vectorization

In [101]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorization = TfidfVectorizer()
Xv_train = vectorization.fit_transform(X_train)
Xv_test = vectorization.transform(X_test)

### LogisticeRegression Model Training

In [102]:
from sklearn.linear_model import LogisticRegression

LR = LogisticRegression()
LR.fit(Xv_train, y_train)

In [103]:
pred_lr = LR.predict(Xv_test)

In [104]:
LR.score(Xv_test, y_test)

0.9860739750445633

In [105]:
print(classification_report(y_test, pred_lr))

              precision    recall  f1-score   support

           0       0.99      0.98      0.99      4748
           1       0.98      0.99      0.99      4228

    accuracy                           0.99      8976
   macro avg       0.99      0.99      0.99      8976
weighted avg       0.99      0.99      0.99      8976



### Decision Tree Model Training


In [106]:
from sklearn.tree import DecisionTreeClassifier

DT = DecisionTreeClassifier()

DT.fit(Xv_train, y_train)

In [107]:
pred_dt=DT.predict(Xv_test)

In [108]:
DT.score(Xv_test, y_test)

0.9945409982174688

In [109]:
print(classification_report(y_test, pred_dt))

              precision    recall  f1-score   support

           0       1.00      0.99      0.99      4748
           1       0.99      1.00      0.99      4228

    accuracy                           0.99      8976
   macro avg       0.99      0.99      0.99      8976
weighted avg       0.99      0.99      0.99      8976



### Gradient Boosting Classifier Training

In [110]:
from sklearn.ensemble import GradientBoostingClassifier

GB = GradientBoostingClassifier(random_state = 0)
GB.fit(Xv_train, y_train)

In [111]:
predict_gb = GB.predict(Xv_test)

In [112]:
GB.score(Xv_test, y_test)

0.9943181818181818

In [113]:
print(classification_report(y_test, predict_gb))

              precision    recall  f1-score   support

           0       1.00      0.99      0.99      4748
           1       0.99      1.00      0.99      4228

    accuracy                           0.99      8976
   macro avg       0.99      0.99      0.99      8976
weighted avg       0.99      0.99      0.99      8976



### Random Forest Classifier Training

In [114]:
from sklearn.ensemble import RandomForestClassifier

RF = RandomForestClassifier(random_state=0)
RF.fit(Xv_train, y_train)

In [115]:
predict_rf = RF.predict(Xv_test)

In [116]:
RF.score(Xv_test, y_test)

0.9894162210338681

In [117]:
print(classification_report(y_test, predict_rf))

              precision    recall  f1-score   support

           0       0.99      0.99      0.99      4748
           1       0.99      0.99      0.99      4228

    accuracy                           0.99      8976
   macro avg       0.99      0.99      0.99      8976
weighted avg       0.99      0.99      0.99      8976



### Model Evaluation

In [118]:
def output_label(n):
    if n == 0:
        return 'Fake News'
    elif n == 1:
        return "Not A Fake News"

def manual_testing(news):
    # Convert the news into a dataframe
    testing_news = {"text": [news]}
    new_def_test = pd.DataFrame(testing_news)

    # Apply preprocessing
    new_def_test["text"] = new_def_test["text"].apply(wordopt)

    # Transform text using vectorization
    new_x_test = new_def_test["text"]
    new_xv_test = vectorization.transform(new_x_test)

    # Predictions
    pred_LR = LR.predict(new_xv_test)
    pred_DT = DT.predict(new_xv_test)
    pred_GB = GB.predict(new_xv_test)
    pred_RF = RF.predict(new_xv_test)

    # Print results
    print("\nPredictions for the given news:")
    print(f"Logistic Regression: {output_label(pred_LR[0])}")
    print(f"Decision Tree: {output_label(pred_DT[0])}")
    print(f"Gradient Boosting: {output_label(pred_GB[0])}")
    print(f"Random Forest: {output_label(pred_RF[0])}")


In [119]:
news = str(input())
manual_testing(news)

The Pentagon is considering a Boeing proposal to supply Ukraine with cheap, small precision bombs fitted to abundantly available rockets, allowing Kyiv to strike far behind Russian lines, according to a Reuters report. US and allied military inventories are shrinking, and Ukraine faces an increasing need for more sophisticated weapons as the war drags on. Boeing's proposed system, dubbed Ground-Launched Small Diameter Bomb (GLSDB), is one of about a half-dozen plans for getting new munitions into production for Ukraine and America's eastern European allies, industry sources told the news agency. GLSDB could be delivered as early as spring 2023, according to a document reviewed by Reuters and three people familiar with the plan. It combines the GBU-39 Small Diameter Bomb (SDB) with the M26 rocket motor, both of which are common in US inventories. Although a handful of GLSDB units have already been made, there are many logistical obstacles to formal procurement. The Boeing plan requires 

In [120]:
news = str(input())
manual_testing(news)

Pro-Russian users have often repeated the Kremlin’s original position that the invasion of Ukraine is a “special military operation” to “denazify” and “demilitarise” a “Neo-Nazi state”. Many have downplayed allegations of Russian war crimes or even claimed that the war is a supposed “hoax”. In one widely shared video, a news reporter could be seen standing in front of lines of body bags, one of which was moving. However, the footage did not show invented war casualties in Ukraine, but a “Fridays for Future” climate change protest in Vienna in February, three weeks before the invasion began. Days later, another viral video of a mannequin claimed to be proof that Ukrainian authorities had “staged” the mass killing of civilians in the town of Bucha. The misleading clip showed a prosthetic doll being dressed and prepared by two men. Nadezhda, an assistant director for a Russian television programme, confirmed to Euronews that the video showed their film set near St. Petersburg and not Ukra

### Some Limitations

The model is trained on a specific dataset of news articles that may not fully capture the diversity and nuances of real-world news, especially social media posts.

Text preprocessing is relatively basic; slang, sarcasm, and contextual nuances might not be effectively handled.

Models may be biased towards patterns learned from the training data, limiting generalization to unseen topics or styles.

### Future Improvements

Incorporate advanced NLP techniques such as word embeddings (Word2Vec, GloVe) or transformer models (BERT) to capture deeper semantic information.

Expand preprocessing to handle social media-specific text features like hashtags, emojis, and abbreviations.

Apply extensive hyperparameter tuning and ensemble methods to boost classification performance.