<a href="https://colab.research.google.com/github/devansh1206/Fake_news_detection/blob/main/fake_new_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Detect fake news**

About the Dataset:


1.   **id**: unique id for a news article
2.   **title**: the title of a news article
3.   **author**: author of the news article
4.   **text**: the text of the article; could be incomplete
5.   **label**: a label that marks whether the news is real or fake



```
1: Fake news
0: Real news
```





importing the dependencies

In [1]:
import numpy as np
import pandas as pd
import re
import nltk
import nltk.corpus
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

In [2]:
#loading the stopwords
nltk.download('stopwords')
print(stopwords.words("english"))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Data Loading and Pre-processing

In [21]:
# laoding the dataset

real_news_data = pd.read_csv("True.csv")
fake_news_data = pd.read_csv("Fake.csv")
print(real_news_data.shape)
print(fake_news_data.shape)

(21417, 4)
(23481, 4)


In [22]:
# removing subject and data columns and adding a new column "fake" that represents if the news is fake or not
# in case of real_news_data all fake will be set to 0

real_news_data.drop(['subject', 'date'], axis=1, inplace=True)
li = [0 for i in range(real_news_data.shape[0])]
real_news_data['fake'] = li
real_news_data.head()

Unnamed: 0,title,text,fake
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,0
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,0
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,0
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,0
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,0


In [23]:
# removing subject and data columns and adding a new column "fake" that represents if the news is fake or not
# in case of fake_news_data all fake will be set to 0

fake_news_data.drop(['subject', 'date'], axis=1, inplace=True)
li = [1 for i in range(fake_news_data.shape[0])]
fake_news_data['fake'] = li
fake_news_data.head()

Unnamed: 0,title,text,fake
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,1
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,1
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",1
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",1
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,1


In [24]:
# combining the two datasets and shuffling the rows

# concatinating the datasets
combined_data = pd.concat([real_news_data,fake_news_data])

#shuffling the rows
df = combined_data.sample(frac=1).reset_index(drop=True)
df

Unnamed: 0,title,text,fake
0,WATCH: Fox News Host DEFENDS Slavery In Bizar...,"Fox News most popular prime time host, Bill O...",1
1,Obama's remarks on Orlando shooting: 'We will ...,WASHINGTON (Reuters) - President Barack Obama ...,0
2,‘Journalistic Malpractice’: CNN Slammed for ‘B...,21st Century Wire says This definitely needed ...,1
3,U.S. says Kerry to meet Iran foreign minister ...,WASHINGTON (Reuters) - U.S. Secretary of State...,0
4,THE “OBAMA BOUNCE”: UKIP Leader Claims Obama’s...,Barack Hussein Obama has been in over his head...,1
...,...,...,...
44893,Samantha Bee BLASTS GOP Over ‘C*ckblocking’ O...,Full Frontal host Samantha Bee is NOT one to m...,1
44894,HYSTERICAL…THE DEMOCRAT CONVENTION Schedule Is...,LOL! You ll want to share this with everyone D...,1
44895,Copts call for action over church closures in ...,"MINYA, Egypt (Reuters) - Coptic Christians in ...",0
44896,UK electoral body fines Liberal Democrats over...,LONDON (Reuters) - Britain s Electoral Commiss...,0


In [27]:
print(df.shape)
df.head(10)

(44898, 3)


Unnamed: 0,title,text,fake
0,WATCH: Fox News Host DEFENDS Slavery In Bizar...,"Fox News most popular prime time host, Bill O...",1
1,Obama's remarks on Orlando shooting: 'We will ...,WASHINGTON (Reuters) - President Barack Obama ...,0
2,‘Journalistic Malpractice’: CNN Slammed for ‘B...,21st Century Wire says This definitely needed ...,1
3,U.S. says Kerry to meet Iran foreign minister ...,WASHINGTON (Reuters) - U.S. Secretary of State...,0
4,THE “OBAMA BOUNCE”: UKIP Leader Claims Obama’s...,Barack Hussein Obama has been in over his head...,1
5,Trump's unpredictability already troubles U.S....,WASHINGTON (Reuters) - New U.S. President Dona...,0
6,U.N. pleads for end of Yemen blockade or 'unto...,(This story from November 16 corrects paragra...,0
7,Woman Born In NAZI GERMANY Says Trump Doesn’t ...,A woman who was born in Nazi Germany says that...,1
8,BOOM! TED CRUZ Will Conduct Hearing TODAY: Foc...,Ted Cruz has no interest in playing Mr Nice G...,1
9,White House slams Russia over veto on Syria ch...,(Reuters) - The White House admonished Russia ...,0


In [28]:
# counting the number of missing values

df.isnull().sum()

title    0
text     0
fake     0
dtype: int64

In [35]:
# seperating the features and label

x = df[['title', 'text']]
y = df[['fake']]

In [36]:
print(x,y)

                                                   title  \
0       WATCH: Fox News Host DEFENDS Slavery In Bizar...   
1      Obama's remarks on Orlando shooting: 'We will ...   
2      ‘Journalistic Malpractice’: CNN Slammed for ‘B...   
3      U.S. says Kerry to meet Iran foreign minister ...   
4      THE “OBAMA BOUNCE”: UKIP Leader Claims Obama’s...   
...                                                  ...   
44893   Samantha Bee BLASTS GOP Over ‘C*ckblocking’ O...   
44894  HYSTERICAL…THE DEMOCRAT CONVENTION Schedule Is...   
44895  Copts call for action over church closures in ...   
44896  UK electoral body fines Liberal Democrats over...   
44897  U.S. Treasury sanctions 26 individuals, nine b...   

                                                    text  
0      Fox News  most popular prime time host, Bill O...  
1      WASHINGTON (Reuters) - President Barack Obama ...  
2      21st Century Wire says This definitely needed ...  
3      WASHINGTON (Reuters) - U.S. Secretar

Stemming :

    The process of reducing the word to its root word
    example: acting-> act
             changes-> change
             timely-> time


In [37]:
port_stem = PorterStemmer()

In [40]:
def stemming(content):
  stemmed_content = re.sub('[^A-z]', ' ', content)
  stemmed_content = stemmed_content.lower()
  stemmed_content = stemmed_content.split()
  stemmed_content = [port_stem.stem(word) for word in stemmed_content if not word in stopwords.words('english')]
  stemmed_content = ' '.join(stemmed_content)
  return stemmed_content

In [44]:
x['title'] = x['title'].apply(stemming)
x.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  x['title'] = x['title'].apply(stemming)


Unnamed: 0,title,text
0,watch fox news host defend slaveri bizarr rant,"Fox News most popular prime time host, Bill O..."
1,obama remark orlando shoot give fear,WASHINGTON (Reuters) - President Barack Obama ...
2,journalist malpractic cnn slam bias coverag fa...,21st Century Wire says This definitely needed ...
3,u say kerri meet iran foreign minist tuesday,WASHINGTON (Reuters) - U.S. Secretary of State...
4,obama bounc ukip leader claim obama insult thr...,Barack Hussein Obama has been in over his head...


In [45]:
X = x['title'].values
Y = y['fake'].values

In [49]:
X

array(['watch fox news host defend slaveri bizarr rant',
       'obama remark orlando shoot give fear',
       'journalist malpractic cnn slam bias coverag favour hillari', ...,
       'copt call action church closur southern egypt',
       'uk elector bodi fine liber democrat brexit vote expens',
       'u treasuri sanction individu nine bank north korea'], dtype=object)

In [50]:
print(X.shape)
print(Y.shape)

(44898,)
(44898,)


In [51]:
# converting textual data to numerical data

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(X)

In [53]:
print(X)

  (0, 9363)	0.38604439085493597
  (0, 1177)	0.43292972466092466
  (0, 10688)	0.4649383327021748
  (0, 2990)	0.3345216781649767
  (0, 5569)	0.32228078623195694
  (0, 7885)	0.2773741551462617
  (0, 4523)	0.31548561884076054
  (0, 12795)	0.23481199416456255
  (1, 4203)	0.4468905161805813
  (1, 4833)	0.35545747508985914
  (1, 10515)	0.39576276218296247
  (1, 8245)	0.4906200957344086
  (1, 9613)	0.46526278848902025
  (1, 8073)	0.24522060543292146
  (2, 5445)	0.18054506659966874
  (2, 4190)	0.4793568047258695
  (2, 2612)	0.3522903199174871
  (2, 1106)	0.39843275903973774
  (2, 10679)	0.2794204509932518
  (2, 2227)	0.24159172898327178
  (2, 7046)	0.4793568047258695
  (2, 6225)	0.29755352715528066
  (3, 12075)	0.4829265451505065
  (3, 7452)	0.35626662461515435
  (3, 4471)	0.3713382048966759
  :	:
  (44894, 9758)	0.4002652797475677
  (44894, 3051)	0.2962538362256977
  (44895, 2531)	0.5369607843353947
  (44895, 10899)	0.38209313579715326
  (44895, 2210)	0.4387738720042664
  (44895, 2093)	0.34302

Splitting the dataset into training and testing data


In [57]:
x_train, x_test, y_train, y_test = train_test_split(X,Y, test_size=0.2, stratify=Y, random_state=2)

In [58]:
x_train.shape

(35918, 13225)

In [59]:
y_train.shape

(35918,)

Creating and training a Logistic Regression Model

In [60]:
model = LogisticRegression()

In [61]:
model.fit(x_train, y_train)

Evaluation

Accuracy score

In [63]:
# accuracy score on training data

x_train_prediction = model.predict(x_train)
train_data_accuracy = accuracy_score(x_train_prediction, y_train)
print("Training data accuracy -> ",train_data_accuracy)

Training data accuracy ->  0.9591569686508158


In [64]:
# accuracy score on test data

x_test_prediction = model.predict(x_test)
test_data_accuracy = accuracy_score(x_test_prediction, y_test)
print("Test data accuracy -> ", test_data_accuracy)

Test data accuracy ->  0.9452115812917594


Making a predictive system

In [112]:
# defining a function to predict if the news is fake or not

def find_fake_news(title):
  #applying all the preprocessing steps that we did with our training data
  title = stemming(title)
  title = pd.core.series.Series(title).values
  title = vectorizer.transform(title)
  #prediction
  fake = model.predict(title)
  print("Real news") if fake[0]==0 else print("Fake news")


In [115]:
title = fake_news_data['title'][100]
print(title)

 Watch: Joe Biden Just Wrecked Trump In Six Words And It’s Hilarious


In [116]:
find_fake_news(title)

Fake news


In [117]:
# contact me at devanshchauhan08@gmail.com if any doubt related to this or any general machine learning query
# I am open to work as an Machine Learning Intern