# Project Overview: Fake News Detection
The Fake News Detection project aims to develop a machine learning model capable of identifying and classifying news articles as either authentic or misleading. With the rise of misinformation, this project holds significance in promoting information integrity and aiding users in making informed decisions.

# Importing Dependencies¶

In [1]:
import numpy as np
import pandas as pd
import re   #Regular expressions(re) provide a powerful and flexible way to work with text, allowing you to define complex patterns for searching, matching, and manipulating strings.
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score,confusion_matrix, classification_report

In [2]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [3]:
# printing the stopwords in English
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

# Data gathering

In [4]:
df = pd.read_csv('/kaggle/input/fake-news/fake_news.csv')
df.head()

Unnamed: 0,id,title,author,text,label
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1


The dataset contain 4 inputs and 1 output colns
Our main agenda of the project is to predict whether the news is fake or not
> **input colns :**
* > id: unique id for a news article
* > title: the title of a news article
* > author: author of the news article
* > text: the text of the article; could be incomplete

> **output coln :**
> label: a label that marks whether the news article is real or fake:
* >     1: Fake news
* >     0: real News

# Data analysis & Data pre processing

In [5]:
df.shape

(20800, 5)

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20800 entries, 0 to 20799
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      20800 non-null  int64 
 1   title   20242 non-null  object
 2   author  18843 non-null  object
 3   text    20761 non-null  object
 4   label   20800 non-null  int64 
dtypes: int64(2), object(3)
memory usage: 812.6+ KB


In [7]:
df['label'].value_counts()    #to check whether its imbalanced or not

label
1    10413
0    10387
Name: count, dtype: int64

In [8]:
# checking the null values
df.isnull().sum()

id           0
title      558
author    1957
text        39
label        0
dtype: int64

In [9]:
# its difficult to fill the null values with actual text, so filling the null values with empty string
df = df.fillna('')

In [10]:
df.isnull().sum()  #now no null values is there

id        0
title     0
author    0
text      0
label     0
dtype: int64

In [11]:
# merging the title & author coln
df['content'] = df['title'] + df['author']

In [12]:
print(df['content'])

0        House Dem Aide: We Didn’t Even See Comey’s Let...
1        FLYNN: Hillary Clinton, Big Woman on Campus - ...
2        Why the Truth Might Get You FiredConsortiumnew...
3        15 Civilians Killed In Single US Airstrike Hav...
4        Iranian woman jailed for fictional unpublished...
                               ...                        
20795    Rapper T.I.: Trump a ’Poster Child For White S...
20796    N.F.L. Playoffs: Schedule, Matchups and Odds -...
20797    Macy’s Is Said to Receive Takeover Approach by...
20798    NATO, Russia To Hold Parallel Exercises In Bal...
20799               What Keeps the F-35 AliveDavid Swanson
Name: content, Length: 20800, dtype: object


**Data processing on 'content' coln**

In [13]:
port_stem = PorterStemmer()

In [14]:
def stemming(content):
    stemmed_content = re.sub('[^a-zA-Z]',' ',content)   #replacing any character in the variable content that is not a lowercase or uppercase letter with a space. 
    stemmed_content = stemmed_content.lower()           #converting all the characters in the string variable stemmed_content to lowercase.
    stemmed_content = stemmed_content.split()           #splitting the string variable stemmed_content into a list of words
    stemmed_content = [port_stem.stem(word) for word in stemmed_content if not word in stopwords.words('english')]  #In summary, this line of code creates a new list (stemmed_content) where each word from the original list is stemmed using the Porter stemming algorithm, and stopwords are excluded from the new list. This is a common preprocessing step in natural language processing (NLP) tasks.
    stemmed_content = ' '.join(stemmed_content)  #The resulting stemmed_content would be a space-separated string of processed words.
    return stemmed_content

In [15]:
df['content'] = df['content'].apply(stemming)
print(df['content'])

0        hous dem aid even see comey letter jason chaff...
1        flynn hillari clinton big woman campu breitbar...
2                   truth might get firedconsortiumnew com
3        civilian kill singl us airstrik identifiedjess...
4        iranian woman jail fiction unpublish stori wom...
                               ...                        
20795    rapper trump poster child white supremaci jero...
20796    n f l playoff schedul matchup odd new york tim...
20797    maci said receiv takeov approach hudson bay ne...
20798    nato russia hold parallel exercis balkansalex ...
20799                            keep f alivedavid swanson
Name: content, Length: 20800, dtype: object


Let's go through each step and discuss the types of variables involved:

> Function Definition:

 def stemming(content):

> Variable Type: content

> Explanation: content is assumed to be a string representing the text content that needs to be processed.


Removing Non-Alphabetic Characters:

>stemmed_content = re.sub('[^a-zA-Z]',' ', content)

>Variable Type: content, stemmed_content

>Explanation: content is a string. stemmed_content is a string with non-alphabetic characters replaced by spaces.


Lowercasing:
> stemmed_content = stemmed_content.lower()

> Variable Type: stemmed_content

> Explanation: stemmed_content is a string. The operation converts it to lowercase.


Tokenization (Splitting into Words):
> stemmed_content = stemmed_content.split()

> Variable Type: stemmed_content

> Explanation: stemmed_content is now a list of words.


Stemming:
> stemmed_content = [port_stem.stem(word) for word in stemmed_content if not word in stopwords.words('english')]

> Variable Type: stemmed_content

> Explanation: stemmed_content is still a list, but now it contains stemmed words. The list comprehension filters out stopwords.


Joining Words Back into a String:
> stemmed_content = ' '.join(stemmed_content)

> Variable Type: stemmed_content

> Explanation: stemmed_content is now a string again, with stemmed words joined by spaces.


Applying the Function to the 'content' Column:
> news_dataset['content'] = news_dataset['content'].apply(stemming)

> Variable Type: DataFrame column ('content')

> Explanation: The 'content' column is assumed to be a column of strings in the news_dataset DataFrame. The apply function applies the stemming function to each element in this column.

In summary, the variable content starts as a string, goes through various transformations, and ends up as a processed string after stemming, suitable for natural language processing tasks. The operations involve strings, lists, and DataFrames, and the variable types are adjusted accordingly at each step.







In [16]:
# separiting the data and label
X = df['content']
Y = df['label']
print(X)
print(Y)

0        hous dem aid even see comey letter jason chaff...
1        flynn hillari clinton big woman campu breitbar...
2                   truth might get firedconsortiumnew com
3        civilian kill singl us airstrik identifiedjess...
4        iranian woman jail fiction unpublish stori wom...
                               ...                        
20795    rapper trump poster child white supremaci jero...
20796    n f l playoff schedul matchup odd new york tim...
20797    maci said receiv takeov approach hudson bay ne...
20798    nato russia hold parallel exercis balkansalex ...
20799                            keep f alivedavid swanson
Name: content, Length: 20800, dtype: object
0        1
1        0
2        1
3        1
4        1
        ..
20795    0
20796    0
20797    0
20798    1
20799    1
Name: label, Length: 20800, dtype: int64


In [17]:
# converting the textual data to numerical data
vectorizer = TfidfVectorizer()
vectorizer.fit(X)

X = vectorizer.transform(X)
print(X)

  (0, 21557)	0.2736369479869461
  (0, 18009)	0.2438301027041085
  (0, 11974)	0.34466883664274506
  (0, 11617)	0.2783091851108118
  (0, 10495)	0.311553446057155
  (0, 10381)	0.41343221816522613
  (0, 9475)	0.20871803491508256
  (0, 6798)	0.22134331972572915
  (0, 5178)	0.25645024223907936
  (0, 3933)	0.23592778464338887
  (0, 3291)	0.34851330509336254
  (0, 381)	0.25686395241555227
  (1, 23108)	0.2952143706864955
  (1, 9174)	0.18812765977413537
  (1, 7534)	0.6987204016565229
  (1, 3714)	0.18820851327454977
  (1, 2957)	0.37436858023248293
  (1, 2482)	0.3616637468521842
  (1, 1956)	0.2878737833766196
  (2, 21453)	0.41331452278016145
  (2, 12835)	0.4627692646157023
  (2, 8113)	0.3260098284202015
  (2, 7391)	0.6469324358467595
  (2, 3910)	0.30035267305096314
  (3, 21978)	0.23565193182482072
  :	:
  (20797, 23430)	0.08239795721216821
  (20797, 20788)	0.19858848268663878
  (20797, 19961)	0.32033450422171095
  (20797, 17564)	0.24595228419376997
  (20797, 16511)	0.2701082984026285
  (20797, 162

# Splitting the dataset into train & test data

In [18]:
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,stratify=Y,test_size=0.2,random_state=2)

# Model Building & evaluation

In [19]:
model1 = RandomForestClassifier()
model2 = LogisticRegression()

In [20]:
model1.fit(X_train,Y_train)

In [21]:
model2.fit(X_train,Y_train)

# prediction on training data

In [22]:

train_pred1 = model1.predict(X_train)
train_pred2 = model2.predict(X_train)

In [23]:
print('accuracy score of Random Forest model on training data :', accuracy_score(train_pred1,Y_train))
print('accuracy score of Logisitic regression model on training data :' ,accuracy_score(train_pred2,Y_train))

accuracy score of Random Forest model on training data : 1.0
accuracy score of Logisitic regression model on training data : 0.9825120192307693


# prediction on test data

In [24]:
test_pred1 = model1.predict(X_test)
test_pred2 = model2.predict(X_test)

In [25]:
print('accuracy score of Random Forest model on test data :', accuracy_score(test_pred1,Y_test))
print('accuracy score of Logisitic regression model on test data :' ,accuracy_score(test_pred2,Y_test))

accuracy score of Random Forest model on test data : 0.9814903846153846
accuracy score of Logisitic regression model on test data : 0.9673076923076923


# let's try some cross val score for better prediction

In [26]:
models = [model1,model2]

In [27]:
from sklearn.model_selection import cross_val_score

In [28]:
for model in models:
    accuracy = cross_val_score(model,X,Y,cv =3,scoring = 'accuracy')
    print('model:',model, 'and its accuracy:',accuracy)

model: RandomForestClassifier() and its accuracy: [0.98254975 0.982403   0.98283571]
model: LogisticRegression() and its accuracy: [0.96308047 0.96264243 0.96091158]


# Conclusion :

successfully build model which can predict almost with 98 percent accuracy of fake data.