### Fake News Classification with Natural Language Processing.

Fake news detection is a hot topic in the field of natural language processing. We consume news through several mediums throughout the day in our daily routine, but sometimes it becomes difficult to decide which one is fake and which one is authentic. Our job is to create a model which predicts whether a given news is real or fake.

In [1]:
Project Flow:
  1. Problem Statement
  2. Data Gathering
  3. Data Processing:
    3.1. Tokenization
    3.2. Lower Case
    3.3. Stopwords
    3.4. Stenning
  4. Vectorization
    4.1. Bag of Words
    4.2. TF-IDF
  5. Model Building:
    5.1. Model Object Initialization
    5.2. Train and Test Model
  6. Model Evaluation:
    6.1. Accuracy Score
    6.2. Confusition Matrix
    6.3. Classification Report
  7. Model Deployment
  8. Prediction on Client Data

SyntaxError: ignored

In [36]:
import pandas as pd
import numpy as np
import re
import nltk
import sklearn
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

##Data Gathering

In [3]:
df = pd.read_csv('/content/drive/MyDrive/WELFake_Dataset.csv')

df = df.drop(df.index[40000:])

In [4]:
df.head()

Unnamed: 0.1,Unnamed: 0,title,text,label
0,0,LAW ENFORCEMENT ON HIGH ALERT Following Threat...,No comment is expected from Barack Obama Membe...,1
1,1,,Did they post their votes for Hillary already?,1
2,2,UNBELIEVABLE! OBAMA’S ATTORNEY GENERAL SAYS MO...,"Now, most of the demonstrators gathered last ...",1
3,3,"Bobby Jindal, raised Hindu, uses story of Chri...",A dozen politically active pastors came here f...,0
4,4,SATAN 2: Russia unvelis an image of its terrif...,"The RS-28 Sarmat missile, dubbed Satan 2, will...",1


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40000 entries, 0 to 39999
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  40000 non-null  int64 
 1   title       39721 non-null  object
 2   text        39982 non-null  object
 3   label       40000 non-null  int64 
dtypes: int64(2), object(2)
memory usage: 1.2+ MB


In [6]:
df.isnull().sum()

Unnamed: 0      0
title         279
text           18
label           0
dtype: int64

In [7]:
df = df.dropna()

In [8]:
df.isnull().sum()

Unnamed: 0    0
title         0
text          0
label         0
dtype: int64

In [9]:
df.shape

(39703, 4)

In [10]:
df['label'].value_counts()

1    20362
0    19341
Name: label, dtype: int64

In [11]:

df.head()

Unnamed: 0.1,Unnamed: 0,title,text,label
0,0,LAW ENFORCEMENT ON HIGH ALERT Following Threat...,No comment is expected from Barack Obama Membe...,1
2,2,UNBELIEVABLE! OBAMA’S ATTORNEY GENERAL SAYS MO...,"Now, most of the demonstrators gathered last ...",1
3,3,"Bobby Jindal, raised Hindu, uses story of Chri...",A dozen politically active pastors came here f...,0
4,4,SATAN 2: Russia unvelis an image of its terrif...,"The RS-28 Sarmat missile, dubbed Satan 2, will...",1
5,5,About Time! Christian Group Sues Amazon and SP...,All we can say on this one is it s about time ...,1


In [12]:
df = df.rename(columns={'Unnamed: 0': 'index'})
df = df.reset_index()
df.head(10)
df = df.drop(['level_0'], axis = 1)

##Data Preprocessing

###1.Tokenization

In [13]:
sample_data = 'The quick brown fox jumps over the lazy dog'
sample_data = sample_data.split()

##2. Make Lowercase

In [14]:
sample_data = [data.lower() for data in sample_data]

##Remove Stopwords

In [15]:
nltk.download('stopwords')
stopwords = stopwords.words('english')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [16]:
sample_data = [data for data in sample_data if data not in stopwords]

In [17]:
sample_data

['quick', 'brown', 'fox', 'jumps', 'lazy', 'dog']

##4. Stemming

In [18]:
ps = PorterStemmer()
sample_data_stemming = [ps.stem(data) for data in sample_data]
sample_data_stemming

['quick', 'brown', 'fox', 'jump', 'lazi', 'dog']

##5. Lemmatization

In [19]:
lm = WordNetLemmatizer()
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [20]:
sample_data_lemma = [lm.lemmatize(data) for data in sample_data]
sample_data_lemma

['quick', 'brown', 'fox', 'jump', 'lazy', 'dog']

In [21]:
df.drop(['index'], axis = 1)
df.head()

Unnamed: 0,index,title,text,label
0,0,LAW ENFORCEMENT ON HIGH ALERT Following Threat...,No comment is expected from Barack Obama Membe...,1
1,2,UNBELIEVABLE! OBAMA’S ATTORNEY GENERAL SAYS MO...,"Now, most of the demonstrators gathered last ...",1
2,3,"Bobby Jindal, raised Hindu, uses story of Chri...",A dozen politically active pastors came here f...,0
3,4,SATAN 2: Russia unvelis an image of its terrif...,"The RS-28 Sarmat missile, dubbed Satan 2, will...",1
4,5,About Time! Christian Group Sues Amazon and SP...,All we can say on this one is it s about time ...,1


In [22]:
lm = WordNetLemmatizer()
corpus = []
for i in range(len(df)):
  review = re.sub('^a-zA-Z0-9',' ', df['title'][i])
  review = review.lower()
  review = review.split()
  review = [lm.lemmatize(x) for x in review if x not in stopwords]
  review = ' '.join(review)
  corpus.append(review)

#4. Vectorization (Convert Text data into the Vector)

In [23]:
tf = TfidfVectorizer()
x = tf.fit_transform(corpus).toarray()

In [24]:
y = df['label']
y

0        1
1        1
2        0
3        1
4        1
        ..
39698    0
39699    0
39700    1
39701    0
39702    1
Name: label, Length: 39703, dtype: int64

## Data splitting into the train and test

In [25]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 1, stratify = y)

In [26]:
len(x_train), len(y_train)

(27792, 27792)

#5. Model Building

In [27]:
rf = RandomForestClassifier()
rf.fit(x_train, y_train)

In [30]:
dt = DecisionTreeClassifier()
dt.fit(x_train, y_train)

In [37]:
lr = LogisticRegression()
lr.fit(x_train, y_train)

#6. Model Evaluation

In [31]:
y_pred = rf.predict(x_test)
accuracy_score_ = accuracy_score(y_test, y_pred)
accuracy_score_

0.886995214507598

In [38]:
dt_y_pred = dt.predict(x_test)
accuracy_score_ = accuracy_score(y_test, dt_y_pred)
accuracy_score_

0.8463605070942826

In [39]:
lr_y_pred = lr.predict(x_test)
accuracy_score_ = accuracy_score(y_test, lr_y_pred)
accuracy_score_

0.8878347745781211