About the Dataset:

1. id: unique id for a news article
2. title: the title of a news article
3. text: the text of the article; could be incomplete
4. label: a label that marks whether the news article is real or fake:
           
           1: Fake news
           0: real News


Importing the dependencies

In [2]:
import numpy as np
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [3]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\aditi\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [4]:
#printing the stopwords in english
print(stopwords.words('english'))

['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', "he'd", "he'll", 'her', 'here', 'hers', 'herself', "he's", 'him', 'himself', 'his', 'how', 'i', "i'd", 'if', "i'll", "i'm", 'in', 'into', 'is', 'isn', "isn't", 'it', "it'd", "it'll", "it's", 'its', 'itself', "i've", 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan', "shan't", 'she

Data pre-processing

In [5]:
# loading the dataset to a pandas dataframe
news_dataset = pd.read_csv('train.csv')

In [6]:
news_dataset.shape

(72134, 4)

In [7]:
# print the first five rows of dataset
news_dataset.head()

Unnamed: 0.1,Unnamed: 0,title,text,label
0,0,LAW ENFORCEMENT ON HIGH ALERT Following Threat...,No comment is expected from Barack Obama Membe...,1
1,1,,Did they post their votes for Hillary already?,1
2,2,UNBELIEVABLE! OBAMA’S ATTORNEY GENERAL SAYS MO...,"Now, most of the demonstrators gathered last ...",1
3,3,"Bobby Jindal, raised Hindu, uses story of Chri...",A dozen politically active pastors came here f...,0
4,4,SATAN 2: Russia unvelis an image of its terrif...,"The RS-28 Sarmat missile, dubbed Satan 2, will...",1


In [8]:
# counting the number of missing values in the dataset
news_dataset.isnull().sum()

Unnamed: 0      0
title         558
text           39
label           0
dtype: int64

In [9]:
# replacing the null values with empty strings
news_dataset = news_dataset.fillna('')

In [11]:
# seperting the data and label
X = news_dataset.drop(columns = 'label', axis=1)
Y = news_dataset['label']

In [12]:
print(X)
print(Y)

       Unnamed: 0  ...                                               text
0               0  ...  No comment is expected from Barack Obama Membe...
1               1  ...     Did they post their votes for Hillary already?
2               2  ...   Now, most of the demonstrators gathered last ...
3               3  ...  A dozen politically active pastors came here f...
4               4  ...  The RS-28 Sarmat missile, dubbed Satan 2, will...
...           ...  ...                                                ...
72129       72129  ...  WASHINGTON (Reuters) - Hackers believed to be ...
72130       72130  ...  You know, because in fantasyland Republicans n...
72131       72131  ...  Migrants Refuse To Leave Train At Refugee Camp...
72132       72132  ...  MEXICO CITY (Reuters) - Donald Trump’s combati...
72133       72133  ...  Goldman Sachs Endorses Hillary Clinton For Pre...

[72134 rows x 3 columns]
0        1
1        1
2        1
3        0
4        1
        ..
72129    0
72130    

Stemming: It is the process of reducing a word to its root word
example: actor, actress, acting ---> root word is "act"

In [13]:
port_stem  = PorterStemmer()

In [14]:
def stemming(title) :
    stemmed_content = re.sub('[^a-zA-z]',' ',title)
    stemmed_content = stemmed_content.lower()
    stemmed_content = stemmed_content.split()
    stemmed_content = [port_stem.stem(word) for word in stemmed_content if not word in stopwords.words('english')]
    stemmed_content = ' '.join(stemmed_content)
    return stemmed_content

In [15]:
news_dataset['title'] = news_dataset['title'].apply(stemming)

In [17]:
print(news_dataset['title'])

0        law enforc high alert follow threat cop white ...
1                                                         
2        unbeliev obama attorney gener say charlott rio...
3        bobbi jindal rais hindu use stori christian co...
4        satan russia unv imag terrifi new supernuk wes...
                               ...                        
72129    russian steal research trump hack u democrat p...
72130    watch giuliani demand democrat apolog trump ra...
72131         migrant refus leav train refuge camp hungari
72132    trump tussl give unpopular mexican leader much...
72133           goldman sach endors hillari clinton presid
Name: title, Length: 72134, dtype: object


In [16]:
# seperating the data and label
X = news_dataset['title'].values
Y = news_dataset['label'].values

In [22]:
print(X)


['law enforc high alert follow threat cop white blacklivesmatt fyf terrorist [video]'
 ''
 'unbeliev obama attorney gener say charlott rioter peac protest home state north carolina [video]'
 ... 'migrant refus leav train refuge camp hungari'
 'trump tussl give unpopular mexican leader much need shot arm'
 'goldman sach endors hillari clinton presid']


In [21]:
print(Y)

[1 1 1 ... 0 0 1]


In [23]:
Y.shape

(72134,)

In [None]:
# converting the textual data to numerical matrix form of data
vectorizer = TfidfVectorizer()
vectorizer.fit(X)
X = vectorizer.transform(X)


In [25]:
print(X)

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 629127 stored elements and shape (72134, 19662)>
  Coords	Values
  (0, 407)	0.31901616740593475
  (0, 1805)	0.33473339572306554
  (0, 3685)	0.2487111216788284
  (0, 5518)	0.31820373781752753
  (0, 6435)	0.28932597161771145
  (0, 6740)	0.4855284351114635
  (0, 7897)	0.2674627355023195
  (0, 9711)	0.22829651152218416
  (0, 17278)	0.2487111216788284
  (0, 17383)	0.25426350326440594
  (0, 18669)	0.12979640357535882
  (0, 19129)	0.19134824060732464
  (2, 1051)	0.28403320490491896
  (2, 2678)	0.30808922726971466
  (2, 2924)	0.36395276344507194
  (2, 6890)	0.2652218649803486
  (2, 8032)	0.26922191912407045
  (2, 11876)	0.22313514796758355
  (2, 12023)	0.168784385727874
  (2, 12756)	0.27904133025095906
  (2, 13605)	0.22695029102228706
  (2, 14606)	0.3579942399172441
  (2, 15110)	0.16099277720355756
  (2, 16463)	0.19996539254710202
  (2, 18054)	0.35961554134655777
  :	:
  (72130, 17798)	0.13227219506940732
  (72130, 18958)	0.25302499

Splitting the dataset to training and test data

In [28]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, stratify = Y, random_state = 2)

Training the model: LogisticRegression

In [26]:
model = LogisticRegression()

In [29]:
model.fit(X_train, Y_train)

Evalluation

accuracy_score

In [31]:
# accuracy store on the training data
X_train_prediction = model.predict(X_train)
training_data_accuracy = accuracy_score(X_train_prediction, Y_train)
print("accuracy score of training data", training_data_accuracy)

accuracy score of training data 0.9193858630668723


In [32]:
# accuracy store on the test data
X_test_prediction = model.predict(X_test)
testing_data_accuracy = accuracy_score(X_test_prediction, Y_test)
print("accuracy score of test data", testing_data_accuracy)

accuracy score of test data 0.9005337214944202


Making a predictive system

In [35]:
X_new = X_test[1]

prediction = model.predict(X_new)
print(prediction)

if (prediction[0] == 0):
    print("the news is real")
else:
    print("the news is fake")

[0]
the news is real


In [36]:
print(Y_test[1])

0


In [37]:
X_new = X_test[0]

prediction = model.predict(X_new)
print(prediction)

if (prediction[0] == 0):
    print("the news is real")
else:
    print("the news is fake")

[1]
the news is fake


In [38]:
print(Y_test[0])

1
