### <center>Fake News Prediction </center>

In this project, we are going to make a machine learning system that predicts whether news are authentic or fake. This is a binary classification problem statement

About the Dataset:

id: unique id for a news article

title: the title of a news article

author: author of the news article

text: the text of the article; could be incomplete

label: a label that marks whether the news article is real or fake:

1: Fake news

0: Real/Authentic News



#### Import the required packages

In [1]:
import numpy as np
import pandas as pd
import re #regular expression for searching words in a text or paragraph
from nltk.corpus import stopwords #words that do not add much value to a paragraph or text
from nltk.stem.porter import PorterStemmer #used to stem our words i.e. used to give us a root word for a particular word
from sklearn.feature_extraction.text import TfidfVectorizer #used to convert the text into feature vectors
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\onale\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

### Stopwords
Stop words are the words in a stop list which are filtered out before or after processing of natural language data because they are insignificant. They do not add any value to our dataset.

In [2]:
# print stopwords in English
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

### Data Pre-processing

In [4]:
# load the dataset to a pandas DataFrame
news_dataset = pd.read_csv('datasets/train.csv')

In [5]:
news_dataset.shape

(20800, 5)

The dataset has 20800 rows and 5 columns/features

In [6]:
# print the first 5 rows of the dataframe
news_dataset.head()

Unnamed: 0,id,title,author,text,label
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1


In [7]:
# count the number of missing values in the dataset
news_dataset.isnull().sum()

id           0
title      558
author    1957
text        39
label        0
dtype: int64

From the above, it is evident that the tittle feature has 558 missing values, author has 1957 and text has 39 missing values. The id and label columns have no missing values.

In [8]:
# replace the null values with empty string
news_dataset = news_dataset.fillna('')

Because we have a lot of data for this project, we can afford to lose some of it. That is why we replaced missing values with null strings. For cases where the dataset is small we have to conduct different data imputation manoeuvres to handle the missing values.

In [9]:
# merge the author name and news title features or columns into a new column
news_dataset['content'] = news_dataset['author']+' '+news_dataset['title']

we combined the author and text features as they are the ones we are going to use for making predictions. The text field has paragraphs which are too large and therefore will take a lot of computational power to preprocess it but it can still be used as well.

In [10]:
print(news_dataset['content'])

0        Darrell Lucus House Dem Aide: We Didn’t Even S...
1        Daniel J. Flynn FLYNN: Hillary Clinton, Big Wo...
2        Consortiumnews.com Why the Truth Might Get You...
3        Jessica Purkiss 15 Civilians Killed In Single ...
4        Howard Portnoy Iranian woman jailed for fictio...
                               ...                        
20795    Jerome Hudson Rapper T.I.: Trump a ’Poster Chi...
20796    Benjamin Hoffman N.F.L. Playoffs: Schedule, Ma...
20797    Michael J. de la Merced and Rachel Abrams Macy...
20798    Alex Ansary NATO, Russia To Hold Parallel Exer...
20799              David Swanson What Keeps the F-35 Alive
Name: content, Length: 20800, dtype: object


In [11]:
# separate the data & label
X = news_dataset.drop(columns='label', axis=1)
Y = news_dataset['label']

In [12]:
print(X)
print(Y)

          id                                              title  \
0          0  House Dem Aide: We Didn’t Even See Comey’s Let...   
1          1  FLYNN: Hillary Clinton, Big Woman on Campus - ...   
2          2                  Why the Truth Might Get You Fired   
3          3  15 Civilians Killed In Single US Airstrike Hav...   
4          4  Iranian woman jailed for fictional unpublished...   
...      ...                                                ...   
20795  20795  Rapper T.I.: Trump a ’Poster Child For White S...   
20796  20796  N.F.L. Playoffs: Schedule, Matchups and Odds -...   
20797  20797  Macy’s Is Said to Receive Takeover Approach by...   
20798  20798  NATO, Russia To Hold Parallel Exercises In Bal...   
20799  20799                          What Keeps the F-35 Alive   

                                          author  \
0                                  Darrell Lucus   
1                                Daniel J. Flynn   
2                             Consortiu

### Stemming:

Stemming is the process of reducing a word to its Root word

example: actor, actress, acting --> act

In [13]:
port_stem = PorterStemmer()

In [14]:
#create a custom function called stemming
def stemming(content):
    stemmed_content = re.sub('[^a-zA-Z]',' ' ,content) #This replaces all characters other alphabets with empty white space. Basically,extract text which only has alphabets both small and capital leaving out numbers, punctuation and so forth. It just get words
    stemmed_content = stemmed_content.lower() #converts the text to lower case, uppercase letters can have a significant impact in the model
    stemmed_content = stemmed_content.split() #convert and split the content to a list
    stemmed_content = [port_stem.stem(word) for word in stemmed_content if not word in stopwords.words('english')] #first part, we take each word and apply stemming to it removing/skipping the stopwords
    stemmed_content = ' '.join(stemmed_content) #join all the words
    return stemmed_content

In [15]:
#use the stemming function on the content column
news_dataset['content'] = news_dataset['content'].apply(stemming)

In [16]:
print(news_dataset['content'])

0        darrel lucu hous dem aid even see comey letter...
1        daniel j flynn flynn hillari clinton big woman...
2                   consortiumnew com truth might get fire
3        jessica purkiss civilian kill singl us airstri...
4        howard portnoy iranian woman jail fiction unpu...
                               ...                        
20795    jerom hudson rapper trump poster child white s...
20796    benjamin hoffman n f l playoff schedul matchup...
20797    michael j de la merc rachel abram maci said re...
20798    alex ansari nato russia hold parallel exercis ...
20799                            david swanson keep f aliv
Name: content, Length: 20800, dtype: object


In [17]:
#separate the data and label
X = news_dataset['content'].values
Y = news_dataset['label'].values

In [18]:
print(X)

['darrel lucu hous dem aid even see comey letter jason chaffetz tweet'
 'daniel j flynn flynn hillari clinton big woman campu breitbart'
 'consortiumnew com truth might get fire' ...
 'michael j de la merc rachel abram maci said receiv takeov approach hudson bay new york time'
 'alex ansari nato russia hold parallel exercis balkan'
 'david swanson keep f aliv']


In [19]:
print(Y)

[1 0 1 ... 0 1 1]


In [20]:
Y.shape

(20800,)

In [21]:
# convert the textual data to numerical data
vectorizer = TfidfVectorizer() 
vectorizer.fit(X)

TfidfVectorizer()

Term frequency (tf) counts the number of times a particular word is repeating in a document/text/paragraph. This repeatition tells the model that it is a very important word. It will then assign a numerical value to that word.

Inverse Document Frequency(IDF) finds values which are repeating so many times but detects if they are significant and if they are not, it will reduce its importance value.

Then together, tfidf creates feature vectors. We only do this to X, which is texttual data. Y is already in numerical form.

In [22]:
X = vectorizer.transform(X)

In [23]:
print(X)

  (0, 15686)	0.28485063562728646
  (0, 13473)	0.2565896679337957
  (0, 8909)	0.3635963806326075
  (0, 8630)	0.29212514087043684
  (0, 7692)	0.24785219520671603
  (0, 7005)	0.21874169089359144
  (0, 4973)	0.233316966909351
  (0, 3792)	0.2705332480845492
  (0, 3600)	0.3598939188262559
  (0, 2959)	0.2468450128533713
  (0, 2483)	0.3676519686797209
  (0, 267)	0.27010124977708766
  (1, 16799)	0.30071745655510157
  (1, 6816)	0.1904660198296849
  (1, 5503)	0.7143299355715573
  (1, 3568)	0.26373768806048464
  (1, 2813)	0.19094574062359204
  (1, 2223)	0.3827320386859759
  (1, 1894)	0.15521974226349364
  (1, 1497)	0.2939891562094648
  (2, 15611)	0.41544962664721613
  (2, 9620)	0.49351492943649944
  (2, 5968)	0.3474613386728292
  (2, 5389)	0.3866530551182615
  (2, 3103)	0.46097489583229645
  :	:
  (20797, 13122)	0.2482526352197606
  (20797, 12344)	0.27263457663336677
  (20797, 12138)	0.24778257724396507
  (20797, 10306)	0.08038079000566466
  (20797, 9588)	0.174553480255222
  (20797, 9518)	0.295420

### Split the dataset to training & test data

In [24]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, stratify=Y, random_state=2)

### Train the Model: Logistic Regression

For binary  classification problem statements, logistic regresion performs really well. That is why we used it

In [25]:
model = LogisticRegression()

In [26]:
model.fit(X_train, Y_train)

LogisticRegression()

### Model Evaluation

In [27]:
# accuracy score on the training data
X_train_prediction = model.predict(X_train)
training_data_accuracy = accuracy_score(X_train_prediction, Y_train)

In [28]:
print('Accuracy score of the training data : ', training_data_accuracy)

Accuracy score of the training data :  0.9865985576923076


In [29]:
# accuracy score on the test data
X_test_prediction = model.predict(X_test)
test_data_accuracy = accuracy_score(X_test_prediction, Y_test)

In [30]:
print('Accuracy score of the test data : ', test_data_accuracy)

Accuracy score of the test data :  0.9790865384615385


The model has low bias and low variance evidenced by its performance as it is really good. In other words, it is a perfectly balanced fit or generalized model

### Making a Predictive System

In [31]:
X_new = X_test[3]

prediction = model.predict(X_new)
print(prediction)

if (prediction[0]==0):
  print('The news is Real')
else:
  print('The news is Fake')

[0]
The news is Real


In [32]:
print(Y_test[3])

0
