# Fake News Detection

The [Fake News Classification dataset](https://www.kaggle.com/ruchi798/source-based-news-classification) from Kaggle is being used here.


## Imports

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

## The Data

In [3]:
news_df = pd.read_csv('news_articles.csv')

In [4]:
news_df.head()

Unnamed: 0,author,published,title,text,language,site_url,main_img_url,type,label,title_without_stopwords,text_without_stopwords,hasImage
0,Barracuda Brigade,2016-10-26T21:41:00.000+03:00,muslims busted they stole millions in govt ben...,print they should pay all the back all the mon...,english,100percentfedup.com,http://bb4sp.com/wp-content/uploads/2016/10/Fu...,bias,Real,muslims busted stole millions govt benefits,print pay back money plus interest entire fami...,1.0
1,reasoning with facts,2016-10-29T08:47:11.259+03:00,re why did attorney general loretta lynch plea...,why did attorney general loretta lynch plead t...,english,100percentfedup.com,http://bb4sp.com/wp-content/uploads/2016/10/Fu...,bias,Real,attorney general loretta lynch plead fifth,attorney general loretta lynch plead fifth bar...,1.0
2,Barracuda Brigade,2016-10-31T01:41:49.479+02:00,breaking weiner cooperating with fbi on hillar...,red state \nfox news sunday reported this mor...,english,100percentfedup.com,http://bb4sp.com/wp-content/uploads/2016/10/Fu...,bias,Real,breaking weiner cooperating fbi hillary email ...,red state fox news sunday reported morning ant...,1.0
3,Fed Up,2016-11-01T05:22:00.000+02:00,pin drop speech by father of daughter kidnappe...,email kayla mueller was a prisoner and torture...,english,100percentfedup.com,http://100percentfedup.com/wp-content/uploads/...,bias,Real,pin drop speech father daughter kidnapped kill...,email kayla mueller prisoner tortured isis cha...,1.0
4,Fed Up,2016-11-01T21:56:00.000+02:00,fantastic trumps point plan to reform healthc...,email healthcare reform to make america great ...,english,100percentfedup.com,http://100percentfedup.com/wp-content/uploads/...,bias,Real,fantastic trumps point plan reform healthcare ...,email healthcare reform make america great sin...,1.0


In [5]:
news_df.tail()

Unnamed: 0,author,published,title,text,language,site_url,main_img_url,type,label,title_without_stopwords,text_without_stopwords,hasImage
2091,-NO AUTHOR-,2016-10-27T15:36:10.573+03:00,teens walk free after gangrape conviction,,english,wnd.com,http://www.wnd.com/files/2016/10/hillary_haunt...,bias,Real,good samaritan wearing indian headdress disarm...,,1.0
2092,-NO AUTHOR-,2016-10-27T15:36:10.671+03:00,school named for munichmassacre mastermind,,english,wnd.com,http://www.wnd.com/files/2016/10/rambo_richard...,bias,Real,skype sex scam fortune built shame,,1.0
2093,-NO AUTHOR-,2016-10-27T13:30:00.000+03:00,russia unveils satan missile,,english,wnd.com,http://www.wnd.com/files/2016/10/skype_sex_sca...,bs,Fake,cannabis aficionados develop thca crystalline ...,,1.0
2094,-NO AUTHOR-,2016-10-27T15:58:41.935+03:00,check out hillarythemed haunted house,,english,wnd.com,http://worldtruth.tv/wp-content/uploads/2016/1...,bs,Fake,title,,0.0
2095,Eddy Lavine,2016-10-28T01:02:00.000+03:00,cannabis aficionados develop thca crystalline ...,,,,,,,,,


In [6]:
news_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2096 entries, 0 to 2095
Data columns (total 12 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   author                   2096 non-null   object 
 1   published                2096 non-null   object 
 2   title                    2096 non-null   object 
 3   text                     2050 non-null   object 
 4   language                 2095 non-null   object 
 5   site_url                 2095 non-null   object 
 6   main_img_url             2095 non-null   object 
 7   type                     2095 non-null   object 
 8   label                    2095 non-null   object 
 9   title_without_stopwords  2094 non-null   object 
 10  text_without_stopwords   2046 non-null   object 
 11  hasImage                 2095 non-null   float64
dtypes: float64(1), object(11)
memory usage: 196.6+ KB


In [7]:
news_df.isnull().sum()

author                      0
published                   0
title                       0
text                       46
language                    1
site_url                    1
main_img_url                1
type                        1
label                       1
title_without_stopwords     2
text_without_stopwords     50
hasImage                    1
dtype: int64

## Data Cleaning

In [8]:
news_df[news_df['language'].isnull() == True]

Unnamed: 0,author,published,title,text,language,site_url,main_img_url,type,label,title_without_stopwords,text_without_stopwords,hasImage
2095,Eddy Lavine,2016-10-28T01:02:00.000+03:00,cannabis aficionados develop thca crystalline ...,,,,,,,,,


**Let us drop this row as the data is missing for many features in this**

In [9]:
news_df.drop(index = 2095, inplace = True)

In [10]:
news_df.isnull().sum()

author                      0
published                   0
title                       0
text                       45
language                    0
site_url                    0
main_img_url                0
type                        0
label                       0
title_without_stopwords     1
text_without_stopwords     49
hasImage                    0
dtype: int64

**Let us next check the missing values for title_without_stopwords**

In [11]:
news_df[news_df['title_without_stopwords'].isnull() == True]

Unnamed: 0,author,published,title,text,language,site_url,main_img_url,type,label,title_without_stopwords,text_without_stopwords,hasImage
374,Daniel Haiphong,2016-11-17T02:00:00.000+02:00,won now what,the syrian army and hezbollah resistance force...,english,ahtribune.com,http://ahtribune.com/images/media/Donald_Trump...,bs,Fake,,syrian army hezbollah resistance forces contin...,1.0


**Let us drop this row**

In [12]:
news_df.drop(index = 374, inplace = True)

In [13]:
news_df.isnull().sum()

author                      0
published                   0
title                       0
text                       45
language                    0
site_url                    0
main_img_url                0
type                        0
label                       0
title_without_stopwords     0
text_without_stopwords     49
hasImage                    0
dtype: int64

**Now let us look at the null values in text**

In [17]:
news_df[news_df['text'].isnull() == True].head()

Unnamed: 0,author,published,title,text,language,site_url,main_img_url,type,label,title_without_stopwords,text_without_stopwords,hasImage
2050,-NO AUTHOR-,2016-10-27T03:19:40.578+03:00,hillarys emails might not be missing after all,,english,wnd.com,No Image URL,bias,Real,meteor space junk rocket mysterious flash hits...,,1.0
2051,-NO AUTHOR-,2016-10-27T03:32:23.580+03:00,hillarys emails might not be missing after all,,english,wnd.com,http://www.wnd.com/files/2016/10/meteor_russia...,bias,Real,democrats really stuff ballot heres answer,,1.0
2052,Leo Hohmann,2016-10-27T03:32:35.039+03:00,wikileaks bombshells on hillary you need to know,,english,wnd.com,http://mobile.wnd.com/files/2013/07/ballot-box...,bias,Real,men cry rape irans top quran reader,,1.0
2053,-NO AUTHOR-,2016-10-27T03:32:37.291+03:00,fascinated with sex,,english,wnd.com,http://mobile.wnd.com/files/2016/10/Saeed_Toos...,bias,Real,democrats really stuff ballot heres answer,,1.0
2054,-NO AUTHOR-,2016-10-27T04:01:58.682+03:00,meteor space junk rocket mysterious flash hits...,,english,wnd.com,http://www.wnd.com/files/2013/07/ballot-box-vo...,bias,Real,men cry rape irans top quran reader,,1.0


In [18]:
news_df.shape

(2094, 12)

**We drop the rows where text is missing**

In [19]:
news_df.dropna(subset=['text'],inplace=True)

In [20]:
news_df.shape

(2049, 12)

**To fill the column for text_without_stopwords we need to import nltk stopwords**

In [24]:
import nltk

In [25]:
nltk.download_shell()

NLTK Downloader
---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> l

Packages:
  [ ] abc................. Australian Broadcasting Commission 2006
  [ ] alpino.............. Alpino Dutch Treebank
  [ ] averaged_perceptron_tagger Averaged Perceptron Tagger
  [ ] averaged_perceptron_tagger_ru Averaged Perceptron Tagger (Russian)
  [ ] basque_grammars..... Grammars for Basque
  [ ] biocreative_ppi..... BioCreAtIvE (Critical Assessment of Information
                           Extraction Systems in Biology)
  [ ] bllip_wsj_no_aux.... BLLIP Parser: WSJ Model
  [ ] book_grammars....... Grammars from NLTK Book
  [ ] brown............... Brown Corpus
  [ ] brown_tei........... Brown Corpus (TEI XML Version)
  [ ] cess_cat............ CESS-CAT Treebank
  [ ] cess_esp............ CESS-ESP Treebank
  [ ] chat80.....

**The corpus for stopwords is already downloaded**

In [28]:
from nltk.corpus import stopwords

In [29]:
news_df.info()

author                     0
published                  0
title                      0
text                       0
language                   0
site_url                   0
main_img_url               0
type                       0
label                      0
title_without_stopwords    0
text_without_stopwords     4
hasImage                   0
dtype: int64

In [31]:
news_df.loc[(news_df['text_without_stopwords'].isnull()==True), 'text_without_stopwords']

2046    NaN
2047    NaN
2048    NaN
2049    NaN
Name: text_without_stopwords, dtype: object

**We now fill in the missing data**

In [37]:
news_df.loc[(news_df['text_without_stopwords'].isnull()==True), 'text_without_stopwords'] = news_df.loc[(news_df['text_without_stopwords'].isnull()==True), 'text'].apply(lambda text: ''.join([word for word in text if word.lower() not in stopwords.words('english')]))

In [38]:
news_df.isnull().sum()

author                     0
published                  0
title                      0
text                       0
language                   0
site_url                   0
main_img_url               0
type                       0
label                      0
title_without_stopwords    0
text_without_stopwords     0
hasImage                   0
dtype: int64

In [39]:
news_df.shape

(2049, 12)

## Exploratory Data Analysis

In [40]:
news_df.describe(include = 'all')

Unnamed: 0,author,published,title,text,language,site_url,main_img_url,type,label,title_without_stopwords,text_without_stopwords,hasImage
count,2049,2049,2049,2049,2049,2049,2049,2049,2049,2049,2049,2049.0
unique,486,1960,1757,1940,5,68,1184,8,2,1753,1940,
top,No Author,2016-11-01T13:00:00.000+02:00,no title,notify me of followup comments by email notify...,english,clickhole.com,No Image URL,bs,Fake,title,notify followup comments email notify new post...,
freq,505,8,186,6,1971,100,465,598,1291,186,6,
mean,,,,,,,,,,,,0.772572
std,,,,,,,,,,,,0.419274
min,,,,,,,,,,,,0.0
25%,,,,,,,,,,,,1.0
50%,,,,,,,,,,,,1.0
75%,,,,,,,,,,,,1.0


We can see that the author column has many rows which does not appear to be accurate. We might drop the column.

In [42]:
news_df.groupby('author').describe()

Unnamed: 0_level_0,hasImage,hasImage,hasImage,hasImage,hasImage,hasImage,hasImage,hasImage
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max
author,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
# 1 NWO Hatr,3.0,1.000000,0.000000,1.0,1.0,1.0,1.0,1.0
-NO AUTHOR-,24.0,0.666667,0.481543,0.0,0.0,1.0,1.0,1.0
4 Goals For The Neomasculinity Movement During Trumps First Term,1.0,0.000000,,0.0,0.0,0.0,0.0,0.0
?????? ???? ???? ?????????,1.0,1.000000,,1.0,1.0,1.0,1.0,1.0
A. Griffee,2.0,1.000000,0.000000,1.0,1.0,1.0,1.0,1.0
...,...,...,...,...,...,...,...,...
watchmannonthewall,1.0,1.000000,,1.0,1.0,1.0,1.0,1.0
willz,1.0,1.000000,,1.0,1.0,1.0,1.0,1.0
wmw_admin,1.0,0.000000,,0.0,0.0,0.0,0.0,0.0
wtromp@operamail.com (WT),2.0,1.000000,0.000000,1.0,1.0,1.0,1.0,1.0


In [43]:
news_df.groupby('site_url').describe()

Unnamed: 0_level_0,hasImage,hasImage,hasImage,hasImage,hasImage,hasImage,hasImage,hasImage
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max
site_url,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
100percentfedup.com,33.0,1.000000,0.000000,1.0,1.0,1.0,1.0,1.0
21stcenturywire.com,24.0,1.000000,0.000000,1.0,1.0,1.0,1.0,1.0
abcnews.com.co,2.0,1.000000,0.000000,1.0,1.0,1.0,1.0,1.0
abeldanger.net,82.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0
abovetopsecret.com,53.0,1.000000,0.000000,1.0,1.0,1.0,1.0,1.0
...,...,...,...,...,...,...,...,...
washingtonsblog.com,3.0,0.666667,0.577350,0.0,0.5,1.0,1.0,1.0
westernjournalism.com,100.0,0.980000,0.140705,0.0,1.0,1.0,1.0,1.0
whatreallyhappened.com,10.0,0.500000,0.527046,0.0,0.0,0.5,1.0,1.0
whydontyoutrythis.com,2.0,1.000000,0.000000,1.0,1.0,1.0,1.0,1.0


**We can observe that there are 68 different websites in the given data**

In [44]:
news_df.groupby('language').describe()

Unnamed: 0_level_0,hasImage,hasImage,hasImage,hasImage,hasImage,hasImage,hasImage,hasImage
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max
language,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
english,1971.0,0.764587,0.424365,0.0,1.0,1.0,1.0,1.0
french,2.0,0.5,0.707107,0.0,0.25,0.5,0.75,1.0
german,72.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0
ignore,3.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0
spanish,1.0,0.0,,0.0,0.0,0.0,0.0,0.0


**Here we can see that the maximum news is in english**

In [45]:
news_df['type'].unique()

array(['bias', 'conspiracy', 'fake', 'bs', 'satire', 'hate', 'junksci',
       'state'], dtype=object)

In [46]:
news_df['label'].unique()

array(['Real', 'Fake'], dtype=object)

In [47]:
news_df[news_df['label'] == 'Real'].head()

Unnamed: 0,author,published,title,text,language,site_url,main_img_url,type,label,title_without_stopwords,text_without_stopwords,hasImage
0,Barracuda Brigade,2016-10-26T21:41:00.000+03:00,muslims busted they stole millions in govt ben...,print they should pay all the back all the mon...,english,100percentfedup.com,http://bb4sp.com/wp-content/uploads/2016/10/Fu...,bias,Real,muslims busted stole millions govt benefits,print pay back money plus interest entire fami...,1.0
1,reasoning with facts,2016-10-29T08:47:11.259+03:00,re why did attorney general loretta lynch plea...,why did attorney general loretta lynch plead t...,english,100percentfedup.com,http://bb4sp.com/wp-content/uploads/2016/10/Fu...,bias,Real,attorney general loretta lynch plead fifth,attorney general loretta lynch plead fifth bar...,1.0
2,Barracuda Brigade,2016-10-31T01:41:49.479+02:00,breaking weiner cooperating with fbi on hillar...,red state \nfox news sunday reported this mor...,english,100percentfedup.com,http://bb4sp.com/wp-content/uploads/2016/10/Fu...,bias,Real,breaking weiner cooperating fbi hillary email ...,red state fox news sunday reported morning ant...,1.0
3,Fed Up,2016-11-01T05:22:00.000+02:00,pin drop speech by father of daughter kidnappe...,email kayla mueller was a prisoner and torture...,english,100percentfedup.com,http://100percentfedup.com/wp-content/uploads/...,bias,Real,pin drop speech father daughter kidnapped kill...,email kayla mueller prisoner tortured isis cha...,1.0
4,Fed Up,2016-11-01T21:56:00.000+02:00,fantastic trumps point plan to reform healthc...,email healthcare reform to make america great ...,english,100percentfedup.com,http://100percentfedup.com/wp-content/uploads/...,bias,Real,fantastic trumps point plan reform healthcare ...,email healthcare reform make america great sin...,1.0


In [48]:
news_df[news_df['label'] == 'Fake'].head()

Unnamed: 0,author,published,title,text,language,site_url,main_img_url,type,label,title_without_stopwords,text_without_stopwords,hasImage
33,No Author,2016-10-27T02:24:00.000+03:00,intl community still financing protecting terr...,st century wire says \nwire reported on friday...,english,21stcenturywire.com,http://21stcenturywire.com/wp-content/uploads/...,conspiracy,Fake,intl community still financing protecting terr...,st century wire says wire reported friday fbis...,1.0
34,No Author,2016-10-29T16:20:00.000+03:00,fbi director comeys leaked memo explains why h...,in a stunning turn of events days before the ...,english,21stcenturywire.com,http://21stcenturywire.com/wp-content/uploads/...,conspiracy,Fake,fbi director comeys leaked memo explains hes r...,stunning turn events days presidential electio...,1.0
35,Shawn Helton,2016-10-29T04:22:00.000+03:00,fbi redux whats behind new probe into hillary ...,a tidal wave of revelations is pouring out of ...,english,21stcenturywire.com,http://21stcenturywire.com/wp-content/uploads/...,conspiracy,Fake,fbi redux whats behind new probe hillary clint...,tidal wave revelations pouring clinton campaig...,1.0
36,Mike Rivero,2016-11-02T01:43:00.000+02:00,party corruption clinton campaign directly tie...,november by wire comments \npatrick henning...,english,21stcenturywire.com,http://i1.wp.com/21stcenturywire.com/wp-conten...,conspiracy,Fake,party corruption clinton campaign directly tie...,november wire comments patrick henningsen st c...,1.0
37,No Author,2016-11-01T16:48:00.000+02:00,hillarys russian hack hoax the biggest lie of ...,november by shawn helton comment \nshawn he...,english,21stcenturywire.com,http://i0.wp.com/21stcenturywire.com/wp-conten...,conspiracy,Fake,hillarys russian hack hoax biggest lie electio...,november shawn helton comment shawn helton st ...,1.0


In [49]:
news_df[news_df['label'] == 'Real'].shape

(758, 12)

In [50]:
news_df[news_df['label'] == 'Fake'].shape

(1291, 12)

**We can see that there are more rows which are fake than real news**

In [51]:
news_df.groupby('label').describe()

Unnamed: 0_level_0,hasImage,hasImage,hasImage,hasImage,hasImage,hasImage,hasImage,hasImage
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max
label,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
Fake,1291.0,0.736638,0.440628,0.0,0.0,1.0,1.0,1.0
Real,758.0,0.833773,0.37253,0.0,1.0,1.0,1.0,1.0


**Let us look at the type column**

In [52]:
news_df.groupby(['label','type']).describe()

Unnamed: 0_level_0,Unnamed: 1_level_0,hasImage,hasImage,hasImage,hasImage,hasImage,hasImage,hasImage,hasImage
Unnamed: 0_level_1,Unnamed: 1_level_1,count,mean,std,min,25%,50%,75%,max
label,type,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
Fake,bs,598.0,0.745819,0.435764,0.0,0.0,1.0,1.0,1.0
Fake,conspiracy,430.0,0.669767,0.470845,0.0,0.0,1.0,1.0,1.0
Fake,fake,15.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0
Fake,junksci,102.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0
Fake,satire,146.0,0.684932,0.466142,0.0,0.0,1.0,1.0,1.0
Real,bias,393.0,0.933842,0.248874,0.0,1.0,1.0,1.0,1.0
Real,hate,244.0,0.590164,0.492814,0.0,0.0,1.0,1.0,1.0
Real,state,121.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0


**We can see that each of the Real or Fake news is categorised into different types**